# Imports
First we will import all the newly developed modules from the hupml library:

In [1]:
# Hupml imports
from hupml import LoadConfig
from hupml import MlDataFrame
from hupml import PipelineBase

# Some imports for this demo
import pandas as pd
import json
import os

There is one module that has been developed that we won't be using today. It's a database connection class. So if you ever work for a client with a database, you can easily kickstart your project by using this. It includes several methods to read and write to the database from/to Pandas DataFrame.

In [2]:
# from hupml.database_connection import DbConnection 

We need a couple of other variables and methods for this demo specific:

In [3]:
root_dir = (os.path.abspath(os.path.dirname('../..')))
telco_data = pd.read_csv(f'{root_dir}/data/telco_customer_churn.csv')

def print_dict_pretty(d):
    print(json.dumps(d, indent=4))

# Pipeline
There are a lot of steps involved in training models that are often overlooked. A lot of time goes into data engineering (fixing data gaps, imporving data quality etc.), feature engineering, parameter tuning and reporting your results. In previous projects, we also noticed we had to manipulate a lot of data. The more data source you have and the more models you make, the more data-manipulation-paths you can walk. To not get drowned in ever increasing amount of if-statements, the PipelineBase class has been setup in the hupml package.

The idea is that all the steps involved in a Data Science project are registered with one or multiple pipeline classes. For this, you only need to make a new class and inherit from `PipelineBase` from the hupml package. For everybody who doesn't know a lot about Object Oriented Programming (OOP), I think the examples will be illustrative enough. If not, look for someone who can help you.

### Step 1: Build your own class
First we inherit from PipelineBase and setup the absolute minimum for a class:

In [4]:
class Pipeline(PipelineBase):
    pass

# Instantiate pipeline object
p = Pipeline()

TypeError: Can't instantiate abstract class Pipeline with abstract methods handle_nans

As you can see, we get an error. Some methods need to be implemented (these methods are the "abstract" methods). Let's do that:

In [5]:
class Pipeline(PipelineBase):       
    def handle_nans(self):
        print('Method "handle_nans" was called')

# Instantiate pipeline object
p = Pipeline()

TypeError: __init__() missing 2 required positional arguments: 'df' and 'method_settings'

Ahhhh more errors! The class actually needs two arguments called `df` and `method_settings`. Lets look at that in step 2.

### Step 2: Passing your data and pipeline method settings
 The argument `df` is our data (DataFrame/MlDataFrame) we pass to the `Pipeline` class. The argument `method_settings` is a list of dictionairies containing the methods + arguments that need to be called, in the order specified. This list looks like the following:  
```
[
    {'method1_name': {'argument1_name': argument1_value}}, # Method 1
    {'method2_name': {'argument1_name': argument1_value, 'argument2_name': argument2_value}}, # Method 2
    ETC....
]
```

Let's say that when our pipeline runs, we only want to run the `handle_nans` methods. We would then do the following:

In [7]:
class Pipeline(PipelineBase):       
    def handle_nans(self):
        print('Method "handle_nans" was called')

method_settings = [
    {'handle_nans': None}, # The method handle_nans doesn't have any arguments, so we set the method argument to None
    {'handle_nans': None} # Let's run handle_nans twice in a row
]

# Instantiate pipeline object
p = Pipeline(df=telco_data, method_settings=method_settings)
# List pipeline settings
print(p)

[{'handle_nans': None}, {'handle_nans': None}]


No more errors, woopwoop! Let's do something a little bit more:

In [8]:
class Pipeline(PipelineBase):
    def handle_nans(self):
        print('Method "handle_nans" was called')

    def print_text_from_argument(self, text='asfd'):
        print(text)

    def print_predefined_text(self):
        print('Predefined text')

    def n_times_squared(self, value: int, n: int):
        result = value
        for i in range(0, n):
            result = result ** 2
        print(f'Squaring the number {value} for {n} times in a row gives = {result}')


method_settings = [
    {'handle_nans':{}},
    {'print_text_from_argument': {'text': 'This is the text passed to the method'}},
    {'print_text_from_argument': {'text': 1}},
    {'print_predefined_text': None},
    {'n_times_squared': {'value': 2, 'n': 2}},
    {'print_text_from_argument': {'text': 'Same method is called again, but later in the pipeline'}}
]

# Instantiate pipeline object
p = Pipeline(df=telco_data, method_settings=method_settings)
# List pipeline settings
print(p)

[{'handle_nans': {}}, {'print_text_from_argument': {'text': 'This is the text passed to the method'}}, {'print_text_from_argument': {'text': 1}}, {'print_predefined_text': None}, {'n_times_squared': {'value': 2, 'n': 2}}, {'print_text_from_argument': {'text': 'Same method is called again, but later in the pipeline'}}]


### Step 3: Execute your pipeline
The pipeline class contains the method `run`, this will execute your pipeline as defined. Simply call this method:

In [9]:
p.run()

Method "handle_nans" was called
This is the text passed to the method
1
Predefined text
Squaring the number 2 for 2 times in a row gives = 16
Same method is called again, but later in the pipeline


### Step 4: Load your settings from a configuration file
You can also define your pipeline settings in a `.yaml` file and let the pipeline class load this file. For this demo we made the `pipeline_settings_demo.yaml` file:

In [12]:
with open(f'{root_dir}/configs/pipeline_settings_demo.yaml', 'r') as f:
    print(f.read())

pipeline:
  - print_text_from_argument: {text: 'This is the text passed to the method'}
  - print_text_from_argument: {text: 1}
  - print_predefined_text:
  - n_times_squared: {value: 2, n: 2}
  - print_text_from_argument: {text: 'Same method is called again, but later in the pipeline'}


Now we load this as follows:

In [13]:
class Pipeline(PipelineBase):
    def handle_nans(self):
        print('Method "handle_nans" was called')

    def print_text_from_argument(self, text='asfd'):
        print(text)

    def print_predefined_text(self):
        print('Predefined text')

    def n_times_squared(self, value: int, n: int):
        result = value
        for i in range(0, n):
            result = result ** 2
        print(f'Squaring the number {value} for {n} times in a row gives = {result}')

# Instantiate pipeline object
p = Pipeline.from_yaml_file(df=telco_data, path=f'{root_dir}/configs/pipeline_settings_demo.yaml')
# List pipeline settings
print(p)

[{'print_text_from_argument': {'text': 'This is the text passed to the method'}}, {'print_text_from_argument': {'text': 1}}, {'print_predefined_text': None}, {'n_times_squared': {'value': 2, 'n': 2}}, {'print_text_from_argument': {'text': 'Same method is called again, but later in the pipeline'}}]


The configuration file doesn't include `handle_nans` like before, so we _shouldn't_ see this when we run it. Now let's run it:

In [14]:
p.run()

This is the text passed to the method
1
Predefined text
Squaring the number 2 for 2 times in a row gives = 16
Same method is called again, but later in the pipeline


That's it! To use the `Pipeline` and `PipelineBase` class correctly, we would like to give a few more notes:

# Considerations and advanced usage
If you have _(mostly) the same data manipulations_ for each pipeline, you can probably use just a single class as described above. However, if this class becomes extremly large (code smell) and large portions of the code are evident to be only applicable to certain types of pipelines, you might consider multiple inheritance. 

For example, handling nans is something you probably always want to do, but you might have completely different methods for classification models and regression models. So you might build a `Pipeline` class as above, but make two _extra_ classes `PipelineClassification` and `PipelineRegression` that _inherit_ from your `Pipeline` class. Another example is that you maybe have timeseries and non-timeseries data. Here, too, you might consider using multiple inheritance if that seems logical.

We advice you just to start with what we described above, and only refactor when it's deemed necessary.