**Learning outcomes**

- Enumerate the components of an ML pipeline
    - Reusable data exploration scripts
    - Reusable data prep
    - Reusable model training

- Build each component using standard python libraries:
    - Build reusable data exploration scripts
    - Build reusable data prep scripts
    - Build reusable model training scripts


## Introduction

Data scientists are the key stakeholders when in comes to creating machine learning or statistical models. The usual process followed by most data science teams is to pull raw data into many jupyter notebooks and then do the usual steps of cleaiing and preparing the data.

![](../images/ops1.png)

But how are these models that a datascience team creates used as part of a bigger product? The model building journey doesn't end the moment a model is trained and sufficient model performance is achieved. One needs to structure the data exploration, data perparation and model training tasks as separate functioning modules.

![](../images/ops2.png)

## Training Pipeline

One of the first things that one needs to do is to create modelling pipeline. A pipeline consists of the following components:

1. Reusable scripts to explore data
2. Reusable scripts to prepare data
3. Reusable scripts to train a model


To demonstrate how a training pipeline works we will use a notebook that already contains the code for data exploration, preparation and model training.


<a href="Data Scientist Notebook.ipynb">Notebook</a>

Use the notebook linked above to create a training pipeline.

**1.Creating a data exploration script**

We will modify the code in the original notebook, more specifically the part where we do data exploration.

![](../images/ops3.png)

We will try to create a python script to do data exploration for us. Below we describe some parts that will go into making this
script.

In [None]:
import pandas as pd
df = pd.read_csv("../data/credit.csv")

The typical scenario that a data exploration script should handle, will be when new training data or inference data arrives and a preliminary data exploration needs to be done.

For a similar but new dataset, we need our data exploration script to do the following:

1. Validate the column names and number
2. If the dataset is being used for model inference, validate the levels in a categorical variable
3. Save data exploration plots given in original notebook
4. Save a report on missing values

In [None]:
### Define a function to validate the column names and number (Instructor guided)


### Define a function to validate the categorical levels  (Instructor guided)


### Decile computation (Instructor guided)


### Function to create and save exploration plots (Instructor guided)


**Python Data Exporation Script** {todo as class excercise guided by instructor}


**2. Creating Data Preparation Script**

This script should be able to:

- Remove missing values, in our specific case drop the missing values
- Create dummy variables
- Split data into target and predictor matrices
- Create a train test split
- Save the train/test data to assets/data folder

We will continue to use the `Data Scientist Notebook` as a reference to build the script.

In [None]:
## Create a function to create target and predictor matrices (Instructor guided)


## train test split(Instructor guided)

## save train test data as numpy arrays(Instructor guided)

**3. Creating model training script**

Lastly we can create a model training script, once the `exploration.py` and `prep.py` have run we can run a `train.py` script.

This script should be able to:

- Read the data prepared and stored in `assets/data/` directory.
- Train a decision tree model by doing grid search
- Save the trained model and classification metrics in `assets/model` directory.


In [1]:
## Read data saved in the data directory (Instructor guided)

## Model train function (Instructor guided)

## Model Save function (Instructor guided)

## Calculate metrics (Instructor guided)

## Save metrics (Instructor guided)

One we have an ML training pipeline, we can train models fast. But another aspect of operationalizing ML is to have an inference pipeline.

An inference pipeline should be able to do the following:

1. Take inference data, do necesary data perp
2. Read serialized model and run predictions
3. Save summary stats of the data and predictions with a date-time stamp


Below we build an inference pipeline:

1. We will modify the `prep.py` and `exploration.py` to suit our inference requirements. While inferencing we don't have access to the target variable.
2. We will also create new `infer.py` script to perform inference and save results.


In [None]:
## validate inference data (Instructor guided)

## validate cat levels (Instructor guided)

## create X matrix (Instructor guided)

## get summary stats (Instructor guided)

## save prepared data and summary stats with appropriate time stamped directory (Instructor guided)

Finally we can write an `infer.py` script. This script should

1. Read the prepared inference data
2. Read saved model
3. Do inference
4. Save inference results

In [None]:
## load model (Instructor guided)

## load data (Instructor guided)

## get predictions (Instructor guided)

## save predictions (Instructor guided)