# MLFLOW - Deploying Machine Learning in Production

In this assignment you will be writing a script that train models and use `mlflow` to submit runs. 

In [1]:
%%writefile ./new_data.json
{"age": {"0": 40, "1": 47},
 "balance": {"0": 580, "1": 3644},
 "campaign": {"0": 1, "1": 2},
 "contact": {"0": "unknown", "1": "unknown"},
 "day": {"0": 16, "1": 9},
 "default": {"0": "no", "1": "no"},
 "duration": {"0": 192, "1": 83},
 "education": {"0": "secondary", "1": "secondary"},
 "housing": {"0": "yes", "1": "no"},
 "job": {"0": "blue-collar", "1": "services"},
 "loan": {"0": "no", "1": "no"},
 "marital": {"0": "married", "1": "single"},
 "month": {"0": "may", "1": "jun"},
 "pdays": {"0": -1, "1": -1},
 "poutcome": {"0": "unknown", "1": "unknown"},
 "previous": {"0": 0, "1": 0}}

Overwriting ./new_data.json


In [2]:
#Load all necessary libraries
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
import joblib
import json

# Load Dataset
bank = pd.read_csv('bank-full.csv', delimiter = ';')

# Split data between train and validation
X_train, X_test, y_train, y_test = train_test_split(bank.drop(columns = "y"), bank["y"], 
                                                    test_size = 0.10, random_state = 42)

X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)
onehoter = OneHotEncoder(handle_unknown='ignore', sparse=False)

In [3]:
# bank.info()
# bank.describe()
# bank.head()

Question 1: Create pre-processing function to be later used as part of the pipeline (custom transformer)

In [4]:
def train_transformations(df):
    cat_cols = X_train.select_dtypes(['object']).columns
    onehoter.fit(X_train[cat_cols])
    onehot_cols = [f'{col}_{cat}' for i, col in enumerate(X_train[cat_cols].columns) for cat in onehoter.categories_[i]]
    res = onehoter.transform(df[cat_cols])
    df_onehot = pd.DataFrame(res, columns=onehot_cols)
   
    num_cols = X_train.select_dtypes(['integer', 'float']).columns
    znormalizer = StandardScaler()
    znormalizer.fit(X_train[num_cols])
    df_norm = znormalizer.transform(df[num_cols])
    df_norm = pd.DataFrame(df_norm, columns=num_cols)

    df_featurized = df_onehot 
    df_featurized[num_cols] = df_norm 

    del df_onehot, df_norm
    return df_featurized

Question 2: Creating a custom transformer from the previously defined function

In [5]:
pre_processing = FunctionTransformer(train_transformations)

Question 3: Creating the pipeline and defining each of two steps: (i) pre-processing, and; (ii) model (Logistic)

In [6]:
pipeline = Pipeline(steps=[
    ["data_pre_processing", pre_processing],
    ["model", LogisticRegression()]
], verbose=True)

In [7]:
# pipeline.get_params().keys()

Question 4: Call `fit` and `predict` on the pipeline to make sure that it all works. Remember to pass them the **un-processed** (original) data, since the data processing should be built into the pipeline now.

In [8]:
#Set parameters for Logistic Regression estimator ('model') inside the pipeline
pipeline.set_params(model__C=1.0,                 # C: default=1.0
                    model__solver='liblinear',   # solver: {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
                    model__max_iter=100,         # max_iter: default=100
                    model__fit_intercept=True,   # fit_intercept:{True, False}, default=True
                    model__penalty='l2')         # penalty: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’ 
                                                 # Warning: The choice of the algorithm depends on the penalty chosen. 
                                                 #          Not all algorithms support every type of penalty 

#Fit Training Data to Model
pipeline.fit(X_train, y_train)

#Prediction on Training and Test Data
# y_train_pred = #To Do
# y_test_pred = #To Do



[Pipeline]  (step 1 of 2) Processing data_pre_processing, total=   0.1s
[Pipeline] ............. (step 2 of 2) Processing model, total=   0.2s


Pipeline(steps=[('data_pre_processing',
                 FunctionTransformer(func=<function train_transformations at 0x7fee48b4baf0>)),
                ['model', LogisticRegression(solver='liblinear')]],
         verbose=True)

Question 5: Evaluate your model by calculating the precision and recall.

In [9]:
#Create a function to evaluate the model performance using precision and recall
def eval_metrics(actual, pred):
    precision = precision_score(actual, pred, pos_label='yes')
    recall = recall_score(actual, pred, pos_label='yes')
    
    return precision, recall

#Calculation of evaluation metrics - Precision and Recall for training and validation data
y_pred_train = pd.Series(pipeline.predict(X_train))
(precision_train, recall_train) = eval_metrics(y_train, y_pred_train)
y_pred_test = pd.Series(pipeline.predict(X_test))
(precision_test, recall_test) = eval_metrics(y_test, y_pred_test)

# Print Model (Logistic Regression) parameters
print()
print('Main Parameters used in logistic regression are: C={}, solver={}, max_iter={}, fit_intercept={} and penalty={}'.format(pipeline['model'].get_params()['C'],
                                                                                                                             pipeline['model'].get_params()['solver'],
                                                                                                                             pipeline['model'].get_params()['max_iter'],
                                                                                                                             pipeline['model'].get_params()['fit_intercept'],
                                                                                                                             pipeline['model'].get_params()['penalty']))
# Print Evaluation Metrics for the Model (Logistic Regression)
print()
print('Precision = {:.2f}% and recall = {:.2f}% on the training data.'.format(precision_train, recall_train))
print('Precision = {:.2f}% and recall = {:.2f}% on the validation data.'.format(precision_test, recall_test))



Main Parameters used in logistic regression are: C=1.0, solver=liblinear, max_iter=100, fit_intercept=True and penalty=l2

Precision = 0.66% and recall = 0.35% on the training data.
Precision = 0.63% and recall = 0.34% on the validation data.


Question 6: Save your pipeline object using `joblib` as shown [here](https://sklearn.org/modules/model_persistence.html).

In [10]:
#store 'pipeline' as pickle file using joblib
joblib.dump(pipeline, 'pipeline.pkl')

['pipeline.pkl']

Question 7: Now write a **new script** for scoring: it loads the pipeline you saved in the last step, reads the data `../data/new_data.json` and converts it to a `pandas.DataFrame` object, and obtains predictions on it. The predictions should be stored as a `json` file `../data/new_preds.json`.

In [11]:
#Call and load stored 'pipeline' 
pipeline = joblib.load('pipeline.pkl')

#Read json file with new data and write into a pandas dataframe  
with open('./new_data.json', 'r') as f:
    data = json.load(f)
new_predictions = pd.DataFrame(data)

#Use predict method of pipeline to score (make prediction) on new data 
new_predictions['prediction'] = pd.Series(pipeline.predict(new_predictions))

#Write predictions of new data into a json file
new_predictions.to_json('./new_preds.json', orient='columns')

In [12]:
# Read json file containing predictions made for the new data and load them into a dataframe
with open('./new_preds.json', 'r') as f:
    data = json.load(f)
    
new_pred_dataframe= pd.DataFrame(data)

#Print predictions for each observation contained in the new_data.json file and the dataframe with the data and prediction
print(new_pred_dataframe['prediction'])
new_pred_dataframe

0    None
1    None
Name: prediction, dtype: object


Unnamed: 0,age,balance,campaign,contact,day,default,duration,education,housing,job,loan,marital,month,pdays,poutcome,previous,prediction
0,40,580,1,unknown,16,no,192,secondary,yes,blue-collar,no,married,may,-1,unknown,0,
1,47,3644,2,unknown,9,no,83,secondary,no,services,no,single,jun,-1,unknown,0,


Question 8: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

Incoming experience: No incoming experience apart from previous assignments.

Steps taken: This week's lesson was about productionizing pipelines. Got a feel for how pipelines and joblib works and is used.

Obstacles: Figuring out some errors with inputs passed to libraries.

Link to real world: Helped me understand the different functions in Scikit learn and joblib.

Steps missing (with just this week's learning): MLFlow