## Machine Learning Model Building Pipeline: Wrapping up for Deployment


In the previous sections, we worked through the typical Machine Learning pipeline to build a regression model that allows us to predict house prices. We have done exploratory data analysis to udnerstand the data, then modified them to make them suitable for regression model and we have selected a subset of variables using Lasso regression.

Now, we need to deploy out model so that when we call new data, our model will be able to estimate the SalePrice based on the characeristics of the house. For this, the code needs to be modified to be suitable and will be shown next. In this section, we will summarise the key parts of the code which will go into the production.

### Setting the seed to ensure reproducibility

It is important to note, that since we are engineering variables and pre-processing data with the idea of deploying the model we need to ensure reproucibility. Hence for each step that includes some element of randomness, it is important that we set the seed.

In [1]:
# To handle datasets. These are standard imports
import pandas as pd
import numpy as np
import os

# for plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

# to display data with high width
pd.set_option('display.width', 1000)

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to build the models
from sklearn.linear_model import Lasso

# to evaluate the models
from sklearn.metrics import mean_squared_error
from math import sqrt

# to persist the model and the scaler
from sklearn.externals import joblib

In [2]:
# Make the output of notebook stable across runs by setting randon seed
np.random.seed(42)

# To make the pictures pretty
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Set up directories to work with datasets and images
PROJECT_ROOT_DIR = os.getcwd()
DATASET_FOLDER = "DataSets"    # Data goes into the DataSets folder
IMAGES_FOLDER = "IMAGES"       # Images go into IMAGES folder

DATASET_PATH = os.path.join(PROJECT_ROOT_DIR, DATASET_FOLDER)
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, IMAGES_FOLDER)

def load_data(filename, dataset_path=DATASET_PATH,**kwargs):
    """Helper Function to load data. Inputs are file name and directory where datasets are stored"""
    file_with_path = os.path.join(dataset_path, filename)
    return pd.read_csv(file_with_path,**kwargs)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    """Helper Function to save data. Inputs are file name and directory where datasets are stored"""
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [3]:
# load dataset
data = load_data(filename="Housing_Data.csv")
print(data.shape)
data.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [6]:
features = load_data(filename="selected_features.csv")

In [34]:
# fit scaler
scaler = MinMaxScaler() # create an instance
scaler.fit(X_train[features]) #  fit  the scaler to the train set for later use

# we persist the model for future use
joblib.dump(scaler, 'scaler.pkl')

  return self.partial_fit(X, y)


['scaler.pkl']

In [35]:
# transform the train and test set, and add on the Id and SalePrice variables
X_train = pd.DataFrame(scaler.transform(X_train[features]), columns=features)
X_test = pd.DataFrame(scaler.transform(X_test[features]), columns=features)

In [36]:
X_test.to_csv("fairmldataX.csv")
y_test.to_csv("fairmldataY.csv")

In [37]:
# train the model
lin_model = Lasso(alpha=0.005, random_state=0) # remember to set the random_state / seed
lin_model.fit(X_train, y_train)

# we persist the model for future use
joblib.dump(lin_model, 'lasso_regression.pkl')

['lasso_regression.pkl']

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

In [None]:
! pip install mlflow

In [7]:
from mlflow import log_metric, log_param, log_artifact

In [None]:
! mlflow ui

In [157]:
# evaluate the model:
# remember that we log transformed the output (SalePrice) in our feature engineering notebook / lecture.

# In order to get the true performance of the Lasso
# we need to transform both the target and the predictions
# back to the original house prices values.

# We will evaluate performance using the mean squared error and the
# root of the mean squared error

pred = lin_model.predict(X_train)
print('linear train mse: {}'.format(mean_squared_error(np.exp(y_train), np.exp(pred))))
print('linear train rmse: {}'.format(sqrt(mean_squared_error(np.exp(y_train), np.exp(pred)))))
print()
pred = lin_model.predict(X_test)
print('linear test mse: {}'.format(mean_squared_error(np.exp(y_test), np.exp(pred))))
print('linear test rmse: {}'.format(sqrt(mean_squared_error(np.exp(y_test), np.exp(pred)))))
print()
print('Average house price: ', np.exp(y_train).median())

linear train mse: 1311097338.457859
linear train rmse: 36209.0781221762

linear test mse: 1422327230.7941039
linear test rmse: 37713.75386770858

Average house price:  163000.00000000012


In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=10, random_state=0)
regressor.fit(X_train, y_train)

In [12]:
data = pd.read_csv("mlFlowData.csv")

In [13]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,LotShape,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,HalfBath,Functional,Fireplaces,GarageCars,WoodDeckSF,ScreenPorch,GarageYrBlt_na,LotFrontage,Saleprice
0,0.0,0,0.630999,0,0.555556,0.25,0.615942,0.540984,0.0,0.215982,0.764014,0.0,0.714182,0.333333,1.0,0.0,0,0.666667,0.5,0.0,0.0,0,0.388581,12.209188
1,0.176471,0,0.389061,0,0.555556,0.75,0.5,0.934426,0.0,0.071403,0.398758,0.331197,0.549294,0.333333,0.333333,0.0,1,0.666667,0.25,0.0,0.0,0,0.490408,11.798104
2,0.176471,0,0.329918,0,0.444444,0.375,0.565217,0.983607,0.100625,0.032778,0.406964,0.119658,0.453307,0.333333,0.333333,0.0,1,0.333333,0.25,0.0,0.0,0,0.388581,11.608236
3,0.235294,0,0.399404,0,0.666667,0.5,0.76087,0.52459,0.186875,0.069454,0.469855,0.462607,0.636999,0.0,0.666667,0.5,1,0.333333,0.5,0.336056,0.0,0,0.50869,12.165251
4,0.823529,0,0.050188,0,0.555556,0.5,0.717391,0.655738,0.238125,0.0,0.171149,0.302885,0.419061,0.0,0.333333,0.5,1,0.0,0.25,0.0,0.0,0,0.0,11.385092


In [17]:
# House price dataset Example
def train(n_estimators=10, max_features=10):
    import os
    import warnings
    import sys

    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import ElasticNet, LinearRegression
    from sklearn.ensemble import RandomForestRegressor


    import mlflow
    import mlflow.sklearn

    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mae, r2


    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # Read the wine-quality csv file (make sure you're running this from the root of MLflow!)
    #  Assumes wine-quality.csv is located in the same folder as the notebook
    data_file = "mlFlowData.csv"
    data = pd.read_csv(data_file)

    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["Saleprice"], axis=1)
    test_x = test.drop(["Saleprice"], axis=1)
    train_y = train[["Saleprice"]]
    test_y = test[["Saleprice"]]

#     # Set default values if no alpha is provided
#     if float(in_alpha) is None:
#         alpha = 0.5
#     else:
#         alpha = float(in_alpha)

#     # Set default values if no l1_ratio is provided
#     if float(in_l1_ratio) is None:
#         l1_ratio = 0.5
#     else:
#         l1_ratio = float(in_l1_ratio)


    # Useful for multiple runs (only doing one run in this sample notebook)    
    with mlflow.start_run():
        # Execute RandomForest Regressor
#         lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
#         lr.fit(train_x, train_y)
#         n_estimators=10, max_features=10
        regressor = RandomForestRegressor(n_estimators=n_estimators, max_features=max_features)
        regressor.fit(train_x, train_y)

        # Evaluate Metrics
        predicted_qualities = regressor.predict(test_x)
        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        # Print out metrics
        print(f"RF regression model, n_estimators = {n_estimators}, max_features = {max_features}")
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        # Log parameter, metrics, and model to MLflow
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_features", max_features)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        mlflow.sklearn.log_model(regressor, "model")

In [19]:
train(n_estimators=10, max_features=8)

RF regression model, n_estimators = 10, max_features = 8
  RMSE: 0.1525051477189051
  MAE: 0.1159105750216217
  R2: 0.8044182852620506


In [20]:
train(n_estimators=10, max_features=15)

RF regression model, n_estimators = 10, max_features = 15
  RMSE: 0.15640975934024257
  MAE: 0.11996106355405392
  R2: 0.7942750627754569


In [21]:
train(n_estimators=15, max_features=5)

RF regression model, n_estimators = 15, max_features = 5
  RMSE: 0.15694185055941098
  MAE: 0.11810069520120126
  R2: 0.792872968341568


In [22]:
train(n_estimators=15, max_features=10)

RF regression model, n_estimators = 15, max_features = 10
  RMSE: 0.14792411664884944
  MAE: 0.1132494029009014
  R2: 0.8159917852242475


In [23]:
train(n_estimators=15, max_features=15)

RF regression model, n_estimators = 15, max_features = 15
  RMSE: 0.13971426965873598
  MAE: 0.10317702847747703
  R2: 0.8358500432616753


In [24]:
train(n_estimators=15, max_features=18)

RF regression model, n_estimators = 15, max_features = 18
  RMSE: 0.13797142951282002
  MAE: 0.10482849717837828
  R2: 0.8399198174427551
