# Project: Managing the ModelLife Cycle with MLflow and GCP

Create a End-to-End ModelLife cycle that includes pre-processing steps, the optimal ML algorithm and hyperparameters, and post-processing logic.
At the end, I'll create a model package which I'll store in a Storage Cloud on GCP

### References:

  - https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.evaluation

### Author: 

  - **Nardini, Ivan - Sr. Customer Advisor | CI & Analytics Team | ModelOps & Decisioning**

## Install last version of MLFlow

In [3]:
# This installs MLflow on Databricks Runtime

dbutils.library.installPyPI("xgboost", "1.0.2")
dbutils.library.installPyPI("mlflow", "1.7.0")
dbutils.library.restartPython()

## Import Libraries

In [5]:
#Starting libraries
import numpy as np
import pandas as pd
import pyspark

#Machine Learning libraries
import sklearn
import xgboost

#Charts library
import matplotlib.pyplot as plt
import seaborn as sns

#MLflow
import mlflow.spark
import mlflow.xgboost

#utils
import urllib
import warnings

Boston House Prices
-------------------
[https://archive.ics.uci.edu/ml/machine-learning-databases/housing/]( https://archive.ics.uci.edu/ml/machine-learning-databases/housing/)

Contains information collected by the U.S. Census Service regarding housing in the Boston, Massachusetts area.

Originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

Rows: 506  

|Column|Type|Description        |
|------| :---: |----------------|
|crim|float|per capita crime rate by town|
|zn|float|proportion of residential land zoned for lots over 25,000 sq.ft|
|indus|float|proportion of non-retail business acres per town|
|chas|int|Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)|
|nox|float|nitric oxides concentration (parts per 10 million)|
|rm|float|average number of rooms per dwelling|
|age|float|proportion of owner-occupied units built prior to 1940|
|dis|float|weighted distances to five Boston employment centres|
|rad|float|index of accessibility to radial highways|
|tax|float|full-value property-tax rate per $10,000|
|ptratio|float|pupil-teacher ratio by town|
|b|float|1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town|
|lstat|float|% lower status of the population|
|medv|float|median value of owner-occupied homes in $1000’s|

## Spark session

In the Databricks notebook, when you create a cluster, the SparkSession is created for you. In both cases it’s accessible through a variable called spark.

In [8]:
spark

## Import Data

In [10]:
from urllib import request
request.urlretrieve("https://github.com/sassoftware/python-sasctl/raw/master/examples/data/boston_house_prices.csv","/tmp/boston_house_prices.csv")
dbutils.fs.mv("file:/tmp/boston_house_prices.csv","dbfs:/data/boston_house_prices.csv")

In [11]:
df = (spark.read
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/data/boston_house_prices.csv")
)

display(df)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9


## Explore Data

In [13]:
display(df.describe())

summary,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.6135235573122535,11.363636363636363,11.136778656126504,0.0691699604743083,0.5546950592885372,6.284634387351787,68.57490118577078,3.795042687747034,9.549407114624506,408.2371541501976,18.455533596837967,356.67403162055257,12.653063241106723,22.532806324110695
stddev,8.601545105332491,23.32245299451514,6.860352940897589,0.2539940413404101,0.1158776756675558,0.7026171434153232,28.148861406903595,2.10571012662761,8.707259384239366,168.53711605495903,2.164945523714445,91.29486438415782,7.141061511348571,9.197104087379817
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [14]:
#median value of owner-occupied homes in $1000’s
display(df[['medv']])

medv
24.0
21.6
34.7
33.4
36.2
28.7
22.9
27.1
16.5
18.9


In [15]:
#median value of owner-occupied homes in $1000’
#average number of rooms per dwelling

display(df[['medv', 'rm']])

medv,rm
24.0,6.575
21.6,6.421
34.7,7.185
33.4,6.998
36.2,7.147
28.7,6.43
22.9,6.012
27.1,6.172
16.5,5.631
18.9,6.004


In [16]:
# Look at other relationships
# crim - per capita crime rate by town
# lower - % lower status of the population


fig, ax = plt.subplots()
plotdf = df[["rm", "crim", "lstat", "medv", "rad", "tax"]].toPandas()

pd.plotting.scatter_matrix(plotdf)
# ax.set_title('Scatter plot')

display(fig.figure)

In [17]:
# Let's calculate correlation

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=list(df.columns), outputCol="features")
df_ftz = assembler.transform(df)

from pyspark.ml.stat import Correlation

pearsonCorr = Correlation.corr(df_ftz, 'features').collect()

corrdf = pd.DataFrame(pearsonCorr[0][0].toArray())

In [18]:
corrdf.index, corrdf.columns = df.columns, df.columns
fig, ax = plt.subplots()
sns.heatmap(corrdf)
display(fig.figure)

## Model Development and Model Tracking with Mlflow

We will fit: 

  - Baseline Model (by calculating the average housing value in the training dataset)

and then we challenge it with 

  - Linear Regression

In [20]:
# Train and Test splitting
train, test= df.randomSplit([0.7, 0.3], seed=12345)

print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

In [21]:
#Baseline model

from pyspark.sql.functions import avg
from pyspark.sql.functions import lit

fit = train.groupby().avg('medv').collect()[0][0]
print("Average home value: {}".format(fit))

predict = test.withColumn("prediction", lit(fit))
display(predict)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv,prediction
0.01311,90.0,1.22,0,0.403,7.249,21.9,8.6966,5,226,17.9,395.93,4.81,35.4,22.8909090909091
0.0136,75.0,4.0,0,0.41,5.888,47.6,7.3197,3,469,21.1,396.9,14.8,18.9,22.8909090909091
0.01432,100.0,1.32,0,0.411,6.816,40.5,8.3248,5,256,15.1,392.9,3.95,31.6,22.8909090909091
0.01439,60.0,2.93,0,0.401,6.604,18.8,6.2196,1,265,15.6,376.7,4.38,29.1,22.8909090909091
0.01951,17.5,1.38,0,0.4161,7.104,59.5,9.2229,3,216,18.6,393.24,8.05,33.0,22.8909090909091
0.01965,80.0,1.76,0,0.385,6.23,31.5,9.0892,1,241,18.2,341.6,12.93,20.1,22.8909090909091
0.02009,95.0,2.68,0,0.4161,8.034,31.9,5.118,4,224,14.7,390.55,2.88,50.0,22.8909090909091
0.02187,60.0,2.93,0,0.401,6.8,9.9,6.2196,1,265,15.6,393.37,5.03,31.1,22.8909090909091
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,22.8909090909091
0.03359,75.0,2.95,0,0.428,7.024,15.8,5.4011,3,252,18.3,395.62,1.98,34.9,22.8909090909091


In [22]:
# Evaluate BaseModel

from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="medv")
rmse = evaluator.evaluate(predict)
mse = evaluator.evaluate(predict, {evaluator.metricName: "mse"})
r2 = evaluator.evaluate(predict, {evaluator.metricName: "r2"})
mae = evaluator.evaluate(predict, {evaluator.metricName: "mae"})

print("rmse on the test set for the baseline model: {}".format(rmse))
print("mse on the test set for the baseline model: {}".format(mse))
print("r2 on the test set for the baseline model: {}".format(r2))
print("mae on the test set for the baseline model: {}".format(mae))

In [23]:
# Track the Baseline experiment

from mlflow import log_metric,  log_artifact

with mlflow.start_run(run_name="Basic RF Experiment") as run:
  
  # Log a metrics
  log_metric("rmse", rmse)
  log_metric("mse", mse)
  log_metric("r2", r2)
  log_metric("mae", mae)
  
  #Log artefacts (Scored Test data)
  scored_df = predict.toPandas()
  scored_df.to_csv('scored_df.csv')
  log_artifact("scored_df.csv")

  runID = run.info.run_uuid
  experimentID = run.info.experiment_id
  
  print("Inside MLflow Run with run_id {} and experiment_id {}".format(runID, experimentID))

In [24]:
# Update parameter : random seed, split

## Logistic Regression Model

In [26]:
features = df.schema.names[:-1]
assembler_features = VectorAssembler(inputCols=features, outputCol="features")
abt_train = assembler_features.transform(train)
abt_test = assembler_features.transform(test)

#display
display(abt_train)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv,features
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,"List(1, 13, List(), List(0.00632, 18.0, 2.31, 0.0, 0.538, 6.575, 65.2, 4.09, 1.0, 296.0, 15.3, 396.9, 4.98))"
0.00906,90.0,2.97,0,0.4,7.088,20.8,7.3073,1,285,15.3,394.72,7.85,32.2,"List(1, 13, List(), List(0.00906, 90.0, 2.97, 0.0, 0.4, 7.088, 20.8, 7.3073, 1.0, 285.0, 15.3, 394.72, 7.85))"
0.01096,55.0,2.25,0,0.389,6.453,31.9,7.3073,1,300,15.3,394.72,8.23,22.0,"List(1, 13, List(), List(0.01096, 55.0, 2.25, 0.0, 0.389, 6.453, 31.9, 7.3073, 1.0, 300.0, 15.3, 394.72, 8.23))"
0.01301,35.0,1.52,0,0.442,7.241,49.3,7.0379,1,284,15.5,394.74,5.49,32.7,"List(1, 13, List(), List(0.01301, 35.0, 1.52, 0.0, 0.442, 7.241, 49.3, 7.0379, 1.0, 284.0, 15.5, 394.74, 5.49))"
0.01381,80.0,0.46,0,0.422,7.875,32.0,5.6484,4,255,14.4,394.23,2.97,50.0,"List(1, 13, List(), List(0.01381, 80.0, 0.46, 0.0, 0.422, 7.875, 32.0, 5.6484, 4.0, 255.0, 14.4, 394.23, 2.97))"
0.01501,80.0,2.01,0,0.435,6.635,29.7,8.344,4,280,17.0,390.94,5.99,24.5,"List(1, 13, List(), List(0.01501, 80.0, 2.01, 0.0, 0.435, 6.635, 29.7, 8.344, 4.0, 280.0, 17.0, 390.94, 5.99))"
0.01501,90.0,1.21,1,0.401,7.923,24.8,5.885,1,198,13.6,395.52,3.16,50.0,"List(1, 13, List(), List(0.01501, 90.0, 1.21, 1.0, 0.401, 7.923, 24.8, 5.885, 1.0, 198.0, 13.6, 395.52, 3.16))"
0.01538,90.0,3.75,0,0.394,7.454,34.2,6.3361,3,244,15.9,386.34,3.11,44.0,"List(1, 13, List(), List(0.01538, 90.0, 3.75, 0.0, 0.394, 7.454, 34.2, 6.3361, 3.0, 244.0, 15.9, 386.34, 3.11))"
0.01709,90.0,2.02,0,0.41,6.728,36.1,12.1265,5,187,17.0,384.46,4.5,30.1,"List(1, 13, List(), List(0.01709, 90.0, 2.02, 0.0, 0.41, 6.728, 36.1, 12.1265, 5.0, 187.0, 17.0, 384.46, 4.5))"
0.01778,95.0,1.47,0,0.403,7.135,13.9,7.6534,3,402,17.0,384.3,4.45,32.9,"List(1, 13, List(), List(0.01778, 95.0, 1.47, 0.0, 0.403, 7.135, 13.9, 7.6534, 3.0, 402.0, 17.0, 384.3, 4.45))"


In [27]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol = 'medv', maxIter=10)
lrModel = lr.fit(abt_train)

In [28]:
beta = pd.DataFrame(np.sort(lrModel.coefficients), columns=['betacoeff'])
beta['coeffnames'] = features
display(beta)

betacoeff,coeffnames
-17.46888971786554,crim
-1.4909952525185537,zn
-0.9718883999163934,indus
-0.5831414916337778,chas
-0.0933009852194109,nox
-0.0119657707014454,rm
0.0081831176249824,age
0.0082449067875343,dis
0.0254898486681503,rad
0.0381002798501165,tax


In [29]:
# Make predictions
predictions = lrModel.transform(abt_test)
display(predictions.select('medv', 'prediction'))

medv,prediction
35.4,30.540475426489916
18.9,14.616801123574568
31.6,32.683946496330144
29.1,31.44088077115888
33.0,23.496332827277648
20.1,19.34107271420518
50.0,42.87656631565828
31.1,31.870648246118343
34.7,31.245246609431163
34.9,33.97465987255054


In [30]:
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="medv")
rmse = evaluator.evaluate(predict)
mse = evaluator.evaluate(predict, {evaluator.metricName: "mse"})
r2 = evaluator.evaluate(predict, {evaluator.metricName: "r2"})
mae = evaluator.evaluate(predict, {evaluator.metricName: "mae"})

print("rmse on the test set for the baseline model: {}".format(rmse))
print("mse on the test set for the baseline model: {}".format(mse))
print("r2 on the test set for the baseline model: {}".format(r2))
print("mae on the test set for the baseline model: {}".format(mae))