## Serialization of Machine Learning Models

Serialization is the process of converting a python object into a stream of bytes so as to store or transmit the object to memory, a database or a file. The main objective of serialization is to save the state of an object in order to recreate it when needed. The opposite or reverse process is deserialization. This article covers json, pickel and joblib serialization.
This is a clean dataset and not much preprocessing is needed.

In [1]:
# Importingthe require libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
# loading the data
autos = pd.read_csv(r'C:\Users\Chuks\datasets\CarPrice_Assignment.csv')
autos.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [3]:
autos.shape

(205, 26)

In [4]:
autos.columns

Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
       'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price'],
      dtype='object')

In [5]:
# Dropping irrelevant columns
autos.drop(['car_ID', 'symboling', 'CarName'], axis=1, inplace=True)

In [6]:
# Checking for null values
autos.isnull().values.any()

False

In [7]:
# Converting categorical datato numeric data using onehot encoding
autos = pd.get_dummies(autos)

In [8]:
autos.shape

(205, 52)

The data set has 205 records or instances and 26 features. Of the features we dropped 3 and are left with 23. There are no null vales in data as can be seem from the isnull() function call. Due tothe far that machine learning algorithm prefer numeric data we converted categorical features to numeric features using onehot encoding bycall the get_dummies() function. This increased the number of features to 52 from 23.

In [9]:
# Selecting the features and target
X = autos.drop('price', axis=1)
Y = autos['price']

In [10]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [11]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((164, 51), (41, 51), (164,), (41,))

In [12]:
model = LinearRegression().fit(x_train, y_train)

In [13]:
Training_score = model.score(x_train, y_train)
Training_score

0.9445695211160382

In [14]:
y_pred = model.predict(x_test)
y_pred

array([ 6863.62855482, 21566.20178414, 11408.97411649,  5985.98770385,
        9468.94096303, 14340.78835904,  7739.29658797, 36655.82084638,
        7445.25467913,  9627.28951173,  4008.7947898 , 10375.29411098,
        7545.67645761,  9352.85961158, 16602.53073294,  6731.29361765,
        1950.47008273,  5267.3968403 , 34028.        ,  4606.80719846,
        9722.77675454,  9126.86254149,  6212.50208146, 20744.80630235,
        5110.82639683,  9768.54984982, 18391.5718596 ,  9763.09574787,
        5829.37005383,  6210.3225186 , 33559.77312113, 16233.83973568,
       13185.51774306, 18033.56617963, 28845.24591507, 10063.65792034,
       10522.65415119, 10586.04337225, 14003.72864568,  6257.37536935,
       15416.73855962])

In [15]:
Testing_score = r2_score(y_test, y_pred)
Testing_score

0.8163709811361523

Our target or label isthe pricing column because it is what we want to predict. The data was split into training and testing data in a 80:20 ratio. As a result 164 records and 51 features was used for training while 41 records and 51 features was used for testing. After fitting and testing the data our model produced R squarred score of 94% on the training and 81% on the testing data.

## Serializing our Model 

For simple linear regression model the only parameter we need to save to disc is the model coefficients and intercept. We can use json because we know the model parameters to save. For complex model where we dont know the model where the 

In [16]:
import json

In [17]:
model.coef_

array([ 4.07552804e+01, -2.98927172e-01,  5.20655385e+02,  8.10359824e+01,
        2.79482330e+00,  1.94335957e+02, -7.55323689e+03, -7.69760029e+03,
       -8.91753741e+02,  1.54857694e+01,  2.68542580e+00, -3.53136714e+02,
        3.44366169e+02,  4.89253709e+03, -4.89253709e+03, -9.00049769e+02,
        9.00049769e+02, -1.60643371e+02,  1.60643371e+02,  2.33461986e+03,
       -1.70228766e+02, -8.21648133e+02, -1.41806383e+02, -1.20093657e+03,
        5.09810471e+02, -6.42262885e+02,  1.32452414e+02, -2.56113612e+03,
        2.56113612e+03, -4.65923313e+02, -1.78642537e+03, -1.80297955e+03,
        3.50856111e+03,  2.25939285e+03, -5.30511116e+03,  3.59248544e+03,
        8.51720193e+02, -1.65729103e+03,  4.01724465e+02, -5.02478144e+02,
        1.31829432e+04, -1.58691041e+04,  3.59248544e+03, -3.14132217e+02,
       -3.51503935e+02,  3.59248544e+03,  4.89253709e+03, -3.66081079e+03,
       -4.38251651e+02, -3.46173787e+03, -2.58586066e+02])

In [18]:
model.intercept_

-13818.542782425691

In [19]:
# Converting the cofficient anf intercept to a python list and saving them to a dictionary
model_param = {}

model_param['coef'] = list(model.coef_)
model_param['intercept'] = model.intercept_.tolist()

model_param

{'coef': [40.755280367373025,
  -0.29892717234860733,
  520.6553845638124,
  81.03598240961553,
  2.7948232978207557,
  194.33595741236456,
  -7553.236891853626,
  -7697.600290935531,
  -891.7537410444038,
  15.485769425219587,
  2.6854257992904422,
  -353.13671367317875,
  344.3661688915354,
  4892.537090397717,
  -4892.537090397679,
  -900.0497692907209,
  900.0497692906896,
  -160.64337068410146,
  160.6433706841133,
  2334.6198557257526,
  -170.22876554892156,
  -821.6481329834207,
  -141.80638256570643,
  -1200.9365746277256,
  509.81047094994875,
  -642.2628850939834,
  132.4524141440511,
  -2561.1361194170813,
  2561.1361194170886,
  -465.9233125860318,
  -1786.4253748033466,
  -1802.9795521690273,
  3508.5611097725064,
  2259.3928518960515,
  -5305.111163675709,
  3592.4854415655127,
  851.7201931846895,
  -1657.291033964342,
  401.7244650281545,
  -502.47814399988374,
  13182.943171855353,
  -15869.1040936695,
  3592.485441565521,
  -314.1322169873389,
  -351.5039345698659,
  

In [20]:
# Dump 's' will dump the dictionary to a string
json_txt = json.dumps(model_param, indent=4)
json_txt

'{\n    "coef": [\n        40.755280367373025,\n        -0.29892717234860733,\n        520.6553845638124,\n        81.03598240961553,\n        2.7948232978207557,\n        194.33595741236456,\n        -7553.236891853626,\n        -7697.600290935531,\n        -891.7537410444038,\n        15.485769425219587,\n        2.6854257992904422,\n        -353.13671367317875,\n        344.3661688915354,\n        4892.537090397717,\n        -4892.537090397679,\n        -900.0497692907209,\n        900.0497692906896,\n        -160.64337068410146,\n        160.6433706841133,\n        2334.6198557257526,\n        -170.22876554892156,\n        -821.6481329834207,\n        -141.80638256570643,\n        -1200.9365746277256,\n        509.81047094994875,\n        -642.2628850939834,\n        132.4524141440511,\n        -2561.1361194170813,\n        2561.1361194170886,\n        -465.9233125860318,\n        -1786.4253748033466,\n        -1802.9795521690273,\n        3508.5611097725064,\n        2259.39285189

In [21]:
# Create a folder in your home directory and write ('w') the json file into the directory
with open('models/regressor_param.txt', 'w') as file:
    file.write(json_txt)

In [22]:
# open the file with 'r' for read and its converted to a python dictionary
with open('models/regressor_param.txt', 'r') as file:
    json.text = json.load(file)

In [23]:
json.text

{'coef': [40.755280367373025,
  -0.29892717234860733,
  520.6553845638124,
  81.03598240961553,
  2.7948232978207557,
  194.33595741236456,
  -7553.236891853626,
  -7697.600290935531,
  -891.7537410444038,
  15.485769425219587,
  2.6854257992904422,
  -353.13671367317875,
  344.3661688915354,
  4892.537090397717,
  -4892.537090397679,
  -900.0497692907209,
  900.0497692906896,
  -160.64337068410146,
  160.6433706841133,
  2334.6198557257526,
  -170.22876554892156,
  -821.6481329834207,
  -141.80638256570643,
  -1200.9365746277256,
  509.81047094994875,
  -642.2628850939834,
  132.4524141440511,
  -2561.1361194170813,
  2561.1361194170886,
  -465.9233125860318,
  -1786.4253748033466,
  -1802.9795521690273,
  3508.5611097725064,
  2259.3928518960515,
  -5305.111163675709,
  3592.4854415655127,
  851.7201931846895,
  -1657.291033964342,
  401.7244650281545,
  -502.47814399988374,
  13182.943171855353,
  -15869.1040936695,
  3592.485441565521,
  -314.1322169873389,
  -351.5039345698659,
  

In [24]:
json_model = LinearRegression()

In [25]:
# Deserializing the json model
json_model.coef_ = np.array(json.text['coef'])
json_model.intercept_ = np.array(json.text['intercept'])

In [26]:
# Predicting
y_pred = json_model.predict(x_test)

# Json model test score
r2_score(y_test, y_pred)

0.8163709811361523

In [27]:
# Initial test score
Testing_score

0.8163709811361523

There is a drawback with serialization with json because you need to know what model parameter you need to write out to disc. We can perform seralization of simple linear regreesion with json because the coefficient and the interceprt are the only model parameters we need to save. If you are working with more complex models such as decision trees and Support Vector Michines you might have to go with other serialization technique such as Pickle and joblib.

Pickle module implement protocols for serializing and deserializing python objects to a bytestream.

In [28]:
import pickle

In [29]:
# Serialization
pickle.dump(model, open('models/model.pkl', 'wb'))

In [30]:
# Deserialization
pickle_model = pickle.load(open('models/model.pkl', 'rb'))

In [31]:
# Prediction
y_pred = pickle_model.predict(x_test)
y_pred

array([ 6863.62855482, 21566.20178414, 11408.97411649,  5985.98770385,
        9468.94096303, 14340.78835904,  7739.29658797, 36655.82084638,
        7445.25467913,  9627.28951173,  4008.7947898 , 10375.29411098,
        7545.67645761,  9352.85961158, 16602.53073294,  6731.29361765,
        1950.47008273,  5267.3968403 , 34028.        ,  4606.80719846,
        9722.77675454,  9126.86254149,  6212.50208146, 20744.80630235,
        5110.82639683,  9768.54984982, 18391.5718596 ,  9763.09574787,
        5829.37005383,  6210.3225186 , 33559.77312113, 16233.83973568,
       13185.51774306, 18033.56617963, 28845.24591507, 10063.65792034,
       10522.65415119, 10586.04337225, 14003.72864568,  6257.37536935,
       15416.73855962])

In [32]:
# Pickle model test score
r2_score(y_test, y_pred)

0.8163709811361523

In [33]:
# Initial test score
Testing_score

0.8163709811361523

Another alternative to pickle is joblib module. This is a better fit for serializing scikit-learn models as it works more efficiently with objects that have large numpy arrays internally. Machine learning models often hold model parameters in the form of sparse vectors this is why joblib is the preferred option to serialize models out to disc.

In [34]:
import joblib

In [35]:
filename = 'models/model.joblib'

In [36]:
# Serialization
joblib.dump(model, filename)

['models/model.joblib']

In [37]:
# Deserialization
joblib_model = joblib.load(filename)

In [38]:
# Predicting
y_pred = joblib_model.predict(x_test)
y_pred

array([ 6863.62855482, 21566.20178414, 11408.97411649,  5985.98770385,
        9468.94096303, 14340.78835904,  7739.29658797, 36655.82084638,
        7445.25467913,  9627.28951173,  4008.7947898 , 10375.29411098,
        7545.67645761,  9352.85961158, 16602.53073294,  6731.29361765,
        1950.47008273,  5267.3968403 , 34028.        ,  4606.80719846,
        9722.77675454,  9126.86254149,  6212.50208146, 20744.80630235,
        5110.82639683,  9768.54984982, 18391.5718596 ,  9763.09574787,
        5829.37005383,  6210.3225186 , 33559.77312113, 16233.83973568,
       13185.51774306, 18033.56617963, 28845.24591507, 10063.65792034,
       10522.65415119, 10586.04337225, 14003.72864568,  6257.37536935,
       15416.73855962])

In [39]:
# Joblib model test score
r2_score(y_test, y_pred)

0.8163709811361523

In [40]:
# Initial test score
Testing_score

0.8163709811361523

In [41]:
!ls models

model.joblib
model.pkl
regressor_param.txt


The above files are the serialized model that was saved to disc. All model test scores are consistent with the initial test score (81%). This shows the serialized model works. The joblib serialization is the most preferred because it can hold model parameters in the form of sparse vectors, as well as you do not need to know the model parameters to serialize.