### Machine Learning Model Development & Deployment

Pre-Requisite:
    1. Use EDA_Feature_Engineering.ipynb to get good insights into dataset.
        a. Make sure to capture the salient points of EDA & feature engineering in project description.
    2. Decide whether it is a classification, regression or unsupervized model.
    3. Have a goood understanding on the evaluation metrics.
    4. Pay attention encoding of categorical variables.


### Project Description

Dataset: 
    1. Dataset_Insurance_6x1.csv

Description:
    1. List of Features: ['age', 'sex', 'bmi', 'children', 'smoker', 'region']
    2. List of Targets: ['expenses']
    3. List of Categorical Variables: ['sex', 'smoker', 'region']
    4. List of Categorical Targets []

ENCODING:
    1. smoker: yes = 1 || no = 0
    2. sex: male = 1 || female = 0
    3. region: northeast = 0 || northwest = 1 || southeast = 2 || southwest = 3
    
OBJECTIVE:
    1. Develop a model to predict health expense given a set of features.

METRICS:
    1. Model accuracy to be assessed by MAE.
    2. Fit by R2

RESULTS: All Features Retained

![image-6.png](attachment:image-6.png)

RESULTS: children & region dropped from the dataset

![image-4.png](attachment:image-4.png)


In [26]:
# Import relevant libraries

import os
import warnings
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns

import pandas as pd
pd.set_option('display.max_columns', 100)
pd.options.display.width=None

from tabulate import tabulate
tabulate.PRESERVE_WHITESPACE = False

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import r2_score

In [27]:
# Global constants

RND_STATE = 39  # random_state where used is assigned RND_STATE
TESTSIZE = 0.2  # test_size where used is assigned TESTSIZE

In [28]:
# Dataset I/O definitions

PATH = r"C:\DSML_Case_Studies_2021.04.24\01_Dataset"
OUTPATH = r"C:\DSML_Case_Studies_2021.04.24\03_Output"
DATASET = r"\Dataset_Insurance_6x1.csv"
PREFIX = r"\Dataset_Insurance_"

#Specify number of features and targets

n_features = 6
n_target = 1

In [29]:
# Dataframe Definition & Classifying Features & Targets

df = pd.read_csv(f"{PATH}{DATASET}")
# df = df.drop(columns=['children', 'region'])
df = df.round(decimals=4)


collst = []
for columns in df.columns:
    collst.append(columns)

featlst = collst[0:len(collst)-n_target]
targlst = collst[-n_target:]

cat_df = df.select_dtypes(include=['object'])
catlst = []
for col in cat_df.columns:
    catlst.append(col)

y_catlst = [value for value in catlst if value in targlst]

In [30]:
print("Dataframe BEFORE Encoding: ")

Dataframe BEFORE Encoding: 


In [31]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [32]:
# Encode Categorical Columns

for i in range(0, len(collst), 1):
    temp = df.dtypes[collst[i]]
    if temp == 'object':
        df[collst[i]] = df[collst[i]].astype('category')
        df[collst[i]] = df[collst[i]].cat.codes
    else:
        continue
        
# Features & Target DataFrame

X = df.drop(columns=targlst)
y = df.drop(columns=featlst)

In [33]:
print("Dataframe AFTER Encoding: ")

Dataframe AFTER Encoding: 


In [34]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,0,27.9,0,1,3,16884.92
1,18,1,33.8,1,0,2,1725.55
2,28,1,33.0,3,0,2,4449.46
3,33,1,22.7,0,0,1,21984.47
4,32,1,28.9,0,0,1,3866.86


In [35]:
print("Descriptive Stats: ")

Descriptive Stats: 


In [36]:
desc_stat = df.describe().T.round(3) # Univariate analyses
print(tabulate(desc_stat, headers=desc_stat.columns, tablefmt="github", numalign="right"))

|          |   count |    mean |   std |     min |     25% |     50% |     75% |     max |
|----------|---------|---------|-------|---------|---------|---------|---------|---------|
| age      |    1338 |  39.207 | 14.05 |      18 |      27 |      39 |      51 |      64 |
| sex      |    1338 |   0.505 |   0.5 |       0 |       0 |       1 |       1 |       1 |
| bmi      |    1338 |  30.665 | 6.098 |      16 |    26.3 |    30.4 |    34.7 |    53.1 |
| children |    1338 |   1.095 | 1.205 |       0 |       0 |       1 |       2 |       5 |
| smoker   |    1338 |   0.205 | 0.404 |       0 |       0 |       0 |       0 |       1 |
| region   |    1338 |   1.516 | 1.105 |       0 |       1 |       2 |       2 |       3 |
| expenses |    1338 | 13270.4 | 12110 | 1121.87 | 4740.29 | 9382.03 | 16639.9 | 63770.4 |


In [37]:
def data_preprocess(X,y):
    X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(), test_size=TESTSIZE, random_state=RND_STATE)
    scaler = MinMaxScaler()
    scaler.fit(X_train)
    
    # Now apply the transformations to the data:
    
    train_scaled = scaler.transform(X_train)
    test_scaled = scaler.transform(X_test)
    y_train=y_train.copy()
    y_test=y_test.copy()
    
    return(train_scaled, test_scaled, y_train, y_test)

In [38]:
# Train Test Split

X_train_sc, X_test_sc, y_train, y_test = data_preprocess(X, y)

In [39]:
# Linear Regression (LR) Model 

from sklearn.linear_model import LinearRegression

LR = LinearRegression()
LR.fit(X_train_sc, y_train)
y_pred = LR.predict(X_test_sc)

TrAcc_LR = round(LR.score(X_train_sc, y_train), 2)
TeAcc_LR = round(LR.score(X_test_sc, y_test), 2)
RMSE_LR = round(np.sqrt(MSE(y_test, y_pred)), 2)
MAE_LR = round(MAE(y_test, y_pred), 2)
RSq_LR = round(r2_score(y_test, y_pred), 2)
AdjRSq_LR = round(1-((1-RSq_LR**2)*((len(X_test_sc)-1)/((X_test_sc.shape[0])-1))), 2)

In [40]:
# Polynomial Regression (PR) Model

from sklearn.preprocessing import PolynomialFeatures

PR = PolynomialFeatures()
X_train_temp = PR.fit_transform(X_train_sc)
X_test_temp = PR.fit_transform(X_test_sc)

# Note: Fit is Polynomial, Model is Sill Linear Regression 

LR_Temp = LinearRegression()
LR_Temp.fit(X_train_temp, y_train)

y_pred = LR_Temp.predict(X_test_temp)

TrAcc_PR = round(LR_Temp.score(X_train_temp, y_train), 2)
TeAcc_PR = round(LR_Temp.score(X_test_temp, y_test), 2)
RMSE_PR = round(np.sqrt(MSE(y_test, y_pred)), 2)
MAE_PR = round(MAE(y_test, y_pred), 2)
RSq_PR = round(r2_score(y_test, y_pred), 2)
AdjRSq_PR = round(1-((1-RSq_PR**2)*((len(X_test_sc)-1)/((X_test_sc.shape[0])-1))), 2)


In [41]:
# K-Nearest Neighbors Regressor (KNNR) Model

from sklearn.neighbors import KNeighborsRegressor

knn_index = []
knn_Tracc = []
knn_Teacc = []
knn_rmse = []
knn_mae = []
knn_rsq = []
knn_adj = []

for k in range(0, 20, 1):
    k = k+1
    KNNM = KNeighborsRegressor(n_neighbors = k)
    KNNM.fit(X_train_sc, y_train)
    y_pred = KNNM.predict(X_test_sc)
    TrAcc_KNNR = round(KNNM.score(X_train_sc, y_train), 2)
    TeAcc_KNNR = round(KNNM.score(X_test_sc, y_test), 2)
    RMSE_KNNR = round(np.sqrt(MSE(y_test, y_pred)), 2)
    MAE_KNNR = round(MAE(y_test, y_pred), 2)
    RSq_KNNR = round(r2_score(y_test, y_pred), 2)
    AdjRSq_KNNR = round(1-((1-RSq_KNNR**2)*((len(X_test_sc)-1)/((X_test_sc.shape[0])-1))), 2)
    knn_index.append(k)
    knn_Tracc.append(TrAcc_KNNR)
    knn_Teacc.append(TeAcc_KNNR)
    knn_rmse.append(RMSE_KNNR)
    knn_mae.append(MAE_KNNR)
    knn_rsq.append(RSq_KNNR)
    knn_adj.append(AdjRSq_KNNR)

K_TrAcc = pd.DataFrame(np.column_stack([knn_index, knn_Tracc]), columns=['K-Neighbors', 'Train_Accuracy'])
K_TeAcc = pd.DataFrame(np.column_stack([knn_index, knn_Teacc]), columns=['K-Neighbors', 'Test_Accuracy'])
K_RMSE = pd.DataFrame(np.column_stack([knn_index, knn_rmse]), columns=['K-Neighbors', 'RMS_Error'])
K_MAE = pd.DataFrame(np.column_stack([knn_index, knn_mae]), columns=['K-Neighbors', 'MA_Error'])
K_RSq = pd.DataFrame(np.column_stack([knn_index, knn_rsq]), columns=['K-Neighbors', 'RSq'])
K_ARSq = pd.DataFrame(np.column_stack([knn_index, knn_adj]), columns=['K-Neighbors', 'Adj_RSq'])

TrAcc_KNNR = K_TrAcc.Train_Accuracy.min()
TeAcc_KNNR = K_TeAcc.Test_Accuracy.min()
RMSE_KNNR = K_RMSE.RMS_Error.min()
MAE_KNNR = K_MAE.MA_Error.min()
RSq_KNNR = K_RSq.RSq.min()
AdjRSq_KNNR = K_ARSq.Adj_RSq.min()

In [42]:
# Kernel Support Vector Regression (KSVR) Model

from sklearn import svm

KSVR = svm.SVR(kernel='poly', degree=4, gamma='auto') # When kernel = poly, define degree
KSVR.fit(X_train_sc, y_train)
y_pred = KSVR.predict(X_test_sc)

TrAcc_KSVR = round(KSVR.score(X_train_sc, y_train), 2)
TeAcc_KSVR = round(KSVR.score(X_test_sc, y_test), 2)
RMSE_KSVR = round(np.sqrt(MSE(y_test, y_pred)), 2)
MAE_KSVR = round(MAE(y_test, y_pred), 2)
RSq_KSVR = round(r2_score(y_test, y_pred), 2)
AdjRSq_KSVR = round(1-((1-RSq_KSVR**2)*((len(X_test_sc)-1)/((X_test_sc.shape[0])-1))), 2)

In [43]:
# Random Forest Regression (RFR) Model

from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor(n_estimators=200, max_depth=5, criterion='mae', random_state=RND_STATE, n_jobs=-1)
RFR.fit(X_train_sc, y_train)
y_pred = RFR.predict(X_test_sc)

TrAcc_RFR = round(RFR.score(X_train_sc, y_train), 2)
TeAcc_RFR = round(RFR.score(X_test_sc, y_test), 2)
RMSE_RFR = round(np.sqrt(MSE(y_test, y_pred)), 2)
MAE_RFR = round(MAE(y_test, y_pred), 2)
RSq_RFR = round(r2_score(y_test, y_pred), 2)
AdjRSq_RFR = round(1-((1-RSq_RFR**2)*((len(X_test_sc)-1)/((X_test_sc.shape[0])-1))), 2)

In [44]:
# Extra Trees Regression (ETR) Model

from sklearn.ensemble import ExtraTreesRegressor

ETR = ExtraTreesRegressor(n_estimators=200, max_depth=5, criterion='mae', random_state=RND_STATE)
ETR.fit(X_train_sc, y_train)
y_pred = ETR.predict(X_test_sc)

TrAcc_ETR = round(ETR.score(X_train_sc, y_train), 2)
TeAcc_ETR = round(ETR.score(X_test_sc, y_test), 2)
RMSE_ETR = round(np.sqrt(MSE(y_test, y_pred)), 2)
MAE_ETR = round(MAE(y_test, y_pred), 2)
RSq_ETR = round(r2_score(y_test, y_pred), 2)
AdjRSq_ETR = round(1-((1-RSq_ETR**2)*((len(X_test_sc)-1)/((X_test_sc.shape[0])-1))), 2)


In [None]:
# CatBoost Regression (CBR) Model

import catboost as cb

train_dataset = cb.Pool(X_train_sc, y_train)
test_dataset = cb.Pool(X_test_sc, y_test)

CBR = cb.CatBoostRegressor(loss_function='MAE')

#For Hyperparameter turning - Only few parameters are considered

grid = {'iterations': [100, 150, 200],
        'learning_rate': [0.03, 0.1],
        'depth': [2, 4, 6, 8],
        'l2_leaf_reg': [0.2, 0.5, 1, 3]}

CBR.grid_search(grid, train_dataset)

y_pred = CBR.predict(X_test_sc)

TrAcc_CBR = round(CBR.score(X_train_sc, y_train), 2)
TeAcc_CBR = round(CBR.score(X_test_sc, y_test), 2)
RMSE_CBR = round(np.sqrt(MSE(y_test, y_pred)), 2)
MAE_CBR = round(MAE(y_test, y_pred), 2)
RSq_CBR = round(r2_score(y_test, y_pred), 2)
AdjRSq_CBR = round(1-((1-RSq_CBR**2)*((len(X_test_sc)-1)/((X_test_sc.shape[0])-1))), 2)

In [46]:
modlst = ['Linear', 'Poly_Regression','K-Nearest', 'Kernel_SVM', 'Random_Forest', 'Extra_Trees', 'CatBoost_Regressor']
score1 = ['TrAcc_LR', 'TrAcc_PR', 'TrAcc_KNNR', 'TrAcc_KSVR','TrAcc_RFR', 'TrAcc_ETR', 'TrAcc_CBR']
score2 = ['TeAcc_LR', 'TeAcc_PR', 'TeAcc_KNNR', 'TeAcc_KSVR', 'TeAcc_RFR', 'TeAcc_ETR', 'TeAcc_CBR']
score3 = ['RMSE_LR', 'RMSE_PR', 'RMSE_KNNR', 'RMSE_KSVR', 'RMSE_RFR', 'RMSE_ETR', 'RMSE_CBR']
score4 = ['MAE_LR', 'MAE_PR', 'MAE_KNNR', 'MAE_KSVR', 'MAE_ETR', 'MAE_ETR', 'MAE_CBR']
score5 = ['RSq_LR', 'RSq_PR', 'RSq_KNNR', 'RSq_KSVR', 'RSq_RFR', 'RSq_ETR', 'RSq_CBR' ]
score6 = ['AdjRSq_LR','AdjRSq_PR', 'AdjRSq_KNNR', 'AdjRSq_KSVR', 'AdjRSq_RFR', 'AdjRSq_ETR', 'AdjRSq_CBR' ]

tracclst = []
teacclst = []
rmselst = []
maelst = []
rsqlst = []
adjrsqlst = []

for i in range(0, len(score1), 1):
    var1 = vars()[score1[i]]
    var2 = vars()[score2[i]]
    var3 = vars()[score3[i]]
    var4 = vars()[score4[i]]
    var5 = vars()[score5[i]]
    var6 = vars()[score6[i]]
    tracclst.append(var1)
    teacclst.append(var2)
    rmselst.append(var3)
    maelst.append(var4)
    rsqlst.append(var5)
    adjrsqlst.append(var6)

Summary = pd.DataFrame(np.column_stack([modlst, tracclst, teacclst, rmselst, maelst, rsqlst, adjrsqlst]), 
                      columns=['Model Name', 'Train_Accuracy', 'Test_Accuracy', 'RMSE', 'MAE', 'RSq', 'AdjRSq'])

In [47]:
tot_models = len(modlst)
print("MLM Evaluation Summary: ")

MLM Evaluation Summary: 


In [48]:
Summary.head(tot_models)

Unnamed: 0,Model Name,Train_Accuracy,Test_Accuracy,RMSE,MAE,RSq,AdjRSq
0,Linear,0.75,0.75,6329.62,4294.75,0.75,0.56
1,Poly_Regression,0.85,0.83,5184.48,3018.57,0.83,0.69
2,K-Nearest,0.8,0.64,5813.24,3629.51,0.64,0.41
3,Kernel_SVM,-0.1,-0.1,13287.96,8654.58,-0.1,0.01
4,Random_Forest,0.87,0.83,5146.38,2297.74,0.83,0.69
5,Extra_Trees,0.85,0.83,5236.59,2297.74,0.83,0.69
6,CatBoost_Regressor,0.86,0.83,5141.92,1815.37,0.83,0.69


In [49]:
Output = r"MLM_Evaluation.xlsx"

writer = pd.ExcelWriter(f"{OUTPATH}{PREFIX}{Output}", engine='xlsxwriter', options={'strings_to_numbers': True})
Summary.to_excel(writer, sheet_name='Model Evaluation')
writer.save()

In [51]:
!jupyter nbconvert MLM_Dataset_Insurance.ipynb --to html --no-input

[NbConvertApp] Converting notebook MLM_Dataset_Insurance.ipynb to html
[NbConvertApp] Writing 645147 bytes to MLM_Dataset_Insurance.html
