### Machine Learning Model Development & Deployment

Pre-Requisite:
    1. Use EDA_Feature_Engineering.ipynb to get good insights into dataset.
        a. Make sure to capture the salient points of EDA & feature engineering in project description.
    2. Decide whether it is a classification, regression or unsupervized model.
    3. Have a goood understanding on the evaluation metrics.
    4. Pay attention encoding of categorical variables.

### Project Description

Dataset: 
    1. Dataset_Insurance_6x1.csv

Description:
    1. List of Features: ['age', 'sex', 'bmi', 'children', 'smoker', 'region']
    2. List of Targets: ['expenses']
    3. List of Categorical Variables: ['sex', 'smoker', 'region']
    4. List of Categorical Targets []

ENCODING:
    1. smoker: yes = 1 || no = 0
    2. sex: male = 1 || female = 0
    3. region: northeast = 0 || northwest = 1 || southeast = 2 || southwest = 3
    
Observations:
    1. No missing data.
    2. Features are a mix of categorical and continuous data.
    3. Feature 'bmi' alone follows a normal distribution.
    4. Smoker : Non-smoker :: 274:1064 --> Fair mix.
    5. Female : Male :: 662 : 676 --> Good mix
    6. Good regional mix of data.
    7. Outliers are observed for the expenses (target) could be attributed to smoker / high bmi.
    8. Outiers ARE RETAINED.
    9. Higher linear correlation observed between smoker & expense. Moderate correlation between age & expense.
    10. Non-linear correlation (PPS) reflects a similar story. However, it has higher weightage to age & expense.
    11. VIF < 5 || 'children', 'sex', 'region', 'smoker'
        VIF < 11 || 'age', 'bmi'
    12. PCA top features: 'bmi', 'smoker', 'children'
    13. DTR top features: 'bmi', 'smoker', 'age'

OBJECTIVE:
    1. Develop a model to predict health expense given a set of features.

METRICS:
    1. Model accuracy to be assessed by MAE.
    2. Fit by R2



In [69]:
# Import relevant libraries

import os
import warnings
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns

import pandas as pd
pd.set_option('display.max_columns', 100)
pd.options.display.width=None

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [70]:
# Global constants

RND_STATE = 39  # random_state where used is assigned RND_STATE
TESTSIZE = 0.2  # test_size where used is assigned TESTSIZE

In [71]:
# Dataset I/O definitions

PATH = r"C:\DSML_Case_Studies_2021.04.24\01_Dataset"
OUTPATH = r"C:\DSML_Case_Studies_2021.04.24\03_Output"
DATASET = r"\Dataset_Insurance_6x1.csv"

#Specify number of features and targets

n_features = 6
n_target = 1

In [72]:
# Dataframe Definition & Classifying Features & Targets

df = pd.read_csv(f"{PATH}{DATASET}")
df = df.round(decimals=4)

collst = []
for columns in df.columns:
    collst.append(columns)

featlst = collst[0:len(collst)-n_target]
targlst = collst[-n_target:]

cat_df = df.select_dtypes(include=['object'])
catlst = []
for col in cat_df.columns:
    catlst.append(col)

y_catlst = [value for value in catlst if value in targlst]

In [73]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [74]:
# Encode Categorical Columns

for i in range(0, len(collst), 1):
    temp = df.dtypes[collst[i]]
    if temp == 'object':
        df[collst[i]] = df[collst[i]].astype('category')
        df[collst[i]] = df[collst[i]].cat.codes
    else:
        continue

In [75]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,0,27.9,0,1,3,16884.92
1,18,1,33.8,1,0,2,1725.55
2,28,1,33.0,3,0,2,4449.46
3,33,1,22.7,0,0,1,21984.47
4,32,1,28.9,0,0,1,3866.86


In [76]:
# Function to split the dataset into train and test dataset

def data_split_scale(feature, target):
    X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(), 
                                                        test_size=TESTSIZE, random_state=RND_STATE)
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    return(X_train, X_test, y_train, y_test)

In [77]:
X = df.drop(columns=targlst)
y = df.drop(columns=featlst)

X_train_scaled, X_test_scaled, y_train, y_test = data_split_scale(X, y)

In [80]:
# Model 1: Linear Regression

LR = LinearRegression()
LR.fit(X_train, y_train)

y_pred = LR.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 4294.74508821275
Mean Squared Error: 40064146.26795162
Root Mean Squared Error: 6329.624496599433
