## Modelling Notebook

This notebook is to be used for testing out the various models that you want to use. No preprocessing will be done in this notebook. Steps:

1. Read in `data/final_data.csv` that you created in the `Data Cleaning.ipynb`
2. Try various models and print appropriate metrics (accuracy/MSE etc)
3. Pick a final model and save it as `models/model.pkl`

In [1]:
import numpy as np
import pandas as pd

In [3]:
data =pd.read_csv("../data/final_dataset.csv")
data.head() 

Unnamed: 0.1,Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,0,19,0,27.9,0,1,3,16884.924
1,1,18,1,33.77,1,0,2,1725.5523
2,2,28,1,33.0,3,0,2,4449.462
3,3,33,1,22.705,0,0,1,21984.47061
4,4,32,1,28.88,0,0,1,3866.8552


In [4]:
y = data["charges"]
x=data.copy()
x.drop("charges",axis=1,inplace=True)
x

Unnamed: 0.1,Unnamed: 0,age,sex,bmi,children,smoker,region
0,0,19,0,27.900,0,1,3
1,1,18,1,33.770,1,0,2
2,2,28,1,33.000,3,0,2
3,3,33,1,22.705,0,0,1
4,4,32,1,28.880,0,0,1
...,...,...,...,...,...,...,...
1333,1333,50,1,30.970,3,0,1
1334,1334,18,0,31.920,0,0,0
1335,1335,18,0,36.850,0,0,2
1336,1336,21,0,25.800,0,0,3


In [5]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size= 0.2, random_state = 0)

## Linear Regression

In [6]:
from sklearn.linear_model import LinearRegression
model1=LinearRegression()
model1.fit(x_train,y_train)


LinearRegression()

In [7]:
from sklearn.metrics import accuracy_score,mean_squared_error,precision_score
y_pred1=model1.predict(x_test)
print("Mean Squared Error for Linear Regression is ",mean_squared_error(y_test,y_pred1))
print("R score for Naive Bayes is ",model1.score(x_test,y_test))

Mean Squared Error for Linear Regression is  31888709.315288536
R score for Naive Bayes is  0.7996058765429954


## Polynomial Regression

In [8]:
from sklearn.preprocessing import PolynomialFeatures
model2=PolynomialFeatures(degree=4)
x_poly=model2.fit_transform(x_train)
model2.fit(x_train,y_train)
reg=LinearRegression()
reg.fit(x_poly,y_train)

LinearRegression()

In [9]:
from sklearn.metrics import mean_squared_error,r2_score
y_pred2=reg.predict(model2.transform(x_test))

print("Mean Squared Error for Linear Regression is ",mean_squared_error(y_test,y_pred2))

Mean Squared Error for Linear Regression is  51425934.556224175


## Decision Tree

In [10]:
from sklearn.tree import DecisionTreeRegressor
model3 = DecisionTreeRegressor()
fitted  = model3.fit(x_train,y_train)


In [11]:
y_pred3=model3.predict(x_test)
print("Mean Squared Error for Decision Tree is ",mean_squared_error(y_test,y_pred3))

Mean Squared Error for Decision Tree is  49909650.19405501


In [12]:
import pickle
filepath = r'../models/Model.pkl'
pickle.dump(model1, open(filepath, 'wb'))