## EDUNET FOUNDATION-Class Exrercise Notebook

### LAB 25 - Developing a full-fledged project using different ML algorithms and python packages from scratch

<center><div style="background-color:#faa6ce; color:#19180F; font-size:27px; font-family:cursive; padding:25px; border: 2px solid #19180F;"> 
Case study on predicting Health Insurance Premiums
</div></center>

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
About Dataset
</div>

---

The `insurance.csv` dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value desginated for each level.

The purposes of this exercise to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.

- **Find dataset : https://www.kaggle.com/datasets/noordeen/insurance-premium-prediction**

---


<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Importing modules for EDA
</div>

In [1]:
import pandas  as pd #Data manipulation
import numpy as np #Data manipulation
import matplotlib.pyplot as plt # Visualization
import seaborn as sns #Visualization
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Reading the dataset
</div>

In [None]:
df = pd.read_csv('insurance.csv')
df

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Understand about data
</div>

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.sample(7)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
duplicate_data=df[df.duplicated()]

In [None]:
duplicate_data

In [None]:
df=df.drop_duplicates()
df

In [None]:
df.describe()

#### Get the total male and female in the dataset

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='sex',data=df)

#### Get the total no of people across the regions

In [None]:
plt.figure(figsize=(10,5))
plt.style.use('fivethirtyeight')
ax=sns.countplot(x='region',data=df,palette='dark',)
ax.set_xlabel(xlabel='Location',fontsize=18)
ax.set_ylabel(ylabel='Total people in Region',fontsize=18)
ax.set_title(label='Region',fontsize=20)
plt.show()

#### Average Expenses of Male and Female

In [None]:
plt.figure(figsize=(10,5))
df.groupby(['sex'])['expenses'].mean().plot.bar()
plt.ylabel('Average Expense')
plt.title("Average Expenses of Male and Female",fontsize=18)
plt.xticks(rotation = 0)
plt.show()

#### Average Expenses of a smoker and Non-smoker

In [None]:
plt.figure(figsize=(10,5))
df.groupby(['smoker'])['expenses'].mean().plot.bar()
plt.ylabel('Average Expense')
plt.title("Average Expenses of a smoker and Non-smoker",fontsize=18)
plt.xticks(rotation = 0)
plt.show()

#### Average Expenses of people of different region

In [None]:
plt.figure(figsize=(10,5))
df.groupby(['region'])['expenses'].mean().plot.bar()
plt.ylabel('Average Expense')
plt.title("Average Expenses of people of different region",fontsize=18)
plt.xticks(rotation = 0)
plt.show()

In [None]:
sns.lmplot(x='bmi',y='expenses',data=df,aspect=2,height=6)
plt.xlabel('Boby Mass Index$(kg/m^2)$: as Independent variable')
plt.ylabel('Insurance Charges: as Dependent variable')
plt.title('Charge Vs BMI');

In [None]:
sns.lmplot(x='age',y='expenses',data=df,aspect=2,height=6)
plt.xlabel('Age: as Independent variable')
plt.ylabel('Insurance Charges: as Dependent variable')
plt.title('Charge Vs Age');

In [None]:
df1 = df[["age","bmi","children","expenses"]]
df1

In [None]:
df1.corr()

In [None]:
# correlation plot
corr = df1.corr()
sns.heatmap(corr, cmap = 'BuPu', annot= True);

There seems to be good relation between **age** and **bmi** with **expenses**.

In [None]:
f = plt.figure(figsize=(14,6))
ax = f.add_subplot(121)
sns.scatterplot(x='age',y='expenses',data=df,palette='magma',hue='smoker',ax=ax)
ax.set_title('Scatter plot of Charges vs age')

ax = f.add_subplot(122)
sns.scatterplot(x='bmi',y='expenses',data=df,palette='viridis',hue='smoker')
ax.set_title('Scatter plot of Charges vs bmi')

As can be seen above, the continous columns of Age and Bmi are compared with expenses and it can be found that the smoker expenses are higher in comparison to non smokers.

In [None]:
sns.lmplot(x='age',y='expenses',data=df,palette='magma',hue='smoker',aspect=2,height=6)
ax.set_title('Scatter plot of Charges vs age')

In [None]:
sns.lmplot(x='bmi',y='expenses',data=df,palette='viridis',hue='smoker',aspect=2,height=6)
ax.set_title('Scatter plot of Charges vs bmi')

As can be seen above, the continous columns of Age and Bmi are compared with expenses and it can be found that the smoker expenses are higher in comparison to non smokers.

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Feature Engineering: Transform the Categorical features into Labels 👷
</div>

In [None]:
# Dummy variable
categorical_columns = ['sex','smoker','region']
df = pd.get_dummies(data = df,
               columns = categorical_columns,
                    drop_first = True
               )


In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.sex_male = df.sex_male.replace({True: 1, False: 0})
df.smoker_yes = df.smoker_yes.replace({True: 1, False: 0})
df.region_northwest = df.region_northwest.replace({True: 1, False: 0})
df.region_southeast = df.region_southeast.replace({True: 1, False: 0})
df.region_southwest = df.region_southwest.replace({True: 1, False: 0})

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()


<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Feature Scaling: Transform the the continous data (numerical data) <br>onto the same scale of (0,1).
</div>

In [None]:
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
df[['age','bmi']]=scaler.fit_transform(df[['age','bmi']])

In [None]:
df.head()

In [None]:
X=df.drop(columns=['expenses'],axis=1) # collecting only the features.

In [None]:
X

In [None]:
y=df['expenses'] # collecting Label values
y

- **We will use the log transformation of expense column to make the respective column in normal  distribution**

In [None]:
y_log=np.log(y)
y_log

In [None]:
from scipy.stats import boxcox
y_bc,lam, ci= boxcox(df['expenses'],alpha=0.05)
f= plt.figure(figsize=(12,4))

ax=f.add_subplot(121)
sns.distplot((df['expenses']),bins=40,color='r',ax=ax)
#ax.set_title('Distribution of insurance charges')

ax=f.add_subplot(122)
sns.distplot(y_log,bins=40,color='b',ax=ax)
#ax.set_title('Distribution of insurance charges in $log$ sacle')
#ax.set_xscale('log');

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Splitting the required data into training and test dataset.
</div>

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y_log,test_size=0.3,random_state=22)

In [None]:
print("X_train.shape=", X_train.shape)
print("y_train.shape=", y_train.shape)
print("X_test.shape=", X_test.shape)
print("y_test.shape=", y_train.shape)

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Apply LinearRegression.
</div>

In [None]:
# Scikit Learn module
from sklearn.linear_model import LinearRegression


lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)

y_pred_LR = lin_reg.predict(X_test)

#Evaluvation: MSE
from sklearn.metrics import mean_squared_error
J_mse_sk_LR = mean_squared_error(y_pred_LR, y_test)

# R_square
R_square_sk_LR = lin_reg.score(X_test,y_test)

print('The Mean Square Error(MSE) or J(theta) is: ',J_mse_sk_LR)
print('R square obtain for scikit learn library is :',R_square_sk_LR)

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Apply Supeort Vector Machine-Regression.
</div>

In [None]:
# Support Vector Machine's 
from sklearn.svm import SVR

SVM_R = SVR(kernel='rbf')
SVM_R.fit(X_train, y_train)

y_pred_svm = SVM_R.predict(X_test)

#Evaluvation: MSE
from sklearn.metrics import mean_squared_error
J_mse_sk_svm = mean_squared_error(y_pred_svm, y_test)

# R_square
R_square_sk_svm = SVM_R.score(X_test,y_test)

print('The Mean Square Error(MSE) or J(theta) is: ',J_mse_sk_svm)
print('R square obtain for scikit learn library is :',R_square_sk_svm)

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Apply Decission Tree Regression.
</div>

In [None]:
# Scikit Learn module
from sklearn.tree import DecisionTreeRegressor
Dec_Tree_Reg = DecisionTreeRegressor(max_depth=4, min_samples_split=20)
Dec_Tree_Reg.fit(X_train, y_train)

y_pred = Dec_Tree_Reg.predict(X_test)

#Evaluvation: MSE
from sklearn.metrics import mean_squared_error
J_mse_sk_Dec_Tree_Rege = mean_squared_error(y_pred, y_test)

# R_square
R_square_sk_Dec_Tree_Rege = Dec_Tree_Reg.score(X_test,y_test)

print('The Mean Square Error(MSE) or J(theta) is: ',J_mse_sk_Dec_Tree_Rege)
print('R square obtain for scikit learn library is :',R_square_sk_Dec_Tree_Rege)


In [None]:
R_square = [R_square_sk_LR,R_square_sk_svm,R_square_sk_Dec_Tree_Rege]
models = ['Linear Regression', 'Support Vector Machine', 'Decission Tree']

In [None]:
sns.barplot(x=R_square, y=models, color="g")
plt.xlim([-0.1,1.0])
plt.xlabel('R-Square')
plt.title('R-Square')
plt.show()

It is found that the R-Square value is highest for the Support Vector Machine (SVM) and Decission Tree compared to all other.

Thus SVM and Decission Tree are seems to be the best for this data.

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
Save SVM Model by using Pickel.
</div>

In [None]:
import pickle

In [None]:
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(SVM_R, open(filename, 'wb'))

In [None]:
# Load the Model back from file
with open(filename, 'rb') as file:  
    SVM_Model = pickle.load(file)

SVM_Model

In [None]:
# Use the Reloaded Model to 
# Calculate the accuracy score and predict target values

# Calculate the Score 
score = SVM_Model.score(X_test, y_test)  
# Print the Score
print("Test score: {0:.2f} %".format(100 * score))  

# Predict the Labels using the reloaded Model
Ypredict = SVM_Model.predict(X_test)  

Ypredict

<div style="background-color:#faa6ce; color:#19180F; font-size:20px; font-family:cursive; padding:10px; border: 2px solid #19180F;"> 
<p style="text-align: center;">Happy Learning 😀</p>
</div>