# Regression of  Medical Cost Personal Datasets

# Content

- Data Pre-processing
- Feature engineering
- Predictive Modelling
- Project Outcomes & Conclusion

# Data Pre-processing

# Import Libraries

In [27]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
warnings.filterwarnings("ignore")

# Import DataSet

In [2]:
df=pd.read_csv("insurance.csv")

# Check Top 5 Rows

In [3]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Check Features

In [4]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [5]:
df.children.value_counts()

0    574
1    324
2    240
3    157
4     25
5     18
Name: children, dtype: int64

# Feature engineering

In [6]:
le=LabelEncoder()
df['sex']=le.fit_transform(df['sex'])
df['children']=le.fit_transform(df['children'])
df['smoker']=le.fit_transform(df['smoker'])
df['region']=le.fit_transform(df['region'])

**Observation:**
   - **By using label Encoder we encode lable to catagorical features for converting them into numerical form.**

# **Correlation between column**

In [7]:
df.corr().style.background_gradient(cmap='GnBu') # Correlation heatmap,we are checking how dependent the variables are with the target variable.

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
age,1.0,-0.020856,0.109272,0.042469,-0.025019,0.002127,0.299008
sex,-0.020856,1.0,0.046371,0.017163,0.076185,0.004588,0.057292
bmi,0.109272,0.046371,1.0,0.012759,0.00375,0.157566,0.198341
children,0.042469,0.017163,0.012759,1.0,0.007673,0.016569,0.067998
smoker,-0.025019,0.076185,0.00375,0.007673,1.0,-0.002181,0.787251
region,0.002127,0.004588,0.157566,0.016569,-0.002181,1.0,-0.006208
charges,0.299008,0.057292,0.198341,0.067998,0.787251,-0.006208,1.0


# Correlation With Target

In [8]:
df.corr()['charges'].sort_values()

region     -0.006208
sex         0.057292
children    0.067998
bmi         0.198341
age         0.299008
smoker      0.787251
charges     1.000000
Name: charges, dtype: float64

In [9]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


# Predictive Modelling

# Split the feature and target output

In [10]:
y=df['charges']
X=df.drop('charges',axis=1)

In [11]:
X

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27.900,0,1,3
1,18,1,33.770,1,0,2
2,28,1,33.000,3,0,2
3,33,1,22.705,0,0,1
4,32,1,28.880,0,0,1
...,...,...,...,...,...,...
1333,50,1,30.970,3,0,1
1334,18,0,31.920,0,0,0
1335,18,0,36.850,0,0,2
1336,21,0,25.800,0,0,3


In [12]:
y

0       16884.92400
1        1725.55230
2        4449.46200
3       21984.47061
4        3866.85520
           ...     
1333    10600.54830
1334     2205.98080
1335     1629.83350
1336     2007.94500
1337    29141.36030
Name: charges, Length: 1338, dtype: float64

# Train Test Split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# StandardScaler

In [14]:
sc=StandardScaler()
numeric=['age', 'bmi', 'children']
X_train[numeric]=sc.fit_transform(X_train[numeric])
X_test[numeric]=sc.transform(X_test[numeric])

# LinearRegression

In [21]:
lr=LinearRegression()
lr.fit(X_train,y_train)
y_pre=lr.predict(X_test)
print("Linear Regression")
print("R2 Score:",r2_score(y_test,y_pre),"\n","MAE:",mean_absolute_error(y_test,y_pre),"\n","MSE:",np.sqrt(mean_squared_error(y_test,y_pre)))

Linear Regression
R2 Score: 0.7602640802497019 
 MAE: 4204.415654724193 
 MSE: 5927.226827909312


**Observation:**
- **R2 Score of LinearRegression is 76%**

# DecisionTreeRegressor

In [22]:
dt=DecisionTreeRegressor()
dt.fit(X_train,y_train)
y_pre=dt.predict(X_test)
print("DecisionTreeRegressor")
print("R2 Score:",r2_score(y_test,y_pre),"\n","MAE:",mean_absolute_error(y_test,y_pre),"\n","MSE:",np.sqrt(mean_squared_error(y_test,y_pre)))

DecisionTreeRegressor
R2 Score: 0.7238503324918447 
 MAE: 2913.2572404049774 
 MSE: 6361.466655191974


**Observation:**
- **R2 Score of DecisionTreeRegressor is 72%**

# SVR(Support Vector Regression )

In [23]:
svr=SVR()
svr.fit(X_train,y_train)
y_pre=svr.predict(X_test)
print("SVR")
print("R2 Score:",r2_score(y_test,y_pre),"\n","MAE:",mean_absolute_error(y_test,y_pre),"\n","MSE:",np.sqrt(mean_squared_error(y_test,y_pre)))

SVR
R2 Score: -0.0813150467358068 
 MAE: 8284.522600650624 
 MSE: 12588.127003492918


**Observation:**
- **R2 Score of SVR is -8%**

# RandomForestRegressor

In [24]:
rf=RandomForestRegressor()
rf.fit(X_train,y_train)
y_pre=rf.predict(X_test)
print("RandomForestRegressor")
print("R2 Score:",r2_score(y_test,y_pre),"\n","MAE:",mean_absolute_error(y_test,y_pre),"\n","MSE:",np.sqrt(mean_squared_error(y_test,y_pre)))

RandomForestRegressor
R2 Score: 0.8439337627868168 
 MAE: 2602.190610648192 
 MSE: 4782.329247536528


**Observation:**
- **R2 Score of RandomForestRegressor is 84.8%**

# KNeighborsRegressor

In [25]:
knn=KNeighborsRegressor()
knn.fit(X_train,y_train)
y_pre=knn.predict(X_test)
print('KNN')
print("R2 Score:",r2_score(y_test,y_pre),"\n","MAE:",mean_absolute_error(y_test,y_pre),"\n","MSE:",np.sqrt(mean_squared_error(y_test,y_pre)))

KNN
R2 Score: 0.705990982031571 
 MAE: 3878.3953321199097 
 MSE: 6563.950820437874


**Observation:**
- **R2 Score of KNN is 70%**

**Conclusion**
- **R2 Score of RandomForestRegressor is 84.8% > LinearRegression is 76% > DecisionTreeRegressor is 72% > KNN is 70% > SVR is 8%**

**RandomForestRegressor shows highest R2 Score SO, now we apply cross validation on it**

# Using Cross Validation

In [29]:
rf=RandomForestRegressor(random_state=42)
score = cross_val_score(rf,X_train,y_train,cv=5,n_jobs=-1)
rf.fit(X_train,y_train)
print("RandomForest using Cross-Validation Accuracy:",score.mean()*100)
print("Testing Accuracy:",rf.score(X_test,y_test)*100)

RandomForest using Cross-Validation Accuracy: 82.59242205983405
Testing Accuracy: 84.64353574853794


# Project Outcomes & Conclusions

**Here are some of the key outcomes of the project:**
- The Dataset was quiet small  around 1338 samples & 7 features.
- Some features are catagorical we apply labelencoder to convert them into numerical form.
- Visualising the distribution of data & their relationships, helped us to get some insights on the relationship between the feature-set.
- Testing multiple algorithms  gave us some understanding on the model performance for various algorithms on this specific dataset.
- The Random Forest Regressor & Linear Regression performed exceptionally well on the current dataset, considering R2 Score as the key-metric.