# Prediction of Premium Charges based on Health factors

The objective is to predict the premium based on bmi,age,sex,region and smoking status of the individuals.

In [45]:
import numpy as np  
import pandas as pd  
from sklearn.preprocessing import LabelEncoder

from scipy import stats

from sklearn.model_selection import train_test_split,KFold,learning_curve
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

In [46]:
df = pd.read_csv('insurance.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,0,19,female,28,0,yes,southwest,16885
1,1,18,male,34,1,no,southeast,1726
2,22,18,male,34,0,no,southeast,1137
3,57,18,male,32,2,yes,southeast,34303
4,121,18,male,24,0,no,northeast,1706


## Data Wrangling

In [47]:
cols = df[['age','bmi','children','charges']]
cols.isnull()


Unnamed: 0,age,bmi,children,charges
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
1333,False,False,False,False
1334,False,False,False,False
1335,False,False,False,False
1336,False,False,False,False


In [48]:
cols = df[['sex','smoker','region']]
cols.isna()

Unnamed: 0,sex,smoker,region
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
...,...,...,...
1333,False,False,False
1334,False,False,False
1335,False,False,False
1336,False,False,False


### label encoding for catgorical variables

In [49]:
le = LabelEncoder()
df1 = df[['sex','smoker','region']]
df[['sex','smoker','region']] = df1.apply(lambda x: le.fit_transform(x))
df

Unnamed: 0.1,Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,0,19,0,28,0,1,3,16885
1,1,18,1,34,1,0,2,1726
2,22,18,1,34,0,0,2,1137
3,57,18,1,32,2,1,2,34303
4,121,18,1,24,0,0,0,1706
...,...,...,...,...,...,...,...,...
1333,1265,64,1,24,0,1,2,26927
1334,1334,18,0,32,0,0,0,2206
1335,1335,18,0,37,0,0,2,1630
1336,1336,21,0,26,0,0,3,2008


## Correlation

In [50]:
df3 = df.corr(method='pearson').round(2)
df3.loc[df3.index == 'charges']

Unnamed: 0.1,Unnamed: 0,age,sex,bmi,children,smoker,region,charges
charges,-0.0,0.3,0.06,0.2,0.07,0.79,-0.01,1.0


From the above correlation values, it is observed that smoker feature alone has very good correlation with charges. While other features have very moderate or poor correlation values. However, age,bmi,children,sex are important features which could play a vital role in predicting the premium charges.
Hence, in order to consider these features, outlier presence has to be checked.

## Outlier Detection via box plots

Upon observing the plot for each of the features, there is few or more outliers for each of the features. Hence, Outliers have to be minimised in order to avoid overfitting of the data model.

## Outlier Elimination based on Z-Score normalisation(Kurtosis)

Identifying Kurtosis to eradicate heavy tail records

In [51]:
df4 = df[['age','bmi','children','charges','sex','smoker','region']]
z_score = stats.zscore(df4)
#sns.distplot(z_score,hist=True);

In [52]:
np.where(z_score>3)

(array([  32,   69,  164,  166,  238,  258,  298,  379,  398,  438,  454,
         511,  514,  543,  568,  577,  746,  796,  819,  833,  859,  878,
         937,  969, 1036, 1085, 1130, 1185, 1249], dtype=int64),
 array([2, 1, 2, 2, 1, 1, 2, 3, 2, 2, 2, 2, 2, 3, 2, 3, 2, 2, 3, 2, 3, 2,
        2, 2, 3, 2, 2, 1, 3], dtype=int64))

In [53]:
df4 = df4[(z_score<3).all(axis=1)]
df4.shape

(1309, 7)

## Outlier elimination based on Interquartile range from upper and lower extremes

In [54]:
Q1 = df4.quantile(0.25)
Q3 = df4.quantile(0.75)
IQR = Q3-Q1

In [55]:
df5 = df4[~((df4 < (Q1-1.5*IQR))| (df4 > (Q3+1.5*IQR))).any(axis=1)]
df5.shape

(1041, 7)

By applying IQR range, the amount of outliers have decreased considerably.However,eradication of the outliers completely may not be a good practice. Since it might result in eradicating most of the records in the dataset.

# Model Development and Evaluation

Since the objective of the dataset is not a classification type, a regression model can be used. Hence, the learning curve of this two model has to be visualised to look for overfitting or underfitting cases.Adding on this , the evaluation metric of each model is observed to get the suitable model. Below models are to be checked.
1. XGBoost Regressor
2. Random Forest Regressor
3. Linear Regression
4. Lasso Regression
5. Ridge Regression

In [56]:
def plot_curve(estimator,x,y,cv= KFold(),m=np.linspace(0.5,1,5)):
                        
                            size,score_train,score_test= learning_curve(estimator,x,y,train_sizes=m)
                            mean_train = np.mean(score_train,axis=1)
                            mean_test = np.mean(score_test,axis=1)
                            std_train = np.std(score_train,axis=1)
                            std_test = np.std(score_test,axis=1)
                            plt.fill_between(size,mean_train - std_train,mean_train + std_train,alpha=0.1)
                            plt.fill_between(size,mean_test - std_test,mean_test + std_test,alpha=0.1)
                            plt.plot(size,mean_train,label='Training samples')
                            plt.plot(size,mean_test,label='Cross-Validation set')
                            plt.xlabel('Training samples')
                            plt.ylabel('Error')
                            plt.legend()
                            plt.title('Learning curve')

In [57]:
def prediction(model):
                        x = df5[['age','bmi','children','sex','smoker','region']]
                        y = df5['charges']
                        x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
                        
                        model = model.fit(x_train,y_train)
                        yhat = model.predict(x_test)
                        df_pred = x_test
                        df_pred['charges'] = yhat
                        df_pred['Actual values'] = y_test
                        from sklearn.metrics import accuracy_score,mean_squared_error,r2_score,mean_absolute_error
                        print('MSE:',mean_squared_error(y_test,yhat))
                        print('R2:',r2_score(y_test,yhat))
                        print('MAE:',mean_absolute_error(y_test,yhat))
                        return model,df_pred

### XGBoost Regressor

### Random Forest Regressor

### Lasso Regression

### Linear Regression

### Ridge Regression

In [60]:
x = df5[['age','bmi','children','sex','smoker','region']]
y = df5['charges']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [62]:
import pickle
r = Ridge(alpha=0.1)
model= r.fit(x_train,y_train)
file = open('model.pkl','wb')
pickle.dump(model,file)

In [63]:
model

Ridge(alpha=0.1)

### Inference

Based on the regression model's learning curve and prediction, below insights are enlisted.

* **XGBoost Regressor**: This model has very less r2 score of 0.12 and the learning curve shows a hint of high variance/bias. Hence, this model may not be suitable.
* **Random Forest Regressor** : This model has has better r2 score compared to the previous model. But the learning curve shows deviation. hence, this model is also not suitable.
* **Ridge Regression**: This model has r2 score of 0.42 and the learning curve shows low bias and variance.
* **Lasso Regression**: This model has r2 score of 0.44 and the learning curve shows low bias and variance.
* **Linear Regression**: This model has r2 score of 0.26 and the learning curve shows low bias and variance.

Comparing the Ridge,Lasso and Linear models:
* Linear model has less r2 score compared to other two. Hence, this can be eliminated.

Comparing Ridge and Lasso:
* Lasso has better r2 score than Ridge. But when comparing the predicted values of both these models, due to the supressing behaviour of Lasso model, some values are predicted too less than the actual value. While in Ridge, there is a considerable match with the actual value.
                    
Hence, Ridge Regression model is chosen for the prediction of charges.

***Note: The MSE and MAE values of all the models are too high due to the presence of few outliers. Since eliminating all the outliers might result in removing most of the records, this error is considered as an exception***