# Health Insurance Cost Prediction


There are many factors that affect how much health insurance companies charge us. Here are some of the major factors considered by the companies in USA that affect how much health insurance premium cost.


* **age:**  age of primary beneficiary

* **sex:**  insurance contractor gender, female, male

* **bmi:**  Body mass index is a value derived from the mass and height of a person. The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m², resulting from mass in kilograms and height in metres

* **children:**  Number of children covered by health insurance / Number of dependents

* **smoker:**  Smoking

* **region:**  the beneficiary's residential area in the US, northeast, southeast, southwest, northwest

My goal is to predict the premium cost for Health Insurance base on above factors.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
plt.style.use('seaborn-poster')

In [None]:
df=pd.read_csv('../input/insurance/insurance.csv')
df.sample(5)

In [None]:
df.info()

## Visualization and EDA

In [None]:
charges = df['charges'].groupby(df.region).mean().sort_values(ascending = True)
print(charges)

sns.barplot(x=charges, y=charges.head().index, palette='Blues')
plt.title("Average Health Cost for Different Region")
plt.show()

In [None]:
sns.boxplot(x=df['smoker'],y=df['charges'])
plt.title('Health Cost among Smoker and non Smoker')
plt.show()

In [None]:
sns.pairplot(data=df, hue='smoker')

In [None]:
sns.displot(df,x='charges',hue='smoker')
plt.show()

In [None]:

df[['sex', 'smoker', 'region']] = df[['sex', 'smoker', 'region']].astype('category')
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
label.fit(df.region.drop_duplicates())
df.region = label.transform(df.region)
label.fit(df.sex.drop_duplicates())
df.sex = label.transform(df.sex)

label.fit(df.smoker.drop_duplicates())
df.smoker = label.transform(df.smoker)

df.head()


In [None]:
sns.heatmap(df.corr(),annot=True,cmap='Blues_r')
plt.title("Heatmap of Correlation")
plt.show()

In [None]:
'''df_copy=df.copy()
df['smoker']=df['smoker'].map({'yes':1,'no':0})
#df.drop('sex',axis=1,inplace=True)
df=pd.get_dummies(df,columns=['region','sex'])
df=df[df.columns[[0,1,2,3,5,6,7,8,9,10,4]]] #rearranging columns
df.head()'''

![BMI](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fimages.agoramedia.com%2Feverydayhealth%2Fgcms%2FBMI-in-Adults-722x406.jpg&f=1&nofb=1)

### Observations

* smokers are charged more than non smoker.
* people with unhealthy BMI (greater than 25) are charged more
* There are less number of people who have to pay high premium.
* Premium amount for older people is greater than the young ones.
* People living in southeast have highest medical cost and southwest have lowest.
* Sex, number of children and region are factors with lowest impact in medical cost.
* Smoking, bmi and age factors have highest impact in medical cost.

## Prediction Model

In [None]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

X = df.drop('charges',axis=1)
y=df['charges']

X2=df.drop(['charges','sex','region'],axis=1) #dropping features with lowest impact.
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.2, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X

## LinearRegression Model

In [None]:
lin_reg1=LinearRegression()
lin_reg1.fit(X_train,y_train)

lin_reg2=LinearRegression()
lin_reg2.fit(X2_train,y2_train)


In [None]:
y_pred1=lin_reg1.predict(X_test)
y_pred2=lin_reg2.predict(X2_test)

df2=pd.DataFrame({'Actual':y_test,'Predicted':y_pred1})
sns.scatterplot(x=y_test,y=y_pred1)
plt.title("Predicted vs Actual cost as per Linear Model")
plt.show()

In [None]:
print("Metrics for Model 1")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred1))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred1))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred1)))
print("R2 score: ",metrics.r2_score(y_test, y_pred1))
print("\n")
print("Metrics for Model 2")
print('Mean Absolute Error:', metrics.mean_absolute_error(y2_test, y_pred2))
print('Mean Squared Error:', metrics.mean_squared_error(y2_test, y_pred2))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y2_test, y_pred2)))
print("R2 score: ",metrics.r2_score(y2_test, y_pred2))

Model trained with more features performed better than the model with only important features. 

## PolynomialRegression Model

In [None]:
from sklearn.preprocessing import PolynomialFeatures
lin_reg=LinearRegression()
for i in [2,3,4,5]:
    poly_reg=PolynomialFeatures(degree=i)

    X_poly=poly_reg.fit_transform(X_train)

    lin_reg.fit(X_poly,y_train)
    y_pred_poly=lin_reg.predict(poly_reg.fit_transform(X_test))

    print("Degree: ",i)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_poly))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_poly))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_poly)))
    print("R2 score: ",metrics.r2_score(y_test, y_pred_poly))
    print("\n")
    
#Best degree 2
poly_reg=PolynomialFeatures(degree=2)

X_poly=poly_reg.fit_transform(X_train)

lin_reg.fit(X_poly,y_train)
y_pred_poly=lin_reg.predict(poly_reg.fit_transform(X_test))

Polynomial Regression of Degree 2 performed better.

In [None]:
y_pred_poly=lin_reg.predict(poly_reg.fit_transform(X_test))
df3=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_poly})

sns.scatterplot(x=y_test,y=y_pred_poly)
plt.title("Predicted vs Actual cost as per Polynomial Regression")
plt.show()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_poly))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_poly))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_poly)))
print("R2 score: ",metrics.r2_score(y_test, y_pred_poly))

## RandomForestRegressor Model

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(
    n_estimators = 1100,
    max_depth = 4,
    random_state = 1,
    max_leaf_nodes=1000
)

rfr.fit(X_train, y_train)


In [None]:
y_pred_rf=rfr.predict(X_test)
df3=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_rf})

sns.scatterplot(x=y_test,y=y_pred_rf)
plt.title("Predicted vs Actual cost as per RandomForest Regressor")
plt.show()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_rf))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_rf))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_rf)))
print("R2 score: ",metrics.r2_score(y_test, y_pred_rf))

In [None]:
rfr2 = RandomForestRegressor(
    n_estimators = 700,
    max_depth = 4,
    random_state = 1,
    max_leaf_nodes=1000
)

rfr2.fit(X2_train, y2_train)

y_pred2=rfr2.predict(X2_test)

print("Metrics for Model 2")
print('Mean Absolute Error:', metrics.mean_absolute_error(y2_test, y_pred2))
print('Mean Squared Error:', metrics.mean_squared_error(y2_test, y_pred2))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y2_test, y_pred2)))
print("R2 score: ",metrics.r2_score(y2_test, y_pred2))

For RandomForest Regressor too model trained with all features performed better.

In [None]:
y_pred_rf_train=rfr.predict(X_train)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, y_pred_rf_train))
print('Mean Squared Error:', metrics.mean_squared_error(y_train, y_pred_rf_train))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_rf_train)))
print("R2 score: ",metrics.r2_score(y_train, y_pred_rf_train))

sns.scatterplot(x=y_train,y=y_pred_rf_train)
plt.title("Predicted vs Actual cost as per RandomForest Regressor on Training Data")


## Ridge Regression

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge

alphas = [0.1000001, 0.0001, 0.0001, 0.001, 0.01, 0.1,0.5, 0.0000002]

for a in alphas:
 ridge_reg = Ridge(alpha=a, normalize=True,fit_intercept=True,max_iter=1000).fit(X_train,y_train) 
 score = ridge_reg.score(X_test, y_test)
 pred_y = ridge_reg.predict(X_test)
 mse = metrics.mean_squared_error(y_test, pred_y) 
 print("Alpha:{0:.6f}, R2:{1:.3f}, MSE:{2:.2f}, RMSE:{3:.2f}"
    .format(a, score, mse, np.sqrt(mse)))

In [None]:
y_pred_rd=ridge_reg.predict(X_test)
df3=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_rd})

sns.scatterplot(x=y_test,y=y_pred_rd)
plt.title("Predicted vs Actual cost as per Ridge Regression")
plt.show()

Ridge Regression with alpha 0.0001 performed better among all alpha taken. But, still could not perform better than Random Forest Regressor.

## Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

alphas = [0.1000001, 0.0001, 0.0001, 0.001, 0.01, 0.1,0.5, 0.0000002]

for a in alphas:
    lasso_reg = Lasso(alpha=0.2, fit_intercept=True, normalize=True, precompute=True, max_iter=10000,
                  tol=0.0001, warm_start=False, positive=False, random_state=1, selection='cyclic'
                ).fit(X_train, y_train)
    y_pred_ls=lasso_reg.predict(X_test)
    mse = metrics.mean_squared_error(y_test, pred_y)
    score = lasso_reg.score(X_test, y_test)
    print("Alpha:{0:.6f}, R2:{1:.3f}, MSE:{2:.2f}, RMSE:{3:.2f}"
        .format(a, score, mse, np.sqrt(mse)))

In [None]:
y_pred_ls=lasso_reg.predict(X_test)
df3=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_ls})

sns.scatterplot(x=y_test,y=y_pred_ls)
plt.title("Predicted vs Actual cost as per Laso Regression")
plt.show()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_ls))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_ls))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_ls)))
print("R2 score: ",metrics.r2_score(y_test, y_pred_ls))

# Conclusion

In our problem to predict insurance premium to be paid, RandomForest Regressor algorithm performed best with the following errors and scores on train and test data. In all algorithms used, models trained with all features performed better than models trained with only important features.

### Train Data

- Mean Absolute Error: 2461.2990118276616
- Mean Squared Error: 19048537.765107304
- Root Mean Squared Error: 4364.463055761534
- R2 score:  0.8648907958535507

### Test Data
- Mean Absolute Error: 2494.4582408483166
- Mean Squared Error: 17609514.276676927
- Root Mean Squared Error: 4196.369177834205
- R2 score:  0.889574381750296

Performance was better for test data than training data.

In [None]:
y_pred_rf_train=rfr.predict(X)
df_all=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_rf})
df_all.head(10)

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor