<a href="https://colab.research.google.com/github/Rohitsawant0123/Loan_eligibility_prediction/blob/main/Health_Insurance_Premium_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Health Insurance Premium Prediction**

The amount of the premium for a health insurance policy depends from person to person, as many factors affect the amount of the premium for a health insurance policy. Let’s say age, a young person is very less likely to have major health problems compared to an older person. Thus, treating an older person will be expensive compared to a young one. That is why an older person is required to pay a high premium compared to a young person.


The dataset that I am using for the task of health insurance premium prediction is collected from Kaggle. It contains data about age,gender,body mass index(bmi),children, person smokes or not,region,amount of premium.

In [17]:
#lets import the dataset
import numpy as np
import pandas as pd
df=pd.read_csv('Health_insurance.csv')
df.head()
import warnings
warnings.filterwarnings('ignore')

# **EDA**

In [18]:
#checking null values
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

The dataset is therefore ready to be used. After getting the first impressions of this data, I noticed the “smoker” column, which indicates whether the person smokes or not. This is an important feature of this dataset because a person who smokes is more likely to have major health problems compared to a person who does not smoke. So let’s look at the distribution of people who smoke and who do not:

In [19]:
import plotly.express as px
figure=px.histogram(df,x='sex',color='smoker',title='Number of smokers')
figure.show()


According to the above visualisation, 547 females, 517 males don’t smoke, and 115 females, 159 males do smoke. It is important to use this feature while training a machine learning model, so now I will replace the values of the “sex” and “smoker” columns with 0 and 1 as both these columns contain string values

In [20]:
df['sex']=df['sex'].map({'female':0,'male': 1})
df['smoker']=df['smoker'].map({'no': 0,'yes': 1})
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


Now let’s have a look at the distribution of the regions where people are living according to the dataset

In [21]:
pie=df['region'].value_counts()
regions=pie.index
population=pie.values
fig=px.pie(df,values=population,names=regions)
fig.show()

In [22]:
#Now let’s have a look at the correlation between the features of this dataset
print(df.corr())



               age       sex       bmi  children    smoker   charges
age       1.000000 -0.020856  0.109272  0.042469 -0.025019  0.299008
sex      -0.020856  1.000000  0.046371  0.017163  0.076185  0.057292
bmi       0.109272  0.046371  1.000000  0.012759  0.003750  0.198341
children  0.042469  0.017163  0.012759  1.000000  0.007673  0.067998
smoker   -0.025019  0.076185  0.003750  0.007673  1.000000  0.787251
charges   0.299008  0.057292  0.198341  0.067998  0.787251  1.000000


# **Health Insurance Premium Prediction Model**
training a machine learning model for the task of predicting health insurance premiums. First, I’ll split the data into training and test *sets*

In [23]:
x=np.array(df[['age','sex','bmi','smoker']])
y=np.array(df['charges'])
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=42)

I found the random forest algorithm as the best performing algorithm for this task. So here I will train the model by using the random forest regression algorithm

In [37]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor=RandomForestRegressor(n_estimators=100,random_state=42)
rf_regressor.fit(xtrain,ytrain)

# **Now let’s have a look at the predicted values of the model**

In [38]:
ypred= forest.predict(xtest)
data=pd.DataFrame(data={'Predicted Premium Amout': ypred})
print(data)

     Predicted Premium Amout
0               10915.202177
1                5506.018889
2               28213.989351
3                9810.025399
4               34554.545578
..                       ...
263             46957.030915
264             12469.325503
265              6302.277453
266             47093.063568
267             10758.613687

[268 rows x 1 columns]


In [39]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae=mean_absolute_error(ytest,ypred)
mse=mean_squared_error(ytest,ypred)
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Squared Error: {mse:.2f}")

Mean Absolute Error: 2659.01
Mean Squared Error: 23576595.80


# **Summary**
**The premium amount of a health insurance policy depends on person to person as many factors affect the premium amount of a health insurance policy. I hope you liked this article on health insurance premium prediction with machine learning using Python**