# Description of the question
Health and medical insurance are a type of insurance that offers medical coverage for expenses incurred by the insured in a medical emergency. In case of medical emergency, health insurance policies act as a financial assistance to the policyholder. As the treatment expenses are increasing every day making it harder for people to afford quality medical treatments, people tend to purchase most suitable health and medical insurance plan for themselves and pay premium price in exchange of medical benefits.
The insurance premium may vary depending on the insurance plan, age, profession, family health history, health issues of the insured etcetera. Hence, determining the accurate insurance premium depending on the insured requirements is important to build stronger customer relationship, to customize health insurance plans and to reduce the risk faced by the insurer.
Thus, the main objective of this analysis is to help the insurer to improve their policy premium pricing accuracy by predicting the insurance policy premium and identifying the factors that have a huge impact on medical premium price based on the data collected from the individuals.
To predict the premium accurately, we can identify the factors that are associated with the yearly health insurance premium price of a person and employ advanced analysis techniques to fit a model on the data, which will result in achieving a model that is capable of predicting the premium price accurately.
Based on these predictions, insurer can make better decisions in risk management and when suggesting insurance plans to the customer.

In [None]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
df = pd.read_csv('../input/medical-insurance-premium-prediction/Medicalpremium.csv')
df.head()

# Exploratory Data Analysis

In [None]:
df.describe()

##  Checking for Null Values

In [None]:
df.isnull().sum()

#  EDA on responce variable

In [None]:
sns.histplot(data=df, x='PremiumPrice',bins=10, kde=True )

the responce variable PremiumPrice is not normally distributed, we might need to apply some transformation techniques on premium price when building models.

## EDA on Quantitative variables

### Correlation Map

In [None]:
plt.figure(figsize=(9,9))
corr = df.corr()
sns.heatmap(corr, cmap='coolwarm', annot=True,linewidths=0.1)

* Premium Price is highly influenced by Age of the customer, Correlation coefficient is +0.71 shows very high positive relationship.
* Premium Price shows very low correlation with few variables which should be ignored or remodeled when creating the model.



### Age

In [None]:
sns.lineplot(x=df.Age,y=df.PremiumPrice)

### Height

In [None]:
sns.lineplot(x=df.Height,y=df.PremiumPrice)

### Weight

In [None]:
sns.lineplot(x=df.Weight,y=df.PremiumPrice)

* It's quite clear from the scatterplot that we don’t see any specific pattern which neither indicates positive nor negative relationship between height and weight.

* Can barely see any positive or negative relationship for height and weight with premium price too. So, it's better to ignore this variable or create a new variable like BMI using these variables such that it has an impact on the premium price.

## EDA on Qualitative variables

In [None]:
Categorical_Variables = ['Diabetes', 'BloodPressureProblems', 'AnyTransplants',
       'AnyChronicDiseases','KnownAllergies', 'HistoryOfCancerInFamily', 'NumberOfMajorSurgeries']

In [None]:
plt.figure(figsize=(15,15))
a = 3
b = 3
c = 1

for feature in Categorical_Variables:
    plt.subplot(a,b,c)
    df.groupby(feature)['PremiumPrice'].mean().plot.bar()
    c=c+1
    
plt.show()

In [None]:
plt.figure(figsize=(15,15))
a = 3
b = 3
c = 1

for feature in Categorical_Variables:
    plt.subplot(a,b,c)
    sns.kdeplot(x='PremiumPrice', data=df, hue=feature, fill=True, common_norm=False, 
                alpha =0.4, warn_singular=False)
    c=c+1
    
plt.show()

In [None]:
plt.figure(figsize=(15,15))
a = 3
b = 3
c = 1

for feature in Categorical_Variables:
    plt.subplot(a,b,c)
    df.groupby(feature)['PremiumPrice'].median().plot.bar()
    c=c+1
    
plt.show()

according to the above 2 graphs we can clearly see that variables like <b>"AnyTransplants" ,"AnyChronicDiseases" ,"HistoryOfCancerInFamily" and "NumberOfMajorSurgeries"</b> has an significant impact on premium price, other variables doesn't seem to have any significant impact on the premium though.

## Creating new variables

As discussed earlier, weight and height doesn’t seem to have a big effect on the premium price. So, creating new variable BMI using these variables might have an impact on the premium price.<br>

It's quite hard to come up with a conclusion using only the BMI value. therefore, assigning people to one of the below categories according to their BMI value and analyzing them might give us a good insight.<br>

* BMI less than 18.5, falls within the underweight range.
* BMI 18.5 to <25, falls within the normal weight range.
* BMI 25.0 to <30, falls within the overweight range.
* BMI 30.0 or higher, falls within the obesity range

In [None]:
df['BMI'] = df.Weight.values/(((df.Height).values/100)**2)
df.head()

In [None]:
under_index = df[df.BMI<18.4999].index
normal_index = df[(df.BMI>18.5) & (df.BMI<24.9999)].index
over_index = df[(df.BMI>25) & (df.BMI<29.9999)].index
obecity_index = df[df.BMI>30].index

df.loc[under_index,'BMI_Status'] = 'Under Weight'
df.loc[normal_index,'BMI_Status'] = 'Normal'
df.loc[over_index,'BMI_Status'] = 'Over Weight'
df.loc[obecity_index,'BMI_Status'] = 'Obecity'

In [None]:
plt.figure(figsize=(9,6))
ax = sns.boxplot(x='BMI_Status', y='PremiumPrice', data=df)

According to above plots, People who are with <b>obesity and overweight problems</b> likely to have higher premium prices.

## One hot encoding on nominal variables

In [None]:
df_BMI_Status = pd.get_dummies(df.BMI_Status)
df = pd.concat([df,df_BMI_Status], axis=1)
df = df.drop(['BMI_Status','BMI'],axis=1)

# Features selection techniques

## RandomForest

In [None]:
from sklearn.ensemble import RandomForestRegressor

X = df.drop('PremiumPrice', axis =1)
y = df['PremiumPrice']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

random_forest = RandomForestRegressor()
random_forest.fit(X_train,y_train)
feature_imp1 = random_forest.feature_importances_
sns.barplot(x=feature_imp1, y=X.columns)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features From Random forest regressor")
plt.show();

In [None]:
import xgboost
from xgboost import XGBRFRegressor

X = df.drop('PremiumPrice', axis =1)
y = df['PremiumPrice']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

xgboost =XGBRFRegressor()
xgboost.fit(X_train,y_train)
feature_imp2 = xgboost.feature_importances_
sns.barplot(x=feature_imp2, y=X.columns)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features From XGBoost")
plt.show();

# Regression techniques

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

X = df.drop('PremiumPrice', axis =1)
y = df['PremiumPrice']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

regressor = LinearRegression()
regressor.fit(X_train,y_train)

y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score
print(r2_score(y_test, y_pred))

## Gradient Boosting

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

from sklearn.ensemble import GradientBoostingRegressor
regressor = GradientBoostingRegressor(n_estimators= 15)
regressor.fit(X_train,y_train)

y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

## Ridge Regressor

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

from sklearn.linear_model import Ridge
regressor = Ridge()
regressor.fit(X_train,y_train)

y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

## Lasso

In [None]:
from sklearn.linear_model import Lasso ,LassoCV

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X) 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)


regressor = Lasso()
regressor.fit(X_train,y_train)

y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

## Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

regressor = RandomForestRegressor()
regressor.fit(X_train,y_train)

y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score
print(r2_score(y_test, y_pred))

## xgboost

In [None]:
import xgboost
from xgboost import XGBRFRegressor


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

regressor = XGBRFRegressor()
regressor.fit(X_train,y_train)

y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score
print(r2_score(y_test, y_pred))

#  Hyperparameter tuning

## Random search

In [None]:
from sklearn.model_selection import KFold, RepeatedKFold, GridSearchCV, cross_validate, train_test_split,RandomizedSearchCV

In [None]:
rf = RandomForestRegressor()

n_estimators = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200]
max_features = ['auto', 'sqrt']
max_depth = [5, 10, 15, 20, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid,
                               scoring='neg_mean_squared_error', 
                               n_iter = 10, cv = 5, verbose=2, 
                               random_state=42, n_jobs = 1)

In [None]:
rf_random.fit(X_train,y_train)

In [None]:
rf_random.best_params_

# Converting the problem to a Classification problem

Since the accuracy of the regression models was poor, I decided to convert the problem to a classification problem.

In [None]:
pr_lab=['Low','Basic','Average','High','SuperHigh']
df['PremiumLabel']=pr_bins=pd.cut(df['PremiumPrice'],bins=5,labels=pr_lab,precision=0)

df.head()

In [None]:
import category_encoders as ce
import pandas as pd

encoder_PremiumLabel= ce.OrdinalEncoder(cols=['PremiumLabel'],return_df=True,
                           mapping=[{'col':'PremiumLabel',
'mapping':{'Low':0,'Basic':1,'Average':2,'High':3,'SuperHigh':4}}])

df = encoder_PremiumLabel.fit_transform(df)
df.head()

In [None]:
df = df.drop('PremiumPrice', axis = 1)

X = df.drop('PremiumLabel', axis =1)
y = df['PremiumLabel']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))

In [None]:
from xgboost import XGBClassifier

from sklearn.ensemble import RandomForestClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))