# Medical Insurance Project

## Problem Statement
Health insurance is a type of insurance that covers medical expenses that arise due to an illness. These expenses could be related to hospitalisation costs, cost of medicines or doctor consultation fees. The main purpose of medical insurance is to receive the best medical care without any strain on your finances. Health insurance plans offer protection against high medical costs. It covers hospitalization expenses, day care procedures, domiciliary expenses, and ambulance charges, besides many others. Based on certain input features such as age , bmi, no of dependents ,smoker ,region medical insurance is calculated .

#### Columns
 - age: age of primary beneficiary
 - sex: insurance contractor gender, female, male
 - bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg I m 'A 2) using the ratio of height to weight, ideally 18.5 to 24.9.
 - children: Number of children covered by health insurance / Number of dependents
 - smoker: Smoking
 - region: the beneficiarys residential area in the US, northeast, southeast, southwest, northwest.
 - charges: Individual medical costs billed by health insurance
 
 
 Links - https://github.com/dsrscientist/dataset4/blob/main/medical_cost_insurance.csv

### Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns


import warnings
warnings.filterwarnings('ignore')

## READING DATASET

In [None]:
df = pd.read_csv('Medical Insurance Project.csv')  
df

### Let's do some non-graphical analysis first to understand the dataset

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

## Some Observations Based On Obove Description


### lets check the count values for each column

In [None]:
print(df['age'].value_counts())


In [None]:
print(df['sex'].value_counts())

## Data is balanced in this case

In [None]:
print(df['bmi'].value_counts())

In [None]:
print(df['children'].value_counts())

In [None]:
print(df['smoker'].value_counts())

In [None]:
print(df['region'].value_counts())

In [None]:
print(df['charges'].value_counts())

## As the target variable has continous data so it is a regression problem

In [None]:
df.isna().sum()  #finding the count if missing values from different columns

### NO null values 

### Lets visualise the relationship between target variable and features having continous data

In [None]:
sns.regplot(x = 'charges', y = 'age', data = df)  #reg-regression-pointer always go  upward

plt.show()

In [None]:
sns.regplot(x = 'charges', y = 'bmi', data = df)  

plt.show()

In [None]:
sns.regplot(x = 'charges', y = 'children', data = df)  

plt.show()

In [None]:
# Checking for  skeweness

df.skew()

### As from the above observation here also we can see that the skeweness is present in the "charges" column but we dont remove the skeweness from target variable so we keepit as it is.

## Checking outliers using boxplot

### As we dont check outliers for taget variable as well as categorical data so we drop them first

In [None]:
df_features = df.drop('charges', axis = 1)
df1 = df_features.drop('sex', axis = 1)
df2 = df1.drop('smoker', axis = 1)
df3 = df2.drop('region', axis =1)

In [None]:
# boxplot

plt.figure(figsize=(5,6))     #5 is x axis measurement and 6 is of y

ax = sns.boxplot(data=df3)   

plt.yticks(range(5,61,5))     #range on y axis.. ticks - distance between two numbers 

plt.xlabel('Checking For Outliers')

plt.show()

### We can see that the outliers are present in 'bmi' column

### Removing outliers from 'bmi' column

In [None]:
df[df['bmi']<45].shape

In [None]:
df.head()

### Plotting Heatmap(Correlation matrix)
### Let's plot heatmap to visualize and find the coefficient of multicolinearity

In [None]:
df_corr = df.corr().abs() #this code will get thecoefficient of one variable vs all other variables, abs is absolute no

plt.figure(figsize = (5, 4))

sns.heatmap(df_corr, annot = True, annot_kws = {'size' : 10})

plt.show()

## From above heatmap we can see that there is no multicollinearity

### Before building a model we have to encode the datapoints having 'object' datatypes

In [None]:
from sklearn.preprocessing import LabelEncoder
lab_enc = LabelEncoder()

In [None]:
df2 = lab_enc.fit_transform(df['sex'])

pd.Series(df2)

In [None]:
df['sex'] = df2

df

In [None]:
df2 = lab_enc.fit_transform(df['smoker'])

pd.Series(df2)

In [None]:
df['smoker'] = df2

df

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
ord_enc = OrdinalEncoder(categories = [['southeast', 'southwest', 'northeast', 'northwest']])

Encoded_df = ord_enc.fit_transform(df[['region']])

Encoded_df

In [None]:
df['region'] = Encoded_df

df

### After Encoding we check the correlation using heatmap

In [None]:
df_corr = df.corr().abs() #this code will get thecoefficient of one variable vs all other variables, abs is absolute no

plt.figure(figsize = (5, 4))

sns.heatmap(df_corr, annot = True, annot_kws = {'size' : 10})

plt.show()

### We can clearly see that there are no highly correlated features in our dataset

## Now our data is ready to build a model

## ModelBuilding

In [None]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge




## Linear Regression

In [None]:
# Devide data into features and labels

y = df['charges']

x = df.drop(columns = ['charges'] )

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state=384)

In [None]:
from sklearn.linear_model import LinearRegression


regression = LinearRegression()

regression.fit(x_train, y_train) 

In [None]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
from sklearn.metrics import mean_absolute_percentage_error as mape
    
y_pred = regression.predict(x_train)
print('Training Error : ', mape(y_train, y_pred))
  
y_pred = regression.predict(x_test)
print('Validation Error : ', mape(y_test, y_pred))
print()

### Predict the charges from given features

In [None]:
df.tail(2)

In [None]:
print('Insurance cost : ',regression.predict(scaler.transform([[61, 0, 29.07, 0, 1, 3.0]])))

### Let's check how well model fits the train data/ how could model learned

In [None]:
# Adjusted R2 score

regression.score(x_train,y_train)

### Let's plot and visualize

In [None]:
y_pred = regression.predict(x_test)

y_pred

In [None]:
plt.scatter(y_test,y_pred)
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Chrages')
plt.title('Actual VS Model Predicted')
plt.show()

In [None]:
## 

## Random Forest Regressor

In [None]:
x_train,x_test,y_transformed,y_test = train_test_split(x,y,test_size = 0.25, random_state = 41)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=6, n_informative=2, random_state=0, shuffle=False)
rfr = RandomForestRegressor(max_depth=6)
rfr.fit(X, y)

In [None]:
df.tail(2)

In [None]:
print(rfr.predict([[61, 0, 29.07, 0, 1, 3.0]]))


In [None]:
RM = RandomForestRegressor()


RM.fit(x_train, y_train) 
y_pred = RM.predict(x_train)
print('Training Error : ', mape(y_train, y_pred))
  
y_pred = RM.predict(x_test)
print('Validation Error : ', mape(y_test, y_pred))
print()

### Predict the charges from given features

In [None]:
print('Insurance cost : ',RM.predict(scaler.transform([[61, 0, 29.07, 0, 1, 3.0]])))

### Let's check how well model fits the train data/ how could model learned

In [None]:
# Adjusted R2 score

RM.score(x_train,y_train)

## Lasso

In [None]:
from sklearn.linear_model import Lasso, Ridge

In [None]:
l = Lasso()
l.fit(x_train, y_train) 


In [None]:
l.fit(x_train, y_train) 
y_pred = l.predict(x_train)
print('Training Error : ', mape(y_train, y_pred))
  
y_pred = l.predict(x_test)
print('Validation Error : ', mape(y_test, y_pred))
print()

### Predict the charges from given features

In [None]:
print('Insurance cost : ',l.predict(scaler.transform([[61, 0, 29.07, 0, 1, 3.0]])))

### Let's check how well model fits the train data/ how could model learned

In [None]:
# Adjusted R2 score

l.score(x_train,y_train)

## Ridge

In [None]:
r = Ridge()
r.fit(x_train, y_train) 

In [None]:
r.fit(x_train, y_train) 
y_pred = r.predict(x_train)
print('Training Error : ', mape(y_train, y_pred))
  
y_pred = r.predict(x_test)
print('Validation Error : ', mape(y_test, y_pred))
print()

### Predict the charges from given features

In [None]:
print('Insurance cost : ',r.predict(scaler.transform([[61, 0, 29.07, 0, 1, 3.0]])))

### Let's check how well model fits the train data/ how could model learned

In [None]:
# Adjusted R2 score

r.score(x_train,y_train)

## AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostRegressor

In [None]:
ADB = AdaBoostRegressor()
ADB.fit(x_train, y_train) 

In [None]:
ADB.fit(x_train, y_train) 
y_pred = ADB.predict(x_train)
print('Training Error : ', mape(y_train, y_pred))
  
y_pred = ADB.predict(x_test)
print('Validation Error : ', mape(y_test, y_pred))
print()

### Predict the charges from given features

In [None]:
print('Insurance cost : ',r.predict(scaler.transform([[61, 0, 29.07, 0, 1, 3.0]])))

### Let's check how well model fits the train data/ how could model learned

In [None]:
# Adjusted R2 score

r.score(x_train,y_train)

## Conclusion

### As the RandomForest gives the least absolute error so it is good model for the prediction of a medicalinsurance