# **Medical Insurance Cost Prediction**

*  Problem Statment
*  Dataset description

*  Importing the dependinces


*  Data Cleaning and Preproccesing

*  Feature Engineering


*  Exploratey data analysis (EDA)

*   Modeling

*   Model validation

*   Saving the model for deployment

*   Deployment using Flask



## **Problem Statment**
### The goal is to help the medical insurance companies generate an automated system for predicting what the medical insurance cost can be for their clients based on some features given.

## **Dataset description**

1. **Age**: The age of the individual.
2. **Sex**: The gender of the individual (male or female).
3. **BMI** (Body Mass Index): A measure of body fat based on height and weight.
4. **Children**: The number of children/dependents covered by the insurance.
5. **Smoker**: Whether the individual is a smoker or not (yes/no).
6. **Region**: The region of the individual (e.g., southwest, southeast, northwest, northeast).
7. **Charges**: The medical insurance costs for the individual.



## **Importing the dependinces**

In [39]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error,r2_score
from sklearn.model_selection import cross_val_score

import pickle

Data Collection & Analysis

In [2]:
# loading the data from csv file to a Pandas DataFrame
df = pd.read_csv('/content/insurance.csv')

In [3]:
# first 5 rows of the dataframe
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
# number of rows and columns
df.shape

(1338, 7)

In [5]:
# getting some informations about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


## **Data Cleaning and Preproccesing**

In [6]:
# checking for missing values
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [7]:
# statistical Measures of the dataset
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [8]:
# Check for duplicates
duplicates = df.duplicated()
print("Number of duplicate rows:", duplicates.sum())

# Drop duplicates
df = df.drop_duplicates()

# Print the shape of the cleaned DataFrame
print("Shape of the cleaned DataFrame:", df.shape)

Number of duplicate rows: 1
Shape of the cleaned DataFrame: (1337, 7)


## **Feature Engineering**

In [9]:
# adding a new feature

# create BMI Category
def categorize_bmi(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi >= 18.5 and bmi < 25:
        return 'Normal'
    elif bmi >= 25 and bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

df['bmi_category'] = df['bmi'].apply(categorize_bmi)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['bmi_category'] = df['bmi'].apply(categorize_bmi)


In [10]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,bmi_category
0,19,female,27.9,0,yes,southwest,16884.924,Overweight
1,18,male,33.77,1,no,southeast,1725.5523,Obese
2,28,male,33.0,3,no,southeast,4449.462,Obese
3,33,male,22.705,0,no,northwest,21984.47061,Normal
4,32,male,28.88,0,no,northwest,3866.8552,Overweight


## **Exploratey data analysis (EDA)**

In [11]:
# Create a histogram plot using Plotly

# Create a histogram plot using Plotly with 10-year intervals
fig = px.histogram(df, x='age', range_x=[15, 65], nbins=24, marginal='rug')
fig.update_layout(
    title='Age Distribution',
    xaxis_title='Age',
    yaxis_title='Density',
    bargap=0.05,  # Adjust the gap between bars
    plot_bgcolor='rgba(0,0,0,0)',  # Set background color to transparent
    paper_bgcolor='rgba(0,0,0,0)',  # Set paper color to transparent
)
fig.show()

# 📊



*   The mean age of the clients is around 39
*   The most perentage of the clients thier age between 18 - 22 and this refers to the Students



In [12]:
# Gender column

sex_distribution = df['sex'].value_counts().reset_index()
sex_distribution.columns = ['Sex', 'Count']

# Plot the interactive countplot with a smaller size
fig = px.bar(sex_distribution, x='Sex', y='Count', color='Sex',
             labels={'Sex': 'Sex', 'Count': 'Count'},
             title='Sex Distribution',
            )
fig.update_layout(
    width=1000,  # Set the width of the plot
    height=650,  # Set the height of the plot
    showlegend=False
)
fig.show()

In [13]:
# Group the data by 'sex' column and calculate the mean of 'charges' for each group
charges_by_sex = df.groupby('sex')['charges'].mean().reset_index()

# Plot the interactive bar plot
fig = px.bar(charges_by_sex, x='sex', y='charges',
             labels={'sex': 'Sex', 'charges': 'Average Charges'},
             title='Average Charges by Gender',
             color='sex',
            )
fig.show()

#📊

*   There is no significant difference between genders clints number.

*   We noticed that **males** tend to have a little bit higher AVG charges than **females**







In [14]:
# bmi distribution

fig = px.histogram(df, x='bmi', nbins=30, marginal='rug')
fig.update_layout(
    title='BMI Distribution',
    xaxis_title='BMI',
    yaxis_title='Density',
    bargap=0.05,  # Adjust the gap between bars
)
fig.show()

# 📈
*   The BMI is represented as a normal distribution at mean **30**.



In [15]:
# children column

children_distribution = df['children'].value_counts().reset_index()
children_distribution.columns = ['Number of Children', 'Count']

# Plot the interactive pie chart
fig = px.pie(children_distribution, values='Count', names='Number of Children',
             title='Children Distribution')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

In [16]:
# Group the data by 'children' column and calculate the mean of 'charges' for each group
charges_by_children = df.groupby('children')['charges'].mean().reset_index()

# Plot the interactive bar plot
fig = px.bar(charges_by_children, x='children', y='charges',
             labels={'children': 'Number of Children', 'charges': 'Average Charges'},
             title='Average Charges by Number of Children',
             color='charges',
             color_continuous_scale='viridis')
fig.show()

# 📊


*   As it clear that the average Charges increase at the top when the number of children increase and reach the top when it equals **2,3 and 4** children



In [17]:
# smoker column

smoker_distribution = df['smoker'].value_counts().reset_index()
smoker_distribution.columns = ['Smoker', 'Count']

# Plot the interactive bar plot
fig = px.bar(smoker_distribution, x='Smoker', y='Count',
             labels={'Smoker': 'Smoker', 'Count': 'Count'},
             title='Smoker Distribution',
             color='Smoker',
             )
fig.update_layout(showlegend=False)
fig.show()

In [18]:
# Group the data by 'smoker' column and calculate the mean of 'charges' for each group
charges_by_smoker = df.groupby('smoker')['charges'].mean().reset_index()

# Plot the interactive bar plot
fig = px.bar(charges_by_smoker, x='smoker', y='charges',
             labels={'smoker': 'Smoker', 'charges': 'Average Charges'},
             title='Average Charges by Smoking Status',
             color='smoker',
             )
fig.show()

# 📈

*   We noticed something interesting: Most of the clients don't smoke **(about 80%)**, but those who do tend to pay about **4 times bigger** more than those who don't.

In [19]:
# region column

region_counts = df['region'].value_counts().reset_index()
region_counts.columns = ['Region', 'Count']

# Plot the tree map
fig = px.treemap(region_counts, path=['Region'], values='Count',
                 title='Distribution of Regions')
fig.show()

# 📈

*   **There is no significant difference between regions.**



In [20]:
# column category_bmi

bmi_category_counts = df['bmi_category'].value_counts().reset_index()
bmi_category_counts.columns = ['BMI Category', 'Count']

# Plot the interactive bar plot
fig = px.bar(bmi_category_counts, x='BMI Category', y='Count',
             labels={'BMI Category': 'BMI Category', 'Count': 'Count'},
             title='BMI Category Distribution',
             color='BMI Category',
             )
fig.show()

In [21]:
# distribution of charges value

fig = px.histogram(df, x='charges',
                   title='Charges Distribution',
                   labels={'charges': 'Charges', 'count': 'Density'},
                   marginal='rug')
fig.show()

### remove outliers from our dataframe using IQR

In [22]:
# remove outliers from our dataframe

# Select numerical columns excluding the target column ('charges')
numerical_cols = df.select_dtypes(include='number').columns.tolist()
numerical_cols.remove('charges')
print("Shape before removing outliers:", df.shape)

# Calculate the IQR for each numerical column
q1 = df[numerical_cols].quantile(0.25)
q3 = df[numerical_cols].quantile(0.75)
iqr = q3 - q1

# Define the lower and upper bounds for outliers
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Remove outliers from each numerical column
df = df.copy()
for col in numerical_cols:
    df = df[(df[col] >= lower_bound[col]) & (df[col] <= upper_bound[col])]
print("Shape after removing outliers:", df.shape)

Shape before removing outliers: (1337, 8)
Shape after removing outliers: (1328, 8)


Encoding the categorical features

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1328 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   age           1328 non-null   int64  
 1   sex           1328 non-null   object 
 2   bmi           1328 non-null   float64
 3   children      1328 non-null   int64  
 4   smoker        1328 non-null   object 
 5   region        1328 non-null   object 
 6   charges       1328 non-null   float64
 7   bmi_category  1328 non-null   object 
dtypes: float64(2), int64(2), object(4)
memory usage: 93.4+ KB


In [24]:
# encoding sex column by Label encoder
sex_encoder = LabelEncoder()
df['sex'] = sex_encoder.fit_transform(df['sex'])

In [25]:
# encoding 'smoker' column
smoker_encoder = LabelEncoder()
df['smoker'] = smoker_encoder.fit_transform(df['smoker'])

In [26]:
# encoding 'region' column by one hot encoder
region_encoder = OneHotEncoder(handle_unknown='ignore')

enc_data = pd.DataFrame(region_encoder.fit_transform(df[['region']]).toarray(), columns = 'region'+'_'+region_encoder.categories_[0])
df = pd.concat([df.reset_index(drop=True), enc_data.reset_index(drop=True)], axis=1)
df.drop(columns='region', inplace=True)

In [27]:
# encoding 'region' column by one hot encoder
bmi_category_encoder = OneHotEncoder(handle_unknown='ignore')

enc_data = pd.DataFrame(bmi_category_encoder.fit_transform(df[['bmi_category']]).toarray(), columns = 'bmi_category'+'_'+bmi_category_encoder.categories_[0])
df = pd.concat([df.reset_index(drop=True), enc_data.reset_index(drop=True)], axis=1)
df.drop(columns='bmi_category', inplace=True)

In [28]:
df.shape

(1328, 14)

Splitting the Features and Target

In [29]:
# Split the data
X = df.drop(columns=['charges'])
y = df['charges']

In [30]:
X.head()

Unnamed: 0,age,sex,bmi,children,smoker,region_northeast,region_northwest,region_southeast,region_southwest,bmi_category_Normal,bmi_category_Obese,bmi_category_Overweight,bmi_category_Underweight
0,19,0,27.9,0,1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,18,1,33.77,1,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,28,1,33.0,3,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,33,1,22.705,0,0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
4,32,1,28.88,0,0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [31]:
y.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

Splitting the data into Training data & Testing Data

In [32]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [33]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1062, 13), (266, 13), (1062,), (266,))

### Model Training

In [34]:
# Train Linear regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Train Ridge regression model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)

# Train Lasso regression model
lasso_model = Lasso(alpha=2.0)
lasso_model.fit(X_train, y_train)

random_F = RandomForestRegressor(n_estimators=150 ,min_samples_split=25, random_state=0)
random_F.fit(X_train, y_train)

### Model Evaluation

In [35]:
# Predictions
linear_preds = regressor.predict(X_test)
ridge_preds = ridge_model.predict(X_test)
lasso_preds = lasso_model.predict(X_test)
random_forest_preds = random_F.predict(X_test)

# Evaluate models
# Calculate R^2 score for Ridge model
linear_r2 = r2_score(y_test, linear_preds)
ridge_r2 = r2_score(y_test, ridge_preds)
lasso_r2 = r2_score(y_test, lasso_preds)
random_forset_r2 = r2_score(y_test, random_forest_preds)

print("R^2 score for Linear Regression:", linear_r2)
print("R^2 score for Ridge Regression:", ridge_r2)
print("R^2 score for Lasso Regression:", lasso_r2)
print("R^2 score for Random Forest Regressor:", random_forset_r2)

R^2 score for Linear Regression: 0.7489383821550097
R^2 score for Ridge Regression: 0.7491874224194273
R^2 score for Lasso Regression: 0.7491502382356299
R^2 score for Random Forest Regressor: 0.8323157656994842


## Model Evaluation

In [36]:
scores = cross_val_score(estimator=random_F, X=X_train, y=y_train, cv=10)

In [37]:
print('Cross-validation scores:', scores)
print('Mean cross-validation score:', scores.mean())
print('Standard deviation of cross-validation scores:', scores.std())

Cross-validation scores: [0.88758225 0.84657567 0.85404172 0.87331407 0.92790895 0.89529368
 0.80506849 0.82843036 0.80714236 0.79658617]
Mean cross-validation score: 0.8521943733294094
Standard deviation of cross-validation scores: 0.04151036968045332


# Saving the model

In [40]:
pickle.dump(random_F,open('insurance_cost_model.sav','wb'))

encoders = {'sex':sex_encoder, 'smoker':smoker_encoder, 'region':region_encoder, 'bmi_category':bmi_category_encoder}
pickle.dump(encoders, open('insurance_cost_encoders.sav', 'wb'))

# Deployment using Flask

In [44]:
from flask import Flask, render_template, request
import pickle
import numpy as np
import pandas as pd

app = Flask(__name__)

model = pickle.load(open('/content/insurance_cost_model.sav', 'rb'))
encoders = pickle.load(open('/content/insurance_cost_encoders.sav', 'rb'))

@app.route('/')
def index():
    return render_template('/content/index.html')

@app.route('/predict', methods=['POST'])
def predict():

    # Get form data
    data = {}
    data['age'] = request.form.get('age')
    data['sex'] = request.form.get('sex')
    data['bmi'] = request.form.get('bmi')
    data['children'] = request.form.get('children')
    data['smoker'] = request.form.get('smoker')
    data['region'] = request.form.get('region')
    data['bmi_category'] = request.form.get('bmi_category')

    df = pd.DataFrame([data])

    encoders['sex'].inverse_transform(data['sex'])[0]

    encoders['smoker'].inverse_transform(data['smoker'])[0]


    for i in encoders['region'].categories_[0]:
        df['region' + '_' + i] = 0.0
    df['region' + '_' + df['region']] = 1.0
    df.drop(columns='region', inplace=True)


    for i in encoders['bmi_category'].categories_[0]:
        df['bmi_category' + '_' + i] = 0.0
    df['bmi_category' + '_' + df['bmi_category']] = 1.0
    df.drop(columns='bmi_category', inplace=True)


    pred = model.predict(df)
    return render_template('/content/index.html', prediction=pred)

if __name__ == "__main__":
    app.run(debug=True)

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat
