![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

## Import Libraries & Data

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


## Data Exploration

In [2]:
# Checking the columns
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB


In [3]:
# Checking the DataFrame of missing values
missing = insurance[insurance.isna().any(axis=1)]
missing.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
23,-34.0,female,,1.0,yes,,$37701.8768
32,,,28.6,,,Southwest,$nan
43,37.0,female,,,,southeast,6313.759
44,,male,,,no,,
49,,,,1.0,,,
51,21.0,,,2.0,,Northwest,$3579.8287
58,,,22.88,,yes,,23244.7902
63,,female,25.935,1.0,,Northwest,4133.64165
65,,,28.9,,,Southwest,1743.214
76,,F,,1.0,no,Southeast,$nan


In [4]:
insurance.isna().sum().sort_values()

charges     54
age         66
sex         66
bmi         66
children    66
smoker      66
region      66
dtype: int64

### Dealing with the missing Data

In [5]:
# Finding the threshold
threshold = len(insurance) * 0.05
# Finding the columns to drop
cols_to_drop = insurance.columns[insurance.isna().sum() <= threshold]

insurance_clean = insurance.dropna(subset=cols_to_drop)

In [6]:
# Checking for further missing values
insurance_clean.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

### Checking each Column and Cleaning it

In [7]:
insurance_clean["age"].dtype

dtype('float64')

In [8]:
# Converting the column to int
insurance_clean["age"] = insurance_clean["age"].astype("int")
insurance_clean["age"].describe()

count    1208.000000
mean       35.355960
std        22.061241
min       -64.000000
25%        24.750000
50%        38.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

In [9]:
# Filtering for columns with positive values
insurance_clean = insurance_clean.loc[insurance_clean["age"] > 0, :]
insurance_clean["age"].describe()

count    1149.000000
mean       39.204526
std        14.163214
min        18.000000
25%        26.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

In [10]:
# Checking the sex column
insurance_clean["sex"].dtype

dtype('O')

In [11]:
# Checking the categories
insurance_clean["sex"] = insurance_clean["sex"].str.strip()
insurance_clean["sex"].value_counts()

male      471
female    457
M          58
woman      56
man        54
F          53
Name: sex, dtype: int64

In [12]:
# Replacing wrong categories
categories = {"M":"male", "woman":"female", "man":"male", "F":"female"}
insurance_clean["sex"] = insurance_clean["sex"].replace(categories)
insurance_clean["sex"].value_counts(dropna=False)

male      583
female    566
Name: sex, dtype: int64

In [13]:
# Checking the bmi column
insurance_clean["bmi"].dtype

dtype('float64')

In [14]:
insurance_clean["bmi"].describe()

count    1149.000000
mean       30.592620
std         6.124013
min        15.960000
25%        26.200000
50%        30.300000
75%        34.700000
max        53.130000
Name: bmi, dtype: float64

In [15]:
# Checking the children column
insurance_clean["children"].dtype

dtype('float64')

In [16]:
# Correcting the data type
insurance_clean["children"] = insurance_clean["children"].astype("int")
insurance_clean["children"].dtype

dtype('int64')

In [17]:
insurance_clean["children"].describe()

count    1149.000000
mean        0.947781
std         1.314243
min        -4.000000
25%         0.000000
50%         1.000000
75%         2.000000
max         5.000000
Name: children, dtype: float64

In [18]:
# Introducing the lower bound
insurance_clean["children"] = insurance_clean["children"].clip(lower=0)
insurance_clean["children"].describe()

count    1149.000000
mean        1.017406
std         1.192183
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max         5.000000
Name: children, dtype: float64

In [19]:
# Checking the smoker column
insurance_clean["smoker"].dtype

dtype('O')

In [20]:
insurance_clean["smoker"].value_counts(dropna=False)

no     912
yes    237
Name: smoker, dtype: int64

In [21]:
# Checking the region column
insurance_clean["region"].dtype

dtype('O')

In [22]:
insurance_clean["region"].value_counts(dropna=False)

Southeast    160
southeast    148
southwest    147
Northeast    142
northwest    141
Northwest    141
northeast    138
Southwest    132
Name: region, dtype: int64

In [23]:
insurance_clean["region"] = insurance_clean["region"].str.strip().str.lower()
insurance_clean["region"].value_counts()

southeast    308
northwest    282
northeast    280
southwest    279
Name: region, dtype: int64

In [24]:
# Checking the charges column
insurance_clean["charges"].dtype

dtype('O')

In [25]:
insurance_clean["charges"].sum()

'16884.9241725.5523$4449.462$21984.47061$3866.85528240.58967281.5056$6406.410728923.13692$2721.320827808.72511826.84311090.71781837.23710797.336210602.38536837.46713228.846954149.7361137.011$6203.9017514001.1338$14451.8351512268.632252775.1921538711.035585.5762198.18985$13770.097951194.55914$1625.4337515612.193352302.339774.276348173.3613046.0624949.75876272.477220630.283513393.356353556.922312629.89672211.13075$23568.272$37742.57578059.679147496.4944513607.3687534303.16725989.523658606.21744504.662430166.6181714711.743814235.0726389.37785$5920.104117663.1442$16577.77956799.45811741.72611946.62597726.85411356.66091532.46974441.213157935.2911537165.163811033.6617$39836.51921098.5540543578.939411073.1768026.666611082.5772$2026.974110942.1320530184.936747291.055$3766.883812105.3210226.284222412.648515820.6996186.12721344.846730942.19185003.85317560.379752331.5193877.304252867.1196$47055.532110825.253711881.3584646.7592404.733811488.3169511381.325419107.77968601.32936686.4313$7740.3371705.

In [26]:
insurance_clean["charges"] = insurance_clean["charges"].str.strip("$")
insurance_clean["charges"] = insurance_clean["charges"].astype("float")
assert insurance_clean["charges"].dtype == "float"

In [27]:
insurance_clean["charges"].dtype

dtype('float64')

In [28]:
insurance_clean["charges"].describe()

count     1149.000000
mean     13331.073243
std      12171.162115
min       1121.873900
25%       4746.344000
50%       9541.695550
75%      16577.779500
max      63770.428010
Name: charges, dtype: float64

In [29]:
# Getting a csv copy 
insurance_clean.to_csv("insurance_clean.csv")

In [30]:
insurance_clean.describe()

Unnamed: 0,age,bmi,children,charges
count,1149.0,1149.0,1149.0,1149.0
mean,39.204526,30.59262,1.017406,13331.073243
std,14.163214,6.124013,1.192183,12171.162115
min,18.0,15.96,0.0,1121.8739
25%,26.0,26.2,0.0,4746.344
50%,39.0,30.3,1.0,9541.69555
75%,51.0,34.7,2.0,16577.7795
max,64.0,53.13,5.0,63770.42801


## Model development and training

In [31]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split, KFold


# Separating features and target
X = insurance_clean.drop("charges", axis=1)
y = insurance_clean["charges"]

# Identify categorical and numerical columns
categorical_cols = ['sex', 'smoker','region']  
numerical_cols = ["age", "bmi", "children"]  

# Creating the preprocessor
preprocessor = ColumnTransformer(
    transformers=[("num", StandardScaler(), numerical_cols), 
                  ("cat", OneHotEncoder(drop="first"), categorical_cols)])


In [32]:
# Splitiing the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=15)

# Create a pipeline that includes both the preprocessor and your model
model = Pipeline(steps=[("preprocessor", preprocessor), 
                        ("regressor", LinearRegression())])

#Performing 
kf = KFold(n_splits=5, shuffle=True, random_state=15)
cv_scores = cross_val_score(model, X_train, y_train,cv=kf, scoring="r2")

In [33]:
# Print the cross-validation scores
print("Cross-validation R2 scores:", cv_scores)
print("Average R2 score:", cv_scores.mean())

Cross-validation R2 scores: [0.79270508 0.65538464 0.79323295 0.79440822 0.70318542]
Average R2 score: 0.7477832619660536


In [34]:
r2_score = cv_scores.mean()

In [35]:

# Fitting the model
model.fit(X_train, y_train)

# Step 4: Predict on the test set
y_pred = model.predict(X_test)

# Step 5: Evaluate the model's performance on the test set
test_r2_score = model.score(X_test, y_test)
print("Test R2 score:", test_r2_score)

Test R2 score: 0.6816623748250806


## Making predictions for new data

In [36]:
# Loading the Data

validation_data = pd.read_csv("validation_dataset.csv")
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,18.0,female,24.09,1.0,no,southeast
1,39.0,male,26.41,0.0,yes,northeast
2,27.0,male,29.15,0.0,yes,southeast
3,71.0,male,65.502135,13.0,yes,southeast
4,28.0,male,38.06,0.0,no,southeast


In [37]:
validation_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       50 non-null     float64
 1   sex       50 non-null     object 
 2   bmi       50 non-null     float64
 3   children  50 non-null     float64
 4   smoker    50 non-null     object 
 5   region    50 non-null     object 
dtypes: float64(3), object(3)
memory usage: 2.5+ KB


In [38]:
validation_data.describe()

Unnamed: 0,age,bmi,children
count,50.0,50.0,50.0
mean,46.82,39.539907,2.78
std,21.681074,17.725844,4.026899
min,18.0,18.715,0.0
25%,28.0,27.575,0.0
50%,44.5,33.8075,1.0
75%,60.75,40.20875,2.75
max,92.0,89.097296,13.0


## Data Cleaning

In [39]:
validation_data["age"] = validation_data["age"].astype("int")
validation_data["children"] = validation_data["children"].astype("int")
validation_data.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
dtype: object

In [40]:
validation_data["sex"].value_counts()

female    25
male      25
Name: sex, dtype: int64

In [41]:
validation_data["smoker"].value_counts()

no     32
yes    18
Name: smoker, dtype: int64

In [42]:
validation_data["region"].value_counts()

northwest    16
southeast    14
northeast    11
southwest     9
Name: region, dtype: int64

In [43]:
validation_data.describe()

Unnamed: 0,age,bmi,children
count,50.0,50.0,50.0
mean,46.82,39.539907,2.78
std,21.681074,17.725844,4.026899
min,18.0,18.715,0.0
25%,28.0,27.575,0.0
50%,44.5,33.8075,1.0
75%,60.75,40.20875,2.75
max,92.0,89.097296,13.0


In [44]:
validation_data["predicted_charges"] = model.predict(validation_data)
validation_data["predicted_charges"].describe()

count       50.000000
mean     22736.179180
std      19974.887155
min       -318.026961
25%       7433.353880
50%      13358.676645
75%      32323.746332
max      68043.911924
Name: predicted_charges, dtype: float64

In [45]:
validation_data["predicted_charges"] = validation_data["predicted_charges"].clip(lower=1000)
validation_data["predicted_charges"].describe()

count       50.000000
mean     22800.987604
std      19900.750612
min       1000.000000
25%       7433.353880
50%      13358.676645
75%      32323.746332
max      68043.911924
Name: predicted_charges, dtype: float64