![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [135]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [136]:
# Inpect the data quality
#display(insurance.info(), insurance.describe(), insurance.isna().sum())
print(f'percentage of data missing: {(66/insurance.shape[0])*100}%')
# Since the missing data is specific to a client(e.g. sex, smoker) and the missing rows are less than 5%, i will drop de missing data
insurance.dropna(inplace=True)

# Check the data again
display(insurance.info(), insurance.describe(), insurance.isna().sum())

percentage of data missing: 4.932735426008969%
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1208 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1208 non-null   float64
 1   sex       1208 non-null   object 
 2   bmi       1208 non-null   float64
 3   children  1208 non-null   float64
 4   smoker    1208 non-null   object 
 5   region    1208 non-null   object 
 6   charges   1208 non-null   object 
dtypes: float64(3), object(4)
memory usage: 75.5+ KB


None

Unnamed: 0,age,bmi,children
count,1208.0,1208.0,1208.0
mean,35.35596,30.574971,0.942881
std,22.061241,6.117562,1.311809
min,-64.0,15.96,-4.0
25%,24.75,26.195,0.0
50%,38.0,30.23,1.0
75%,51.0,34.58,2.0
max,64.0,53.13,5.0


age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [137]:
# Prepare the data for modeling
# Steps, Ohe for region column, bool for smoker, sex: 1 for female 0 for male
# convert the charges column into float
dummies = pd.get_dummies(data=insurance['region'])
insurance.loc[:,'smoker'] = insurance['smoker'].apply(lambda x: 1 if x == 'yes' else 0)
insurance.loc[:,'sex'] = insurance['sex'].apply(lambda x: 1 if x == 'female' else 0)
insurance.loc[:,'charges'] = insurance['charges'].str.strip('$').astype(float)
# fill na values with the mean
insurance.fillna(value=insurance.mean(), inplace=True)
# i won't change the name of the sex column for this exercise, but i would recommend to do so
display(insurance.head(2), dummies.head(2), insurance.describe())

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,1,27.9,0.0,1,southwest,16884.924
1,18.0,0,33.77,1.0,0,Southeast,1725.5523


Unnamed: 0,Northeast,Northwest,Southeast,Southwest,northeast,northwest,southeast,southwest
0,0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,0,0


Unnamed: 0,age,sex,bmi,children,smoker,charges
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,35.35596,0.396523,30.574971,0.942881,0.205298,13311.273947
std,22.061241,0.489378,6.117562,1.311809,0.404087,12131.029019
min,-64.0,0.0,15.96,-4.0,0.0,1121.8739
25%,24.75,0.0,26.195,0.0,0.0,4750.065725
50%,38.0,0.0,30.23,1.0,0.0,9447.316375
75%,51.0,1.0,34.58,2.0,0.0,16579.959053
max,64.0,1.0,53.13,5.0,1.0,63770.42801


In [138]:
# create train and test data
from sklearn.model_selection import train_test_split as tts

X = insurance.drop(columns=['charges','region'])
X = pd.concat([X,dummies],axis=1)
y = insurance['charges']

X_train, X_test, y_train, y_test = tts(X,y, test_size=0.2, random_state=42)

In [139]:
# Scale numeric features then combine with categorical features
scaler = StandardScaler()
num_cols = ['age','bmi','children']
X_train_scaled = scaler.fit_transform(X_train[num_cols])
X_test_scaled = scaler.transform(X_test[num_cols])

# Convert scaled data back to a df
X_train_df = pd.DataFrame(X_train_scaled, columns=num_cols)
X_test_df = pd.DataFrame(X_test_scaled, columns=num_cols)

# Concatenate data with categorical features
categorical_features = [col for col in X_train.columns if col not in num_cols]
X_train = pd.concat([X_train_df, X_train[categorical_features].reset_index(drop=True)], axis=1)
X_test = pd.concat([X_test_df, X_test[categorical_features].reset_index(drop=True)], axis=1)

# Check the data
display(X_train.head(),X_test.head())

Unnamed: 0,age,bmi,children,sex,smoker,Northeast,Northwest,Southeast,Southwest,northeast,northwest,southeast,southwest
0,0.618753,-0.558004,0.036004,0,0,0,0,0,0,0,0,0,1
1,1.060491,0.255564,1.548171,1,0,0,0,0,0,0,0,0,1
2,1.28136,0.532177,0.036004,0,1,0,0,0,0,0,0,0,1
3,-0.088027,1.941278,0.036004,0,0,0,0,1,0,0,0,0,0
4,0.530405,0.890148,0.036004,1,0,0,0,0,0,0,0,0,1


Unnamed: 0,age,bmi,children,sex,smoker,Northeast,Northwest,Southeast,Southwest,northeast,northwest,southeast,southwest
0,-0.618113,0.844588,-0.720079,1,0,0,0,0,0,0,1,0,0
1,1.060491,0.133529,-0.720079,1,0,0,0,0,0,0,1,0,0
2,-0.485591,-0.476648,-0.720079,0,0,0,0,0,1,0,0,0,0
3,-0.662286,-0.252916,-0.720079,1,0,0,1,0,0,0,0,0,0
4,0.7071,0.999166,0.792087,0,0,0,1,0,0,0,0,0,0


In [140]:
# Select models for training
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# instanciate models
svr_poly = SVR(kernel="poly", C=100, gamma="auto", degree=3, epsilon=0.1, coef0=1)
linreg = LinearRegression()
rfr = RandomForestRegressor()

# fit the models and score the models
# Perform cross-validation with 5 folds
r2_score_svr = cross_val_score(svr_poly, X_train, y_train, cv=5, scoring='r2')
r2_score_linreg = cross_val_score(linreg, X_train, y_train, cv=5, scoring='r2')
r2_score_rfr = cross_val_score(rfr, X_train, y_train, cv=5, scoring='r2')

# Print and save R-squared scores
print("train SVR Poly R-squared: ", r2_score_svr.mean())
print("train Linear Regression R-squared: ", r2_score_linreg.mean())
print("train Random Forest R-squared: ", r2_score_rfr.mean())

# save r2_score
r2_score = r2_score_rfr.mean()

train SVR Poly R-squared:  0.15188400784622708
train Linear Regression R-squared:  0.6822377047193378
train Random Forest R-squared:  0.8366200856870266


In [141]:
X

Unnamed: 0,age,sex,bmi,children,smoker,Northeast,Northwest,Southeast,Southwest,northeast,northwest,southeast,southwest
0,19.0,1,27.900,0.0,1,0,0,0,0,0,0,0,1
1,18.0,0,33.770,1.0,0,0,0,1,0,0,0,0,0
2,28.0,0,33.000,3.0,0,0,0,0,0,0,0,1,0
3,33.0,0,22.705,0.0,0,0,0,0,0,0,1,0,0
4,32.0,0,28.880,0.0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50.0,0,30.970,3.0,0,0,1,0,0,0,0,0,0
1334,-18.0,1,31.920,0.0,0,1,0,0,0,0,0,0,0
1335,18.0,1,36.850,0.0,0,0,0,0,0,0,0,1,0
1336,21.0,1,25.800,0.0,0,0,0,0,0,0,0,0,1


In [142]:
# Use the trained model to predict charges for the data in validation_dataset.csv
# Fit the model on the whole dataset
X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])
rfr.fit(X, y)

In [143]:
# load the validation data
validation_data = pd.read_csv('validation_dataset.csv')
data = validation_data.copy()
#display(validation_data.head(),validation_data.describe(), validation_data.info(),validation_data.isna().sum())


# Apply the same preprocessing steps to the validation data

# One-hot encoding for region column
dummies = pd.get_dummies(data=data['region'])
data = pd.concat([data, dummies], axis=1)
data.drop('region', axis=1, inplace=True)  

# Boolean conversion for smoker and sex
data.loc[:, 'smoker'] = data['smoker'].apply(lambda x: 1 if x == 'yes' else 0)
data.loc[:, 'sex'] = data['sex'].apply(lambda x: 1 if x == 'female' else 0)

display(data.head())
display(data.describe())

Unnamed: 0,age,sex,bmi,children,smoker,northeast,northwest,southeast,southwest
0,18.0,1,24.09,1.0,0,0,0,1,0
1,39.0,0,26.41,0.0,1,1,0,0,0
2,27.0,0,29.15,0.0,1,0,0,1,0
3,71.0,0,65.502135,13.0,1,0,0,1,0
4,28.0,0,38.06,0.0,0,0,0,1,0


Unnamed: 0,age,sex,bmi,children,smoker,northeast,northwest,southeast,southwest
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,46.82,0.5,39.539907,2.78,0.36,0.22,0.32,0.28,0.18
std,21.681074,0.505076,17.725844,4.026899,0.484873,0.418452,0.471212,0.453557,0.388088
min,18.0,0.0,18.715,0.0,0.0,0.0,0.0,0.0,0.0
25%,28.0,0.0,27.575,0.0,0.0,0.0,0.0,0.0,0.0
50%,44.5,0.5,33.8075,1.0,0.0,0.0,0.0,0.0,0.0
75%,60.75,1.0,40.20875,2.75,1.0,0.0,1.0,1.0,0.0
max,92.0,1.0,89.097296,13.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# scale the data
num_cols = ['age','bmi','children']
scaled_val = scaler.fit_transform(data[num_cols])
scaled_val = pd.DataFrame(scaled_val, columns=num_cols)

categorical_features = [col for col in data.columns if col not in num_cols]
data = pd.concat([scaled_val, data[categorical_features].reset_index(drop=True)], axis=1)

Unnamed: 0,age,bmi,children,sex,smoker,northeast,northwest,southeast,southwest
0,-1.342765,-0.880452,-0.446515,1,0,0,0,1,0
1,-0.364345,-0.748241,-0.697366,0,1,1,0,0,0
2,-0.923442,-0.592095,-0.697366,0,1,0,0,1,0
3,1.126581,1.479524,2.563699,0,1,0,0,1,0
4,-0.876851,-0.084336,-0.697366,0,0,0,0,1,0
5,1.07999,1.904435,2.061997,1,1,0,0,1,0
6,-0.83026,-0.423412,-0.195664,1,0,0,1,0,0
7,-0.224571,0.101728,-0.446515,1,0,1,0,0,0
8,0.054978,-0.168963,-0.697366,1,0,0,1,0,0
9,0.75385,-0.335082,0.055187,0,0,0,0,1,0


In [145]:
# fill data
data['Northeast'] = 0
data['Northwest'] = 0
data['Southeast'] = 0
data['Southwest'] = 0

# Ensure the feature names are in the same order as they were during fit
data = data[X.columns.values[:]]

# predict the charges
predictions = rfr.predict(data)

In [146]:
validation_data['predicted_charges'] = predictions

In [None]:
# This project wold be better if i used a pipeline to build the model