![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [297]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
print(insurance.head())
print(insurance.shape)

    age     sex     bmi  children smoker     region       charges
0  19.0  female  27.900       0.0    yes  southwest     16884.924
1  18.0    male  33.770       1.0     no  Southeast     1725.5523
2  28.0    male  33.000       3.0     no  southeast     $4449.462
3  33.0    male  22.705       0.0     no  northwest  $21984.47061
4  32.0    male  28.880       0.0     no  northwest    $3866.8552
(1338, 7)


## Data Cleaning
### 1.1 dealing with missing data
Firstly detect wheather there is any:

In [298]:
print(insurance.isna().sum().sort_values())

charges     54
age         66
sex         66
bmi         66
children    66
smoker      66
region      66
dtype: int64


Dealing with it:

In [299]:
#Creating two imputers to fit data types well 
Categorical_imputer = SimpleImputer(strategy = 'most_frequent')
Numerical_imputer = SimpleImputer(strategy = 'median')
#Defining columns before using
Num_columns = ['age','bmi','children']
Cat_columns = ['sex','smoker','region']
#Imputing the missing values
insurance[Num_columns] = Numerical_imputer.fit_transform(insurance[Num_columns]) 
insurance[Cat_columns] = Categorical_imputer.fit_transform(insurance[Cat_columns]) 
#Dropping the rows of missing target to avoid bias or miscalculations
insurance['charges'] = insurance['charges'].replace('[\$,]', '', regex=True).astype(float)
insurance = insurance.dropna(subset = ['charges'])
insurance = insurance.reset_index(drop=True)
#Printing to see The changes and wheather all NaNs are gone or not
print(insurance)
print(insurance.isna().sum().sort_values())

       age     sex     bmi  children smoker     region      charges
0     19.0  female  27.900       0.0    yes  southwest  16884.92400
1     18.0    male  33.770       1.0     no  Southeast   1725.55230
2     28.0    male  33.000       3.0     no  southeast   4449.46200
3     33.0    male  22.705       0.0     no  northwest  21984.47061
4     32.0    male  28.880       0.0     no  northwest   3866.85520
...    ...     ...     ...       ...    ...        ...          ...
1267  50.0    male  30.970       3.0     no  Northwest  10600.54830
1268 -18.0  female  31.920       0.0     no  Northeast   2205.98080
1269  18.0  female  36.850       0.0     no  southeast   1629.83350
1270  21.0  female  25.800       0.0     no  southwest   2007.94500
1271  61.0  female  29.070       0.0    yes  northwest  29141.36030

[1272 rows x 7 columns]
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


### 1.2 Dealing with catagorical Data

In [300]:
#Using One-Hot Endcoding to encode to encode catagorical data to numeric
insurance['region'] = insurance['region'].str.lower()
insurance['smoker'] = insurance['smoker'].str.lower()
insurance = pd.get_dummies(insurance, columns=['smoker','region'], drop_first=True)
#Sex will be handeled independently as it has many variations of the same meaning so to remove confusion
insurance['sex'] = insurance['sex'].str.lower().map({'male': 1, 'm': 1, 'man': 1, 'female': 0, 'f': 0, 'woman': 0})

insurance.head()

Unnamed: 0,age,sex,bmi,children,charges,smoker_yes,region_northwest,region_southeast,region_southwest
0,19.0,0,27.9,0.0,16884.924,1,0,0,1
1,18.0,1,33.77,1.0,1725.5523,0,0,1,0
2,28.0,1,33.0,3.0,4449.462,0,0,1,0
3,33.0,1,22.705,0.0,21984.47061,0,1,0,0
4,32.0,1,28.88,0.0,3866.8552,0,1,0,0


The data appeared to be cofusing as: Sex had many variations for the same values such as male, man, m to represent the male. While the case sensetivity in region made it detect some regions a multiple times in the One-Hot encoder. But I delt with each columns problem solely correctly, plus unifying the form of Charges to remove the $ included in some only to be able to get it ready for modelling.

### 1.3 Data splitting and scaling

In [301]:
#Data splitting
X = insurance.drop("charges", axis = 1)
y = insurance["charges"]
#Data scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 2. Model **Creation**

### 2.1 Model **intiation**

In [302]:
#Using ridge as it is good for few features with regularization to decrease the effect of some if needed
ridge = Ridge()
model = GridSearchCV(ridge,param_grid = {'alpha': [0.0001,0.001,0.01,0.1,1,10,100,1000,10000]}, cv=5, scoring='r2')
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 39)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
r2_score = r2_score(y_test,y_pred)
print(r2_score)
print(model.best_params_)

0.6852741689959847
{'alpha': 1}


### 2.2 Predicting the new Charges

Importing and viewing the data

In [303]:
validation_data = pd.read_csv('validation_dataset.csv')
print(validation_data)

     age     sex        bmi  children smoker     region
0   18.0  female  24.090000       1.0     no  southeast
1   39.0    male  26.410000       0.0    yes  northeast
2   27.0    male  29.150000       0.0    yes  southeast
3   71.0    male  65.502135      13.0    yes  southeast
4   28.0    male  38.060000       0.0     no  southeast
5   70.0  female  72.958351      11.0    yes  southeast
6   29.0  female  32.110000       2.0     no  northwest
7   42.0  female  41.325000       1.0     no  northeast
8   48.0  female  36.575000       0.0     no  northwest
9   63.0    male  33.660000       3.0     no  southeast
10  27.0    male  18.905000       3.0     no  northeast
11  51.0  female  36.670000       2.0     no  northwest
12  60.0  female  24.530000       0.0     no  southeast
13  57.0  female  28.700000       0.0     no  southwest
14  20.0  female  28.975000       0.0     no  northwest
15  18.0    male  30.400000       3.0     no  northeast
16  83.0    male  89.097296       9.0     no  no

In [304]:
#Applying the same changes to be able to use the features to predict
validation_data['region'] = validation_data['region'].str.lower()
validation_data['smoker'] = validation_data['smoker'].str.lower()
validation_data = pd.get_dummies(validation_data, columns=['smoker','region'], drop_first=True)
validation_data['sex'] = validation_data['sex'].str.lower().map({'male': 1, 'm': 1, 'man': 1, 'female': 0, 'f': 0, 'woman': 0})

In [305]:
#Predicting charges
validation_data['predicted_charges'] = model.predict(validation_data)
print(validation_data)

     age  sex        bmi  ...  region_southeast  region_southwest  predicted_charges
0   18.0    0  24.090000  ...                 1                 0        4244.742170
1   39.0    1  26.410000  ...                 0                 0       30511.951539
2   27.0    1  29.150000  ...                 1                 0       29450.053520
3   71.0    1  65.502135  ...                 1                 0       51829.768169
4   28.0    1  38.060000  ...                 1                 0        9400.920067
5   70.0    0  72.958351  ...                 1                 0       53931.873453
6   29.0    0  32.110000  ...                 0                 0        9538.707739
7   42.0    0  41.325000  ...                 0                 0       13553.470057
8   48.0    0  36.575000  ...                 0                 0       12048.929454
9   63.0    1  33.660000  ...                 1                 0       12255.032270
10  27.0    1  18.905000  ...                 0                 0