![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |


In [3]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder


In [38]:
# Loading the insurance dataset
insurance = pd.read_csv('data/insurance.csv')
insurance

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.900,0.0,yes,southwest,16884.924
1,18.0,male,33.770,1.0,no,Southeast,1725.5523
2,28.0,male,33.000,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.880,0.0,no,northwest,$3866.8552
...,...,...,...,...,...,...,...
1333,50.0,male,30.970,3.0,no,Northwest,$10600.5483
1334,-18.0,female,31.920,0.0,no,Northeast,2205.9808
1335,18.0,female,36.850,0.0,no,southeast,$1629.8335
1336,21.0,female,25.800,0.0,no,southwest,2007.945


In [40]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB


## Data Cleaning

In [29]:
# Check for missiing values
insurance.isna().sum()

age         66
sex         66
bmi         66
children    66
smoker      66
region      66
charges     54
dtype: int64

In [30]:
# Check missing data percentage
(insurance.isna().sum() / len(insurance)) * 100

age         4.932735
sex         4.932735
bmi         4.932735
children    4.932735
smoker      4.932735
region      4.932735
charges     4.035874
dtype: float64

In [31]:
# Drop all missing data
insurance.dropna(inplace=True)

# Check missing data again
insurance.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [47]:
# Fix negative ages and children
insurance['age'] = insurance['age'].apply(lambda x: x * -1 if x < 0 else x)
insurance['children'] = insurance['children'].apply(lambda x: x * -1 if x < 0 else x)


In [43]:
# Clean `charges` column
insurance['charges'] = insurance['charges'].str.replace("$", '')

# Take a look at `charges` column and change type
insurance['charges'] = insurance['charges'].astype(float)

# Drop all missing data
insurance.dropna(inplace=True)

# Final look at charges column
insurance.charges

0       16884.92400
1        1725.55230
2        4449.46200
3       21984.47061
4        3866.85520
           ...     
1333    10600.54830
1334     2205.98080
1335     1629.83350
1336     2007.94500
1337    29141.36030
Name: charges, Length: 1207, dtype: float64

In [59]:
# Fix data in region
insurance['region'] = insurance['region'].str.lower()

# Fix data in sex
insurance['sex'] = insurance['sex'].replace({
    'woman': 'female',
    'F': 'female',
    'M': 'male',
    'man': 'male',
})

## Exploratory Data Analysis

In [48]:
# Descriptive stats for numeric cols
insurance.describe()

Unnamed: 0,age,bmi,children,charges
count,1207.0,1207.0,1207.0,1207.0
mean,39.231152,30.574147,1.075394,13311.273947
std,14.075269,6.120031,1.203277,12136.057425
min,18.0,15.96,0.0,1121.8739
25%,26.0,26.19,0.0,4749.06145
50%,39.0,30.21,1.0,9447.25035
75%,51.0,34.58,2.0,16582.138605
max,64.0,53.13,5.0,63770.42801


In [60]:
# Descriptive stats for Categorical cols
insurance.describe(include='O')

Unnamed: 0,sex,smoker,region
count,1207,1207,1207
unique,2,2,4
top,male,no,southeast
freq,612,959,321


In [63]:
age_cat = []
for value in insurance.age.values:
    if value < 30:
        age_cat.append('18 - 29')
    elif value < 40:
        age_cat.append('30 - 39')
    elif value < 50:
        age_cat.append('40 - 49')
    else:
        age_cat.append('50 - above')

In [64]:
insurance['age_cat'] = age_cat
insurance['age_cat']

0          18 - 29
1          18 - 29
2          18 - 29
3          30 - 39
4          30 - 39
           ...    
1333    50 - above
1334       18 - 29
1335       18 - 29
1336       18 - 29
1337    50 - above
Name: age_cat, Length: 1207, dtype: object