# Insurance Cost Prediction

## Goal

We want to construct a linear regression predictor to predict the total cost of an insurance bill

## About the data

We will be using the [Medical Cost Data Set](https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download) from Kaggle.

**age**: age of primary beneficiary

**sex**: insurance contractor gender, female, male

**bmi**: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,

**objective** index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

**children**: Number of children covered by health insurance / Number of dependents

**smoker**: Smoking

**region**: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

**charges**: Individual medical costs billed by health insurance

## Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

## Model development

### Load data

In [11]:
insurance = pd.read_csv('data/insurance.csv')

### EDA

In [5]:
insurance.shape

(1338, 7)

In [6]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [7]:
insurance.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

#### Observations

- We have 1300+ observations
- We have 6 features to try to predict 'charges'
- We don't have missing data
- Age looks uniformly distributed with a range of [18 , 64]
- As for sex we have almost a 50/50 split
- BMI looks normally distributed with some right skewness 
- On average people have 1 dependant, about 50% of the people doesn't have one
- 80% are non-smokers
- the region is evenly distributed between the 4
- charges goes from as little as 1121 up to 63770

#### Feature engineering

Let's perform some cleaning on the features

In [12]:
insurance['sex'] = insurance['sex'].apply(lambda row: 1 if row == 'female' else 0)

In [20]:
insurance['bmi'] = insurance['bmi'].round(0).astype(int)

In [18]:
insurance['smoker'] = insurance['smoker'].apply(lambda row: 1 if row == 'yes' else 0)

In [22]:
insurance['charges'] = insurance['charges'].round(0).astype(int)

- Coded sex and smoker into 1 or 0
- bmi and charges where rounded and casted as integers

#### Correlations

Let's find out if there are features that can be helpful to predict 'charges'