# Dataset

Task: Predicting health insurance charges using features about the insured person (age, sex, bmi, children, smoker, region).

Dataset link on Kaggle: <a href="https://www.kaggle.com/teertha/ushealthinsurancedataset">US Health Insurance Dataset</a>

Dataset description on Kaggle: 
<br> "*This dataset contains **1338 rows** of insured data, where the Insurance charges are given against the following **attributes** of the insured: **Age, Sex, BMI, Number of Children, Smoker and Region**. There are no missing or undefined values in the dataset.*"

In [1]:
import pandas as pd

data = pd.read_csv('insurance.csv')
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'insurance.csv'

# Check categorical data types and values

In [None]:
data.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [None]:
print(data['sex'].unique())
print(data['smoker'].unique())
print(data['region'].unique())

['female' 'male']
['yes' 'no']
['southwest' 'southeast' 'northwest' 'northeast']


**Answer the following questions:**
1. Which features are categorical?
2. What unique values appear in each feature?

# Encoding binary categorical features using numerical encoding

In [None]:
data_encoded_1 = data.replace({
    'sex': {'female': 0, 'male': 1},
    'smoker': {'yes': 1, 'no': 0}
})

data_encoded_1.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


**Note:** The previous output must have numeric values for `sex` and `smoker` features.

# Encoding nominal features using one-hot encoding

In [None]:
data_encoded_2 = pd.get_dummies(data_encoded_1)

data_encoded_2.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northeast,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,16884.924,0,0,0,1
1,18,1,33.77,1,0,1725.5523,0,0,1,0
2,28,1,33.0,3,0,4449.462,0,0,1,0
3,33,1,22.705,0,0,21984.47061,0,1,0,0
4,32,1,28.88,0,0,3866.8552,0,1,0,0


**Note:** The new dataset `data_encoded_2` must have numeric values for **all features**. 

**Note:** The output `charges` has moved to the middle of the DataFrame. But it doesn't matter because we will separate it from the input in the next code cell.

# Splitting data to input and output

In [None]:
data_input = data_encoded_2.drop(columns=['charges'])
data_output = data_encoded_2['charges']

In [None]:
data_input.head()

Unnamed: 0,age,sex,bmi,children,smoker,region_northeast,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,0,0,0,1
1,18,1,33.77,1,0,0,0,1,0
2,28,1,33.0,3,0,0,0,1,0
3,33,1,22.705,0,0,0,1,0,0
4,32,1,28.88,0,0,0,1,0,0


**Note:** `data_input` must contain the same columns as `data_encoded_2` except `charges` column.

In [None]:
data_output.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

**Note:** `data_output` must contain only the `charges` column.

# Splitting data to (train - validation - test)

In [None]:
from sklearn.model_selection import train_test_split

X, X_test, y, y_test = train_test_split(
    data_input, data_output, test_size=0.20, random_state=0
)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=0
)

In [None]:
print(X_train.shape)
print(y_train.shape)
print()
print(X_val.shape)
print(y_val.shape)
print()
print(X_test.shape)
print(y_test.shape)

(802, 9)
(802,)

(268, 9)
(268,)

(268, 9)
(268,)


**Note:** The previous output must be like the following:
```
(802, 9)
(802,)

(268, 9)
(268,)

(268, 9)
(268,)
```

# Feature scaling using StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Linear Regression

## Training and Validation

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [None]:
# Creating a linear regression model
linear_reg = LinearRegression()

# training our model
linear_reg.fit(X_train_scaled, y_train)

# making predictions
y_pred_train = linear_reg.predict(X_train_scaled)
y_pred_val = linear_reg.predict(X_val_scaled)

# Evaluating our model using R2 score
print('train', r2_score(y_train, y_pred_train))
print('val', r2_score(y_val, y_pred_val))

train 0.7409716664286246
val 0.7238983516037893


# Non-linear Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(2)
pf.fit(X_train_scaled)

X_train_poly = pf.transform(X_train_scaled)
X_val_poly = pf.transform(X_val_scaled)
X_test_poly = pf.transform(X_test_scaled)

In [None]:
# Creating a nonlinear regression model
nonlinear_reg = LinearRegression()

# training our model
nonlinear_reg.fit(X_train_poly, y_train)

# making predictions
y_pred_train = nonlinear_reg.predict(X_train_poly)
y_pred_val = nonlinear_reg.predict(X_val_poly)

# Evaluating our model using R2 score
print('train', r2_score(y_train, y_pred_train))
print('val', r2_score(y_val, y_pred_val))

train 0.8352717709879879
val 0.8350813429985007


# Testing Non-linear Reg with Deg=2

In [None]:
y_pred_test = nonlinear_reg.predict(X_test_poly)
r2_score(y_test, y_pred_test)

0.8759528224284139