## Feature Engineering

### Feature engineering
#### Missing value analysis.

Let's analyse the dataset for missing values:

In [63]:
import pandas as pd
total_data = pd.read_csv("../data/interim/factorised_eda_results.csv")
total_data.describe()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0
mean,39.207025,0.505232,30.663397,1.094918,0.795217,1.484305,13270.422265
std,14.04996,0.50016,6.098187,1.205493,0.403694,1.104885,12110.011237
min,18.0,0.0,15.96,0.0,0.0,0.0,1121.8739
25%,27.0,0.0,26.29625,0.0,1.0,1.0,4740.28715
50%,39.0,1.0,30.4,1.0,1.0,1.0,9382.033
75%,51.0,1.0,34.69375,2.0,1.0,2.0,16639.912515
max,64.0,1.0,53.13,5.0,1.0,3.0,63770.42801


We know for a fact that the dataset doesn't have any nulls. From the table above we can now also deduce that the dataset doesn't have any invalid mins or max, so, overall, the data seems to be pretty consistent.

#### Outlier analysis
Now that we have a tidier dataset, we can proceed to study the outliers.

In [64]:
total_data[['age', 'bmi', 'children', 'charges']].describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


overall, the data seems to be normally distributed except for the target, for which the max value is several times larger than the 75th percentile. The difference doesn't seem to be very significant so let's leave the dataset as it is for now.

#### Feature engineering
The dataset seems clean and complete, so we won't try to add any new columns. Let's split it and prepare it for training:

#### Feature scaling
Let's scale our features now

In [96]:
from sklearn.preprocessing import MinMaxScaler

num_variables = list(total_data.columns)
num_variables.remove('charges')

scaler = MinMaxScaler()
norm_features = scaler.fit_transform(total_data[num_variables])
total_data_norm = pd.DataFrame(norm_features, index = total_data.index, columns = num_variables)
total_data_norm["charges"] = total_data["charges"]
total_data_norm.head()
total_data.to_csv("../data/processed/scaled_data.csv")