# Tutorial 1: Data Preprocessing

In this tutorial, we perform **data preprocessing** on the *Medical Insurance Charges* dataset.

### Why Preprocessing?

Even though the dataset is clean, preprocessing is necessary because:

- Models require **numerical** inputs, so we must encode categorical data.
- Features like `age`, `bmi`, and `children` are on different **scales**, which can affect model performance.
- We need to **split** data into training and testing sets for evaluation.

---

### Preprocessing Steps:
1. Load the dataset
2. Check for missing values
3. Separate features and target
4. Encode categorical columns
5. Scale numerical columns
6. Split into train-test sets

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [5]:
url = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Dataset Overview

This dataset contains information about individuals such as:

- `age`, `sex`, `bmi`, `children`, `smoker`, `region`, and `charges`.

`charges` is our **target** variable — the amount billed by the insurance provider.

In [14]:
df.isnull().sum() #Checking for missing values

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [16]:
X = df.drop("charges", axis=1)
y = df["charges"] #Feature & Target Separation

In [18]:
categorical_features = ['sex', 'smoker', 'region']
numerical_features = ['age', 'bmi', 'children']

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numerical_features),
    ("cat", OneHotEncoder(drop="first"), categorical_features)
]) #Preprocesing setup

In [20]:
#Applying Transformation
X_processed = preprocessor.fit_transform(X)
X_processed[:5]  # Check first few rows

array([[-1.43876426, -0.45332   , -0.90861367,  0.        ,  1.        ,
         0.        ,  0.        ,  1.        ],
       [-1.50996545,  0.5096211 , -0.07876719,  1.        ,  0.        ,
         0.        ,  1.        ,  0.        ],
       [-0.79795355,  0.38330685,  1.58092576,  1.        ,  0.        ,
         0.        ,  1.        ,  0.        ],
       [-0.4419476 , -1.30553108, -0.90861367,  1.        ,  0.        ,
         1.        ,  0.        ,  0.        ],
       [-0.51314879, -0.29255641, -0.90861367,  1.        ,  0.        ,
         1.        ,  0.        ,  0.        ]])

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape #Spliting Data

((1070, 8), (268, 8))

In [24]:
import joblib

# Save preprocessor and split data
joblib.dump(preprocessor, "preprocessor.pkl")
joblib.dump((X_train, X_test, y_train, y_test), "insurance_data_split.pkl")

['insurance_data_split.pkl']