# 02 — Preprocessing

**Goal:** Build a reproducible preprocessing pipeline with scikit-learn.

Now that we have gathered considerable insights regarding our dataset, we will proceed to use them for data preprocessing

**Checklist**
- Train/test split
- Scale numeric features.
- feature selection

In [40]:
# Imports
import pandas as pd, numpy as np
from scipy.io import arff

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

import joblib


In [18]:
# Load data
file_path = '/kaggle/input/bodyfat/bodyfat.arff'
bf_arff = arff.loadarff(file_path)
df = pd.DataFrame(bf_arff[0])
df.head()

Unnamed: 0,Density,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist,class
0,1.0708,23.0,154.25,67.75,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1,12.3
1,1.0853,22.0,173.25,72.25,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2,6.1
2,1.0414,22.0,154.0,66.25,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6,25.3
3,1.0751,26.0,184.75,72.25,37.4,101.8,86.4,101.2,60.1,37.3,22.8,32.4,29.4,18.2,10.4
4,1.034,24.0,184.25,71.25,34.4,97.3,100.0,101.9,63.2,42.2,24.0,32.2,27.7,17.7,28.7


## Double checking for issues

Normally and quite often, datasets may have certain issues that can cause problems later down the line when we train our models, but from our analysis in the `01_exploration.ipynb` notebook, we have observed that there are, thankfully, no issues at all across all our data. But for the sake of this step and verification, we will double-check if there aren't any issues at all.

The issues that we will be checking will be:
- if any NA values are present
- if there are any duplicate entries present

As for any outliers, we have gone through this during our previous notebook

In [28]:
print("Missing values:\n", df.isnull().sum())
print("Duplicates:", df.duplicated().sum())

Missing values:
 Density    0
Age        0
Weight     0
Height     0
Neck       0
Chest      0
Abdomen    0
Hip        0
Thigh      0
Knee       0
Ankle      0
Biceps     0
Forearm    0
Wrist      0
class      0
dtype: int64
Duplicates: 0


## Splitting the target from the dataset

As it can be seen that the target we will be going with is the `'class'`, which contains the body fat percentages of the 252 individuals, we will be splitting it from the dataset and assigning it to `'y'` (standard naming convention for labels). The rest of the dataset will then be assigned to `'X'`.

We will also be dropping the `'Density'` column, as we have discussed during the previous notebook.

In [22]:
# Target & features
TARGET = 'class'
X = df.drop(columns=[TARGET, 'Density'])
y = df[TARGET]

X.head()
#y.head()

Unnamed: 0,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist
0,23.0,154.25,67.75,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
1,22.0,173.25,72.25,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
2,22.0,154.0,66.25,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6
3,26.0,184.75,72.25,37.4,101.8,86.4,101.2,60.1,37.3,22.8,32.4,29.4,18.2
4,24.0,184.25,71.25,34.4,97.3,100.0,101.9,63.2,42.2,24.0,32.2,27.7,17.7


## Splitting the dataset to train, validate and test datasets

Moving on to one of the essential steps of data preprocessing. Now that we have defined the `'X'` (features) and `'y'` (target) values in our dataset, we will proceed to split them into two subsets: the **training** set and the **test** set (often I prefer to go with three subsets, but as this is a fairly small dataset, we will be keeping cross-validation in mind, and also therefore, we are only going for those two subsets).

The divisions we will be using when splitting the dataset are as follows:
- 80% of `df` ----> `train`
- 20% of `df` ----> `test`

In [31]:
# Train/valid/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (201, 13)
Test shape: (51, 13)


### Scaling features

When preprocessing a dataset for model training, it is good practice to scale and standardise our features, as often in many datasets, there are some features that have entries measured in different units. As in the case of our dataset, some such features include
- Age: 20–80 (years)
- Weight: 100–250 (lbs)
- Height: 60–80 (inches)
- Circumferences: 10–120 (cm)

Models that rely on distances, dot products and gradient descent can get affected by this, and as we will be experimenting with a lot of regression models, we will be putting this to practice.

In [33]:
# Scale the data
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train))
X_test_scaled = pd.DataFrame(scaler.transform(X_test))

# Adding back the columns (as they were removed when scaling)
X_train_scaled.columns = X_train.columns
X_test_scaled.columns = X_test.columns

X_train_scaled.head()
#X_test_scaled.head()

Unnamed: 0,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist
0,-0.517891,-0.405512,0.180513,-0.410457,-0.913066,-0.613129,-0.613767,-0.083661,-1.03053,-0.380167,0.211451,-0.187997,-1.220435
1,0.664124,0.713217,0.499346,0.827925,0.819871,0.692826,-0.027908,-0.39615,-0.235461,-0.634879,1.298621,0.748065,0.742897
2,-0.123886,1.608201,-0.074553,1.254953,2.470288,1.99878,1.070576,0.834275,0.726992,-0.762235,0.924906,0.994397,-0.893213
3,0.427721,-1.032001,-0.074553,-1.349919,-0.924855,-1.082897,-0.511242,-1.060189,-0.863147,-0.953269,-0.264186,-1.173325,-0.456917
4,-1.621106,1.017512,-0.074553,1.254953,1.208898,0.739802,0.938758,1.791273,0.936221,1.020749,0.415295,0.501733,0.197527


## Feature selection

Going back to `01_exploration.ipynb`, we went through the correlations of each feature in the dataset with the target and for this project, we will be creating two versions of these features: one with all the features in the dataset (except `'Density'`, of course) and the other with only features that show a correlation coefficient of above **0.3** with the target.

We will be using the score function `f_regression` in the `SelectKBest` class from `scikit-learn` to sort out the selected features automatically for us.

In [37]:
# Select top 10 features (features that have |corr| > 0.3)
selector = SelectKBest(score_func=f_regression, k=10)
X_train_kbest = pd.DataFrame(selector.fit_transform(X_train_scaled, y_train))
X_test_kbest = pd.DataFrame(selector.transform(X_test_scaled))

# Adding back the columns (as they were removed when scaling)
X_train_kbest.columns = X_train.columns[selector.get_support()]
X_test_kbest.columns = X_test.columns[selector.get_support()]

X_train_kbest.head()
#X_test_kbest.head()

Unnamed: 0,Weight,Neck,Chest,Abdomen,Hip,Thigh,Knee,Biceps,Forearm,Wrist
0,-0.405512,-0.410457,-0.913066,-0.613129,-0.613767,-0.083661,-1.03053,0.211451,-0.187997,-1.220435
1,0.713217,0.827925,0.819871,0.692826,-0.027908,-0.39615,-0.235461,1.298621,0.748065,0.742897
2,1.608201,1.254953,2.470288,1.99878,1.070576,0.834275,0.726992,0.924906,0.994397,-0.893213
3,-1.032001,-1.349919,-0.924855,-1.082897,-0.511242,-1.060189,-0.863147,-0.264186,-1.173325,-0.456917
4,1.017512,1.254953,1.208898,0.739802,0.938758,1.791273,0.936221,0.415295,0.501733,0.197527


## Saving the preprocessed data, scaler and selector

Now that we have done all the required preprocessing steps, we can save the data in separate `.csv` files so that we can later import them for model training. Since we will be conducting cross-validation during training, we will also be saving both our scaler and selector alongside the data as `.pkl` files.

In [38]:
X_train_scaled.to_csv("/kaggle/working/X_train_all_features.csv", index=False)
X_test_scaled.to_csv("/kaggle/working/X_test_all_features.csv", index=False)
X_train_kbest.to_csv("/kaggle/working/X_train_top_features.csv", index=False)
X_test_kbest.to_csv("/kaggle/working/X_test_top_features.csv", index=False)
y_train.to_csv("/kaggle/working/y_train.csv", index=False)
y_test.to_csv("/kaggle/working/y_test.csv", index=False)

In [39]:
# Save scaler and selector
joblib.dump(scaler, "/kaggle/working/scaler.pkl")
joblib.dump(selector, "/kaggle/working/selector.pkl")

['/kaggle/working/selector.pkl']