## Data Preprocessing

The purpose of this notebook is to process the dataset (both train and test set). The main stages in proprocessing step includes: remove irrelevant attribute, handle missing values, encode the categorical features and scale data.

### Step 1 | Import libraries

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.preprocessing import OrdinalEncoder

### Step 2 | Import datasets

In [2]:
train_df = pd.read_csv('data/train_data.csv')
test_df = pd.read_csv('data/test_data.csv')

In [3]:
train_df.head()

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,4515,Male,22.0,1.71,90.0,yes,yes,2.0,1.0,Sometimes,no,2.0,no,1.0,2.0,Sometimes,Public_Transportation,Obesity_Type_I
1,7949,Female,41.0,1.64,77.0,yes,yes,3.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Automobile,Obesity_Type_I
2,20677,Male,18.0,1.8,56.0,yes,yes,2.0,4.0,Frequently,no,2.0,no,2.0,1.0,no,Automobile,Insufficient_Weight
3,18079,Male,18.0,1.7,85.0,yes,yes,2.0,3.0,Frequently,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Overweight_Level_II
4,5129,Male,22.735328,1.849425,121.657979,yes,yes,2.352323,2.699971,Sometimes,no,2.357978,no,1.684582,0.739609,Sometimes,Public_Transportation,Obesity_Type_II


### Step 3 | Data Preprocessing

#### Step 3.1 | Irrelevant Feature Removals

Here we decided to drop the 'id' columns because it is just for indexing, which has no affect on the labels.

In [4]:
train_df = train_df.drop(columns=['id'])

In [5]:
test_df = test_df.drop(columns=['id'])

#### Step 3.2 | Missing Value Treatment

In [6]:
train_df.isnull().sum().sum()

0

In [7]:
test_df.isnull().sum().sum()

0

The dataset has no missing values.

#### Step 3.3 | Categorical Attributes Encoding

In [8]:
ordinal_features = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC']

Here, we decided to apply ordinal encoding to binary-value features (yes/no, true/false), and ordinal-data features. Mean while, for the other case we will apply one-hot encoding (in this case for 'MTRANS' column).

And we define a dictonary for label mapping, which convert a text label to a specific number.

In [9]:
label_mapping = {
    'Insufficient_Weight': 0,
    'Normal_Weight': 1,
    'Overweight_Level_I': 2,
    'Overweight_Level_II': 3,
    'Obesity_Type_I': 4,
    'Obesity_Type_II': 5,
    'Obesity_Type_III': 6
}

In [10]:
def categorical_attribute_encoding(df):
    df = pd.get_dummies(df, columns=['MTRANS'])
    mtrans_columns = [col for col in df.columns if col.startswith('MTRANS')]
    df[mtrans_columns] = df[mtrans_columns].astype(int)
    ord_encoder = OrdinalEncoder()
    df[ordinal_features] = ord_encoder.fit_transform(df[ordinal_features])
    df[ordinal_features] = df[ordinal_features].astype(int)
    df['NObeyesdad'] = df['NObeyesdad'].map(label_mapping).astype(int)
    return df

In [11]:
train_df = categorical_attribute_encoding(train_df)
test_df = categorical_attribute_encoding(test_df)

In [12]:
train_df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,...,SCC,FAF,TUE,CALC,NObeyesdad,MTRANS_Automobile,MTRANS_Bike,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking
0,1,22.0,1.71,90.0,1,1,2.0,1.0,2,0,...,0,1.0,2.0,1,4,0,0,0,1,0
1,0,41.0,1.64,77.0,1,1,3.0,1.0,2,0,...,0,0.0,0.0,1,4,1,0,0,0,0
2,1,18.0,1.8,56.0,1,1,2.0,4.0,1,0,...,0,2.0,1.0,2,0,1,0,0,0,0
3,1,18.0,1.7,85.0,1,1,2.0,3.0,1,0,...,0,2.0,1.0,0,3,0,0,0,1,0
4,1,22.735328,1.849425,121.657979,1,1,2.352323,2.699971,2,0,...,0,1.684582,0.739609,1,5,0,0,0,1,0


In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16606 entries, 0 to 16605
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          16606 non-null  int32  
 1   Age                             16606 non-null  float64
 2   Height                          16606 non-null  float64
 3   Weight                          16606 non-null  float64
 4   family_history_with_overweight  16606 non-null  int32  
 5   FAVC                            16606 non-null  int32  
 6   FCVC                            16606 non-null  float64
 7   NCP                             16606 non-null  float64
 8   CAEC                            16606 non-null  int32  
 9   SMOKE                           16606 non-null  int32  
 10  CH2O                            16606 non-null  float64
 11  SCC                             16606 non-null  int32  
 12  FAF                             

#### Step 3.4 | Feature Scaling

Feature scaling is a data preprocessing technique used to transform the values of features or variables in a dataset to a similar scale. The purpose is to ensure that all features contribute equally to the model and to avoid the domination of features with larger values. Models like SVM, KNN, and many linear models rely on distances or gradients, making them susceptible to variations in feature scales. Scaling ensures that all features contribute equally to the model's decision rather than being dominated by features with larger magnitudes.

---

**We Skip It At This Time:**
Not all algorithms require scaled data. For instance, Decision Tree-based models are scale-invariant. Given our intent to use a mix of models (some requiring scaling, others not), we've chosen to handle standard scaling later using pipelines. This approach lets us apply scaling specifically for models that benefit from it, ensuring flexibility and efficiency in our modeling process.

### Step 4 | Save The Dataset

In [14]:
train_df.to_csv('data/preprocessed_train_df.csv', index=False)
test_df.to_csv('data/preprocessed_test_data.csv', index=False)