# DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING

This assignment aims to equip with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models.

**Data Exploration and Missing Value Handling**

In [1]:
import pandas as pd
import numpy as np

In [3]:
# Load dataset
df = pd.read_csv("adult_with_headers.csv")

In [12]:
# Basic exploration
print('Shape:',df.shape)
print('Head:\n',df.head())
print('Statistical summury:\n',df.describe())

Shape: (32561, 15)
Head:
    age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country  income  
0          2174             0             

In [17]:
df.info

<bound method DataFrame.info of        age          workclass  fnlwgt    education  education_num  \
0       39          State-gov   77516    Bachelors             13   
1       50   Self-emp-not-inc   83311    Bachelors             13   
2       38            Private  215646      HS-grad              9   
3       53            Private  234721         11th              7   
4       28            Private  338409    Bachelors             13   
...    ...                ...     ...          ...            ...   
32556   27            Private  257302   Assoc-acdm             12   
32557   40            Private  154374      HS-grad              9   
32558   58            Private  151910      HS-grad              9   
32559   22            Private  201490      HS-grad              9   
32560   52       Self-emp-inc  287927      HS-grad              9   

            marital_status          occupation    relationship    race  \
0            Never-married        Adm-clerical   Not-in-family   

In [13]:
# Identify Missing Values
# Replace '?' with NaN
df.replace("?", np.nan, inplace=True)

# Check missing values
df.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

In [22]:
# handle missing values
# Separate numerical and categorical columns
num_cols = df.select_dtypes(include=["int64", "float64"]).columns
cat_cols = df.select_dtypes(include=["object"]).columns

# Impute numerical columns with median
for col in num_cols:
    df[col] = df[col].fillna(df[col].median())
# Impute categorical columns with mode
for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

In [23]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Copy dataset for scaling
df_scaled = df.copy()

# Standard Scaling
std_scaler = StandardScaler()
df_scaled[num_cols] = std_scaler.fit_transform(df[num_cols])

In [24]:
# Min-Max Scaling
minmax_scaler = MinMaxScaler()
df_minmax = df.copy()
df_minmax[num_cols] = minmax_scaler.fit_transform(df[num_cols])

**When to Use Each Scaling Technique**
StandardScaler:
Use StandardScaler (Z-score normalization) when data has a roughly Gaussian distribution or using algorithms sensitive to feature scales (like SVM, k-NN, Logistic Regression, PCA, Neural Networks) to center features around zero mean and unit standard deviation, especially when features have different units or ranges.
Min-Max Scaling:
When need features in a fixed range (like 0 to 1) for algorithms sensitive to feature scales (Neural Networks, KNN, K-Means), when data isn't normally distributed, or when have clear min/max bounds from domain knowledge.

# Encoding Techniques

In [26]:
# Identify Categorical Variables by Cardinality
low_cardinality = [col for col in cat_cols if df[col].nunique() < 5]
high_cardinality = [col for col in cat_cols if df[col].nunique() >= 5]

low_cardinality
high_cardinality

['workclass',
 'education',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'native_country']

In [27]:
# One-Hot Encoding (Less than 5 categories)
df_encoded = pd.get_dummies(df, columns=low_cardinality, drop_first=True)

In [28]:
# Label Encoding (More than 5 categories)
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in high_cardinality:
    df_encoded[col] = le.fit_transform(df_encoded[col])

**Pros and Cons of Encoding Techniques**
-> One-Hot Encoding :
Pros:
It avoids implying any artificial or ordinal relationship between categories, making it ideal for nominal data.
All original category information is retained.
Works well with algorithms that require numerical input and do not assume ordinality.
Cons:
For features with many categories (high cardinality), it can lead to a significant increase in the number of columns, making the dataset sparse and memory-intensive.
A high number of dimensions can increase the risk of overfitting, especially with a small sample size.
It can introduce multicollinearity (dummy variable trap) in some models like linear regression, which can be mitigated by dropping one of the resulting columns. 

-> Label/Ordinal Encoding :
Pros:
It does not increase the number of columns, making it memory efficient.
Preserves the inherent order of the data, which can improve model performance for certain algorithms (especially tree-based ones).
Simple and easy to implement.
Cons:
Applying it to nominal data can mislead models into misinterpreting the integer labels as having mathematical significance or rank, which is not intended.
Cannot handle data without an inherent relationship effectively. 

# Feature Engineering

In [31]:
# Capital Net Gain
df_encoded["capital_net_gain"] = df_encoded["capital_gain"] - df_encoded["capital_loss"]
# Net wealth change is more informative than gain or loss alone.

In [32]:
# Working Intensity
df_encoded["hours_per_age"] = df_encoded["hours_per_week"] / (df_encoded["age"] + 1)
# Captures productivity relative to age, useful for income prediction.

# Transformation of Skewed Features

In [33]:
# Identify Skewed Features
df[num_cols].skew()

age                0.558743
fnlwgt             1.446980
education_num     -0.311676
capital_gain      11.953848
capital_loss       4.594629
hours_per_week     0.227643
dtype: float64

In [35]:
# Apply Log Transformation
df_encoded["log_capital_gain"] = np.log1p(df_encoded["capital_gain"])

**Justification for transformation:**

Reduces right skewness
Improves model stability
Helps linear and distance-based models

In [36]:
# Final Prepared Dataset
print(df_encoded.shape)
df_encoded.head()

(32561, 18)


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Male,income_ >50K,capital_net_gain,hours_per_age,log_capital_gain
0,39,7,77516,9,13,4,1,1,4,2174,0,40,39,True,False,2174,1.0,7.684784
1,50,6,83311,9,13,2,4,0,4,0,0,13,39,True,False,0,0.254902,0.0
2,38,4,215646,11,9,0,6,1,4,0,0,40,39,True,False,0,1.025641,0.0
3,53,4,234721,1,7,2,6,0,2,0,0,40,39,True,False,0,0.740741,0.0
4,28,4,338409,9,13,2,10,5,2,0,0,40,5,False,False,0,1.37931,0.0
