## Hands-On Material 2
`Dataset` : https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

### 1. Data Loading and Initial Exploration

Starts with importing the necessary libraries, defines the column names for the Adult dataset, and loads the data from the UCI repository.

In [46]:
import pandas as pd
import numpy as np

col_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num',
    'marital-status', 'occupation', 'relationship', 'race', 'sex',
    'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'
]

df_adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
                       header=None, names=col_names, na_values=' ?', engine='python')

df_adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 2. Data Cleaning

Here, we remove all rows that contain missing values (`NaN`) to ensure the dataset is clean before proceeding with preprocessing and modeling. `inplace=True` modifies the DataFrame directly.

In [47]:
df_adult.dropna(inplace=True)

### 3. Feature Engineering and Data Splitting

First, we separate the features (X) from the target variable (y). Then, we convert categorical features into a numerical format using one-hot encoding with `pd.get_dummies()`. Finally, we split the encoded data into training and testing sets, using stratification to maintain the same proportion of income classes in both splits.

In [50]:
from sklearn.model_selection import train_test_split

X = df_adult.drop('income', axis=1)
y = df_adult['income']

X_encoded = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, 
    y,
    test_size=0.3,
    random_state=0,
    stratify=y
)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (21113, 104)
Test set shape: (9049, 104)


### 4. Feature Scaling (Min-Max Scaling)

This cell applies Min-Max scaling to the features. This technique scales the data to a fixed range, usually between 0 and 1. We `fit` the scaler on the training data and then `transform` both the training and test data to prevent data leakage from the test set.

In [51]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

### 5. Feature Scaling (Standardization)

This cell applies Standardization (or Z-score normalization). This technique rescales the features so that they have the properties of a standard normal distribution with a mean of 0 and a standard deviation of 1. Similar to Min-Max scaling, we `fit` on the training data and `transform` both sets.

In [52]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)