<a href="https://colab.research.google.com/github/gabitza-tech/ETTI-SummerSchool2025/blob/main/Students_MachineLearning_Intro_FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 PART 1 - Introduction: Binary Classification with the Adult Income Dataset

In this exercise, we will explore the **Adult Income** dataset, a widely-used dataset for **classification tasks**.  

The main goal of this exercise is to:  
- Gain a solid understanding of the dataset  
- Preprocess and prepare it for a **binary classification** task  
- Apply Machine Learning techniques to predict income levels  

---

### 🛠 Tools & Libraries
We will primarily use **Scikit-Learn**, a powerful and versatile library for **Data Science** and **Machine Learning**. Some of its highlights:  

- Ready-to-use datasets for **prototyping and experimentation**  
- Built-in **data preprocessing tools**  
- Wide selection of **Machine Learning algorithms**  
- Easy evaluation with common metrics such as:  
  - ✅ Accuracy  
  - ✅ Precision  
  - ✅ Recall  
  - ✅ F1-score  

Scikit-Learn makes it straightforward to experiment with different models, and while it works well out-of-the-box, understanding the **hyperparameters** is important for improving performance.

---

### 💾 Saving Results
All results, including **screenshots and brief explanations**, will be saved in Google Docs for documentation purposes.

---

### 📥 Step 1: Load the Dataset
Let's start by loading the **Adult Income** dataset and exploring its structure.


In [2]:
!pip install scikit-learn pandas


Collecting pandas
  Downloading pandas-2.3.2-cp313-cp313-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.2-cp313-cp313-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   -- ------------------------------------- 0.8/11.0 MB 9.9 MB/s eta 0:00:02
   ------------------ --------------------- 5.0/11.0 MB 16.5 MB/s eta 0:00:01
   ----------------------------------- ---- 9.7/11.0 MB 18.5 MB/s eta 0:00:01
   ---------------------------------------- 11.0/11.0 MB 18.6 MB/s  0:00:00
Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas

   ---------------------------------------- 0/3 [pytz]
   ---------------------------------------- 0

In [73]:
from sklearn.datasets import 
import pandas as pd
import numpy as np
import math 
# Load Adult dataset
adult = fetch_openml("adult", version=2, as_frame=True)
x, y = adult.data, adult.target

adult

{'data':        age     workclass  fnlwgt     education  education-num  \
 0       25       Private  226802          11th              7   
 1       38       Private   89814       HS-grad              9   
 2       28     Local-gov  336951    Assoc-acdm             12   
 3       44       Private  160323  Some-college             10   
 4       18           NaN  103497  Some-college             10   
 ...    ...           ...     ...           ...            ...   
 48837   27       Private  257302    Assoc-acdm             12   
 48838   40       Private  154374       HS-grad              9   
 48839   58       Private  151910       HS-grad              9   
 48840   22       Private  201490       HS-grad              9   
 48841   52  Self-emp-inc  287927       HS-grad              9   
 
            marital-status         occupation relationship   race     sex  \
 0           Never-married  Machine-op-inspct    Own-child  Black    Male   
 1      Married-civ-spouse    Farming-fishin

# 📝 Exercise 1: Exploring the Dataset

In this exercise, we will take a closer look at the dataset by examining the features (**X**) and the labels (**y**).  
> **Hint:** The dataset is in **pandas DataFrame** format, so you can leverage all the familiar pandas methods.

### Tasks

1. **Preview the data**: Display the first 5 samples in the dataset.
2. **Dataset size**: How many samples are there in total?
3. **Feature count**: How many features does each sample have?
4. **Number of classes**: How many unique classes are present in the target variable?
5. **Class distribution**: How many samples belong to each class?
6. **Missing values**: Identify the total number of missing values.  
   > **Hint:** Use the `isna()` method in pandas.
7. **Missing value percentages**: Compute what percentage of each feature contains missing data.
8. **Feature types**: Determine the type (categorical or numerical) of the features that contain missing values.

---

Take your time to explore the dataset thoroughly—this step is crucial for **data cleaning** and **preprocessing**, which can significantly impact model performance.


In [74]:
# Missing values in the dataset are marked with '?' => Replace "?" with NaN
print(type(x))
print(type(y))
x = x.replace("?", pd.NA)

# .... Code here
# Print first 5 samples
a =  (x[:4])
b =  (y[:4])
# Get the number of samples in the dataset (rows) and the number of features (columns)
print(x.shape[0]) 
print(x.shape[1])
# How many classes?
print(y.shape[0])
# Samples per calss
cnt = 0
print(y.unique())
for i in range(y.shape[0]):
    if y.iloc[i] == '>50K':
        cnt += 1
print("Samples with '>50K':", cnt)
print("Samples with '<=50k:",y.shape[0] - cnt)
# How many missing values do we have? => isna().sum()
print(x.isna().sum())

# Type of features
print(x.dtypes)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
48842
14
48842
['<=50K', '>50K']
Categories (2, object): ['<=50K', '>50K']
Samples with '>50K': 11687
Samples with '<=50k: 37155
age                  0
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
dtype: int64
age                  int64
workclass         category
fnlwgt               int64
education         category
education-num        int64
marital-status    category
occupation        category
relationship      category
race              category
sex               category
capital-gain         int64
capital-loss         int64
hours-per-week       int64
native-country    category
dtype: object


# 📝 Exercise 2: Handling Missing Data

Now that we have identified the missing values in the dataset, and since they represent **≤5% of the samples**, we can remove them.  

### Tasks

1. Drop the samples with missing values from both **features (X)** and **labels (y)**.  
   > **Hint:** Use `dropna()`.
2. Check the **new size** of the dataset.
3. Calculate **how much data was lost** after removing the missing entries.


In [76]:
# CODE HERE
# Drop samples that have missing values for fratures and lables
x_drop = (x.dropna())
y_drop = y[x_drop.index]

lost_sample = x.shape[0] - x_drop.shape[0]

print("nr sample pierdute", lost_sample)
print(x_drop.shape[0]) 
print(y_drop.shape[0])
# Check new dataset size
print("Procent_sample_pierdute:" ,(math.floor(lost_sample/x_drop.shape[0] * 100),"%"))

nr sample pierdute 3620
45222
45222
Procent_sample_pierdute: (8, '%')


# 🔄 Transforming Categorical Data into Numbers

Many Machine Learning algorithms require **numerical inputs**, so we need to convert categorical features into numbers.  

For this, we will use **`LabelEncoder`** from `scikit-learn`.  

> This time, we will handle this step for you. Next time, you will be on your own! 🙂


In [104]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# Numerically encode features based on the number of unique values
le = LabelEncoder()

# Separate categorical and numeric columns
cat_cols = x_drop.select_dtypes(include=["object", "category"]).columns
num_cols = x_drop.select_dtypes(include=["int64", "float64"]).columns

print(cat_cols)
print(num_cols)
# 1. Label Encoder - Individual numerical values for each categorical value in a feature
x_encoded = x_drop.copy()
for col in cat_cols:
  x_encoded[col] = le.fit_transform(x_drop[col])

# x_encoded has the same structure as x, but with categorical columns having numerical values now
x_encoded.head()

print("data_type_x_encoded",x_encoded.dtypes)
# For labels, we can simply transform them in numerical values in the case of binary classification
y_encoded = le.fit_transform(y_drop)
print(set(y_encoded))


Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex', 'native-country'],
      dtype='object')
Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')
data_type_x_encoded age               int64
workclass         int64
fnlwgt            int64
education         int64
education-num     int64
marital-status    int64
occupation        int64
relationship      int64
race              int64
sex               int64
capital-gain      int64
capital-loss      int64
hours-per-week    int64
native-country    int64
dtype: object
{np.int64(0), np.int64(1)}


# 📝 Exercise 3: Train-Test Split

Next, we will split our dataset into a **training set** and a **test set**.  

> In real-world applications, we would usually also create a **validation set**, but for this introductory exercise, we will keep it simple with just two splits.  

### Tasks

1. Split the data into **train (70%)** and **test (30%)** sets using `train_test_split()`.  
   > **Note:** Use the `stratify` parameter to ensure class proportions are preserved.
2. Check the **number of samples** in the train and test sets.
3. Verify that the **class distribution** is balanced in both sets.  
   - Example: If the training set has 65% `'>=50K'` and 35% `'<=50K'`, the test set should have a **similar distribution**.


In [84]:
from sklearn.model_selection import train_test_split
import numpy as np

# Complete the function - search it
x_train, x_test, y_train, y_test = train_test_split(x_encoded, y_encoded, test_size=0.3,train_size=0.7, stratify=y_encoded, random_state=42)

# CODE HERE - STUDENTS
# No samples

# See if train-test balanced
print("Train samples:", x_train.shape[0])
print("Test samples:", x_test.shape[0]) 
print("Test samples:", x_encoded) 
print("Procent_sample_pierdute:" ,((y_train.shape[0]/y_encoded.shape[0] * 100),"%"))
print("Procent_sample_pierdute:" ,((y_test.shape[0]/y_encoded.shape[0] * 100),"%"))

print("Train samples:", y_train.shape[0])
print("Test samples:", y_test.shape[0]) 

Train samples: 31655
Test samples: 13567
Test samples:        age  workclass  fnlwgt  education  education-num  marital-status  \
0       25          2  226802          1              7               4   
1       38          2   89814         11              9               2   
2       28          1  336951          7             12               2   
3       44          2  160323         15             10               2   
5       34          2  198693          0              6               4   
...    ...        ...     ...        ...            ...             ...   
48837   27          2  257302          7             12               2   
48838   40          2  154374         11              9               2   
48839   58          2  151910         11              9               6   
48840   22          2  201490         11              9               4   
48841   52          3  287927         11              9               2   

       occupation  relationship  race  sex  

# ⚡ Training & Inference with a Simple Logistic Regression

Now that our data is **cleaned**, **encoded**, and **split** into training and test sets, we can move on to **making predictions**.  

In this section, we will train a **K Nearest Neighbour** classifier and a **Logistic Regression** model, two of the simplest and most widely used algorithms for **binary classification**, and then evaluate its performance on the test set.

# 📝 Exercise 4: Comparing 2 Machine Learning Classification Methods
### Tasks

1. Which algorithm has the better performance? R:al doilea are accuracy mai mare
2. Which algorithm is faster to train? Do some profiling on training time taken for each. R: al doilea are accuracy mai mare dar ii trebuie si mai mult timp, deci primul e mai rapid chiar daca nu e la fel de gresit 
3. How does changing the `n_neighbors` parameter for KNN affect results and how does `max_interations` affect the results? 
    R: devine accuracy mai bun dar 
4. Why isn't it ok to simply change the training hyper-parameters of the methods and then evaluate them on the test set?

In [92]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import time

t_knn = time.time()
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
t_knn2 = time.time() - t_knn
print ("KNN training time:", t_knn2)
# --- Make predictions ---
y_pred = knn.predict(x_test)

print("\nKNN Classification Report:\n", classification_report(y_test, y_pred))

from sklearn.linear_model import LogisticRegression

t_clf = time.time()
clf = LogisticRegression(max_iter=10000000000000)
clf.fit(x_train, y_train)
t_clf = time.time() - t_clf

print ("Logistic Regression training time:", t_clf)
# Predict and evaluate
y_pred = clf.predict(x_test)
print("\nLogistic Regression Classification Report:\n", classification_report(y_test, y_pred))


# Alternatively, just for checking accuracy
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)


KNN training time: 0.0818777084350586

KNN Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.91      0.85     10205
           1       0.54      0.32      0.40      3362

    accuracy                           0.76     13567
   macro avg       0.67      0.61      0.63     13567
weighted avg       0.74      0.76      0.74     13567

Logistic Regression training time: 33.27241230010986

Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.95      0.88     10205
           1       0.70      0.35      0.47      3362

    accuracy                           0.80     13567
   macro avg       0.76      0.65      0.67     13567
weighted avg       0.79      0.80      0.78     13567

Accuracy: 0.8026829807621434


STOP: TOTAL NO. OF F,G EVALUATIONS EXCEEDS LIMIT

You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 📏 Scaling Numerical Features

When working with numerical data, it’s important to **scale features** so that they have a similar distribution.  
For example, scaling values to a **range between 0 and 1** can improve the performance of many Machine Learning algorithms.

Scikit-Learn provides built-in scalers, so we don’t have to implement them manually.

---

### Common Scaling Techniques

**1. StandardScaler**  
Scales each feature to have **mean** $\mu = 0$ and **standard deviation** $\sigma = 1$:  

$$
z = \frac{x - \mu}{\sigma}
$$

**2. MinMaxScaler**  
Scales each feature to a specified range, by default \([0, 1]\):  

$$
x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
$$

> You can also choose other ranges, e.g., \([-1, 1]\).

---

**⚠️ Important:** Always **FIT** your scaler on the **TRAINING set** only.  
Scaling the train and test set together can lead to **data leakage**, which will make your results **skewed, biased, and unfair**.

E.g.: 'centering the values of the dataset with the mean of the entire dataset' -> can lead to influencing training with the distribution of the test set, which might be totally different.


# 📝 Exercise 5: Scaling and Its Impact

In this exercise, we will explore how **different scaling methods** affect model performance.

### Tasks

1. Scale your data using:  
   - `StandardScaler()`  
   - `MinMaxScaler()`  
   > **Hint:** Use `fit_transform()` on the training data and `transform()` on the test data.
2. Verify that your scaled data **looks different** from the original data.  
   > It’s always good to double-check that scaling was applied correctly.
3. Train **two separate Logistic Regression models**:  
   - One on the StandardScaler data  
   - One on the MinMaxScaler data  
4. Compare the results. Which scaling method gives **better performance** on the test set?


In [95]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# CODE HERE
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(x_train)
X_test_std = scaler_std.transform(x_test)

scaler_minmax = MinMaxScaler()
X_train_minmax = scaler_minmax.fit_transform(x_train)
X_test_minmax = scaler_minmax.transform(x_test)

print(X_train_std[:2])
print('---')
print(X_train_minmax[:2])



[[ 0.33639527 -0.21803666 -0.51625421  0.44399784  1.51020289 -1.71538391
  -0.74030198 -0.25640362  0.38407079 -1.44321823 -0.14805735 -0.21840026
   0.74727875  0.26588708]
 [ 1.7706406   0.82313856  0.52499458 -0.33960785  1.12030743 -1.71538391
   1.2495518  -0.25640362  0.38407079  0.69289591 13.08181239 -0.21840026
  -0.0844001   0.26588708]]
---
[[0.35616438 0.33333333 0.08282957 0.8        0.86666667 0.
  0.23076923 0.2        1.         0.         0.         0.
  0.5        0.95      ]
 [0.61643836 0.5        0.15753371 0.6        0.8        0.
  0.84615385 0.2        1.         1.         1.         0.
  0.39795918 0.95      ]]


In [102]:

# --- Train Logistic Regression StandardScaler---
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_std, y_train)

# --- Make predictions ---# 
y_pred = clf.predict(X_test_std)
print("\nLogistic Regression Classification Report:\n", classification_report(y_test, y_pred))

# --- Train Logistic Regression MinMaxScaler---
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_minmax, y_train)

# --- Make predictions ---
y_pred = clf.predict(X_test_minmax)
print("\nLogistic Regression Classification Report:\n", classification_report(y_test, y_pred))


Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.94      0.89     10205
           1       0.72      0.44      0.55      3362

    accuracy                           0.82     13567
   macro avg       0.78      0.69      0.72     13567
weighted avg       0.81      0.82      0.80     13567


Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.94      0.89     10205
           1       0.72      0.43      0.54      3362

    accuracy                           0.82     13567
   macro avg       0.78      0.69      0.71     13567
weighted avg       0.81      0.82      0.80     13567



# 🚀 PART 2 - Transforming Categorical Data with One-Hot Encoding

Many Machine Learning algorithms require **numerical input**, so categorical features must be converted into numbers.  

This time, we will use **`OneHotEncoder`** from `scikit-learn`, which creates a **binary column for each category** in a feature.  

> Good news: You already have the One-Hot Encoded data ready as `X_encoded` and `y_encoded`. (lucky you!)  

However, you will need to **repeat the previous steps** with this new encoding:  
- Split the data into train and test sets  
- Scale the numerical features if necessary  
- Train the model  
- Compare the results with the previous **LabelEncoder** approach  

This will help you understand **how encoding choices impact model performance**.


In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# One-Hot Encode Features
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

# Separate categorical and numeric columns
cat_cols = x_drop.select_dtypes(include=["object", "category"]).columns
num_cols = x_drop.select_dtypes(include=["int64", "float64"]).columns

# One-Hot Encode categorical features - Each feature, will be One-Hot Encoded
x_encoded_array = ohe.fit_transform(x_drop[cat_cols])

# Convert encoded array back to DataFrame
x_encoded_df = pd.DataFrame(
    x_encoded_array,
    columns=ohe.get_feature_names_out(cat_cols),
    index=x_drop.index
)

# Combine numeric columns and encoded categorical columns
x_encoded = pd.concat([x_drop[num_cols], x_encoded_df], axis=1)

# x_encoded now has more columns than the original x, but with categorical columns one-hot encoded
x_encoded.head()

# For labels, we can simply transform them in numerical values in the case of binary classification
y_encoded = le.fit_transform(y_drop)
print(set(y_encoded))


# 📝 Exercise 6: One-Hot Encoding and Its Impact

In this exercise, we will evaluate how **One-Hot Encoding** affects model performance compared to Label Encoding.

### Tasks

1. Determine **how many features** the One-Hot Encoded dataset now contains.
2. Split the data into **train and test sets**, keeping the **same proportions** as in the first case for a fair comparison.
3. Train **two Logistic Regression classifiers**:  
   - One with scaled features  
   - One without scaling
4. Compare the results and analyze:  
   - What are the **performance gains or losses** with One-Hot Encoding compared to Label Encoding?
5. Is training faster or slower than previously? Do some profiling of time taken for training.

In [None]:
# Code here

# 📝 Exercise 7 (Optional - Homework): Handling Missing Values

For the curious minds: so far, we simply **dropped samples** with missing values.  
But what if we **keep them** and try to fill in the missing information?  

### Tasks

1. Treat the missing values (`'n/a'`) as a **separate category** for categorical features.  
   > Hint: This is very easy to do in pandas.
2. Fill the missing values:  
   - **Categorical features:** use the **most frequent value** in the column  
   - **Numerical features:** use the **mean or median** of the column
3. Train your model again and **compare the results**.  
   - Does filling missing values improve performance, or not?


In [None]:
# Code here

# 📝 Exercise 8 (Optional - Homework): Exploring Interesting Relationships

At the end of the day, the goal of data analysis and predictions is to **generate insights** that can have a real-world impact.  
This exercise encourages you to explore **interesting patterns and relationships** in the Adult Income dataset.

### Suggested Questions to Explore

1. Where do people earning **>=50K/year** come from?  
2. How many hours do they work in each occupation or category?  
3. What is the **average age** in each category?  
4. How is the **gender balance** for each income category?  
5. How many hours do people in each category work?  
6. Any **other observations** that you find interesting or surprising.  

> Feel free to use **groupby**, **pivot tables**, or **visualizations** to uncover meaningful patterns.


In [None]:
# Code here