# **Homework 1 - BDA**
### **Francesco Natali 1945581**

The goal of this project is to predict whether an individual earns over $50K per year using demographic and occupational features. The task is approached as a supervised classification problem.


In [3]:
# Import necessary libraries for data manipulation, preprocessing, modeling, and evaluation.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load data
data = pd.read_csv("/content/data1 (2).csv", sep=';')

# Drop education and native-country as instructed
df = data.drop(['education', 'native-country'], axis=1)
print(df.head())

   age          workclass  fnlwgt  education-num       marital-status  \
0   39          State-gov   77516             13        Never-married   
1   50   Self-emp-not-inc   83311             13   Married-civ-spouse   
2   38            Private  215646              9             Divorced   
3   53            Private  234721              7   Married-civ-spouse   
4   28            Private  338409             13   Married-civ-spouse   

           occupation    relationship    race      sex  capital-gain  \
0        Adm-clerical   Not-in-family   White     Male          2174   
1     Exec-managerial         Husband   White     Male             0   
2   Handlers-cleaners   Not-in-family   White     Male             0   
3   Handlers-cleaners         Husband   Black     Male             0   
4      Prof-specialty            Wife   Black   Female             0   

   capital-loss  hours-per-week  target  
0             0              40   <=50K  
1             0              13   <=50K  
2 

Occupational categories are reclassified into five broader groups to reduce complexity and facilitate modeling.

In [6]:
# Combine occupation into 5 categories
# Define the mapping
occupation_map = {
    # Group 1: Office-based and technical roles
    'Adm-clerical': 'Office & Tech Roles',
    'Exec-managerial': 'Office & Tech Roles',
    'Prof-specialty': 'Office & Tech Roles',
    'Tech-support': 'Office & Tech Roles',

    # Group 2: Sales and personal services
    'Sales': 'Sales & Personal Services',
    'Other-service': 'Sales & Personal Services',
    'Priv-house-serv': 'Sales & Personal Services',

    # Group 3: Technical and mechanical Work
    'Craft-repair': 'Technical & Mechanical Work',
    'Machine-op-inspct': 'Technical & Mechanical Work',
    'Handlers-cleaners': 'Technical & Mechanical Work',
    'Transport-moving': 'Technical & Mechanical Work',

    # Group 4: Agriculture and military
    'Farming-fishing': 'Agriculture & Military',
    'Armed-Forces': 'Agriculture & Military',

    # Group 5: Security and protective services
    'Protective-serv': 'Security & Protection'
}

# First removes space
df['occupation'] = df['occupation'].str.strip()
# Remove rows where occupation is '?'
df = df[df['occupation'] != '?']
# Apply the mapping
df['occupation_grouped'] = df['occupation'].map(occupation_map)
print(df.head())

   age          workclass  fnlwgt  education-num       marital-status  \
0   39          State-gov   77516             13        Never-married   
1   50   Self-emp-not-inc   83311             13   Married-civ-spouse   
2   38            Private  215646              9             Divorced   
3   53            Private  234721              7   Married-civ-spouse   
4   28            Private  338409             13   Married-civ-spouse   

          occupation    relationship    race      sex  capital-gain  \
0       Adm-clerical   Not-in-family   White     Male          2174   
1    Exec-managerial         Husband   White     Male             0   
2  Handlers-cleaners   Not-in-family   White     Male             0   
3  Handlers-cleaners         Husband   Black     Male             0   
4     Prof-specialty            Wife   Black   Female             0   

   capital-loss  hours-per-week  target   occupation_grouped  
0             0              40   <=50K  Office & Tech Roles  
1       

The target variable is defined and the dataset is split into training and validation subsets. Then, categorical and numerical features are identified to set up a preprocessing pipeline that standardizes numeric values and applies one-hot encoding to categorical variables.

In [7]:
# Define target and features
X = df.drop(['target'], axis=1)
y = df['target'].str.strip()  # Rimuove spazi extra da target

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=123
)

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include='object').columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove 'education' and 'native-country' if still in the list
categorical_cols = [col for col in categorical_cols if col not in ['education', 'native-country']]

# Column transformer
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])

Now we define and evaluate three different classification models: Random Forest, Gradient Boosting, and Logistic Regression. Each model is trained using the training set and evaluated using the validation set. The performance is assessed based on accuracy, the classification report (which includes precision, recall, and F1-score), and the confusion matrix to analyze the true positives, false positives, true negatives, and false negatives.

In [8]:
# Models to evaluate
models = {
    'Random Forest': RandomForestClassifier(random_state=123),
    'Gradient Boosting': GradientBoostingClassifier(random_state=123),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=123)
}

# Loop over models
for name, model in models.items():
    clf = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_val)

    print(f"--- {name} ---")
    print("Accuracy:", accuracy_score(y_val, y_pred))
    print(classification_report(y_val, y_pred))
    print(confusion_matrix(y_val, y_pred))
    print("\n")

--- Random Forest ---
Accuracy: 0.8516710069444444
              precision    recall  f1-score   support

       <=50K       0.88      0.92      0.90      6946
        >50K       0.73      0.63      0.68      2270

    accuracy                           0.85      9216
   macro avg       0.81      0.78      0.79      9216
weighted avg       0.85      0.85      0.85      9216

[[6416  530]
 [ 837 1433]]


--- Gradient Boosting ---
Accuracy: 0.8648003472222222
              precision    recall  f1-score   support

       <=50K       0.88      0.95      0.91      6946
        >50K       0.79      0.62      0.69      2270

    accuracy                           0.86      9216
   macro avg       0.84      0.78      0.80      9216
weighted avg       0.86      0.86      0.86      9216

[[6573  373]
 [ 873 1397]]


--- Logistic Regression ---
Accuracy: 0.8489583333333334
              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.90      6946
        >50K    

**Random Forest**: Achieved an accuracy of 0.852, with strong precision for <=50K (0.88) but lower recall for >50K (0.63), indicating difficulty in identifying high-income individuals.

**Gradient Boosting**: Slightly outperformed Random Forest with an accuracy of 0.865, similar precision for <=50K (0.88), and a slightly higher recall for >50K (0.62), but still faced challenges in predicting high-income individuals.

**Logistic Regression**: Had the lowest accuracy (0.848), with strong performance for <=50K but poor precision and recall for >50K, struggling to identify high-income individuals effectively.

Hence, In order to optimize the models and identify the best hyperparameters, we use GridSearchCV. This method performs an exhaustive search over a specified parameter grid for each model, optimizing for accuracy by evaluating different hyperparameter combinations using cross-validation. Below are the steps for hyperparameter optimization applied to the Random Forest, Gradient Boosting, and Logistic Regression models.

In [9]:
# Random Forest – Grid Search
param_grid_rf = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5]
}

rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=123))
])

grid_rf = GridSearchCV(rf_pipeline, param_grid_rf, cv=3, n_jobs=-1, scoring='accuracy')
grid_rf.fit(X_train, y_train)

print("Best RF parameters:", grid_rf.best_params_)
print("Best RF score:", grid_rf.best_score_)

Best RF parameters: {'classifier__max_depth': 20, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 200}
Best RF score: 0.8598736660711794


In [10]:
# Gradient Boosting – Grid Search
param_grid_gb = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.1, 0.05],
    'classifier__max_depth': [3, 5]
}

gb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=123))
])

grid_gb = GridSearchCV(gb_pipeline, param_grid_gb, cv=3, n_jobs=-1, scoring='accuracy')
grid_gb.fit(X_train, y_train)

print("Best GB parameters:", grid_gb.best_params_)
print("Best GB score:", grid_gb.best_score_)

Best GB parameters: {'classifier__learning_rate': 0.1, 'classifier__max_depth': 5, 'classifier__n_estimators': 200}
Best GB score: 0.8653150722952255


In [11]:
# Logistic Regression – Grid Search
param_grid_lr = {
    'classifier__C': [0.01, 0.1, 1, 10],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs']
}

lr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, random_state=123))
])

grid_lr = GridSearchCV(lr_pipeline, param_grid_lr, cv=3, n_jobs=-1, scoring='accuracy')
grid_lr.fit(X_train, y_train)

print("Best LR parameters:", grid_lr.best_params_)
print("Best LR score:", grid_lr.best_score_)

Best LR parameters: {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'classifier__solver': 'lbfgs'}
Best LR score: 0.846154068259699


After performing hyperparameter tuning using GridSearchCV, we now evaluate the optimized models on the validation set. The best models for Random Forest, Gradient Boosting, and Logistic Regression are used to predict the outcomes and their performance is assessed using accuracy, precision, recall, F1-score, and confusion matrix.

In [12]:
# Best models
best_rf = grid_rf.best_estimator_
best_gb = grid_gb.best_estimator_
best_lr = grid_lr.best_estimator_

# Predict and evaluate
for name, model in [('Random Forest', best_rf),
                    ('Gradient Boosting', best_gb),
                    ('Logistic Regression', best_lr)]:
    y_pred = model.predict(X_val)
    print(f"\n--- {name} (Tuned) ---")
    print("Accuracy:", accuracy_score(y_val, y_pred))
    print(classification_report(y_val, y_pred))
    print(confusion_matrix(y_val, y_pred))


--- Random Forest (Tuned) ---
Accuracy: 0.8614366319444444
              precision    recall  f1-score   support

       <=50K       0.88      0.94      0.91      6946
        >50K       0.77      0.62      0.69      2270

    accuracy                           0.86      9216
   macro avg       0.83      0.78      0.80      9216
weighted avg       0.86      0.86      0.86      9216

[[6528  418]
 [ 859 1411]]

--- Gradient Boosting (Tuned) ---
Accuracy: 0.8716362847222222
              precision    recall  f1-score   support

       <=50K       0.90      0.94      0.92      6946
        >50K       0.78      0.67      0.72      2270

    accuracy                           0.87      9216
   macro avg       0.84      0.80      0.82      9216
weighted avg       0.87      0.87      0.87      9216

[[6521  425]
 [ 758 1512]]

--- Logistic Regression (Tuned) ---
Accuracy: 0.8490668402777778
              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.90    

Among the three tuned models, Gradient Boosting achieved the highest accuracy on the validation set (87.16%) and outperformed the others in detecting individuals earning over 50K, with a precision of 0.78, recall of 0.67, and F1-score of 0.72 for the positive class. These metrics indicate a better balance between false positives and false negatives compared to Random Forest and Logistic Regression.

Although Random Forest performed comparably in terms of overall accuracy (86.14%) and had slightly higher precision (0.77), its lower recall (0.62) suggests it missed more high-income individuals. Logistic Regression, while simpler and faster to train, underperformed with the lowest accuracy (84.91%) and F1-score (0.66) on the >50K class.

Therefore, Gradient Boosting is selected as the best-performing model for this binary classification task due to its superior balance of predictive performance across all key evaluation metrics.