# **Final Project Task 3 - Census Modeling Regression**

Requirements

- You can use models (estmators) from sklearn, but feel free to use any library for traditional ML. 
    - Note: in sklearn, the LinearRegression estimator is based on OLS, a statistical method. Please use the SGDRegressor estimator, since this is based on gradient descent. 
    - You can use LinearRegression estimator, but only as comparison with the SGDRegressor - Optional.

- Model Selection and Setup:
    - Implement multiple models, to solve a regression problem using traditional ML:
        - Linear Regression
        - Decision Tree Regression
        - Random Forest Regression - Optional
        - Ridge Regression - Optional
        - Lasso Regression - Optional
    - Choose a loss (or experiment with different losses) for the model and justify the choice.
        - MSE, MAE, RMSE, Huber Loss or others
    - Justify model choices based on dataset characteristics and task requirements; specify model pros and cons.


- Data Preparation
    - Use the preprocessed datasets from Task 1.
    - From the train set, create an extra validation set, if necesarry. So in total there will be: train, validation and test datasets.
    - Be sure all models have their data preprocessed as needed. Some models require different, or no encoding for some features.


- Model Training and Experimentation
    - Establish a Baseline Model:
        - For each model type, train a simple model with default settings as a baseline.
        - Evaluate its performance to establish a benchmark for comparison.
    - Make plots with train, validation loss and metric on epochs (or on steps), if applicable. - Optional
    - Feature Selection:
        - Use insights from EDA in Task 2 to identify candidate features by analyzing patterns, relationships, and distributions.
    - Experimentation:
        - For each baseline model type, iteratively experiment with different combinations of features and transformations.
        - Experiment with feature engineering techniques such as interaction terms, polynomial features, or scaling transformations.
        - Identify the best model which have the best performance metrics on test set.
    - Hyperparameter Tuning:
        - Perform hyperparameter tuning only on the best-performing model after evaluating all model types and experiments.
        - Avoid tuning models that do not show strong baseline performance or are unlikely to outperform others based on experimentation.
        - Ensure that hyperparameter tuning is done after completing feature selection, baseline modeling, and experimentation, ensuring that the model is stable and representative of the dataset.


- Model Evaluation
    - Evaluate models on the test dataset using regression metrics:
        - Mean Absolute Error (MAE)
        - Mean Squared Error (MSE)
        - Root Mean Squared Error (RMSE)
        - R² Score
    - Compare the results across different models. Save all experiment results into a table.

Feature Importance - Optional
- For applicable models (e.g., Decision Tree Regression), analyze feature importance and discuss its relevance to the problem.



Deliverables

- Notebook code with no errors.
- Code and results from experiments. Create a table with all experiments results, include experiment name, metrics results.
- Explain findings, choices, results.
- Potential areas for improvement or further exploration.


In [73]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
#data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
#columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
#]

#data = pd.read_csv(data_url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
#data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
22791,39,Self-emp-inc,372525,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,60,United-States,>50K
10185,48,Self-emp-not-inc,164582,Some-college,10,Married-civ-spouse,Farming-fishing,Husband,White,Male,7298,0,60,United-States,>50K
25298,41,Self-emp-not-inc,97277,Assoc-voc,11,Divorced,Other-service,Unmarried,White,Female,0,0,10,United-States,<=50K
5372,27,Private,36851,Assoc-voc,11,Married-civ-spouse,Sales,Husband,White,Male,0,0,35,United-States,<=50K
10345,28,Private,261725,1st-4th,2,Never-married,Machine-op-inspct,Not-in-family,White,Female,0,0,40,Mexico,<=50K
11658,19,?,318264,Some-college,10,Never-married,?,Own-child,White,Male,0,0,30,United-States,<=50K
4441,43,Self-emp-inc,286750,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,Black,Male,0,0,99,United-States,>50K
7324,58,Private,175127,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
19787,29,Private,46609,10th,6,Never-married,Craft-repair,Not-in-family,Black,Male,0,0,40,?,<=50K
8466,25,Private,359985,5th-6th,3,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,33,Mexico,<=50K


## Load datasets

In [74]:
train_raw = pd.read_csv("train_raw.csv")
test_raw = pd.read_csv("test_raw.csv")

In [75]:
train_df = train_raw.copy()
test_df = test_raw.copy()

In [76]:
target_variable = "hours-per-week"

# Separating features and target variable 
x_train = train_raw.drop(columns=[target_variable])
y_train = train_raw[target_variable]

x_test = test_raw.drop(columns=[target_variable])
y_test = test_raw[target_variable]

## Split train set into train and validation set

In [77]:
train_df, val_df = train_test_split(train_raw, test_size=0.2, random_state=42)

## Preprocess transformations

In [78]:
# applying standardization, normalization and scaling on train, test and validation set
def standardization_transform(train_df, test_df, val_df, columns):
    _train_df = train_df.copy()
    _test_df = test_df.copy()
    _val_df = val_df.copy()

    ss = StandardScaler()
    _train_df[columns] = ss.fit_transform(_train_df[columns])
    _test_df[columns] = ss.transform(_test_df[columns])
    _val_df[columns] = ss.transform(_val_df[columns])

    return _train_df, _test_df, _val_df

def normalization_transform(train_df, test_df, val_df, columns):
    _train_df = train_df.copy()
    _test_df = test_df.copy()
    _val_df = val_df.copy()

    mms = MinMaxScaler()
    _train_df[columns] = mms.fit_transform(_train_df[columns])
    _test_df[columns] = mms.transform(_test_df[columns])
    _val_df[columns] = mms.transform(_val_df[columns])

    return _train_df, _test_df, _val_df

def scaling_transform(train_df, test_df, val_df, columns):
    _train_df = train_df.copy()
    _test_df = test_df.copy()
    _val_df = val_df.copy()

    rs = RobustScaler()
    _train_df[columns] = rs.fit_transform(_train_df[columns])
    _test_df[columns] = rs.transform(_test_df[columns])
    _val_df[columns] = rs.transform(_val_df[columns])

    return _train_df, _test_df, _val_df

In [79]:
#Defining preprocess functions 
def preprocess1(train_df, test_df, val_df):
    s_columns = ['age', 'education-num', 'capital-gain', 'education_simplified', 'sex', 'full_time']
    
    _train_df, _test_df, _val_df = standardization_transform(train_df, test_df, val_df, s_columns)

    all_columns = s_columns + [target_variable]
    _train_df = _train_df[all_columns]
    _test_df = _test_df[all_columns]
    _val_df = _val_df[all_columns]

    print("Preprocessed train_df (head):")
    print(_train_df.head())

    return _train_df, _test_df, _val_df

In [80]:
train_df_p1, test_df_p1, val_df_p1 = preprocess1(train_df, test_df, val_df)


Preprocessed train_df (head):
            age  education-num  capital-gain  education_simplified       sex  \
13242  0.276354       1.301974     -0.208253              1.642777 -1.260805   
16886 -0.691167      -0.326248     -0.208253             -0.344529  0.793144   
5     -1.063291      -0.326248     -0.208253             -0.344529  0.793144   
2656  -0.542318      -0.326248     -0.208253             -0.344529 -1.260805   
13244  1.020602      -0.326248      3.711542             -0.344529  0.793144   

       full_time  hours-per-week  
13242  -0.165111              25  
16886  -0.165111              40  
5      -0.165111              40  
2656   -0.165111              40  
13244  -0.165111              40  


In [81]:
def preprocess2(train_df, test_df, val_df):
    s_columns = ['age', 'education-num', 'capital-gain', 'education_simplified', 'sex', 'full_time']
    
    _train_df, _test_df, _val_df = normalization_transform(train_df, test_df, val_df, s_columns)

    all_columns = s_columns + [target_variable]
    _train_df = _train_df[all_columns]
    _test_df = _test_df[all_columns]
    _val_df = _val_df[all_columns]

    return _train_df, _test_df, _val_df

In [82]:
train_df_p2, test_df_p2, val_df_p2 = preprocess2(train_df, test_df, val_df)

In [83]:
def preprocess3(train_df, test_df, val_df):
    s_columns = ['age', 'education-num', 'capital-gain', 'education_simplified', 'sex', 'full_time']
    
    _train_df, _test_df, _val_df = scaling_transform(train_df, test_df, val_df, s_columns)

    all_columns = s_columns + [target_variable]
    _train_df = _train_df[all_columns]
    _test_df = _test_df[all_columns]
    _val_df = _val_df[all_columns]

    return _train_df, _test_df, _val_df

In [84]:
train_df_p3, test_df_p3, val_df_p3 = preprocess3(train_df, test_df, val_df)

In [85]:
datasets = {
    'p1': (train_df_p1, test_df_p1, val_df_p1, 'p1'),
    'p2': (train_df_p2, test_df_p2, val_df_p2, 'p2'),
    'p3': (train_df_p3, test_df_p3, val_df_p3, 'p3')
}

## Regression Models

In [90]:
def get_models(name):
    models = {
        'linear_regression_ols': LinearRegression,
        'linear_regression_sgd': lambda **kwargs: SGDRegressor(random_state=42, max_iter=1000, tol=1e-3, learning_rate='adaptive', eta0=0.001, penalty='l2', **kwargs),
        'decision_tree': lambda **kwargs: DecisionTreeRegressor(random_state=42, **kwargs),
        'random_forest': lambda **kwargs: RandomForestRegressor(random_state=42, **kwargs)
    }
    if name not in models:
        raise ValueError(f"Model name '{name}' does not exist!")
    return models[name]

# 🔹 Preprocessed data sets
datasets = {
    'p1': (train_df_p1, test_df_p1, val_df_p1),  
    'p2': (train_df_p2, test_df_p2, val_df_p2),  
    'p3': (train_df_p3, test_df_p3, val_df_p3)   
}
# 🔹 Define experiments
experiments = [
    {'name': 'exp1', 'model_name': 'linear_regression_ols', 'dataset': 'p1', 'kwargs': {}},
    {'name': 'exp2', 'model_name': 'linear_regression_ols', 'dataset': 'p2', 'kwargs': {}},
    {'name': 'exp3', 'model_name': 'linear_regression_ols', 'dataset': 'p3', 'kwargs': {}},
    
    {'name': 'exp4', 'model_name': 'linear_regression_sgd', 'dataset': 'p1', 'kwargs': {}},
    {'name': 'exp5', 'model_name': 'linear_regression_sgd', 'dataset': 'p2', 'kwargs': {}},
    {'name': 'exp6', 'model_name': 'linear_regression_sgd', 'dataset': 'p3', 'kwargs': {}},

    {'name': 'exp7', 'model_name': 'decision_tree', 'dataset': 'p1', 'kwargs': {}},
    {'name': 'exp8', 'model_name': 'decision_tree', 'dataset': 'p2', 'kwargs': {}},
    {'name': 'exp9', 'model_name': 'decision_tree', 'dataset': 'p3', 'kwargs': {}},

    {'name': 'exp10', 'model_name': 'random_forest', 'dataset': 'p1', 'kwargs': {}},
    {'name': 'exp11', 'model_name': 'random_forest', 'dataset': 'p2', 'kwargs': {}},
    {'name': 'exp12', 'model_name': 'random_forest', 'dataset': 'p3', 'kwargs': {}}
]

# 🔹 Running experiments
for experiment in experiments:
    exp_name = experiment['name']
    model_name = experiment['model_name']
    dataset_key = experiment['dataset']
    model_kwargs = experiment['kwargs']

    train_df, test_df, val_df = datasets[dataset_key]

    X_train = train_df.drop(columns=['hours-per-week']).values
    y_train = train_df['hours-per-week'].values
    X_val = val_df.drop(columns=['hours-per-week']).values
    y_val = val_df['hours-per-week'].values
    X_test = test_df.drop(columns=['hours-per-week']).values
    y_test = test_df['hours-per-week'].values

    # Model initialization and training
    model_class = get_models(model_name)
    model = model_class(**model_kwargs)

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Performance metrics calculation: am ales mean absolute error pt ca avem destul de multi outlieri (peste 5000), iar mae este singura care nu este sensibila la outlieri la fel ca mse sau rmse
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"{exp_name}: {model_name} -> MAE: {mae:.4f}, R²: {r2:.4f}")

exp1: linear_regression_ols -> MAE: 5.1142, R²: 0.0688
exp2: linear_regression_ols -> MAE: 5.1142, R²: 0.0688
exp3: linear_regression_ols -> MAE: 5.1142, R²: 0.0688
exp4: linear_regression_sgd -> MAE: 5.1133, R²: 0.0688
exp5: linear_regression_sgd -> MAE: 5.1144, R²: 0.0686
exp6: linear_regression_sgd -> MAE: 380464548964.1384, R²: -57624107197348121149440.0000
exp7: decision_tree -> MAE: 4.4652, R²: 0.1037
exp8: decision_tree -> MAE: 4.4624, R²: 0.1035
exp9: decision_tree -> MAE: 4.4624, R²: 0.1040
exp10: random_forest -> MAE: 4.4224, R²: 0.1560
exp11: random_forest -> MAE: 4.4216, R²: 0.1561
exp12: random_forest -> MAE: 4.4198, R²: 0.1567


## Hyperparameter tuning

In [91]:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import joblib

#Defining parameters
param_dist = { 
    'max_depth': [None, 5, 10, 20], 
    'min_samples_split': np.arange(2, 20),  
    'min_samples_leaf': np.arange(1, 20),  
    'max_features': ['auto', 'sqrt', 'log2', None],    
}

# Create a RandomizedSearchCV object
random_search = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_distributions=param_dist, 
    n_iter=10,  
    cv=5, 
    scoring='neg_mean_absolute_error', 
    random_state=42,
    n_jobs=-1  
)

# Perform the random search
random_search.fit(X_train, y_train)

# Print the best hyperparameters and score

print("Best hyperparameters:", random_search.best_params_)
print("Best score:", -random_search.best_score_)

# Evaluarea pe setul de validare
best_model = random_search.best_estimator_
y_val_pred = best_model.predict(X_val)
mae_val = mean_absolute_error(y_val, y_val_pred)
r2_val = r2_score(y_val, y_val_pred)
print(f"Validation MAE: {mae_val:.4f}, Validation R²: {r2_val:.4f}")


# Evaluarea finală pe test
y_test_pred = best_model.predict(X_test)
mae_test = mean_absolute_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)
print(f"Test MAE: {mae_test:.4f}, Test R²: {r2_test:.4f}")

# Get the best model
best_rf_model = random_search.best_estimator_
# Print the best model 
print("Best Random Forest Model:", best_rf_model)
joblib.dump(best_rf_model, 'best_random_forest_model.pkl')



5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Iulia\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Iulia\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py", line 1382, in wrapper
    estimator._validate_params()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "c:\Users\Iulia\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py", line 436, in _validate_params
    validate_parameter

Best hyperparameters: {'min_samples_split': np.int64(12), 'min_samples_leaf': np.int64(1), 'max_features': None, 'max_depth': 10}
Best score: 4.3399734277814375
Validation MAE: 4.4590, Validation R²: 0.2372
Test MAE: 4.3501, Test R²: 0.2212
Best Random Forest Model: RandomForestRegressor(max_depth=10, max_features=None,
                      min_samples_leaf=np.int64(1),
                      min_samples_split=np.int64(12), random_state=42)


['best_random_forest_model.pkl']

## Interpretation of tuning model
- Best score: 4.3399734277814375  -- This represents the negative Mean Absolute Error (MAE) value obtained from RandomizedSearchCV. Although it is expressed as a negative value (because RandomizedSearchCV maximizes the score, and MAE is a value to be minimized), the absolute value shows that the MAE is approximately 4.33 on the train data. 
- Validation MAE: 4.4590 -- This represents the MAE value on the validation set, which means that the model makes an error of about 4.45 units in its predictions.
- Validation R²: 0.2372 -- This value represents the value of the coefficient of determination on the validation set. Such a value indicates that 23% of the variation in the target variable (hours-per-week) is explained by the variation in the feature variables included in the model. 
- On the other hand, the prediction on the other model is similar (22% of the variance of the target variable) and the mean absolute error is also very similar. 


## Regression Models - Report

In [62]:
%pip install tabulate
%pip install jinja2


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Models Overview

In [92]:
from tabulate import tabulate
from jinja2 import Template 

# Results list
results = [
    {'Experiment': 'exp1', 'Model': 'linear_regression_ols', 'Test MAE': 5.1142, 'Test R²': 0.0688},
    {'Experiment': 'exp2', 'Model': 'linear_regression_ols', 'Test MAE': 5.1142, 'Test R²': 0.0688},
    {'Experiment': 'exp3', 'Model': 'linear_regression_ols', 'Test MAE': 5.1142, 'Test R²': 0.0688},
    {'Experiment': 'exp4', 'Model': 'linear_regression_sgd', 'Test MAE': 5.1133, 'Test R²': 0.0688},
    {'Experiment': 'exp5', 'Model': 'linear_regression_sgd', 'Test MAE': 5.1144, 'Test R²': 0.0686},
    {'Experiment': 'exp6', 'Model': 'linear_regression_sgd', 'Test MAE': 380464548964.1384, 'Test R²': -57624107197348121149440.0000},
    {'Experiment': 'exp7', 'Model': 'decision_tree', 'Test MAE': 4.4652, 'Test R²': 0.1037},
    {'Experiment': 'exp8', 'Model': 'decision_tree', 'Test MAE': 4.4624, 'Test R²': 0.1035},
    {'Experiment': 'exp9', 'Model': 'decision_tree', 'Test MAE': 4.4624, 'Test R²': 0.1040},
    {'Experiment': 'exp10', 'Model': 'random_forest', 'Test MAE': 4.4224, 'Test R²': 0.1560},
    {'Experiment': 'exp11', 'Model': 'random_forest', 'Test MAE': 4.4216, 'Test R²': 0.1561},
    {'Experiment': 'exp12', 'Model': 'random_forest', 'Test MAE': 4.4198, 'Test R²': 0.1567},
    {'Experiment': 'exp13', 'Model': 'random_forest_tuned', 'Test MAE': 4.3501, 'Test R²': 0.2212}
]


# DataFrame with list of results
import pandas as pd
results_df = pd.DataFrame(results)

# Best R2 on test set
best_r2 = results_df['Test R²'].max()

def highlight_best_model(row):
    if row['Test R²'] == best_r2:
        return ['font-weight: bold']*len(row) 
    return ['']*len(row)

# Styling
styled_df = results_df.style.apply(highlight_best_model, axis=1)

styled_df


Unnamed: 0,Experiment,Model,Test MAE,Test R²
0,exp1,linear_regression_ols,5.1142,0.0688
1,exp2,linear_regression_ols,5.1142,0.0688
2,exp3,linear_regression_ols,5.1142,0.0688
3,exp4,linear_regression_sgd,5.1133,0.0688
4,exp5,linear_regression_sgd,5.1144,0.0686
5,exp6,linear_regression_sgd,380464548964.1384,-5.762410719734812e+22
6,exp7,decision_tree,4.4652,0.1037
7,exp8,decision_tree,4.4624,0.1035
8,exp9,decision_tree,4.4624,0.104
9,exp10,random_forest,4.4224,0.156


## Results:
- Among all the 12 models realized, Random Forest seems to perform the best compared to the other models. Having the lowest error values (MAE Test) and the highest R² scores (R² Test), it is clear that this model provides the best prediction on the test set. Among the 3 random forest models, the experiment on the third data set performs the best (Test MAE:4.419800, R²:0.156700). However, the tuned random forest model is the best performer, with the smallest mean absolute error (4.350100) and the highest R2 value (0.221200). 
- If we were to evaluate the performance of the realized regression models, at the top remains random forest, followed by decision tree and linear regression. Regarding the difference between ols-based linear regression and sgd, ols seems to perform very slightly better, as it has the smaller errors.  
- An important mention: the huge amount of values for experiment 6 may be due to the way the third data set was processed, which was not standardized or normalized

## Justification of the regression model used
1. Linear regression (OLS)
    - I chose linear regression as a starting point in understanding the data sets, because it provides coefficients that easily explain the model prediction and the effects of the explratory variables on the target variable.

2. Linear regression (SGD)
    - I chose logistic regression with SGD to use a more refined modeling approach, which could be more efficient on larger datasets and more exploratory variables.

3. Decision Tree
    - I also chose to use decision tree because it is able to learn complex, non-linear relationships between the target variable and the exploratory variables.

4. Random Forest
    - I also used Random Forest because it combines multiple decision trees to improve model performance and reduce the risk of overfitting. The model also handles complex data better and generalizes better on unknown sets. 

Mention: The variables included in the models were selected after following experiments and the current form was selected because it offered the highest R2 coefficient values for model prediction. 


## Potential improvements
- The main area for improvement remains outliers, looking for other solutions to deal with them could help to normalize the distributions and create better regression models.
- Another improvement may be to reduce the categories for some variables so that they can be recoded more easily into fewer categories