In [1]:
from src.entities import EnvironmentConfiguration

import warnings
warnings.filterwarnings("ignore")
import os

environment_configuration = EnvironmentConfiguration()
from src.preprocessing_pipelines.preprocessing_pipelines import common_data_preprocessing

# Basic

## Common Data Transformations

We are going to apply to our dataset the transformations that are common to every algorithm (enconding etc) and do not imply data leakage(transformations that do not use global statistics like mean, median, min or max)

1. Drop the 'education' column (redundant since 'education.num' encodes the same information).
2. Encode nominal categorical features using one-hot encoding while avoiding multicollinearity 
    (i.e., using dummy encoding with drop_first=True).
3. Create a new feature 'profit' as the difference between 'capital.gain' and 'capital.loss'.
4. Create a new binary feature 'investor' indicating if the person has any capital gain 
    (1 if capital.gain > 0, else 0).
5. Create a new binary feature 'american' indicating if the person is a U.S. citizen 
    (1 if native.country equals 'United-States', else 0).
6. Map the 'income' column to binary labels (0 for '<=50K' and 1 for '>50K').
7. Map the 'sex' column to binary labels (0 for 'Female' and 1 for 'Male').
8. One-hot encode the categorical-nominal features: workclass, marital.status, occupation, 
    relationship, and race (excluding native.country, already encoded as 'american').

In [2]:
import pandas as pd
raw_data = pd.read_csv(environment_configuration.raw_data_folder)

data_without_duplicates = raw_data.drop_duplicates()
common_data_preprocessed_data = common_data_preprocessing(data_without_duplicates)

In [8]:
common_data_preprocessed_data.head(5)

Unnamed: 0,age,fnlwgt,education.num,sex,capital.gain,capital.loss,hours.per.week,income,profit,investor,...,occupation_Transport-moving,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White
0,90,77053,9,0,0,4356,40,0,-4356,0,...,0,1,0,0,0,0,0,0,0,1
1,82,132870,9,0,0,4356,18,0,-4356,0,...,0,1,0,0,0,0,0,0,0,1
2,66,186061,10,0,0,4356,40,0,-4356,0,...,0,0,0,0,1,0,0,1,0,0
3,54,140359,4,0,0,3900,40,0,-3900,0,...,0,0,0,0,1,0,0,0,0,1
4,41,264663,10,0,0,3900,40,0,-3900,0,...,0,0,0,1,0,0,0,0,0,1


## Holdout Test (Validation)

In [3]:
from sklearn.model_selection import train_test_split

# Split the dataset into X (features) and y (target)
X = common_data_preprocessed_data.drop('income', axis=1)  # All features except the 'income' column
y = common_data_preprocessed_data['income']  # The target variable 'income'

# Perform the train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify= y)

# Print the shape of the resulting sets
print(f"Training data shape: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test data shape: X_test={X_test.shape}, y_test={y_test.shape}")

Training data shape: X_train=(26029, 47), y_train=(26029,)
Test data shape: X_test=(6508, 47), y_test=(6508,)


# Baseline Performance

## Simple Heuristic

Apply the mode to every sample

In [11]:
from sklearn.metrics import fbeta_score
import numpy as np

# Step 1: Identify the majority class
majority_class = y.mode()[0]  # This gets the most frequent class (0 or 1)

# Step 2: Make predictions (always predict the majority class)
y_pred_majority = np.full_like(y_test, majority_class)

# Step 3: Calculate F2 score
f2_score = fbeta_score(y_test, y_pred_majority, beta=2)

print(f"F2 Score for majority class heuristic: {f2_score}")

F2 Score for majority class heuristic: 0.0


## Strong Model - Out of the Box

We are going to test 2 Robust models to see how skillfull the are out of the box

In [None]:
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.metrics import fbeta_score
# XGBoost with 100 estimators

# Initialize XGBoost model
xgb_model = GradientBoostingClassifier(random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model (F2 score)
xgb_f2_score = fbeta_score(y_test, y_pred_xgb, beta=2)

print(f"XGBoost F2 Score: {xgb_f2_score}")

XGBoost F2 Score: 0.6167785234899329


In [None]:
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import fbeta_score
# Random Forest with 100 estimators

# Initialize XGBoost model
rfc_model = RandomForestClassifier(random_state=42)

# Train the model
rfc_model.fit(X_train, y_train)

# Predict on the test set
y_pred_rfc = rfc_model.predict(X_test)

# Evaluate the model (F2 score)
rfc_f2_score = fbeta_score(y_test, y_pred_rfc, beta=2)

print(f"rfc F2 Score: {rfc_f2_score:.3f}")

rfc F2 Score: 0.634


## Conclusion

Out of the box RandomForestClassifier Got a Result of `f2_score: 0.634`.

Now we have a Baseline to improve it applying the correct `Imbalanced Dataset Framework`

# Experiment 1

## 1. Objective

Apply the `imbalanced Framework` to improve the performance, using heuristics from what works the vest

## 2. Description:

### 2.1 Preprocessing

* The common preprocessment steps applied in `CommonDataTransformations` section
* Use the `RobustScaler()` to put all the data within the same Scale (This scaler is robust to outliers/ Heuristic);
* Resample data Using `SMOTE and Edited Nearest Neighbors - Resampling` (this technique is one tha usually performs the best/Heuristic); 

We should use `pipelines` to perform the transformations inside each cross validation fold (this avoids data-leakage)

### 2.2 Evaluation

#### 2.2.1 Cross Validation

We are going to `RepeatedStratifiedKFold`, because it is the best for imbalanced Datasets for the following reasons:

**2.2.1.1 Maintains Class Balance in Every Split**

* Regular k-fold cross-validation can lead to uneven class distributions in some folds, which is problematic for imbalanced datasets.
* RepeatedStratifiedKFold ensures each fold has the same class proportions as the full dataset, making training and evaluation more reliable.

**2.2.1.2 Reduces Variance in Model Evaluation** 

* A single stratified k-fold split may still introduce randomness in performance metrics.
* By repeating the process multiple times with different splits, the results are more stable and less dependent on how the data was initially partitioned.

**2.2.1.3 Maximizes Data Utilization**

* In small datasets, some samples may appear only in the test set, limiting training opportunities.
* RepeatedStratifiedKFold allows each sample to be used in training and testing multiple times, leading to better generalization.

**2.2.1.4 More Reliable Performance Estimates**

* Imbalanced datasets often suffer from misleading metrics due to the dominance of the majority class.
* Repeating stratified k-fold cross-validation helps in getting a more representative model evaluation, reducing the risk of overfitting to specific splits.

#### 2.2.2 Final Evaluation:

The final model will be evaluated on the `validation dataset`, to ensure it is able to perform in real world datasets

#### 2.2.3 Modeling:

1. Try Cost Sensitive Models (Decision Trees, Logistic Regression with balanced weights, etc...)

2. Try Cost Sensitive Ensembling (Random Forest w/ Class weighting, Random Forest With Bootstrap Class Weighting, Balanced Bagging)

3. Try Cost Sensitive ensemblings using Cost Sensitive Models as base estimators (Use the best models from step 1 as base estimators to step 2)

4. Tune Hyperparameters using cross validation


5. Calibrate Probabilities/Tune the Classification Threshold


6. Ensemble and test on the validation dataset

## Preprocessing

In [None]:
# Check binary and categorical features to avoid scaling them
X_train.nunique()

In [4]:
# Scaler
from sklearn.preprocessing import RobustScaler
from sklearn.compose import ColumnTransformer

numeric_features = ["age", "fnlwgt", "capital.gain", "capital.loss", "hours.per.week", "profit",]
categorical_ordinal_features = ["education.num"]

## Just Scale Numeric Features
robust_scaler = ColumnTransformer(
    transformers=[
        ("num", RobustScaler(), numeric_features),
        ("ord", RobustScaler(), categorical_ordinal_features)
    ],
    remainder="passthrough"
)

# Resampler
from imblearn.combine import SMOTEENN
smoteenn_resampler = SMOTEENN()

# Pipeline
from imblearn.pipeline import Pipeline

steps = [('robust_scaler', robust_scaler),
        ('smoteenn_resampler', smoteenn_resampler)] 

experiment_1_pipeline = Pipeline(steps=steps)


# Cross validation
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)

# Scoring
from sklearn.metrics import make_scorer, fbeta_score
scoring = {"f2": make_scorer(fbeta_score, beta=2)}


### Testing the Pipeline

In [27]:
X_scaled_resampled, y_resampled = experiment_1_pipeline.fit_resample(X_train, y_train)

In [37]:
# Convert to pandas DataFrames
X_scaled_resampled_df = pd.DataFrame(X_scaled_resampled, columns=X_train.columns)
y_resampled_df = pd.Series(y_resampled)

# 1. Check the scale of transformed numeric features
# They should be centered around 0 with similar ranges
print("Numeric features statistics after transformation:")
print(X_scaled_resampled_df[numeric_features].describe())

# 2. Compare shapes before and after transformation
print("\nShape before:", X_train.shape)
print("Shape after:", X_scaled_resampled_df.shape)
# SMOTEENN should have changed the number of samples

# 3. Check class distribution before and after resampling
print("\nClass distribution before:", pd.Series(y_train).value_counts())
print("Class distribution after:", pd.Series(y_resampled_df).value_counts())

# 4. Verify untransformed columns remained unchanged
unchanged_cols = [col for col in X_train.columns 
                 if col not in numeric_features + categorical_ordinal_features]
if unchanged_cols:
    print("\nSample of unchanged columns before vs after:")
    print("Before:", X_train[unchanged_cols].head())
    print("After:", X_scaled_resampled_df[unchanged_cols].head())

# 5. Check for any missing values that might have been introduced
print("\nMissing values after transformation:")
print(X_scaled_resampled_df.isnull().sum())


Numeric features statistics after transformation:
                age        fnlwgt  capital.gain  capital.loss  hours.per.week  \
count  30810.000000  30810.000000  30810.000000  30810.000000    30810.000000   
mean       0.148761      0.090882      0.314256   2584.716879       10.637889   
std        0.677218      0.835452      2.400264  12035.498755        2.677982   
min       -1.052632     -1.392634     -7.800000  -4356.000000        1.000000   
25%       -0.368421     -0.472151      0.000000      0.000000        9.000000   
50%        0.121629     -0.008598      0.000000      0.000000       10.000000   
75%        0.598300      0.459047      1.600000      0.000000       13.000000   
max        2.789474     10.958018     11.800000  99999.000000       16.000000   

             profit  
count  30810.000000  
mean       0.717954  
std        0.444395  
min        0.000000  
25%        0.000000  
50%        1.000000  
75%        1.000000  
max        1.000000  

Shape before: (26029,

## Cost Weigthed Models

In [6]:
from src.model_selection import cross_validate_models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

simple_models_weighted = [
    LogisticRegression(solver='lbfgs', class_weight='balanced', penalty= None),
    LogisticRegression(solver='lbfgs', class_weight='balanced', penalty= 'l2'),
    LogisticRegression(solver='liblinear', class_weight='balanced', penalty= 'l1'),    
    DecisionTreeClassifier(class_weight='balanced', max_depth= None),
    DecisionTreeClassifier(class_weight='balanced', max_depth= 50),
    DecisionTreeClassifier(class_weight='balanced', max_depth= 75),
] # Svms are also good for this case, altough they are very computational expensive

simple_cost_sensitive_models_report_path = os.path.join("artifacts", 
                                                        "model_selection", "experiment_1", 
                                                        "simple_cost_sensitive_models.xlsx")

cross_validate_models(
    base_pipeline= experiment_1_pipeline, 
    model_list= simple_models_weighted, 
    X= X_train, 
    y= y_train, 
    scoring= scoring, 
    cv= cv, 
    report_path= simple_cost_sensitive_models_report_path
)


The file was successfully saved to artifacts\model_selection\experiment_1\simple_cost_sensitive_models.xlsx


### Results

The data went throught `robust_scaler()` and then `SMOTEENN()` 

And we got the following results:

| Model                                                                         | f2 Mean  | f2 Std   | f2 Median |
| ----------------------------------------------------------------------------- | -------- | -------- | --------- |
| LogisticRegression(class_weight='balanced', penalty='l1', solver='liblinear') | 0,78459  | 0,0089   | 0,783755  |
| LogisticRegression(class_weight='balanced', penalty=None)                     | 0,776524 | 0,010053 | 0,775898  |
| LogisticRegression(class_weight='balanced')                                   | 0,775598 | 0,012051 | 0,776054  |
| DecisionTreeClassifier(class_weight='balanced', max_depth=50)                 | 0,735218 | 0,01123  | 0,733684  |
| DecisionTreeClassifier(class_weight='balanced', max_depth=75)                 | 0,734019 | 0,011054 | 0,733503  |
| DecisionTreeClassifier(class_weight='balanced')                               | 0,733939 | 0,011721 | 0,735074  |

As we can see Logistic Regressors perform the best, specially the `LogisticRegression(class_weight='balanced', penalty='l1', solver='liblinear')` with the best mean and lowest Standart Deviation (more stable)



Now we can give a little peak (compare with our validation dataset) to see if we are on the right path:

In [None]:
best_logistic_regressor_pipeline = Pipeline(steps=[('robust_scaler',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num', RobustScaler(),
                                                  ['age', 'fnlwgt',
                                                   'capital.gain',
                                                   'capital.loss',
                                                   'hours.per.week',
                                                   'profit']),
                                                 ('ord', RobustScaler(),
                                                  ['education.num'])])),
                ('smoteenn_resampler', SMOTEENN()),
                ('model',
                 LogisticRegression(class_weight='balanced', penalty='l1',
                                    solver='liblinear'))])

from sklearn.metrics import fbeta_score

# Train the model
best_logistic_regressor_pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred_best_logistic_regressor = best_logistic_regressor_pipeline.predict(X_test)

# Evaluate the model (F2 score)
best_log_reg_f2_score = fbeta_score(y_test, y_pred_best_logistic_regressor, beta=2)

print(f"Best Logistic Regressor F2 Score: {best_log_reg_f2_score:.3f}")

Best Logistic Regressor F2 Score: 0.771


As we can see the Validation score is very close to the score we bot by using cross-validation.

By far, we already got an advantage over our baseline

## Regular Cost-Sensitive Ensemble Models

In [9]:
from src.model_selection import cross_validate_models
import numpy as np

from xgboost import XGBClassifier
from imblearn.ensemble import BalancedBaggingClassifier, EasyEnsembleClassifier, BalancedRandomForestClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Parameter for XGBoost:
total_negative_examples = np.sum(y_train == 0)
total_positive_examples = np.sum(y_train == 1)
scale_pos_weight = total_negative_examples/total_positive_examples

regular_const_sensitive_ensemble_models = [
    XGBClassifier(scale_pos_weight = scale_pos_weight),
    
    # Bagging Algorithms:
    BalancedBaggingClassifier(estimator = DecisionTreeClassifier()),
    BalancedBaggingClassifier(estimator = LogisticRegression()),
    
    # Easy Ensemble Algorithms: 
    EasyEnsembleClassifier(n_estimators= 10), # Uses adaboost as the default estimator
    EasyEnsembleClassifier(n_estimators= 20), # Uses adaboost as the default estimator
    EasyEnsembleClassifier(estimator = LogisticRegression() , n_estimators= 20),  # using logistic Regression

    # Random Forest Based Algorithms:
    BalancedRandomForestClassifier(n_estimators = 10), # Uses Decision Trees
    BalancedRandomForestClassifier(n_estimators = 20) # Uses Decision Trees
]

regular_cost_sensitive_ensemble_report_path = os.path.join("artifacts", 
                                                        "model_selection", "experiment_1", 
                                                        "regular_cost_sensitive_ensemble.xlsx")

cross_validate_models(
    base_pipeline= experiment_1_pipeline, 
    model_list= regular_const_sensitive_ensemble_models, 
    X= X_train, 
    y= y_train, 
    scoring= scoring, 
    cv= cv, 
    report_path= regular_cost_sensitive_ensemble_report_path
)

The file was successfully saved to artifacts\model_selection\experiment_1\regular_cost_sensitive_ensemble.xlsx


### Results

The data went throught `robust_scaler()` and then `SMOTEENN()` 

And we got the following results:


| Model                                                                   | f2 Mean  | f2 Std   | f2 Median |
| ----------------------------------------------------------------------- | -------- | -------- | --------- |
| XGBClassifier(scale_pos_weight = scale_pos_weight)                      | 0,804084 | 0,006663 | 0,803379  |
| EasyEnsembleClassifier(estimator=LogisticRegression(), n_estimators=20) | 0,778849 | 0,009399 | 0,779091  |
| BalancedBaggingClassifier(estimator=LogisticRegression())               | 0,778596 | 0,008683 | 0,780137  |
| EasyEnsembleClassifier()                                                | 0,777076 | 0,009771 | 0,77704   |
| EasyEnsembleClassifier(n_estimators=20)                                 | 0,775906 | 0,010259 | 0,77647   |
| BalancedRandomForestClassifier(n_estimators=20)                         | 0,765735 | 0,011445 | 0,765081  |
| BalancedRandomForestClassifier(n_estimators=10)                         | 0,758    | 0,013299 | 0,756845  |
| BalancedBaggingClassifier(estimator=DecisionTreeClassifier())           | 0,747421 | 0,013164 | 0,745745  |

As we can see:

* `XGBClassifier` was the best classifier with the scale_pos_weight hyperparameter, this model is good and should been tuned further(specially number of estimators);
* `EasyEnsembleClassifier(estimator=LogisticRegression(), n_estimators=20`) was the second best;
* `BalancedBaggingClassifier(estimator=LogisticRegression())` was the third best;
* Two of the best model use Logistic Regression as base estimators, we are going to combine this ensembles with the best Single Logistic Regression model (`LogisticRegression(class_weight='balanced', penalty='l1', solver='liblinear')`), and see if we can improve the performance;
* The performances have low standart deviations, so our performance is stable across "splits"; and
* We Already have some good models, improve further more when we apply the `probability tuning techiniques`.


## Best Ensembles Using Best Base Estimators

In [None]:
from src.model_selection import cross_validate_models
from imblearn.ensemble import BalancedBaggingClassifier, EasyEnsembleClassifier
from sklearn.linear_model import LogisticRegression

best_ensembles_using_best_estimators = [    
    
    # Bagging Algorithms:
    BalancedBaggingClassifier(estimator = LogisticRegression(class_weight='balanced', penalty='l1', solver='liblinear')), # Best-Base Estimator
    BalancedBaggingClassifier(estimator = LogisticRegression(class_weight='balanced', penalty=None)), # 2nd Best-Base Estimator

    # Easy Ensemble Algorithms: 
    EasyEnsembleClassifier(estimator = LogisticRegression(class_weight='balanced', 
                                                          penalty='l1', solver='liblinear'),
                                                          n_estimators= 20), # Best-Base Estimator

    EasyEnsembleClassifier(estimator = LogisticRegression(class_weight='balanced', penalty=None)), # 2nd Best-Base Estimator
]

best_ensembles_using_best_estimators_report_path = os.path.join("artifacts", 
                                                        "model_selection", "experiment_1", 
                                                        "best_ensembles_using_best_estimators.xlsx")

cross_validate_models(
    base_pipeline= experiment_1_pipeline, 
    model_list= best_ensembles_using_best_estimators, 
    X= X_train, 
    y= y_train, 
    scoring= scoring, 
    cv= cv,  
    report_path= best_ensembles_using_best_estimators_report_path
)  

The file was successfully saved to artifacts\model_selection\experiment_1\best_ensembles_using_best_estimators.xlsx


### Results

The data went throught `robust_scaler()` and then `SMOTEENN()` 

And we got the following results:

| Model                                                                                                                           | f2 Mean  | f2 Std   | f2 Median |
| ------------------------------------------------------------------------------------------------------------------------------- | -------- | -------- | --------- |
| BalancedBaggingClassifier(estimator=LogisticRegression(class_weight='balanced', penalty='l1', solver='liblinear'))              | 0,785544 | 0,008117 | 0,785104  |
| EasyEnsembleClassifier(estimator=LogisticRegression(class_weight='balanced', penalty='l1', solver='liblinear'),n_estimators=20) | 0,784275 | 0,008907 | 0,785329  |
| BalancedBaggingClassifier(estimator=LogisticRegression(class_weight='balanced', penalty=None))                                  | 0,778873 | 0,009545 | 0,778856  |
| EasyEnsembleClassifier(estimator=LogisticRegression(class_weight='balanced', penalty=None))                                     | 0,778702 | 0,010105 | 0,77826   |


* As we can see the best performance models were ensembles that used the best `LogisticRegressor` from `simple_cost_sensitive_models.xlsx` as the base model. Althought, the performance gain was not significant when compared to 1 single simple logistic regressor;
* Now we can try and perform the hyperparameter tunning, first on the `LogisticRegressor`; after that we can use this logistic regressor as base to `BalancedBaggingClassifier` and `EasyEnsembleClassifier` and tune the other hyperparameters from both ensembles;
* Finally we can tune the hyperparameters of XGBoost (best model so far);
* That way we are going to have a good idea of which models are going to have their Probability Tuned, and be ensembled in the final model


## Hyperparameter Tuning

Altough we can Tune the Classification Threshold during the hyperparameter tunning (probably better results), we are not going to do that, because it will take much more time, and probably improve just a little bit of performance