# Model Optimization
**Project Assumption:** The target model is SVM

### Table of Contents
1. Data Loading
    - 1.1 Library Import
    - 1.2 Data Loading
2. Strategy Selection: Rapid Prototyping
    - 2.1 Generating a Stratified Subset (5,000 samples)
    - 2.2 Helper FUnctions for Training and Testing
    - 2.3 Baseline Model
    - 2.4 Weighted Loss (Class Weight = 'Balanced')
    - 2.5 UnderSampling 1:1
    - 2.6 UnderSampling 2:1
    - 2.7 UnderSampling 10:3
    - 2.8 UnderSampling 5:1
    - 2.9 UnderSampling 2:1 + Balanced
    - 2.10 UnderSampling 2:1 + Balanced
3. Experimental Results & Model Selection
4. Analysis of Suboptimal Model Performance
    - 4.1 Training on a Larger Dataset & Thresholf Tuning
    - 4.2 Alternative Model - XGBoost
    - 4.3 Soft Pipeline
    - 4.4 Unethical Pipeline
    - 4.5 More Features Pipeline
    - 4.6 Very Soft Pipeline
    - 4.7 Conclusion
5. Model Optimization
    - 5.1 Initial Optimization
    - 5.2 Extended Hyperparameter Tuning
6. The Best Model
7. Saving Final Model to the File

## 1. Data Loading

### 1.1 Library Import

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

import pickle
import warnings
import joblib

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.model_selection import (
    train_test_split, 
    GridSearchCV, 
    RandomizedSearchCV
)
from sklearn.metrics import f1_score, precision_recall_curve

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.under_sampling import RandomUnderSampler

from xgboost import XGBClassifier

from pipelines.preprocessing_pipeline import preprocessing_pipeline
from pipelines.soft_preprocessing_pipeline import soft_preprocessing_pipeline
from pipelines.very_soft_preprocessing_pipeline import very_soft_preprocessing_pipeline
from pipelines.unethical_preprocessing_pipeline import unethical_preprocessing_pipeline
from pipelines.more_features_preprocessing_pipeline import more_features_preprocessing_pipeline

warnings.filterwarnings('ignore')

Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)


### 1.2 Data Loading
Saving the target encoder to maintain consistency during model evaluation

In [35]:
df_train = pd.read_parquet("data/train.parquet")
X_train = df_train.drop(columns=['income_50k'])
y_train_raw = df_train['income_50k']

le = LabelEncoder()
y_train = le.fit_transform(y_train_raw)

joblib.dump(le, 'income_label_encoder.joblib')

['income_label_encoder.joblib']

## 2. Strategy Selection: Rapid Prototyping

Since Support Vector Machines are computationally expensive to train on large datasets, we will perform initial hyperparameter tuning on a **stratified subset of 5,000 rows**.

This allows us to:
1.  **Iterate quickly:** Test different strategies (e.g., Class Weights, Sampling) and narrow down the range of effective hyperparameters without waiting hours for training
2.  **Ensure Reliability:** We utilize **Cross-Validation** to generate trustworthy metrics, ensuring our findings on the subset are robust before applying them to the full dataset
3.  **Target the Right Metric:** We set our optimization objective to **Average Precision**. Given the severe class imbalance, Accuracy is misleading (the "Accuracy Paradox"), as a model could achieve 94% accuracy simply by predicting the majority class. Optimizing for **Average Precision** forces the model to actively learn the minority class boundaries by balancing Precision and Recall.
4. **Fine-Tuning the Prediction Threshold:** Given a solid model, we can adjust the decision threshold to maximize the F1 score.


Standard reproducibility is not guaranteed because of the patch_sklearn optimization.

### 2.1 Generating a Stratified Subset (5,000 samples)

In [3]:
X_sub, X_rest, y_sub, y_rest = train_test_split(
    X_train, y_train, 
    train_size=5000, 
    stratify=y_train,
    random_state=42
)

print(f"Subset size: {X_sub.shape}")
print(f"Class distribution in subset:\n{pd.Series(y_sub).value_counts(normalize=True)}")

Subset size: (5000, 41)
Class distribution in subset:
0    0.938
1    0.062
Name: proportion, dtype: float64


### 2.2 Helper Functions for Training and Testing

In [4]:
def create_svm_pipeline(preprocessing_pipeline, sampling_strategy=None, class_weight=None):
    steps = preprocessing_pipeline.steps.copy()
    
    if sampling_strategy is not None:
        rus = RandomUnderSampler(sampling_strategy=sampling_strategy, random_state=42)
        steps.append(('undersampler', rus))

    svm = SVC(class_weight=class_weight, probability=True, random_state=42)
    steps.append(('svm', svm))

    return ImbPipeline(steps)

In [5]:
def run_svm_experiment(X, y, preprocessing_pipeline, param_grid, sampling_strategy, title_suffix, class_weight=None):
    pipeline = create_svm_pipeline(preprocessing_pipeline, sampling_strategy, class_weight)

    scoring_metrics = {
        'ROC-AUC': 'roc_auc',
        'Accuracy': 'accuracy',
        'Precision': 'precision',
        'Recall': 'recall',
        'F1-Score': 'f1',
        'Average Precision': 'average_precision'
    }

    grid_search = GridSearchCV(
        pipeline, 
        param_grid, 
        cv=3, 
        scoring=scoring_metrics,
        refit='Average Precision', 
        n_jobs=-1, 
        verbose=1
    )

    print(f"\nRunning Grid Search for {title_suffix}...")
    grid_search.fit(X, y)

    best_model = grid_search.best_estimator_
    
    results = grid_search.cv_results_
    best_index = grid_search.best_index_

    print(f"\nðŸ”¹ Best Parameters ({title_suffix}): {grid_search.best_params_}")

    summary_data = []
    for metric_name in scoring_metrics.keys():
        mean_score = results[f'mean_test_{metric_name}'][best_index]
        
        summary_data.append({
            'Metric': metric_name,
            'Mean CV Score': mean_score,
        })

    metrics_df = pd.DataFrame(summary_data)
    
    pd.options.display.float_format = '{:,.4f}'.format
    
    print(f"\n--- PERFORMANCE REPORT ({title_suffix}) ---")
    display(metrics_df)

    
    return best_model

In [6]:
param_grid = [
        {
            'svm__kernel': ['linear'],
            'svm__C': [0.1, 1, 10, 100]
        },
        {
            'svm__kernel': ['rbf'],
            'svm__C': [0.1, 1, 10, 100],
            'svm__gamma': [0.001, 0.01, 0.1, 1]
        }
]

### 2.3. Baseline Model

In [7]:
model_base = run_svm_experiment(
    X_sub, y_sub, preprocessing_pipeline, param_grid, sampling_strategy=None, title_suffix="Baseline", class_weight=None)


Running Grid Search for Baseline...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (Baseline): {'svm__C': 100, 'svm__gamma': 0.001, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (Baseline) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.912
1,Accuracy,0.9464
2,Precision,0.8079
3,Recall,0.1775
4,F1-Score,0.2889
5,Average Precision,0.5104


### 2.4 Weighted Loss (Class Weight = 'Balanced')

In [8]:
model_weighted = run_svm_experiment(
    X_sub, y_sub, preprocessing_pipeline, param_grid, sampling_strategy=None, title_suffix="Balanced", class_weight='balanced')


Running Grid Search for Balanced...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (Balanced): {'svm__C': 1, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (Balanced) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9271
1,Accuracy,0.8338
2,Precision,0.2563
3,Recall,0.8774
4,F1-Score,0.3965
5,Average Precision,0.532


### 2.5 UnderSampling 1:1

In [9]:
model_1to1 = run_svm_experiment(
    X_sub, y_sub, preprocessing_pipeline, param_grid, sampling_strategy=1.0, title_suffix="UnderSampling 1:1", class_weight=None)


Running Grid Search for UnderSampling 1:1...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 1:1): {'svm__C': 1, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 1:1) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9239
1,Accuracy,0.7762
2,Precision,0.2054
3,Recall,0.9032
4,F1-Score,0.3344
5,Average Precision,0.5067


### 2.6 UnderSampling 2:1

In [10]:
model_2to1 = run_svm_experiment(
    X_sub, y_sub, preprocessing_pipeline, param_grid, sampling_strategy=0.5, title_suffix="UnderSampling 2:1", class_weight=None)


Running Grid Search for UnderSampling 2:1...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 2:1): {'svm__C': 100, 'svm__gamma': 0.001, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 2:1) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9247
1,Accuracy,0.8886
2,Precision,0.3332
3,Recall,0.7935
4,F1-Score,0.4693
5,Average Precision,0.5227


### 2.7 UnderSampling 10:3

In [11]:
model_10to3 = run_svm_experiment(
    X_sub, y_sub, preprocessing_pipeline, param_grid, sampling_strategy=0.3, title_suffix="UnderSampling 10:3", class_weight=None)


Running Grid Search for UnderSampling 10:3...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 10:3): {'svm__C': 10, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 10:3) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9256
1,Accuracy,0.9236
2,Precision,0.431
3,Recall,0.671
4,F1-Score,0.523
5,Average Precision,0.527


### 2.8 UnderSampling 5:1

In [12]:
model_5to1= run_svm_experiment(
    X_sub, y_sub, preprocessing_pipeline, param_grid, sampling_strategy=0.2, title_suffix="UnderSampling 5:1", class_weight=None)


Running Grid Search for UnderSampling 5:1...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 5:1): {'svm__C': 100, 'svm__gamma': 0.001, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 5:1) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9233
1,Accuracy,0.9358
2,Precision,0.4887
3,Recall,0.5646
4,F1-Score,0.5223
5,Average Precision,0.53


### 2.9 UnderSampling 2:1 + Balanced

In [13]:
model_2to1_weighted = run_svm_experiment(
    X_sub, y_sub, preprocessing_pipeline, param_grid, sampling_strategy=0.5, title_suffix="UnderSampling 2:1 + Balanced", class_weight='balanced')


Running Grid Search for UnderSampling 2:1 + Balanced...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 2:1 + Balanced): {'svm__C': 1, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 2:1 + Balanced) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.926
1,Accuracy,0.8058
2,Precision,0.2298
3,Recall,0.9
4,F1-Score,0.3658
5,Average Precision,0.5114


### 2.10 UnderSampling 5:1 + Balanced

In [14]:
model_5to1_weighted = run_svm_experiment(
    X_sub, y_sub, preprocessing_pipeline, param_grid, sampling_strategy=0.2, title_suffix="UnderSampling 5:1 + Balanced", class_weight='balanced')


Running Grid Search for UnderSampling 5:1 + Balanced...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 5:1 + Balanced): {'svm__C': 1, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 5:1 + Balanced) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9281
1,Accuracy,0.8236
2,Precision,0.2477
3,Recall,0.8935
4,F1-Score,0.3873
5,Average Precision,0.5233


## 3. Experimental Results & Model Selection

**Best Model: UnderSampling 5:1**

The moderate undersampling strategy (reducing the majority class to a 5:1 ratio) yielded the strongest performance.

*   **Average Precision: ~0.53** (Best)
*   **F1-Score: ~0.52**
*   **ROC-AUC: ~0.92**
*   **Accuracy: ~0.93**
*   **Precision: ~0.48**
*   **Recall: ~`0.56**


**Optimal Configuration:**
The Grid Search identified the following best hyperparameters for this strategy:
*   **Kernel:** `'rbf'`
*   **C:** `100`
*   **Gamma:** `0.001`

## 4. Analysis of Suboptimal Model Performance

Despite hyperparameter tuning, the Average Precision and F1 score fails to exceed 0.55. This section investigates the potential root causes of this underperformance, including:
- **Data Complexity:** The dataset may lack clear separability or contain high noise levels.
- **Model Limitations:** SVMs might not be the optimal algorithm for this specific feature space.
- **Preprocessing Issues:** Potential information loss or artifacts introduced during feature engineering.

### 4.1 Training on a Larger Dataset & Threshold Tuning
We retrain the model on 50k samples and utilize a separate 10k hold-out validation set to optimize the decision threshold. This separation prevents data leakage. Finally, we compare the Training vs. Validation F1 scores. We specifically monitor the training metrics to verify if high performance is achievable at all. Low training scores would suggest that the data itself lacks sufficient signal or separability, or that the preprocessing pipeline is suboptimal.

In [15]:
X_bigger_sub, X_rest, y_bigger_sub, y_rest = train_test_split(
    X_train, y_train, 
    train_size=50000, 
    stratify=y_train,
    random_state=42
)

X_validate, X_rest, y_validate, y_rest = train_test_split(
    X_rest, y_rest, 
    train_size=10000, 
    stratify=y_rest,
    random_state=42
)

print(f"Subset size: {X_bigger_sub.shape}")
print(f"Class distribution in subset:\n{pd.Series(y_bigger_sub).value_counts(normalize=True)}")

Subset size: (50000, 41)
Class distribution in subset:
0   0.9379
1   0.0621
Name: proportion, dtype: float64


### Helper Functions

In [16]:
def evaluate_with_threshold_tuning(model, X_train, y_train, X_val, y_val, verbose=True):

    if verbose:
        print("Calculating probabilities...")
        
    y_train_proba = model.predict_proba(X_train)[:, 1]
    y_val_proba = model.predict_proba(X_val)[:, 1]

    precisions, recalls, thresholds = precision_recall_curve(y_val, y_val_proba)
    
    with np.errstate(divide='ignore', invalid='ignore'):
        f1_scores = 2 * (precisions * recalls) / (precisions + recalls)
    f1_scores = np.nan_to_num(f1_scores) 

    best_idx = np.argmax(f1_scores)
    best_threshold = thresholds[best_idx] if best_idx < len(thresholds) else 0.5
    
    y_train_pred = (y_train_proba >= best_threshold).astype(int)
    y_val_pred = (y_val_proba >= best_threshold).astype(int)

    train_f1 = f1_score(y_train, y_train_pred)
    val_f1 = f1_score(y_val, y_val_pred)
    diff = train_f1 - val_f1

    if verbose:
        print(f"\nOptimal Threshold Found: {best_threshold:.4f}")
        print("\n--- DIAGNOSTICS (Train vs Validation) ---")
        print(f"Train F1 Score:       {train_f1:.4f}")
        print(f"Validation F1 Score:  {val_f1:.4f}")
        print(f"Difference (Overfit): {diff:.4f}")

    return

In [17]:
best_model_bigger_set = run_svm_experiment(
    X_bigger_sub, y_bigger_sub, preprocessing_pipeline, param_grid, sampling_strategy=0.2, title_suffix="UnderSampling 5:1", class_weight=None)


Running Grid Search for UnderSampling 5:1...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 5:1): {'svm__C': 10, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 5:1) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9151
1,Accuracy,0.9425
2,Precision,0.5336
3,Recall,0.5794
4,F1-Score,0.5555
5,Average Precision,0.5817


In [18]:
evaluate_with_threshold_tuning(best_model_bigger_set, X_bigger_sub, y_bigger_sub, X_validate, y_validate)

Calculating probabilities...

Optimal Threshold Found: 0.5065

--- DIAGNOSTICS (Train vs Validation) ---
Train F1 Score:       0.5767
Validation F1 Score:  0.5723
Difference (Overfit): 0.0044


### Performance Analysis:
F1-score remains low (~0.57) even after threshold tuning. Crucially, the Training and Validation scores are nearly identical, with a difference of almost zero.

Optimal Threshold is ~0.5 which indicates no threshold tuning is needed.

### 4.2 Alternative Model - XGBoost
To verify whether the performance bottleneck is intrinsic to the dataset or a limitation of the Support Vector Machine (SVM) algorithm, we train an XGBoost classifier. As a tree-based ensemble method, XGBoost handles non-linear relationships and feature interactions differently than SVMs, potentially uncovering patterns the previous model missed.

In [19]:
count_neg = len(y_bigger_sub) - y_bigger_sub.sum()
count_pos = y_bigger_sub.sum()
scale_ratio = count_neg / count_pos

full_pipeline = Pipeline([
    ('preprocessing', preprocessing_pipeline), 
    ('classifier', XGBClassifier(
        objective='binary:logistic',
        scale_pos_weight=scale_ratio,
        n_jobs=-1,
        tree_method='hist' 
    ))
])

param_dist = {
    'classifier__n_estimators': [500, 1000],
    'classifier__learning_rate': [0.03, 0.05, 0.1],
    'classifier__max_depth': [4, 6, 8], 
    'classifier__min_child_weight': [1, 3, 5],
    'classifier__colsample_bytree': [0.6, 0.7, 0.8], 
    'classifier__subsample': [0.7, 0.8, 0.9],
    'classifier__gamma': [0.1, 0.5, 1.0] 
}

scoring_metrics = {
    'ROC-AUC': 'roc_auc',
    'Accuracy': 'accuracy',
    'Precision': 'precision',
    'Recall': 'recall',
    'F1-Score': 'f1',
    'Average Precision': 'average_precision'
}

best_model_xg = RandomizedSearchCV(
    full_pipeline,
    param_distributions=param_dist,
    n_iter=15, 
    scoring=scoring_metrics,
    refit='Average Precision',
    cv=3,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

print("Running Randomized Search...")
best_model_xg.fit(X_bigger_sub, y_bigger_sub)

print(f"\nðŸ”¹ Best Parameters: {best_model_xg.best_params_}")
print("\n--- PERFORMANCE REPORT ---\n")

results = []
best_index = best_model_xg.best_index_

for metric_name in scoring_metrics.keys():
    mean_score = best_model_xg.cv_results_[f'mean_test_{metric_name}'][best_index]
    results.append({
        'Metric': metric_name, 
        'Mean CV Score': round(mean_score, 4)
    })

results_df = pd.DataFrame(results)
display(results_df)

Running Randomized Search...
Fitting 3 folds for each of 15 candidates, totalling 45 fits

ðŸ”¹ Best Parameters: {'classifier__subsample': 0.9, 'classifier__n_estimators': 500, 'classifier__min_child_weight': 3, 'classifier__max_depth': 6, 'classifier__learning_rate': 0.03, 'classifier__gamma': 1.0, 'classifier__colsample_bytree': 0.7}

--- PERFORMANCE REPORT ---



Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9439
1,Accuracy,0.8797
2,Precision,0.3214
3,Recall,0.8434
4,F1-Score,0.4654
5,Average Precision,0.626


In [20]:
evaluate_with_threshold_tuning(best_model_xg, X_bigger_sub, y_bigger_sub, X_validate, y_validate)

Calculating probabilities...

Optimal Threshold Found: 0.7924

--- DIAGNOSTICS (Train vs Validation) ---
Train F1 Score:       0.6413
Validation F1 Score:  0.5965
Difference (Overfit): 0.0448


### Performance Analysis

Even with XGBoost, the F1 score remains suboptimal (~0.59 on validation, ~0.64 on train). Since two fundamentally different algorithms (Linear/Kernel SVM and Gradient Boosting) failed to achieve high performance, we can conclude that the issue is not the choice of model.


**Hypothesized Root Causes:**
- **Aggressive Preprocessing:** The feature engineering steps (e.g., grouping industries/occupations may have discarded critical signal, leading to information loss
- **Low Separability:** The dataset itself may contain high noise or overlapping classes, making it impossible to separate them with high precision based on the available features

### 4.3 Soft Pipeline

To investigate if the previous feature engineering was too aggressive (causing information loss), we implemented a "Soft" pipeline with the following adjustments:

- **Financial Features Preserved:** `capital_gains` and `capital_losses` are retained and log-transformed instead of being dropped in favor of just `net_capital`
- **Raw Categories Kept:** Removed broad manual grouping for `major_ind_code` and `major_occ_code`. These are now one-hot encoded with `min_frequency=0.01` to handle cardinality automatically while preserving detail
- **Otherwise Identical:** All other preprocessing steps (cleaning, scaling, and other engineered features) remain exactly the same as in the primary pipeline.

In [21]:
best_model_bigger_set_soft_pipeline = run_svm_experiment(
    X_bigger_sub, y_bigger_sub, soft_preprocessing_pipeline, param_grid, sampling_strategy=0.2, title_suffix="UnderSampling 5:1", class_weight=None)


Running Grid Search for UnderSampling 5:1...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 5:1): {'svm__C': 100, 'svm__gamma': 0.001, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 5:1) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9398
1,Accuracy,0.941
2,Precision,0.5219
3,Recall,0.5988
4,F1-Score,0.5577
5,Average Precision,0.5903


In [22]:
evaluate_with_threshold_tuning(best_model_bigger_set_soft_pipeline, X_bigger_sub, y_bigger_sub, X_validate, y_validate)

Calculating probabilities...

Optimal Threshold Found: 0.5250

--- DIAGNOSTICS (Train vs Validation) ---
Train F1 Score:       0.5735
Validation F1 Score:  0.5672
Difference (Overfit): 0.0062


### Performance Analysis: Soft Pipeline
Comparing the "Soft" pipeline to the previous approach, we observe the following performance metrics:

**1. Cross-Validation (Mean CV Score):**
*   **F1-Score:** Remained stable at **0.56**
*   **Average Precision:** Increased from **0.58** to **0.59**

**2. Validation Set (Hold-out):**
*   **F1-Score:** Remained consistent at **0.57**

### 4.4 Unethical Pipeline
To rigorously test the performance limitations, we implemented an "Unethical" pipeline. This setup builds directly upon the **Soft Pipeline** strategy (preserving financial details and raw categories) but deliberately **reintroduces sensitive demographic features**: `sex`, `race`, and `hisp_origin`.

Optimal Threshold is ~0.5 which indicates no threshold tuning is needed.

In [23]:
best_model_bigger_set_unethical_pipeline = run_svm_experiment(
    X_bigger_sub, y_bigger_sub, unethical_preprocessing_pipeline, param_grid, sampling_strategy=0.2, title_suffix="UnderSampling 5:1", class_weight=None)


Running Grid Search for UnderSampling 5:1...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 5:1): {'svm__C': 1, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 5:1) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9427
1,Accuracy,0.9441
2,Precision,0.5461
3,Recall,0.5897
4,F1-Score,0.5671
5,Average Precision,0.6106


In [24]:
evaluate_with_threshold_tuning(best_model_bigger_set_unethical_pipeline, X_bigger_sub, y_bigger_sub, X_validate, y_validate)

Calculating probabilities...

Optimal Threshold Found: 0.5690

--- DIAGNOSTICS (Train vs Validation) ---
Train F1 Score:       0.5763
Validation F1 Score:  0.5928
Difference (Overfit): -0.0165


### Performance Analysis: Unethical Pipeline
Comparing the "Unethical" pipeline (which reintroduces sensitive demographic features) to the "Soft" pipeline, we observe a clear performance boost, confirming that demographic variables hold significant predictive power:

**1. Cross-Validation (Mean CV Score):**
*   **F1-Score:** Increased from **0.56** to **0.57**
*   **Average Precision:** Rose from **0.59** to **0.61**

**2. Validation Set (Hold-out):**
*   **F1-Score:** Improved from **0.57** to **0.59**

### 4.5 More Features Pipeline
In this step, we extend the "Unethical" pipeline by re-introducing specific auxiliary features that were previously dropped: `own_or_self`, `vet_benefits`, `unemp_reason`, and `vet_question`.

In [25]:
best_model_bigger_set_more_features_pipeline = run_svm_experiment(
    X_bigger_sub, y_bigger_sub, more_features_preprocessing_pipeline, param_grid, sampling_strategy=0.2, title_suffix="UnderSampling 5:1", class_weight=None)


Running Grid Search for UnderSampling 5:1...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 5:1): {'svm__C': 1, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 5:1) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9427
1,Accuracy,0.9443
2,Precision,0.5478
3,Recall,0.5878
4,F1-Score,0.5671
5,Average Precision,0.61


In [26]:
evaluate_with_threshold_tuning(best_model_bigger_set_more_features_pipeline, X_bigger_sub, y_bigger_sub, X_validate, y_validate)

Calculating probabilities...

Optimal Threshold Found: 0.5823

--- DIAGNOSTICS (Train vs Validation) ---
Train F1 Score:       0.5781
Validation F1 Score:  0.5912
Difference (Overfit): -0.0130


### Performance Analysis: More Features Pipeline
Comparing the "More Features" pipeline (adding auxiliary features like `vet_benefits`) to the "Unethical" pipeline, we observe that the additional complexity did not yield performance gains, suggesting these features contain little incremental predictive value:

**1. Cross-Validation (Mean CV Score):**
*   **F1-Score:** Remained stable at **0.57**
*   **Average Precision:** Remained constant at **0.61**

**2. Validation Set (Hold-out):**
*   **F1-Score:** Remained consistent at **0.59**


Optimal Threshold is ~0.58 which indicates slight threshold tuning is needed.

### 4.6 Very Soft Pipeline
In this final configuration, we aim to maximize the information available to the model by combining manual engineering with raw data retention:

*   **Create but Don't Delete:** We generate all our new synthetic features (e.g., `net_capital`, `is_investor`), but unlike previous pipelines, we **do not remove any original columns**. Both the derived features and their raw sources are passed to the model.
*   **No Manual Grouping:** We completely skip the manual categorization step (e.g., grouping industries or employment status). Instead, we use `OneHotEncoder` with a very low **`min_frequency=0.001`**, allowing the model to see virtually all original, granular categories.
*   **Automated Selection:** Since this approach generates a huge number of features (often redundant), we add a **`SelectKBest(k=100)`** step at the end. This forces the algorithm to automatically pick the 100 most predictive features from this massive pool of engineered and raw variables.

In [27]:
best_model_bigger_set_very_soft_pipeline = run_svm_experiment(
    X_bigger_sub, y_bigger_sub, very_soft_preprocessing_pipeline, param_grid, sampling_strategy=0.2, title_suffix="UnderSampling 5:1", class_weight=None)


Running Grid Search for UnderSampling 5:1...
Fitting 3 folds for each of 20 candidates, totalling 60 fits

ðŸ”¹ Best Parameters (UnderSampling 5:1): {'svm__C': 10, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (UnderSampling 5:1) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9275
1,Accuracy,0.9422
2,Precision,0.5296
3,Recall,0.6133
4,F1-Score,0.5684
5,Average Precision,0.6093


In [28]:
evaluate_with_threshold_tuning(best_model_bigger_set_very_soft_pipeline, X_bigger_sub, y_bigger_sub, X_validate, y_validate)

Calculating probabilities...

Optimal Threshold Found: 0.5852

--- DIAGNOSTICS (Train vs Validation) ---
Train F1 Score:       0.6033
Validation F1 Score:  0.5937
Difference (Overfit): 0.0096


### Performance Analysis: Very Soft Pipeline
Comparing the "Very Soft" pipeline (which utilizes granular raw data and automated `SelectKBest` feature selection) to the previous "More Features" pipeline, we observe that performance has reached a plateau. This indicates that the automated selection strategy successfully identified the predictive signals matched by the previous approach but did not uncover additional information:

**1. Cross-Validation (Mean CV Score):**
*   **F1-Score:** Remained stable at **0.57**
*   **Average Precision:** Remained consistent at **0.61**

**2. Validation Set (Hold-out):**
*   **F1-Score:** Remained stable at **0.59**

Optimal Threshold is ~0.58 which indicates slight threshold tuning is needed.

### 4.7 Conclusion

After testing multiple preprocessing strategies and feature sets, we arrived at the following conclusions:

**1. Pipeline Evaluation:**
*   **"Soft" Pipeline is the Optimal Balance:** While the "Very Soft" and "Unethical" pipelines achieved the highest raw metrics (Validation F1 ~0.59), the **Soft Pipeline** (Validation F1 ~0.57) remains the superior choice for production. It offers an improvement over the baseline without relying on sensitive demographic data or unmanageable feature spaces.
*   **The Ethical Trade-off:** Reintroducing protected attributes (`sex`, `race`) in the "Unethical Pipeline" **improved performance** (F1 rose from 0.57 to 0.59). This confirms that demographic bias exists in the dataset and has predictive power. However, we explicitly **reject** this gain to adhere to fairness constraints and prevent the model from perpetuating historical biases.
*   **Automation vs. Manual Engineering:** The "Very Soft" pipeline, which used automated feature selection (`SelectKBest`) on raw data, matched the performance of our manually engineered pipelines. This proves that while automated selection is effective, it hit the same "performance ceiling," suggesting no hidden signal was missed by our manual feature engineering.

**2. Root Cause Analysis:**
*   **Intrinsic Data Difficulty:** The performance plateau around **F1 ~0.59** (even with all features and raw data) indicates a limit in the data's separability. Since more complex non-linear methods (like XGBoost tested in parallel) also struggled to break this ceiling, we conclude the bottleneck is the **class overlap** in the dataset, not the SVM architecture itself.

**Final Verdict:**
We select the **Soft Preprocessing Pipeline**. It maximizes predictive capability (~0.57 F1) while maintaining strict ethical standards and model interpretability, accepting a minor trade-off in performance to ensure a fair and unbiased model. Also no threshold tuning is needed.

**Optimal Model Configuration:**
*   **Kernel:** RBF
*   **C:** 100
*   **Gamma:** 0.001

## 5. Model Optimization

### 5.1 Initial Optimization
Based on previous experiments where the optimal parameters hit the edge of the grid (`C=100`, `gamma=0.001`), we now perform a focused search around these values. We utilize the same 50k training subset to maintain consistency, but with a denser grid to pinpoint the absolute global maximum for `C` and `gamma`.

In [29]:
param_grid_initial_optimization = [
    {
    'svm__kernel': ['rbf'],
    'svm__C': [50, 80, 100, 150, 200, 500, 1000],
    'svm__gamma': [0.0001, 0.0005, 0.0008, 0.001, 0.0015, 0.002, 0.003, 0.004]
    }
]
initial_opitimization_model = run_svm_experiment(
    X_bigger_sub, y_bigger_sub, soft_preprocessing_pipeline, param_grid_initial_optimization, sampling_strategy=0.2, title_suffix="Initial Optimization", class_weight=None)


Running Grid Search for Initial Optimization...
Fitting 3 folds for each of 56 candidates, totalling 168 fits

ðŸ”¹ Best Parameters (Initial Optimization): {'svm__C': 500, 'svm__gamma': 0.0008, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (Initial Optimization) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9396
1,Accuracy,0.9408
2,Precision,0.5196
3,Recall,0.6062
4,F1-Score,0.5596
5,Average Precision,0.6004


### 5.2 Extended Hyperparameter Tuning
Full Dataset Optimization & Micro-Tuning
Having identified the optimal region (`C=500`, `gamma=0.0008`) on the subset, we now scale up the training to the entire dataset. We perform a very narrow grid search centered around these values to finalize the model hyperparameters, ensuring they are tuned specifically for the full data distribution.

In [33]:
param_grid_extended_optimization = [
    {
    'svm__kernel': ['rbf'],
    'svm__C': [350, 500, 650, 800],
    'svm__gamma': [0.0007, 0.0008, 0.0009]
    }
]
extended_optimization_model = run_svm_experiment(
    X_train, y_train, soft_preprocessing_pipeline, param_grid_extended_optimization, sampling_strategy=0.2, title_suffix="Extended Optimization", class_weight=None)


Running Grid Search for Extended Optimization...
Fitting 3 folds for each of 12 candidates, totalling 36 fits

ðŸ”¹ Best Parameters (Extended Optimization): {'svm__C': 800, 'svm__gamma': 0.0009, 'svm__kernel': 'rbf'}

--- PERFORMANCE REPORT (Extended Optimization) ---


Unnamed: 0,Metric,Mean CV Score
0,ROC-AUC,0.9297
1,Accuracy,0.943
2,Precision,0.5358
3,Recall,0.6095
4,F1-Score,0.5703
5,Average Precision,0.6024


## 6. The Best Model

After an experimental process we have identified the optimal configuration for predicting high-income individuals in this highly imbalanced dataset.

**Final Configuration:**
*   **Algorithm:** Support Vector Machine (SVM)
*   **Kernel:** `'rbf'`
*   **Regularization (C):** `800`
*   **Gamma:** `0.0009`
*   **Imbalance Strategy:** Random UnderSampling (5:1 ratio / `sampling_strategy=0.2`)

## 7. Saving Final Model to the File

In [34]:
final_trained_model = extended_optimization_model

model_filename = 'final_model.pkl'

try:
    with open(model_filename, 'wb') as file:
        pickle.dump(final_trained_model, file)
    print(f"\n Final trained model successfully saved to '{model_filename}'")
    print("This model was trained on X_train/y_train using the optimal parameters found.")
except Exception as e:
    print(f"\n Error saving the model: {e}")


 Final trained model successfully saved to 'final_model.pkl'
This model was trained on X_train/y_train using the optimal parameters found.
