# **Adult income prediction**

## **Leonardo Cofone**

**In this project, I worked on building a machine learning model for a binary classification task using a stacking ensemble approach. I combined several classifiers, including Random Forest, Gradient Boosting, and XGBoost, with AdaBoost as the final estimator to improve performance. I handled data preprocessing by setting up a pipeline that included steps like imputation, feature scaling, and encoding. To evaluate the model, I used various metrics like precision, recall, F1 score, and AUC. Finally, I fine-tuned the threshold using Youden's J statistic, which helped improve recall without sacrificing precision.**

In [None]:
#install old libraries for compatibility
!pip uninstall -y scikit-learn > /dev/null 2>&1
!pip uninstall -y category-encoders > /dev/null 2>&1
!pip uninstall -y imbalanced-learn > /dev/null 2>&1

!pip install scikit-learn==1.1.3 > /dev/null 2>&1
!pip install imbalanced-learn==0.9.1 > /dev/null 2>&1

##### **Now the setup it's ready!!**

## **1) Analyze and work on data**

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('/kaggle/input/adult-income-dataset/adult.csv', na_values="?")
data["income"] = data["income"].map({"<=50K": 0, ">50K": 1})

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.isnull().sum()

In [None]:
data.shape

In [None]:
data.duplicated().sum()

In [None]:
#Take away those duplicates
data.drop_duplicates(inplace=True) 
data.shape

In [None]:
#Check if the classes are unbalanced
print(pd.Series(data["income"]).value_counts(normalize=True))

**The classes are really unbalanced**

In [None]:
#Try to balance the classes
from sklearn.model_selection import train_test_split 
from imblearn.over_sampling import RandomOverSampler

X=data.drop(['income'],axis=1)
y=data['income']
X_over,y_over=RandomOverSampler().fit_resample(X,y)

X_train_full, X_test, y_train_full, y_test = train_test_split(X_over, y_over, test_size=0.1, random_state=42, stratify=y_over)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.1111, random_state=42, stratify=y_train_full)

In [None]:
#Double check if the classes are balanced
print(pd.Series(y_train).value_counts(normalize=True))

In [None]:
#Find out which features are categorical, numerical or heavy-tailed
categorical_cols = X_train.select_dtypes(include='object').columns
numeric_cols = X_train.select_dtypes(include=np.number).columns

print("Categorical features: ", categorical_cols)
print("Numerical features: ", numeric_cols)

from scipy.stats import skew
numeric_cols = X_train.select_dtypes(include=np.number).columns
skewed_feats = X_train[numeric_cols].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewed_features = skewed_feats[skewed_feats > 0.8].index.tolist()
skewed_features = [col for col in skewed_features if col in X_train.columns]

print("High skew features:", skewed_features)

## **2) Create a pipeline for a better preprocessing**

In [None]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer

log_transformer = make_pipeline(                      
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log1p, feature_names_out="one-to-one"),
    StandardScaler()
)

numeric_transformer = make_pipeline(              
    SimpleImputer(strategy="median"),
    StandardScaler()
)

categorical_transformer = make_pipeline(                       
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore")
)


preprocessor = ColumnTransformer([                       
    ("log", log_transformer, skewed_features),
    
    ("num", numeric_transformer, list(set(numeric_cols) - set(skewed_features))),

    ("cat", categorical_transformer, categorical_cols),
])

## **3) Train and evaluate different model**

#### **Logistic Regression**

In [None]:
#TRAIN A LOGISTIC REGRESSION MODEL
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_curve

log_reg = Pipeline([
    ("preprocessor1", preprocessor),
    ("log_reg", LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')),
])

log_reg.fit(X_train, y_train)

#EVALUATE THE MODEL
from sklearn.model_selection import cross_val_predict
y_val_pred_1 = log_reg.predict(X_val)

print("Confusion Matrix For Logistic Regression:\n", confusion_matrix(y_val, y_val_pred_1))

print(f"Precision: { precision_score(y_val, y_val_pred_1)}")
print(f"Recall: {recall_score(y_val, y_val_pred_1)}")
print(f"F1 Score: {f1_score(y_val, y_val_pred_1)}")

y_val_pred_proba_1 = log_reg.predict_proba(X_val)[:, 1]
print(f"ROC AUC Score: {roc_auc_score(y_val, y_val_pred_proba_1)}")

#### **Gradient Boosting CLassifier**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gb_model = Pipeline([
    ("preprocessor1", preprocessor),
    ("gb", GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42))
])

gb_model.fit(X_train, y_train)

y_val_pred_4 = gb_model.predict(X_val)

print("Confusion Matrix For Gradient Boosting:\n", confusion_matrix(y_val, y_val_pred_4))
print(f"Precision: {precision_score(y_val, y_val_pred_4)}")
print(f"Recall: {recall_score(y_val, y_val_pred_4)}")
print(f"F1 Score: {f1_score(y_val, y_val_pred_4)}")

y_val_pred_proba_4 = gb_model.predict_proba(X_val)[:, 1]
print(f"ROC AUC Score: {roc_auc_score(y_val, y_val_pred_proba_4)}")


#### **XGB CLassifier**

In [None]:
from xgboost import XGBClassifier

xgb_model = Pipeline([
    ("preprocessor1", preprocessor),
    ("xgb", XGBClassifier(n_estimators=200, learning_rate=0.1, use_label_encoder=False, eval_metric="logloss", random_state=42))
])
xgb_model.fit(X_train, y_train)

y_val_pred_3 = xgb_model.predict(X_val)

print("Confusion Matrix For XGBClassifier:\n", confusion_matrix(y_val, y_val_pred_3))
print(f"Precision: {precision_score(y_val, y_val_pred_3)}")
print(f"Recall: {recall_score(y_val, y_val_pred_3)}")
print(f"F1 Score: {f1_score(y_val, y_val_pred_3)}")

y_val_pred_proba_3 = xgb_model.predict_proba(X_val)[:, 1]
print(f"ROC AUC Score: {roc_auc_score(y_val, y_val_pred_proba_3)}")

#### **Random Forest Classifier**

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = Pipeline([
    ("preprocessor1", preprocessor),
    ("rf", RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced'))
])

rf_model.fit(X_train, y_train)

y_val_pred_2 = rf_model.predict(X_val)

print("Confusion Matrix For Random Forest:\n", confusion_matrix(y_val, y_val_pred_2))
print(f"Precision: {precision_score(y_val, y_val_pred_2)}")
print(f"Recall: {recall_score(y_val, y_val_pred_2)}")
print(f"F1 Score: {f1_score(y_val, y_val_pred_2)}")

y_val_pred_proba_2 = rf_model.predict_proba(X_val)[:, 1]
print(f"ROC AUC Score: {roc_auc_score(y_val, y_val_pred_proba_2)}")

## **4) Ensemble the three best models**

In [None]:
from sklearn.ensemble import StackingClassifier,  AdaBoostClassifier

rf_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("rf", RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced'))
])

xgb_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("xgb", XGBClassifier(n_estimators=200, learning_rate=0.1,
                          use_label_encoder=False, eval_metric="logloss", random_state=42))
])

gb_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("gb", GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42))
])

stack_model = StackingClassifier(
    estimators=[
        ('rf', rf_pipeline),
        ('xgb', xgb_pipeline),
        ('gb', gb_pipeline),
    ],
    final_estimator=AdaBoostClassifier(n_estimators=100, learning_rate=0.5, random_state=42),
    cv=5,
    n_jobs=-1
)

stack_model.fit(X_train, y_train)

y_val_pred_5 = stack_model.predict(X_val)

print("Confusion Matrix For Stacking:\n", confusion_matrix(y_val, y_val_pred_5))
print(f"Precision: {precision_score(y_val, y_val_pred_5)}")
print(f"Recall: {recall_score(y_val, y_val_pred_5)}")
print(f"F1 Score: {f1_score(y_val, y_val_pred_5)}")

y_val_pred_proba_5 = stack_model.predict_proba(X_val)[:, 1]
print(f"ROC AUC Score: {roc_auc_score(y_val, y_val_pred_proba_5)}")

## **5) Final test on the test set**

In [None]:
y_test_pred_ST = stack_model.predict(X_test)

print(f"Confusion matrix for stacking classifier:\n {confusion_matrix(y_test, y_test_pred_ST)}")
print(f"Recall: {recall_score(y_test, y_test_pred_ST)}")
print(f"Precision:  {precision_score(y_test, y_test_pred_ST)}")
print(f"F1 Score: {f1_score(y_test, y_test_pred_ST)}")
print("Classification Report, Stacking Classifier:")
print(classification_report(y_test, y_test_pred_ST))

## **6) Threshold tuning**

In [None]:
import matplotlib.pyplot as plt

y_test_proba = stack_model.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
roc_auc = roc_auc_score(y_test, y_test_proba)
print(f"AUC Score: {roc_auc:.4f}")

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid()
plt.show()

J_scores = tpr - fpr
best_threshold = thresholds[np.argmax(J_scores)]
print(f"Best threshold based on Youden's J statistic: {best_threshold:.4f}")

y_test_pred_new_threshold = (y_test_proba >= best_threshold).astype(int)

print(f"Confusion matrix at new threshold {best_threshold}:\n {confusion_matrix(y_test, y_test_pred_new_threshold)}")
print(f"Recall: {recall_score(y_test, y_test_pred_new_threshold)}")
print(f"Precision: {precision_score(y_test, y_test_pred_new_threshold)}")
print(f"F1 Score: {f1_score(y_test, y_test_pred_new_threshold)}")
print(classification_report(y_test, y_test_pred_new_threshold))

## **7) save the model**

In [None]:
import joblib
joblib.dump({'model': stack_model, 'threshold': best_threshold}, 'stacking__final.pkl')

## **Thank you for watching my notebook!!!**