In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
collected_data = pd.read_csv('data/collected_data.csv') # Read the CSV file

keep_col = ['sahm', 'indpro', 'sp500', 'tr10', 't10yff', 'unrate', 'pcepi', 'payems', 'fedfunds', 'date', 'recession']
data = collected_data[keep_col]
del collected_data
# Convert 'date_column' to an index.
data.set_index('date', inplace=True)
data

Unnamed: 0_level_0,sahm,indpro,sp500,tr10,t10yff,unrate,pcepi,payems,fedfunds,recession
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1962-02-01,-0.17,0.016229,0.016139,-0.043737,1.650556,0.055,0.042,0.827913,0.0237,0
1962-03-01,-0.17,0.005350,-0.005878,-0.108990,1.078182,0.056,0.021,0.819544,0.0285,0
1962-04-01,-0.10,0.002130,-0.063973,-0.087455,1.043000,0.056,0.018,0.829109,0.0278,0
1962-05-01,-0.07,-0.001066,-0.089914,0.030636,1.441818,0.055,0.010,0.817113,0.0236,0
1962-06-01,0.00,-0.002132,-0.085381,0.035411,1.206667,0.055,0.010,0.816714,0.0268,0
...,...,...,...,...,...,...,...,...,...,...
2024-04-01,0.37,-0.000690,-0.042506,0.330591,-0.790909,0.039,0.322,0.820381,0.0533,0
2024-05-01,0.37,0.007466,0.046904,-0.056818,-0.847727,0.040,-0.010,0.824685,0.0533,0
2024-06-01,0.43,0.000622,0.034082,-0.177010,-1.024737,0.041,0.145,0.820780,0.0533,0
2024-07-01,0.53,-0.009475,0.011258,-0.056627,-1.081364,0.043,0.189,0.819624,0.0533,0


In [4]:
# get X and y 
X = data.drop(['recession'], axis=1)
y = data['recession']

# We define the training period.
X_train, y_train = X.loc["1962-02-01":"2012-12-01"], y.loc["1962-02-01":"2012-12-01"]
# We define the test period.
# X_test, y_test = X.loc["2013-01-01":], y.loc["2013-01-01":]

In [27]:
y_train.describe()

count    611.000000
mean       0.135843
std        0.342902
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: recession, dtype: float64

### 1. Metrics

Here are the main metrics we can use to get the **final score** for each **model**:

1. **Precision**: It measures the proportion of true positive predictions relative to all positive predictions. It is **important** when you **want to minimize false positives**.<br>Precision = TP/(TP+FP)
   
2. **Recall** (Sensitivity): It measures the proportion of true positive predictions to all actual positive cases. It is useful when you want to minimize false negatives.<br>Recall = TP/(TP+FN)
   
3. **F1-score**: Is a metric that balances precision and recall. It is calculated as the harmonic mean of precision and recall. F1 Score is useful when seeking a balance between high precision and high recall, as it penalizes extreme negative values of either component.<br>F1 = 2\*Precision\*Recall/(Precision+Recall)
   
4. **Confusion Matrix**: Visualizes true and predicted classes, which can help better understand model performance.
    
5. **Specificity**: It measures the proportion of true negative predictions relative to all actual negative cases. It is useful when you want to **minimize false positives**.<br>Specificity = TN/(TN+FP)
    
6. **Accuracy**: It measures the proportion of correctly predicted cases (both positive and negative) relative to the total number of cases. Accuracy **can be misleading with unbalanced data**, as it can be high even if the model does not predict the small class well.

### 2. Logistic Regression

When properly configured with the **class_weight='balanced'** option, Logistic Regression can handle unbalanced data. However, if the classes are very imbalanced, the model may be biased towards the majority class. It works well for moderate imbalance, but **may need additional techniques to deal with a large imbalance**.

In [31]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix

# Checking the distribution of classes in the target variable (y_train).
print("Class distribution in y_train:", np.bincount(y_train))
# Check for minimum amount of class 1 examples.
if np.bincount(y_train).shape[0] < 2:
    print("One of the classes in the training data is missing.")
else:
    # Stratified cross-validation with 5 folds.
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Initialize logistic regression with class_weight='balanced'.
    model = LogisticRegression(class_weight='balanced', random_state=42)

    # We use cross_val_predict to predict the results during cross-validation.
    y_pred = cross_val_predict(model, X_train, y_train, cv=skf)

    # Calculation of model evaluation metrics.
    precision = precision_score(y_train, y_pred, average='binary')
    recall = recall_score(y_train, y_pred, average='binary')
    f1 = f1_score(y_train, y_pred, average='binary')
    accuracy = accuracy_score(y_train, y_pred)
    conf_matrix = confusion_matrix(y_train, y_pred)

    # Extracting the values ​​from the confusion matrix.
    TN, FP, FN, TP = conf_matrix.ravel()

    # Specificity: Calculates how well the model recognizes negative examples (specificity).
    specificity = TN / (TN + FP)

    # Print the results.
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Confusion Matrix: \n{conf_matrix}")
    print(f"Specificity: {specificity:.4f}")

Class distribution in y_train: [528  83]
Precision: 0.4328
Recall: 0.6988
F1-Score: 0.5346
Accuracy: 0.8347
Confusion Matrix: 
[[452  76]
 [ 25  58]]
Specificity: 0.8561


### 3. Decision Tree

Decision Trees can handle unbalanced data, but without regularization they tend to be biased towards the majority class if the classes are highly imbalanced. **Not the best choice for a large imbalance if no balancing techniques are used.**

### 4. Random Forest

Random Forest is more robust to unbalanced classes, especially when used with the **class_weight='balanced_subsample'** option, which compensates for the imbalance. This makes it more stable compared to Decision Tree. **Very suitable for unbalanced data if properly set up.**

### 5. XGBoost

XGBoost has the **scale_pos_weight** parameter which can correct the imbalance between classes. This makes it one of the best models for unbalanced data. One of the most suitable models **to deal with severe imbalance**.

### 6. CatBoost

CatBoost also supports imbalance correction parameters (**class_weights**) and can automatically detect imbalance in data. Good **for dealing with unbalanced classes**.

### 7. SVM

SVM can handle unbalanced classes by using the **class_weight='balanced'** parameter. However, with highly unbalanced data, SVM may not be the best solution. It works well with moderate imbalance, but **may struggle with more imbalance**.