In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, accuracy_score

In [2]:
collected_data = pd.read_csv('data/collected_data.csv') # Read the CSV file

keep_col = ['sahm', 'indpro', 'sp500', 'tr10', 't10yff', 'unrate', 'pcepi', 'payems', 'recession']
data = collected_data[keep_col]
del collected_data
# data['date'] = pd.to_datetime(data['date'])
# data['date'] = (data['date'] - data['date'].min()).dt.days  # Convert to days.
data

Unnamed: 0,sahm,indpro,sp500,tr10,t10yff,unrate,pcepi,payems,recession
0,-0.17,0.016229,0.016139,-0.043737,1.650556,0.055,0.042,0.827913,0
1,-0.17,0.005350,-0.005878,-0.108990,1.078182,0.056,0.021,0.819544,0
2,-0.10,0.002130,-0.063973,-0.087455,1.043000,0.056,0.018,0.829109,0
3,-0.07,-0.001066,-0.089914,0.030636,1.441818,0.055,0.010,0.817113,0
4,0.00,-0.002132,-0.085381,0.035411,1.206667,0.055,0.010,0.816714,0
...,...,...,...,...,...,...,...,...,...
746,0.37,-0.000690,-0.042506,0.330591,-0.790909,0.039,0.322,0.820381,0
747,0.37,0.007466,0.046904,-0.056818,-0.847727,0.040,-0.010,0.824685,0
748,0.43,0.000622,0.034082,-0.177010,-1.024737,0.041,0.145,0.820780,0
749,0.53,-0.009475,0.011258,-0.056627,-1.081364,0.043,0.189,0.819624,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 751 entries, 0 to 750
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   sahm       751 non-null    float64
 1   indpro     751 non-null    float64
 2   sp500      751 non-null    float64
 3   tr10       751 non-null    float64
 4   t10yff     751 non-null    float64
 5   unrate     751 non-null    float64
 6   pcepi      751 non-null    float64
 7   payems     751 non-null    float64
 8   recession  751 non-null    int64  
dtypes: float64(8), int64(1)
memory usage: 52.9 KB


In [4]:
# get X and y 
X = data.drop(['recession'], axis=1)
y = data["recession"]

# We define the training period.
X_train, y_train = X.loc["1962-02-01":"2012-12-01"], y.loc["1962-02-01":"2012-12-01"]
# We define the test period.
X_test, y_test = X.loc["2013-01-01":], y.loc["2013-01-01":]

### 1. Metrics

Here are the main metrics we can use to get the **final score** for each **model**:

1. **Precision**: It measures the proportion of true positive predictions relative to all positive predictions. It is **important** when you **want to minimize false positives**.<br>Precision = TP/(TP+FP)
   
2. **Recall** (Sensitivity): It measures the proportion of true positive predictions to all actual positive cases. It is useful when you want to minimize false negatives.<br>Recall = TP/(TP+FN)
   
3. **F1-score**: Is a metric that balances precision and recall. It is calculated as the harmonic mean of precision and recall. F1 Score is useful when seeking a balance between high precision and high recall, as it penalizes extreme negative values of either component.<br>F1 = 2\*Precision\*Recall/(Precision+Recall)
   
4. **Confusion Matrix**: Visualizes true and predicted classes, which can help better understand model performance.
    
5. **Specificity**: It measures the proportion of true negative predictions relative to all actual negative cases. It is useful when you want to **minimize false positives**.<br>Specificity = TN/(TN+FP)
    
6. **Accuracy**: It measures the proportion of correctly predicted cases (both positive and negative) relative to the total number of cases. Accuracy **can be misleading with unbalanced data**, as it can be high even if the model does not predict the small class well.

### 2. Logistic Regression

When properly configured with the **class_weight='balanced'** option, Logistic Regression can handle unbalanced data. However, if the classes are very imbalanced, the model may be biased towards the majority class. It works well for moderate imbalance, but **may need additional techniques to deal with a large imbalance**.

### 3. Decision Tree

Decision Trees can handle unbalanced data, but without regularization they tend to be biased towards the majority class if the classes are highly imbalanced. **Not the best choice for a large imbalance if no balancing techniques are used.**

### 4. Random Forest

Random Forest is more robust to unbalanced classes, especially when used with the **class_weight='balanced_subsample'** option, which compensates for the imbalance. This makes it more stable compared to Decision Tree. **Very suitable for unbalanced data if properly set up.**

### 5. XGBoost

XGBoost has the **scale_pos_weight** parameter which can correct the imbalance between classes. This makes it one of the best models for unbalanced data. One of the most suitable models **to deal with severe imbalance**.

### 6. CatBoost

CatBoost also supports imbalance correction parameters (**class_weights**) and can automatically detect imbalance in data. Good **for dealing with unbalanced classes**.

### 7. SVM

SVM can handle unbalanced classes by using the **class_weight='balanced'** parameter. However, with highly unbalanced data, SVM may not be the best solution. It works well with moderate imbalance, but **may struggle with more imbalance**.