What You're Aiming For

In this checkpoint, we are going to work on the 'Systemic Crisis, Banking Crisis, inflation Crisis In Africa' dataset that was provided by Kaggle.

Dataset description : This dataset focuses on the Banking, Debt, Financial, Inflation and Systemic Crises that occurred, from 1860 to 2014, in 13 African countries, including: Algeria, Angola, Central African Republic, Ivory Coast, Egypt, Kenya, Mauritius, Morocco, Nigeria, South Africa, Tunisia, Zambia and Zimbabwe. The ML model objective is to predict the likelihood of a Systemic crisis emergence given a set of indicators like the annual inflation rates.

 ➡️ Dataset link

https://i.imgur.com/3XzFz3x.jpg


Instructions

Import you data and perform basic data exploration phase
Display general information about the dataset
Create a pandas profiling reports to gain insights into the dataset
Handle Missing and corrupted values
Remove duplicates, if they exist
Handle outliers, if they exist
Encode categorical features
Select your target variable and the features
Split your dataset to training and test sets
Based on your data exploration phase select a ML classification algorithm and train it on the training set
Assess your model performance on the test set using relevant evaluation metrics
Discuss with your cohort alternative ways to improve your model performance

In [30]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
# Import necessary libraries for ML models
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, f1_score, confusion_matrix, roc_auc_score

In [31]:
# Load dataset
df = pd.read_csv('African_crises_dataset.csv')

In [32]:
# Display basic information about the dataset
print("Basic Info:")
df.info()  # Gives overview of column types, missing values
print("\nBasic Statistics:")
print(df.describe())  # Provides basic statistics (mean, std, etc.)

# Display first few rows
print("\nFirst 5 rows of the dataset:")
print(df.head())

Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1059 entries, 0 to 1058
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   country_number                   1059 non-null   int64  
 1   country_code                     1059 non-null   object 
 2   country                          1059 non-null   object 
 3   year                             1059 non-null   int64  
 4   systemic_crisis                  1059 non-null   int64  
 5   exch_usd                         1059 non-null   float64
 6   domestic_debt_in_default         1059 non-null   int64  
 7   sovereign_external_debt_default  1059 non-null   int64  
 8   gdp_weighted_default             1059 non-null   float64
 9   inflation_annual_cpi             1059 non-null   float64
 10  independence                     1059 non-null   int64  
 11  currency_crises                  1059 non-null   int64  
 12  inflatio

In [33]:
# Create a pandas profiling report to gain insights into the dataset
profile = ProfileReport(df, title="Profiling Report", explorative=True)
profile.to_file("systemic_crisis_dataset_report.html")  # Saves the report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [34]:
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())  # Identifies missing values per column


Missing Values:
country_number                     0
country_code                       0
country                            0
year                               0
systemic_crisis                    0
exch_usd                           0
domestic_debt_in_default           0
sovereign_external_debt_default    0
gdp_weighted_default               0
inflation_annual_cpi               0
independence                       0
currency_crises                    0
inflation_crises                   0
banking_crisis                     0
dtype: int64


In [35]:
# Checking for duplicate rows 
duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")


Number of duplicate rows: 0


In [36]:
# Visualizing the potential outliers in inflation_annual_cpi
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['inflation_annual_cpi'])
plt.title("Boxplot of Inflation Annual CPI")
plt.show()

  plt.show()


In [37]:

# Checking the quantiles to identify extreme outliers
quantiles = df['inflation_annual_cpi'].quantile([0.25, 0.5, 0.75, 0.90, 0.95, 0.99])
print(f"\nQuantiles of inflation_annual_cpi:\n{quantiles}")

# Capping extreme outliers at the 99th percentile (Winsorization)
upper_bound = df['inflation_annual_cpi'].quantile(0.99)
df['inflation_annual_cpi'] = np.where(df['inflation_annual_cpi'] > upper_bound, upper_bound, df['inflation_annual_cpi'])


Quantiles of inflation_annual_cpi:
0.25      2.086162
0.50      5.762330
0.75     11.644048
0.90     23.975404
0.95     46.015246
0.99    269.604580
Name: inflation_annual_cpi, dtype: float64


In [38]:
# Encode categorical features using one-hot encoding for 'country' and 'country_code'
df_encoded = pd.get_dummies(df, columns=['country', 'country_code'], drop_first=True)

# Encode 'banking_crisis' to 0/1 where 'crisis' is 1 and 'no_crisis' is 0
df_encoded['banking_crisis'] = df_encoded['banking_crisis'].map({'crisis': 1, 'no_crisis': 0})

In [39]:
# Display first few rows of the encoded dataset
print("\nFirst 5 rows of the encoded dataset:")
print(df_encoded.head())


First 5 rows of the encoded dataset:
   country_number  year  systemic_crisis  exch_usd  domestic_debt_in_default  \
0               1  1870                1  0.052264                         0   
1               1  1871                0  0.052798                         0   
2               1  1872                0  0.052274                         0   
3               1  1873                0  0.051680                         0   
4               1  1874                0  0.051308                         0   

   sovereign_external_debt_default  gdp_weighted_default  \
0                                0                   0.0   
1                                0                   0.0   
2                                0                   0.0   
3                                0                   0.0   
4                                0                   0.0   

   inflation_annual_cpi  independence  currency_crises  ...  country_code_DZA  \
0              3.441456             0  

In [40]:
# Features and target
X = df_encoded.drop('systemic_crisis', axis=1)
y = df_encoded['systemic_crisis']

In [41]:
# Train-test split with 70-30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [42]:
# Dictionary to store models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM': SVC(),
    'Naive Bayes': GaussianNB()
}

# Loop through each model, train it, and evaluate performance
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Predict on the test set
    
    # Evaluate model performance using multiple metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]) if hasattr(model, "predict_proba") else "N/A"
    
    print(f"\nModel: {name}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"ROC-AUC: {roc_auc}")
    print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
    print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
    print("=" * 60)

# Improving Model Performance (Hyperparameter Tuning Example for Random Forest)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Randomized search for hyperparameter tuning
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"\nBest Parameters for Random Forest: {grid_search.best_params_}")
best_rf = grid_search.best_estimator_

# Evaluate tuned Random Forest model
y_pred_tuned = best_rf.predict(X_test)
print(f"Tuned Random Forest Accuracy: {accuracy_score(y_test, y_pred_tuned):.4f}")
print(f"Tuned Random Forest Classification Report:\n{classification_report(y_test, y_pred_tuned)}")


Model: Logistic Regression
Accuracy: 0.9843
F1 Score: 0.8837
ROC-AUC: 0.996031746031746
Confusion Matrix:
[[294   0]
 [  5  19]]
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       294
           1       1.00      0.79      0.88        24

    accuracy                           0.98       318
   macro avg       0.99      0.90      0.94       318
weighted avg       0.98      0.98      0.98       318


Model: KNN
Accuracy: 0.9403
F1 Score: 0.5366
ROC-AUC: 0.9134778911564625
Confusion Matrix:
[[288   6]
 [ 13  11]]
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       294
           1       0.65      0.46      0.54        24

    accuracy                           0.94       318
   macro avg       0.80      0.72      0.75       318
weighted avg       0.93      0.94      0.94       318


Model: Decision Tree
Accuracy: 0.9874
F1 Score: 0.9200

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Best Parameters for Random Forest: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 100}
Tuned Random Forest Accuracy: 0.9843
Tuned Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       294
           1       0.91      0.88      0.89        24

    accuracy                           0.98       318
   macro avg       0.95      0.93      0.94       318
weighted avg       0.98      0.98      0.98       318

