<a href="https://colab.research.google.com/github/RafaelAnga/Artificial-Intelligence/blob/main/Supervised-Learning/Classification/XGBoost_Bankruptcy_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bankruptcy Prediction for Colombian Companies Using XGBoost
By: Rafael Angarita

## Summary of the Code:
This project aims to predict the likelihood of bankruptcy for Colombian companies using financial data. The dataset includes financial metrics such as gross income, total equity, liabilities, and current assets. The model leverages XGBoost and advanced sampling techniques like SMOTEENN to handle the highly imbalanced dataset, where bankruptcy cases are significantly fewer than non-bankruptcy cases. Multiple pipelines with different sampling strategies are compared to identify the best-performing model.

**Business Applications:**
This model can be applied in various industries, particularly in finance and risk management, to:

**Credit Risk Assessment:** Help banks and financial institutions evaluate the risk of lending to companies.

**Investment Decision-Making:** Assist investors in identifying financially stable companies.

**Corporate Governance:** Enable companies to monitor their financial health and take corrective actions.

**Policy Development:**
Aid policymakers in understanding bankruptcy trends and creating supportive regulations.

# Component 1: Data Understanding

*   Which attributes (columns) in the database look most promising?


1.   Ganancia bruta (Gross income)
2.   Ingresos de actividades ordinarias (Ordinary activity Income)
3.   Patrimonio Total (Total Equity)
4.   Total Pasivos (Total passive income)
5.   Pasivos corrientes totales (Total Current Liabilities)
6.   Activos corientes totales (Total Current Assets)


1.   Which attributes do not seem relevant and can be excluded?


*   Sector: If this is categorical, it might not add predictive value unless its being analized by industry.
*   Index: Can be used to identify and keep in the data but if it doesnt have a correlation to backruptcy then it can be excluded.


2.   Is there enough data to draw general conclusions or make accurate predictions?


*   While there is enough train data with 14097 observations/instances there is big imbalance, with observations events of non-bankruptcy and 247 observations of bankruptcy, with 13 attributes including the target 'event'.
*   In terms of the test data, there are 6042 observations with 12 attributes


3.  Do you have sufficient attributes for your modeling method?
*  The provided attributes are common financial indicators that can predict bankruptcy.

4.   Are you merging multiple data sources? If so, are there any areas that may pose problems when merging?
* No merging will be necesary

5.  Have you considered how missing values are handled in each data source?
* The missing val






## Step 1: Initial data collection

* Imported the dataset from Google Drive.
* Verified data dimensions: 14,097 rows, 13 columns for training; 6,042 rows, 12 columns for testing.

In [14]:
# Suppress warnings and unnecessary logs
import warnings
warnings.filterwarnings("ignore")  # Suppress all warnings

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, classification_report
from xgboost import XGBClassifier

In [2]:
from google.colab import drive # Used to connect to google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os    # la librería necesaria para la ruta de la carpeta
os.chdir('/content/drive/MyDrive/Projects/Colombia-bankruptcy-prediction/DataSet')

#mostar lista de archivos
os.listdir()

['example_submission.csv', 'train.csv', 'test.csv']

In [4]:
# Load the training and test datasets
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [5]:
# Display dataset dimensions
print(f"Dataset has {df_train.shape[0]} rows and {df_train.shape[1]} columns.")

Dataset has 14097 rows and 13 columns.


In [6]:
# Check the shape of the data
print(f"Dataset has {df_test.shape[0]} rows and {df_test.shape[1]} columns.")

Dataset has 6042 rows and 12 columns.


In [7]:
# Check data types and non-null values
print(df_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14097 entries, 0 to 14096
Data columns (total 13 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   index                               14097 non-null  int64  
 1   Ganancia bruta                      14097 non-null  float64
 2   Ganancia (pérdida)                  14097 non-null  float64
 3   Ingresos de actividades ordinarias  14097 non-null  float64
 4   Costo de ventas                     12335 non-null  float64
 5   Patrimonio total                    14097 non-null  float64
 6   Total pasivos                       14097 non-null  float64
 7   Total de activos                    14097 non-null  float64
 8   Ganancias acumuladas                14061 non-null  float64
 9   Pasivos corrientes totales          14059 non-null  float64
 10  Activos corrientes totales          14092 non-null  float64
 11  Sector                              14097

In [8]:
# Summary statistics
print(df_train.describe())

              index  Ganancia bruta  Ganancia (pérdida)  \
count  14097.000000    1.409700e+04        1.409700e+04   
mean   10156.616372    2.982942e+06        3.962406e+05   
std     5912.146154    7.399979e+06        3.082754e+06   
min        1.000000   -1.531074e+08       -1.679610e+08   
25%     5010.000000    3.795030e+05        3.240000e+02   
50%    10059.000000    1.222298e+06        1.379860e+05   
75%    15301.000000    3.043982e+06        4.941270e+05   
max    24950.000000    3.338862e+08        8.712783e+07   

       Ingresos de actividades ordinarias  Costo de ventas  Patrimonio total  \
count                        1.409700e+04     1.233500e+04      1.409700e+04   
mean                         1.081655e+07     9.011900e+06      7.445149e+06   
std                          2.803366e+07     2.528155e+07      2.014086e+07   
min                          0.000000e+00     0.000000e+00     -7.480299e+07   
25%                          8.917180e+05     4.780630e+05      1.03

## Step 2: Data Description
* Explored data types and identified a highly imbalanced target variable ('event').
* Identified the role of the index is as an indentifier (not part of features).

## Step 3: Data exploration
* Assessed class distribution: 247 bankrupt (1.0) vs. 13,850 non-bankrupt (0.0).
* Considered feature relevance: Highlighted financial metrics as strong predictors.

In [9]:
# Check class distribution for 'Event' (bankruptcy vs non-bankruptcy)
print(df_train['event'].value_counts())

event
0.0    13850
1.0      247
Name: count, dtype: int64


### Exploring the data for both the training and test data

In [10]:
# Check for missing values
print(df_train.isnull().sum())

index                                    0
Ganancia bruta                           0
Ganancia (pérdida)                       0
Ingresos de actividades ordinarias       0
Costo de ventas                       1762
Patrimonio total                         0
Total pasivos                            0
Total de activos                         0
Ganancias acumuladas                    36
Pasivos corrientes totales              38
Activos corrientes totales               5
Sector                                   0
event                                    0
dtype: int64


In [11]:
# Check data types and non-null values
print(df_test.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6042 entries, 0 to 6041
Data columns (total 12 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   index                               6042 non-null   int64  
 1   Ganancia bruta                      6042 non-null   float64
 2   Ganancia (pérdida)                  6042 non-null   float64
 3   Ingresos de actividades ordinarias  6042 non-null   float64
 4   Costo de ventas                     5350 non-null   float64
 5   Patrimonio total                    6042 non-null   float64
 6   Total pasivos                       6042 non-null   float64
 7   Total de activos                    6042 non-null   float64
 8   Ganancias acumuladas                6027 non-null   float64
 9   Pasivos corrientes totales          6034 non-null   float64
 10  Activos corrientes totales          6039 non-null   float64
 11  Sector                              6042 no

In [12]:
# Check for missing values
print(df_test.isnull().sum())

index                                   0
Ganancia bruta                          0
Ganancia (pérdida)                      0
Ingresos de actividades ordinarias      0
Costo de ventas                       692
Patrimonio total                        0
Total pasivos                           0
Total de activos                        0
Ganancias acumuladas                   15
Pasivos corrientes totales              8
Activos corrientes totales              3
Sector                                  0
dtype: int64


# Component 2: Data preparation
-------

## Step 1: Data selection
* Choose relevant columns and rows.

##Step 2: Data cleansing

In [15]:
# Check for missing values and fill them with median/mean values
train_median_cost = df_train['Costo de ventas'].median()
train_median_gains = df_train['Ganancias acumuladas'].median()
train_median_liabilities = df_train['Pasivos corrientes totales'].median()
train_mean_assets = df_train['Activos corrientes totales'].mean()

# Apply to training and test sets
df_train['Costo de ventas'].fillna(train_median_cost, inplace=True)
df_train['Ganancias acumuladas'].fillna(train_median_gains, inplace=True)
df_train['Pasivos corrientes totales'].fillna(train_median_liabilities, inplace=True)
df_train['Activos corrientes totales'].fillna(train_mean_assets, inplace=True)

df_test['Costo de ventas'].fillna(train_median_cost, inplace=True)
df_test['Ganancias acumuladas'].fillna(train_median_gains, inplace=True)
df_test['Pasivos corrientes totales'].fillna(train_median_liabilities, inplace=True)
df_test['Activos corrientes totales'].fillna(train_mean_assets, inplace=True)

In [16]:
# Check for missing values
print(df_train.isnull().sum())

index                                 0
Ganancia bruta                        0
Ganancia (pérdida)                    0
Ingresos de actividades ordinarias    0
Costo de ventas                       0
Patrimonio total                      0
Total pasivos                         0
Total de activos                      0
Ganancias acumuladas                  0
Pasivos corrientes totales            0
Activos corrientes totales            0
Sector                                0
event                                 0
dtype: int64


In [17]:
# Check for missing values
print(df_test.isnull().sum())

index                                 0
Ganancia bruta                        0
Ganancia (pérdida)                    0
Ingresos de actividades ordinarias    0
Costo de ventas                       0
Patrimonio total                      0
Total pasivos                         0
Total de activos                      0
Ganancias acumuladas                  0
Pasivos corrientes totales            0
Activos corrientes totales            0
Sector                                0
dtype: int64


In [18]:
# Original dataset (before encoding/scaling)
X = df_train.drop(columns=['event', 'index'])  # Drop target and irrelevant columns
y = df_train['event']
X_verification = df_test

# Component 3: Modeling
-------

## Step 1: Selection of modeling techniques

In [None]:
#Splits the Train_set into train-test data to train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Define a preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(handle_unknown='ignore'), ['Sector']),
        ('scaler', StandardScaler(), ['Costo de ventas', 'Ganancias acumuladas',
                                      'Pasivos corrientes totales', 'Activos corrientes totales'])
    ],
    remainder='passthrough'
)

# Define pipelines with different sampling strategies
modified_sampling9_11_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('undersample', RandomUnderSampler(sampling_strategy={0: 900}, random_state=42)),
    ('smoteenn', SMOTEENN(
        smote=SMOTE(sampling_strategy={1: 1100}, random_state=42)
    )),
    ('classifier', XGBClassifier(
        eval_metric='logloss',
        learning_rate=0.03,
        max_depth=3,
        n_estimators=100,
        colsample_bytree=0.8,
        subsample=0.8,
        random_state=42
    ))
])

modified_sampling8_10_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('undersample', RandomUnderSampler(sampling_strategy={0: 800}, random_state=42)),
    ('smoteenn', SMOTEENN(
        smote=SMOTE(sampling_strategy={1: 1000}, random_state=42)
    )),
    ('classifier', XGBClassifier(
        eval_metric='logloss',
        learning_rate=0.03,
        max_depth=3,
        n_estimators=100,
        colsample_bytree=0.8,
        subsample=0.8,
        random_state=42
    ))
])

modified_sampling10_10_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('undersample', RandomUnderSampler(sampling_strategy={0: 100}, random_state=42)),
    ('smoteenn', SMOTEENN(
        smote=SMOTE(sampling_strategy={1: 1000}, random_state=42)
    )),
    ('classifier', XGBClassifier(
        eval_metric='logloss',
        learning_rate=0.03,
        max_depth=3,
        n_estimators=100,
        colsample_bytree=0.8,
        subsample=0.8,
        random_state=42
    ))
])

modified_sampling9_12_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('undersample', RandomUnderSampler(sampling_strategy={0: 900}, random_state=42)),
    ('smoteenn', SMOTEENN(
        smote=SMOTE(sampling_strategy={1: 1200}, random_state=42)
    )),
    ('classifier', XGBClassifier(
        eval_metric='logloss',
        learning_rate=0.03,
        max_depth=3,
        n_estimators=100,
        colsample_bytree=0.8,
        subsample=0.8,
        random_state=42
    ))
])


### Testing the train data for predictions

In [None]:
pipelines_to_compare = {
    'Modified Sampling Pipeline 9_11': modified_sampling9_11_pipeline,
    'Modified Sampling Pipeline 8_10': modified_sampling8_10_pipeline,
    'Modified Sampling Pipeline 10_10': modified_sampling10_10_pipeline,
    'Modified Sampling Pipeline 9_12': modified_sampling9_12_pipeline,
}

# Evaluate pipelines on the training set
results = {}

for name, pipeline in pipelines_to_compare.items():
    print(f"\nEvaluating {name} on X_train...")

    # Train the pipeline
    pipeline.fit(X_train, y_train)

    # Predict on test data
    y_pred = pipeline.predict(X_train)
    y_pred_proba = pipeline.predict_proba(X_train)[:, 1]

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_train, y_pred)
    recall = recall_score(y_train, y_pred)
    f1 = f1_score(y_train, y_pred)
    auc = roc_auc_score(y_train, y_pred_proba)

    # Store results
    results[name] = {
        'F1 Score': f1,
        'Accuracy': accuracy,
        'Recall': recall,
        'ROC-AUC': auc
    }

    # Print classification report
    print("\nClassification Report:")
    print(classification_report(y_train, y_pred))

# Display results for comparison
results_df = pd.DataFrame(results).T

print("Pipeline Comparison Results:")
display(results_df)




Evaluating Modified Sampling Pipeline 9_11 on X_train...

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.63      0.77     11079
         1.0       0.04      0.90      0.08       198

    accuracy                           0.63     11277
   macro avg       0.52      0.76      0.42     11277
weighted avg       0.98      0.63      0.76     11277


Evaluating Modified Sampling Pipeline 8_10 on X_train...

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.60      0.75     11079
         1.0       0.04      0.90      0.08       198

    accuracy                           0.61     11277
   macro avg       0.52      0.75      0.41     11277
weighted avg       0.98      0.61      0.74     11277


Evaluating Modified Sampling Pipeline 10_10 on X_train...

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.07      0.13 

Unnamed: 0,F1 Score,Accuracy,Recall,ROC-AUC
Modified Sampling Pipeline 9_11,0.079164,0.632792,0.89899,0.840192
Modified Sampling Pipeline 8_10,0.07521,0.609648,0.90404,0.841199
Modified Sampling Pipeline 10_10,0.037054,0.087435,1.0,0.741561
Modified Sampling Pipeline 9_12,0.078036,0.627028,0.89899,0.841018


## Step 4: Model evaluation

## Evaluation of the results

In [None]:
pipelines_to_compare = {
    'Modified Sampling Pipeline 9_11': modified_sampling9_11_pipeline,
    'Modified Sampling Pipeline 8_10': modified_sampling8_10_pipeline,
    'Modified Sampling Pipeline 10_10': modified_sampling10_10_pipeline,
    'Modified Sampling Pipeline 9_12': modified_sampling9_12_pipeline
}

results = {}

for name, pipeline in pipelines_to_compare.items():
    print(f"\nEvaluating {name} on X_test...")

    # Train the pipeline
    pipeline.fit(X_train, y_train)

    # Predict on test data
    y_pred = pipeline.predict(X_test)
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)

    # Store results
    results[name] = {
        'F1 Score': f1,
        'Accuracy': accuracy,
        'Recall': recall,
        'ROC-AUC': auc
    }

    # Print classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

# Display results for comparison
results_df = pd.DataFrame(results).T

print("Pipeline Comparison Results:")
display(results_df)




Evaluating Modified Sampling Pipeline 9_11 on X_test...

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.61      0.76      2771
         1.0       0.04      0.84      0.07        49

    accuracy                           0.62      2820
   macro avg       0.52      0.72      0.41      2820
weighted avg       0.98      0.62      0.75      2820


Evaluating Modified Sampling Pipeline 8_10 on X_test...

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.59      0.74      2771
         1.0       0.04      0.88      0.07        49

    accuracy                           0.59      2820
   macro avg       0.52      0.73      0.41      2820
weighted avg       0.98      0.59      0.73      2820


Evaluating Modified Sampling Pipeline 10_10 on X_test...

Classification Report:
              precision    recall  f1-score   support

         0.0       0.99      0.07      0.13    

Unnamed: 0,F1 Score,Accuracy,Recall,ROC-AUC
Modified Sampling Pipeline 9_11,0.070447,0.616312,0.836735,0.800415
Modified Sampling Pipeline 8_10,0.069976,0.594681,0.877551,0.814154
Modified Sampling Pipeline 10_10,0.035794,0.082979,0.979592,0.742191
Modified Sampling Pipeline 9_12,0.065934,0.608156,0.795918,0.803821


In [None]:
# Generate predictions for both models
pipelines_to_evaluate = {
    'Modified Sampling Pipeline 9_11': modified_sampling9_11_pipeline,
    'Modified Sampling Pipeline 8_10': modified_sampling8_10_pipeline,
    'Modified Sampling Pipeline 10_10': modified_sampling10_10_pipeline,
    'Modified Sampling Pipeline 9_12': modified_sampling9_12_pipeline
}

output_dir = "/content/drive/MyDrive/Projects/Colombia-bankruptcy-prediction/predictions"

import os
os.makedirs(output_dir, exist_ok=True)  # Ensure the output directory exists

for name, pipeline in pipelines_to_evaluate.items():
    print(f"Generating predictions for {name}...")

    # Predict using the pipeline
    predictions = pipeline.predict(X_verification)

    # Create a DataFrame for the submission
    submission = pd.DataFrame({
        'Index': df_test['index'],  # Use the 'index' column from df_test
        'Prediction': predictions
    })

    # Save to CSV
    output_csv_path = os.path.join(output_dir, f"{name.replace(' ', '_')}_prediction23_1.csv")
    submission.to_csv(output_csv_path, index=False)
    print(f"Predictions for {name} saved to {output_csv_path}")



Generating predictions for Modified Sampling Pipeline 9_11...
Predictions for Modified Sampling Pipeline 9_11 saved to /content/drive/MyDrive/Projects/Colombia-bankruptcy-prediction/DataSet/Modified_Sampling_Pipeline_9_11_prediction23_1.csv
Generating predictions for Modified Sampling Pipeline 8_10...
Predictions for Modified Sampling Pipeline 8_10 saved to /content/drive/MyDrive/Projects/Colombia-bankruptcy-prediction/DataSet/Modified_Sampling_Pipeline_8_10_prediction23_1.csv
Generating predictions for Modified Sampling Pipeline 8_11...
Predictions for Modified Sampling Pipeline 8_11 saved to /content/drive/MyDrive/Projects/Colombia-bankruptcy-prediction/DataSet/Modified_Sampling_Pipeline_8_11_prediction23_1.csv
Generating predictions for Modified Sampling Pipeline 9_12...
Predictions for Modified Sampling Pipeline 9_12 saved to /content/drive/MyDrive/Projects/Colombia-bankruptcy-prediction/DataSet/Modified_Sampling_Pipeline_9_12_prediction23_1.csv


## Final project review

**Final Project Performance and Validation**

This bankruptcy prediction model was evaluated through the Kaggle competition platform, achieving notable results:

**Competition Performance:**

The model achieved an accuracy score of **81.81%** on the private test set
Validation Methodology: Performance was assessed on approximately 10% of the hidden test data, providing an unbiased evaluation of the model's generalization capabilities

**Real-World Applicability:**

The strong out-of-sample performance suggests robust predictive power for real-world bankruptcy risk assessment
The model's success on unseen data validates our technical approach, particularly:

The effectiveness of our sampling strategy in handling class imbalance
The robustness of our feature engineering process
The appropriateness of XGBoost for this financial prediction task
This performance places the model as a viable tool for practical applications in credit risk assessment and financial decision-making, demonstrating both statistical validity and business relevance.

**Future Enhancements**

While the model shows strong predictive capability, potential areas for improvement include:

Exploration of additional financial ratios and domain-specific features
Implementation of more sophisticated ensemble techniques
Integration of macroeconomic indicators for broader context
Development of model interpretability tools for stakeholder confidence
The current performance establishes a solid foundation for these future developments while already providing actionable insights for bankruptcy risk assessment.

### Technical Summary of the Project:
**Data Understanding:**

* The dataset contains 14,097 rows and 13 columns for training, and 6,042 rows and 12 columns for testing.
* The target variable (event) is highly imbalanced, with only 247 bankruptcy cases out of 14,097.

**Data Preparation:**

* Missing values in key financial columns are filled using median or mean values from the training set.
* Derived features such as profitability, liquidity, and debt ratios are created to enhance model performance.

**Modeling:**

* Four pipelines with different sampling strategies (e.g., SMOTEENN and undersampling) are created to handle class imbalance.
* The XGBoost Classifier is used with hyperparameters like learning_rate=0.03, max_depth=3, and n_estimators=100.

**Evaluation:**

* Pipelines are evaluated on both training and test sets using metrics such as F1 Score, Accuracy, Recall, and ROC-AUC.
* The best pipeline is selected based on its performance on the test set.

**Deployment:**
* Predictions for the test dataset are generated and saved as CSV files for further analysis or submission.

### Model Limitations and Assumptions

1. Data Temporal Limitations:
   - Model trained on historical Colombian financial data
   - Assumes similar economic conditions in prediction period
   - May not capture sudden market shocks or economic disruptions

2. Key Assumptions:
   - Financial ratios maintain predictive power over time
   - Company reporting standards remain consistent
   - Linear relationship between financial indicators and bankruptcy risk

3. Risk Disclaimers:
   - This model is a decision support tool, not a definitive predictor
   - Predictions should be used in conjunction with expert judgment
   - Regular model retraining is recommended (at least annually)
   - Past performance does not guarantee future results

### Implementation Guidelines

1. Technical Requirements:
   - Python 3.7+
   - Required packages: scikit-learn, xgboost, imblearn
   - Minimum 8GB RAM for processing

2. Data Requirements:
   - Financial statements not older than 12 months
   - Complete set of required financial ratios
   - Standardized reporting format

3. Operational Protocol:
   - Monthly model performance monitoring
   - Quarterly threshold calibration
   - Annual model retraining