# Hybrid Credit Risk Prediction with Reinforcement Learning and LSTM

In [1]:
import zipfile
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

![image.png](../docs/img/data_dictonary_description.png)

In [2]:
zf = zipfile.ZipFile('../data/master_data/GiveMeSomeCredit.zip')

df_cs_train = pd.read_csv(zf.open('cs-training.csv'))
df_cs_test = pd.read_csv(zf.open('cs-test.csv'))
df_sample_entry = pd.read_csv(zf.open('sampleEntry.csv'))

## EDA

### Test

In [None]:
print(df_cs_test.shape)
df_cs_test.head()

### Train

In [None]:
df_cs_train.head()

#### 1. Basic Information

In [None]:
# check the shape of the dataset
print(df_cs_train.shape)

In [None]:
# get information on data types and missing values
df_cs_train.info()

In [None]:
# check for missing values
print(df_cs_train.isnull().sum())

In [None]:
# check for duplicates
print(df_cs_train.duplicated().sum())

#### 2. Statistical summary

In [None]:
# summary statistics
df_cs_train.describe()

#### 3. Handle missing values

In [10]:
# fill missing values (mean imputation as an example)
df_cs_train.fillna(df_cs_train.mean(), inplace=True)

# optionally, drop rows/columns with a high percentage of missing data
df_cs_train.dropna(inplace=True)

#### 4. Visualize Target Variable (SeriousDlqin2yrs)

In [None]:
# # check distribution of target variable
# sns.countplot(df_cs_train['SeriousDlqin2yrs'])
# plt.title('Distribution of Target Variable (SeriousDlqin2yrs)')
# plt.show()

# # calculate percentage of default vs non-default
# df_cs_train['SeriousDlqin2yrs'].value_counts(normalize=True) * 100

#### 5. Correlation Matrix

In [None]:
# correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df_cs_train.corr(), annot=True, fmt='.2f', cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

#### 6. Feature Distribution Analysis

In [None]:
# plot distributions of numerical features
df_cs_train.hist(bins=20, figsize=(20, 15), color='blue', edgecolor='black')
plt.suptitle('Histograms of Numerical Features')
plt.show()

# alternative: Plot KDE (Kernel Density Estimation) plots
for column in df_cs_train.columns:
    if df_cs_train[column].dtype != 'object':
        sns.kdeplot(df_cs_train[column], shade=True)
        plt.title(f'Distribution of {column}')
        plt.show()

#### 7. Boxplots to Detect Outliers

In [None]:
# plot boxplots of numerical features
plt.figure(figsize=(15, 10))
sns.boxplot(data=df_cs_train.drop(columns=['SeriousDlqin2yrs']), palette='Set3')
plt.xticks(rotation=90)
plt.title('Boxplots for Outlier Detection')
plt.show()

#### 8. Bivariate Analysis (Target vs Features)

In [None]:
# Boxplot to check feature distribution by target
for column in df_cs_train.columns:
    if df_cs_train[column].dtype != 'object' and column != 'SeriousDlqin2yrs':
        plt.figure(figsize=(6, 4))
        sns.boxplot(x='SeriousDlqin2yrs', y=column, data=df_cs_train)
        plt.title(f'Boxplot of {column} vs SeriousDlqin2yrs')
        plt.show()

# KDE plots to see how the distribution varies between classes
for column in df_cs_train.columns:
    if df_cs_train[column].dtype != 'object' and column != 'SeriousDlqin2yrs':
        sns.kdeplot(df_cs_train[df_cs_train['SeriousDlqin2yrs'] == 0][column], label='No Default')
        sns.kdeplot(df_cs_train[df_cs_train['SeriousDlqin2yrs'] == 1][column], label='Default')
        plt.title(f'Distribution of {column} by Default Status')
        plt.legend()
        plt.show()

#### 9. Pairwise Relationships (Pair Plot)

In [None]:
# Sample pair plot (for a few features due to performance limitations)
sns.pairplot(df_cs_train[['RevolvingUtilizationOfUnsecuredLines', 'age', 'DebtRatio', 'MonthlyIncome', 'SeriousDlqin2yrs']], hue='SeriousDlqin2yrs')
plt.show()

#### 10. Handling Outliers

In [16]:
# Remove outliers using IQR (Interquartile Range)
Q1 = df_cs_train.quantile(0.25)
Q3 = df_cs_train.quantile(0.75)
IQR = Q3 - Q1

# Remove rows with outliers
df_out = df_cs_train[~((df_cs_train < (Q1 - 1.5 * IQR)) | (df_cs_train > (Q3 + 1.5 * IQR))).any(axis=1)]

#### 11. Feature Engineering (Optional)

In [17]:
# Example feature creation
df_cs_train['DebtToIncomeRatio'] = df_cs_train['DebtRatio'] / df_cs_train['MonthlyIncome']

#### 12. Feature Importance (Using Random Forest)

In [None]:
from sklearn.ensemble import RandomForestClassifier # type: ignore

# Prepare the data
X = df_cs_train.drop(columns=['SeriousDlqin2yrs'])
y = df_cs_train['SeriousDlqin2yrs']

# Train a simple Random Forest model
model = RandomForestClassifier()
model.fit(X.fillna(0), y)

# Plot feature importances
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Important Features')
plt.show()

#### 13. Insights and Conclusions

- Summarize your insights:
    - Which features are most correlated with defaulting?
    - Are there any significant outliers or patterns?
    - How imbalanced is the target variable?
    - What relationships exist between features and the target?

#### 14. Address Imbalance in Target Variable

Since the dataset is often imbalanced, consider techniques such as SMOTE (Synthetic Minority Over-sampling Technique) or class weights during modeling to address the imbalance.

#### Final Notes

- EDA is not a one-size-fits-all process. Always adjust the analysis based on the findings in each step.
- You can also consider applying PCA (Principal Component Analysis) or t-SNE for dimensionality reduction and visualizing patterns in high-dimensional data.

This comprehensive EDA will give you a deep understanding of the Give Me Some Credit dataset and prepare you for modeling tasks like credit risk prediction.

## Handle Imbalance

### 1. Check the Class Distribution

If one class has significantly more samples than the other, you have an imbalanced dataset. 
For instance, if you see something like:
- Class 0 (non-delinquent): 93%
- Class 1 (delinquent): 7%

This is a sign of imbalance.

In [None]:
# Check distribution of the target variable
sns.countplot(x='SeriousDlqin2yrs', data=df_cs_train)
plt.title('Class Distribution of SeriousDlqin2yrs')
plt.show()

# Percentage of each class
class_distribution = df_cs_train['SeriousDlqin2yrs'].value_counts(normalize=True) * 100
print(class_distribution)

### 2. Metrics to Evaluate Imbalance

- Class Distribution: Directly see the percentage of each class using .value_counts().
- Class Ratios: Calculate the ratio of minority to majority class to quantify the imbalance. A ratio close to 0 indicates a severe imbalance.

In [None]:
minority_class = df_cs_train['SeriousDlqin2yrs'].value_counts()[1]
majority_class = df_cs_train['SeriousDlqin2yrs'].value_counts()[0]

imbalance_ratio = minority_class / majority_class
print(f"Imbalance Ratio: {imbalance_ratio}")

### 3. Visualize Class Distribution in Features

Use pair plots, histograms, or KDE plots to compare feature distributions between the two classes (delinquent vs. non-delinquent). 
This can help in visualizing patterns and understanding whether certain features show clear separations between the classes.

In [None]:
# KDE plot for numerical features
for column in df_cs_train.columns:
    if df_cs_train[column].dtype != 'object' and column != 'SeriousDlqin2yrs':
        sns.kdeplot(df_cs_train[df_cs_train['SeriousDlqin2yrs'] == 0][column], label='No Default', shade=True)
        sns.kdeplot(df_cs_train[df_cs_train['SeriousDlqin2yrs'] == 1][column], label='Default', shade=True)
        plt.title(f'Distribution of {column} by Class')
        plt.legend()
        plt.show()

### 4. Correlation Analysis

Perform a correlation analysis to check whether any features are highly correlated with the target variable. 
Features with strong correlation to SeriousDlqin2yrs may help differentiate between the classes despite the imbalance.

In [None]:
# Correlation heatmap (including target variable)
plt.figure(figsize=(10, 8))
sns.heatmap(df_cs_train.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix with SeriousDlqin2yrs')
plt.show()

### 5. Handling Imbalance

Once you've identified the imbalance, you need strategies to handle it before modeling.

#### 5.1 Resampling Techniques

- Oversampling the Minority Class (SMOTE): Synthetic Minority Over-sampling Technique generates synthetic samples for the minority class.

In [None]:
from imblearn.over_sampling import SMOTE # type: ignore

X = df_cs_train.drop(columns='SeriousDlqin2yrs')
y = df_cs_train['SeriousDlqin2yrs']

smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)

- Undersampling the Majority Class: Randomly remove samples from the majority class to balance the classes.

In [None]:
from imblearn.under_sampling import RandomUnderSampler # type: ignore

undersample = RandomUnderSampler()
X_res, y_res = undersample.fit_resample(X, y)

- Combination of Oversampling and Undersampling: A balanced approach where both the minority class is oversampled, and the majority class is undersampled.

#### 5.2 Use Class Weights

Most machine learning algorithms, like Random Forests and Logistic Regression, allow you to set class weights to give more importance to the minority class.

In [None]:
from sklearn.ensemble import RandomForestClassifier     # type: ignore
from sklearn.model_selection import train_test_split    # type: ignore

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Set class weights to 'balanced'
rf = RandomForestClassifier(class_weight='balanced')
rf.fit(X_train, y_train)

#### 5.3 Evaluation Metrics Beyond Accuracy

In imbalanced datasets, accuracy may not be a reliable metric since predicting the majority class can lead to high accuracy but poor performance on the minority class. Use metrics like:

- Precision, Recall, F1-Score: Precision measures how many selected items are relevant, while recall measures how many relevant items are selected. The F1-score balances these two metrics.

In [None]:
from sklearn.metrics import classification_report   # type: ignore

y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

- AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Measures how well the model separates the classes. It’s especially useful for imbalanced datasets.

In [None]:
from sklearn.metrics import roc_auc_score   # type: ignore

y_pred_proba = rf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f'AUC-ROC Score: {auc}')

### 6. Cross-Validation with Stratified K-Folds

When performing cross-validation, use stratified k-fold cross-validation to ensure that each fold has the same proportion of classes, preserving the imbalance ratio during training and validation.

In [None]:
from sklearn.model_selection import StratifiedKFold # type: ignore

skf = StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    ## Train your model here ##

Summary of Steps for EDA on Imbalanced Data:

1. Identify the imbalance in the target variable using countplot and value_counts.
2. Visualize relationships between the features and the target variable to understand feature distributions in different classes.
3. Correlation analysis to find relationships between features and the target variable.
4. Handle the imbalance using techniques like SMOTE, undersampling, or adjusting class weights.
5. Use appropriate metrics such as precision, recall, F1-score, and AUC-ROC to evaluate the model on imbalanced data.
6. Cross-validation with stratified sampling ensures that the imbalance is preserved in all training and validation folds.

This approach will help you thoroughly explore the dataset, handle imbalanced data, and guide your modeling choices effectively.

## Identify the important features

To identify important features for building an LSTM model using the Give Me Some Credit dataset, it's important to analyze which features contribute most to predicting the target variable (SeriousDlqin2yrs). 
Here’s a step-by-step guide to identify the important features:

### 1. Correlation Matrix

A simple first step is to compute the correlation between features and the target variable. This will give you a sense of how strongly individual features are associated with the target variable.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix
corr_matrix = df_cs_train.corr()

# Plot the heatmap for the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix with SeriousDlqin2yrs')
plt.show()

# Check correlation with the target variable 'SeriousDlqin2yrs'
target_corr = corr_matrix['SeriousDlqin2yrs'].sort_values(ascending=False)
print(target_corr)

Look for features with a higher absolute correlation with the target variable. These may be strong candidates to include in your model.

### 2. Feature Importance using Random Forest

Although LSTM models don’t inherently provide feature importance, you can use other models like Random Forest to compute feature importance scores and then feed the most important features into your LSTM model.

In [None]:
from sklearn.ensemble import RandomForestClassifier # type: ignore

# Split the dataset into features and target
X = df_cs_train.drop(columns='SeriousDlqin2yrs')
y = df_cs_train['SeriousDlqin2yrs']

# Fit Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

# Get feature importance
importance = rf.feature_importances_

# Create a DataFrame for visualization
feature_importance = pd.DataFrame({'feature': X.columns, 'importance': importance})
feature_importance = feature_importance.sort_values(by='importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance)
plt.title('Feature Importance from Random Forest')
plt.show()

This will help you identify the top features to focus on for your LSTM model.

### 3. Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) helps in selecting the most important features by recursively removing less significant features. 
Although it is slower than Random Forest, it systematically eliminates features based on model performance.

In [None]:
from sklearn.feature_selection import RFE           # type: ignore
from sklearn.linear_model import LogisticRegression # type: ignore

# Initialize a logistic regression model
logreg = LogisticRegression(max_iter=1000)

# Use RFE for feature selection
# Choose top 10 features
rfe = RFE(logreg, n_features_to_select=10)
rfe.fit(X, y)

# Get the selected features
selected_features = X.columns[rfe.support_]
print('Selected features:', selected_features)

### 4. SHAP (SHapley Additive exPlanations)

SHAP values provide interpretable machine learning insights by showing how much each feature contributes to the final prediction.

In [None]:
import shap # type: ignore

# Train a model
model = RandomForestClassifier()
model.fit(X, y)

# Initialize SHAP explainer
explainer = shap.TreeExplainer(model)

# Compute SHAP values
shap_values = explainer.shap_values(X)

# Plot SHAP summary plot
shap.summary_plot(shap_values[1], X)

The SHAP summary plot will show you which features contribute the most to the model’s predictions, which can guide your feature selection for the LSTM model.

### 5. Statistical Methods (Chi-Square Test for Categorical Features)

If there are any categorical features (not common in this dataset, but if added in preprocessing), you can use statistical tests like Chi-Square to determine feature importance.

In [None]:
from sklearn.feature_selection import chi2      # type: ignore
from sklearn.preprocessing import MinMaxScaler  # type: ignore

# Normalize the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Compute Chi-Square scores
chi_scores = chi2(X_scaled, y)

# Create a DataFrame for visualization
chi2_importance = pd.DataFrame({'feature': X.columns, 'chi2_score': chi_scores[0]})
chi2_importance = chi2_importance.sort_values(by='chi2_score', ascending=False)

# Plot the Chi-Square scores
plt.figure(figsize=(10, 6))
sns.barplot(x='chi2_score', y='feature', data=chi2_importance)
plt.title('Chi-Square Feature Importance')
plt.show()

### 6. Using Autoencoders for Feature Extraction

Autoencoders are neural networks that can be used to automatically learn the most relevant features by compressing and reconstructing the data. These compressed representations can then be used as input to your LSTM model.

Steps to Consider for LSTM Feature Selection:
1. Correlation matrix to find initial relationships between features and the target.
2. Random Forest feature importance to prioritize features that may have a higher predictive power.
3. RFE to systematically reduce the number of features.
4. SHAP values for interpretability of feature impact.
5. Statistical tests for further feature selection (if applicable).
6. Autoencoders for unsupervised feature learning.

**Final Considerations:**

After identifying the top features, you may want to normalize the data and possibly use PCA (Principal Component Analysis) for dimensionality reduction before feeding the data into the LSTM model.