# Data Analytics for the "Credit" Data Set

### Table of Contents

1. [Loading the Data](#1-loading-the-data)
2. [Data Set Size](#2-data-set-size)
3. [Miscellaneous Observations on the Data](#3-miscellaneous-observations-on-the-data)
4. [Class Distribution](#4-class-distribution)
5. [Distribution of Categorical Features](#5-distribution-of-categorical-features)
6. [Distribution of Numerical Features](#6-distribution-of-numerical-features)
7. [Averages Along Numerical Features](#7-averages-along-numerical-features)
8. [Feature Correlations](#8-feature-correlations)
    - [Correlation loan_amount, loan_duration and target Variable](#81-correlation-loan_amount-loan_duration-and-target-variable)
    - [Correlation loan_amount, purpose/ age and target Variable](#82-correlation-loan_amount-purpose-age-and-target-variable)
9. [Applying Random Forest Classifier to Assess Feature Importance](#9-applying-random-forest-classifier-to-assess-feature-importance)
10. [Summary of Potential Data Flaws](#10-summary-of-potential-data-flaws)

### 1. Loading the Data

Here, we are working with the [German Credit Data Set from the UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data).
The data was originally provided by Prof. Hans Hofmann.
The UCI repository furthermore states that "It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1)." (see [here](https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data)), meaning that a high recall is eventually more important than a high precision.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import math
import warnings
import matplotlib.patches as mpatches

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')

credit_data = pd.read_csv(os.path.join('..', 'data', 'real', 'credit.csv'))
credit_data.head()

### 2. Data Set Size

First, we want to explore how many data points are available for training the models.


In [None]:
num_features = credit_data.shape[1]
num_samples = credit_data.shape[0]

print(f"Number of features: {num_features}")
print(f"Number of samples: {num_samples}")

target_col = 'target'
numerical_cols = credit_data.select_dtypes(include=[np.number]).columns
categorical_cols = credit_data.select_dtypes(include=['object']).columns

print(f"Number of numerical columns: {len(numerical_cols)}")
print(f"Number of categorical columns: {len(categorical_cols)}")

Result: In contrast to "bank" or "income", this data set has noticeably less samples. We thus need to utilize the available data even better. With 21 features, 7 numerical and 14 categorical ones (one of which is the label), we have enough features to make a profound feature selection essential for the success of a classification model.

### 3. Miscellaneous Observations on the Data

Ensuring data integrity by checking for null values, duplicate rows and data types.

In [None]:
# Convert byte strings to normal strings for all object columns
print(f"Element before conversion: {credit_data[categorical_cols[1]][3]}")
for col in credit_data.select_dtypes(include=['object']).columns:
    credit_data[col] = credit_data[col].astype(str).str.replace(r"^b'|'$", "", regex=True).str.strip()
print(f"Element after conversion: {credit_data[categorical_cols[1]][3]}")

num_nulls = credit_data.isnull().sum().sum()
print(f"Number of null values: {num_nulls}")

duplicates = credit_data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Results:
- The categorical feature values of this data set are provided as byte-strings. To make working with them easier, they are converted to normal strings here
- The data set does not have any missing values -> no extra preprocessing needed here
- The data set does not have any duplicated samples either -> no extra preprocessing needed here

### 4. Class Distribution

Do we have to account for a class imbalance?

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='target', data=credit_data)

plt.title('Distribution of the Target Variable')
plt.show()

Result: There is some class imbalance towards credit applications that were assessed as 'good' (credit was granted). This needs to be addressed during model training, however there are still enough 'bad' samples available for a good assessment of patterns.

### 5. Distribution of Categorical Features

How many distinct values does each categorical feature have? How is the distribution along these of the given 1000 samples? Are there obvious correlations between certain values and the target variable?

In [None]:
n = len(categorical_cols)
ncols = 3
nrows = math.ceil(n / ncols)

fig, axes = plt.subplots(nrows, ncols, figsize=(12, 4 * nrows))
axes = axes.flatten()

for idx, col in enumerate(categorical_cols):
    if col == target_col:
        continue
    ax = axes[idx]
    counts = credit_data.groupby([col, target_col]).size().unstack(fill_value=0)
    counts.plot(kind='bar', stacked=True, color=['coral', 'skyblue'], ax=ax, width=0.8)
    ax.set_title(col)
    ax.set_ylabel('count')
    ax.set_xlabel('')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
    ax.get_legend().remove()

# Hide any unused subplots
for j in range(idx, len(axes)):
    fig.delaxes(axes[j])

red_patch = mpatches.Patch(color='coral', label='bad')
blue_patch = mpatches.Patch(color='skyblue', label='good')
fig.legend(handles=[red_patch, blue_patch], loc='upper right', fontsize=12, title=target_col)

plt.suptitle('Histograms of Categorical Features - Stacked by Assessment Result', x=0.5, y=1.02, fontsize=16)
plt.tight_layout()
plt.show()

Conclusions:
- None of the features has more than 10 possible values -> categorical value encoding in a compact space possible
- Luckily for us, potential null values are already encoded: e.g. by 'other' or 'no known savings'
- Some values are only represented by very few samples (e.g. is_foreign=no or guarantor other than none) -> hard to tell in how far these values actually influence the target variable
- People without a checking account at this bank have far worse chances of getting a credit
- Getting money for a radio/ TV is no problem

### 6. Distribution of Numerical Features

What values do the numerical features have and how are they distributed in their feature space? Are there obvious correlations between certain values and the target variable?

In [None]:
non_int_count = 0
for col in numerical_cols:
    non_int_count = non_int_count + credit_data[col].dropna().apply(lambda x: not x.is_integer()).sum()
print(f"Total non-integer values in numerical columns: {non_int_count}")

Result: Even though all numerical features are provided as floats, they are in fact all integers (always ending on '.0').

In [None]:
n = len(numerical_cols)
ncols = 3
nrows = math.ceil(n / ncols)

fig, axes = plt.subplots(nrows, ncols, figsize=(12, 4 * nrows))
axes = axes.flatten()

colors = {'bad': 'coral', 'good': 'skyblue'}
target_values = credit_data[target_col].unique()

for idx, col in enumerate(numerical_cols):
    ax = axes[idx]
    for t in target_values:
        data = credit_data[credit_data[target_col] == t][col].dropna()
        ax.hist(data, bins=20, alpha=0.7, label=str(t), color=colors.get(str(t), 'gray'), stacked=True)
    ax.set_title(col)
    ax.set_ylabel('count')
    ax.set_xlabel('')
    ax.tick_params(axis='x', rotation=45)

# Hide any unused subplots
for j in range(idx + 1, len(axes)):
    fig.delaxes(axes[j])

red_patch = mpatches.Patch(color='coral', label='bad')
blue_patch = mpatches.Patch(color='skyblue', label='good')
fig.legend(handles=[red_patch, blue_patch], loc='upper right', fontsize=12, title=target_col)

plt.suptitle('Histograms of Numerical Features - Stacked by Assessment Result', x=0.5, y=1.02, fontsize=16)
plt.tight_layout()
plt.show()

Conclusions (among others):
- 'Dependents' is in fact more of a binary feature
- Other features like 'num_existing_loans' and 'residence_years' seem to have rather distinct values, too
- Short loans are rejected less than long loans
- Very small loans are barely rejected, but small loans are rejected the most, but very large loans (above 12.000€) are mostly accepted (maybe because applying for these big loans is done rather carefully)
- Very old applicants seem to have lower loan acceptance rates

### 7. Averages Along Numerical Features

What are the key facts of the average loan application? What about the same facts about the average accepted loan application?

In [None]:
print("Descriptive statistics for the numerical features of all credit applications:")
credit_data.describe().loc[['count', 'mean', 'std', 'min', 'max']].round(2)

In [None]:
print("Descriptive statistics for the numerical features of successful credit applications:")
credit_data[credit_data['target'] == 'good'].describe().loc[['count', 'mean', 'std', 'min', 'max']].round(2)

Conclusions on the data: Successful applicants tend to be slightly older, request lower loan amounts, and opt for shorter loan durations compared to the overall applicant pool. Other features such as number of dependents, residence years, and installment rate show little difference between the groups.

### 8. Feature Correlations

Are there any redundant features? Are there any features with noticeably high correlations to the target variable (great for prediction)?

In [None]:
from sklearn.preprocessing import LabelEncoder

# Make a copy to avoid changing your original data
credit_data_numeric = credit_data.copy()

# Encode all categorical columns
for col in credit_data_numeric.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    credit_data_numeric[col] = le.fit_transform(credit_data_numeric[col])

# Now compute the correlation matrix including categorical features
corr_matrix = credit_data_numeric.corr()

plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(
    corr_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
    square=True, fmt='.2f', cbar_kws={"shrink": .8}
)
plt.title('Correlation Matrix (Including Encoded Categorical Features)')
plt.tight_layout()
plt.show()

Conclusions:
- No duplicate features
- Correlation between loan_duration and loan_amount
- Correlation between num_existing_loans and repayment_history: repayment_history e.g. has the value 'no credits/ all paid'
- The bottom line represents the correlations between all features and the target variable: account_status and loan_duration seem to be the most correlated values

#### 8.1 Correlation loan_amount, loan_duration and target Variable

In [None]:
plt.figure(figsize=(6, 4))
sns.scatterplot(
    data=credit_data,
    x='loan_duration',
    y='loan_amount',
    hue='target',
    palette={'good': 'skyblue', 'bad': 'coral'},
    alpha=0.7
)
plt.xlabel('Loan Duration')
plt.ylabel('Loan Amount')
plt.title('Loan Amount vs. Loan Duration Colored by Target')
plt.tight_layout()
plt.show()

Conclusion: Short and low loans are usually granted. Short and high loans have the worst chance of acceptance. In general: the longer the loan, the higher it usually is.

### 8.2 Correlation loan_amount, purpose/ age and target Variable

In [None]:
# Boxplots
plt.figure(figsize=(12, 10))

plt.subplot(211)
g1 = sns.boxplot(x="loan_purpose", y="loan_amount", data=credit_data, palette="husl", hue="target")
g1.set_xlabel("Loan Purpose", fontsize=12)
g1.set_ylabel("Loan Amount", fontsize=12)
g1.set_title("Loan Amount Distribution by Loan Purpose", fontsize=16)

bins = [0, 20, 30, 40, 50, np.inf]
labels = ['<20', '20-29', '30-39', '40-49', '50+']
credit_data['age_bucket'] = pd.cut(credit_data['applicant_age'], bins=bins, labels=labels, right=False)

plt.subplot(212)
g2 = sns.boxplot(x="age_bucket", y="loan_amount", data=credit_data, palette="husl", hue="target")
g2.set_xlabel("Applicant Age Bucket", fontsize=12)
g2.set_ylabel("Loan Amount", fontsize=12)
g2.set_title("Loan Amount Distribution by Applicant Age", fontsize=16)

plt.tight_layout()
plt.show()

Conclusions on purposes:
- Usually smaller loans for used cars are accepted more easily
- General tendency to accepting smaller loans (less risk for the bank) - except for repairs

Conclusions on ages:
- Pople above 50 are less trusted with large loans
- People between 30 and 40 seem most trustworthy

### 9. Applying Random Forest Classifier to Assess Feature Importance

Next to the feature correlation matrix, we are now using a simple Random Forest Classifier as a second source on which features might be important to use for our model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Prepare data for Random Forest model
X = pd.get_dummies(credit_data.drop(target_col, axis=1), drop_first=True)
y = credit_data[target_col]

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_
feature_names = X.columns

# Aggregate importances for original features
agg_importance = {}
for col in credit_data.drop(target_col, axis=1).columns:
    # Find all dummy columns that start with this feature name + '_'
    related_features = [f for f in feature_names if f == col or f.startswith(col + '_')]
    agg_importance[col] = sum(importances[list(feature_names).index(f)] for f in related_features)

# Create DataFrame for visualization
agg_importance_df = pd.DataFrame({'feature': list(agg_importance.keys()), 'importance': list(agg_importance.values())})
agg_importance_df = agg_importance_df.sort_values(by='importance', ascending=False)

# Plot
plt.figure(figsize=(5, 5))
sns.barplot(x='importance', y='feature', data=agg_importance_df)
plt.title('Aggregated Feature Importance from Random Forest')
plt.xlabel('Importance Score')
plt.ylabel('Original Features')
plt.tight_layout()
plt.show()

Conclusions: In combination with the results from earlier correlation matrices it seems promising to use combination of loan_amount, account_status, loan_duration, applicant_age and loan_purpose, among other factors, as input features for the model training.

### 10. Summary of Potential Data Flaws

- **Class Imbalance:** There is a moderate imbalance in the target variable, with more "good" than "bad" outcomes. This can bias models toward the majority class.
- **Sparse Categories:** Some categorical values are rare, which can lead to unreliable estimates for those groups.

Papers like ['Using neural network rule extraction and decision tables for credit-risk evaluation' by Bart Baesens et. al.](https://www.jstor.org/stable/4133928) discuss the challenges of class imbalance and categorical encoding in very similar credit risk datasets, confirming that these are recognized issues in the literature.
