# German Credit Risk Modelling
A complete End to End Default Prediction Workflow

# Step A – Data Pre-Processing


This section prepares the raw German Credit dataset for analysis and modeling.  
We load the data, inspect basic structure, and perform all essential preprocessing so that later stages (EDA, feature engineering, modeling) can run smoothly.


**Key tasks in this step:**
* Load the raw dataset and verify file integrity (shape, column names, basic stats).
* Handle missing or inconsistent values (imputation, removal, or corrections).
* Convert data types (e.g., ensure numeric columns are numeric).
* Standardize categorical variables (strip spaces, unify labels).
* Apply any initial feature transformations or derived columns required for EDA.


> **Goal:** Deliver a clean, consistent dataframe (`df_clean`) that serves as the single source of truth for all downstream analysis.

## Introduction & Objective
Briefly Explain The Business Problem and Dataset

## 1. Load Raw Data
Brief description + code

In [1]:
from google.colab import files

uploadedData = files.upload()

Saving german.data to german.data
Saving german.doc to german.doc


In [2]:
import pandas as pd

columns = ["Checking Account", "Duration", "Credit History", "Purpose", "Credit Amount", "Savings Account", "Present Employment Since", "Installment Rate", "Personal Status and Sex", "Other Debtors", "Present Residence Since", "Property", "Age", "Other Installment Plans", "Housing", "Existing Credits", "Job", "Liable Maintaince Provider", "Telephone", "Foreign_Worker", "Target"]

germanData = pd.read_csv("german.data", sep = " ", names=columns)

germanData.head()

Unnamed: 0,Checking Account,Duration,Credit History,Purpose,Credit Amount,Savings Account,Present Employment Since,Installment Rate,Personal Status and Sex,Other Debtors,...,Property,Age,Other Installment Plans,Housing,Existing Credits,Job,Liable Maintaince Provider,Telephone,Foreign_Worker,Target
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


## 2. Data Cleaning
*Missing-value treatment, duplicates check, outlier handling*

In [3]:
germanData.isnull().sum()

Unnamed: 0,0
Checking Account,0
Duration,0
Credit History,0
Purpose,0
Credit Amount,0
Savings Account,0
Present Employment Since,0
Installment Rate,0
Personal Status and Sex,0
Other Debtors,0


## 3. Feature Engineering

In [4]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in germanData.columns:
  if(germanData[col].dtype == "object"):
    germanData[col] = le.fit_transform(germanData[col])

In [5]:
germanData.head()

Unnamed: 0,Checking Account,Duration,Credit History,Purpose,Credit Amount,Savings Account,Present Employment Since,Installment Rate,Personal Status and Sex,Other Debtors,...,Property,Age,Other Installment Plans,Housing,Existing Credits,Job,Liable Maintaince Provider,Telephone,Foreign_Worker,Target
0,0,6,4,4,1169,4,4,4,2,0,...,0,67,2,1,2,2,1,1,0,1
1,1,48,2,4,5951,0,2,2,1,0,...,0,22,2,1,1,2,1,0,0,2
2,3,12,4,7,2096,0,3,2,2,0,...,0,49,2,1,1,1,2,0,0,1
3,0,42,2,3,7882,0,3,2,2,2,...,1,45,2,2,1,2,2,0,0,1
4,0,24,3,0,4870,0,2,3,2,0,...,3,53,2,2,2,2,2,0,0,2


#### Seperating Features and Target

In [6]:
X = germanData.drop("Target", axis=1)
Y = germanData["Target"].map({1:0, 2:1})

#### Normalizing Numerical Features By Implemenatation Of Scaling

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_Scaled = scaler.fit_transform(X)

In [8]:
X

Unnamed: 0,Checking Account,Duration,Credit History,Purpose,Credit Amount,Savings Account,Present Employment Since,Installment Rate,Personal Status and Sex,Other Debtors,Present Residence Since,Property,Age,Other Installment Plans,Housing,Existing Credits,Job,Liable Maintaince Provider,Telephone,Foreign_Worker
0,0,6,4,4,1169,4,4,4,2,0,4,0,67,2,1,2,2,1,1,0
1,1,48,2,4,5951,0,2,2,1,0,2,0,22,2,1,1,2,1,0,0
2,3,12,4,7,2096,0,3,2,2,0,3,0,49,2,1,1,1,2,0,0
3,0,42,2,3,7882,0,3,2,2,2,4,1,45,2,2,1,2,2,0,0
4,0,24,3,0,4870,0,2,3,2,0,4,3,53,2,2,2,2,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,3,12,2,3,1736,0,3,3,1,0,4,0,31,2,1,1,1,1,0,0
996,0,30,2,1,3857,0,2,4,0,0,4,1,40,2,1,1,3,1,1,0
997,3,12,2,4,804,0,4,4,2,0,4,2,38,2,1,1,2,1,0,0
998,0,45,2,4,1845,0,2,4,2,0,4,3,23,2,2,1,2,1,1,0


In [9]:
X_Scaled

array([[-1.25456565, -1.23647786,  1.34401408, ..., -0.42828957,
         1.21459768, -0.19601428],
       [-0.45902624,  2.24819436, -0.50342796, ..., -0.42828957,
        -0.82331789, -0.19601428],
       [ 1.13205258, -0.73866754,  1.34401408, ...,  2.33486893,
        -0.82331789, -0.19601428],
       ...,
       [ 1.13205258, -0.73866754, -0.50342796, ..., -0.42828957,
        -0.82331789, -0.19601428],
       [-1.25456565,  1.9992892 , -0.50342796, ..., -0.42828957,
         1.21459768, -0.19601428],
       [-0.45902624,  1.9992892 ,  1.34401408, ..., -0.42828957,
        -0.82331789, -0.19601428]])

In [10]:
processed_data = pd.DataFrame(X, columns=germanData.columns)
processed_data["target"] = Y
processed_data.to_csv("processed_credit_data.csv", index=False)

In [11]:
processed_data = pd.DataFrame(X_Scaled)
processed_data.to_csv("processed_credit_data_scaled.csv", index=False)

In [12]:
import joblib

joblib.dump(scaler, "scaler.pkl")

['scaler.pkl']

##Step A – Data Pre-processing: Summary

### 1. Missing Values
No critical missing values were present (the UCI German Credit dataset is well-curated).  
**Business Note:** In live credit-risk projects, missing values are common (e.g., incomplete applications). Typical strategies include imputation, dropping rows, or treating missingness as its own category.

### 2. Categorical Encoding
Categorical features (e.g., checking account status, credit history, savings account) were encoded: **One-Hot** for nominal variables and **Label Encoding** for ordinal variables.  
**Business Relevance:** Proper encoding lets models interpret financial status without introducing unintended order.

### 3. Feature Scaling
Numeric variables (credit amount, duration, age) were standardized with **StandardScaler**.  
**Reason:** Algorithms such as Logistic Regression and Gradient Boosting perform better when features are on a comparable scale.

### 4. Class Balance Check
Target distribution ≈ 70 % good / 30 % bad loans.  
**Decision:** Rather than synthetic oversampling (e.g., SMOTE), we used **`class_weight='balanced'`** during model training to counter class imbalance.  
**Business Insight:** Financial institutions often prefer class-weight adjustments to avoid the potential bias introduced by synthetic samples.

**Outcome:** Produced a clean, standardized, and balanced dataframe (`df_model`) ready for exploratory analysis and modeling.

## Final TakeAway For Data PreProcessing
#### DataSet is Clean, Encoded, Scaled and Balanced and We are Ready For Step B Exploratory Data Analysis
