# Customer Credit Risk Scoring Model

### **Objective:**

The goal of this project is to build a **Customer Credit Risk Scoring Model** to assess the likelihood of a customer defaulting on a loan. The model will classify customers as **low-risk** or **high-risk** based on their financial data and personal characteristics. This project is relevant for the finance industry, particularly for banks and lending institutions that need to evaluate creditworthiness before approving loans.

### **1. Problem Definition:**

Credit risk scoring helps financial institutions make informed lending decisions by estimating the likelihood of default on credit obligations. A customer’s risk score is crucial for determining whether to approve a loan, setting interest rates, or determining the credit limit.

### **2. Dataset:**

First of all, let me import some necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import assistant

%matplotlib inline

The dataset in this project is German Credit Dataset, which can be found on https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data. To import the dataset, I install `ucimlrepo` as the Python module using `pip`. As a result, the command to install the module is:

`pip install ucimlrepo`

Subsequently, I proceed some steps as follows in order to get the dataset.

In [2]:
from ucimlrepo import fetch_ucirepo 

# Fetch German Credit Data (ID: 144).
statlog_german_credit_data = fetch_ucirepo(id=144) 

# Extract features into a DataFrame.
df = statlog_german_credit_data.data.features 
# Add target values as a new column 'credibility'.
df["credibility"] = statlog_german_credit_data.data.targets

df.head(5)

Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,...,Attribute12,Attribute13,Attribute14,Attribute15,Attribute16,Attribute17,Attribute18,Attribute19,Attribute20,credibility
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


It looks like the dataset's columns are not well-described, as we can see that the column names are in the format of "Attribute[number]". Therefore, I'd better name these columns properly so that I'll get some insights from the data more conveniently.

In [3]:
statlog_german_credit_data.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Attribute1,Feature,Categorical,,Status of existing checking account,,no
1,Attribute2,Feature,Integer,,Duration,months,no
2,Attribute3,Feature,Categorical,,Credit history,,no
3,Attribute4,Feature,Categorical,,Purpose,,no
4,Attribute5,Feature,Integer,,Credit amount,,no
5,Attribute6,Feature,Categorical,,Savings account/bonds,,no
6,Attribute7,Feature,Categorical,Other,Present employment since,,no
7,Attribute8,Feature,Integer,,Installment rate in percentage of disposable i...,,no
8,Attribute9,Feature,Categorical,Marital Status,Personal status and sex,,no
9,Attribute10,Feature,Categorical,,Other debtors / guarantors,,no


According to the description, I name these columns as below.

In [4]:
# Define a list of feature column names for the DataFrame.
cols_features = [
    "account_status", "duration", "credit_history", "purpose", "credit_amount", "savings_account_or_bonds", "employment", "installment_rate",
    "status_and_sex", "other_debtors_or_guarantors", "residence", "property", "age", "other_installment_plans", "housing", "num_credits", "job", 
    "num_liable_people", "telephone", "is_foreign"
]

# Rename DataFrame columns (excluding "credibility") using the defined feature names.
df = df.rename(dict(zip(list(df.drop("credibility", axis=1).columns), cols_features)), axis=1)
df.head(5)

Unnamed: 0,account_status,duration,credit_history,purpose,credit_amount,savings_account_or_bonds,employment,installment_rate,status_and_sex,other_debtors_or_guarantors,...,property,age,other_installment_plans,housing,num_credits,job,num_liable_people,telephone,is_foreign,credibility
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


Unfortunately, the categorical values have already been encoded before, so I'll try my best to interpret the encoded data.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   account_status               1000 non-null   object
 1   duration                     1000 non-null   int64 
 2   credit_history               1000 non-null   object
 3   purpose                      1000 non-null   object
 4   credit_amount                1000 non-null   int64 
 5   savings_account_or_bonds     1000 non-null   object
 6   employment                   1000 non-null   object
 7   installment_rate             1000 non-null   int64 
 8   status_and_sex               1000 non-null   object
 9   other_debtors_or_guarantors  1000 non-null   object
 10  residence                    1000 non-null   int64 
 11  property                     1000 non-null   object
 12  age                          1000 non-null   int64 
 13  other_installment_plans      1000 

The data has no null values, but are there any outlying numerical values? Let's find out.

In [6]:
cols_w_outliers = []

for col in df.columns:
    if df[col].dtype == "int64" and assistant.has_outliers(df[col]):
        cols_w_outliers.append(col)

cols_w_outliers

['duration', 'credit_amount', 'age', 'num_credits', 'num_liable_people']

In [7]:
for col in cols_w_outliers:
    df[f"{col}_outlies"] = assistant.get_outliers(df[col])

df.head(5)

Unnamed: 0,account_status,duration,credit_history,purpose,credit_amount,savings_account_or_bonds,employment,installment_rate,status_and_sex,other_debtors_or_guarantors,...,job,num_liable_people,telephone,is_foreign,credibility,duration_outlies,credit_amount_outlies,age_outlies,num_credits_outlies,num_liable_people_outlies
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A173,1,A192,A201,1,False,False,True,False,False
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A173,1,A191,A201,2,True,False,False,False,False
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A172,2,A191,A201,1,False,False,False,False,True
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A173,2,A191,A201,1,False,False,False,False,True
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A173,2,A191,A201,2,False,False,False,False,True


In [8]:
for col in cols_w_outliers:
    print(f"{col}: {min(df[df[f'{col}_outlies']][col].unique())}")

duration: 45
credit_amount: 7966
age: 65
num_credits: 4
num_liable_people: 2


In [9]:
for col in cols_w_outliers:
    col_outlies = f"{col}_outlies"
    median = df[~df[col_outlies]][col].median()
    df[col] = df[[col, col_outlies]].apply(lambda v: v[0] if not v[1] else median, axis=1)
    print(f"{col}: impute outliers with {median}")

duration: impute outliers with 18.0
credit_amount: impute outliers with 2145.5
age: impute outliers with 33.0
num_credits: impute outliers with 1.0
num_liable_people: impute outliers with 1.0


  df[col] = df[[col, col_outlies]].apply(lambda v: v[0] if not v[1] else median, axis=1)
  df[col] = df[[col, col_outlies]].apply(lambda v: v[0] if not v[1] else median, axis=1)
  df[col] = df[[col, col_outlies]].apply(lambda v: v[0] if not v[1] else median, axis=1)
  df[col] = df[[col, col_outlies]].apply(lambda v: v[0] if not v[1] else median, axis=1)
  df[col] = df[[col, col_outlies]].apply(lambda v: v[0] if not v[1] else median, axis=1)
