# Credit Risk Assessment

# 1. Project Overview

**Context**

The consumer credit department of a bank wants to automate the decisionmaking process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable to provide a reason for any adverse actions (rejections).

**Content**

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable indicating whether an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.

**Data source**: https://www.kaggle.com/datasets/ajay1735/hmeq-data


## 2. Imports

In [8]:
import pandas as pd

## 3. Data Loading

In [9]:
credit = pd.read_csv('./datasets/hmeq.csv')
credit.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,


## 4. Data Inspection

In [10]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB


## 5. Data Cleaning

### Handling Duplicates

In [11]:
# Check for duplicate rows
duplicates = credit.duplicated()

# Count the number of duplicate rows
num_duplicates = duplicates.sum()

print(f"Number of duplicate rows: {num_duplicates}")

Number of duplicate rows: 0


### Handling Missing Values

Handling rows with missing entries of more than 50%

In [12]:
# Calculate the threshold for 50% missing values
threshold = credit.shape[1] / 2

# Identify rows with more than 50% missing values
rows_with_many_missing = credit.isna().sum(axis=1) > threshold

# Count the number of rows with more than 50% missing values
num_rows_with_many_missing = rows_with_many_missing.sum()

print(f"Number of rows with more than 50% missing values: {num_rows_with_many_missing}")

# Remove these rows
credit = credit[~rows_with_many_missing]

# Verify the removal
print(f"Number of rows after removal: {credit.shape[0]}")

Number of rows with more than 50% missing values: 126
Number of rows after removal: 5834


In [13]:
# Missing values per column by percentage
(credit.isna().mean() * 100).sort_values()

BAD         0.000000
LOAN        0.000000
VALUE       1.628385
CLNO        1.645526
REASON      2.862530
JOB         2.982516
CLAGE       3.119643
NINQ        6.582105
YOJ         6.993486
MORTDUE     7.559136
DELINQ      7.781968
DEROG       9.976003
DEBTINC    21.049023
dtype: float64

### Imputation

**For Numerical Features**:

**Median Imputation**: For features like MORTDUE, VALUE, and DEBTINC, median imputation is simple, interpretable, and reduces the risk of bias from outliers.


**For Categorical Features**:

**Mode Imputation**: For categorical features like REASON and JOB, mode imputation is standard and straightforward.

In [None]:
# Median Imputation for Numerical Features
numerical_cols = ['VALUE', 'CLNO', 'CLAGE', 'NINQ', 'YOJ', 'MORTDUE', 'DELINQ', 'DEROG', 'DEBTINC']
for col in numerical_cols:
    credit[col] = credit[col].fillna(credit[col].median())

# Mode Imputation for Categorical Features
categorical_cols = ['REASON', 'JOB']
for col in categorical_cols:
    credit[col] = credit[col].fillna(credit[col].mode()[0])
    
# New amount of missing values per column by percentage
(credit.isna().mean() * 100).sort_values()

BAD        0.0
LOAN       0.0
MORTDUE    0.0
VALUE      0.0
REASON     0.0
JOB        0.0
YOJ        0.0
DEROG      0.0
DELINQ     0.0
CLAGE      0.0
NINQ       0.0
CLNO       0.0
DEBTINC    0.0
dtype: float64