# Credit risk modelling using Logistic Regression

## Problem Statement

Predict the loan defaulters using a Logistic Regression model on the credit risk data and calculate credit scores

## Learning Objectives

* perform data exploration, preprocessing and visualization
* implement Logistic Regression using manual code or using sklearn library
* evaluate the model using appropriate performance metrics
* develop a credit scoring system

## Dataset

The dataset chosen for this project is the '**Give Me Some Credit**' dataset which can be used to build models for predicting loan repayment defaulters. This dataset contains 150000 data points and 11 features.

### Download the dataset

### Install Packages

In [None]:
!pip install pandas==1.3.5

In [None]:
!pip install xverse

### Import Neccesary Packages

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sns
from matplotlib import pyplot as plt
import math
from xverse.transformer import MonotonicBinning,WOE
%matplotlib inline

### Load the dataset

In [None]:
# YOUR CODE HERE
df = pd.read_csv('GiveMeSomeCredit.csv', index_col=0)
df.head()

In [None]:
df.shape

#### Describe the all statistical properties of the train dataset

In [None]:
df.describe().T

### Pre-processing

#### Remove unwanted columns

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True)

#### Handle the missing data

Find the how many null values in the dataset and fill with mean or remove.

In [None]:
df.isnull().sum()

In [None]:
df['MonthlyIncome'].fillna(df['MonthlyIncome'].mean(), inplace=True)

In [None]:
df['NumberOfDependents'].fillna(df['NumberOfDependents'].mean(), inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.drop(df[df['age']<18].index, inplace=True)
df[df['age']<18]

In [None]:
df.boxplot(column='RevolvingUtilizationOfUnsecuredLines')

In [None]:
upper = np.percentile(df['RevolvingUtilizationOfUnsecuredLines'], 75)
df['RevolvingUtilizationOfUnsecuredLines'] = df['RevolvingUtilizationOfUnsecuredLines'].apply(lambda x: upper if x>1 else x)

In [None]:
df.boxplot(column='RevolvingUtilizationOfUnsecuredLines')

### EDA &  Visualization

#### Calculate the percentage of the target lebels and visualize with a graph

In [None]:
target0 = (df['SeriousDlqin2yrs'].value_counts()[0]*100)/len(df['SeriousDlqin2yrs'])
target1 = (df['SeriousDlqin2yrs'].value_counts()[1]*100)/len(df['SeriousDlqin2yrs'])
target_percent = [target0, target1]
plt.bar(['0', '1'],target_percent)
plt.title('percentages of target labels')

In [None]:
plt.pie((df['SeriousDlqin2yrs'].value_counts()), labels = [0, 1], autopct = '%10.3f%%')
plt.title("Percentages of customers with Serious deliquency in 2 years")

#### Plot the distribution of SeriousDlqin2yrs by age

In [None]:
bins = [df['age'].min(), 30, 60, 90, df['age'].max()]
df['age_bins'] = pd.cut(x = df['age'], bins = bins, include_lowest = True)
df.head()

In [None]:
fig, ax = plt.subplots(figsize=(8,8))
sns.countplot(data=df, x='age_bins', hue='SeriousDlqin2yrs', ax=ax)

In [None]:
plt.figure(figsize=(6,5),dpi=110)
plt.title("age vs SeriousDlqin2yrs",fontsize=16)
sns.regplot(data=df,y="age",x="MonthlyIncome")
plt.show()

#### Calculate the correlation and plot the heatmap

In [None]:
df[df.columns[:]].corr()
sns.heatmap(df[train_data.columns[:]].corr(),fmt=".1f")
plt.show()

### Data Engineering

#### Weight of Evidence and Information value

* Arrange the binning for each variable with different bins
* Calculate information value and chooose the best features based on the rules given below

| Information Value |	Variable Predictiveness |
| --- | --- |
| Less than 0.02	|  Not useful for prediction |
| 0.02 to 0.1	| Weak predictive Power |
|  0.1 to 0.3 | Medium predictive Power |
| 0.3 to 0.5 | Strong predictive Power |
| >0.5 | Suspicious Predictive Power |

* Calculate Weight of evidence for the selected variables

In [None]:
X = df.drop(columns='SeriousDlqin2yrs', axis=1)
y = df['SeriousDlqin2yrs']

In [None]:
from xverse.transformer import MonotonicBinning

clf = MonotonicBinning()
clf.fit(X, y)

print(clf.bins)
output_bins = clf.bins

In [None]:
clf = MonotonicBinning(custom_binning=output_bins) #output_bins was created earlier

out_X = clf.transform(X)
out_X.head()

In [None]:
from xverse.transformer import WOE
clf = WOE()
clf.fit(X, y)
clf.woe_df # weight of evidence transformation dataset. This dataset will be used in making bivariate charts as well. 
clf.iv_df #information value dataset

In [None]:
clf.woe_df # weight of evidence transformation dataset. This dataset will be used in making bivariate charts as well. 


In [None]:
out_X_woe = clf.transform(X)

In [None]:
out_X_woe.head()

### Identify features,  target and split it into train and test

In [None]:
# YOUR CODE HERE
X = out_X_woe.drop(['SeriousDlqin2yrs'], axis=1)
y = out_X_woe['SeriousDlqin2yrs']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

### Logistic Regression from scratch using gradient method

In [None]:
def sigmoid(x):
  return np.maximum(np.minimum(1 / (1 + np.exp(-x)), 0.9999), 0.0001)

def cost_function(x, y, theta):
  t = x.dot(theta)
  return - np.sum(y * np.log(sigmoid(t)) + (1 - y) * np.log(1 - sigmoid(t))) / x.shape[0]

def gradient_cost_function(x, y, theta):
  t = x.dot(theta)
  return x.T.dot(y - sigmoid(t)) / x.shape[0]

def update_theta(x, y, theta, learning_rate):
  return theta + learning_rate * gradient_cost_function(x, y, theta)

def train(x, y, learning_rate, iterations=500, threshold=0.0005):
  theta = np.zeros(x.shape[1])
  costs = []
  print('Start training')
  for i in range(iterations):
    theta = update_theta(x, y, theta, learning_rate)
    cost = cost_function(x, y, theta)
    print(f'[Training step #{i}] — Cost function: {cost:.4f}')
    costs.append({'cost': cost, 'weights': theta})
    if i > 15 and abs(costs[-2]['cost'] - costs[-1]['cost']) < threshold:
      break
  return theta, costs

theta, costs = train(x_train, y_train, learning_rate=0.0001)

def predict(x, theta):
  return (sigmoid(x.dot(theta)) >= 0.5).astype(int)

#Let’s compare, how predicted data are different than real:

def get_accuracy(x, y, theta):
  y_pred = predict(x, theta)
  return (y_pred == y).sum() / y.shape[0]

print(f'Accuracy on the training set: {get_accuracy(x_train, y_train, theta)}')

print(f'Accuracy on the test set: {get_accuracy(x_test, y_test, theta)}')

### Implement the Logistic regression using sklearn

As there is imbalance in the class distribution, add weightage to the Logistic regression.

* Find the accuracy with class weightage in Logistic regression
* Find the accuracy without class weightage in Logistic regression

In [None]:
# With weightage
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)


In [None]:
weighted_lr = LogisticRegression(class_weight='balanced', random_state=123, max_iter=100)
weighted_lr.fit(x_train, y_train)
y_pred = weighted_lr.predict(X_test)
y_pred

In [None]:
# Without weightage
log_reg = LogisticRegression()
log_reg.fit(x_train,y_train)
y_pred = log_reg.predict(X_test)
log_reg.score(X_test,y_test), log_reg.score(x_train, y_train)

### Credit scoring

When scaling the model into a scorecard, we will need both the Logistic Regression coefficients from model fitting as well as the transformed WoE values. We will also need to convert the score from the model from the log-odds unit to a points system.
For each independent variable Xi, its corresponding score is:

$Score = \sum_{i=1}^{n} (-(β_i × WoE_i + \frac{α}{n}) × Factor + \frac{Offset}{n})$

Where:

βi — logistic regression coefficient for the variable Xi

α — logistic regression intercept

WoE — Weight of Evidence value for variable Xi

n — number of independent variable Xi in the model

Factor, Offset — known as scaling parameter

  - Factor = pdo / ln(2); pdo is points to double the odds
  - Offset = Round_of_Score - {Factor * ln(Odds)}

In [None]:
# Scaling factors
coef = log_reg.coef_.ravel()
intercept = log_reg.intercept_
factor = 20/np.log(2)
offset = 600 - ( factor * np.log(50))
factor, offset

In [None]:
# 1st method
# all_scores = []
# for idx,row in X.iterrows():
#   score  = []
#   for j in range(len(row)):
#     asum = (-((row[j] * coef[j]) + (intercept/X.shape[1])) * factor) + (offset/X.shape[1])
#     score.append(asum)
#   all_scores.append(sum(score))
# max(all_scores), min(all_scores)

In [None]:
# 2nd method
all_scores = []
for idx,row in X.iterrows():
  a = row.values * coef          # B_i * WOE_i
  a = a + (intercept/X.shape[1]) # (B_i * WOE_i) + intercept_i / n
  b = -a * factor                # -((B_i * WOE_i) + intercept_i / n) * factor
  b = b + (offset/X.shape[1])    # -((B_i * WOE_i) + intercept_i / n) * factor) + offset / n
  all_scores.append(sum(b))      # sum

In [None]:
max(all_scores),min(all_scores)

In [None]:
np.array(all_scores)

### Performance Metrics

#### Precision

In [None]:
from sklearn.metrics import precision_score
precision_score(y_test, y_pred ,average='macro') 

#### Recall

In [None]:
from sklearn.metrics import recall_score
recall_score(y_test, y_pred,average='macro') 

#### Classification Report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

#### Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
mat