# Bachelor's thesis: CPLE-LDA model

### The corresponding paper is: Loog, Marco. "Contrastive pessimistic likelihood estimation for semi-supervised classification." IEEE transactions on pattern analysis and machine intelligence 38.3 (2015): 462-475.

### This file sets up the data for the CPLE-LDA model and calculates the metrics based on the results from the model. The implementation of this model was done in a seperate R file "CPLE-LDA.r" with the package "RSSL".

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve, brier_score_loss

## Loading Data

In [3]:
accepts = pd.DataFrame()
accepts = pd.read_csv('../data/New_accepts.csv',encoding = "ISO-8859-1", low_memory=False)
rejects = pd.DataFrame()
rejects = pd.read_csv('../data/New_rejects.csv',encoding = "ISO-8859-1", low_memory=False)

In [4]:
accepts_data = accepts.copy()
rejects_data = rejects.copy()

In [5]:
y_acc = accepts_data["loan_status"]
y_rej = rejects_data["loan_status"]

## Preprocessing

In [6]:
scaler=StandardScaler()
scaler.fit(accepts_data[['loan_amnt', 'emp_length', 'dti', 'fico']])
accepts_data[['loan_amnt', 'emp_length', 'dti', 'fico']] = scaler.transform(accepts_data[['loan_amnt', 'emp_length', 'dti', 'fico']])
rejects_data[['loan_amnt', 'emp_length', 'dti', 'fico']] = scaler.transform(rejects_data[['loan_amnt', 'emp_length', 'dti', 'fico']])

In [7]:
accepts_data = pd.get_dummies(data=accepts_data, columns = ['addr_state'], drop_first = True)
rejects_data = pd.get_dummies(data=rejects_data, columns = ['addr_state'], drop_first = True)

In [8]:
accepts_data = accepts_data[['loan_amnt', 'emp_length', 'dti', 'fico','addr_state_1',"addr_state_2","addr_state_3",'loan_status']]
rejects_data = rejects_data[['loan_amnt', 'emp_length', 'dti', 'fico','addr_state_1',"addr_state_2","addr_state_3",'loan_status']]

In [9]:
accepts_data.to_csv('../data/CPLE_accepts.csv', encoding='utf-8', index=False)
rejects_data.to_csv('../data/CPLE_rejects.csv', encoding='utf-8', index=False)

## Results

### First we need to load the results. These are already available but you can run the implementation "CPLE-LDA.R" again to generate the results.

In [10]:
CPLE_result = pd.read_csv('../data/CPLE_result.csv',encoding = "ISO-8859-1", low_memory=False)
CPLE_result.drop(columns = CPLE_result.columns[0],axis = 1, inplace = True)

In [11]:
yhat = CPLE_result["V2"]

In [12]:
auc = roc_auc_score(y_rej,yhat)
print("AUC:", auc)

AUC: 0.5543551061704693


In [13]:
brier = brier_score_loss(y_rej, yhat,pos_label=1)
print("Brier score:", brier)

Brier score: 0.603776847522558


In [14]:
fpr, tpr, thresholds = roc_curve(y_rej, yhat,pos_label=1) 
ks_statistic = max(tpr - fpr)
print("KS-Statistic:",ks_statistic)

KS-Statistic: 0.08001999573972729
