# Credict Book Valuation Analysis

## Exercise 2

#### Done by: Darren Ramsook, Date: 6th Nov., 2019

Works on Python3.6 and above

In [None]:
# Libraries used for this task:
import pandas as pd #Pandas for Data Manipulation
import seaborn as sns #Seaborn for Data Plotting
import matplotlib.pyplot as plt #Matplotlib for graphics rendering
from sklearn.preprocessing import OrdinalEncoder #Encoding categorical variables
from sklearn.model_selection import train_test_split # Splitting dataset
from sklearn.linear_model import SGDClassifier #ML Algorithm
from sklearn.metrics import accuracy_score #Accuracy Measure
from sklearn.metrics import f1_score
import numpy as np #Computing Library

The data credit book valuation data is located in the "Data/" directory. This data can be loaded and the top 10 rows examined further, and the column data and their interpreted data types can be list:

In [None]:
creditBookDF = pd.read_csv("Data/exercise2_data.csv")
print("Column Titles and Datatypes:\n"+str(creditBookDF.dtypes))
print("\n\nTop 10 rows of data:\n")
creditBookDF.head(10)

### Initial Observation

From initial observation we can note the data has the following columns: 
* account_no (categorical), 
* gender (categorical), 
* age (numerical-continuous), 
* income (numerical-continuous), 
* term (numerical-continuous), 
* installment_amount (numerical-continuous), 
* interest_rate (numerical-continuous), 
* credit_score_at_application (numerical-continuous), 
* outstanding_balance(numerical-continuous) and 
* status (categorical).

Using this data we can now determine relationships using visual methods.

We can analyze the data further by looking at a pairplot between all the variables and having a general idea of their relationship and their histogram by producing a pair-wise plot across all variables.

In [None]:
pairplotCreditDF = sns.pairplot(creditBookDF)
plt.show()

From the pairplot, we can see some interesting trends, for example (and most notable) that the relationship between credit_score_at_application and interest_rate follows something resembling an **INVERSE SIGMOID** relationship.

Also the term variable is bunched together at regular intervals when related to all other variables. This indicates that the term may actually be categorical instead of a continuous number as once thought.

Also the distribution of variables can be seen from the self interseciton points. For example age, income, loan_amount and installment_amount seems as it follows a **GAMMA DISTRIBUTION**, while interest rate follows a **NORMAL DISTRIBUTION**.

The color palette and indiscrimination between data points makes it a bit hard to actually visualize the data. This can be changed via the following.

In [None]:
pairplotCDFCategory = sns.pairplot(creditBookDF, hue="status")
plt.show()

# Warning will show if on python 3.6, ignore 

Various decision boundaries on pairwise variables can be seen and separated from visual inspection, such as credit_score_at_application and interest_rate.


To get a better idea of the distributions we can perform **Kernel Density Estimation** on the data and also try to fit linear relationships between the data. This is done as shown below:

In [None]:
pairplotCDFCatKDE = sns.pairplot(creditBookDF, hue="status", kind = "reg",diag_kind="kde")
plt.show()

# Warning will show if on python 3.6, ignore 
# This will take some time to run as it is running KDE as well as fitting a linear reg on all variables

The relationship between some variables are a bit complex, however we can still model non-linear relationships using other methods.


## Modelling Using Probabilistic Methods

The data is first split into historical and current data. The categorical data is encoded at this stage to save computation time.

The problem requires to find the probability that someone defaults, in this case, we can group "PAID_UP" and "LIVE" into one category.

In [None]:
creditBookMLDF = creditBookDF.copy()

historicalDF = creditBookMLDF[creditBookMLDF['status'] != "LIVE"]
currentDF = creditBookMLDF[creditBookMLDF['status'] == "LIVE"]
creditBookMLDF['status'] = creditBookMLDF['status'].replace({'PAID_UP': 'X', 'LIVE': 'X'})

print("Dimensions of Data:")
print("Historical DataFrame: (#rows:"+str(historicalDF.shape[0])+" , #columns: "+str(historicalDF.shape[1])+")")
print("Current DataFrame: (#rows:"+str(currentDF.shape[0])+" , #columns: "+str(currentDF.shape[1])+")")

From this information, we can determine the following split for the data:
* 80% of the Historical Data would be utilized for training
* The remaining 20% of the Historical Data would be used as a test set

The only categorical data used will be the "gender" field and the output status field.

Preparing the data:

In [None]:
historicalDF['status'] = historicalDF['status'].replace({'PAID_UP': 'X', 'LIVE': 'X'})
currentDF['status'] = currentDF['status'].replace({'PAID_UP': 'X', 'LIVE': 'X'})

# Create and fit Encoders
genEncoder = OrdinalEncoder()
statusEncoder = OrdinalEncoder()
genEncoder = genEncoder.fit(creditBookMLDF.gender.values.reshape(-1,1))
statusEncoder = statusEncoder.fit(creditBookMLDF.status.values.reshape(-1,1))

# Transform Data
historicalDF['status'] = statusEncoder.transform(historicalDF.status.values.reshape(-1,1))
historicalDF['gender'] = genEncoder.transform(historicalDF.gender.values.reshape(-1,1))

# Splitting encoded data into train test sets now
trainDFX = historicalDF.drop(['outstanding_balance','account_no','status'],axis=1)
trainDFY = historicalDF['status']
X_train, X_test, y_train, y_test = train_test_split(trainDFX,trainDFY,test_size = 0.2,random_state=1)

### Using a Probabilistic model to create predicitons

Logistic Regression Classifier was trained using the l2 loss results in a probability between 0-1 of the output.

In [None]:
from sklearn.linear_model import LogisticRegression

lf = LogisticRegression(penalty ='l2', tol=1e-4,solver='lbfgs').fit(X_train, y_train)
y_pred_actual = lf.predict(X_test)
y_pred_probability = lf.predict_proba(X_test)

print("mean accuracy on the given test data and labels: " + str(lf.score(X_train, y_train)))
print("F1 score: " + str(f1_score(y_test, y_pred_actual)))
print("Overall accuracy: " + str(accuracy_score(y_test, y_pred_actual)))

### Forecasting for remainder of data



In [None]:
lfInputs = currentDF.drop(['outstanding_balance','account_no','status'],axis=1)
lfInputs['gender'] = genEncoder.transform(lfInputs.gender.values.reshape(-1,1))
currentDF['DefaultProb'] = ''
currentDF['DefaultProb'] = lf.predict_proba(lfInputs)

### Expected Repayment amount for Loan

In [None]:
currentDF['ExpectedRepay'] = ''
currentDF['ExpectedRepay'] = (1 - currentDF['DefaultProb'])*currentDF['outstanding_balance']

In [None]:
print("Total Expected Sum: " + str(currentDF['ExpectedRepay'].sum()))

ExpectedRepaySum = currentDF['ExpectedRepay'].sum()
OutstandingBalanceSum = currentDF['outstanding_balance'].sum()
ratio = ExpectedRepaySum/OutstandingBalanceSum

paidUpcount = creditBookDF[creditBookDF['status'] == 'PAID_UP'].shape[0]
defaultcount = creditBookDF[creditBookDF['status'] == 'DEFAULT'].shape[0]
countRatio = paidUpcount/defaultcount
print("Ratio of Expected Sum to Outstanding Balance Sum: " + str(ratio))
print("Ratio of PAID_UP to DEFAULT loans: "+str(countRatio))