# Predicting Credit Ratings

This notebook contains my very first Data Science / Machine Learning project, to learn about and apply different ML algorithms on a German Credit Data Set to predict creditworthiness.

### 1. Importing Packages & Data

First off, I'll import the packages I'll need in this project, this includes:
- pandas: using dataframes makes manipulating the data easier
- numpy: for matrix maths
- scikit-learn: for applying distance metrics in K-means clustering
- pytorch:
- matplotlib: for data visualisations and graphs.

In [28]:
import pandas as pd
import numpy as np
import sklearn as skl
import torch as pyt
import matplotlib as plt

Next I'll import the data set as a pandas DataFrame. 

I am using the German Credit DataFrame from the UCI Machine Learning repository:

https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 

In [125]:
german_credit_data_original = pd.read_csv(r"C:\Users\nicholst\OneDrive - Pension Services Corporation Ltd\Projects\Credit Risk ML\German Credit Data\german.csv")
print(german_credit_data_original)

    AccBalance  Duration CreditHistory Purpose  Amount Savings Employment  \
0          A11         6           A34     A43    1169     A65        A75   
1          A12        48           A32     A43    5951     A61        A73   
2          A14        12           A34     A46    2096     A61        A74   
3          A11        42           A32     A42    7882     A61        A74   
4          A11        24           A33     A40    4870     A61        A73   
..         ...       ...           ...     ...     ...     ...        ...   
995        A14        12           A32     A42    1736     A61        A74   
996        A11        30           A32     A41    3857     A61        A73   
997        A14        12           A32     A43     804     A61        A75   
998        A11        45           A32     A43    1845     A61        A73   
999        A12        45           A34     A41    4576     A62        A71   

     Rate  Status Sex  ... Property  Age OtherPlans  Housing ExistingCredit

### 2. K-Means Clustering  

First I'll try using K-Means clustering to determine Credit Rating (Good or Bad) from a number of features in the data. 

As you can see from the above DataFrame, the data is of mixed type i.e. contains continuous data in some columns (e.g. Duration) and categorical data in the rest (e.g. Sex). This means that the standard K-means algorithm, built solely for continuous data will not work in this instance. This is because the Euclidean distance metirc (i.e. straight line in N-dimensions) is not calculable when any of the dimensions are not continuous. 

However, I can use an alternative method: Gower's distance. This uses the "Manhattan distance" to measure the dissimilarity of the continuous features and the "Dice distance" for the categorical features.

First let's decide on which features I'll use. I'll start by using a set of the 10 features I believe to be the most relevant: Age,
AccBalance, CreditHistory, Purpose, Employment, Savings, ExistingCredits, Property, Duration and Rate.

I'll cut down the DataFrame to only include these features. 

In [126]:
feature_list = ['Age','AccBalance','CreditHistory','Purpose','Employment','Savings','ExistingCredits','Property','Duration','Rate']
german_credit_data = german_credit_data_original[feature_list]
print(german_credit_data)

     Age AccBalance CreditHistory Purpose Employment Savings  ExistingCredits  \
0     67        A11           A34     A43        A75     A65                2   
1     22        A12           A32     A43        A73     A61                1   
2     49        A14           A34     A46        A74     A61                1   
3     45        A11           A32     A42        A74     A61                1   
4     53        A11           A33     A40        A73     A61                2   
..   ...        ...           ...     ...        ...     ...              ...   
995   31        A14           A32     A42        A74     A61                1   
996   40        A11           A32     A41        A73     A61                1   
997   38        A14           A32     A43        A75     A61                1   
998   23        A11           A32     A43        A73     A61                1   
999   27        A12           A34     A41        A71     A62                1   

    Property  Duration  Rat

Now that I have my cut-down features DataFrame I will apply the manhattan distance to all of the numeric features. 

In [127]:
from sklearn.neighbors import DistanceMetric

# Get the dtype for each column
data_types = german_credit_data.dtypes

# Set up dictionary to save distances
s = {}

# For each column in the df calculate manhattan distance if numeric
for col in german_credit_data:
    if data_types[col] != "object":

        # Calculating matrix of pairwise differences
        s[col] = DistanceMetric.get_metric('manhattan').pairwise(german_credit_data[[col]])

        # Normalising the matrix 
        s[col] = s[col]/max(np.ptp(german_credit_data[[col]]),1)

# Print an example
print("Age")
print(s['Age'])

Age
[[0.         0.80357143 0.32142857 ... 0.51785714 0.78571429 0.71428571]
 [0.80357143 0.         0.48214286 ... 0.28571429 0.01785714 0.08928571]
 [0.32142857 0.48214286 0.         ... 0.19642857 0.46428571 0.39285714]
 ...
 [0.51785714 0.28571429 0.19642857 ... 0.         0.26785714 0.19642857]
 [0.78571429 0.01785714 0.46428571 ... 0.26785714 0.         0.07142857]
 [0.71428571 0.08928571 0.39285714 ... 0.19642857 0.07142857 0.        ]]


The remaining features are all categorical and as such I will now apply the Dice distance to them.

In [128]:
# For each column in the df calculate dice distance if non-numeric
for col in german_credit_data:
    if data_types[col] == "object":

        # Create a dummy data column
        dummy_df = pd.get_dummies(german_credit_data[[col]])

        # Calculating matrix of pairwise differences
        s[col] = DistanceMetric.get_metric('dice').pairwise(dummy_df)

        # Normalising the matrix 
        s[col] = s[col]/max(np.ptp(dummy_df),1)

# Print an example
print("AccBalance")
print(s["AccBalance"])

AccBalance
[[0. 1. 1. ... 1. 0. 1.]
 [1. 0. 1. ... 1. 1. 0.]
 [1. 1. 0. ... 0. 1. 1.]
 ...
 [1. 1. 0. ... 0. 1. 1.]
 [0. 1. 1. ... 1. 0. 1.]
 [1. 0. 1. ... 1. 1. 0.]]


The Gowers distance can then be calculated using the formula below: 

d = Sum(si*wi)/Sum(wi)

Where: 
- si are the distances for each feature
- wi are the weights

We will use weights of 1 for all features to begin with. 


In [149]:
i = 0
for col in german_credit_data:
    if i == 0:
        gowers = s[col]
    else:
        gowers = np.add(gowers, s[col])
    i += 1
gowers = gowers / len(german_credit_data.columns)
print(gowers)
    
     

[[0.         0.64212185 0.54096639 ... 0.49394258 0.5692577  0.69544818]
 [0.64212185 0.         0.50115546 ... 0.44817927 0.27286415 0.54667367]
 [0.54096639 0.50115546 0.         ... 0.48630952 0.66162465 0.62114846]
 ...
 [0.49394258 0.44817927 0.48630952 ... 0.         0.37531513 0.6015056 ]
 [0.5692577  0.27286415 0.66162465 ... 0.37531513 0.         0.64047619]
 [0.69544818 0.54667367 0.62114846 ... 0.6015056  0.64047619 0.        ]]


We now have our full matrix of Gower distances between all of the 1000 records in the data set. 

Let's take the first record and find it's nearest 10 neighbours. And from this we can get a first guess probability for the person having a credit rating of "Good".


In [160]:
# Get just first row of distances
distances = list(gowers[0])

# Create a list of row nums
nums = [i+1 for i in range(1000)]
dist_map = dict(zip(distances,nums))

# Get IDs of the 10 nearest neighbours
distances.sort()
IDs = []
for i in range(0,10):
    dist = distances[i+1]
    ID = dist_map[dist]-1
    IDs += [ID]
print("IDs of the 10 nearest neighbours by Gower Distance:")
print(IDs)
print("")

# For each of these 10 records, get the Credit Rating
data_neighbours = german_credit_data_original.iloc[IDs]["CreditRating"]

# Count number of 1's 
good_count = german_credit_data_original.iloc[IDs]["CreditRating"].value_counts()[1]
length = len(data_neighbours)
prob = good_count / len(data_neighbours)
print("Probability of Good Credit: "+str(prob*100)+"%")

# Get the actual Credit Rating
Actual = german_credit_data_original.iloc[0]["CreditRating"]
if Actual == 1:
    Actual = "Good"
else:
    Actual = "Bad"
print(f"Actual Credit Rating: {Actual}")



IDs of the 10 nearest neighbours by Gower Distance:
[61, 135, 807, 179, 177, 115, 823, 71, 955, 16]

Probability of Good Credit: 100.0%
Actual Credit Rating: Good


Now I'll perform a loop to get the Probabilities and actual credit ratings of each row.

In [179]:
k = 10 
row_id = 0 
correct_count = 0
for row in gowers:

    # Get just first row of distances
    distances = list(row)

    # Create a list of row nums
    nums = [i+1 for i in range(1000)]
    dist_map = dict(zip(distances,nums))

    # Get IDs of the 10 nearest neighbours
    distances.sort()
    IDs = []
    for i in range(0,k):
        dist = distances[i+1]
        ID = dist_map[dist]-1
        IDs += [ID]

    # For each of these 10 records, get the Credit Rating
    data_neighbours = german_credit_data_original.iloc[IDs]["CreditRating"]

    # Count number of 1's
    try:
        good_count = german_credit_data_original.iloc[IDs]["CreditRating"].value_counts()[1]
    except:
        good_count = 0
    length = len(data_neighbours)
    prob = good_count / len(data_neighbours)

    # Get the actual Credit Rating
    Actual = german_credit_data_original.iloc[row_id]["CreditRating"]
    if Actual == 1:
        Actual = "Good"
    else:
        Actual = "Bad"

    # Determine if estimation is correct
    if prob > 0.5:
        if Actual == "Good":
            correct_count += 1
    else:
        if Actual == "Bad":
            correct_count += 1

    row_id += 1

# Display the results
print(f"Of 1000 records {str(correct_count)} were predicted correctly and {str(1000-correct_count)} were not")
print(f"This means the algorithm is {str(correct_count*100/1000)}% accurate when:")
print(f"k = {k}")


Of 1000 records 703 were predicted correctly and 297 were not
This means the algorithm is 70.3% accurate when:
k = 10
feature list = ['Age', 'AccBalance', 'CreditHistory', 'Purpose', 'Employment', 'Savings', 'ExistingCredits', 'Property', 'Duration', 'Rate']


With an accuracy of 70.3% this algorithm does not achieve a much higher precision than just simply guessing that all records will have "Good" credit rating (as these make up 700 out of 1000 records), this is far from ideal. 

I'll now wrap this in a function and call this again for a range of values of k, to see if we can achieve any more accuracy. 

In [181]:
def test_k_means(k,gowers_matrix):

    row_id = 0 
    correct_count = 0
    for row in gowers_matrix:

        # Get just first row of distances
        distances = list(row)

        # Create a list of row nums
        nums = [i+1 for i in range(1000)]
        dist_map = dict(zip(distances,nums))

        # Get IDs of the 10 nearest neighbours
        distances.sort()
        IDs = []
        for i in range(0,k):
            dist = distances[i+1]
            ID = dist_map[dist]-1
            IDs += [ID]

        # For each of these 10 records, get the Credit Rating
        data_neighbours = german_credit_data_original.iloc[IDs]["CreditRating"]

        # Count number of 1's
        try:
            good_count = german_credit_data_original.iloc[IDs]["CreditRating"].value_counts()[1]
        except:
            good_count = 0
        length = len(data_neighbours)
        prob = good_count / len(data_neighbours)

        # Get the actual Credit Rating
        Actual = german_credit_data_original.iloc[row_id]["CreditRating"]
        if Actual == 1:
            Actual = "Good"
        else:
            Actual = "Bad"

        # Determine if estimation is correct
        if prob > 0.5:
            if Actual == "Good":
                correct_count += 1
        else:
            if Actual == "Bad":
                correct_count += 1

        row_id += 1

    # Display the results
    print(f"Of 1000 records {str(correct_count)} were predicted correctly and {str(1000-correct_count)} were not")
    print(f"This means the algorithm is {str(correct_count*100/1000)}% accurate when:")
    print(f"k = {k}")
    print(f"")

test_k_means(5,gowers)
test_k_means(10,gowers)
test_k_means(15,gowers)
test_k_means(25,gowers)
test_k_means(50,gowers)
test_k_means(100,gowers)
test_k_means(999,gowers)

Of 1000 records 707 were predicted correctly and 293 were not
This means the algorithm is 70.7% accurate when:
k = 5
feature list = ['Age', 'AccBalance', 'CreditHistory', 'Purpose', 'Employment', 'Savings', 'ExistingCredits', 'Property', 'Duration', 'Rate']

Of 1000 records 703 were predicted correctly and 297 were not
This means the algorithm is 70.3% accurate when:
k = 10
feature list = ['Age', 'AccBalance', 'CreditHistory', 'Purpose', 'Employment', 'Savings', 'ExistingCredits', 'Property', 'Duration', 'Rate']

Of 1000 records 735 were predicted correctly and 265 were not
This means the algorithm is 73.5% accurate when:
k = 15
feature list = ['Age', 'AccBalance', 'CreditHistory', 'Purpose', 'Employment', 'Savings', 'ExistingCredits', 'Property', 'Duration', 'Rate']

Of 1000 records 746 were predicted correctly and 254 were not
This means the algorithm is 74.6% accurate when:
k = 25
feature list = ['Age', 'AccBalance', 'CreditHistory', 'Purpose', 'Employment', 'Savings', 'ExistingCred

The highest accuracy acheived was 74.6% when k = 25, this is still not great...

I have two avenues to explore in order to try to improve this accuracy:

   1. Change the features in the feature list.  
   2. Adjust the weighting applied to the various features when calculating the distance metric. 

I'll now experiment by varying up the features in the selected feature list to again see if this can generate any better results. It may be that we are missing some features that are key for determining Credit Ratings. 