# Homework 2

## The task:

### This homework is focused on using and evaluating three predictive models: Linear Regression, Logistic Regression, and Random Forests. The dataset comes from a credit scoring company which is interested in the reduction of credit repayment risk. The dataset  is comprised of 26 columns, and 840 rows, with the following data provided:

#####  - RiskPerformance
#####  - ExternalRiskEstimate                  
#####  - MSinceOldestTradeOpen                 
#####  - MSinceMostRecentTradeOpen              
#####  - AverageMInFile                         
#####  - NumSatisfactoryTrades                  
#####  - NumTrades90Ever2DerogPubRec            
#####  - PercentTradesNeverDelq                 
#####  - MSinceMostRecentDelq                  
#####  - MaxDelq2PublicRecLast12M                
#####  - MaxDelqEver                             
#####  - NumTotalTrades                          
#####  - NumTradesOpeninLast12M                  
#####  - PercentInstallTrades                    
#####  - MSinceMostRecentInqexcl7days          
#####  - NumInqLast6Mexcl7days                   
#####  - NetFractionRevolvingBurden              
#####  - NetFractionInstallBurden                
#####  - NumRevolvingTradesWBalance              
#####  - NumInstallTradesWBalance                
#####  - NumBank2NatlTradesWHighUtilization      
#####  - PercentTradesWBalance 
##### - DelqEver                               
##### - DelqLast12M                             
##### - PercentSatisfactoryTrades             
##### - NumTradesWBalance      


### The target feature is RiskPerformance, which has two possible outcomes: Good and Bad. The goal of this work is to find a subset of features which have a correlation with the target feature, and then use this subset to create and evaluate predictive models in an effort to see if the models can accurately predict the target feature. If the models are successful, then they will help the company to judge which of their customers will be able to repay their credit within a 2 year period.

##### The data used in this homework has been sourced from: https://community.fico.com/s/explainable-machine-learning-challenge?tabset-3158a=2

##### This data used for this homework has been cleaned as follows:

- Duplicate rows and constant columns have been dropped.
- Null and irregular cardinalities were checked for: none found.
- Rows with a value of -9 for every feature were dropped.
- Rows with -8 values were subject to imputation, except for NetFractionInstallBurden where they were set to null.
- -7 values were set to be 1.5 * (largest value) of the feature which contained the -7.
- Rows where the value of NumSatisfactoryTrades was greater NumTotalTrades were dropped.
- MaxDelq2PublicRecLast12M had two equal values where 5 and 6 meant unknown delinquency. Thus, all 5 values were replaced with 6 values.
- NumInqLast6M and NumInqLast6Mexcl7days both had very similar data. NumInqLast6M was dropped.
- NumTrades60Ever2DerogPubRec and NumTrades90Ever2DerogPubRec both had very similar data. NumTrades60Ever2DerogPubRec was dropped.
- Rows where NumTradesOpeninLast12M was greater than NumTotalTrades were dropped.
- Rows where NumTrades90Ever2DerogPucRec was greater than NumTotalTrades were dropped.
- Rows where MSinceMostRecentTradeOpen was greater than MSinceOldestTradeOpen were dropped.
- Rows where NumRevolvingTradesWBalance was greater than NumTotalTrades were dropped.
- Rows where NumInstallTradesWBalance was greater than NumTotalTrades were dropped.
- Rows where NumBank2NatlTradesWHighUtilization was greater than NumTotalTrades were dropped.
- Rows where NumTotalTrades was greater than MSinceMoatRecentTradeOpen were dropped.
- Rows where PercentTradesNeverDelq was 100% and NumTrades60Ever2DerogPubRec had positive entries were dropped.
- Rows where PercentTradesNeverDelq was less than 100% and NumTrades60Ever2DerogPubRec was 0 were dropped.
- Rows where MSinceMostRecentDelq was equal to 0 and NumTrades60Ever2DerogPubRec was 0 were dropped.
- Outliers were examined and deemed to be fit to remain.
- Replaced MaxDelq2PublicRecLast12M with MaxDelqEver scale.
- Combined 7 & 8 values in MaxDelq2PublicRecLast12M	and MaxDelqEver. 
- New feature created: DelqEver, which measures if a entry has ever been delinquent.
- New feature created: DelqLast12M which measures if a entry during the last 12 months has ever been delinquent.

### First, the prerequisite software tools are imported.

In [None]:
# Library Imports.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as mpatches
import seaborn as sns
%matplotlib inline
from patsy import dmatrices


# For shuffling the dataframe
from sklearn.utils import shuffle

# For creating the models
from sklearn.model_selection import train_test_split
from sklearn import linear_model 
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# For model evaluation
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import cross_val_predict

### The cleaned CSV is imported.

In [None]:
df = pd.read_csv('CreditRisk-H2.csv')

# Print the first 10 rows of the dataframe. 
df.head(10)

In [None]:
# Get the numbers of rows and columns in the dataframe
df.shape

There are 840 rows, and 26 columns

##### Inspect the data types of the columns within the dataframe

In [None]:
df.dtypes

##### 'MaxDelqEver', 'MaxDelq2PublicRecLast12M', 'RiskPerformance', 'DelqEver' and 'DelqLast12M' are all categorical features. 'DelqEver' and 'DelqLast12M' contain boolean values, and not continuous values. 'MaxDelqEver' and 'MaxDelq2PublicRecLast12M' are numerical values, but these numbers have set meaning, and are not continuous features. 'RiskPerformance' has the binary values of 'Good' and 'Bad', and is not a continuous feture. Set these as categories.

In [None]:
df['MaxDelqEver'] = df['MaxDelqEver'].astype('category')
df['MaxDelq2PublicRecLast12M'] = df['MaxDelq2PublicRecLast12M'].astype('category')
df['RiskPerformance'] = df['RiskPerformance'].astype('category')
df['DelqEver'] = df['DelqEver'].astype('category')
df['DelqLast12M'] = df['DelqLast12M'].astype('category')

In [None]:
df.dtypes

##### The target feature is 'RiskPerformance', and thus for our predictive models, it must be marked as separate from the rest of the features. X will refer to all of the features in the dataframe excluding the target, and Y will refer to the target. 

In [None]:
X = df[[x for x in df.columns.values if x not in ['RiskPerformance']]]
y = df.RiskPerformance


### In order to ensure that the same data is being used each time this notebook is run, the following code is commented out as it is required to be run just once. The output is stored in 2 separate CSV files named 'training.csv' and 'test.csv' which are imported every time the notebook is run. This will ensure that every user of this notebook will have the same data in the training and test dataframes (explained below).

##### Shuffle the dataframe, and print the first 5 rows

In [None]:
#df = shuffle(df)
#df.head()

##### When training and evaluating predicitive models, it is important to split the data between training data and test data. This way, the test data is something which the model hasn't trained on, and therefore removes the possibility that the model has just "learned" the data provided. As the test data will be provided after the model has been trained, a better picture of the performance of the model on new, unseen data can be obtained.

##### Here, the data is split 70% as training data, 30% as test data.

In [None]:
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

##### Two new dataframes are created, one comprised of only the training data, and one only comprised of the test data. The original dataframe has now been separated into two different dataframes, with the test data being set aside.

In [None]:
#traindf=pd.concat([X_train, y_train], axis=1)
#testdf=pd.concat([X_test, y_test],axis=1)

##### The dataframes are saved to CSV files for importing whenever this notebook is run.

In [None]:
#traindf.to_csv('training.csv', index=False)
#testdf.to_csv('test.csv', index=False)

##### Import the training and test dataframes

In [None]:
traindf = pd.read_csv('training.csv')
testdf = pd.read_csv('test.csv')

In [None]:
traindf.select_dtypes(['float64', 'int64']).describe().T

We can see that there are now 588 rows in the training dataframe, which is 70% of the original dataframe. There are no minus values present, as expected from the cleaned data.

In [None]:
df.dtypes

##### Select only the continuous features from the training dataframe, so as to allow for the upcoming continuous feature correlation analysis.

In [None]:
train_continuous_features = traindf.select_dtypes(['float64', 'int64'])

##### Checking the correlation between the continuous features. A 1:1 correlation is signified by a 1.0.

In [None]:
train_continuous_features.corr()

##### An example of a high correlation is between 'NumBank2NatlTradesWHighUtilization' and 'NumRevolvingTradesWBalance'.

In [None]:
traindf[['NumBank2NatlTradesWHighUtilization', 'NumRevolvingTradesWBalance']].corr().as_matrix()[0,1]

##### To allow for easier analysis of correlation, a correlation matrix is created, which shows correlation between continuous features in illustated form. All continuous features from the training dataframe are used.

In [None]:
sns.set(style="white")

# Select columns containing continuous data
continuous_columns = traindf[['ExternalRiskEstimate', 'PercentTradesNeverDelq', 'NumTrades90Ever2DerogPubRec', 
'NumSatisfactoryTrades', 'MSinceMostRecentDelq', 'NumBank2NatlTradesWHighUtilization','NetFractionRevolvingBurden', 
'NetFractionInstallBurden', 'MSinceMostRecentTradeOpen', 'NumInstallTradesWBalance', 'NumRevolvingTradesWBalance', 
                              'NumTotalTrades', 'PercentTradesWBalance', 'MSinceMostRecentInqexcl7days', 
                              'NumInqLast6Mexcl7days', 'MSinceOldestTradeOpen',
    'NumTradesOpeninLast12M', 'PercentInstallTrades', 'AverageMInFile', 'PercentSatisfactoryTrades',
                             'NumTradesWBalance']].columns

# Calculate correlation of all pairs of continuous features
corr = traindf[continuous_columns].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(14, 12))

# Generate a custom colormap - blue and red
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, annot=True, mask=mask, cmap=cmap, vmax=1, vmin=-1,
            square=True, xticklabels=True, yticklabels=True,
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
plt.yticks(rotation = 0)
plt.xticks(rotation = 45)

Correlations above 0.5 will be considered noteworthy. In this correlation matrix, we see the following strong correlations:

**PercentTradesNeverDelq - ExternalRiskEstimate: 0.56**

**ExternalRiskEstimate - MSinceMostRecentDelq: 0.63**

**PercentTradesNeverDelq - MSinceMostRecentDelq: 0.64**

**NumTotalTrades - NumSatisfactoryTrades: 0.93**

**NetFractionRevolvingBurden - NumBank2NatlTradesWHighUtilization: 0.58**

**NumRevolvingTradesWBalance - NumBank2NatlTradesWHighUtilization: 0.62**

**PercentTradesWBalance - NetFractionRevolvingBurden: 0.570**

**AverageMInFile - MSinceOldestTradeOpen: 0.74**

**NumSatisfactoryTrades - NumTradesWBalance: 0.7**

**NumTradesWBalance - NumTotalTrades: 0.74**

**NumTradesWBalance - NumRevolvingTradesWBalance: 0.61** 

**PercentTradesNeverDelq - ExternalRiskEstimate**
There is a slightly strong correlation between these two features, and as stated in the supplied "data dictionary" CSV, ExternalRiskEstimate is "Monotonically Decreasing", meaning that an increase in ExternalRiskEstimate leads to a decrease in the probability of a "Bad" target outcome. This relationship makes sense, as past performance on delinquency intuitively leads to predictions on future performance.   

**ExternalRiskEstimate - MSinceMostRecentDelq**
There is a reasonably high correlation present here, which makes sense as the longer it has been since an entry has had a delinquency associated with it, the less risky it will be when it comes to repayments.

**PercentTradesNeverDelq - MSinceMostRecentDelq**:
There is a reasonably high correlation present here, which indicates that the entries which have not had a deliquency in a long time (or ever), have a higher percentage of trades that were never delinquent. This makes sense because the longer it has been since a delinquency, the longer the amount of time satisfactory trades have been occuring.

**NumTotalTrades - NumSatisfactoryTrades**
There is a very high correlation here, almost 1:1. This points out that the vast majority of all trades are satisfactory. 

**NetFractionRevolvingBurden - NumBank2NatlTradesWHighUtilization**
There is a reasonably high correlation here. NumBank2NatlTradesWHighUtilization "counts the number of credit cards on a consumer credit bureau report carrying a balance that is at 75% of its limit or greater." Therefore, it can be seen that customers with a high fraction of burden on revolving type accounts are quite likely to have high utilization (75%+). 

**NumRevolvingTradesWBalance - NumBank2NatlTradesWHighUtilization**
Following on from the above point, it can be noted that quite a high percentage of people with revolving accounts that have balance, have a balance that is 75% of its limit or greater. This would tend to suggest that customers with balance on revolving accounts are likely to at risk of missing repayments on time. This correlation will be examined further in the target feature - continuous feature correlations in the next segment.

**PercentTradesWBalance - NetFractionRevolvingBurden**
Revolving accounts appear again, with this correlation showing that the majority of trades which have balance are in fact revolving accounts. It would seem that, in general, revolving accounts have more balance to be repaid than installment accounts.

**AverageMInFile - MSinceOldestTradeOpen**
There is quite a strong correlation here, which makes sense as the longer a customer is with the company, the more opportunities to be in file exist. 

**NumSatisfactoryTrades - NumTradesWBalance**
There is quite a strong relationship here, which makes sense as it has already been shown that NumTotalTrades has an extremely high correlation with NumSatisfactoryTrades, and NumTradesWBalance is a subset of NumTotalTrades.

**NumTradesWBalance - NumTotalTrades**
This high correlation indicates that the majority of trades have remaining balance, which shows that the company has more ongoing trades than completed trades. 

**NumTradesWBalance - NumRevolvingTradesWBalance** 
This reasonably strong correlation indicates that the majority of trades with balance are revolving trades, and as discussed earlier, the majority of these have high utilization. Therefore, the company has a large about of high utilization trades. 

### Scatter plots

A better visualisation of the continuous features can seen through the use of scatter plots. If any two continuous features have a very high correlation (0.9+), then only one will be chosen for the predictive models if both have a high correlation with the target feature, as having two very highly correlated features in the predictive model can lead to issues (as one of the very highly correlated continuous features offers little extra information).

In [None]:
# Scatterplots for each descriptive feature and target feature. 
# Show the correlation value in the plot.
# This allows us to check strength of correlation with the target feature.

fig, axs = plt.subplots(1, 3, sharey=False)
traindf.plot(kind='scatter', x='PercentTradesNeverDelq', y='ExternalRiskEstimate', label="%.3f" % df[['PercentTradesNeverDelq', 'ExternalRiskEstimate']].corr().as_matrix()[0,1], ax=axs[0], figsize=(20, 5))
traindf.plot(kind='scatter', x='ExternalRiskEstimate', y='MSinceMostRecentDelq', label="%.3f" % df[['ExternalRiskEstimate', 'MSinceMostRecentDelq']].corr().as_matrix()[0,1], ax=axs[1])
traindf.plot(kind='scatter', x='PercentTradesNeverDelq', y='MSinceMostRecentDelq', label="%.3f" % df[['PercentTradesNeverDelq', 'MSinceMostRecentDelq']].corr().as_matrix()[0,1], ax=axs[2])


fig, axs = plt.subplots(1, 3, sharey=False)
traindf.plot(kind='scatter', x='NumTotalTrades', y='NumSatisfactoryTrades', label="%.3f" % df[['NumTotalTrades', 'NumSatisfactoryTrades']].corr().as_matrix()[0,1], ax=axs[0], figsize=(20, 5))
traindf.plot(kind='scatter', x='NetFractionRevolvingBurden', y='NumBank2NatlTradesWHighUtilization', label="%.3f" % df[['NetFractionRevolvingBurden', 'NumBank2NatlTradesWHighUtilization']].corr().as_matrix()[0,1], ax=axs[1])
traindf.plot(kind='scatter', x='NumRevolvingTradesWBalance', y='NumBank2NatlTradesWHighUtilization', label="%.3f" % df[['NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization']].corr().as_matrix()[0,1], ax=axs[2])

fig, axs = plt.subplots(1, 3, sharey=False)
traindf.plot(kind='scatter', x='PercentTradesWBalance', y='NetFractionRevolvingBurden', label="%.3f" % df[['PercentTradesWBalance', 'NetFractionRevolvingBurden']].corr().as_matrix()[0,1], ax=axs[0], figsize=(20, 5))
traindf.plot(kind='scatter', x='AverageMInFile', y='MSinceOldestTradeOpen', label="%.3f" % df[['AverageMInFile', 'MSinceOldestTradeOpen']].corr().as_matrix()[0,1], ax=axs[1])
traindf.plot(kind='scatter', x='NumSatisfactoryTrades', y='NumTradesWBalance', label="%.3f" % df[['NumSatisfactoryTrades', 'NumTradesWBalance']].corr().as_matrix()[0,1], ax=axs[2])

fig, axs = plt.subplots(1, 2, sharey=False)
traindf.plot(kind='scatter', x='NumTradesWBalance', y='NumTotalTrades', label="%.3f" % df[['NumTradesWBalance', 'NumTotalTrades']].corr().as_matrix()[0,1], ax=axs[0], figsize=(20, 5))
traindf.plot(kind='scatter', x='NumTradesWBalance', y='NumRevolvingTradesWBalance', label="%.3f" % df[['NumTradesWBalance', 'NumRevolvingTradesWBalance']].corr().as_matrix()[0,1], ax=axs[1])



##### It can be seen that NumTotalTrades and NumSatisfactoryTrades are very highly correlated. Thus, if they are seen to have a high correlation with the target feature in the tests below, a decision will have to be made about which of the features will be used in the predictive model. 

### The correlation between the continuous features in the dataframe and the target features are set out below, with the goal of finding the most promising features to use in the predictive models. A promising feature is one which shows a high correlation with the target outcome. A higher correlation between features leads to more accurate predictive models.

In [None]:
plt.figure()
flierprops = dict(marker='o', markerfacecolor='green', markersize=6,
                  linestyle='none')

traindf.boxplot(column=['ExternalRiskEstimate'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['MSinceOldestTradeOpen'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['AverageMInFile'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))

traindf.boxplot(column=['NumSatisfactoryTrades'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['PercentTradesNeverDelq'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['NumRevolvingTradesWBalance'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))


traindf.boxplot(column=['NumBank2NatlTradesWHighUtilization'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['PercentTradesWBalance'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['MSinceMostRecentDelq'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))

traindf.boxplot(column=['NumTotalTrades'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['NetFractionInstallBurden'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['NetFractionRevolvingBurden'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))

traindf.boxplot(column=['NumTrades90Ever2DerogPubRec'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))


traindf.boxplot(column=['NumTradesOpeninLast12M'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['MSinceMostRecentInqexcl7days'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['MSinceMostRecentTradeOpen'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))

traindf.boxplot(column=['NumInstallTradesWBalance'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['PercentInstallTrades'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['NumInqLast6Mexcl7days'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))

traindf.boxplot(column=['PercentSatisfactoryTrades'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
traindf.boxplot(column=['NumTradesWBalance'], by=['RiskPerformance'], flierprops=flierprops, figsize=(10,7))
 


From analysing these plots, the following information has been deemed noteworthy:

**ExternalRiskEstimate - RiskPerformance**
From the plot it can be seen that a higher ExternalRiskEstimate leads to a lower chance of receiving a 'Bad' target outcome. There is a strong relationship displayed on the plot, and thus this is a prime candidate for use in the predictive data model. The work of external risk estimators leads to a more confident decision in whether the customer is perceived to be able to make their repayments on time.

**AverageMInFile - RiskPerformance**
While the relationship is not as strong as that seen with 'ExternalRiskEstimate', there is still a correlation between 'AverageMInFile' and the target feature. As 'AverageMInFile' increases, so does the probability of a 'Good' target outcome. This makes intuitive sense, as longer term customers have more data associated with them which can be used to determine their probability of repayment.

**NumBank2NatlTradesWHighUtilization**
Even with an abundance of outliers (which were not clamped due to the decision that this data was highly probable to be correct), there is a link between 'NumBank2NatlTradesWHighUtilization' and the target feature, with a higher utilization pointing to a higher chance of a 'Bad' target outcome. This makes intuitive sense, as a customer with 75%+ utilization has a large amount of money to repay and thus may struggle to do so in a timely manner.

**NumSatisfactoryTrades**
Interestingly, the number of satisfactory trades doesn't seem to have a strong correlation with the target feature, which is counterintuitive. This points to other factors being weighted more heavily in the decision. This is therefore not a good candidate for the predictive model.

Also, as NumTotalTrades has a high correlation with NumSatisfactoryTrades, we can see that this feature also doesn't have a high correlation with the target feature. Thus, neither will be used in the predictive model.

**PercentTradesWBalance**
Upon inspecting this plot, it can be seen that there is a correlation with the target feature. While customers with a 0 to 100% balance can still achieve 'Good' outcomes, the average is still weighted more heavily towards a 'Bad' outcome. With an increase in trades with balance, it can generally be seen that there is an increase in the chance that a 'Bad' target outcome will occur. This makes sense as a higher amount to repay intuitively means there is a higher chance that an on time repayment will be less likely.  

**MSinceMostRecentDelq**
Here it can be seen that the longer it was since a delinquency, the higher the chance of a 'Good' target outcome. This makes sense, as a customer who recently missed repayments is more likely to do so in the near future. 


**NetFractionRevolvingBurden**
There is quite a strong correlation here, with a higher fraction of burden on a revolving account meaning a higher chance of a 'Bad' outcome. A more burdened account will be more difficult to repay, and thus the risk increases. 


**MSinceOldestTradeOpen**
There is a reasonably strong correlation with the target feature here. Perhaps having a record on the books for a longer period of time allows the company to have a better idea on the repayment risks of their customers, as we see that the greater the number of months since the oldest trade was open, the higher the chance of receiving a "Good" target outcome. 


**Decision on subset to choose for the predicitive model:**

**The decision has been made to choose:**
- ExternalRiskEstimate
- AverageMInFile
- PercentTradesWBalance
- MSinceMostRecentDelq
- NetFractionRevolvingBurden
- NumBank2NatlTradesWHighUtilization
- MSinceOldestTradeOpen

This decision was made due to the overall strong correlations of these features with the target feature. The other features in the plot have lesser degrees of correlation, and thus would not add much usefulness in the creation of predictive models.

### The correlation between the categorical features in the dataframe and the target features are set out below, with the goal of finding the most promising features to use in the predictive models. A promising feature is one which shows a high correlation with the target outcome. A higher correlation between features leads to more accurate predictive models.

In [None]:
MaxDelq2PublicRecLast12M = pd.unique(traindf.MaxDelq2PublicRecLast12M.ravel())
dfnew = traindf.copy()
dfnew['percent'] = 0

for i in MaxDelq2PublicRecLast12M:
    count = 1 / dfnew[dfnew.MaxDelq2PublicRecLast12M == i].count()['RiskPerformance']
    index_list = dfnew[dfnew['MaxDelq2PublicRecLast12M'] == i].index.tolist()
    for ind in index_list:
        dfnew.loc[ind, 'percent'] = count * 100
        
group = dfnew[['percent','MaxDelq2PublicRecLast12M','RiskPerformance']].groupby(['MaxDelq2PublicRecLast12M','RiskPerformance']).sum()

my_plot = group.unstack().plot(kind='bar', stacked=True, title="RiskPerformance based on MaxDelq2PublicRecLast12M", figsize=(15,7))

red_patch = mpatches.Patch(color='orange', label='Good')
blue_patch = mpatches.Patch(color='blue', label='Bad')
my_plot.legend(handles=[red_patch, blue_patch], frameon = True)

my_plot.set_xlabel("MaxDelq2PublicRecLast12M")
my_plot.set_ylabel("RiskPerformance")
my_plot.set_ylim([0,100])

In [None]:
MaxDelqEver = pd.unique(traindf.MaxDelqEver.ravel())
dfnew = traindf.copy()
dfnew['percent'] = 0

for i in MaxDelqEver:
    count = 1 / dfnew[dfnew.MaxDelqEver == i].count()['RiskPerformance']
    index_list = dfnew[dfnew['MaxDelqEver'] == i].index.tolist()
    for ind in index_list:
        dfnew.loc[ind, 'percent'] = count * 100
        
group = dfnew[['percent','MaxDelqEver','RiskPerformance']].groupby(['MaxDelqEver','RiskPerformance']).sum()

my_plot = group.unstack().plot(kind='bar', stacked=True, title="RiskPerformance based on MaxDelqEver", figsize=(15,7))

red_patch = mpatches.Patch(color='orange', label='Good')
blue_patch = mpatches.Patch(color='blue', label='Bad')
my_plot.legend(handles=[red_patch, blue_patch], frameon = True)

my_plot.set_xlabel("MaxDelqEver")
my_plot.set_ylabel("RiskPerformance")
my_plot.set_ylim([0,100])

In [None]:
DelqEver = pd.unique(traindf.DelqEver.ravel())
dfnew = traindf.copy()
dfnew['percent'] = 0

for i in DelqEver:
    count = 1 / dfnew[dfnew.DelqEver == i].count()['RiskPerformance']
    index_list = dfnew[dfnew['DelqEver'] == i].index.tolist()
    for ind in index_list:
        dfnew.loc[ind, 'percent'] = count * 100
        
group = dfnew[['percent','DelqEver','RiskPerformance']].groupby(['DelqEver','RiskPerformance']).sum()

my_plot = group.unstack().plot(kind='bar', stacked=True, title="RiskPerformance based on DelqEver", figsize=(15,7))

red_patch = mpatches.Patch(color='orange', label='Good')
blue_patch = mpatches.Patch(color='blue', label='Bad')
my_plot.legend(handles=[red_patch, blue_patch], frameon = True)

my_plot.set_xlabel("DelqEver")
my_plot.set_ylabel("RiskPerformance")
my_plot.set_ylim([0,100])

In [None]:
DelqLast12M = pd.unique(traindf.DelqLast12M.ravel())
dfnew = traindf.copy()
dfnew['percent'] = 0

for i in DelqLast12M:
    count = 1 / dfnew[dfnew.DelqLast12M == i].count()['RiskPerformance']
    index_list = dfnew[dfnew['DelqLast12M'] == i].index.tolist()
    for ind in index_list:
        dfnew.loc[ind, 'percent'] = count * 100
        
group = dfnew[['percent','DelqLast12M','RiskPerformance']].groupby(['DelqLast12M','RiskPerformance']).sum()

my_plot = group.unstack().plot(kind='bar', stacked=True, title="RiskPerformance based on DelqLast12M", figsize=(15,7))

red_patch = mpatches.Patch(color='orange', label='Good')
blue_patch = mpatches.Patch(color='blue', label='Bad')
my_plot.legend(handles=[red_patch, blue_patch], frameon = True)

my_plot.set_xlabel("DelqLast12M")
my_plot.set_ylabel("RiskPerformance")
my_plot.set_ylim([0,100])

**MaxDelq2PublicRecLast12M**
The following special meanings are attached to these numbers:

- 0	derogatory comment
- 1	120+ days delinquent
- 2	90 days delinquent
- 3	60 days delinquent
- 4	30 days delinquent
- 5, 6	unknown delinquency
- 7	current and never delinquent
- 8, 9	all other

Customers with the value 7 have the highest amount of 'Good' outcomes compared to the other values. This makes sense, as 7 means current and never delinquent, and thus the customer has a good record for repayment. The other values have a higher correlation with 'Bad' target outcomes, which is to be expected as either the customer has a history of delinquency, or the delinquency is unknown.

**MaxDelqEver**
The following special meanings are attached to these numbers: 

- 1	No such value
- 2	derogatory comment
- 3	120+ days delinquent
- 4	90 days delinquent
- 5	60 days delinquent
- 6	30 days delinquent
- 7	unknown delinquency
- 8	current and never delinquent
- 9	all other

This is a very similar story to MaxDelq2PublicRecLast12M above. Customers with the value 8 have the highest amount of 'Good' outcomes compared to the other values. This makes sense, as 8 means current and never delinquent, and thus the customer has a good record for repayment. The other values have a higher correlation with 'Bad' target outcomes, which is to be expected as either the customer has a history of delinquency, or the delinquency is unknown.

**DelqEver** 

DelqEver measures if a entry during their history has ever been delinquent, and it can be seen from the chart that have a deliquency on a record points to a greater chance of a "Bad" target outcome. This makes sense, as a previous failure to repay on time intuitively points to a greater risk of it happening again.

**DelqLast12M**

DelqEver measures if a entry during has been delinquent in the last 12 months, and it can be seen from the chart that have a deliquency in the last 12 months on a record points to a high chance of a "Bad" target outcome. This makes sense, as a previous failure to repay on time, especially recently, intuitively points to a greater risk of it happening again.

**Decision on subset to choose for the predicitive model:**

**The decision has been made to choose:**
MaxDelq2PublicRecLast12M, MaxDelqEver, DelqEver, and DelqLast12M were chosen, as the values which a customer receives for these features has been seen to have a large effect on the target outcome.

## The final chosen subset:
**Continuous:** 
- ExternalRiskEstimate 
- AverageMInFile 
- PercentTradesWBalance
- MSinceMostRecentDelq 
- NetFractionRevolvingBurden 
- MSinceOldestTradeOpen
- NumBank2NatlTradesWHighUtilization


**Categorical:** 
- MaxDelq2PublicRecLast12M
- MaxDelqEver
- DelqEver
- DelqLast12M

### The subset of categorical and continuous features have now been chosen, and thus the creation and evaluation of the predictive models may now begin. First, the data shall be prepared for these models.

### Preparing the data

### The predictive models need numerical data to function properly, and thus, categorical data must be changed to numerical. 

In [None]:
traindf.dtypes

#### Dummy values will be used to give categorical data numerical values, which are needed for the predictive model descriptors. 

In [None]:
# Get the dummy values for the categorical subset chosen. 

RiskPerformance_dummies = pd.get_dummies(traindf['RiskPerformance'], prefix='RiskPerformance', drop_first=True)
print("RiskPerformance:", RiskPerformance_dummies)

MaxDelqEver_dummies = pd.get_dummies(traindf['MaxDelqEver'], prefix='MaxDelqEver', drop_first=True)
print("MaxDelqEver:", MaxDelqEver_dummies)

MaxDelq2PublicRecLast12M_dummies = pd.get_dummies(traindf['MaxDelq2PublicRecLast12M'], prefix='MaxDelq2PublicRecLast12M', drop_first=True)
print("MaxDelq2PublicRecLast12M:", MaxDelq2PublicRecLast12M_dummies)

DelqEver_dummies = pd.get_dummies(traindf['DelqEver'], prefix='DelqEver', drop_first=True)
print("DelqEver:", DelqEver_dummies)

DelqLast12M_dummies = pd.get_dummies(traindf['DelqLast12M'], prefix='DelqLast12M', drop_first=True)
print("DelqLast12M:", DelqLast12M_dummies)


                

#### The model will be trained, or 'fit', using all of the chosen subset of categorical and continuous features.

In [None]:
# Continuous features
cont_features = ['ExternalRiskEstimate', 'PercentTradesWBalance', 'NetFractionRevolvingBurden',
                'AverageMInFile', 'MSinceMostRecentDelq', 'MSinceOldestTradeOpen', 'NumBank2NatlTradesWHighUtilization']


# Continuous and categorical features combined 
features = cont_features + RiskPerformance_dummies.columns.values.tolist() + MaxDelqEver_dummies.columns.values.tolist() + MaxDelq2PublicRecLast12M_dummies.columns.values.tolist() + DelqEver_dummies.columns.values.tolist() + DelqLast12M_dummies.columns.values.tolist()

print("Features: ", features)

### A new dataframe will be created which will have the dummy values added, and the catergory data type features removed. Only the chosen subset will be in the dataframe.

In [None]:
traindf_chosFeat = pd.concat([traindf, RiskPerformance_dummies, MaxDelqEver_dummies, MaxDelq2PublicRecLast12M_dummies, DelqEver_dummies, DelqLast12M_dummies], axis=1)
# Keep the features chosen.
feat_to_keep = features

traindf_chosFeat = traindf_chosFeat.loc[:, feat_to_keep]


#### Check the head of the new df to ensure it contains the wanted features.

In [None]:
traindf_chosFeat.head(5)

## The first predictive model will now be created and evaluated. This model is the Multiple Linear Regression model. This model works by fitting a linear equation to observed data.

## Multiple linear regression

### Set the descriptive features, which will be used to try predict the target feature.

In [None]:
# The decriptive features, X_train, for the model will be all chosen features except for the target, which will be y_train.

X_train = traindf_chosFeat[[x for x in traindf_chosFeat[features] if x not in ['RiskPerformance_Good']]]
y_train = traindf_chosFeat.RiskPerformance_Good

print("\nDescriptive features in X:\n", X_train)
print("\nTarget feature in y:\n", y_train)

In [None]:
traindf_chosFeat.head()

#### Train aka fit a model using all chosen continuous and categorical features on the training dataset.

In [None]:
multiple_linreg = LinearRegression().fit(X_train, y_train)

# Print the intercept
print("\nIntercept: \n", multiple_linreg.intercept_)
print()
# Print the features and coefficients
print("Features and coefficients:", list(zip(X_train, multiple_linreg.coef_)))

### The linear regression model works based on estimating a set of weights per feature, and also an extra weight called the intercept. The intercept is the mean of Y (the target) when all predictors equal zero. It is an adjustment parameter called the "bias", and gives the base value of the target.

$target\_feature = w_0 + w_1 * feature_1 + w_2*feature_2 + ...+ w_n*feature_n $

From the results of this model, it can be seen that the base value for RiskPerformance is -0.8727520494674129.

The weights for the other features are as follows:

- For every unit increase in **ExternalRiskEstimate** the RiskPerformance target increases by: **0.014011095261030292**
- For every unit increase in **PercentTradesWBalance** the RiskPerformance target increases by: **0.001829684339055234**
- For every unit increase in **NetFractionRevolvingBurden** the RiskPerformance target decreases by: **-0.0031226344582442663**
- For every unit increase in **AverageMInFile** the RiskPerformance target increases by: **0.0026798888878722844**
- For every unit increase in **MSinceMostRecentDelq** the RiskPerformance target decreases by: **-0.0009084575942394557**
- For every unit increase in **MSinceOldestTradeOpen** the RiskPerformance target increases by: **0.0004927879034892035**
- For every unit increase in **NumBank2NatlTradesWHighUtilization** the RiskPerformance target decreases by: **-0.012732620803912575**
- For every unit increase in **MaxDelqEver_3** the RiskPerformance target decreases by: **-0.16715543030913155**
- For every unit increase in **MaxDelqEver_4** the RiskPerformance target decreases by: **-0.002765512726690825**
- For every unit increase in **MaxDelqEver_5** the RiskPerformance target decreases by: **-0.019637458917251053**
- For every unit increase in **MaxDelqEver_6** the RiskPerformance target increases by: **0.1124144744351723**
- For every unit increase in **MaxDelqEver_7** the RiskPerformance target increases by: **0.06548184122182071**
- For every unit increase in **MaxDelq2PublicRecLast12M_3** the RiskPerformance target increases by: **0.17336707488613431**
- For every unit increase in **MaxDelq2PublicRecLast12M_4** the RiskPerformance target decreases by: **-0.1089470561126091**
- For every unit increase in **MaxDelq2PublicRecLast12M_6** the RiskPerformance target increases by: **0.11944537854390623**
- For every unit increase in **MaxDelq2PublicRecLast12M_7** the RiskPerformance target increases by: **0.10635333789162385**
- For every unit increase in **DelqEver_True** the RiskPerformance target decreases by: **-0.06548184122182067**
- For every unit increase in **DelqLast12M_True** the RiskPerformance target decreases by: **-0.10635333789162374**

### Thus, the model is:

**RiskPerformance_Good** = -0.8727520494674129 + ExternalRiskEstimate * 0.014011095261030292 + PercentTradesWBalance * 0.001829684339055234 - NetFractionRevolvingBurden * 0.0031226344582442663 + AverageMInFile * 0.0026798888878722844 - MSinceMostRecentDelq * 0.0009084575942394557 + MSinceOldestTradeOpen * 0.0004927879034892035 - NumBank2NatlTradesWHighUtilization * 0.012732620803912575 - MaxDelqEver_3 * 0.16715543030913155 - MaxDelqEver_4 * 0.002765512726690825 - MaxDelqEver_5 * 0.019637458917251053 + MaxDelqEver_6* 0.1124144744351723 + MaxDelqEver_7 * 0.06548184122182071 + MaxDelq2PublicRecLast12M_3 * 0.17336707488613431 - MaxDelq2PublicRecLast12M_4 * 0.1089470561126091 + MaxDelq2PublicRecLast12M_6 * 0.11944537854390623 + MaxDelq2PublicRecLast12M_7 * 0.10635333789162385 - DelqEver_True * 0.06548184122182067 - DelqLast12M_True * 0.10635333789162374

#### Now that the model has been trained, it will be tested on the training data. The first 100 training examples will be printed to ensure that the model is working correctly (debugging).

In [None]:
# Predict on the training descriptive features.
multiple_linreg_predictions = multiple_linreg.predict(X_train)
print("\nPredictions with multiple linear regression: \n")

# Realign indices
y_train.reset_index(drop=True)
noThresPredDf = pd.DataFrame(multiple_linreg_predictions, columns=['Predicted'])
noThresPredDf.reset_index(drop=True)

# Show the actual vs predicted values for the first 100 training examples.
actual_vs_predicted_multiplelinreg = pd.concat([y_train.head(100), noThresPredDf.head(100)], axis=1)

print(actual_vs_predicted_multiplelinreg)

#### The predicted values are in a range from 0 to 1, but for the target in this data, the outcome can only be 0 or 1, nothing in between. Thus, the predicted results will be set to 0 or 1. The threshold is set as 0.5, with 0 if the prediction is below 0.5, and 1 is the prediction is 0.5 or greater. This gives a class to the prediction.

In [None]:
# Set less than 0.5 to 0, and 0.5 and greater to 1.
preddf = pd.DataFrame(multiple_linreg_predictions, columns=['Predicted'])
preddf.loc[preddf.Predicted <= 0.5, 'predicted_Yes_No'] = 0 
preddf.loc[preddf.Predicted > 0.5, 'predicted_Yes_No'] = 1 
preddf = preddf.drop("Predicted", axis=1)

# Reset the indices to allow the rows to align correctly.
y_train.reset_index(drop=True)
preddf.reset_index(drop=True)

# Show the actual vs predicted values for the first 100 training examples.
actual_vs_predicted_multiplelinreg = pd.concat([y_train.head(100), preddf.head(100)], axis=1)
print(actual_vs_predicted_multiplelinreg)


### A comparision can now be made more clearly, with the actual and predicted both being either 0 or 1.

The printMetrics function has been created to give evalution metrics throughout this work. 

In [None]:
#This function is used repeatedly to compute all metrics
def printMetrics(testActualVal, predictions):
    #classification evaluation measures
    print('\n==============================================================================')
    print("Accuracy: ", metrics.accuracy_score(testActualVal, predictions))
    print("MAE: ", metrics.mean_absolute_error(testActualVal, predictions))
    print("MSE: ", metrics.mean_squared_error(testActualVal, predictions))
    print("RMSE: ", metrics.mean_squared_error(testActualVal, predictions)**0.5)
    print("R2: ", metrics.r2_score(testActualVal, predictions))
    print('\n==============================================================================')
    print("Confusion matrix: \n", metrics.confusion_matrix(testActualVal, predictions))
    print("Classification report:\n ", metrics.classification_report(testActualVal, predictions))
    

In [None]:
printMetrics(y_train, preddf['predicted_Yes_No'])

Analysis of the metrics:

The accuracy is 74.3%, which means that the model predicts correctly 74.3% of the time, which is better than a 50/50 choice but still not very accurate.

The confusion matrix reveals 235 true negatives, 82 false negatives, 69 false positives, and 202 true positives.

The weighted average precision, (correctly predicted positive / predicted positive) is 74%, which shows that when the model predicts a good target outcome, it is correct 74% of the time. Precision is the ratio of correctly predicted positive (here it is a 'good' outcome) observations to the total predicted positive observations.

The weighted average recall, (correctly predicted positive / actual positive) is 74%, which shows when the target outcome is good, it predicts that it is good 74% of the time. (When it's actually good, it predicts good 74% of the time.)

The weighted average F1-score, (aggregation of Precision and Recall) is 74%, which means there are a reasonably low amount of false positives and false negatives.

The MAE and MSE are both 0.2568027210884354, the RMSE is 0.5067570631855419, and the R2 is -0.02840066716085965.

In general, the higher the R2, the better the model fits the data, and the lower the MSE, MAE, and RMSE, the lower the error in the model. In this case, a minus value for R2 could indicate a poor fit for the model with the data.


## Evaluation of the model with the test data 

### Repeat the dummies and features dataframe setup for the test dataframe.

In [None]:
RiskPerformance_dummies = pd.get_dummies(testdf['RiskPerformance'], prefix='RiskPerformance', drop_first=True)
print("RiskPerformance:", RiskPerformance_dummies)

MaxDelqEver_dummies = pd.get_dummies(testdf['MaxDelqEver'], prefix='MaxDelqEver', drop_first=True)
print("MaxDelqEver:", MaxDelqEver_dummies)

MaxDelq2PublicRecLast12M_dummies = pd.get_dummies(testdf['MaxDelq2PublicRecLast12M'], prefix='MaxDelq2PublicRecLast12M', drop_first=True)
print("MaxDelq2PublicRecLast12M:", MaxDelq2PublicRecLast12M_dummies)

DelqEver_dummies = pd.get_dummies(testdf['DelqEver'], prefix='DelqEver', drop_first=True)
print("DelqEver:", DelqEver_dummies)

DelqLast12M_dummies = pd.get_dummies(testdf['DelqLast12M'], prefix='DelqLast12M', drop_first=True)
print("DelqLast12M:", DelqLast12M_dummies)


cont_features = ['ExternalRiskEstimate', 'PercentTradesWBalance', 'NetFractionRevolvingBurden',
                'AverageMInFile', 'MSinceMostRecentDelq', 'MSinceOldestTradeOpen', 'NumBank2NatlTradesWHighUtilization']

features = cont_features + RiskPerformance_dummies.columns.values.tolist() + MaxDelqEver_dummies.columns.values.tolist() + MaxDelq2PublicRecLast12M_dummies.columns.values.tolist() + DelqEver_dummies.columns.values.tolist() + DelqLast12M_dummies.columns.values.tolist()
print("\nCont features: ", cont_features)
# print("Train Categ features: ", train_categ_features)
print("Features: ", features)

In [None]:
testdf_chosFeat = pd.concat([testdf, RiskPerformance_dummies, MaxDelqEver_dummies, MaxDelq2PublicRecLast12M_dummies, DelqEver_dummies, DelqLast12M_dummies], axis=1)

feat_to_keep = features

testdf_chosFeat = testdf_chosFeat.loc[:, feat_to_keep]

In [None]:
testdf_chosFeat.head()

##### Divide the test dataframe into X_test with the descriptive features, and y_test with the target feature.

In [None]:
X_test = testdf_chosFeat[[x for x in testdf_chosFeat[features] if x not in ['RiskPerformance_Good']]]
y_test = testdf_chosFeat.RiskPerformance_Good

### Evaluate the predictive model on the test dataframe. 

In [None]:
test_predictions = multiple_linreg.predict(X_test)
print("Actual values of test:\n", y_test)
print("Predictions on test:", test_predictions)


#### Set the threshold of 0.5

In [None]:
test_preddf = pd.DataFrame(test_predictions, columns=['Predicted'])
test_preddf.loc[test_preddf.Predicted <= 0.5, 'predicted_Yes_No'] = 0 
test_preddf.loc[test_preddf.Predicted > 0.5, 'predicted_Yes_No'] = 1 
test_preddf = test_preddf.drop("Predicted", axis=1)

In [None]:
printMetrics(y_test, test_preddf['predicted_Yes_No'])

Analysis of the metrics:

The accuracy is 70.6%, which means that the model predicts correctly 70.6% of the time, which is better than a 50/50 choice but still not very accurate. The model is 3.7% less accurate than on the training data.

The confusion matrix reveals 91 true negatives, 33 false negatives, 41 false positives, and 87 true positives.

The weighted average precision, (correctly predicted positive / predicted positive) is 71%, which shows that when the model predicts a good target outcome, it is correct 71% of the time. Precision is the ratio of correctly predicted positive (here it is a 'good' outcome) observations to the total predicted positive observations.

The weighted average recall, (correctly predicted positive / actual positive) is 71%, which shows when the target outcome is good, it predicts that it is good 74% of the time. (When it's actually good, it predicts good 71% of the time.)

The weighted average F1-score, (aggregation of Precision and Recall) is 71%, which means there are a reasonably low amount of false positives and false negatives.

The scores all are lower on the test data than the training data. This is to be expected, as the test data is new data for the model, with the model having the chance to "learn" the results of some of the training data. 

The MAE, MSE and RMSE are all higher, indicating more error in the model when used with the test data. The R2 is also lower than before. This points to the model performing worse on the test data than on the training data.

## Evaluation with cross-validation

### cross_val_score() is normally used to generate scores through cross-validation, however in this dataset the target needs to have a threshold, and thus cross_val_score() won't return accurate results. Instead, it has been decided that cross_val_predict() will we used to generate cross-validiated predictions, which will then be thresholded, and metrics applied.

### cross_val_predict() returns cross-validated prediction estimates for each element in the input.

In [None]:
# Create 2 new dataframes, one with all of x, one with all of y. This represents the entirety of the data.
crossValXdf=pd.concat([X_train, X_test])
crossValYdf=pd.concat([y_train, y_test])

In [None]:
crossValXdf= crossValXdf.reset_index(drop=True)
crossValXdf

In [None]:
predictions = cross_val_predict(LinearRegression(), crossValXdf, crossValYdf, cv=3)
predictions

In [None]:
# Apply thresholding
for i, predict in enumerate(predictions):
     predictions[i] = 0 if (predict <= 0.5) else 1

In [None]:
predictions

In [None]:
# Print metrics of the cross-validated predictions
printMetrics(predictions, crossValYdf)

## Cross validation vs Test data results

Analysis of the metrics:

The accuracy is 70.6% for the test data, and 72.26% for the cross validation data.

The weighted average precision, recall, and F1-score for the test data is 71%, and 72% for the cross validation data.

The cross validated model has lower MAE, MSE, and RMSE scores, indicating less error, and the R2 is higher which indicates a better fit for the model.

The MAE, MSE and RMSE are all higher, indicating more errors in the model when used with the test data. The R2 is also lower than before. This points to the model performing worse on the test data than on the training data.

Therefore, the cross validated model is a better predictive model than that used on the test data. This is largely down to the cross validated model having access to a larger amount of data sets.

### Linear regression evaluation

The accuracy scores of 70.6% for the test model and 72.26% for the cross validation model are better than 50/50 chance, but it is still not highly accurate. The low r2 scores seem to indicate that this model is not a great fit for the data. Therefore, further models will be explored.



## Logistic Regression is the next predictive model which will be evaluated. This models the probabilities for classification problems which have two possible outcomes. In classification we can interpret the target feature as the probability of class membership.

#### Training the model using the chosen features

In [None]:
# Train aka fit, a model using all continuous features.
multiple_logreg = LogisticRegression().fit(X_train, y_train)

# Print the weights learned for each feature.
print("Features: \n", cont_features)
print("Coefficients: \n", multiple_logreg.coef_)
print("\nIntercept: \n", multiple_logreg.intercept_)

$probability(target=1|descriptive\_features)=w_0 + w_1 * feature_1 + w_2*feature_2 + ...+ w_n*feature_n$


### The logistic regression model works similarily to the linear regression model by estimating a set of weights per feature, and also an extra weight called the intercept. However, with logistic regression it uses classification which aims to classify an example into one of two classes (target feature is 0 or 1, which is "Bad" or "Good" in our case).

Thus, with classification we can interpret the target feature as the probability of class membership:
$target\_feature = w_0 + w_1 * feature_1 + w_2*feature_2 + ...+ w_n*feature_n $

The probability of class membership is determined as follows:
$probability(target=1|descriptive\_features)=logistic(w_0 + w_1 * feature_1 + w_2*feature_2 + ...+ w_n*feature_n)$
where $logistic(x)$ is defined as: $logistic(x) = \frac{e ^ x}{1 + e ^ x} = \frac{1}{1+e^{-x}}$

From the results of this model, it can be seen that the base value for RiskPerformance is -1.76718641.

The weights for the other features are as follows:

- For every unit increase in **ExternalRiskEstimate** the RiskPerformance target increases by: **0.02710619**
- For every unit increase in **PercentTradesWBalance** the RiskPerformance target increases by: **0.00591177**
- For every unit increase in **NetFractionRevolvingBurden** the RiskPerformance target decreases by: **-0.02016581**
- For every unit increase in **AverageMInFile** the RiskPerformance target increases by: **0.01673749**
- For every unit increase in **MSinceMostRecentDelq** the RiskPerformance target decreases by: **-0.00543609**
- For every unit increase in **MSinceOldestTradeOpen** the RiskPerformance target increases by: **0.00274037**
- For every unit increase in **NumBank2NatlTradesWHighUtilization** the RiskPerformance target decreases by: **-0.15378619**
- For every unit increase in **MaxDelqEver_3** the RiskPerformance target decreases by: **-0.67130334**
- For every unit increase in **MaxDelqEver_4** the RiskPerformance target decreases by: **-0.16776911**
- For every unit increase in **MaxDelqEver_5** the RiskPerformance target decreases by: **-0.13580476**
- For every unit increase in **MaxDelqEver_6** the RiskPerformance target increases by: **0.62118979**
- For every unit increase in **MaxDelqEver_7** the RiskPerformance target decreases by: **-0.33580757**
- For every unit increase in **MaxDelq2PublicRecLast12M_3** the RiskPerformance target increases by: **0.1982586**
- For every unit increase in **MaxDelq2PublicRecLast12M_4** the RiskPerformance target decreases by: **-0.50812457**
- For every unit increase in **MaxDelq2PublicRecLast12M_6** the RiskPerformance target increases by: **0.45209493**
- For every unit increase in **MaxDelq2PublicRecLast12M_7** the RiskPerformance target decreases by: **-0.28637155**
- For every unit increase in **DelqEver_True** the RiskPerformance target decreases by: **-1.43137883**
- For every unit increase in **DelqLast12M_True** the RiskPerformance target decreases by: **-1.48081486**

### Thus, the model is:



**probability(RiskPerformance_Good=1|descriptive_features)=logistic**(-1.76718641 + ExternalRiskEstimate * 0.02710619 + PercentTradesWBalance * 0.00591177 - NetFractionRevolvingBurden * 0.02016581 + AverageMInFile * 0.01673749 - MSinceMostRecentDelq * 0.00543609 + MSinceOldestTradeOpen * 0.00274037 - NumBank2NatlTradesWHighUtilization * 0.15378619 - MaxDelqEver_3 * 0.67130334 - MaxDelqEver_4 * 0.16776911 - MaxDelqEver_5 * 0.13580476 + MaxDelqEver_6 * 0.62118979 - MaxDelqEver_7 * 0.33580757 + MaxDelq2PublicRecLast12M_3 * 0.1982586 - MaxDelq2PublicRecLast12M_4 * 0.50812457 + MaxDelq2PublicRecLast12M_6 * 0.45209493 + MaxDelq2PublicRecLast12M_7 * 0.28637155 - DelqEver_True * 1.43137883 - DelqLast12M_True * 1.48081486)

### Get the predicted probabilities. 
The output is a pair for each example, 
The first component is the probability of the negative class (class 0).
The second component is the probability of the positive class (class 1).

In [None]:
# Predicted probabilities for each example. 
multiple_logreg_predicted_probs = multiple_logreg.predict_proba(X_train)
# First 100 for debugging purposes (ensuring no problems).
print(multiple_logreg.predict_proba(X_train.head(100)))

### Print the predicted classes of the first 100 training examples, 1 or 0.

In [None]:
print(multiple_logreg.predict(X_train.head(100)))

### Predict on the whole training set.

In [None]:
multiple_logreg_predicted_class = multiple_logreg.predict(X_train)


#### Check the accuracy on the training set. 

In [None]:
multiple_logreg.score(X_train, y_train)

#### Print metrics

In [None]:
printMetrics(y_train, multiple_logreg_predicted_class)

Analysis of the metrics:

The accuracy is 73.8%, which means that the model predicts correctly 73.8% of the time, which is better than a 50/50 choice but still not very accurate.

The confusion matrix reveals 231 true negatives, 81 false negatives, 73 false positives, and 203 true positives.

The weighted average precision, (correctly predicted positive / predicted positive) is 74%, which shows that when the model predicts a good target outcome, it is correct 74% of the time. Precision is the ratio of correctly predicted positive (here it is a 'good' outcome) observations to the total predicted positive observations.

The weighted average recall, (correctly predicted positive / actual positive) is 74%, which shows when the target outcome is good, it predicts that it is good 74% of the time. (When it's actually good, it predicts good 74% of the time.)

The weighted average F1-score, (aggregation of Precision and Recall) is 74%, which means there are a reasonably low amount of false positives and false negatives.

The MAE is 35.25170068027211, the MSE is 0.2619047619047619, the RMSE is 0.511766315719159, and the R2 is -0.0488324684951813.

### Evaluate the model on the test data

In [None]:
# Estimated class probabilities on test set
print(multiple_logreg.predict_proba(X_test))

In [None]:
# Estimated classes on test set
y_predicted = multiple_logreg.predict(X_test)
print(y_predicted)

#### Print out test metrics

In [None]:
printMetrics(y_test, y_predicted)

Analysis of the metrics:

In comparison to the training data, the model had an accuracy of 69.4% on the test data versus 73.8%.

The weighted average precision is 70% for test versus 74% for training.

The weighted average recall is 69% for test verus 74% for training.

The weighted average F1-scoreis 69% for test verus 74% for training.

The MAE, MSE, and RMSE scores are higher indicating larger error, and the R2 is smaller, indicating a worse fit for the model.

The model is not working as well on new data.

## Model evaluation using cross-validation 

In [None]:
f1_scores = cross_val_score(LogisticRegression(), crossValXdf, crossValYdf, scoring='f1', cv=3)
print("\nF1 CV scores:", f1_scores)
print("\nF1 CV mean score:",f1_scores.mean())

acc_scores = cross_val_score(LogisticRegression(), crossValXdf, crossValYdf, scoring='accuracy', cv=3)
print("\nAccuracy CV scores:",acc_scores)
print("\nAccuracy CV mean score:",acc_scores.mean())

mean_abs_scores = cross_val_score(LogisticRegression(), crossValXdf, crossValYdf, scoring='neg_mean_absolute_error', cv=3)
print("\nneg_mean_absolute_error CV scores:",mean_abs_scores)
print("\nneg_mean_absolute_error CV mean score:",mean_abs_scores.mean())

mean_squared_scores = cross_val_score(LogisticRegression(), crossValXdf, crossValYdf, scoring='neg_mean_squared_error', cv=3)
print("\nneg_mean_squared_error CV scores:",mean_squared_scores)
print("\nneg_mean_squared_error CV mean score:",mean_squared_scores.mean())

r2_scores = cross_val_score(LogisticRegression(), crossValXdf, crossValYdf, scoring='r2', cv=3)
print("\nr2 CV scores:",r2_scores)
print("\nr2 CV mean score:",r2_scores.mean())

F1 CV scores: [0.68592058 0.73764259 0.69565217]

F1 CV mean score: 0.7064051123605676

Accuracy CV scores: [0.69039146 0.75357143 0.69892473]

Accuracy CV mean score: 0.7142958729429858

neg_mean_absolute_error CV scores: [-36.46619217 -34.71785714 -34.89605735]

neg_mean_absolute_error CV mean score: -35.36003555378196

neg_mean_squared_error CV scores: [-0.30960854 -0.24642857 -0.30107527]

neg_mean_squared_error CV mean score: -0.2857041270570142

r2 CV scores: [-0.24033486  0.01302682 -0.20617602]

r2 CV mean score: -0.14449468398311816


The cross validated model has an accuracy of 71.4%, whereas the test data model is 69.4%

The cross validated model has an MAE of 35.36 whereas the test data model is 35.58
The cross validated model has an RMSE of 0.2857, whereas the test data model is 0.5527
The cross validated model has an R2 of -0.14449, whereas the test data model is -0.225
The cross validated model has an R2 of -0.14449, whereas the test data model is -0.225
The cross validated model has an F1 of 0.7064, whereas the test data model is 0.69

The cross validated model outperforms the test data model in every metric, and it is clear that cross validating a model provides for a better model, as there are more datasets.


### Logistic regression evaluation

In linear regression, accuracy scores of 70.6% for the test model and 72.26% for the cross validation model were obtained. For the logistic regression model, test data model was 69.4% accurate, and the cross validated model was 71.4%. Thus, based on accuracy, the linear regression model is the better of the two.

# Next, the Random Forest predictive model will be evaluated. This model will interpret the target feature as the probability of class membership, and will create these estimations probabilities based on the mean predicted class probabilities of the trees in the forest.

#### Train the random forest with 100 trees

In [None]:
rfc = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)

#### Fit the model on training dataset

In [None]:
rfc.fit(X_train, y_train)

## Feature importance from the RFC model

In [None]:
pd.DataFrame({'feature': X_train.columns, 'importance':rfc.feature_importances_})

### Interpreting random forests

Interpreting random forest is not as straightforward as linear and logistic regression models. The random forest model is somewhat similar to logistic regression in that it predicts classes, and classifies class membership in a similar way: 

$probability(target=1|descriptive\_features)$

A random forest is made up of many decision trees, and for a single decision tree this probability is estimated as the proportion of examples from the positive class at the leaf node that contains the test example. This probability is estimated as the mean predicted class probabilities of the trees in the forest. 

So for example if we use 3 trees, with probability for class 1 on given example being 0.6, 0.7 and 0.8, 
then the probability $probability(target=1|descriptive\_features) = \frac{0.6 + 0.7 + 0.8}{3} = 0.7$.

As it is not possible to look at each of the 100 trees and understand how the probability is estimated, the feature importance table above is what can be used.

From looking at the table, it can be seen that:

ExternalRiskEstimate is the most important feature, with an importance of 0.184277

The following 4 features are the next most important in order, all with an importance over 0.1:

AverageMInFile
NetFractionRevolvingBurden
MSinceOldestTradeOpen
PercentTradesWBalance

The following are the next most important in order, with an importance over 0.05
MSinceMostRecentDelq
NumBank2NatlTradesWHighUtilization

The rest of the features have extremely low importance.

The importance is like a weighting, with the high importance values being more useful in obtaining a prediction for the target feature. In this model, the importance points to a weighting hierarchy which determines which features are more useful for making the prediction, with that hierarchy displayed below:

In [None]:
plotdf = pd.DataFrame({'feature': X_train.columns, 'importance':rfc.feature_importances_})
sortplotdf = plotdf.sort_values(by=['importance'], ascending=False)
ax = sortplotdf.plot.bar(x='feature', y='importance', figsize=(20,10))

## Predict on the training dataset using the trained random forest model

#### Predicted probabilities for all examples. The output is a pair for each example, the first component is the probability of the negative class (class 0) and the second component is the probability of the positive class (class 1).

In [None]:
rfc.predict_proba(X_train.head(100))

#### Use the model to make predictions on the training data, and print the first 100 rows.

In [None]:
rfc_predictions_100 = rfc.predict(X_train.head(100))
rfc_predictions = rfc.predict(X_train)
rfc_predictions_100

#### Print out the actual versus predicted class for the first 100 rows. 

In [None]:
df_true_vs_rfc_train_predicted = pd.DataFrame({'ActualClass': y_train.head(100), 'PredictedClass': rfc_predictions_100})
df_true_vs_rfc_train_predicted

In [None]:
printMetrics(y_train, rfc_predictions)

#### Analysis of the metrics

Here, everything is a perfect score. This is evidence that the model is overfitting, and has learned the training data exactly. This is highly undesirable, and other parameters will have to be used to remedy this.

## Evaluate the model on the test data

In [None]:
rfc_predictions_test = rfc.predict(X_test)
df_true_vs_rfc_predicted_test = pd.DataFrame({'ActualClass': y_test, 'PredictedClass': rfc_predictions_test})
df_true_vs_rfc_predicted_test

In [None]:
printMetrics(y_test, rfc_predictions_test)

#### Analysis of the metrics

There is a massive difference between the training data metrics and the test metrics, with the test results being far less accurate. This is expected, as the model has overfit on the training data, which leads to poor performance on new data. 

## Random Forest Cross-validation

In [None]:
f1_scores = cross_val_score(RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1), crossValXdf, crossValYdf, scoring='f1', cv=3)
print("\nF1 CV scores:", f1_scores)
print("\nF1 CV mean score:",f1_scores.mean())

acc_scores = cross_val_score(RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1), crossValXdf, crossValYdf, scoring='accuracy', cv=3)
print("\nAccuracy CV scores:",acc_scores)
print("\nAccuracy CV mean score:",acc_scores.mean())

mean_abs_scores = cross_val_score(RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1), crossValXdf, crossValYdf, scoring='neg_mean_absolute_error', cv=3)
print("\nneg_mean_absolute_error CV scores:",mean_abs_scores)
print("\nneg_mean_absolute_error CV mean score:",mean_abs_scores.mean())

mean_squared_scores = cross_val_score(RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1), crossValXdf, crossValYdf, scoring='neg_mean_squared_error', cv=3)
print("\nneg_mean_squared_error CV scores:",mean_squared_scores)
print("\nneg_mean_squared_error CV mean score:",mean_squared_scores.mean())

r2_scores = cross_val_score(RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1), crossValXdf, crossValYdf, scoring='r2', cv=3)
print("\nr2 CV scores:",r2_scores)
print("\nr2 CV mean score:",r2_scores.mean())


#### Calculate the out of bag score.

In [None]:
rfc.oob_score_

##### Analysis:

Using cross-validation, with uses more datasets, it can be seen that there are quite respectable metrics scores in comparison to the other predicitive models tested, and better than the test dataset model. However, it cannot be fully determined whether or not the random forest parameters are allowing for overfitting in some of the cross validation datasets too. Thus, with the level of overfitting witnessed, it is best to avoid this model unless parameter changes are made and the training data is seen not to overfit so heavily.

## Analysis thus far

##### So far, the linear regression model has the best metrics. Logistic regression is close behind; within a few percentage points. Random Forest however, was overfitting in the training dataset, and thus the metrics are not reliable. 

The accuracy results thus far are:

**Linear Regression**, 
70.6% for the test model and 72.26% for the cross validation model

**Logistic Regression** 
69.4% for the test model and 71.4% for the cross validation model 

**Random Forest**
69.4% for the test model and 71.4% for the cross validation model 

What's interesting is that the overfit random forest model obtains the same accuracy as the logistic regression model, but still, both of these models are outperformed very slightly by the linear regression model.

The predictive models thus far have been relatively successful in that they outperform a simple model which just always predicts the majority class. This can be shown below: 

In [None]:
crossValYdf.value_counts()

In [None]:
436/840

Thus, 51.9% of the time, the target value will be 0, or "Bad". In all of the predictive models evaluated above, the accuracy rating were far higher than this. Thus, in helping the company to solve their RiskPerformance target prediction problem, any of the models would do a better job than the simple majority class model.

However, just because these models were run once with certain parameters does not mean they cannot improve. Thus, seeing as the accuracy for all 3 were roughly the same, and the random forest model was overfitting, attempted improvements to all three will be made.

## Linear Regression with one descriptive feature

It was evident from the boxplot correlation thatn ExternalRiskEstimate was the feature with the highest correlation with the target feature. Thus, instead of doing multiple linear regression, linear regression with one feature, ExternalRiskEstimate, will be evaluated.

In [None]:
features = ['ExternalRiskEstimate']
linreg = LinearRegression().fit(X_train[features], y_train)

In [None]:
# Print the estimated linear regression coefficients.
print("Features: \n", features)
print("Coeficients: \n", linreg.coef_)
print("\nIntercept: \n", linreg.intercept_)

##### A unit increase in ExternalRiskEstimate leads to a 0.02455584 increase in the target feature, with a bias of  -1.286238303310054

### Fit the model on the training data

In [None]:
linreg_predictions = linreg.predict(X_train[features])

# Set less than 0.5 to 0, and 0.5 and greater to 1.
preddf = pd.DataFrame(linreg_predictions, columns=['Predicted'])
preddf.loc[preddf.Predicted < 0.5, 'predicted_Yes_No'] = 0 
preddf.loc[preddf.Predicted >= 0.5, 'predicted_Yes_No'] = 1 
preddf = preddf.drop("Predicted", axis=1)

# Reset the indices to allow the rows to align correctly.
y_train.reset_index(drop=True)
preddf.reset_index(drop=True)

# Show the actual vs predicted values for the first 100 training examples.
actual_vs_predicted_multiplelinreg = pd.concat([y_train.head(100), preddf.head(100)], axis=1)
print(actual_vs_predicted_multiplelinreg)

In [None]:
printMetrics(y_train, preddf['predicted_Yes_No'])

In [None]:
# Predicted scores for each example. 
linreg_predictions_test = linreg.predict(X_test[features])

preddfTest = pd.DataFrame(linreg_predictions_test, columns=['Predicted'])
preddfTest.loc[preddfTest.Predicted < 0.5, 'predicted_Yes_No'] = 0 
preddfTest.loc[preddfTest.Predicted >= 0.5, 'predicted_Yes_No'] = 1 
preddfTest = preddfTest.drop("Predicted", axis=1)

test_actual_vs_predicted_multiplelinreg = pd.concat([y_test.reset_index(drop=True), preddfTest.reset_index(drop=True)], axis=1)
print(test_actual_vs_predicted_multiplelinreg)


In [None]:
printMetrics(y_test, preddfTest['predicted_Yes_No'])

Here, the test results actually outperform the training results very slightly. Both have .71 for their precision, recall, and f1 scores, but with test having an accuracy of 71.03% versus 70.4% against training.

### Cross validation

In [None]:
predictions = cross_val_predict(linreg, crossValXdf, crossValYdf, cv=3)
predictions

In [None]:
# Apply thresholding
for i, predict in enumerate(predictions):
     predictions[i] = 0 if (predict < 0.5) else 1

In [None]:
printMetrics(predictions, crossValYdf)

### Analysis

The cross validated results outperform the test results slightly:

Accuracy
- Test: 71.03%   CV: 72.26%

precision, recall, f1-score:
- Test: .71      CV: .72

So far, this is actually the accuracy model evaluated, beating the previous high score of 70.6% for the test results and 72.26% for the cross validation results for the multiple linear regression model. Thus, the attempted improvements were a success.

## Random Forest with changed parameters

It was seen that the random forest model created earlier was overfitting heavily. Thus, an attempt to remedy that will be made.

In [None]:
# Train RF with 100 trees
rfc = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)

In [None]:
# Fit model on training dataset
rfc.fit(X_train, y_train)

#### Print feature importance

In [None]:
pd.DataFrame({'feature': X_train.columns, 'importance':rfc.feature_importances_})

#### Importance features can often be more valuable in selecting features than the correlations. Thus, the top 5 highest importance features will be used for this model. Features with importance over 0.1 in the table above:
'ExternalRiskEstimate', 'NetFractionRevolvingBurden', 'PercentTradesWBalance', 'AverageMInFile', 'MSinceOldestTradeOpen'

In [None]:
# Continuous features
cont_features_improved = ['ExternalRiskEstimate', 'NetFractionRevolvingBurden', 'PercentTradesWBalance', 'AverageMInFile', 'MSinceOldestTradeOpen']

RiskPerformance_dummies_improved = pd.get_dummies(traindf['RiskPerformance'], prefix='RiskPerformance', drop_first=True)

# Continuous and categorical features combined 
features_improved = cont_features_improved + RiskPerformance_dummies_improved.columns.values.tolist()
print("Features: ", features_improved)

In [None]:
traindf_improved = pd.concat([traindf, RiskPerformance_dummies_improved], axis=1)
# Keep the features chosen.
feat_to_keep = features_improved

traindf_improved = traindf_improved.loc[:, feat_to_keep]

In [None]:
traindf_improved.head(5)

#### Created updated dataframes with the new features

In [None]:
X_train_improved = traindf_improved[[x for x in traindf_improved[features_improved] if x not in ['RiskPerformance_Good']]]
y_train_improved = traindf_improved.RiskPerformance_Good

print("\nDescriptive features in X:\n", X_train_improved)
print("\nTarget feature in y:\n", y_train_improved)

#### Create the model

In [None]:
rfc_improved = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)

In [None]:
# Fit model on training dataset
rfc_improved.fit(X_train_improved, y_train_improved)

In [None]:
pd.DataFrame({'feature': X_train_improved.columns, 'importance':rfc_improved.feature_importances_})

#### All features have relatively high importance compared to the first feature importance table.

In [None]:
rfc_improved.predict_proba(X_train_improved.head(100))

In [None]:
rfc_predictions_100_improved = rfc_improved.predict(X_train_improved.head(100))
rfc_predictions_improved = rfc_improved.predict(X_train_improved)
rfc_predictions_100_improved

In [None]:
df_true_vs_rfc_train_predicted = pd.DataFrame({'ActualClass': y_train_improved.head(100), 'PredictedClass': rfc_predictions_100_improved})
df_true_vs_rfc_train_predicted

In [None]:
printMetrics(y_train, rfc_predictions_improved)

Unfortunately the model is still overfitting, even with the changes in features.

## Evaluate on test data

In [None]:
# Continuous features
cont_features_improved = ['ExternalRiskEstimate', 'NetFractionRevolvingBurden', 'PercentTradesWBalance', 'AverageMInFile', 'MSinceOldestTradeOpen']

RiskPerformance_dummies_improved = pd.get_dummies(testdf['RiskPerformance'], prefix='RiskPerformance', drop_first=True)

# Continuous and categorical features combined 
features_improved = cont_features_improved + RiskPerformance_dummies_improved.columns.values.tolist()
print("Features: ", features_improved)

In [None]:
testdf_improved = pd.concat([testdf, RiskPerformance_dummies_improved], axis=1)
# Keep the features chosen.
feat_to_keep = features_improved

testdf_improved = testdf_improved.loc[:, feat_to_keep]

In [None]:
X_test_improved = testdf_improved[[x for x in testdf_improved[features_improved] if x not in ['RiskPerformance_Good']]]
y_test_improved = testdf_improved.RiskPerformance_Good

print("\nDescriptive features in X:\n", X_test_improved)
print("\nTarget feature in y:\n", y_test_improved)

In [None]:
rfc_predictions_test_impr = rfc_improved.predict(X_test_improved)
df_true_vs_rfc_predicted_test_impr = pd.DataFrame({'ActualClass': y_test_improved, 'PredictedClass': rfc_predictions_test_impr})
df_true_vs_rfc_predicted_test_impr

In [None]:
printMetrics(y_test, rfc_predictions_test_impr)

In [None]:
scores = cross_val_score(RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1), crossValXdf, crossValYdf, scoring='accuracy', cv=3)
print(scores)
print(scores.mean())

In [None]:
rfc_improved.oob_score_

### Analysis: 
Compared to the previous random forest model, this model has a minutely higher accuracy, at 69.8% versus 69.4%. The out of bag score remains unchanged. While improvements have been made, the overfitting is not acceptable.

### Overfitting again was a rather disappointing outcome, and so further parameters will be changed. 

### Previously, 100 estimators were being used. Now, an attempt to find the most accurate number will estimators will be performed.

In [None]:
for i in range(10,110,10):
    rfc_improved = RandomForestClassifier(n_estimators=i, max_features='auto', oob_score=True, random_state=1)
    rfc_improved.fit(X_train_improved, y_train_improved) 
    rfc_predictions = rfc_improved.predict(X_test_improved)
    print("Accuracy: ", i, metrics.accuracy_score(y_test_improved, rfc_predictions))

10, 60, 80, and 90 estimators all perform best. 

### In the previous random forest models, the min_samples_leaf parameter was not specified. Thus, the best combination of estimator size and min_samples_leaf size will be found.

In [None]:
for i in [1,5,10,50,100,200,500]:
    for j in [10,60,80,90]:
        rfc_improved = RandomForestClassifier(n_estimators= 60, max_features='auto', oob_score=True, random_state=1, min_samples_leaf=i)
        rfc_improved.fit(X_train_improved, y_train_improved) 
        rfc_predictions = rfc_improved.predict(X_test_improved)
        print("estimators: ", i," min_samples_leaf: ", j," Accuracy: ", metrics.accuracy_score(y_test_improved, rfc_predictions))

The combinations:
    
estimators:  10  min_samples_leaf:  10  Accuracy:  0.7301587301587301
estimators:  10  min_samples_leaf:  60  Accuracy:  0.7301587301587301
estimators:  10  min_samples_leaf:  80  Accuracy:  0.7301587301587301
estimators:  10  min_samples_leaf:  90  Accuracy:  0.7301587301587301

Are all equal, and thus estimators:  10  min_samples_leaf:  10 will be chosen, as it should use less computation resources than the other options.

#### Create and fit the new model

In [None]:
rfc_improved2 = RandomForestClassifier(n_estimators= 10, max_features='auto', oob_score=True, random_state=1, min_samples_leaf=10)

In [None]:
# Fit model on training dataset
rfc_improved2.fit(X_train_improved, y_train_improved)

In [None]:
rfc_improved2.predict_proba(X_train_improved.head(100))

In [None]:
rfc_predictions_100_improved2 = rfc_improved2.predict(X_train_improved.head(100))
rfc_predictions_improved2 = rfc_improved2.predict(X_train_improved)
rfc_predictions_100_improved2

In [None]:
printMetrics(y_train_improved, rfc_predictions_improved2)

Evaluate the model on the test data

In [None]:
rfc_predictions_test_impr = rfc_improved2.predict(X_test_improved)
df_true_vs_rfc_predicted_test_impr = pd.DataFrame({'ActualClass': y_test_improved, 'PredictedClass': rfc_predictions_test_impr})
df_true_vs_rfc_predicted_test_impr

In [None]:
printMetrics(y_test_improved, rfc_predictions_test_impr)

## Cross validation:

In [None]:
scores = cross_val_score(RandomForestClassifier(n_estimators=10, max_features='auto', oob_score=True, random_state=1, min_samples_leaf=10), crossValXdf, crossValYdf, scoring='accuracy', cv=3)
print(scores)
print(scores.mean())

In [None]:
rfc_improved2.oob_score_

### Analysis:

The changes to the parameters were a huge success. The model is no longer massively overfitting, and there is a 4% jump in accuracy in the test results over the first random forest that was created. 

The training results here are 78.4% accuracy vs the test results of 73.4%. 

Precision, recall and f1-score are .78 across the board for training, and .74,.73,.73 for test.

The cross validation score for the altered parameter random forest model saw an increase from 71.430% in previous models, to 72.857%, accuracy wise.

The OOB score improved from 0.7295918367346939 to 0.7329931972789115.

This model is now the most accurate of those tested.

### These features worked very well with the random forest model, so now an evaluation will be done to see if using these features can improve the linear regression and logistic regression models.

## Multiple linear regression with new features

In [None]:
multiple_linreg = LinearRegression().fit(X_train_improved, y_train_improved)

# Print the intercept
print("\nIntercept: \n", multiple_linreg.intercept_)
print()
# Print the features and coefficients
print("Features and coefficients:", list(zip(features_improved, multiple_linreg.coef_)))

For a unit increase in:
- 'ExternalRiskEstimate' the target feature increases by: 0.018671818287282132
- 'NetFractionRevolvingBurden' the target feature decreases by: -0.0027070379416195035
- 'PercentTradesWBalance' the target feature increases by: 0.001682642850419566
- 'AverageMInFile' the target feature increases by: 0.0022151730368995
- 'MSinceOldestTradeOpen' the target feature increases by: 0.0005034003076700728

### Fit on the training dataset

In [None]:
# Predict on the training descriptive features.
multiple_linreg_predictions = multiple_linreg.predict(X_train_improved)
print("\nPredictions with multiple linear regression: \n")

y_train_improved.reset_index(drop=True)
noThresPredDf_imp = pd.DataFrame(multiple_linreg_predictions, columns=['Predicted'])
noThresPredDf_imp.reset_index(drop=True)

# Show the actual vs predicted values for the first 100 training examples.
actual_vs_predicted_multiplelinreg = pd.concat([y_train_improved.head(100), noThresPredDf_imp.head(100)], axis=1)

print(actual_vs_predicted_multiplelinreg)

#### Predict with the training set

In [None]:
# Set less than 0.5 to 0, and 0.5 and greater to 1.
preddf_imp = pd.DataFrame(multiple_linreg_predictions, columns=['Predicted'])
preddf_imp.loc[preddf_imp.Predicted < 0.5, 'predicted_Yes_No'] = 0 
preddf_imp.loc[preddf_imp.Predicted >= 0.5, 'predicted_Yes_No'] = 1 
preddf_imp = preddf_imp.drop("Predicted", axis=1)

# Reset the indices to allow the rows to align correctly.
y_train_improved.reset_index(drop=True)
preddf_imp.reset_index(drop=True)

# Show the actual vs predicted values for the first 100 training examples.
actual_vs_predicted_multiplelinreg = pd.concat([y_train_improved.head(100), preddf_imp.head(100)], axis=1)
print(actual_vs_predicted_multiplelinreg)

In [None]:
printMetrics(y_train, preddf_imp['predicted_Yes_No'])

#### Predict with the test dataset

In [None]:
test_predictions = multiple_linreg.predict(X_test_improved)
print("Actual values of test:\n", y_test)
print("Predictions on test:", test_predictions)

In [None]:
test_preddf_imp = pd.DataFrame(test_predictions, columns=['Predicted'])
test_preddf_imp.loc[test_preddf_imp.Predicted < 0.5, 'predicted_Yes_No'] = 0 
test_preddf_imp.loc[test_preddf_imp.Predicted >= 0.5, 'predicted_Yes_No'] = 1 
test_preddf_imp = test_preddf_imp.drop("Predicted", axis=1)

In [None]:
printMetrics(y_test, test_preddf_imp['predicted_Yes_No'])

### Analysis

With an accuracy of 74.65% in training and 73.015% in test, this model is well fit. It is also an improvement on the single descriptive feature linear regression model, with the test results here getting 73.015% versus 71.03%.

This model is therefore now the second best model accuracy wise, after the improved random forest model (73.4% in test for the random forest, 73.015% for this model in test).

## Logistic regression with new features

In [None]:
multiple_logreg_imp = LogisticRegression().fit(X_train_improved, y_train_improved)

print("Features: \n", features_improved)
print("Coefficients: \n", multiple_logreg_imp.coef_)
print("\nIntercept: \n", multiple_logreg_imp.intercept_)

For a unit increase in:
- 'ExternalRiskEstimate' the target feature increases by: 4.57080607e-02
- 'NetFractionRevolvingBurden' the target feature decreases by: -2.03006484e-02
- 'PercentTradesWBalance' the target feature increases by: 04.45268159e-05
- 'AverageMInFile' the target feature increases by: 1.22015790e-02
- 'MSinceOldestTradeOpen' the target feature increases by: 2.05936840e-03

In [None]:
multiple_logreg_predicted_probs = multiple_logreg_imp.predict_proba(X_train_improved)
print(multiple_logreg_imp.predict_proba(X_train_improved.head(100)))

In [None]:
print(multiple_logreg_imp.predict(X_train_improved.head(100)))

#### Fit the model

In [None]:
multiple_logreg_predicted_class = multiple_logreg_imp.predict(X_train_improved)

In [None]:
multiple_logreg_imp.score(X_train_improved, y_train_improved)

In [None]:
printMetrics(y_train_improved, multiple_logreg_predicted_class)

In [None]:
y_predicted_imp = multiple_logreg_imp.predict(X_test_improved)
print(y_predicted_imp)

In [None]:
print(multiple_logreg_imp.predict_proba(X_test_improved))

In [None]:
printMetrics(y_test_improved, y_predicted_imp)

## Analysis

The training data results show an accuracy of 73.29%, and the test results show an accuracy of 72.61%. This is a large increase over the first Logistic Regression model which had 69.4% for the test result accuracy. The improvements to the model worked well, but it is still behind both the improved multiple linear regression model and the improved random forest model when it comes to accuracy.

(73.4% in test for the random forest, 73.015% for the multiple linear regression model, and 72.61% in this, the logistic regression model).

## Normalise

### Normalising the data put into a predictive model can often lead to improvements, and thus it will be attempted.

In [None]:
features_to_norm = traindf[['ExternalRiskEstimate', 'NetFractionRevolvingBurden', 'PercentTradesWBalance', 'AverageMInFile', 'MSinceOldestTradeOpen']]
norm_features_df = (features_to_norm - features_to_norm.mean()) / (features_to_norm.std())
print(norm_features_df.head(10))

In [None]:
RiskPerformance_dummies = pd.get_dummies(traindf['RiskPerformance'], prefix='RiskPerformance', drop_first=True)


#### Create new dataframes with normalised data

In [None]:
norm_df_train = pd.concat([norm_features_df, RiskPerformance_dummies], axis=1)
norm_df_train

In [None]:
features_to_norm_test = testdf[['ExternalRiskEstimate', 'NetFractionRevolvingBurden', 'PercentTradesWBalance', 'AverageMInFile', 'MSinceOldestTradeOpen']]
norm_features_test_df = (features_to_norm_test - features_to_norm_test.mean()) / (features_to_norm_test.std())
print(norm_features_test_df.head(10))

In [None]:
RiskPerformance_dummies = pd.get_dummies(testdf['RiskPerformance'], prefix='RiskPerformance', drop_first=True)


In [None]:
norm_df_test = pd.concat([norm_features_test_df, RiskPerformance_dummies], axis=1)
norm_df_test

In [None]:
features = ['ExternalRiskEstimate', 'NetFractionRevolvingBurden', 'PercentTradesWBalance', 'AverageMInFile', 'MSinceOldestTradeOpen']

In [None]:
X_train_norm = norm_df_train[[x for x in norm_df_train[features] if x not in ['RiskPerformance_Good']]]
y_train_norm = norm_df_train.RiskPerformance_Good

In [None]:
X_test_norm = norm_df_test[[x for x in norm_df_test[features] if x not in ['RiskPerformance_Good']]]
y_test_norm = norm_df_test.RiskPerformance_Good

#### Fit the model

In [None]:
rfc_improved2.fit(X_train_norm, y_train_norm)

In [None]:
rfc_improved2.predict_proba(X_train_norm.head(100))

In [None]:
rfc_predictions_100_improved2 = rfc_improved2.predict(X_train_norm.head(100))
rfc_predictions_improved2 = rfc_improved2.predict(X_train_norm)
rfc_predictions_100_improved2

In [None]:
printMetrics(y_train_norm, rfc_predictions_improved2)

In [None]:
rfc_predictions_test_impr = rfc_improved2.predict(X_test_norm)
df_true_vs_rfc_predicted_test_impr = pd.DataFrame({'ActualClass': y_test_norm, 'PredictedClass': rfc_predictions_test_impr})
df_true_vs_rfc_predicted_test_impr

In [None]:
printMetrics(y_test_norm, rfc_predictions_test_impr)

In [None]:
rfc_improved2.oob_score_

In [None]:
scores = cross_val_score(RandomForestClassifier(n_estimators=10, max_features='auto', oob_score=True, random_state=1, min_samples_leaf=10), crossValXdf, crossValYdf, scoring='accuracy', cv=3)
print(scores)
print(scores.mean())

## Analysis

The normalisation did not result in an improvement, and in fact saw a decrease in metric performance compared to the improved random forest model. For example, the test result accuracy here is 73% versus 73.4% in the improved model from earlier.

## A final attempt will be made to improve the random forest, by only using the top 3 features ranked by importance.

In [None]:
# Continuous features
cont_features_improved = ['ExternalRiskEstimate', 'NetFractionRevolvingBurden', 'MSinceOldestTradeOpen']

RiskPerformance_dummies_improved = pd.get_dummies(traindf['RiskPerformance'], prefix='RiskPerformance', drop_first=True)

# Continuous and categorical features combined 
features_improved = cont_features_improved + RiskPerformance_dummies_improved.columns.values.tolist()
print("Features: ", features_improved)

In [None]:
traindf_improved = pd.concat([traindf, RiskPerformance_dummies_improved], axis=1)
# Keep the features chosen.
feat_to_keep = features_improved

traindf_improved = traindf_improved.loc[:, feat_to_keep]

In [None]:
traindf_improved.head(5)

In [None]:
X_train_improved = traindf_improved[[x for x in traindf_improved[features_improved] if x not in ['RiskPerformance_Good']]]
y_train_improved = traindf_improved.RiskPerformance_Good

print("\nDescriptive features in X:\n", X_train_improved)
print("\nTarget feature in y:\n", y_train_improved)

In [None]:
X_test_improved = testdf_improved[[x for x in testdf_improved[features_improved] if x not in ['RiskPerformance_Good']]]
y_test_improved = testdf_improved.RiskPerformance_Good

print("\nDescriptive features in X:\n", X_test_improved)
print("\nTarget feature in y:\n", y_test_improved)

In [None]:
rfc_improved2 = RandomForestClassifier(n_estimators= 10, max_features='auto', oob_score=True, random_state=1, min_samples_leaf=10)

In [None]:
rfc_improved2.fit(X_train_improved, y_train_improved)

In [None]:
rfc_improved2.predict_proba(X_train_improved.head(100))

In [None]:
rfc_predictions_100_improved2 = rfc_improved2.predict(X_train_improved.head(100))
rfc_predictions_improved2 = rfc_improved2.predict(X_train_improved)
rfc_predictions_100_improved2

In [None]:
printMetrics(y_train_improved, rfc_predictions_improved2)

In [None]:
rfc_predictions_test_impr = rfc_improved2.predict(X_test_improved)
df_true_vs_rfc_predicted_test_impr = pd.DataFrame({'ActualClass': y_test_improved, 'PredictedClass': rfc_predictions_test_impr})
df_true_vs_rfc_predicted_test_impr

In [None]:
printMetrics(y_test_improved, rfc_predictions_test_impr)

In [None]:
rfc_improved2.oob_score_

In [None]:
scores = cross_val_score(RandomForestClassifier(n_estimators=10, max_features='auto', oob_score=True, random_state=1, min_samples_leaf=10), crossValXdf, crossValYdf, scoring='accuracy', cv=3)
print(scores)
print(scores.mean())

## Analysis

Choosing the top 3 features by importance did not yield a better result. The test results accuracy is 71.42% versus 73.4% for the best random forest model, and the oob score has reduced from 0.7329931972789115 to 0.70578231292517. This shows that reducing the random forest model to ever smaller numbers of important features does not guarantee an improvement in the model.

# Conclusion

Many different models were tested throughout this notebook, and while none reached a very high accuracy, they were all a large improvement over a simple predicitive model which would have just made a prediction based on the majority class. 

For the company and their business problem, it might be tempting to say that the model with the highest accuracy is the best fit. However, accuracy alone is the only metric which a predicitive model must be judged by. Resource usage such as CPU load, disk size, and ram usagemay be a concern to the company. If this is the case, then the best model would be the linear regression model with one descriptive feature, as it was simple but also one of the highest accuracy models (72.61% in test).

Explainability is another factor which may be important, especially for a credit company. Sometimes, sacrificing a little accuracy for better interpretation of the predicitive model is desirable, as if an entity wishes to know why the predicted risk performance was bad, such as a returning customer, the company would like to have good reasoning behind the decision. If this is the case, then the random forest model would not be a good fit, as interpreting it is a difficult task. The improved multiple linear regression model might be the best trade off here for accuracy and interpretability (73.015% in test).

However, if accuracy is the main priority, then the improved random forest model (73.4% in test) would be the best choice for the company.