<div style="text-align:center;">
    <img src="http://www.infante.space/wp-content/uploads/2018/02/Logo-FEUP.png" width="35%"/>
    <h1 style="font-size: 2.5em">Credit Risk Analysis</h1>
    <h2 style="font-size: 2em">Artificial Intelligence 2020/21 - Supervised Learning</h2>
</div>

## Table of contents

1. [Introduction](#Introduction)

2. [License](#License)

3. [Required libraries](#Required-libraries)

4. [The problem domain](#The-problem-domain)

5. [Step 1: Answering the question](#Step-1:-Answering-the-question)

6. [Step 2: Checking the data](#Step-2:-Checking-the-data)

7. [Step 3: Tidying the data](#Step-3:-Tidying-the-data)

    - [Bonus: Testing our data](#Bonus:-Testing-our-data)

8. [Step 4: Exploratory analysis](#Step-4:-Exploratory-analysis)

9. [Step 5: Classification](#Step-5:-Classification)

    - [Cross-validation](#Cross-validation)

    - [Parameter tuning](#Parameter-tuning)

10. [Step 6: Reproducibility](#Step-6:-Reproducibility)

11. [Conclusions](#Conclusions)

12. [Further reading](#Further-reading)

13. [Acknowledgements](#Acknowledgements)

## Introduction

In this project we look forward to use Machine Learning, more specifically, Supervised Learning, to predict the risk on loan repayment. For this, we use a provided dataset with approximatelly 800k entries of previously issued loans and aim to train this dataset using various machine learning algorithms.

## License

Please see the repository README file for the licenses and usage terms for the instructional material and code in this notebook.

## Required libraries

- **numpy** - provides support for large multidimensional arrays and matrices along with a collection of high-level mathematical functions to execute these functions swiftly.
- **pandas** - enables the provision of easy data structure and quicker data analysis for Python. For operations like data analysis and modelling, Pandas makes it possible to carry these out without needing to switch to more domain-specific language.
- **scikit-learn** - can be effectively used for a variety of applications which include classification, regression, clustering, model selection, naive Bayes’, grade boosting, K-means, and preprocessing.
- **matplotlib** - widely used for publication of quality figures in a variety of hard copy formats and interactive environments across platforms. Used to design charts, graphs, pie charts, scatterplots, histograms, error charts, etc.
- **seaborn** - visualisation of statistical models like heat maps.
- **watermark** - printing date and time stamps, version numbers, and hardware information.
- **contextlib**
- **time**

## Initialization

### Importing modules

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from contextlib import redirect_stdout
import matplotlib.pyplot as plt
import seaborn as sb
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifie
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
import time
import sys

### Cleaning and Normalization

After a thorough analysis of the data available, it was decided to drop a considerable amount of columns from the dataset. It includes columns containing either data that was not relevant to the problem or columns with too many missing values to be used reliably.
For example: identifiers, titles and descriptions; zip codes.
We also created new columns using old ones with, for example, date differences and differences of money amounts, replaced 'grade' column with the 'sub_grade' column

#### Removing unnecessary columns

In [2]:
df = pd.read_csv('./resources/data.csv', index_col=0)

# Remove id column and use a default index
df.reset_index(drop=True, inplace=True)

df.drop(columns = [
        'member_id',
        'grade',
        'emp_title',
        'pymnt_plan',
        'desc',
        'title',
        'zip_code',
        'initial_list_status',
        'out_prncp_inv',
        'total_pymnt_inv',
        'funded_amnt_inv',
        'total_rec_prncp',
        'total_rec_int',
        'total_rec_late_fee',
        'collection_recovery_fee',
        'last_pymnt_d',
        'last_pymnt_amnt',
        'next_pymnt_d',
        'last_credit_pull_d',
        'collections_12_mths_ex_med',
        'mths_since_last_major_derog',
        'policy_code',
        'application_type',
        'annual_inc_joint',
        'dti_joint',
        'verification_status_joint',
        'open_acc_6m',
        'open_il_6m',
        'open_il_12m',
        'open_il_24m',
        'mths_since_rcnt_il',
        'total_bal_il',
        'il_util',
        'open_rv_12m',
        'open_rv_24m',
        'max_bal_bc',
        'all_util',
        'inq_fi',
        'total_cu_tl',
        'inq_last_12m',
        'total_rev_hi_lim',
        'open_acc',
        'mths_since_last_record',
        'mths_since_last_delinq'
        ], inplace = True)

df.to_csv("afterRemoving.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


#### Renaming, replacing and aggregating

In [7]:

df = pd.read_csv('./afterRemoving.csv', index_col=0)

# Remove "months" in column "term"
df.term = df.term.str.split().str[0]

# Rename column "sub_grade" to "grade"
df.rename(columns={'sub_grade':'grade'}, inplace = True)

# Normalize emp_length
def normalize_emp_length(emp_length):
    if (emp_length == None or (not type(emp_length) is str)):
        return
    if (emp_length == '< 1 year'):
        return '0'
    elif (emp_length == '10+ years'):
        return '10'
    else:
        return emp_length.split()[0]

df.emp_length = df.emp_length.apply(normalize_emp_length)

df.emp_length = pd.to_numeric(df.emp_length, downcast='integer') # TODO: fix the conversion (current -> float64, desired -> int8)

df.to_csv('AfterRename.csv')

### Convert types
To save memory usage and processing time, the data types of the columns in the dataset were changed, accordingly to the range of the corresponding data.

In [13]:
import sys
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from contextlib import redirect_stdout
import matplotlib.pyplot as plt
import seaborn as sb
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifie
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

df = pd.read_csv('./afterRename.csv', index_col=0)

def convertTypes(df):
    df.loan_amnt = df.loan_amnt.astype('uint32')
    df.term = df.term.astype('uint8')
    df.int_rate = df.int_rate.astype('float16')
    df.installment = df.installment.astype('float16')
    df.grade = df.grade.astype('category')
    df.home_ownership = df.home_ownership.astype('category')
    df.annual_inc = df.annual_inc.astype('uint32')
    df.verification_status = df.verification_status.astype('category')
    df.purpose = df.purpose.astype('category')
    df.addr_state = df.addr_state.astype('category')
    df.dti = df.dti.astype('float16')
    df.delinq_2yrs = df.delinq_2yrs.astype('uint8')
    df.inq_last_6mths = df.inq_last_6mths.astype('uint8')
    df.pub_rec = df.pub_rec.astype('uint8')
    df.revol_bal = df.revol_bal.astype('uint32')
    df.total_acc = df.total_acc.astype('uint8')
    df.out_prncp = df.out_prncp.astype('float16')
    df.total_pymnt = df.total_pymnt.astype('float16')
    df.recoveries = df.recoveries.astype('float16')
    df.acc_now_delinq = df.acc_now_delinq.astype('category')
    df.default_ind = df.default_ind.astype('bool')   

    return df
    
def convertTypesImputer(df):
    df.loan_amnt = df.loan_amnt.astype('uint32')
    df.term = df.term.astype('uint8')
    df.int_rate = df.int_rate.astype('float16')
    df.installment = df.installment.astype('float16')
    df.annual_inc = df.annual_inc.astype('uint32')
    df.dti = df.dti.astype('float16')
    df.delinq_2yrs = df.delinq_2yrs.astype('uint8')
    df.inq_last_6mths = df.inq_last_6mths.astype('uint8')
    df.pub_rec = df.pub_rec.astype('uint8')
    df.revol_bal = df.revol_bal.astype('uint32')
    df.total_acc = df.total_acc.astype('uint8')
    df.out_prncp = df.out_prncp.astype('float16')
    df.total_pymnt = df.total_pymnt.astype('float16')
    df.recoveries = df.recoveries.astype('float16')
    df.default_ind = df.default_ind.astype('bool')
    return df
    
def read_and_convert_imp(file):
    df = pd.read_csv(file, index_col=0)
    df = convertTypesImputer(df)
    return df


def read_and_convert(file):
    df = pd.read_csv(file, index_col=0)
    df = convertTypes(df)
    return df

df = convertTypes(df)
df.earliest_cr_line = df.earliest_cr_line.astype('datetime64')
df.issue_d = df.issue_d.astype('datetime64')

df['since_first_cr'] = ((df['issue_d'] - df['earliest_cr_line']) / np.timedelta64(1, 'M')).astype('uint16')
df['diff_loan_funded_amnt'] = (df['loan_amnt'] - df['funded_amnt']).astype('uint16')
    
df.drop(columns = [
        'issue_d',
        'earliest_cr_line',
        'funded_amnt'
        ], inplace = True)


df.to_csv("afterConverting.csv")
#convertTypes(df)
#df.info()

# stratified_sample, _ = train_test_split(df, test_size=0.9, stratify=df[['default_ind']])

"""dtc = DecisionTreeClassifier()

feature_cols = ['annual_inc']
X = df[feature_cols] # Features
y = df['default_ind'].values

(training_inputs,
     testing_inputs,
     training_classes,
     testing_classes) = train_test_split(X, y, test_size=0.25)

dtc.fit(training_inputs, training_classes)

prediction_classes = dtc.predict(testing_inputs)

# with open('credit.dot', 'w') as out_file:
#    out_file = tree.export_graphviz(dtc, out_file=out_file)
    
# tree.plot_tree(dtc)

print("Confusion matrix:\n", metrics.confusion_matrix(testing_classes, prediction_classes))
print("Accuracy:",metrics.accuracy_score(testing_classes, prediction_classes))
print("Precision:", metrics.precision_score(testing_classes, prediction_classes, average='weighted'))"""

'dtc = DecisionTreeClassifier()\n\nfeature_cols = [\'annual_inc\']\nX = df[feature_cols] # Features\ny = df[\'default_ind\'].values\n\n(training_inputs,\n     testing_inputs,\n     training_classes,\n     testing_classes) = train_test_split(X, y, test_size=0.25)\n\ndtc.fit(training_inputs, training_classes)\n\nprediction_classes = dtc.predict(testing_inputs)\n\n# with open(\'credit.dot\', \'w\') as out_file:\n#    out_file = tree.export_graphviz(dtc, out_file=out_file)\n    \n# tree.plot_tree(dtc)\n\nprint("Confusion matrix:\n", metrics.confusion_matrix(testing_classes, prediction_classes))\nprint("Accuracy:",metrics.accuracy_score(testing_classes, prediction_classes))\nprint("Precision:", metrics.precision_score(testing_classes, prediction_classes, average=\'weighted\'))'

## Undersampling
### (Same amount of rows with default_ind = 0 and default_ind = 1)

In [14]:
df = read_and_convert('./afterConverting.csv')

# Shuffle the Dataset.
# df = df.sample(frac=1,random_state=4)
df, _ = train_test_split(df, test_size=0.9, stratify=df[['default_ind']]) #SAMPLE STRAT

df.to_csv("afterSampling.csv")

"""
# Put all the fraud class in a separate dataset.
fraud_df = shuffled_df.loc[shuffled_df['default_ind'] == 1]

# Randomly select 46467 observations from the non-fraud (majority class)
non_fraud_df = shuffled_df.loc[shuffled_df['default_ind'] == 0].sample(n=fraud_df['default_ind'].count(),random_state=42)

# Concatenate both dataframes again
df = pd.concat([fraud_df, non_fraud_df])

# plot the dataset after the undersampling
plt.figure(figsize=(8, 8))
sb.countplot('default_ind', data=df)
plt.title('Balanced Classes')
plt.show()
"""

"\n# Put all the fraud class in a separate dataset.\nfraud_df = shuffled_df.loc[shuffled_df['default_ind'] == 1]\n\n# Randomly select 46467 observations from the non-fraud (majority class)\nnon_fraud_df = shuffled_df.loc[shuffled_df['default_ind'] == 0].sample(n=fraud_df['default_ind'].count(),random_state=42)\n\n# Concatenate both dataframes again\ndf = pd.concat([fraud_df, non_fraud_df])\n\n# plot the dataset after the undersampling\nplt.figure(figsize=(8, 8))\nsb.countplot('default_ind', data=df)\nplt.title('Balanced Classes')\nplt.show()\n"

## Handling Missing Values

In [None]:
df = read_and_convert('./afterSampling.csv')

print("Columns left with missing values before imputation: ")
print(df.isnull().sum())


start = time.time()

newDf = df[df.columns.difference(['grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state'])]
imputer = KNNImputer(n_neighbors=2)
#newDf = imputer.fit_transform(df[['emp_length', 'revol_util', 'tot_coll_amt', 'tot_cur_bal', 'loan_amnt','annual_inc', 'dti', 'delinq_2yrs', 'pub_rec']])
newDf = pd.DataFrame(imputer.fit_transform(newDf), columns = newDf.columns)
print("KNNImputer completed in {} seconds".format(time.time()-start))
newDf.to_csv("Imputed.csv")



#newDf = read_and_convert_imp("Imputed.csv")

# Replace emp_length | revol_util | tot_coll_amt | loan_amnt columns with calculated values
df['emp_length'] = newDf['emp_length'].values
df['revol_util'] = newDf['revol_util'].values
df['tot_coll_amt'] = newDf['tot_coll_amt'].values
df['tot_cur_bal'] = newDf['tot_cur_bal'].values


#np.set_printoptions(threshold=sys.maxsize)
#print(newDf[:,0] )
"""
df.emp_length = df.emp_length.astype('float16')
df.revol_util = df.revol_util.astype('float16')
df.tot_coll_amt = df.tot_coll_amt.astype('float64')
df.tot_cur_bal = df.tot_cur_bal.astype('float64')
"""

df.to_csv("AfterKNNImputer.csv")
newDf.to_csv("test.csv")

# Columns with missing values
print("\nColumns left with missing values after imputation: ")
print(df.columns[df.isnull().any()].tolist())
print(df.isnull().sum())

Columns left with missing values before imputation: 
loan_amnt                   0
term                        0
int_rate                    0
installment                 0
grade                       0
emp_length               4191
home_ownership              0
annual_inc                  0
verification_status         0
purpose                     0
addr_state                  0
dti                         0
delinq_2yrs                 0
inq_last_6mths              0
pub_rec                     0
revol_bal                   0
revol_util                 40
total_acc                   0
out_prncp                   0
total_pymnt                 0
recoveries                  0
acc_now_delinq              0
tot_coll_amt             6792
tot_cur_bal              6792
default_ind                 0
since_first_cr              0
diff_loan_funded_amnt       0
dtype: int64


## Removal of Outliers (Z-Score)

In [None]:
from scipy import stats

"""
df = pd.read_csv('./AfterKNNImputer.csv', index_col=0)

z = np.abs(stats.zscore(df.select_dtypes(exclude=['object', 'bool'])))

df.to_csv('original.csv')
#testDf = df.select_dtypes(exclude=['object', 'bool'])
testDf = df
testDf = testDf[(z < 4).all(axis=1)]
testDf.info()

testDf.to_csv('test1.csv')
"""

## Correlation

### Before 

In [None]:
df = read_and_convert('./afterConverting.csv')
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

### After

In [None]:
df = read_and_convert('./afterKNNImputer.csv')
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
# sb.stripplot(x='grade', y='annual_inc', hue='default_ind', data=df)

In [None]:
# df.plot(subplots=True, layout=(6,6))

In [None]:
## 