<a href="https://colab.research.google.com/github/RozitaAbdoli/credit_default_mining/blob/main/EDA_frequent_patterns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## EDA: Association Rule Mining using Frequent Pattern Growth Algorithm
* An association rule has the form LHS (left-hand side) => RHS (right-hand side)
* LHS U RHS is called an itemset
* The “support” or prevalence of the rule is the frequency of occurence of an itemset in a dataset/database for it to be considered a “frequent itemset” or a “frequent pattern”.
* The “confidence” or strength of the rule = support(LHS U RHS)/support(LHS). Or the probability that the items in RHS will occur given that the items in LHS have occured.
* Here, support threshold is set at 1000, and confidence threshold is set at 0.8.



In [None]:
#install and import necessary libraries
!pip install pyfpgrowth 
import os
import pyfpgrowth
import pandas as pd



In [None]:
#Import Drive API and authenticate
from google.colab import drive
#Mount Drive to the Colab VM
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Load the dataset into pandas DataFrame
df = pd.read_csv("/content/drive/MyDrive/Capstone_project/v2_credit_default.csv")

In [None]:
# keep the categorical attributes. The numeric attributes can be binned/converted to categorical and added to this list too.
df2 = df[['SEX', 'EDUCATION', 'MARRIAGE', 'Repay_Sept',
       'Repay_Aug', 'Repay_July', 'Repay_June', 'Repay_May', 'Repay_Apr','Default']]


In [None]:
# turn off chain assignment warning for the next cell
pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
# Renaming some of the columns so it's easier to understand association rules
df2['SEX'] = df2['SEX'].replace({ 1: 'Male', 2: 'Female'})
df2['EDUCATION'] = df2['EDUCATION'].replace({1:'Graduate School', 2:'University', 3:'High School', 4:'Others'})
df2['MARRIAGE']= df2['MARRIAGE'].replace({1:'Married', 2:'Single', 3:'Others'})
df2['Repay_Apr'] = df2['Repay_Apr'].replace({-2:'Apr_not_used', -1:'Apr_duly_payment', 0:'Apr_revolving_credit', 1: 'Apr_late_1month', 2: 'Apr_late_2months', 3: 'Apr_late_3months', 4: 'Apr_late_4months', 5: 'Apr_late_5months', 6: 'Apr_late_6months', 7: 'Apr_late_7months', 8: 'Apr_late_8months'})
df2['Repay_May'] = df2['Repay_May'].replace({-2:'May_not_used', -1:'May_duly_payment', 0:'May_revolving_credit', 1: 'May_late_1month', 2: 'May_late_2months', 3: 'May_late_3months', 4: 'May_late_4months', 5: 'May_late_5months', 6: 'May_late_6months', 7: 'May_late_7months', 8: 'May_late_8months'})
df2['Repay_June'] = df2['Repay_June'].replace({-2:'June_not_used', -1:'June_duly_payment', 0:'June_revolving_credit', 1: 'June_late_1month', 2: 'June_late_2months', 3: 'June_late_3months', 4: 'June_late_4months', 5: 'June_late_5months', 6: 'June_late_6months', 7: 'June_late_7months', 8: 'June_late_8months'})
df2['Repay_July'] = df2['Repay_July'].replace({-2:'July_not_used', -1:'July_duly_payment', 0:'July_revolving_credit', 1: 'July_late_1month', 2: 'July_late_2months', 3: 'July_late_3months', 4: 'July_late_4months', 5: 'July_late_5months', 6: 'July_late_6months', 7: 'July_late_7months', 8: 'July_late_8months'})
df2['Repay_Aug'] = df2['Repay_Aug'].replace({-2:'Aug_not_used', -1:'Aug_duly_payment', 0:'Aug_revolving_credit', 1: 'Aug_late_1month', 2: 'Aug_late_2months', 3: 'Aug_late_3months', 4: 'Aug_late_4months', 5: 'Aug_late_5months', 6: 'Aug_late_6months', 7: 'Aug_late_7months', 8: 'Aug_late_8months'})
df2['Repay_Sept'] = df2['Repay_Sept'].replace({-2:'Sept_not_used', -1:'Sept_duly_payment', 0:'Sept_revolving_credit', 1: 'Sept_late_1month', 2: 'Sept_late_2months', 3: 'Sept_late_3months', 4: 'Sept_late_4months', 5: 'Sept_late_5months', 6: 'Sept_late_6months', 7: 'Sept_late_7months', 8: 'Sept_late_8months'})
df2['Default'] = df2['Default'].replace({ 0: 'no_default', 1: 'default'})

In [None]:
## Convert df2 to list of lists, and look at the first 2 lists (first 2 rows).
df2_list = df2.values.tolist()
print(df2_list[0:2])

[['Female', 'University', 'Married', 'Sept_late_2months', 'Aug_late_2months', 'July_duly_payment', 'June_duly_payment', 'May_not_used', 'Apr_not_used', 'default'], ['Female', 'University', 'Single', 'Sept_duly_payment', 'Aug_late_2months', 'July_revolving_credit', 'June_revolving_credit', 'May_revolving_credit', 'Apr_late_2months', 'default']]


In [None]:
patterns = pyfpgrowth.find_frequent_patterns(df2_list, support_threshold = 1000)    

In [None]:
rules = pyfpgrowth.generate_association_rules(patterns, confidence_threshold = 0.8) 

#### Some interesting association rules picked from the association rules below:
* ['Apr_not_used', 'Aug_not_used', 'Female', 'July_not_used', 'June_not_used', 'May_not_used', 'Sept_not_used'] --> ['no_default'] 0.8676880222841226

~87% of females that are not using their credit in the previous 6 months are non-defaulters. This is expected.
* ['Apr_revolving_credit', 'July_revolving_credit', 'June_revolving_credit', 'Male', 'May_revolving_credit', 'University'] --> ['no_default'] 0.8115758343895226

~81% of males with Bachelor's degree that are using their revolving credit from April to July are non-defaulters.
* ['Apr_revolving_credit', 'Female', 'July_revolving_credit', 'June_revolving_credit', 'Married', 'May_revolving_credit', 'University'] --> ['no_default'] 0.8515881708652793

~85% of married females that are using their revolving credit from April to July are non-defaulters.


In [None]:
# Now to make the outputted rules look nicer:
for antecedents, consequents in rules.items():          #my note: for keys, values in dictionary.items():
    antec_list =[]
    conseq_list =[]
    for a in antecedents:
        antec_list.append(a)
    for c in consequents[0]:
        conseq_list.append(c)
    print(antec_list, '-->', conseq_list, consequents[1])    

['Aug_late_2months', 'July_late_2months', 'May_late_2months'] --> ['June_late_2months'] 0.9617097061442564
['Aug_late_2months', 'June_late_2months', 'May_late_2months'] --> ['July_late_2months'] 0.9694793536804309
['Apr_late_2months', 'July_late_2months', 'June_late_2months'] --> ['May_late_2months'] 0.9496021220159151
['Apr_late_2months', 'July_late_2months', 'May_late_2months'] --> ['June_late_2months'] 0.9572192513368984
['Apr_late_2months', 'June_late_2months', 'May_late_2months'] --> ['July_late_2months'] 0.8014925373134328
['Apr_late_2months', 'June_late_2months'] --> ['May_late_2months'] 0.9403508771929825
['July_late_2months', 'Sept_late_2months'] --> ['Aug_late_2months'] 0.9367396593673966
['June_revolving_credit', 'Sept_late_2months'] --> ['May_revolving_credit'] 0.8841033672670321
['May_revolving_credit', 'Sept_late_2months'] --> ['Apr_revolving_credit'] 0.8527298850574713
['Apr_revolving_credit', 'Sept_late_2months'] --> ['May_revolving_credit'] 0.8539568345323741
['June_no