### Exploratory Analysis on data linking demographics and financial status to credit history.

Dataset and explanation of dataset is available from [this Kaggle page](https://www.kaggle.com/rikdifos/credit-card-approval-prediction).

In [3]:
import sys
import pickle
import itertools
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import sklearn

Let's begin by loading our two dataframes.

In [95]:
with open("application_record.csv", "r") as app_data:
    app_df = pd.read_csv(app_data)
    

In [130]:
print(app_df)

             ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
0       5008804           M            Y               Y             0   
1       5008805           M            Y               Y             0   
2       5008806           M            Y               Y             0   
3       5008808           F            N               Y             0   
4       5008809           F            N               Y             0   
...         ...         ...          ...             ...           ...   
438552  6840104           M            N               Y             0   
438553  6840222           F            N               N             0   
438554  6841878           F            N               N             0   
438555  6842765           F            N               Y             0   
438556  6842885           F            N               Y             0   

        AMT_INCOME_TOTAL      NAME_INCOME_TYPE            NAME_EDUCATION_TYPE  \
0               427500.0      

In [97]:
with open("credit_record.csv", "r") as credit_data:
    credit_df = pd.read_csv(credit_data)

In [98]:
print(credit_df)

              ID  MONTHS_BALANCE STATUS
0        5001711               0      X
1        5001711              -1      0
2        5001711              -2      0
3        5001711              -3      0
4        5001712               0      C
...          ...             ...    ...
1048570  5150487             -25      C
1048571  5150487             -26      C
1048572  5150487             -27      C
1048573  5150487             -28      C
1048574  5150487             -29      C

[1048575 rows x 3 columns]


Notice that the application data has duplicates, the ID key doesn't uniquely identify a customer. We need to identify the cause of the duplication. It is possible these are multiple accounts for the same peron, and it is also possible these multiple accounts have different credit histories. To account for this, we create some functions that help us identify duplicate data. We first create a function that takes an index and returns all duplicates below it in the dataframe, assuming contiguous duplicates. We also create a function that returns the next index that isn't a duplicate of the given one. This is mostly for testing purposes. 

In [100]:
def DuplicateList(df, index, col_list):
    '''Parameters are a pandas dataframe, an index number and a list of column numbers. 
       Returns a list of all index values greater than the input index that have identical data under the given columns.
       Assumes the duplicated data in the dataframe is contiguous.'''
    
    Duplicates = []
    df_null = df.isnull()
    #Testing if the latest entry checked was a duplicate.
    is_Latest_Dupe = True
    for i in [x for x in df.index.values if x > index]:
        if is_Latest_Dupe:
            Dupe_Status = True
            for col in col_list:
                if df_null.iloc[index, col] and df_null.iloc[i, col]:
                    Dupe_Status=Dupe_Status
                else: 
                    Dupe_Status = Dupe_Status and (df.iloc[index, col]==df.iloc[i, col])
            if Dupe_Status:
                Duplicates.append(i)
            is_Latest_Dupe = is_Latest_Dupe and Dupe_Status
        elif not is_Latest_Dupe:
            break
    
    return Duplicates

Let's test this on some of the values we can see in our dataframe. 

In [101]:
print(DuplicateList(df=app_df, index=0, col_list = range(1, len(app_df.columns.values))))

[1]


In [116]:
for i in [2, 3, 4, 438552, 438556]:
    print(DuplicateList(df=app_df, index=i, col_list = range(1, len(app_df.columns.values))))
print(DuplicateList(app_df, index=1315, col_list=range(1, len(app_df.columns.values))))
print(app_df.iloc[[1315, 1316],:])

[]
[4, 5, 6]
[5, 6]
[]
[]
[1316, 1317]
           ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
1315  5010298           F            N               Y             0   
1316  5010299           F            N               Y             0   

      AMT_INCOME_TOTAL      NAME_INCOME_TYPE            NAME_EDUCATION_TYPE  \
1315          135000.0  Commercial associate  Secondary / secondary special   
1316          135000.0  Commercial associate  Secondary / secondary special   

     NAME_FAMILY_STATUS  NAME_HOUSING_TYPE  DAYS_BIRTH  DAYS_EMPLOYED  \
1315            Married  House / apartment      -14918          -1866   
1316            Married  House / apartment      -14918          -1866   

      FLAG_MOBIL  FLAG_WORK_PHONE  FLAG_PHONE  FLAG_EMAIL OCCUPATION_TYPE  \
1315           1                0           0           0     Sales staff   
1316           1                0           0           0     Sales staff   

      CNT_FAM_MEMBERS  
1315              2.0  
1316  

In [148]:
def NextNonDuplicate(df, index, col_list):
    '''Arguments are a dataframe with contiguous duplicate data, an index value and list of columns to compare
        for duplication.
       Returns the index of the next datapoint that isn\'t a duplicate of the data at index.'''
    if index <0:
        index = len(df.index.values)+index
    if index >=len(df.index.values)-1:
        return None
    if not DuplicateList(df, index, col_list):
        return index + 1
    elif DuplicateList(df, index, col_list)[-1] >= len(df.index.values)-1:
        return None
    else:
        return DuplicateList(df, index, col_list)[-1]+1



We can now use the drop_duplicates method in pandas to remove duplicates from app_df, but this would create some issues. The account id's in the application data are linked to the account id's in the credit data. If we remove duplicates first, we will potentially lose information. Instead, we will adopt a different strategy. We identify the accounts that are in both dataframes and only condense each dataframe to those accounts for which we have both application data and credit data. This is a natural procedure to make here, as these are the only data points we can actually use to fit a model, as for other accounts, we either lack feature data or data to construct a label. 

In [167]:
#Boolean series that tracks indices of app_df whose "ID" keys are also in credit_df

app_boolean = app_df["ID"].isin(credit_df["ID"])

In [171]:
app_df_only_including_overlap = app_df.loc[app_boolean]
print(app_df_only_including_overlap)

             ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
0       5008804           M            Y               Y             0   
1       5008805           M            Y               Y             0   
2       5008806           M            Y               Y             0   
3       5008808           F            N               Y             0   
4       5008809           F            N               Y             0   
...         ...         ...          ...             ...           ...   
434808  5149828           M            Y               Y             0   
434809  5149834           F            N               Y             0   
434810  5149838           F            N               Y             0   
434811  5150049           F            N               Y             0   
434812  5150337           M            N               Y             0   

        AMT_INCOME_TOTAL      NAME_INCOME_TYPE            NAME_EDUCATION_TYPE  \
0               427500.0      

Let us do the same for the credit data and only keep the overlap.

In [172]:
credit_boolean = credit_df["ID"].isin(app_df["ID"])

In [173]:
credit_df_only_including_overlap = credit_df.loc[credit_boolean]
print(credit_df_only_including_overlap)

              ID  MONTHS_BALANCE STATUS
92938    5008804               0      C
92939    5008804              -1      C
92940    5008804              -2      C
92941    5008804              -3      C
92942    5008804              -4      C
...          ...             ...    ...
1048570  5150487             -25      C
1048571  5150487             -26      C
1048572  5150487             -27      C
1048573  5150487             -28      C
1048574  5150487             -29      C

[777715 rows x 3 columns]
