## In this notebook, I analyze [this Kaggle dataset](https://www.kaggle.com/rikdifos/credit-card-approval-prediction). The dataset consists of demographic and financial data for accounts at a bank (unspecified) along with a credit history of that account. The same customer at the bank may have multiple accounts attached to them. The goal of the notebook is to clean up the data and construct a label for each customer using all of their accounts as good or bad credit. This notebook will later be used to build a model to predict the credit score of customers using their demographic and financial data, to quantify the risk of opening a credit account.

### Data Cleaning

In [311]:
import sys
import time
import pickle
import itertools
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import sklearn

Let's begin by loading our two dataframes.

In [95]:
with open("application_record.csv", "r") as app_data:
    app_df = pd.read_csv(app_data)
    

In [130]:
print(app_df)

             ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
0       5008804           M            Y               Y             0   
1       5008805           M            Y               Y             0   
2       5008806           M            Y               Y             0   
3       5008808           F            N               Y             0   
4       5008809           F            N               Y             0   
...         ...         ...          ...             ...           ...   
438552  6840104           M            N               Y             0   
438553  6840222           F            N               N             0   
438554  6841878           F            N               N             0   
438555  6842765           F            N               Y             0   
438556  6842885           F            N               Y             0   

        AMT_INCOME_TOTAL      NAME_INCOME_TYPE            NAME_EDUCATION_TYPE  \
0               427500.0      

In [97]:
with open("credit_record.csv", "r") as credit_data:
    credit_df = pd.read_csv(credit_data)

In [98]:
print(credit_df)

              ID  MONTHS_BALANCE STATUS
0        5001711               0      X
1        5001711              -1      0
2        5001711              -2      0
3        5001711              -3      0
4        5001712               0      C
...          ...             ...    ...
1048570  5150487             -25      C
1048571  5150487             -26      C
1048572  5150487             -27      C
1048573  5150487             -28      C
1048574  5150487             -29      C

[1048575 rows x 3 columns]


Let us see if the ID keys are all unique. If so, we want to change the index to correspond to the ID key. This will save time on later computations. 

In [323]:
print(len(set(app_df["ID"])))

438510


This is a bit unfortunate, as we have duplicate id's. We need to check if the duplicate ids have the same remaining data.

In [332]:
app_df_duplicate_id = app_df.duplicated(subset="ID", keep=False)

print(app_df[app_df_duplicate_id])

             ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
421211  7702516           F            N               Y             2   
421268  7602432           M            N               Y             0   
421349  7602432           F            N               N             0   
421464  7836971           M            Y               N             1   
421698  7213374           M            Y               N             0   
...         ...         ...          ...             ...           ...   
433158  7282535           F            N               Y             0   
433159  7742853           M            N               Y             0   
433217  7135270           F            N               Y             0   
433666  7091721           F            Y               Y             0   
433789  7618285           F            N               Y             0   

        AMT_INCOME_TOTAL      NAME_INCOME_TYPE            NAME_EDUCATION_TYPE  \
421211          180000.0      

There doesn't seem to be a particularly strong pattern between accounts tied to the same id. It is possible these are joint accounts but the demographic data doesn't really line up. We also don't have any indication from the dataset on how these id's are linked to the corresponding id in the credit dataframe. Fortuantely, we don't have many duplicates of this type, so we can just remove them from the dataset. 

In [474]:
app_df = app_df.drop_duplicates(subset="ID", keep=False)

Notice that we have yet another issue with duplicates: the same customers seem to have multiple account id's. We need to identify the cause of the duplication. It is possible these are multiple accounts for the same peron, and it is also possible these multiple accounts have different credit histories. We're going to collect all accounts with identical applicaton data together to then see how they behave in the credit. But before we do so, we will reduce both the application dataframe and the credit dataframe to just their intersections along the ID key. Additionaly, to make things easier, we will fill in values for null values in the application dataframe.

In [467]:
#Boolean series that tracks indices of app_df whose "ID" keys are also in credit_df

app_boolean = app_df["ID"].isin(credit_df["ID"])

app_df_overlap = app_df.loc[app_boolean]
print(app_df_overlap)

             ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
0       5008804           M            Y               Y             0   
1       5008805           M            Y               Y             0   
2       5008806           M            Y               Y             0   
3       5008808           F            N               Y             0   
4       5008809           F            N               Y             0   
...         ...         ...          ...             ...           ...   
434808  5149828           M            Y               Y             0   
434809  5149834           F            N               Y             0   
434810  5149838           F            N               Y             0   
434811  5150049           F            N               Y             0   
434812  5150337           M            N               Y             0   

        AMT_INCOME_TOTAL      NAME_INCOME_TYPE            NAME_EDUCATION_TYPE  \
0               427500.0      

Let us do the same for the credit data and only keep the overlap.

In [468]:
credit_boolean = credit_df["ID"].isin(app_df["ID"])

credit_df_overlap = credit_df.loc[credit_boolean]
print(credit_df_overlap)

              ID  MONTHS_BALANCE STATUS
92938    5008804               0      C
92939    5008804              -1      C
92940    5008804              -2      C
92941    5008804              -3      C
92942    5008804              -4      C
...          ...             ...    ...
1048570  5150487             -25      C
1048571  5150487             -26      C
1048572  5150487             -27      C
1048573  5150487             -28      C
1048574  5150487             -29      C

[777715 rows x 3 columns]


The account ID in the credit dataframe is not unique for each row. So, to check each account ID refers to a unique account, we will simply compare it to the number of account id's in the overlapping application dataframe. 

In [462]:
print(len(set(app_df_overlap["ID"])))
print(len(set(credit_df_overlap["ID"])))

36457
36457


This is excellent. It means each account is identified by its ID. We now collect all accounts with the same application data together, with the goal of creating a new dataframe that only has 1 row per customer, rather than 1 row per account, and whose ID column is replaced by a column that tracks all accounts for that customer. 

In [463]:
#Fill in missing values.

app_df_overlap = app_df_overlap.fillna(value="Null")

#Assigning variables to the columns of our app_df dataframe

columns = app_df.columns.values.tolist()
columns_remaining = [x for x in columns if x!='ID']

app_df_overlap_grouped = app_df_overlap.groupby(by=columns_remaining, axis=0)

In [475]:
#Defining a function that pulls out the id keys as a list from a dataframe corresponding to a group in the above groupby.

def IDList(df):
    '''Input is a dataframe with an ID column and other columns, where every row is identical except for having 
        different entries under ID.
       Returns a single row dataframe with all IDs collected into a single IDList column.'''
    list_of_ids = []
    
    for entry in df["ID"]:
        list_of_ids.append(entry)
    IDList_col = []
    for i in range(0, len(df.index.values)):
        IDList_col.append(list_of_ids)
        
    df["IDList"] = IDList_col
    return df



In [476]:
#Test
t_0 = time.time()
print(IDList(app_df_overlap.iloc[[0, 1], :]))
t_1 = time.time()
print("Time Elapsed: ", t_1 - t_0)
print(len(IDList(app_df_overlap.iloc[[0, 1], :])))

        ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
0  5008804           M            Y               Y             0   
1  5008805           M            Y               Y             0   

   AMT_INCOME_TOTAL NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS  \
0          427500.0          Working    Higher education     Civil marriage   
1          427500.0          Working    Higher education     Civil marriage   

  NAME_HOUSING_TYPE  DAYS_BIRTH  DAYS_EMPLOYED  FLAG_MOBIL  FLAG_WORK_PHONE  \
0  Rented apartment      -12005          -4542           1                1   
1  Rented apartment      -12005          -4542           1                1   

   FLAG_PHONE  FLAG_EMAIL OCCUPATION_TYPE  CNT_FAM_MEMBERS              IDList  
0           0           0            Null              2.0  [5008804, 5008805]  
1           0           0            Null              2.0  [5008804, 5008805]  
Time Elapsed:  0.0074841976165771484
2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["IDList"] = IDList_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["IDList"] = IDList_col


In [479]:
app_df_id_list = app_df_overlap_grouped.apply(IDList)

In [480]:
print(app_df_id_list)

             ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
0       5008804           M            Y               Y             0   
1       5008805           M            Y               Y             0   
2       5008806           M            Y               Y             0   
3       5008808           F            N               Y             0   
4       5008809           F            N               Y             0   
...         ...         ...          ...             ...           ...   
434808  5149828           M            Y               Y             0   
434809  5149834           F            N               Y             0   
434810  5149838           F            N               Y             0   
434811  5150049           F            N               Y             0   
434812  5150337           M            N               Y             0   

        AMT_INCOME_TOTAL      NAME_INCOME_TYPE            NAME_EDUCATION_TYPE  \
0               427500.0      

We can now drop duplicates based on demographic data and we will have our desired dataframe. 

In [482]:
app_df_customer_rows = app_df_id_list.drop_duplicates(subset=columns_remaining)
print(app_df_customer_rows)

             ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  CNT_CHILDREN  \
0       5008804           M            Y               Y             0   
2       5008806           M            Y               Y             0   
3       5008808           F            N               Y             0   
7       5008812           F            N               Y             0   
10      5008815           M            Y               Y             0   
...         ...         ...          ...             ...           ...   
434797  5148694           F            N               N             0   
434801  5149055           F            N               Y             0   
434806  5149729           M            Y               Y             0   
434810  5149838           F            N               Y             0   
434812  5150337           M            N               Y             0   

        AMT_INCOME_TOTAL      NAME_INCOME_TYPE            NAME_EDUCATION_TYPE  \
0               427500.0      

Our application data is now in exactly the form we want it. We're going to save this dataframe, along with the credit dataframe that only has the overlapping accounts for easy reference in later notebooks. 

In [483]:
app_df_customer_rows.to_csv(r'application_data_with_one_row_per_customer.csv')

In [484]:
credit_df_overlap.to_csv(r'credit_data_only_including_accounts_in_app_data.csv')

### Label Construction

We now create our labels. To do so, we need to pick the question we are trying to answer with this data. For this notebook, we will focus on some very precise questions. We will pick performance windows of 6 months, 12 months, 24 months and lifetime, and assign a label column corresponding to each performance window. We will label a customer as "bad credit", with the label 1, if any of their accounts either get cancelled or go 60+ days overdue at any point during the performance window. If the performance window is "lifetime", we will consider a customer to be a bad customer, if any of their accounts close within 6 months or are overdrawn by 60+ days at any point in the account's lifetime.

To create these labels, we first need to rearrange our credit dataframe so each row corresponds to a single account number. 

In [487]:
credit_pivot = credit_df_overlap.pivot(index = 'ID', columns = 'MONTHS_BALANCE', values = 'STATUS')

In [488]:
print(credit_pivot)

MONTHS_BALANCE  -60  -59  -58  -57  -56  -55  -54  -53  -52  -51  ...  -9   \
ID                                                                ...        
5008804         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    C   
5008805         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    C   
5008806         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    0   
5008808         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...  NaN   
5008809         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...  NaN   
...             ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
5150482         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...  NaN   
5150483         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    X   
5150484         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    0   
5150485         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...  NaN   
5150487         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Let's add some more information on each account to this dataframe. We need to know the month in which the account was opened, and, if the account was ever closed, the month in which it was closed.

In [498]:
def OpenMonth(ID):
    '''Given an ID in the credit_pivot dataframe, return the month the account was opened.'''
    ID_Balance = credit_pivot.loc[ID]
    
    return ID_Balance.first_valid_index()

#Test

for i in [5008804, 5008805, 5008809, 5150482, 5150484, 5150487]:
    print(OpenMonth(i), credit_pivot.loc[i, range(-60, OpenMonth(i)+1)])



-15 MONTHS_BALANCE
-60    NaN
-59    NaN
-58    NaN
-57    NaN
-56    NaN
-55    NaN
-54    NaN
-53    NaN
-52    NaN
-51    NaN
-50    NaN
-49    NaN
-48    NaN
-47    NaN
-46    NaN
-45    NaN
-44    NaN
-43    NaN
-42    NaN
-41    NaN
-40    NaN
-39    NaN
-38    NaN
-37    NaN
-36    NaN
-35    NaN
-34    NaN
-33    NaN
-32    NaN
-31    NaN
-30    NaN
-29    NaN
-28    NaN
-27    NaN
-26    NaN
-25    NaN
-24    NaN
-23    NaN
-22    NaN
-21    NaN
-20    NaN
-19    NaN
-18    NaN
-17    NaN
-16    NaN
-15      X
Name: 5008804, dtype: object
-14 MONTHS_BALANCE
-60    NaN
-59    NaN
-58    NaN
-57    NaN
-56    NaN
-55    NaN
-54    NaN
-53    NaN
-52    NaN
-51    NaN
-50    NaN
-49    NaN
-48    NaN
-47    NaN
-46    NaN
-45    NaN
-44    NaN
-43    NaN
-42    NaN
-41    NaN
-40    NaN
-39    NaN
-38    NaN
-37    NaN
-36    NaN
-35    NaN
-34    NaN
-33    NaN
-32    NaN
-31    NaN
-30    NaN
-29    NaN
-28    NaN
-27    NaN
-26    NaN
-25    NaN
-24    NaN
-23    NaN
-22    Na

In [552]:
def CloseMonth(ID):
    '''Given an ID in the credit_pivot_dataframe, return the month the account was closed, or 100
        if the account did not close.'''
    ID_Balance = credit_pivot.loc[ID]
    last_month = ID_Balance.last_valid_index()
    
    if last_month==0:
        return 100
    else: 
        return last_month

#Test

for i in [5008804, 5008805, 5008809, 5150482, 5150484, 5150487]:
    print(CloseMonth(i), credit_pivot.loc[i, range(CloseMonth(i), 1)])

100 Series([], Name: 5008804, dtype: object)
100 Series([], Name: 5008805, dtype: object)
-22 MONTHS_BALANCE
-22      X
-21    NaN
-20    NaN
-19    NaN
-18    NaN
-17    NaN
-16    NaN
-15    NaN
-14    NaN
-13    NaN
-12    NaN
-11    NaN
-10    NaN
-9     NaN
-8     NaN
-7     NaN
-6     NaN
-5     NaN
-4     NaN
-3     NaN
-2     NaN
-1     NaN
 0     NaN
Name: 5008809, dtype: object
-11 MONTHS_BALANCE
-11      C
-10    NaN
-9     NaN
-8     NaN
-7     NaN
-6     NaN
-5     NaN
-4     NaN
-3     NaN
-2     NaN
-1     NaN
 0     NaN
Name: 5150482, dtype: object
100 Series([], Name: 5150484, dtype: object)
100 Series([], Name: 5150487, dtype: object)


In [554]:
def MonthsOpen(ID):
    '''Given an ID in the credit dataframe, return the number of months the account was open during the measured window.
        If the account did not close, returns a number bigger than or equal to 100.'''
    return CloseMonth(ID)-OpenMonth(ID)+1

Let's add this information to our dataframe.

In [555]:
account_series = credit_pivot.index.to_series()

credit_with_months = credit_pivot.copy()
credit_with_months["Opening_Month"] = account_series.apply(OpenMonth)
credit_with_months["Closing_Month"] = account_series.apply(CloseMonth)
credit_with_months["Account_Life"] = account_series.apply(MonthsOpen)

In [611]:
print(credit_with_months)

MONTHS_BALANCE  -60  -59  -58  -57  -56  -55  -54  -53  -52  -51  ...   -6  \
ID                                                                ...        
5008804         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    C   
5008805         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    C   
5008806         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    C   
5008808         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...  NaN   
5008809         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...  NaN   
...             ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
5150482         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...  NaN   
5150483         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    X   
5150484         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...    0   
5150485         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...  NaN   
5150487         NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Let's order our accounts by their opening date to get a sense of when the latest accounts were opened. 

In [556]:
credit_by_opening_date = credit_with_months.sort_values(by="Opening_Month", axis=0)

credit_opening_month_groups = credit_by_opening_date.groupby("Opening_Month")

opening_month_counts = credit_opening_month_groups["Opening_Month"].count()

print(opening_month_counts)

print(opening_month_counts.loc[-5:1].sum()/opening_month_counts.sum())

Opening_Month
-60    321
-59    307
-58    333
-57    304
-56    345
-55    368
-54    358
-53    377
-52    463
-51    476
-50    447
-49    431
-48    459
-47    481
-46    447
-45    457
-44    447
-43    521
-42    540
-41    552
-40    590
-39    675
-38    591
-37    548
-36    533
-35    584
-34    586
-33    560
-32    574
-31    597
-30    617
-29    623
-28    639
-27    682
-26    620
-25    694
-24    703
-23    696
-22    669
-21    650
-20    710
-19    673
-18    706
-17    807
-16    785
-15    774
-14    774
-13    773
-12    771
-11    828
-10    798
-9     770
-8     820
-7     889
-6     824
-5     816
-4     765
-3     800
-2     643
-1     551
 0     315
Name: Opening_Month, dtype: int64
0.10670104506679101


To construct our labels, we will define an auxilliary functions. Since we care about being overdrawn by 60 or more, we can use a trick to make our life easier, by replacing all entries corresponding to less than 60 overdrawn days by 1 and more than 60 overdrawn days by 0.

In [649]:
replacement_dict = {'0': int(1), 'C': int(1), 'X': int(1), '1': int(0), '2': int(0), '3': int(0), '4': int(0), '5': int(0)}

In [650]:
credit_binary = credit_pivot.replace(replacement_dict)
credit_binary["Opening_Month"] = credit_with_months["Opening_Month"]
credit_binary["Closing_Month"] = credit_with_months["Closing_Month"]
credit_binary["Account_Life"] = credit_with_months["Account_Life"]

In [661]:
def CreditRating(ID, period):
    '''Input: An ID number for an account in the credit dataframe and a time period to measure performance.
              Period must be an int between 1 and 60, or the string LIFETIME.
       Returns a binary credit rating based on the desired performance window. 1 = bad credit, 0 = good credit.'''
    
    ID_credit_history = credit_binary.loc[ID, range(-60, 1)]

    if period=="LIFETIME":
        return 1 - ID_credit_history.product()
    
    else: 
        ID_open = credit_binary.loc[ID, "Opening_Month"]
        m = min(ID_open + period, 1)
        if period >= credit_binary.loc[ID, "Account_Life"]:
            return 1
        else: 
            return 1 - credit_binary.loc[ID, range(ID_open, m)].product()
    

Let's run some tests on this function.

In [659]:
print(credit_binary.loc[5008804])

print(CreditRating(5008804, "LIFETIME"))
print(CreditRating(5008804, 6))
print(CreditRating(5008804, 1))



MONTHS_BALANCE
-60                NaN
-59                NaN
-58                NaN
-57                NaN
-56                NaN
-55                NaN
-54                NaN
-53                NaN
-52                NaN
-51                NaN
-50                NaN
-49                NaN
-48                NaN
-47                NaN
-46                NaN
-45                NaN
-44                NaN
-43                NaN
-42                NaN
-41                NaN
-40                NaN
-39                NaN
-38                NaN
-37                NaN
-36                NaN
-35                NaN
-34                NaN
-33                NaN
-32                NaN
-31                NaN
-30                NaN
-29                NaN
-28                NaN
-27                NaN
-26                NaN
-25                NaN
-24                NaN
-23                NaN
-22                NaN
-21                NaN
-20                NaN
-19                NaN
-18                

In [660]:
print(credit_binary.loc[5008808])

print(CreditRating(5008808, "LIFETIME"))
print(CreditRating(5008808, 1))
print(CreditRating(5008808, 6))
print(CreditRating(5008808, 5))
print(CreditRating(5008808, 4))

MONTHS_BALANCE
-60                NaN
-59                NaN
-58                NaN
-57                NaN
-56                NaN
-55                NaN
-54                NaN
-53                NaN
-52                NaN
-51                NaN
-50                NaN
-49                NaN
-48                NaN
-47                NaN
-46                NaN
-45                NaN
-44                NaN
-43                NaN
-42                NaN
-41                NaN
-40                NaN
-39                NaN
-38                NaN
-37                NaN
-36                NaN
-35                NaN
-34                NaN
-33                NaN
-32                NaN
-31                NaN
-30                NaN
-29                NaN
-28                NaN
-27                NaN
-26                NaN
-25                NaN
-24                NaN
-23                NaN
-22                NaN
-21                NaN
-20                NaN
-19                NaN
-18                

In [653]:
print(credit_binary.loc[5008809])

print(CreditRating(5008809, "LIFETIME"))
print(CreditRating(5008809, 1))
print(CreditRating(5008809, 6))
print(CreditRating(5008809, 5))
print(CreditRating(5008809, 4))

MONTHS_BALANCE
-60               NaN
-59               NaN
-58               NaN
-57               NaN
-56               NaN
-55               NaN
-54               NaN
-53               NaN
-52               NaN
-51               NaN
-50               NaN
-49               NaN
-48               NaN
-47               NaN
-46               NaN
-45               NaN
-44               NaN
-43               NaN
-42               NaN
-41               NaN
-40               NaN
-39               NaN
-38               NaN
-37               NaN
-36               NaN
-35               NaN
-34               NaN
-33               NaN
-32               NaN
-31               NaN
-30               NaN
-29               NaN
-28               NaN
-27               NaN
-26               1.0
-25               1.0
-24               1.0
-23               1.0
-22               1.0
-21               NaN
-20               NaN
-19               NaN
-18               NaN
-17               NaN
-16              

Note that it is easy to adjust the definition of the function if we later decide to change our stance on closed accounts. This function is also flexible enough that if we later decide to make our period depend in some way on the month the account was opened, it can still be applied to that problem. We will do this later when we use vintage analysis to bucket accounts into the bottom percentiles of their cohort. Let's add these labels to our credit dataframe for periods = 6, 12, 24 and LIFETIME.

In [665]:
credit_with_labels = credit_with_months.copy()

In [667]:
credit_with_labels["6_Month"] = account_series.apply(lambda x: CreditRating(x, 6))

In [668]:
credit_with_labels["12_Month"] = account_series.apply(lambda x: CreditRating(x, 12))

In [669]:
credit_with_labels["24_Month"] = account_series.apply(lambda x: CreditRating(x, 24))

In [670]:
credit_with_labels["Lifetime"] = account_series.apply(lambda x: CreditRating(x, 'LIFETIME'))

Let's do some quick counts on how many good/bad accounts there are for each window. 

In [680]:
print(credit_with_labels.groupby("6_Month").size())

print(credit_with_labels.groupby("12_Month").size())

print(credit_with_labels.groupby("24_Month").size())

print(credit_with_labels.groupby("Lifetime").size())

6_Month
0.0    32164
1.0     4293
dtype: int64
12_Month
0.0    28310
1.0     8147
dtype: int64
24_Month
0.0    24550
1.0    11907
dtype: int64
Lifetime
0.0    32166
1.0     4291
dtype: int64


For 6 month or lifetime periods, we have around 11% delinquent accounts. For 12 month delinquency, the number is around 22% and for 24 months, it is around 30%. Let us save this final credit dataframe and use it to construct labels for customers, rather than just accounts. 