## Model for predicting approval

Here we are trying to predict via personal info whether or not one could be approved for credit card.

Problem statement: According to the Consumer Financial Protection Bureau, approximately 45 million Americans do not have a credit score, preventing most from obtaining mortgages, car loans, and even personal loans. This problem stems from a lack of financial literacy among Americans and it’s costing millions of households thousands of dollars every year. The easiest way to build credit is through credit cards, where people can develop a credit history and improve their credit over time. Before applying for credit, people must meet a certain set of criteria like age of credit, income, cost of housing, etc. This data is then used by banks to determine credit worthiness and applicants are either approved or denied, with very little feedback given to applicants who were denied. We propose an ML model to help applicants gauge their creditworthiness before applying, helping them avoid an unnecessary credit score drop from a hard pull on their credit report.


### DATA preprocessing

In [1]:
import numpy as np
import pandas as pd    
import matplotlib.pyplot as plt
credit = pd.read_csv('credit_record.csv')  
application = pd.read_csv('application_record.csv')

print(credit.count())
credit.head()
print("\n")
print(application.count())


ID                1048575
MONTHS_BALANCE    1048575
STATUS            1048575
dtype: int64


ID                     438557
CODE_GENDER            438557
FLAG_OWN_CAR           438557
FLAG_OWN_REALTY        438557
CNT_CHILDREN           438557
AMT_INCOME_TOTAL       438557
NAME_INCOME_TYPE       438557
NAME_EDUCATION_TYPE    438557
NAME_FAMILY_STATUS     438557
NAME_HOUSING_TYPE      438557
DAYS_BIRTH             438557
DAYS_EMPLOYED          438557
FLAG_MOBIL             438557
FLAG_WORK_PHONE        438557
FLAG_PHONE             438557
FLAG_EMAIL             438557
OCCUPATION_TYPE        304354
CNT_FAM_MEMBERS        438557
dtype: int64


Here we can see a differnet count between the two files we are using, This can be attributed to how the credit_record.csv is formatted but as we will see the credit_record.csv does not contain all the IDs that application_record.csv contains

In [5]:
creditIDs=credit.ID.unique()
applicationIDs=application.ID.unique()
totalIDsDiff= np.setdiff1d(applicationIDs,creditIDs)
# here applicationIDs has alot more IDs that creditIDS doesnt so we will remove those
print("number of differnet IDs in application")
print(len(totalIDsDiff))

f_df=application[application.ID.isin(totalIDsDiff)]
filtered_application=pd.concat([application, f_df, f_df]).drop_duplicates(keep=False)


totalIDsDiff= np.setdiff1d(creditIDs,applicationIDs)
print("\nnumber of differnet IDs in credit")
print(len(totalIDsDiff))

f_df=credit[credit.ID.isin(totalIDsDiff)]
filtered_credit=pd.concat([credit, f_df, f_df]).drop_duplicates(keep=False)

creditIDs=filtered_credit.ID.unique()
applicationIDs=filtered_application.ID.unique()

totalIDsDiff= np.setdiff1d(applicationIDs,creditIDs)
print("\nnumber of differnet IDs in application")
print(len(totalIDsDiff))

totalIDsDiff= np.setdiff1d(creditIDs,applicationIDs)
print("\nnumber of differnet IDs in credit")
print(len(totalIDsDiff))
len(filtered_credit.ID.unique())
# Note that if you have a better way to filter the dataframe please do it because this method isin is very slow

number of differnet IDs in application
402053

number of differnet IDs in credit
9528

number of differnet IDs in application
0

number of differnet IDs in credit
0


36457

After that there are 2 more preprocessing steps I have identified so far, there could be more. However, we need to preform a vintage anyalysis on the Credit_record.csv to create the label in which we will be predicting and then we have to remove some Null values.

Note for for the vintage anaylsis, this data set doesn not contain a label that tells if someone has been approved. Intsead it contains their loan record. We could have choosen a differnet data set with the label that tells if someone has been approved, but like in real scenarios the data being used might not always be complete. So instead we are using Vintage Analysis to preform some data processing on the credit record to create a label that would mimic a credit card approved label. Vintage Analysis is one of many popular methods for accessing credit risk. It is used to determine the number of months' data you should consider for performance window. If customer defaults (90 days or more past due) during the performance window, borrower would be considered as a risk and would more likely be declined for credit card approval. We will use this combinedd with a proportion of the customers past due to determine their likelyhood of being approved for credit or not. 1 being that they meet all criteria to being approved and 0 being that they are not apporved.

NOTE: to reader do not include in final submit

What is our model doing then? like what is our justification to our model? If the Vintage analysis already exists where a user doesnt need to pull credit instead provide loan info they can get an accurate guess of approval?

My own thought about this is that, we are trying to connect the labels in application_record.csv to credit approval through connecting the labels to loan history to approval. So no we dont even need loan history, just the labels we decide to use?

I am not quite sure if that makes sense tbh, like are we allowed to make that connection/assumption? so what we are really doing is predicting good or bad loan history based on these labels if we do this?

-Patrick

# link

link to the vintage analysis the author of the dataset does
https://www.kaggle.com/code/rikdifos/eda-vintage-analysis

I looked over this and it still just seems that they set anyone over 60 days past due to be a bad customer?

are we allowed to do this kinda target label creation? It seems kind of arbitrary to me tbh.

We should also start a LaTex for this because thats what she asks for.

need some refrence like research papers, different methods on how we approached this which for us will probably include alot of data processing and figures. We will using supervised learning I think. The link I provided has some nice figures and explination on what they do for preprocessing.