# COGS 118A - Project Checkpoint

# Names

- Angel Olivas

- Hyunseo Park

- Yuanzhen Zhu

- Eric Dong

# Abstract 
This paper aims to research and understand the possibility of looking at how one can use machine learning models to analyze and predict whether or not a credit card will default. Defaulting on credit card payments occurs when individuals repeatedly fail to pay their bills, leading to account closure and negative marks on their credit history. The consequences of defaulting are severe and can create problems for both lenders and borrowers. Thus, the goal is to figure out when a credit card bill would default to either prevent or notify both parties from dealing with potential consequences. The study utilizes a dataset from Taiwan, encompassing various client statistics such as credit accumulation, payment history, billing amounts, and more. By analyzing the data of around 30,000 clients, the research investigates the relationship between these factors and credit cards defaulting. Six different data mining methods were employed to extract meaningful insights and patterns from the dataset, forming the basis for us to clean and manipulate the data in a way that allows us to build a machine learning model to predict debt default in a binary way. For our machine learning model, we decided on binary logistic regression as the modeling technique due to its ability to handle binary classification tasks effectively. It allows for the estimation of the probability of credit cards defaulting based on the identified factors. To evaluate the performance of the model, we will employ AUC-ROC curves, with a specific focus on minimizing the false positive rate. This emphasis on accuracy is essential, as providing incorrect information regarding credit card default risk can have significant repercussions.

# Background

Understanding and preventing defaulting on credit card loans is an issue that is very vital in dealing with financial risk management. In allowing the credit card account to be closed due to unpaid charges, credit histories are ruined and financial burdens and stresses are exponentially increased. There are required debts to be paid and any mortgage, student, or auto loans will also take a hit or be increased. The burden on the loaner increases, and they might resort to harsher tactics as time passes.<a name="Akinnote"></a>[<sup>[1]</sup>](#Akin) <br><br>
In economic models, certain relationships between variables, known as "stylized facts," can provide valuable insights into the behavior of credit card default. The process of creating such models often rely on broad trends that are quantified into aggregate satistics into how resources are distributed and flow in a society. However there are times where fundamental relationships between variables tend to be skewed one way or the other based on external factors leading to some confusion and investigation, these "stylized facts" that are essentially just understood relationships that inform us as to what is happening in a system. An example of this would be the relationship between interest rates and investments, which would be that as interest rates goes down, generally speaking investments go up.<a name="Ouliarisnote"></a>[<sup>[2]</sup>](#Ouliaris) One of the most famous example of this being prior to the 2008 financial crisis in United States, as interest rates decreased more people got into real estate that didn't have the capital to be doing so. This eventually lead to a bust because irresponsible lending to these creditors became rampant. But it also isn't unheard of for investments to decrease as interest rates go down, for example in Japan in the 1990's there was a period of uncertainty in the economy that created economic behaviour that needed to be stopped.<a name="Neilsennote"></a>[<sup>[3]</sup>](#Neilsen,) When this is the case, it is important to look at smaller scale models to figure diagonse what is causing odd behaviour, and though it would be ideal to make classifier models that properly reflect the real world, often times predictive models are needed to gain insight to build classifier models. We aim to help create one of these smaller scale models for Taiwan as it's within our ability with the dataset we have, using data found in debt collection and payment history and monthly income level. The choice of our features is partially informed by both historical and newer models for consumer default risk,<a name="CostaeSilvanote"></a>[<sup>[4]</sup>](#CostaeSilva,) and aims to create a smaller model for analysis for tracking debt in Taiwan. Ultimately, our goal is to enhance risk assessment practices, facilitate early detection of defaulting behaviors, and foster financial stability in the credit card industry. <br><br>
Previous work in this field has been done, such as various studies done by in Malaysia<a name="Sayjadahnote"></a>[<sup>[5]</sup>](#Sayjadah,), India<a name="Mahmudinote"></a>[<sup>[6]</sup>](#Mahmudi,), and UCLA<a name="Guinote"></a>[<sup>[7]</sup>](#Gui), all looking at using Machine Learning to predict credit card defaults. All three studies were conclusive, using varying machine learning techniques and finding which ones concluded with high precision rates, including random forest, XGBoost, and AdaBoost. These replicable studies show the application of machine learning models, highlighting their effectiveness in alleviating financial management burdens. By utilizing advanced techniques and analyzing relevant datasets, researchers have successfully developed models that provide conclusive insights into credit card default prediction. The findings from these studies serve as valuable references for our research, indicating that machine learning can yield robust models and contribute to improved financial decision-making. By leveraging the knowledge gained from previous studies, we aim to further investigate and build upon these findings. Our objective is to build successful models, contributing to research and providing insight for banks and clients dealing with credit card defaults.

# Problem Statement

The problem that we are aiming to solve is whether or not a certain client's credit card would default given different variables and factors. Some very useful information that we could use to create a model to predict a client's possibility to default - which is very clearly just a Yes or No. Some potential solutions that we were looking at given that the result is a binary classification of either positive or negative are either looking at Logistic Regression models or maybe even a Suport Vector Machine. Given our dataset, we have around 30 thousand clients to work with and approximately 23 explanatory variables that are chunked into around eight parts. However, the result of the problem should be entirely binary, a positive or negative stat as to whether or not the credit card account would default. All the explantory variables are measurable as the data is essentially just data about the clients and their credit history. In addition, the entire problem would be replicable as we can continue looking at different datasets and training our model based off of those, predicting whether or not certain credit card accounts would default.

# Data

UPDATED FROM PROPOSAL!

You should have obtained and cleaned (if necessary) data you will use for this project.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


# Proposed Solution

Our solution aims to make features of data found in the columns for credit given, repayment status, amount of bills, education, and amount paid per month. As mentioned in data, some of our feature selection involves aggregate statistics that normalized through using z-score and normal curve. We are hopeful about this method of normalization, however we recognize that z-score is particularly sensitive toward skews in existing data. The reason we have to normalize in the first place is because some of our variables are within the magnitude of single digits, while some debt is within four or five. However we acknowledge this dataset may not follow the normal distribution well enough to use it, if this is the case, then we plan to explore L1 normalization techinques to correct for this issue as this method resists outliers by nature. As for the type of statistical model, we will be using logistic regression in order to assign a probability of the predicted outcome of a creditor defaulting or paying their debts off, with this in mind we would be able to more easily interpret the relationships between different features. 

# Evaluation Metrics

One evaluation metrics we could use is AUC-ROC, which is 'Area under the ROC Curve'. This is a widely recognized evaluation metrics for classification model. This is derived by graphing the model's 'True positive rate'(True positive/(True Positive + False Negative)) versus 'False positive rate'(1 - (True Negative/(True Negative + False Positive))) which is always in conflicting tradeoffs in most models. A better model is characterized by having a bigger area under the ROC curve which says the model would have a larger TP rate and a smaller FP rate in its optimum point.

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



In [None]:
#Importing dataframes
import pandas as pd

file = 'default of credit card clients.xls'
df = pd.read_excel(file)
df

# Ethics & Privacy

We recognize that our project may have real data privacy and ethics issues depending on what variables we decide to put into creating our model. Some variables such as Gender, Marriage status and Age may be very sensitive and private information, we will make sure that datas we use are completely anonymous and will not specify or harm any real entities that the data may subject to. Additionally, we will make sure that our model will be for research purposes only and will not be used to discriminate against any group or any person. Lastly, we can utilize tools such as 'deon' to check for data ethics issue.

# Team Expectations 

* *Our team will converse with eachother on our group chat*
* *Our team will split work accordingly to fairness and ability*
* *Our team will schedule meetings within a week in order to accomdoate changing schedules*
* *Individidual members of the team are expected to follow the Project Timeline Proposal as best as possible*
* *If issues arise with a submission for previously split work, individuals are expected to ask for help from other teammates*

# Project Timeline Proposal

Things have been busy, but this is the updated Timeline and future Timeline Proposal for our group.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/17  |  7 PM |  Individually assigned sections of proposal (all)  | Finalize Project Proposal Submission | 
| 5/21  |  5 PM |  Review methods for implementation of feature analysis (all) | Figure out which combinations of features produce better models with error metrics discussed prior by assigning models to each person for analysis | 
| 5/23  | 5 PM  | Make sure project looks presentable for Peer Review (all)  | Clean up project if there's anything that looks rough, discuss EDA individual assigments   |
| 5/27  | 5 PM  | EDA assigned responsibilities (all) | Review/Edit wrangling/EDA; Discuss Analysis Plan  |
| 5/30  | 5 PM  | Finalize wrangling/EDA; Begin programming for project (all) | Discuss/edit project code; Assure all models followed what we said they did |
| 6/04  | 5 PM  | Each model's analysis (all)| Discuss/edit full project |
| 6/07  | 5 PM  | NA | Discuss any responsiblities that may need to be taken care of  |
| 6/14  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="Akinnote"></a>1.[^](#Akin): Akin, J. (1 Nov 2022) How Does Default Impact Your Credit?. *Experian*. https://www.experian.com/blogs/ask-experian/how-does-default-impact-credit/ <br>
<a name="Ouliarisnote"></a>2.[^](#Ouliaris): Ouliaris, S. (9 Dec 2021) What are Economic Models?. *International Monetary Fund*. https://www.imf.org/external/pubs/ft/fandd/2011/06/basics.html <br>
<a name="Neilsennote"></a>3.[^](#Neilsen,): Neilsen, B. (14 Jan 2022) The Lost Decade: Lessons from Japan's Real Estate Crisis. *Investopedia*. https://www.investopedia.com/articles/economics/08/japan-1990s-credit-crunch-liquidity-trap.asp <br>
<a name="CostaeSilvanote"></a>4.[^](#CostaeSilva,): Costa e Silva, E. (5 May 2020) TA logistic regression model for consumer default risk. *National Library of Medicine*. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9041570/ <br>
<a name="Sayjadahnote"></a>5.[^](#Sayjadah,): Sayjadah, Y. et al. (28 Oct 2018) Credit Card Default Prediction using Machine Learning Techniques. *IEEE Explore*. https://ieeexplore.ieee.org/document/8776802 <br>
<a name="Mahmudinote"></a>6.[^](#Mahmudi,): Mahmudi, H. et al. (15 Oct 2022) Evaluation of Gradient Boosting Algorithms on Balanced Home Credit Default Risk. *IEEE Explore*. https://ieeexplore.ieee.org/document/10041584 <br>
<a name="Guinote"></a>7.[^](#Gui): Gui, L. (2019) Application of Machine Learning Algorithms in Predicting Credit Card Default Payment. *UCLA*. https://escholarship.org/uc/item/9zg7157q#main <br>