# COGS 118A - Final Project


## Group members

- Neil Bajaj
- Ria Singh
- Pratheek Sankeshi 
- Shenova Davis

# Abstract 
The goal of this project is to predict whether to approve a credit card for an applicant based on a variety of factors that were asked on their application. We will predict using a machine learning algorithm. The data we will be using is from the Kaggle dataset: A Credit Card Dataset for Machine Learning. The link is https://www.kaggle.com/datasets/caesarmario/application-data/. We will be using factors such as Total Income, Education Type, Applicant Age, etc. More will be described in the data section. Furthermore, we will drop unnecessary columns like Owned Phone, Owned Email, etc while also replacing null values to perform EDA. We will then run various supervised machine learning algorithms to create models to predict the data and use the best one. The performance will be measured on how accurately we predict the data against the status column of the data which is whether the application was approved or not. 


__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

# Background

As we get deeper into the age of Big Data, we see that, where possible, we are attempting to move from human evaluation to machine learning prediction. There has also been an uptick in the number of credit card defaulters. Using this logic, credit card companies need a new way to decide whether or not to approve credit cards based on prior history. 

A previous study done by Dr. Hemkiran[1] evaluated whether applicants should be approved for a credit card by using a Logistic Regression with and without a grid search technique. They found that using a grid search technique improved the competency of their model. Additionally, they also used an Artificial Neural Network (ANN) and found it to be better than the linear regression model. Another study by Dr. Kibria[2] aimed to create a deep learning model to aid credit card approval decision-making. They also used a logistic regression model and a support vector machine (SVM) model to compare their results. They found that the deep learning model was better than the logistic regression and SVM models. However, the ANN model and deep learning model are more computationally expensive and time-consuming. 

We are attempting to create a predictive machine learning model that models whether or not the application has the right credentials to have their credit card application approved using the following variables: Applicant Gender, Owned property, Total Children, Owned Car, Total Income, Housing Type, Total Family Members, Applicant Age, Education Type, and Family Status. This is to avoid future credit card defaulters. We will use models such as Naive Bayes Classifier, Linear SVM, and Linear Regression and use them to compare our model’s performance as well. 


# Problem Statement

We are building this model to measure if there are any discrepancies when approving credit card applications. We additionally want to check if we could use big data and machine learning to build a model that predicts if a credit card is approved or not. This would take out human bias from the equation and make sure every application is fairly reviewed. Creating a machine learning model will attempt to eliminate the human bias towards race, class, gender, etc - an issue that plagues the financial ratings of individuals. Additionally, a machine-learning system can significantly reduce the human power and costs of a credit card company, increasing revenue. If our model can accurately predict whether an application can be accepted, we could conclude our hypothesis. This problem is quantifiable since we are trying to model a binary predictor. This problem is measurable because we would be using the metric of accuracy to validate the performance of our model. Lastly, our model is replicable because we could run it on different datasets and check its accuracy on each dataset. The model we will create will be composed of supervised machine-learning algorithms and techniques such as logistic regression, linear SVM, K fold validation, etc. We will train our model on previously collected data from credit card companies to understand what attributes make an individual more or less likely to get approved for a credit card. 

# Data

- Link for our data: https://www.kaggle.com/datasets/caesarmario/application-data

- This data set has about 25,100 observations with 21 variables. 

- The variables that will be used are Applicant Gender, Owned property, Total Children, Owned Car, Total Income, Housing Type, Total Family Members, Applicant Age, Education Type,Family Status, Total Good Debt, and Total Bad Debt. 

- The dataset we are using has already been cleaned to drop any data points with null values and yet the dataset remains robust so we will not be addressing it any further.

- We will additionally drop all the features that we will not be using to further declutter the dataset.

- In addition to cleaning we one hot encoded all other categorical data which wasn't already binary which is displayed below





In [10]:
# @hidden_cell
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns    
from matplotlib.pyplot import figure
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#models importing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB 

#linear svm stuff
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

from sklearn.kernel_ridge import KernelRidge

credit  = pd.read_csv('Application_Data.csv')

#Keeping only the columns we intend on using along with the status 
#column which contains the true status of the applicant

credit = credit[["Applicant_Age","Owned_Realty",
                 "Total_Children", "Owned_Car", "Total_Income",
                 "Total_Family_Members","Total_Good_Debt","Total_Bad_Debt",'Status']]

In [11]:
credit.head()

Unnamed: 0,Applicant_Age,Owned_Realty,Total_Children,Owned_Car,Total_Income,Total_Family_Members,Total_Good_Debt,Total_Bad_Debt,Status
0,59,1,0,1,112500,2,30,0,1
1,53,1,0,0,270000,1,5,0,1
2,53,1,0,0,270000,1,5,0,1
3,53,1,0,0,270000,1,27,0,1
4,53,1,0,0,270000,1,39,0,1



While doing the initial EDA, we realised that our dataset only had 0.4% negative class. To rectify this discrepancy, we used SMOTE to generate data such that the data is split at an even 50% - similar to actual credit card approval rates. However, before running SMOTE we need to One Hot Encode our data.


The data generation results in the total number of data points increasing from 25,000 to 40,000.

Before One Hot Encoding, we check for null or NAN values - as seen below our dataset doesn't have any hence handling them is not required.

In [12]:
credit.isnull().values.any()

False

### One Hot Encoding

In [16]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(credit.drop('Status', axis=1), #
                                                    credit['Status'], #Y variable target
                                                    train_size=0.8,
                                                    random_state=123)

#categorical_features = ["Applicant_Gender", "Housing_Type", "Education_Type", "Family_Status"]


# Define numerical and categorical column selectors
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

# Get numerical and categorical column names
numerical_columns = numerical_columns_selector(X_train)
categorical_columns = categorical_columns_selector(X_train)

# Define one-hot encoder and scaler
numerical_transformer = StandardScaler()
#categorical_transformer = OneHotEncoder(drop='if_binary')  # drop original categorical features if binary (applicant_gender)

categorical_transformer = OneHotEncoder(drop='first')  # drop original categorical features, could combine with if binary
#but unnecessary

#creating a preprocessor to add to the pipeline maybe combine this cell into one?
preprocessor = ColumnTransformer(
    transformers=[ 
        ('num', numerical_transformer, numerical_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

### Running Smote

In [17]:
from imblearn.over_sampling import SMOTE

X_train_processed = preprocessor.fit_transform(X_train)

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train_processed, y_train)

Doing some additional data cleaning

In [19]:

num_columns = list(numerical_columns)

preprocessor.fit(X_train)

#cat_columns = list(categorical_transformer.get_feature_names_out(categorical_columns))
cat_columns = list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_columns))
column_names = num_columns + cat_columns
column_names = num_columns + cat_columns

X_res_df = pd.DataFrame(X_res, columns=column_names)


NotFittedError: This OneHotEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Standard Scaler explaination - UNSURE @SHENOVA PLEASE HANDLE

In [None]:
'''

The reason why they look different and transformed is due to standard scaling 
Standard scaling (also known as standardization) has several benefits:

Normalization of data: Standard scaling transforms the data to have zero mean and unit variance, which helps in normalizing the data. This is particularly useful when the data has different units of measurement or scales.

Better performance of some machine learning algorithms: Some machine learning algorithms like K-nearest neighbors (KNN) and SVM (support vector machines) are sensitive to the scale of the input features. Standard scaling can improve the performance of these algorithms.

Efficient optimization: Many optimization algorithms like gradient descent converge faster when the input features are on the same scale.

Interpretation of coefficients: When performing linear regression or other models with coefficients, standard scaling ensures that the coefficients can be compared fairly, as they are on the same scale.

Overall, standard scaling is a common preprocessing step that can improve the performance of many machine learning models and make the interpretation of results easier.


Due to the reasons above, I believe we should keep standard scaling

'''

This concludes our data cleaning process and make the data ready to be used in our models

# Proposed Solution

Our proposed solution to predicting whether individuals get approved for a credit card and understanding what variables play a role in making that decision is to use classifiers based on the techniques we’ve learnt in class. 
First, we will use k-fold validation to determine our model of choice by validating over logistic classifiers/ regression, naive Bayes, Linear SVM, kernel ridge regression with l1, l2, and elastic net penalty. We will choose the model that gives us the highest accuracy score. 

We are using K-fold validation since with the train-validation-test model split we always run the issue of overfitting on the training data. K-fold validation instead trains and evaluates on all available data by splitting the data amongst each fold training on that and testing on the rest. 

We are using logistic classifiers/ regression, naive Bayes, Linear SVM, and kernel ridge regression since they are machine learning classifiers. Our proposed problem is to predict whether to approve a credit card for an applicant based on a variety of factors that were asked on their application. This is a yes or no question which makes it a binary classification task. As a result, it makes a lot of sense to use machine learning classification algorithms that can also be used to solve binary classification problems like the algorithms above. We are using these specific algorithms since we are familiar with these algorithms and understand how to run and evaluate them effectively. Furthermore, we use L1, L2, and elasticnet penalty on kernel ridge regression since these are regularization terms that will curb overfitting and make generalization better. 

With the chosen model we will validate across parameters. In the case of logistic regression, we will validate over the values of C = [0.01, 0.1, 1, 10, 100] to find our best model. 
For Naive Bayes, we will validate over Gaussian Naive Bayes and Multinomial Naive Bayes and choose our alpha from the following values - [0.01,0.05, 0.1, 0.2, 0.25].

For Linear SVM, we will validate over different kernel functions - 'linear', 'poly', 'rbf', and 'sigmoid' and we will choose over values of C = [0.01, 0.1, 1, 10, 100] to find the best model. 

For Kernel Ridge, we will validate over the following values of alpha - [0.01, 0.1, 1.0, 10.0, 100.0], the kernel functions of linear, polynomial, and RBF and related kernel-specific parameters.

Our extensive search over the models and their parameters will make our classification model accurate. As our dataset is not super large, we are not as concerned about computational efficiency - something to improve on while expanding on the project. 

We will finally train our model and test it to see its accuracy and compare it to existing models available on Kaggle to compare our performance. We will set up a confusion matrix to see how the model fares and plot the ROC/AUC. 


# Evaluation Metrics

We will use both precision and recall metrics to evaluate our model as both false negatives and false positives are of significant concern to our model. A credit card company would need to maximize eligible customers to increase revenue by minimizing the number of false negatives and also would need to minimize the number of individuals who may default to cut losses hence minimizing the false positives. Since neither one of the metrics is more important to our model, we will additionally use the f1 score which incorporates both precision and recall to finally measure the performance of our model. 

## Add picture here instead of formulas!!
Precision is defined as Correctly Classified Positives / Predicted Positives. It helps gauge false positivity rates

Recall is defined as Correctly Classfied Positives/ All True Positives.

F1 score is 2(Correctly Classified Positives)/2(Correctly Classified Positives) + False Positive + False Negative

Accuracy is defined as Correctly Classified data/ All data. 


# Results

After cleaning and balancing our dataset,we proceed to 

cross validation code - repeated k fold

baseline model - accuracy - we will compare to this accuracy to check for improvement





















### Classifier 1 - Logistic Classifier
Include code and accuracy - only include the relevant hyperparameters that converge and work

make verbose (GridSearch) = 2/1 not 3

write up - include why the model, pros and cons, how it performs and which hyperparametrs perform best according to our comparison metric

### Classifier 2 - SVM

### Classifier 3 - Naive Bayes

### Classifier 4 - KNN

### Classifier 5 - Random Forest

### Classifier 6 - Ensemble(AdaBoost)

### Classifier 7 - KNN

### Classifier 8 - Ensemble (Gradient boost)

## Final testing
Pick best model and run on test data

# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
