# **Credit Risk Analysis**
# Riley Sample - rcs3396



**Project Definition:**


> This dataset holds values pertaining to customers applying for a credit card. Variables of interest including gender, age, debt, and income will be used to predict if a customer was approved or denied. I expect the data to have a 70 - 80% accuracy, and we will use Logistic Regression, SVM, and Decision Tree classification models to determine which is the better fit and produces a better accuracy.


In [35]:
# import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [36]:
# read in data
df = pd.read_csv('clean_dataset.csv')

In [37]:
df.head(5)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1,30.83,0.0,1,1,Industrials,White,1.25,1,1,1,0,ByBirth,202,0,1
1,0,58.67,4.46,1,1,Materials,Black,3.04,1,1,6,0,ByBirth,43,560,1
2,0,24.5,0.5,1,1,Materials,Black,1.5,1,0,0,0,ByBirth,280,824,1
3,1,27.83,1.54,1,1,Industrials,White,3.75,1,1,5,1,ByBirth,100,3,1
4,1,20.17,5.625,1,1,Industrials,White,1.71,1,0,0,0,ByOtherMeans,120,0,1


In [38]:
# drop industry col
df = df.drop('Industry', axis='columns')

# test accuracy when other cols dropped
# df = df.drop('Ethnicity', axis='columns')
# df = df.drop('Gender', axis='columns')
# df = df.drop('Income', axis='columns')

# change categorical data into numeric
df['Ethnicity'].replace(['White', 'Black', 'Asian', 'Latino', 'Other'], [1, 2, 3, 4, 5], inplace=True)
df['Citizen'].replace(['ByBirth', 'ByOtherMeans', 'Temporary'], [1, 2, 3], inplace=True)

# Data Description

**clean_dataset.csv:**
Machine Learning dataset regarding credit card approvals.

>**Gender:**
>* 0 = Female
>* 1 = Male

>**Age:** Age in years.

>**Debt:** Outstanding debt (scaled)

>**Married:**
>* 0 = Single/Divorced
>* 1 = Married

>**Bank Customer:**
>* 0 = no bank account
>* 1 = has a bank account

>**Industry:** Job sector

>**Ethnicity:**
>* 1 = White
>* 2 = Black
>* 3 = Asian
>* 4 = Latino
>* 5 = Other

>**YearsEmployed:** Years employed

>**Prior Default:**
>* 0 = no prior default
>* 1 = prior default

>**Employed:**
>* 0 = not employed
>* 1 = employed

>**CreditScore:** Credit score (scaled)

>**DriversLicense:**
>* 0 = no license
>* 1 = has license

>**Citizen:**
>* 1 = by birth
>* 2 = by other means
>* 3 = temporary

>**ZipCode:** Zip code

>**Income:** Income (scaled)

>**Approved**
>* 0 = not approved
>* 1 = approved








In [39]:
df.head(10)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1,30.83,0.0,1,1,1,1.25,1,1,1,0,1,202,0,1
1,0,58.67,4.46,1,1,2,3.04,1,1,6,0,1,43,560,1
2,0,24.5,0.5,1,1,2,1.5,1,0,0,0,1,280,824,1
3,1,27.83,1.54,1,1,1,3.75,1,1,5,1,1,100,3,1
4,1,20.17,5.625,1,1,1,1.71,1,0,0,0,2,120,0,1
5,1,32.08,4.0,1,1,1,2.5,1,0,0,1,1,360,0,1
6,1,33.17,1.04,1,1,2,6.5,1,0,0,1,1,164,31285,1
7,0,22.92,11.585,1,1,1,0.04,1,0,0,0,1,80,1349,1
8,1,54.42,0.5,0,0,2,3.96,1,0,0,0,1,180,314,1
9,1,42.5,4.915,0,0,1,3.165,1,0,0,1,1,52,1442,1


# Logistic Regression Model

In [40]:
# identify x and y
y = df['Approved']
x = df.drop('Approved', axis='columns')

# create 80% train/ 20% test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [41]:
# create logistic regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state = 1)
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(random_state=1)

In [42]:
# find y_pred and calculate accuracy
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
lr_acc = accuracy_score(y_test, y_pred)
print("Accuracy Score:", lr_acc)

Accuracy Score: 0.8333333333333334


In [43]:
# classification report
from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.92      0.85        73
           1       0.89      0.74      0.81        65

    accuracy                           0.83       138
   macro avg       0.84      0.83      0.83       138
weighted avg       0.84      0.83      0.83       138



Logistic Regression performed well with above an 80% accuracy and very high precision and recall scores

# SVM Model

In [44]:
from sklearn import svm

model = svm.SVC(random_state = 1)
model.fit(X_train, y_train)

SVC(random_state=1)

In [45]:
y_pred = model.predict(X_test)
svm_acc = accuracy_score(y_test, y_pred)
print("Accuracy Score:", svm_acc)

Accuracy Score: 0.6594202898550725


In [46]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.62      0.95      0.75        73
           1       0.85      0.34      0.48        65

    accuracy                           0.66       138
   macro avg       0.73      0.64      0.61       138
weighted avg       0.72      0.66      0.62       138



SVM struggled when compared to Logistic Regression, and seemed to favor one outcome over the other.

# Decision Tree Model

In [47]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state = 1)
model.fit(X_train, y_train)

DecisionTreeClassifier(random_state=1)

In [48]:
y_pred = model.predict(X_test)
dt_acc = accuracy_score(y_test, y_pred)
print("Accuracy Score:", dt_acc)

Accuracy Score: 0.7898550724637681


In [49]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.81      0.80        73
           1       0.78      0.77      0.78        65

    accuracy                           0.79       138
   macro avg       0.79      0.79      0.79       138
weighted avg       0.79      0.79      0.79       138



The Decision Tree Model also performed well with an accuracy above 80%, and similar precision and recall scores for both outcomes.

# Accuracy of Models

In [50]:
print("Logistic Regression:", lr_acc)
print("Support Vector Machine:", svm_acc)
print("Decision Tree:", dt_acc)

Logistic Regression: 0.8333333333333334
Support Vector Machine: 0.6594202898550725
Decision Tree: 0.7898550724637681


**Conclusion:**
> Logistic Regression and Decision tree performed well with accuracies around 80-85% while svm was not a good fit for this dataset.

**Link to Data:**
Data and data description obtained from kaggle.com
* https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data