## 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

## 2. Import Pandas

1. Import pandas and alias it as pd
2. Load the dataset cc_approvals.data into a cc_apps dataframe.
    - Set the header argument to None.
3. Print the first five rows.
4. Drop the columns 11 and 13.

In [31]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import \
     classification_report, confusion_matrix,\
     accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.linear_model import Ridge, RidgeCV, Lasso
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


In [3]:
df = pd.read_csv('cc_approvals.data',names=['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15','A16'])


## 3. Explore the dataset

1. Print the basic statistics.
2. Print the information of the dataset.
3. Print the last 17 rows.

In [4]:
df.describe(include='all')

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
count,690,690,690.0,690,690,690,690,690.0,690,690,690.0,690,690,690.0,690.0,690
unique,3,350,,4,4,15,10,,2,2,,2,3,171.0,,2
top,b,?,,u,g,c,v,,t,f,,f,g,0.0,,-
freq,468,12,,519,519,137,399,,361,395,,374,625,132.0,,383
mean,,,4.758725,,,,,2.223406,,,2.4,,,,1017.385507,
std,,,4.978163,,,,,3.346513,,,4.86294,,,,5210.102598,
min,,,0.0,,,,,0.0,,,0.0,,,,0.0,
25%,,,1.0,,,,,0.165,,,0.0,,,,0.0,
50%,,,2.75,,,,,1.0,,,0.0,,,,5.0,
75%,,,7.2075,,,,,2.625,,,3.0,,,,395.5,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      690 non-null    object 
 1   A2      690 non-null    object 
 2   A3      690 non-null    float64
 3   A4      690 non-null    object 
 4   A5      690 non-null    object 
 5   A6      690 non-null    object 
 6   A7      690 non-null    object 
 7   A8      690 non-null    float64
 8   A9      690 non-null    object 
 9   A10     690 non-null    object 
 10  A11     690 non-null    int64  
 11  A12     690 non-null    object 
 12  A13     690 non-null    object 
 13  A14     690 non-null    object 
 14  A15     690 non-null    int64  
 15  A16     690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [6]:
df.tail(17)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


## 4. Train Test Split

Do not split the dataset into X and y, just split the original dataset.

random_state=42

test_size=0.33

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df.loc[:,df.columns != 'A16'],df.loc[:,df.columns  == 'A16'], random_state=42, test_size=0.33)

## 5. Handling Missing Values

Convert any '?' to a NaN value from both training and testing sets.

In [8]:
X_train = X_train.replace('?', np.nan)
X_test = X_test.replace('?', np.nan)
y_train = y_train.replace('?', np.nan)
y_test = y_test.replace('?', np.nan)

## 6. Handling Missing Values

Impute the numerical data for both training and testing sets with mean value.

In [9]:
X_train['A2']=X_train['A2'].astype('float')
X_train['A14']=X_train['A14'].astype('float')

X_test['A2']=X_test['A2'].astype('float')
X_test['A14']=X_test['A14'].astype('float')

X_train['A2'] = X_train['A2'].fillna(X_train['A2'].mean())
X_train['A14'] = X_train['A14'].fillna(X_train['A14'].mean())

X_test['A2'] = X_test['A2'].fillna(X_test['A2'].mean())
X_test['A14'] = X_test['A14'].fillna(X_test['A14'].mean())



## 7. Handling Missing Values

Impute the categorical data for both training and testing sets with mode value.

In [10]:

X_train['A1'] = X_train['A1'].fillna(X_train['A1'].mode()[0])
X_train['A4'] = X_train['A4'].fillna(X_train['A4'].mode()[0])
X_train['A5'] = X_train['A5'].fillna(X_train['A5'].mode()[0])
X_train['A6'] = X_train['A6'].fillna(X_train['A6'].mode()[0])
X_train['A7'] = X_train['A7'].fillna(X_train['A7'].mode()[0])


X_test['A1'] = X_test['A1'].fillna(X_test['A1'].mode()[0])
X_test['A6'] = X_test['A6'].fillna(X_test['A6'].mode()[0])
X_test['A7'] = X_test['A7'].fillna(X_test['A7'].mode()[0])


## 8. Encoding

The columns 0, 3, 4, 5, 6, 8, 9, and 12 are categorical, there are several methods we can use to encode the categorical columns. One of the method called get_dummies().

Use get_dummies() function to convert the categorical columns to a numerical columns (for training the machine learning algorithms).

Do not forget to convert both training and testing sets.

In [11]:
X_train = pd.get_dummies(X_train, prefix=['A1', 'A4','A5','A6', 'A7','A9','A10','A12','A13'])
X_test = pd.get_dummies(X_test, prefix=['A1', 'A4','A5','A6', 'A7','A9','A10','A12','A13'])

In [12]:
y_train.shape

(462, 1)

## 9. Split into features and target

X_train and y_train will take 462 rows.
X_test and y_test will take 228 rows.

In [13]:
X_train=X_train.drop('A7_o', axis=1)
print(f'Features shapes: Train: {X_train.shape} , Test: {X_test.shape}')
print(f'Target shapes: Train: {y_train.shape} , Test: {y_test.shape}')

Features shapes: Train: (462, 45) , Test: (228, 45)
Target shapes: Train: (462, 1) , Test: (228, 1)


## 10. Normalization

In [14]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 462 entries, 382 to 102
Data columns (total 45 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A2      462 non-null    float64
 1   A3      462 non-null    float64
 2   A8      462 non-null    float64
 3   A11     462 non-null    int64  
 4   A14     462 non-null    float64
 5   A15     462 non-null    int64  
 6   A1_a    462 non-null    bool   
 7   A1_b    462 non-null    bool   
 8   A4_l    462 non-null    bool   
 9   A4_u    462 non-null    bool   
 10  A4_y    462 non-null    bool   
 11  A5_g    462 non-null    bool   
 12  A5_gg   462 non-null    bool   
 13  A5_p    462 non-null    bool   
 14  A6_aa   462 non-null    bool   
 15  A6_c    462 non-null    bool   
 16  A6_cc   462 non-null    bool   
 17  A6_d    462 non-null    bool   
 18  A6_e    462 non-null    bool   
 19  A6_ff   462 non-null    bool   
 20  A6_i    462 non-null    bool   
 21  A6_j    462 non-null    bool   
 22  A6_k 

In [15]:
scaler = MinMaxScaler()
scaled = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']
X_train[scaled] = scaler.fit_transform(X_train[scaled])
X_test[scaled] = scaler.transform(X_test[scaled])

## 11. Train a Logistic Regression

In [16]:

logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


## 12. Make predictions and evaluate the Logistic Regression Model

In [21]:

y_pred = logmodel.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
sensitivity_specificity = sensitivity / specificity


In [22]:
print(f' Accuracy of the testing model: {accuracy_score(y_test, y_pred)}')
print(f' Sensitivity of the testing model: {sensitivity}')
print(f' Specificity of the testing model: {specificity}')
print(f' Sensitivity/Specificity of the testing model: {sensitivity_specificity}')

 Accuracy of the testing model: 0.8421052631578947
 Sensitivity of the testing model: 0.792
 Specificity of the testing model: 0.9029126213592233
 Sensitivity/Specificity of the testing model: 0.8771612903225806


## 13. Repeat the steps 11 and 12 for SVM, DT, and RF

## SVC:

In [26]:
svc = SVC(kernel='linear')
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
sensitivity_specificity = sensitivity / specificity
print(f' Accuracy of the testing model: {accuracy_score(y_test, y_pred)}')
print(f' Sensitivity of the testing model: {sensitivity}')
print(f' Specificity of the testing model: {specificity}')
print(f' Sensitivity/Specificity of the testing model: {sensitivity_specificity}')

 Accuracy of the testing model: 0.8377192982456141
 Sensitivity of the testing model: 0.768
 Specificity of the testing model: 0.9223300970873787
 Sensitivity/Specificity of the testing model: 0.8326736842105263


  y = column_or_1d(y, warn=True)


## DT: 

In [29]:
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X_train,y_train)
y_pred = dtree.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
sensitivity_specificity = sensitivity / specificity
print(f' Accuracy of the testing model: {accuracy_score(y_test, y_pred)}')
print(f' Sensitivity of the testing model: {sensitivity}')
print(f' Specificity of the testing model: {specificity}')
print(f' Sensitivity/Specificity of the testing model: {sensitivity_specificity}')

 Accuracy of the testing model: 0.7807017543859649
 Sensitivity of the testing model: 0.816
 Specificity of the testing model: 0.7378640776699029
 Sensitivity/Specificity of the testing model: 1.1058947368421053


## RF: 

In [33]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
sensitivity_specificity = sensitivity / specificity
print(f' Accuracy of the testing model: {accuracy_score(y_test, y_pred)}')
print(f' Sensitivity of the testing model: {sensitivity}')
print(f' Specificity of the testing model: {specificity}')
print(f' Sensitivity/Specificity of the testing model: {sensitivity_specificity}')

  return fit_method(estimator, *args, **kwargs)


 Accuracy of the testing model: 0.868421052631579
 Sensitivity of the testing model: 0.856
 Specificity of the testing model: 0.883495145631068
 Sensitivity/Specificity of the testing model: 0.9688791208791208
