## 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

## 2. Import Pandas

1. Import pandas and alias it as pd
2. Load the dataset cc_approvals.data into a cc_apps dataframe.
    - Set the header argument to None.
3. Print the first five rows.
4. Drop the columns 11 and 13.

In [284]:
import numpy as np
import pandas as pd
import matplotlib as plt

In [285]:
cc_apps= pd.read_csv('datasets/cc_approvals.data',header=None)
cc_apps.drop(columns=11,axis=1,inplace=True)
cc_apps.drop(columns=13,axis=1,inplace=True)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


## 3. Explore the dataset

1. Print the basic statistics.
2. Print the information of the dataset.
3. Print the last 17 rows.

In [286]:
cc_apps.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [287]:
cc_apps.describe(include='all')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
count,690,690,690.0,690,690,690,690,690.0,690,690,690.0,690,690.0,690
unique,3,350,,4,4,15,10,,2,2,,3,,2
top,b,?,,u,g,c,v,,t,f,,g,,-
freq,468,12,,519,519,137,399,,361,395,,625,,383
mean,,,4.758725,,,,,2.223406,,,2.4,,1017.385507,
std,,,4.978163,,,,,3.346513,,,4.86294,,5210.102598,
min,,,0.0,,,,,0.0,,,0.0,,0.0,
25%,,,1.0,,,,,0.165,,,0.0,,0.0,
50%,,,2.75,,,,,1.0,,,0.0,,5.0,
75%,,,7.2075,,,,,2.625,,,3.0,,395.5,


In [288]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  12      690 non-null    object 
 12  14      690 non-null    int64  
 13  15      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [289]:
cc_apps.tail(17)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,g,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,g,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,g,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,g,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,g,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,g,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,g,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,g,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,g,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,g,2,-


In [290]:
cc_apps.isna().value_counts()

0      1      2      3      4      5      6      7      8      9      10     12     14     15   
False  False  False  False  False  False  False  False  False  False  False  False  False  False    690
Name: count, dtype: int64

## 4. Train Test Split

Do not split the dataset into X and y, just split the original dataset.

random_state=42

test_size=0.33

In [291]:
from sklearn.model_selection import train_test_split

## 5. Handling Missing Values

Convert any '?' to a NaN value from both training and testing sets.

In [292]:
cc_apps.replace('?', np.NaN,inplace=True)

In [293]:
cc_apps.isna().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
12     0
14     0
15     0
dtype: int64

## 6. Handling Missing Values

Impute the numerical data for both training and testing sets with mean value.

In [294]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       678 non-null    object 
 1   1       678 non-null    object 
 2   2       690 non-null    float64
 3   3       684 non-null    object 
 4   4       684 non-null    object 
 5   5       681 non-null    object 
 6   6       681 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  12      690 non-null    object 
 12  14      690 non-null    int64  
 13  15      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [295]:
cc_apps[1] = cc_apps[1].astype(float)

cc_apps[1].fillna(cc_apps[1].mean(), inplace = True)

In [296]:
cc_apps.isna().sum()

0     12
1      0
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
12     0
14     0
15     0
dtype: int64

## 7. Handling Missing Values

Impute the categorical data for both training and testing sets with mode value.

In [297]:
cc_apps[0].fillna(cc_apps[0].mode()[0], inplace=True)
cc_apps[3].fillna(cc_apps[3].mode()[0], inplace=True)
cc_apps[4].fillna(cc_apps[4].mode()[0], inplace=True)
cc_apps[5].fillna(cc_apps[5].mode()[0], inplace=True)
cc_apps[6].fillna(cc_apps[6].mode()[0], inplace=True)


In [298]:
cc_apps.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64

## 8. Encoding

The columns 0, 3, 4, 5, 6, 8, 9, and 12 are categorical, there are several methods we can use to encode the categorical columns. One of the method called get_dummies().

Use get_dummies() function to convert the categorical columns to a numerical columns (for training the machine learning algorithms).

Do not forget to convert both training and testing sets.

In [299]:
Dcc_apps = pd.get_dummies(cc_apps, columns=[0, 3, 4, 5, 6, 8, 9, 12,15])

In [300]:
Dcc_apps.columns

Index([     1,      2,      7,     10,     14,  '0_a',  '0_b',  '3_l',  '3_u',
        '3_y',  '4_g', '4_gg',  '4_p', '5_aa',  '5_c', '5_cc',  '5_d',  '5_e',
       '5_ff',  '5_i',  '5_j',  '5_k',  '5_m',  '5_q',  '5_r',  '5_w',  '5_x',
       '6_bb', '6_dd', '6_ff',  '6_h',  '6_j',  '6_n',  '6_o',  '6_v',  '6_z',
        '8_f',  '8_t',  '9_f',  '9_t', '12_g', '12_p', '12_s', '15_+', '15_-'],
      dtype='object')

In [301]:
Dcc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 45 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       690 non-null    float64
 1   2       690 non-null    float64
 2   7       690 non-null    float64
 3   10      690 non-null    int64  
 4   14      690 non-null    int64  
 5   0_a     690 non-null    bool   
 6   0_b     690 non-null    bool   
 7   3_l     690 non-null    bool   
 8   3_u     690 non-null    bool   
 9   3_y     690 non-null    bool   
 10  4_g     690 non-null    bool   
 11  4_gg    690 non-null    bool   
 12  4_p     690 non-null    bool   
 13  5_aa    690 non-null    bool   
 14  5_c     690 non-null    bool   
 15  5_cc    690 non-null    bool   
 16  5_d     690 non-null    bool   
 17  5_e     690 non-null    bool   
 18  5_ff    690 non-null    bool   
 19  5_i     690 non-null    bool   
 20  5_j     690 non-null    bool   
 21  5_k     690 non-null    bool   
 22  5_

In [302]:
Dcc_apps

Unnamed: 0,1,2,7,10,14,0_a,0_b,3_l,3_u,3_y,...,6_z,8_f,8_t,9_f,9_t,12_g,12_p,12_s,15_+,15_-
0,30.83,0.000,1.25,1,0,False,True,False,True,False,...,False,False,True,False,True,True,False,False,True,False
1,58.67,4.460,3.04,6,560,True,False,False,True,False,...,False,False,True,False,True,True,False,False,True,False
2,24.50,0.500,1.50,0,824,True,False,False,True,False,...,False,False,True,True,False,True,False,False,True,False
3,27.83,1.540,3.75,5,3,False,True,False,True,False,...,False,False,True,False,True,True,False,False,True,False
4,20.17,5.625,1.71,0,0,False,True,False,True,False,...,False,False,True,True,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,21.08,10.085,1.25,0,0,False,True,False,False,True,...,False,True,False,True,False,True,False,False,False,True
686,22.67,0.750,2.00,2,394,True,False,False,True,False,...,False,True,False,False,True,True,False,False,False,True
687,25.25,13.500,2.00,1,1,True,False,False,False,True,...,False,True,False,False,True,True,False,False,False,True
688,17.92,0.205,0.04,0,750,False,True,False,True,False,...,False,True,False,True,False,True,False,False,False,True


## 9. Split into features and target

X_train and y_train will take 462 rows.
X_test and y_test will take 228 rows.

In [303]:
x = Dcc_apps.drop(['15_+', '15_-'], axis=1)
y = Dcc_apps[['15_+', '15_-']]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
print('test is :', y_test.count(),'train is :', y_train.count())

test is : 15_+    228
15_-    228
dtype: int64 train is : 15_+    462
15_-    462
dtype: int64


## 10. Normalization

In [304]:
from sklearn.preprocessing import MinMaxScaler

In [305]:
scaler = MinMaxScaler()
x_train.columns = x_train.columns.astype(str)
x_test.columns = x_test.columns.astype(str)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

## 11. Train a Logistic Regression

In [306]:
from sklearn.linear_model import LogisticRegression

In [307]:
y_train = y_train['15_+']
y_test = y_test['15_+']
logreg = LogisticRegression()
logreg.fit(x_train, y_train)

## 12. Make predictions and evaluate the Logistic Regression Model

In [308]:
from sklearn.metrics import classification_report

In [309]:
y_pred = logreg.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       False       0.91      0.80      0.85       125
        True       0.79      0.90      0.84       103

    accuracy                           0.85       228
   macro avg       0.85      0.85      0.85       228
weighted avg       0.85      0.85      0.85       228



## 13. Repeat the steps 11 and 12 for SVM, DT, and RF

In [310]:
from sklearn.svm import SVC
model= SVC(kernel='linear')
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       False       0.92      0.77      0.84       125
        True       0.77      0.92      0.84       103

    accuracy                           0.84       228
   macro avg       0.84      0.85      0.84       228
weighted avg       0.85      0.84      0.84       228



In [311]:
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier()
DTC.fit(x_train, y_train)
y_pred = DTC.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       False       0.82      0.84      0.83       125
        True       0.80      0.78      0.79       103

    accuracy                           0.81       228
   macro avg       0.81      0.81      0.81       228
weighted avg       0.81      0.81      0.81       228



In [312]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier()
RFC.fit(x_train, y_train)
y_pred = RFC.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       False       0.88      0.86      0.87       125
        True       0.83      0.86      0.85       103

    accuracy                           0.86       228
   macro avg       0.86      0.86      0.86       228
weighted avg       0.86      0.86      0.86       228

