## 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

## 2. Import Pandas

1. Import pandas and alias it as pd
2. Load the dataset cc_approvals.data into a cc_apps dataframe.
    - Set the header argument to None.
3. Print the first five rows.
4. Drop the columns 11 and 13.

In [2]:
import pandas as pd 
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [3]:
cc_apps = pd.read_csv("datasets/cc_approvals.data",header=None)
cc_apps

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,00202,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,00043,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,00280,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,00100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,00260,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,00200,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,00200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,00280,750,-


In [4]:
cc_apps.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [5]:
cc_apps.drop(columns= [11, 13], inplace=True)
cc_apps

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,g,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,g,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,g,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,g,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,g,750,-


## 3. Explore the dataset

1. Print the basic statistics.
2. Print the information of the dataset.
3. Print the last 17 rows.

In [6]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  12      690 non-null    object 
 12  14      690 non-null    int64  
 13  15      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [7]:
cc_apps.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [8]:
cc_apps.value_counts()

0  1      2       3  4  5   6  7      8  9  10  12  14    15
?  20.08  0.125   u  g  q   v  1.000  f  t  1   g   768   +     1
b  30.17  6.500   u  g  cc  v  3.125  t  t  8   g   1200  +     1
   29.67  1.415   u  g  w   h  0.750  t  t  1   g   100   +     1
   29.83  1.250   y  p  k   v  0.250  f  f  0   g   0     -     1
          2.040   y  p  x   h  0.040  f  f  0   g   1     -     1
                                                               ..
   16.50  0.125   u  g  c   v  0.165  f  f  0   g   0     -     1
   16.92  0.335   y  p  k   v  0.290  f  f  0   s   0     -     1
   17.08  0.085   y  p  c   v  0.040  f  f  0   g   722   -     1
          0.250   u  g  q   v  0.335  f  t  4   g   8     -     1
   ?      10.500  u  g  x   v  6.500  t  f  0   g   0     +     1
Name: count, Length: 690, dtype: int64

In [9]:
cc_apps[15] = cc_apps[15].replace('+', 1)
cc_apps[15] = cc_apps[15].replace('-', 0)
cc_apps[15] = cc_apps[15].astype('int32')
cc_apps

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,g,0,1
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,g,560,1
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,g,824,1
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,g,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,g,0,0
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,g,394,0
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,g,1,0
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,g,750,0


In [10]:
cc_apps.dtypes

0      object
1      object
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
12     object
14      int64
15      int32
dtype: object

## 4. Train Test Split

Do not split the dataset into X and y, just split the original dataset.

random_state=42

test_size=0.33

**spliting is done in step 9**

## 5. Handling Missing Values

Convert any '?' to a NaN value from both training and testing sets.

In [12]:
cc_apps.value_counts()

0  1      2       3  4  5   6  7      8  9  10  12  14    15
?  20.08  0.125   u  g  q   v  1.000  f  t  1   g   768   1     1
b  30.17  6.500   u  g  cc  v  3.125  t  t  8   g   1200  1     1
   29.67  1.415   u  g  w   h  0.750  t  t  1   g   100   1     1
   29.83  1.250   y  p  k   v  0.250  f  f  0   g   0     0     1
          2.040   y  p  x   h  0.040  f  f  0   g   1     0     1
                                                               ..
   16.50  0.125   u  g  c   v  0.165  f  f  0   g   0     0     1
   16.92  0.335   y  p  k   v  0.290  f  f  0   s   0     0     1
   17.08  0.085   y  p  c   v  0.040  f  f  0   g   722   0     1
          0.250   u  g  q   v  0.335  f  t  4   g   8     0     1
   ?      10.500  u  g  x   v  6.500  t  f  0   g   0     1     1
Name: count, Length: 690, dtype: int64

In [13]:
cc_apps = cc_apps.replace('?', np.nan)

print (cc_apps.value_counts(), '\n')

0  1      2       3  4   5   6  7       8  9  10  12  14      15
a  15.75  0.375   u  g   c   v  1.000   f  f  0   g   18      0     1
b  29.58  4.500   u  g   w   v  7.500   t  t  2   g   0       1     1
   29.67  0.750   y  p   c   v  0.040   f  f  0   g   0       0     1
          1.415   u  g   w   h  0.750   t  t  1   g   100     1     1
   29.83  1.250   y  p   k   v  0.250   f  f  0   g   0       0     1
                                                                   ..
   17.42  6.500   u  g   i   v  0.125   f  f  0   g   100     0     1
   17.50  22.000  l  gg  ff  o  0.000   f  f  0   p   100000  1     1
   17.58  10.000  u  g   w   h  0.165   f  t  1   g   1       0     1
   17.67  4.460   u  g   c   v  0.250   f  f  0   s   0       0     1
   76.75  22.290  u  g   e   z  12.750  t  t  1   g   109     1     1
Name: count, Length: 659, dtype: int64 



## 6. Handling Missing Values

Impute the numerical data for both training and testing sets with mean value.

In [14]:
cc_apps[[1, 2, 7, 10, 14]].isnull().sum()

1     12
2      0
7      0
10     0
14     0
dtype: int64

In [15]:
Continuous = [1, 2, 7, 10, 14]

cc_apps[1] = cc_apps[1].astype('float64')
cc_apps[1].fillna(value= cc_apps[1].mean(), inplace=True)

In [16]:
cc_apps[[1, 2, 7, 10, 14]].isnull().sum()

1     0
2     0
7     0
10    0
14    0
dtype: int64

## 7. Handling Missing Values

Impute the categorical data for both training and testing sets with mode value.

In [17]:
Categorical =  [0, 3, 4, 5, 6, 8, 9, 12]

In [18]:
cc_apps[Categorical].isnull().sum()

0     12
3      6
4      6
5      9
6      9
8      0
9      0
12     0
dtype: int64

In [19]:
cc_apps[0].value_counts()

0
b    468
a    210
Name: count, dtype: int64

In [20]:
for i in Categorical:
    cc_apps[i].fillna(value= cc_apps[i].mode()[0], inplace=True)
    
cc_apps[Categorical].isnull().sum()

0     0
3     0
4     0
5     0
6     0
8     0
9     0
12    0
dtype: int64

## 8. Encoding

The columns 0, 3, 4, 5, 6, 8, 9, and 12 are categorical, there are several methods we can use to encode the categorical columns. One of the method called get_dummies().

Use get_dummies() function to convert the categorical columns to a numerical columns (for training the machine learning algorithms).

Do not forget to convert both training and testing sets.

In [21]:
for i in Categorical:
    cc_apps[i].fillna(value= str(cc_apps[i].mode()), inplace=True)
    
cc_apps[Categorical].isnull().sum()

0     0
3     0
4     0
5     0
6     0
8     0
9     0
12    0
dtype: int64

In [22]:
Categorical =  [0, 3, 4, 5, 6, 8, 9, 12]
cc_apps = pd.get_dummies(cc_apps, columns= Categorical, dtype=int)
cc_apps

Unnamed: 0,1,2,7,10,14,15,0_a,0_b,3_l,3_u,...,6_o,6_v,6_z,8_f,8_t,9_f,9_t,12_g,12_p,12_s
0,30.83,0.000,1.25,1,0,1,0,1,0,1,...,0,1,0,0,1,0,1,1,0,0
1,58.67,4.460,3.04,6,560,1,1,0,0,1,...,0,0,0,0,1,0,1,1,0,0
2,24.50,0.500,1.50,0,824,1,1,0,0,1,...,0,0,0,0,1,1,0,1,0,0
3,27.83,1.540,3.75,5,3,1,0,1,0,1,...,0,1,0,0,1,0,1,1,0,0
4,20.17,5.625,1.71,0,0,1,0,1,0,1,...,0,1,0,0,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,21.08,10.085,1.25,0,0,0,0,1,0,0,...,0,0,0,1,0,1,0,1,0,0
686,22.67,0.750,2.00,2,394,0,1,0,0,1,...,0,1,0,1,0,0,1,1,0,0
687,25.25,13.500,2.00,1,1,0,1,0,0,0,...,0,0,0,1,0,0,1,1,0,0
688,17.92,0.205,0.04,0,750,0,0,1,0,1,...,0,1,0,1,0,1,0,1,0,0


## 9. Split into features and target

X_train and y_train will take 462 rows.
X_test and y_test will take 228 rows.

In [33]:
from sklearn.model_selection import train_test_split

X=cc_apps.drop(15, axis= 1)
Y=cc_apps[[15]]

X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                    test_size=0.33, random_state=42)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((462, 43), (462, 1), (228, 43), (228, 1))

## 10. Normalization

In [34]:
from sklearn.preprocessing import MinMaxScaler

In [35]:
X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)

In [36]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [37]:
X_train_scaled.shape

(462, 43)

In [38]:
X_test_scaled.shape

(228, 43)

## 11. Train a Logistic Regression

In [39]:
from sklearn.linear_model import LogisticRegression

In [40]:
logi = LogisticRegression()

logi.fit(X_train_scaled, y_train)

## 12. Make predictions and evaluate the Logistic Regression Model

In [41]:
from sklearn.metrics import (
    classification_report, confusion_matrix,
    accuracy_score, precision_score, recall_score, f1_score)

In [42]:
y_pred= logi.predict(X_test_scaled)

print('The accuracy of the model : ', accuracy_score(y_test, y_pred) )
print('The precision of the model : ', precision_score(y_test, y_pred))
print('The recall of the model : ', recall_score(y_test, y_pred))
print('The f1 score of the model : ', f1_score(y_test, y_pred))
print('The confusion matrix of the model : \n', confusion_matrix(y_test,y_pred))

The accuracy of the model :  0.8464912280701754
The precision of the model :  0.788135593220339
The recall of the model :  0.9029126213592233
The f1 score of the model :  0.8416289592760181
The confusion matrix of the model : 
 [[100  25]
 [ 10  93]]


## 13. Repeat the steps 11 and 12 for SVM, DT, and RF

In [43]:
from sklearn.svm import SVC
SVM = SVC()

SVM.fit(X_train_scaled, y_train)
y_pred= SVM.predict(X_test_scaled)

print('The accuracy of the model : ', accuracy_score(y_test, y_pred) )
print('The precision of the model : ', precision_score(y_test, y_pred))
print('The recall of the model : ', recall_score(y_test, y_pred))
print('The f1 score of the model : ', f1_score(y_test, y_pred))
print('The confusion matrix of the model : \n', confusion_matrix(y_test,y_pred))

The accuracy of the model :  0.8508771929824561
The precision of the model :  0.7899159663865546
The recall of the model :  0.912621359223301
The f1 score of the model :  0.8468468468468469
The confusion matrix of the model : 
 [[100  25]
 [  9  94]]


In [44]:
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()

DT.fit(X_train_scaled, y_train)
y_pred= DT.predict(X_test_scaled)

print('The accuracy of the model : ', accuracy_score(y_test, y_pred) )
print('The precision of the model : ', precision_score(y_test, y_pred))
print('The recall of the model : ', recall_score(y_test, y_pred))
print('The f1 score of the model : ', f1_score(y_test, y_pred))
print('The confusion matrix of the model : \n', confusion_matrix(y_test,y_pred))

The accuracy of the model :  0.8289473684210527
The precision of the model :  0.8333333333333334
The recall of the model :  0.7766990291262136
The f1 score of the model :  0.8040201005025125
The confusion matrix of the model : 
 [[109  16]
 [ 23  80]]


In [47]:
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier()

RF.fit(X_train_scaled, y_train)
y_pred= RF.predict(X_test_scaled)

print('The accuracy of the model : ', accuracy_score(y_test, y_pred) )
print('The precision of the model : ', precision_score(y_test, y_pred))
print('The recall of the model : ', recall_score(y_test, y_pred))
print('The f1 score of the model : ', f1_score(y_test, y_pred))
print('The confusion matrix of the model : \n\n', confusion_matrix(y_test,y_pred))

The accuracy of the model :  0.8771929824561403
The precision of the model :  0.8440366972477065
The recall of the model :  0.8932038834951457
The f1 score of the model :  0.8679245283018868
The confusion matrix of the model : 

 [[108  17]
 [ 11  92]]
