## 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix, recall_score, \
precision_score, f1_score, roc_auc_score ,classification_report

## 2. Import Pandas

1. Import pandas and alias it as pd
2. Load the dataset cc_approvals.data into a cc_apps dataframe.
    - Set the header argument to None.
3. Print the first five rows.
4. Drop the columns 11 and 13.

In [2]:
cc_apps  = pd.read_csv("datasets/cc_approvals.data",header=None)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [3]:
cc_apps = cc_apps.drop(columns=[11, 13])

In [4]:
cc_apps.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15], dtype='int64')

## 3. Explore the dataset

1. Print the basic statistics.
2. Print the information of the dataset.
3. Print the last 17 rows.

In [5]:
cc_apps.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [6]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  12      690 non-null    object 
 12  14      690 non-null    int64  
 13  15      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [7]:
cc_apps.tail(17)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,g,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,g,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,g,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,g,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,g,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,g,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,g,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,g,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,g,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,g,2,-


## 4. Train Test Split

Do not split the dataset into X and y, just split the original dataset.

random_state=42

test_size=0.33

In [95]:
# from sklearn.model_selection import train_test_split
# train, test = train_test_split(cc_apps, test_size=0.33, random_state=42)
# print(train.shape)
# print(test.shape)

## 5. Handling Missing Values

Convert any '?' to a NaN value from both training and testing sets.

In [8]:
import numpy as np
cc_apps.replace('?', np.nan, inplace=True)


## 6. Handling Missing Values

Impute the numerical data for both training and testing sets with mean value.

In [9]:
cc_apps.isna().sum().sum()


54

In [11]:
for column in cc_apps.columns:
    if cc_apps[column].dtype in ['int64', 'float64']:
        mean_value = cc_apps[column].mean()
        cc_apps[column].replace(np.nan, mean_value, inplace=True)
    else:
        mode_value = cc_apps[column].mode()[0]
        cc_apps[column].replace(np.nan, mode_value, inplace=True)

In [12]:
cc_apps.isna().sum().sum()

0

## 7. Handling Missing Values

Impute the categorical data for both training and testing sets with mode value.

In [91]:
# i did it in 6.

## 8. Encoding

The columns 0, 3, 4, 5, 6, 8, 9, and 12 are categorical, there are several methods we can use to encode the categorical columns. One of the method called get_dummies().

Use get_dummies() function to convert the categorical columns to a numerical columns (for training the machine learning algorithms).

Do not forget to convert both training and testing sets.

In [13]:
cat_columns = [0, 3, 4, 5, 6, 8, 9, 12]
cc_apps = pd.get_dummies(cc_apps, columns = cat_columns)

In [14]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 44 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       690 non-null    object 
 1   2       690 non-null    float64
 2   7       690 non-null    float64
 3   10      690 non-null    int64  
 4   14      690 non-null    int64  
 5   15      690 non-null    object 
 6   0_a     690 non-null    uint8  
 7   0_b     690 non-null    uint8  
 8   3_l     690 non-null    uint8  
 9   3_u     690 non-null    uint8  
 10  3_y     690 non-null    uint8  
 11  4_g     690 non-null    uint8  
 12  4_gg    690 non-null    uint8  
 13  4_p     690 non-null    uint8  
 14  5_aa    690 non-null    uint8  
 15  5_c     690 non-null    uint8  
 16  5_cc    690 non-null    uint8  
 17  5_d     690 non-null    uint8  
 18  5_e     690 non-null    uint8  
 19  5_ff    690 non-null    uint8  
 20  5_i     690 non-null    uint8  
 21  5_j     690 non-null    uint8  
 22  5_

In [15]:
cc_apps.shape

(690, 44)

In [16]:
cc_apps

Unnamed: 0,1,2,7,10,14,15,0_a,0_b,3_l,3_u,...,6_o,6_v,6_z,8_f,8_t,9_f,9_t,12_g,12_p,12_s
0,30.83,0.000,1.25,1,0,+,0,1,0,1,...,0,1,0,0,1,0,1,1,0,0
1,58.67,4.460,3.04,6,560,+,1,0,0,1,...,0,0,0,0,1,0,1,1,0,0
2,24.50,0.500,1.50,0,824,+,1,0,0,1,...,0,0,0,0,1,1,0,1,0,0
3,27.83,1.540,3.75,5,3,+,0,1,0,1,...,0,1,0,0,1,0,1,1,0,0
4,20.17,5.625,1.71,0,0,+,0,1,0,1,...,0,1,0,0,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,21.08,10.085,1.25,0,0,-,0,1,0,0,...,0,0,0,1,0,1,0,1,0,0
686,22.67,0.750,2.00,2,394,-,1,0,0,1,...,0,1,0,1,0,0,1,1,0,0
687,25.25,13.500,2.00,1,1,-,1,0,0,0,...,0,0,0,1,0,0,1,1,0,0
688,17.92,0.205,0.04,0,750,-,0,1,0,1,...,0,1,0,1,0,1,0,1,0,0


## 9. Split into features and target

X_train and y_train will take 462 rows.
X_test and y_test will take 228 rows.

In [17]:
X = cc_apps.drop(15, axis=1)
y = cc_apps.iloc[:, 15]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [18]:
X_train.shape

(462, 43)

In [19]:
X_test.shape

(228, 43)

## 10. Normalization

In [20]:
X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)

scaler = MinMaxScaler()
scaledX_train = scaler.fit_transform(X_train)
scaledX_test = scaler.fit_transform(X_test)

In [21]:
scaledX_train

array([[0.16793651, 0.0949307 , 0.225     , ..., 1.        , 0.        ,
        0.        ],
       [0.3147619 , 0.10442377, 0.2125    , ..., 1.        , 0.        ,
        0.        ],
       [0.29365079, 0.05695842, 0.0125    , ..., 1.        , 0.        ,
        0.        ],
       ...,
       [0.37825397, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.08333333, 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.07809524, 0.1898614 , 0.01875   , ..., 1.        , 0.        ,
        0.        ]])

## 11. Train a Logistic Regression

In [22]:
logreg = LogisticRegression()
logreg.fit(scaledX_train, y_train)

## 12. Make predictions and evaluate the Logistic Regression Model

In [23]:
y_pred = logreg.predict(scaledX_test)
y_pred

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1], dtype=uint8)

In [24]:
print(f'The accuracy of the model : {accuracy_score(y_test, y_pred)}')
print(f'The precision of the model : {precision_score(y_test, y_pred)}')
print(f'The recall of the model : {recall_score(y_test, y_pred)}')
print(f'The f1 score of the model : {f1_score(y_test, y_pred)}')
print(f'The confusion matrix :-\n{confusion_matrix(y_test, y_pred)}')
print(f'The classification report :-\n{classification_report(y_test, y_pred)}')

The accuracy of the model : 1.0
The precision of the model : 1.0
The recall of the model : 1.0
The f1 score of the model : 1.0
The confusion matrix :-
[[180   0]
 [  0  48]]
The classification report :-
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       180
           1       1.00      1.00      1.00        48

    accuracy                           1.00       228
   macro avg       1.00      1.00      1.00       228
weighted avg       1.00      1.00      1.00       228



## 13. Repeat the steps 11 and 12 for SVM, DT, and RF

### SVM :

In [25]:
svm_model = svm.SVC()
svm_model.fit(scaledX_train,y_train)

In [26]:
y_pred = svm_model.predict(scaledX_test)
print(f'The accuracy of the model : {accuracy_score(y_test, y_pred)}')
print(f'The precision of the model : {precision_score(y_test, y_pred)}')
print(f'The recall of the model : {recall_score(y_test, y_pred)}')
print(f'The f1 score of the model : {f1_score(y_test, y_pred)}')
print(f'The confusion matrix :-\n{confusion_matrix(y_test, y_pred)}')
print(f'The classification report :-\n{classification_report(y_test, y_pred)}')

The accuracy of the model : 1.0
The precision of the model : 1.0
The recall of the model : 1.0
The f1 score of the model : 1.0
The confusion matrix :-
[[180   0]
 [  0  48]]
The classification report :-
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       180
           1       1.00      1.00      1.00        48

    accuracy                           1.00       228
   macro avg       1.00      1.00      1.00       228
weighted avg       1.00      1.00      1.00       228



### DT :

In [27]:
classifier = DecisionTreeClassifier()
classifier.fit(scaledX_train,y_train)


In [28]:
y_pred = classifier.predict(scaledX_test)
y_pred

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1], dtype=uint8)

In [29]:
print(f'The accuracy of the model : {accuracy_score(y_test, y_pred)}')
print(f'The precision of the model : {precision_score(y_test, y_pred)}')
print(f'The recall of the model : {recall_score(y_test, y_pred)}')
print(f'The f1 score of the model : {f1_score(y_test, y_pred)}')
print(f'The confusion matrix :-\n{confusion_matrix(y_test, y_pred)}')
print(f'The classification report :-\n{classification_report(y_test, y_pred)}')

The accuracy of the model : 1.0
The precision of the model : 1.0
The recall of the model : 1.0
The f1 score of the model : 1.0
The confusion matrix :-
[[180   0]
 [  0  48]]
The classification report :-
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       180
           1       1.00      1.00      1.00        48

    accuracy                           1.00       228
   macro avg       1.00      1.00      1.00       228
weighted avg       1.00      1.00      1.00       228



### RF :

In [30]:
classifier = RandomForestClassifier()
classifier.fit(scaledX_train,y_train)


In [31]:
y_pred = classifier.predict(scaledX_test)
print(f'The accuracy of the model : {accuracy_score(y_test, y_pred)}')
print(f'The precision of the model : {precision_score(y_test, y_pred)}')
print(f'The recall of the model : {recall_score(y_test, y_pred)}')
print(f'The f1 score of the model : {f1_score(y_test, y_pred)}')
print(f'The confusion matrix :-\n{confusion_matrix(y_test, y_pred)}')
print(f'The classification report :-\n{classification_report(y_test, y_pred)}')

The accuracy of the model : 1.0
The precision of the model : 1.0
The recall of the model : 1.0
The f1 score of the model : 1.0
The confusion matrix :-
[[180   0]
 [  0  48]]
The classification report :-
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       180
           1       1.00      1.00      1.00        48

    accuracy                           1.00       228
   macro avg       1.00      1.00      1.00       228
weighted avg       1.00      1.00      1.00       228

