## 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

## 2. Import Pandas

1. Import pandas and alias it as pd
2. Load the dataset cc_approvals.data into a cc_apps dataframe.
    - Set the header argument to None.
3. Print the first five rows.
4. Drop the columns 11 and 13.

In [89]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import \
     classification_report, confusion_matrix,\
     accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.linear_model import Ridge, RidgeCV, Lasso
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder


In [90]:
df = pd.read_csv('cc_approvals.data',names=['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15','A16'])


## 3. Explore the dataset

1. Print the basic statistics.
2. Print the information of the dataset.
3. Print the last 17 rows.

In [91]:
df.describe(include='all')

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
count,690,690,690.0,690,690,690,690,690.0,690,690,690.0,690,690,690.0,690.0,690
unique,3,350,,4,4,15,10,,2,2,,2,3,171.0,,2
top,b,?,,u,g,c,v,,t,f,,f,g,0.0,,-
freq,468,12,,519,519,137,399,,361,395,,374,625,132.0,,383
mean,,,4.758725,,,,,2.223406,,,2.4,,,,1017.385507,
std,,,4.978163,,,,,3.346513,,,4.86294,,,,5210.102598,
min,,,0.0,,,,,0.0,,,0.0,,,,0.0,
25%,,,1.0,,,,,0.165,,,0.0,,,,0.0,
50%,,,2.75,,,,,1.0,,,0.0,,,,5.0,
75%,,,7.2075,,,,,2.625,,,3.0,,,,395.5,


In [92]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 462 entries, 382 to 102
Data columns (total 47 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A2      462 non-null    float64
 1   A3      462 non-null    float64
 2   A8      462 non-null    float64
 3   A11     462 non-null    int64  
 4   A14     462 non-null    float64
 5   A15     462 non-null    int64  
 6   A16     462 non-null    int32  
 7   A1_a    462 non-null    int32  
 8   A1_b    462 non-null    int32  
 9   A4_l    462 non-null    int32  
 10  A4_u    462 non-null    int32  
 11  A4_y    462 non-null    int32  
 12  A5_g    462 non-null    int32  
 13  A5_gg   462 non-null    int32  
 14  A5_p    462 non-null    int32  
 15  A6_aa   462 non-null    int32  
 16  A6_c    462 non-null    int32  
 17  A6_cc   462 non-null    int32  
 18  A6_d    462 non-null    int32  
 19  A6_e    462 non-null    int32  
 20  A6_ff   462 non-null    int32  
 21  A6_i    462 non-null    int32  
 22  A6_j 

In [93]:
df.tail(17)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


## 4. Train Test Split

Do not split the dataset into X and y, just split the original dataset.

random_state=42

test_size=0.33

In [94]:
train, test= train_test_split(df, random_state=42, test_size=0.33)

In [95]:
train.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
382,a,24.33,2.5,y,p,i,bb,4.5,f,f,0,f,g,200,456,-
137,b,33.58,2.75,u,g,m,v,4.25,t,t,6,f,g,204,0,+
346,?,32.25,1.5,u,g,c,v,0.25,f,f,0,t,g,372,122,-
326,b,30.17,1.085,y,p,c,v,0.04,f,f,0,f,g,170,179,-
33,a,36.75,5.125,u,g,e,v,5.0,t,f,0,t,g,0,4000,+


## 5. Handling Missing Values

Convert any '?' to a NaN value from both training and testing sets.

In [96]:
train = train.replace('?', np.nan)
test = test.replace('?', np.nan)


## 6. Handling Missing Values

Impute the numerical data for both training and testing sets with mean value.

In [97]:
train['A2']=train['A2'].astype('float')
train['A14']=train['A14'].astype('float')

test['A2']=test['A2'].astype('float')
test['A14']=test['A14'].astype('float')

train['A2'] = train['A2'].fillna(train['A2'].mean())
train['A14'] = train['A14'].fillna(train['A14'].mean())

test['A2'] = test['A2'].fillna(test['A2'].mean())
test['A14'] = test['A14'].fillna(test['A14'].mean())



## 7. Handling Missing Values

Impute the categorical data for both training and testing sets with mode value.

In [98]:
train['A1'] = train['A1'].fillna(train['A1'].mode()[0])
train['A4'] = train['A4'].fillna(train['A4'].mode()[0])
train['A5'] = train['A5'].fillna(train['A5'].mode()[0])
train['A6'] = train['A6'].fillna(train['A6'].mode()[0])
train['A7'] = train['A7'].fillna(train['A7'].mode()[0])


test['A1'] = test['A1'].fillna(test['A1'].mode()[0])
test['A6'] = test['A6'].fillna(test['A6'].mode()[0])
test['A7'] = test['A7'].fillna(test['A7'].mode()[0])


## 8. Encoding

The columns 0, 3, 4, 5, 6, 8, 9, and 12 are categorical, there are several methods we can use to encode the categorical columns. One of the method called get_dummies().

Use get_dummies() function to convert the categorical columns to a numerical columns (for training the machine learning algorithms).

Do not forget to convert both training and testing sets.

In [102]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 47 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A2      690 non-null    float64
 1   A3      690 non-null    float64
 2   A8      690 non-null    float64
 3   A11     690 non-null    int64  
 4   A14     690 non-null    float64
 5   A15     690 non-null    int64  
 6   A16     690 non-null    int32  
 7   A1_a    690 non-null    int32  
 8   A1_b    690 non-null    int32  
 9   A4_l    690 non-null    int32  
 10  A4_u    690 non-null    int32  
 11  A4_y    690 non-null    int32  
 12  A5_g    690 non-null    int32  
 13  A5_gg   690 non-null    int32  
 14  A5_p    690 non-null    int32  
 15  A6_aa   690 non-null    int32  
 16  A6_c    690 non-null    int32  
 17  A6_cc   690 non-null    int32  
 18  A6_d    690 non-null    int32  
 19  A6_e    690 non-null    int32  
 20  A6_ff   690 non-null    int32  
 21  A6_i    690 non-null    int32  
 22  A6

In [100]:
df = pd.concat([train, test], axis=0)
df.reset_index(drop=True, inplace=True)
df['A16'] = LabelEncoder().fit_transform(df['A16'])
df = pd.get_dummies(df, dtype=int)

## 9. Split into features and target

X_train and y_train will take 462 rows.
X_test and y_test will take 228 rows.

In [105]:
X = df.drop('A16',axis=1)
y = df['A16']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print(f'Features shapes: Train: {X_train.shape} , Test: {X_test.shape}')
print(f'Target shapes: Train: {y_train.shape} , Test: {y_test.shape}')

Features shapes: Train: (462, 46) , Test: (228, 46)
Target shapes: Train: (462,) , Test: (228,)


## 10. Normalization

In [106]:
scaler = MinMaxScaler()
scaled = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']
X_train[scaled] = scaler.fit_transform(X_train[scaled])
X_test[scaled] = scaler.transform(X_test[scaled])

## 11. Train a Logistic Regression

In [107]:

logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

## 12. Make predictions and evaluate the Logistic Regression Model

In [110]:

y_pred = logmodel.predict(X_test)

print(f'accuracy: {accuracy_score(y_test,y_pred)}\n')
print(f'f1_score: {f1_score(y_test,y_pred, average="micro")}\n')
print(f'precition: {precision_score(y_test,y_pred,average="micro")}\n')
print(f'recall: {recall_score(y_test,y_pred, average="micro")}\n')
print(f'classification report:\n {classification_report(y_test,y_pred)}')

accuracy: 0.881578947368421

f1_score: 0.881578947368421

precition: 0.881578947368421

recall: 0.881578947368421

classification report:
               precision    recall  f1-score   support

           0       0.85      0.92      0.88       109
           1       0.92      0.85      0.88       119

    accuracy                           0.88       228
   macro avg       0.88      0.88      0.88       228
weighted avg       0.88      0.88      0.88       228



## 13. Repeat the steps 11 and 12 for SVM, DT, and RF

## SVC:

In [111]:
svc = SVC(kernel='linear')
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)

print(f'accuracy: {accuracy_score(y_test,y_pred)}\n')
print(f'f1_score: {f1_score(y_test,y_pred, average="micro")}\n')
print(f'precition: {precision_score(y_test,y_pred,average="micro")}\n')
print(f'recall: {recall_score(y_test,y_pred, average="micro")}\n')
print(f'classification report:\n {classification_report(y_test,y_pred)}')

accuracy: 0.868421052631579

f1_score: 0.868421052631579

precition: 0.868421052631579

recall: 0.868421052631579

classification report:
               precision    recall  f1-score   support

           0       0.82      0.94      0.87       109
           1       0.93      0.81      0.86       119

    accuracy                           0.87       228
   macro avg       0.87      0.87      0.87       228
weighted avg       0.88      0.87      0.87       228



## DT: 

In [112]:
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X_train,y_train)
y_pred = dtree.predict(X_test)

print(f'accuracy: {accuracy_score(y_test,y_pred)}\n')
print(f'f1_score: {f1_score(y_test,y_pred, average="micro")}\n')
print(f'precition: {precision_score(y_test,y_pred,average="micro")}\n')
print(f'recall: {recall_score(y_test,y_pred, average="micro")}\n')
print(f'classification report:\n {classification_report(y_test,y_pred)}')

accuracy: 0.8201754385964912

f1_score: 0.8201754385964911

precition: 0.8201754385964912

recall: 0.8201754385964912

classification report:
               precision    recall  f1-score   support

           0       0.85      0.76      0.80       109
           1       0.80      0.87      0.84       119

    accuracy                           0.82       228
   macro avg       0.82      0.82      0.82       228
weighted avg       0.82      0.82      0.82       228



## RF: 

In [113]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print(f'accuracy: {accuracy_score(y_test,y_pred)}\n')
print(f'f1_score: {f1_score(y_test,y_pred, average="micro")}\n')
print(f'precition: {precision_score(y_test,y_pred,average="micro")}\n')
print(f'recall: {recall_score(y_test,y_pred, average="micro")}\n')
print(f'classification report:\n {classification_report(y_test,y_pred)}')

accuracy: 0.8771929824561403

f1_score: 0.8771929824561403

precition: 0.8771929824561403

recall: 0.8771929824561403

classification report:
               precision    recall  f1-score   support

           0       0.87      0.87      0.87       109
           1       0.88      0.88      0.88       119

    accuracy                           0.88       228
   macro avg       0.88      0.88      0.88       228
weighted avg       0.88      0.88      0.88       228

