## 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

## 2. Import Pandas

1. Import pandas and alias it as pd
2. Load the dataset cc_approvals.data into a cc_apps dataframe.
    - Set the header argument to None.
3. Print the first five rows.
4. Drop the columns 11 and 13.

In [172]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive


In [173]:
import sys                             # Read system parameters.
import numpy as np                     # Work with multi-dimensional arrays and matrices.
import pandas as pd                    # Manipulate and analyze data.
import seaborn as sb                   # Perform data visualization.
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer




In [174]:
cc_apps= pd.read_csv('cc_approvals.data', header=None)
cc_apps.head()



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [175]:
# Drop columns 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)

print("\nDataFrame after dropping columns 11 and 13:")
cc_apps.head()



DataFrame after dropping columns 11 and 13:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


## 3. Explore the dataset

1. Print the basic statistics.
2. Print the information of the dataset.
3. Print the last 17 rows.

In [176]:
#Print the basic statistics
cc_apps.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [177]:
#Print the information of the dataset
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  12      690 non-null    object 
 12  14      690 non-null    int64  
 13  15      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [178]:
cc_apps.value_counts()

0  1      2       3  4  5   6  7      8  9  10  12  14    15
?  20.08  0.125   u  g  q   v  1.000  f  t  1   g   768   +     1
b  30.17  6.500   u  g  cc  v  3.125  t  t  8   g   1200  +     1
   29.67  1.415   u  g  w   h  0.750  t  t  1   g   100   +     1
   29.83  1.250   y  p  k   v  0.250  f  f  0   g   0     -     1
          2.040   y  p  x   h  0.040  f  f  0   g   1     -     1
                                                               ..
   16.50  0.125   u  g  c   v  0.165  f  f  0   g   0     -     1
   16.92  0.335   y  p  k   v  0.290  f  f  0   s   0     -     1
   17.08  0.085   y  p  c   v  0.040  f  f  0   g   722   -     1
          0.250   u  g  q   v  0.335  f  t  4   g   8     -     1
   ?      10.500  u  g  x   v  6.500  t  f  0   g   0     +     1
Length: 690, dtype: int64

In [179]:
#Print the last 17 rows.
cc_apps.tail(17)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,g,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,g,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,g,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,g,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,g,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,g,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,g,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,g,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,g,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,g,2,-


## 4. Train Test Split

Do not split the dataset into X and y, just split the original dataset.

random_state=42

test_size=0.33

In [180]:
random_state = 42
test_size = 0.33

# train-test split
train_set, test_set = train_test_split(cc_apps, test_size=test_size, random_state=random_state)

print("Training set shape:", train_set.shape)
print("Test set shape:", test_set.shape)


Training set shape: (462, 14)
Test set shape: (228, 14)


## 5. Handling Missing Values

Convert any '?' to a NaN value from both training and testing sets.

In [181]:
# Replace '?' with NaN in the entire dataframe (both training and testing sets)
train_set.replace('?', np.nan, inplace=True)
test_set.replace('?', np.nan, inplace=True)

print("Training set with NaN values:")
train_set.head()



Training set with NaN values:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
382,a,24.33,2.5,y,p,i,bb,4.5,f,f,0,g,456,-
137,b,33.58,2.75,u,g,m,v,4.25,t,t,6,g,0,+
346,,32.25,1.5,u,g,c,v,0.25,f,f,0,g,122,-
326,b,30.17,1.085,y,p,c,v,0.04,f,f,0,g,179,-
33,a,36.75,5.125,u,g,e,v,5.0,t,f,0,g,4000,+


In [182]:
print("\nTest set with NaN values:")
test_set.head()


Test set with NaN values:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
286,a,,1.5,u,g,ff,ff,0.0,f,t,2,g,105,-
511,a,46.0,4.0,u,g,j,j,0.0,t,f,0,g,960,+
257,b,20.0,0.0,u,g,d,v,0.5,f,f,0,g,0,-
336,b,47.33,6.5,u,g,c,v,1.0,f,f,0,g,228,-
318,b,19.17,0.0,y,p,m,bb,0.0,f,f,0,s,1,+


## 6. Handling Missing Values

Impute the numerical data for both training and testing sets with mean value.

In [183]:
#numerical columns
numerical_columns = cc_apps.select_dtypes(include=np.number).columns.tolist()

imputer = SimpleImputer(strategy='mean')

train_set[numerical_columns] = imputer.fit_transform(train_set[numerical_columns])
test_set[numerical_columns] = imputer.transform(test_set[numerical_columns])

In [184]:
# after imputation
print("Training set after imputation:")
train_set.head()


Training set after imputation:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
382,a,24.33,2.5,y,p,i,bb,4.5,f,f,0.0,g,456.0,-
137,b,33.58,2.75,u,g,m,v,4.25,t,t,6.0,g,0.0,+
346,,32.25,1.5,u,g,c,v,0.25,f,f,0.0,g,122.0,-
326,b,30.17,1.085,y,p,c,v,0.04,f,f,0.0,g,179.0,-
33,a,36.75,5.125,u,g,e,v,5.0,t,f,0.0,g,4000.0,+


In [185]:
train_set.isnull().sum()

0     8
1     5
2     0
3     6
4     6
5     7
6     7
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64

In [186]:
print("\nTest set after imputation:")
test_set.head()


Test set after imputation:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
286,a,,1.5,u,g,ff,ff,0.0,f,t,2.0,g,105.0,-
511,a,46.0,4.0,u,g,j,j,0.0,t,f,0.0,g,960.0,+
257,b,20.0,0.0,u,g,d,v,0.5,f,f,0.0,g,0.0,-
336,b,47.33,6.5,u,g,c,v,1.0,f,f,0.0,g,228.0,-
318,b,19.17,0.0,y,p,m,bb,0.0,f,f,0.0,s,1.0,+


## 7. Handling Missing Values

Impute the categorical data for both training and testing sets with mode value.

In [187]:
#categorical columns
categorical_columns = cc_apps.select_dtypes(include='object').columns.tolist()
imputer = SimpleImputer(strategy='most_frequent')

train_set[categorical_columns] = imputer.fit_transform(train_set[categorical_columns])
test_set[categorical_columns] = imputer.transform(test_set[categorical_columns])

In [188]:
# Display the sets after imputation
print("Training set after imputation:")
train_set.head()

Training set after imputation:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
382,a,24.33,2.5,y,p,i,bb,4.5,f,f,0.0,g,456.0,-
137,b,33.58,2.75,u,g,m,v,4.25,t,t,6.0,g,0.0,+
346,b,32.25,1.5,u,g,c,v,0.25,f,f,0.0,g,122.0,-
326,b,30.17,1.085,y,p,c,v,0.04,f,f,0.0,g,179.0,-
33,a,36.75,5.125,u,g,e,v,5.0,t,f,0.0,g,4000.0,+


In [189]:
print("\nTest set after imputation:")
test_set.head()


Test set after imputation:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
286,a,23.58,1.5,u,g,ff,ff,0.0,f,t,2.0,g,105.0,-
511,a,46.0,4.0,u,g,j,j,0.0,t,f,0.0,g,960.0,+
257,b,20.0,0.0,u,g,d,v,0.5,f,f,0.0,g,0.0,-
336,b,47.33,6.5,u,g,c,v,1.0,f,f,0.0,g,228.0,-
318,b,19.17,0.0,y,p,m,bb,0.0,f,f,0.0,s,1.0,+


## 8. Encoding

The columns 0, 3, 4, 5, 6, 8, 9, and 12 are categorical, there are several methods we can use to encode the categorical columns. One of the method called get_dummies().

Use get_dummies() function to convert the categorical columns to a numerical columns (for training the machine learning algorithms).

Do not forget to convert both training and testing sets.

In [190]:
# Specify categorical columns
categorical_columns = [0, 3, 4, 5, 6, 8, 9, 12]

# Convert categorical columns to numerical using get_dummies for both training and testing sets
train_set = pd.get_dummies(train_set, columns=categorical_columns)
test_set = pd.get_dummies(test_set, columns=categorical_columns)

# Display the sets after one-hot encoding
print("Training set after one-hot encoding:")
train_set.head()




Training set after one-hot encoding:


Unnamed: 0,1,2,7,10,14,15,0_a,0_b,3_l,3_u,...,6_o,6_v,6_z,8_f,8_t,9_f,9_t,12_g,12_p,12_s
382,24.33,2.5,4.5,0.0,456.0,-,1,0,0,0,...,0,0,0,1,0,1,0,1,0,0
137,33.58,2.75,4.25,6.0,0.0,+,0,1,0,1,...,0,1,0,0,1,0,1,1,0,0
346,32.25,1.5,0.25,0.0,122.0,-,0,1,0,1,...,0,1,0,1,0,1,0,1,0,0
326,30.17,1.085,0.04,0.0,179.0,-,0,1,0,0,...,0,1,0,1,0,1,0,1,0,0
33,36.75,5.125,5.0,0.0,4000.0,+,1,0,0,1,...,0,1,0,0,1,1,0,1,0,0


In [191]:
print("\nTest set after one-hot encoding:")
test_set.head()


Test set after one-hot encoding:


Unnamed: 0,1,2,7,10,14,15,0_a,0_b,3_l,3_u,...,6_n,6_v,6_z,8_f,8_t,9_f,9_t,12_g,12_p,12_s
286,23.58,1.5,0.0,2.0,105.0,-,1,0,0,1,...,0,0,0,1,0,0,1,1,0,0
511,46.0,4.0,0.0,0.0,960.0,+,1,0,0,1,...,0,0,0,0,1,1,0,1,0,0
257,20.0,0.0,0.5,0.0,0.0,-,0,1,0,1,...,0,1,0,1,0,1,0,1,0,0
336,47.33,6.5,1.0,0.0,228.0,-,0,1,0,1,...,0,1,0,1,0,1,0,1,0,0
318,19.17,0.0,0.0,0.0,1.0,+,0,1,0,0,...,0,0,0,1,0,1,0,0,0,1


## 9. Split into features and target

X_train and y_train will take 462 rows.
X_test and y_test will take 228 rows.

In [192]:
# Assuming '15' is the column index of your target variable
target_column_index = 15

# Split
X_train = train_set.drop(target_column_index, axis=1)
y_train = train_set[target_column_index]

X_test = test_set.drop(target_column_index, axis=1)
y_test = test_set[target_column_index]

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print('-----------------------------')
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)


X_train shape: (462, 43)
y_train shape: (462,)
-----------------------------
X_test shape: (228, 42)
y_test shape: (228,)


## 10. Normalization

In [193]:
a = set(list(X_train.columns))
b = set(list(X_test.columns))

a-b

{'6_o'}

In [194]:
X_test['6_o'] = 0

In [195]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((462, 43), (462,), (228, 43), (228,))

In [196]:
X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [197]:
X_train_scaled.shape

(462, 43)

In [198]:
X_test_scaled.shape

(228, 43)

## 11. Train a Logistic Regression

In [199]:
# Logistic Regression model
logreg_model = LogisticRegression(random_state=42)

logreg_model.fit(X_train_scaled, y_train)
y_train_pred = logreg_model.predict(X_train_scaled)


## 12. Make predictions and evaluate the Logistic Regression Model

In [200]:
# Evaluate the model on the training set
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Training Accuracy:", train_accuracy)

# Predictions on the test set
y_test_pred = logreg_model.predict(X_test_scaled)

# Evaluate the model on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)
print('------------------------------')
# Display classification report for the test set
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred))

Training Accuracy: 0.8809523809523809
Test Accuracy: 0.2850877192982456
------------------------------

Classification Report:
              precision    recall  f1-score   support

           +       0.29      0.41      0.34       103
           -       0.27      0.18      0.22       125

    accuracy                           0.29       228
   macro avg       0.28      0.30      0.28       228
weighted avg       0.28      0.29      0.27       228



## 13. Repeat the steps 11 and 12 for SVM, DT, and RF

In [201]:
# ------------------------SVM model---------------------------------
svm_model = SVC(random_state=42)
svm_model.fit(X_train_scaled, y_train)
y_train_pred_svm = svm_model.predict(X_train_scaled)


# -------------------predictions and evaluate-------------------
train_accuracy_svm = accuracy_score(y_train, y_train_pred_svm)
print("Training Accuracy (SVM):", train_accuracy_svm)

y_test_pred_svm = svm_model.predict(X_test_scaled)


test_accuracy_svm = accuracy_score(y_test, y_test_pred_svm)
print("Test Accuracy (SVM):", test_accuracy_svm)
print('----------------------------------------')

print("\nClassification Report (SVM):")
print(classification_report(y_test, y_test_pred_svm))


Training Accuracy (SVM): 0.8896103896103896
Test Accuracy (SVM): 0.2631578947368421
----------------------------------------

Classification Report (SVM):
              precision    recall  f1-score   support

           +       0.25      0.31      0.28       103
           -       0.28      0.22      0.25       125

    accuracy                           0.26       228
   macro avg       0.27      0.27      0.26       228
weighted avg       0.27      0.26      0.26       228



In [202]:
# ------------------------Decision Tree model---------------------------------

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_scaled, y_train)

y_train_pred_dt = dt_model.predict(X_train_scaled)

# -------------------predictions and evaluate-------------------

train_accuracy_dt = accuracy_score(y_train, y_train_pred_dt)
print("Training Accuracy (Decision Tree):", train_accuracy_dt)

y_test_pred_dt = dt_model.predict(X_test_scaled)

test_accuracy_dt = accuracy_score(y_test, y_test_pred_dt)
print("Test Accuracy (Decision Tree):", test_accuracy_dt)
print('----------------------------------------')


print("\nClassification Report (Decision Tree):")
print(classification_report(y_test, y_test_pred_dt))


Training Accuracy (Decision Tree): 1.0
Test Accuracy (Decision Tree): 0.36403508771929827
----------------------------------------

Classification Report (Decision Tree):
              precision    recall  f1-score   support

           +       0.26      0.21      0.23       103
           -       0.43      0.49      0.46       125

    accuracy                           0.36       228
   macro avg       0.34      0.35      0.34       228
weighted avg       0.35      0.36      0.36       228



In [203]:
# ------------------------Random Forest model---------------------------------


rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_scaled, y_train)


# -------------------predictions and evaluate-------------------
y_train_pred_rf = rf_model.predict(X_train_scaled)
train_accuracy_rf = accuracy_score(y_train, y_train_pred_rf)
print("Training Accuracy (Random Forest):", train_accuracy_rf)


y_test_pred_rf = rf_model.predict(X_test_scaled)
test_accuracy_rf = accuracy_score(y_test, y_test_pred_rf)

print("Test Accuracy (Random Forest):", test_accuracy_rf)
print('----------------------------------------')

print("\nClassification Report (Random Forest):")
print(classification_report(y_test, y_test_pred_rf))


Training Accuracy (Random Forest): 1.0
Test Accuracy (Random Forest): 0.32894736842105265
----------------------------------------

Classification Report (Random Forest):
              precision    recall  f1-score   support

           +       0.19      0.15      0.16       103
           -       0.41      0.48      0.44       125

    accuracy                           0.33       228
   macro avg       0.30      0.31      0.30       228
weighted avg       0.31      0.33      0.32       228

