## 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

## 2. Import Pandas

1. Import pandas and alias it as pd
2. Load the dataset cc_approvals.data into a cc_apps dataframe.
    - Set the header argument to None.
3. Print the first five rows.
4. Drop the columns 11 and 13.

In [486]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split 

In [487]:
df = pd.read_csv('./datasets/cc_approvals.data',names=['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15','A16'])
df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [488]:
df.drop(['A12','A14'],axis=1,inplace=True)

## 3. Explore the dataset

1. Print the basic statistics.
2. Print the information of the dataset.
3. Print the last 17 rows.

In [489]:
df.describe()

Unnamed: 0,A3,A8,A11,A15
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [490]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      690 non-null    object 
 1   A2      690 non-null    object 
 2   A3      690 non-null    float64
 3   A4      690 non-null    object 
 4   A5      690 non-null    object 
 5   A6      690 non-null    object 
 6   A7      690 non-null    object 
 7   A8      690 non-null    float64
 8   A9      690 non-null    object 
 9   A10     690 non-null    object 
 10  A11     690 non-null    int64  
 11  A13     690 non-null    object 
 12  A15     690 non-null    int64  
 13  A16     690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


## 4. Train Test Split

Do not split the dataset into X and y, just split the original dataset.

random_state=42

test_size=0.33

In [491]:

X_train, X_test, y_train, y_test = train_test_split(df.drop('A16',axis=1),df['A16'] , 
                                   random_state=42,  
                                   test_size=0.33) 

In [492]:
df.columns.array

<PandasArray>
[ 'A1',  'A2',  'A3',  'A4',  'A5',  'A6',  'A7',  'A8',  'A9', 'A10', 'A11',
 'A13', 'A15', 'A16']
Length: 14, dtype: object

In [493]:
for i in X_train.columns.array:
    print(i)
    print(X_train[i].value_counts())
    
X_train.isna().sum()

A1
A1
b    314
a    140
?      8
Name: count, dtype: int64
A2
A2
23.58    6
23.00    5
18.83    5
25.00    5
?        5
        ..
56.83    1
37.33    1
34.58    1
16.92    1
18.67    1
Name: count, Length: 286, dtype: int64
A3
A3
0.000     15
1.500     14
3.000     13
1.250     13
2.500     11
          ..
10.335     1
0.665      1
9.585      1
11.665     1
1.165      1
Name: count, Length: 174, dtype: int64
A4
A4
u    341
y    114
?      6
l      1
Name: count, dtype: int64
A5
A5
g     341
p     114
?       6
gg      1
Name: count, dtype: int64
A6
A6
c     91
q     52
w     48
k     41
i     40
ff    39
aa    34
m     25
cc    25
x     22
e     16
d     15
?      7
j      6
r      1
Name: count, dtype: int64
A7
A7
v     261
h      91
ff     42
bb     41
z       7
?       7
dd      5
n       3
j       3
o       2
Name: count, dtype: int64
A8
A8
0.000     55
0.250     25
0.125     24
1.000     20
0.040     19
          ..
6.750      1
1.665      1
13.875     1
5.665      1
1.210      1

A1     0
A2     0
A3     0
A4     0
A5     0
A6     0
A7     0
A8     0
A9     0
A10    0
A11    0
A13    0
A15    0
dtype: int64

## 5. Handling Missing Values

Convert any '?' to a NaN value from both training and testing sets.

In [494]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 462 entries, 382 to 102
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      462 non-null    object 
 1   A2      462 non-null    object 
 2   A3      462 non-null    float64
 3   A4      462 non-null    object 
 4   A5      462 non-null    object 
 5   A6      462 non-null    object 
 6   A7      462 non-null    object 
 7   A8      462 non-null    float64
 8   A9      462 non-null    object 
 9   A10     462 non-null    object 
 10  A11     462 non-null    int64  
 11  A13     462 non-null    object 
 12  A15     462 non-null    int64  
dtypes: float64(2), int64(2), object(9)
memory usage: 50.5+ KB


In [495]:
#for i in X_train.columns.array:
    #X_train[i]=X_train[i].replace('?',np.nan)
    #X_test[i]=X_test[i].replace('?',np.nan)

In [496]:
X_train=X_train.replace('?',np.nan)
X_test=X_test.replace('?',np.nan)

## 6. Handling Missing Values

Impute the numerical data for both training and testing sets with mean value.

In [497]:
X_train['A2'] = X_train['A2'].astype(float)
X_test['A2'] = X_test['A2'].astype(float)



In [498]:
#X_train.fillna(method=)
#for i in X_train.columns.array:
    #X_train[i].fillna(X_train[i].mean())
    #X_train[i]=X_train[i].replace('?',np.nan)
    #X_test[i]=X_test[i].replace('?',np.nan)
    

numeric_columns = X_train.select_dtypes(include=np.number).columns.tolist()

for column in numeric_columns:
    print(column)
    X_train[column].fillna(X_train[column].mean(), inplace=True)
    X_test[column].fillna(X_train[column].mean(), inplace=True)

A2
A3
A8
A11
A15


## 7. Handling Missing Values

Impute the categorical data for both training and testing sets with mode value.

In [499]:
categoric_columns = X_train.select_dtypes(exclude=np.number).columns.tolist()

for column in categoric_columns:
    print(column)
    X_train[column]=X_train[column].fillna(X_train[column].mode()[0])
    X_test[column]=X_test[column].fillna(X_train[column].mode()[0])

A1
A4
A5
A6
A7
A9
A10
A13


In [500]:
X_train.isna().sum()

A1     0
A2     0
A3     0
A4     0
A5     0
A6     0
A7     0
A8     0
A9     0
A10    0
A11    0
A13    0
A15    0
dtype: int64

In [501]:
X_test.isna().sum()

A1     0
A2     0
A3     0
A4     0
A5     0
A6     0
A7     0
A8     0
A9     0
A10    0
A11    0
A13    0
A15    0
dtype: int64

In [502]:
X_train['A1'].unique()

array(['a', 'b'], dtype=object)

## 8. Encoding

The columns 0, 3, 4, 5, 6, 8, 9, and 12 are categorical, there are several methods we can use to encode the categorical columns. One of the method called get_dummies().

Use get_dummies() function to convert the categorical columns to a numerical columns (for training the machine learning algorithms).

Do not forget to convert both training and testing sets.

In [503]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 462 entries, 382 to 102
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      462 non-null    object 
 1   A2      462 non-null    float64
 2   A3      462 non-null    float64
 3   A4      462 non-null    object 
 4   A5      462 non-null    object 
 5   A6      462 non-null    object 
 6   A7      462 non-null    object 
 7   A8      462 non-null    float64
 8   A9      462 non-null    object 
 9   A10     462 non-null    object 
 10  A11     462 non-null    int64  
 11  A13     462 non-null    object 
 12  A15     462 non-null    int64  
dtypes: float64(3), int64(2), object(8)
memory usage: 50.5+ KB


In [504]:
# Combine X_train and X_test into a single DataFrame
combined_df = pd.concat([X_train, X_test])

# Perform one-hot encoding on categorical columns
combined_encoded = pd.get_dummies(combined_df)

# Split back into X_train_encoded and X_test_encoded
xtrain = combined_encoded.iloc[:len(X_train)]
xtest = combined_encoded.iloc[len(X_train):]

## 9. Split into features and target

X_train and y_train will take 462 rows.
X_test and y_test will take 228 rows.

## 10. Normalization

In [505]:
# Separate numeric columns
numeric_columns = xtrain.select_dtypes(include=['number']).columns.tolist()

# Apply Min-Max normalization only to numeric columns
scaler = MinMaxScaler()
xtrain[numeric_columns] = scaler.fit_transform(xtrain[numeric_columns])
xtest[numeric_columns] = scaler.transform(xtest[numeric_columns])




In [506]:
#scaler = MinMaxScaler()
#xtrain = scaler.fit_transform(xtrain)

#xtrain
xtrain.columns

Index(['A2', 'A3', 'A8', 'A11', 'A15', 'A1_a', 'A1_b', 'A4_l', 'A4_u', 'A4_y',
       'A5_g', 'A5_gg', 'A5_p', 'A6_aa', 'A6_c', 'A6_cc', 'A6_d', 'A6_e',
       'A6_ff', 'A6_i', 'A6_j', 'A6_k', 'A6_m', 'A6_q', 'A6_r', 'A6_w', 'A6_x',
       'A7_bb', 'A7_dd', 'A7_ff', 'A7_h', 'A7_j', 'A7_n', 'A7_o', 'A7_v',
       'A7_z', 'A9_f', 'A9_t', 'A10_f', 'A10_t', 'A13_g', 'A13_p', 'A13_s'],
      dtype='object')

## 11. Train a Logistic Regression

In [507]:
from sklearn.linear_model import LogisticRegression


logmodel = LogisticRegression()
logmodel.fit(xtrain,y_train)


print(logmodel.intercept_)
print(logmodel.coef_)

[0.15644233]
[[-1.21135274 -0.15102634 -0.76692409 -0.47509164 -0.77956735 -0.0584797
   0.05876781 -0.37468044  0.03188745  0.3430811   0.03188745 -0.37468044
   0.3430811   0.60385986 -0.00930644 -0.76818866  0.17529195 -0.45222689
   0.64192422  0.20098095  0.02732623  0.64460484  0.1257272   0.23180214
   0.04128338 -0.24458708 -1.21820361  0.50724799 -0.08160143  0.86324135
  -0.55930831  0.1689582  -0.70145378 -0.29696223 -0.269563    0.36972934
   1.76496046 -1.76467235  0.55972174 -0.55943363  0.76509441 -1.669683
   0.9048767 ]]


## 12. Make predictions and evaluate the Logistic Regression Model

In [508]:
from sklearn.metrics import \
     classification_report, confusion_matrix,\
     accuracy_score, precision_score, recall_score, f1_score,roc_auc_score


y_pred = logmodel.predict(xtest)
print(y_pred)
print(f'The classification report :-\n{classification_report(y_test, y_pred)}')
print(f'The accuracy of the model : {accuracy_score(y_test, y_pred)}')


['-' '+' '-' '-' '-' '-' '-' '+' '-' '-' '-' '+' '-' '+' '-' '+' '-' '-'
 '-' '-' '-' '-' '-' '+' '-' '-' '+' '+' '-' '-' '+' '+' '+' '+' '+' '+'
 '+' '+' '+' '+' '+' '+' '-' '+' '-' '+' '-' '-' '+' '-' '-' '+' '-' '-'
 '+' '-' '+' '-' '+' '-' '+' '+' '+' '-' '-' '+' '+' '+' '+' '-' '-' '+'
 '-' '+' '-' '-' '-' '-' '+' '-' '+' '+' '-' '-' '+' '-' '+' '+' '+' '+'
 '+' '+' '+' '+' '+' '+' '+' '+' '-' '-' '-' '-' '+' '-' '+' '+' '-' '+'
 '-' '+' '-' '+' '+' '+' '+' '-' '+' '+' '-' '+' '-' '-' '+' '-' '-' '+'
 '-' '-' '+' '+' '+' '-' '+' '-' '+' '+' '+' '+' '-' '+' '+' '+' '+' '+'
 '-' '-' '-' '+' '+' '-' '-' '-' '-' '+' '-' '+' '+' '+' '-' '-' '+' '-'
 '-' '-' '-' '-' '-' '+' '-' '+' '-' '-' '+' '-' '+' '-' '-' '-' '-' '-'
 '-' '+' '+' '-' '+' '-' '-' '+' '+' '+' '-' '+' '-' '+' '+' '+' '+' '+'
 '+' '+' '+' '+' '-' '+' '+' '-' '+' '+' '-' '-' '+' '-' '+' '+' '-' '-'
 '+' '-' '-' '-' '+' '+' '+' '-' '+' '-' '-' '-']
The classification report :-
              precision    recall  f1-score  

## 13. Repeat the steps 11 and 12 for SVM, DT, and RF

In [509]:
from matplotlib.pyplot import xticks
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state = 1912, criterion='entropy')
tree.fit(xtrain, y_train)

y_pred = tree.predict(xtest)


report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)


print(f'The confusion matrix :-\n{confusion_matrix(y_test, y_pred)}')

print(f'The accuracy of the model : {accuracy_score(y_test, y_pred)}')

Classification Report:
               precision    recall  f1-score   support

           +       0.84      0.81      0.82       103
           -       0.84      0.87      0.86       125

    accuracy                           0.84       228
   macro avg       0.84      0.84      0.84       228
weighted avg       0.84      0.84      0.84       228

The confusion matrix :-
[[ 83  20]
 [ 16 109]]
The accuracy of the model : 0.8421052631578947


In [510]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 100,
                                criterion = 'entropy',
                                max_depth = 6,
                                min_samples_leaf = 10,
                                min_samples_split = 78,
                                bootstrap = True,
                                oob_score = True,
                                random_state = 1912)

forest.fit(xtrain, y_train)

prediction = forest.predict(xtest)
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)


print(f'The confusion matrix :-\n{confusion_matrix(y_test, y_pred)}')

print(f'The accuracy of the model : {accuracy_score(y_test, y_pred)}')

Classification Report:
               precision    recall  f1-score   support

           +       0.84      0.81      0.82       103
           -       0.84      0.87      0.86       125

    accuracy                           0.84       228
   macro avg       0.84      0.84      0.84       228
weighted avg       0.84      0.84      0.84       228

The confusion matrix :-
[[ 83  20]
 [ 16 109]]
The accuracy of the model : 0.8421052631578947


In [515]:
from sklearn import svm


model = svm.SVC(kernel='linear',random_state=32)

model.fit(xtrain,y_train)

y_pred=model.predict(xtest)

print(model.score(xtest,y_test))
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)


print(f'The confusion matrix :-\n{confusion_matrix(y_test, y_pred)}')

print(f'The accuracy of the model : {accuracy_score(y_test, y_pred)}')



0.8377192982456141
Classification Report:
               precision    recall  f1-score   support

           +       0.77      0.92      0.84       103
           -       0.92      0.77      0.84       125

    accuracy                           0.84       228
   macro avg       0.84      0.85      0.84       228
weighted avg       0.85      0.84      0.84       228

The confusion matrix :-
[[95  8]
 [29 96]]
The accuracy of the model : 0.8377192982456141
