## 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

## 2. Import Pandas

1. Import pandas and alias it as pd
2. Load the dataset cc_approvals.data into a cc_apps dataframe.
    - Set the header argument to None.
3. Print the first five rows.
4. Drop the columns 11 and 13.

In [62]:
import pandas as pd
import numpy as np
column_names = ['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16']
file = 'cc_approvals.data'

df = pd.read_csv(file, names=column_names, header=None)
df.head()


Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [63]:
df.drop(['c11','c13'], axis=1,inplace=True)
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c12,c14,c15,c16
0,b,30.83,0.000,u,g,w,v,1.25,t,t,f,00202,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,f,00043,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,f,00280,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,t,00100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,f,00120,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,f,00260,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,t,00200,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,t,00200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,f,00280,750,-


## 3. Explore the dataset

1. Print the basic statistics.
2. Print the information of the dataset.
3. Print the last 17 rows.

In [64]:
df.describe()

Unnamed: 0,c3,c8,c15
count,690.0,690.0,690.0
mean,4.758725,2.223406,1017.385507
std,4.978163,3.346513,5210.102598
min,0.0,0.0,0.0
25%,1.0,0.165,0.0
50%,2.75,1.0,5.0
75%,7.2075,2.625,395.5
max,28.0,28.5,100000.0


In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      690 non-null    object 
 1   c2      690 non-null    object 
 2   c3      690 non-null    float64
 3   c4      690 non-null    object 
 4   c5      690 non-null    object 
 5   c6      690 non-null    object 
 6   c7      690 non-null    object 
 7   c8      690 non-null    float64
 8   c9      690 non-null    object 
 9   c10     690 non-null    object 
 10  c12     690 non-null    object 
 11  c14     690 non-null    object 
 12  c15     690 non-null    int64  
 13  c16     690 non-null    object 
dtypes: float64(2), int64(1), object(11)
memory usage: 75.6+ KB


In [66]:
df.tail(17)

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c12,c14,c15,c16
673,?,29.5,2.0,y,p,e,h,2.0,f,f,f,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,f,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,f,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,t,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,f,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,f,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,f,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,f,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,f,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,t,140,2,-


## 4. Train Test Split

Do not split the dataset into X and y, just split the original dataset.

random_state=42

test_size=0.33

In [67]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.33, random_state = 42)




## 5. Handling Missing Values

Convert any '?' to a NaN value from both training and testing sets.

In [68]:
df[df == '?'].count().sum()

67

In [69]:
train[train == '?'].count().sum() + test[test == '?'].count().sum()

67

In [70]:
train.replace('?',np.nan,inplace=True)
test.replace('?',np.nan,inplace=True)


In [71]:
train[train == '?'].count().sum() + test[test == '?'].count().sum()

0

In [72]:
test.isna().sum().sum() + train.isna().sum().sum()

67

## 6. Handling Missing Values

Impute the numerical data for both training and testing sets with mean value.

In [73]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      454 non-null    object 
 1   c2      457 non-null    object 
 2   c3      462 non-null    float64
 3   c4      456 non-null    object 
 4   c5      456 non-null    object 
 5   c6      455 non-null    object 
 6   c7      455 non-null    object 
 7   c8      462 non-null    float64
 8   c9      462 non-null    object 
 9   c10     462 non-null    object 
 10  c12     462 non-null    object 
 11  c14     450 non-null    object 
 12  c15     462 non-null    int64  
 13  c16     462 non-null    object 
dtypes: float64(2), int64(1), object(11)
memory usage: 54.1+ KB


In [74]:
train[['c2', 'c14']] = train[['c2', 'c14']].astype(float)
test[['c2', 'c14']] = test[['c2', 'c14']].astype(float)

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      454 non-null    object 
 1   c2      457 non-null    float64
 2   c3      462 non-null    float64
 3   c4      456 non-null    object 
 4   c5      456 non-null    object 
 5   c6      455 non-null    object 
 6   c7      455 non-null    object 
 7   c8      462 non-null    float64
 8   c9      462 non-null    object 
 9   c10     462 non-null    object 
 10  c12     462 non-null    object 
 11  c14     450 non-null    float64
 12  c15     462 non-null    int64  
 13  c16     462 non-null    object 
dtypes: float64(4), int64(1), object(9)
memory usage: 54.1+ KB


In [75]:
train[['c2','c14']].isna().sum()

c2      5
c14    12
dtype: int64

In [76]:
train.fillna(train.mean(),inplace=True)
test.fillna(test.mean(),inplace=True)

  train.fillna(train.mean(),inplace=True)
  test.fillna(test.mean(),inplace=True)


In [77]:
train[['c2','c14']].isna().sum()

c2     0
c14    0
dtype: int64

## 7. Handling Missing Values

Impute the categorical data for both training and testing sets with mode value.

In [78]:
test.isna().sum().sum() + train.isna().sum().sum()

42

In [79]:
train.fillna(train.mode().iloc[0], inplace=True)
test.fillna(test.mode().iloc[0], inplace=True)



In [80]:
test.isna().sum().sum() + train.isna().sum().sum()

0

## 8. Encoding

The columns 0, 3, 4, 5, 6, 8, 9, and 12 are categorical, there are several methods we can use to encode the categorical columns. One of the method called get_dummies().

Use get_dummies() function to convert the categorical columns to a numerical columns (for training the machine learning algorithms).

Do not forget to convert both training and testing sets.

In [81]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      462 non-null    object 
 1   c2      462 non-null    float64
 2   c3      462 non-null    float64
 3   c4      462 non-null    object 
 4   c5      462 non-null    object 
 5   c6      462 non-null    object 
 6   c7      462 non-null    object 
 7   c8      462 non-null    float64
 8   c9      462 non-null    object 
 9   c10     462 non-null    object 
 10  c12     462 non-null    object 
 11  c14     462 non-null    float64
 12  c15     462 non-null    int64  
 13  c16     462 non-null    object 
dtypes: float64(4), int64(1), object(9)
memory usage: 54.1+ KB


In [82]:
train = pd.get_dummies(train)
train.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Data columns (total 44 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c2      462 non-null    float64
 1   c3      462 non-null    float64
 2   c8      462 non-null    float64
 3   c14     462 non-null    float64
 4   c15     462 non-null    int64  
 5   c1_a    462 non-null    uint8  
 6   c1_b    462 non-null    uint8  
 7   c4_l    462 non-null    uint8  
 8   c4_u    462 non-null    uint8  
 9   c4_y    462 non-null    uint8  
 10  c5_g    462 non-null    uint8  
 11  c5_gg   462 non-null    uint8  
 12  c5_p    462 non-null    uint8  
 13  c6_aa   462 non-null    uint8  
 14  c6_c    462 non-null    uint8  
 15  c6_cc   462 non-null    uint8  
 16  c6_d    462 non-null    uint8  
 17  c6_e    462 non-null    uint8  
 18  c6_ff   462 non-null    uint8  
 19  c6_i    462 non-null    uint8  
 20  c6_j    462 non-null    uint8  
 21  c6_k    462 non-null    uint8  
 22  

In [83]:

test = pd.get_dummies(test)
test

Unnamed: 0,c2,c3,c8,c14,c15,c1_a,c1_b,c4_l,c4_u,c4_y,...,c7_v,c7_z,c9_f,c9_t,c10_f,c10_t,c12_f,c12_t,c16_+,c16_-
286,31.428416,1.50,0.000,200.0,105,1,0,0,1,0,...,0,0,1,0,0,1,0,1,0,1
511,46.000000,4.00,0.000,100.0,960,1,0,0,1,0,...,0,0,0,1,1,0,1,0,1,0
257,20.000000,0.00,0.500,144.0,0,0,1,0,1,0,...,1,0,1,0,1,0,1,0,0,1
336,47.330000,6.50,1.000,0.0,228,0,1,0,1,0,...,1,0,1,0,1,0,0,1,0,1
318,19.170000,0.00,0.000,500.0,1,0,1,0,0,1,...,0,0,1,0,1,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,20.830000,0.50,1.000,260.0,0,1,0,0,0,1,...,0,0,1,0,1,0,1,0,0,1
234,58.420000,21.00,10.000,0.0,6700,1,0,0,1,0,...,0,0,0,1,0,1,1,0,1,0
644,36.170000,0.42,0.290,309.0,2,0,1,0,0,1,...,1,0,1,0,1,0,0,1,0,1
271,32.330000,2.50,1.250,280.0,0,0,1,0,1,0,...,1,0,1,0,1,0,0,1,0,1


## 9. Split into features and target

X_train and y_train will take 462 rows.
X_test and y_test will take 228 rows.

In [84]:
df.replace('?', np.nan, inplace=True)
df[['c2', 'c14']] = df[['c2', 'c14']].astype(float)
df.fillna(df.mean(numeric_only=True), inplace=True)
df.fillna(df.mode().iloc[0], inplace=True)



df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      690 non-null    object 
 1   c2      690 non-null    float64
 2   c3      690 non-null    float64
 3   c4      690 non-null    object 
 4   c5      690 non-null    object 
 5   c6      690 non-null    object 
 6   c7      690 non-null    object 
 7   c8      690 non-null    float64
 8   c9      690 non-null    object 
 9   c10     690 non-null    object 
 10  c12     690 non-null    object 
 11  c14     690 non-null    float64
 12  c15     690 non-null    int64  
 13  c16     690 non-null    object 
dtypes: float64(4), int64(1), object(9)
memory usage: 75.6+ KB


In [85]:
from sklearn.model_selection import train_test_split
X  = df.drop('c16',axis=1)
Y = df['c16']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state = 42)

print('X_train, Y_train',X_train.shape, X_test.shape)
print('X_test, Y_test',X_test.shape,Y_test.shape)


X_train, Y_train (462, 13) (228, 13)
X_test, Y_test (228, 13) (228,)


## 10. Normalization

In [86]:
X_train.head()

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c12,c14,c15
382,a,24.33,2.5,y,p,i,bb,4.5,f,f,f,200.0,456
137,b,33.58,2.75,u,g,m,v,4.25,t,t,f,204.0,0
346,b,32.25,1.5,u,g,c,v,0.25,f,f,t,372.0,122
326,b,30.17,1.085,y,p,c,v,0.04,f,f,f,170.0,179
33,a,36.75,5.125,u,g,e,v,5.0,t,f,t,0.0,4000


In [87]:
from sklearn.preprocessing import MinMaxScaler

numeric_columns = ['c2', 'c3', 'c8', 'c14', 'c15']

scaler = MinMaxScaler()

X_train[numeric_columns] = scaler.fit_transform(X_train[numeric_columns])

X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])

X_train.head()


Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c12,c14,c15
382,a,0.167937,0.094931,y,p,i,bb,0.225,f,f,f,0.1,0.00456
137,b,0.314762,0.104424,u,g,m,v,0.2125,t,t,f,0.102,0.0
346,b,0.293651,0.056958,u,g,c,v,0.0125,f,f,t,0.186,0.00122
326,b,0.260635,0.0412,y,p,c,v,0.002,f,f,f,0.085,0.00179
33,a,0.365079,0.194608,u,g,e,v,0.25,t,f,t,0.0,0.04


## 11. Train a Logistic Regression

In [88]:
from sklearn.linear_model import LogisticRegression


log_modle = LogisticRegression()
log_modle.fit(X_train,Y_train)



ValueError: could not convert string to float: 'a'

In [None]:
X_train.info()

## 12. Make predictions and evaluate the Logistic Regression Model

In [89]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, precision_score, recall_score

Y_pred = log_modle.predict(X_test)

print(accuracy_score(Y_test,Y_pred))

print(f1_score(Y_test,Y_pred,average='micro'))

print(precision_score(Y_test,Y_pred,average="micro"))

print(recall_score(Y_test,Y_pred, average="micro"))

print(confusion_matrix(Y_test,Y_pred))

print(classification_report(Y_test,Y_pred))



ValueError: could not convert string to float: 'a'

## 13. Repeat the steps 11 and 12 for SVM, DT, and RF