## 1. Project Overview
<p>現代人向商業銀行申請信用卡似乎已成常態。然而，申請人可能因某些原因而被拒絕，例如不良貸款紀錄、低收入水平、有其他負債等。傳統利用人工查核的方式不僅容易出錯且耗時，還產生高額人力成本。如今，因科技的進步，使得很多商業銀行想要採用 <em>AI</em> 來實現信用卡審核自動化。因此，本專案將採用 <em>機器學習</em> 構建一個信用卡審核預測器，以滿足銀行需求。</p>

<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>

<p>數據來源採用 UCI ML Repository - <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Approval Data Set</a>.</p>

<p>此專案的目錄如下：
<ul>
<li>首先，載入數據集並查看</li>
<li>觀察到數據特徵包含數值型及類別型資料，且每個欄位的尺度 (range) 不一樣，並含有缺失值</li>
<li>對數據集進行預處理，以確保我們選擇的機器學習模型能夠有好的預測能力</li>
<li>將數據適當清洗後，進行探索性資料分析觀察洞見</li>
<li>最後，建立機器學習模型，用以預測個人的信用卡申請是否會被接受</li>
</ul>
</p>
<p>首先，載入並查看數據集。 我們發現，由於這些數據可能牽涉個資問題，所以數據集的貢獻者匿名了特徵名稱。</p>

In [None]:
# Attach cloud drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Import pandas
import pandas as pd

# Load dataset
data_path = "/content/drive/MyDrive/Colab Notebooks/DataCamp/Project6: Predicting Credit Card Approvals/cc_approvals.data"
cc_apps = pd.read_csv(data_path, header=None)   # default -> header='infer'

# Inspect data
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 2. Inspecting the applications
<p>由於數據集的特徵已被匿名化以保護隱私，但這個 <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">blog</a> 提供了可能的特徵概述。 申請信用卡的典型特徵可能包含 <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code>, <code>ApprovalStatus</code>。 因此，我們將這些特徵對照每個可能的欄位，並進行簡單探索。</p>

In [None]:
# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB
None


In [None]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


In [None]:
# Inspect missing values in the dataset
cc_apps.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-




*   如上所見，數據集包含數字和非數字特徵的混合，並含有缺失值，例如標有 "?" 處



## 3. Splitting the dataset into train and test sets
<p>一般來說，為了避免 <a href="https://en.wikipedia.org/wiki/Leakage_(machine_learning)">data leakage</a>，在預處理訓練集數據時不應包含測試集數據。因此我們首先將數據拆分成訓練集和測試集。</p>

<p>此外，<code>DriversLicense</code>, <code>ZipCode</code> 特徵在預測信用卡批准方面並不重要，所以我們會排除它們，以適當的 <strong>特徵選擇</strong> 來進行後半部模型的建構。</p>

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)

# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)   # 設置 random_state 可確保每次數據集都被相同的分割。

In [None]:
cc_apps_train.shape

(462, 14)

In [None]:
cc_apps_test.shape

(228, 14)

## 4. Handling the missing values
<p>這裡提出一個重要的問題 : 為何要處理缺失值，不能直接忽略嗎 ?
<li>忽略缺失值會嚴重影響機器學習模型的表現，並且可能會錯過對其訓練有用的數據信息</li>
<li>有許多模型無法在有缺失值的情況下訓練，例如 Linear Discriminant Analysis (LDA)</li>
</p>

### (i) Replace the '?' with NaN

In [None]:
# Import numpy
import numpy as np

# Replace the '?' with NaN in the train and test sets
cc_apps_train = cc_apps_train.replace('?', np.nan)
cc_apps_test = cc_apps_test.replace('?', np.nan)

In [None]:
cc_apps_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
382,a,24.33,2.5,y,p,i,bb,4.5,f,f,0,g,456,-
137,b,33.58,2.75,u,g,m,v,4.25,t,t,6,g,0,+
346,,32.25,1.5,u,g,c,v,0.25,f,f,0,g,122,-
326,b,30.17,1.085,y,p,c,v,0.04,f,f,0,g,179,-
33,a,36.75,5.125,u,g,e,v,5.0,t,f,0,g,4000,+


In [None]:
cc_apps_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
286,a,,1.5,u,g,ff,ff,0.0,f,t,2,g,105,-
511,a,46.0,4.0,u,g,j,j,0.0,t,f,0,g,960,+
257,b,20.0,0.0,u,g,d,v,0.5,f,f,0,g,0,-
336,b,47.33,6.5,u,g,c,v,1.0,f,f,0,g,228,-
318,b,19.17,0.0,y,p,m,bb,0.0,f,f,0,s,1,+


### (ii) Imputation - numerical data
<p>我們使用 <strong>Mean imputation</strong> 來 impute 數值型欄位的缺失值。</p>

In [None]:
imputer = cc_apps_train.mean()
imputer

  """Entry point for launching an IPython kernel.


2       4.647814
7       2.044567
10      2.476190
14    978.207792
dtype: float64

In [None]:
# Impute the missing values with mean imputation
cc_apps_train.fillna(imputer, inplace=True)
cc_apps_test.fillna(imputer, inplace=True)   # 確保使用從訓練集計算的平均值來 impute 測試集   

# Count the number of NaNs in the datasets and print the counts to verify
print(cc_apps_train.isna().sum())
print(cc_apps_test.isna().sum())

0     8
1     5
2     0
3     6
4     6
5     7
6     7
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     4
1     7
2     0
3     0
4     0
5     2
6     2
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


### (iii) Imputation - categorical data
<p>我們使用相應欄位中最常出現的值 <strong>(most frequent value)</strong> 來 impute 類別型欄位的缺失值。補充: <a href="https://www.datacamp.com/community/tutorials/categorical-data">Handling Categorical Data</a></p>

In [None]:
# Iterate over each column of cc_apps_train
for col in cc_apps_train.columns:
    # Check if the column is of object type
    if cc_apps_train[col].dtype == 'object':
        # Impute with the most frequent value
        cc_apps_train[col] = cc_apps_train[col].fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test[col] = cc_apps_test[col].fillna(cc_apps_train[col].value_counts().index[0])

# To verify if there still have the NaNs 
print(cc_apps_train.isna().any().any())
print(cc_apps_test.isna().any().any())

False
False


## 5. Preprocessing the data
<p>首先，我們將所有類別型資料轉換為數值型。這樣做不僅可以加快計算速度，而且許多機器學習模型（如 XGBoost）（尤其是使用 scikit-learn 開發的模型），要求數據採用數字格式。其次，我們將特徵值縮放到統一的範圍。</p>


### (i) Encoding categorical data
<p>將非數值資料轉換為數值資料</p>

<p>
<ul> 
<li>pd.get_dummies() : 會看資料型態</li>
<li>One Hot Encoding : 不會管欄位的資料型態如何，只要有指定做onehot，就都會做。</li>
</ul>
</p>


In [None]:
# Convert the categorical features in the train and test sets independently
cc_apps_train = pd.get_dummies(cc_apps_train)   # pd.get_dummies()會自己看欄位的資料型態 
cc_apps_test = pd.get_dummies(cc_apps_test)

# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)   # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html

In [None]:
cc_apps_train.head()

Unnamed: 0,2,7,10,14,0_a,0_b,1_13.75,1_15.83,1_15.92,1_16.00,...,6_z,8_f,8_t,9_f,9_t,12_g,12_p,12_s,15_+,15_-
382,2.5,4.5,0,456,1,0,0,0,0,0,...,0,1,0,1,0,1,0,0,0,1
137,2.75,4.25,6,0,0,1,0,0,0,0,...,0,0,1,0,1,1,0,0,1,0
346,1.5,0.25,0,122,0,1,0,0,0,0,...,0,1,0,1,0,1,0,0,0,1
326,1.085,0.04,0,179,0,1,0,0,0,0,...,0,1,0,1,0,1,0,0,0,1
33,5.125,5.0,0,4000,1,0,0,0,0,0,...,0,0,1,1,0,1,0,0,1,0


### (ii) Feature Scaling
<p>我們將所有欄位的數值縮放到 0-1 的範圍內。<br>欄位經過特徵縮放後所代表的含義，例如 <code>CreditScore</code>，持卡人的信用評分是基於他們的信用記錄。這個數字越高，代表持卡人在財務上的可信度越高。因此 <code>CreditScore</code> 為 1 時是最高的。</p>

In [None]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# drop column: "15_-"
cc_apps_train = cc_apps_train.drop(["15_-"], axis=1)
cc_apps_test = cc_apps_test.drop(["15_-"], axis=1)

# Segregate features and labels into separate variables
X_train, y_train = cc_apps_train.iloc[:, :-1].values, cc_apps_train.iloc[:,-1:].values
X_test, y_test = cc_apps_test.iloc[:, :-1].values, cc_apps_test.iloc[:,-1:].values

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

In [None]:
rescaledX_train

array([[0.0949307 , 0.225     , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.10442377, 0.2125    , 0.08955224, ..., 1.        , 0.        ,
        0.        ],
       [0.05695842, 0.0125    , 0.        , ..., 1.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.05970149, ..., 1.        , 0.        ,
        0.        ],
       [0.1898614 , 0.01875   , 0.02985075, ..., 1.        , 0.        ,
        0.        ]])

In [None]:
rescaledX_test

array([[0.05695842, 0.        , 0.02985075, ..., 1.        , 0.        ,
        0.        ],
       [0.15188912, 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.025     , 0.        , ..., 1.        , 0.        ,
        0.        ],
       ...,
       [0.01594836, 0.0145    , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.0949307 , 0.0625    , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.06645149, 0.11675   , 0.        , ..., 1.        , 0.        ,
        0.        ]])

## 6. Fitting a logistic regression model to the train set
<p>預測信用卡申請是否會被批准是一個分類問題。我們採用 <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"> 邏輯回歸模型 </a> 建模。<br>
此數據集包含 690 筆信用卡申請，其中有 383 筆 (55.5%) 申請被拒絕，307 筆 (44.5%) 申請獲得批准。</p>



In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Initiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression()

## 7. Making predictions and evaluating performance
<p>我們在測試集上評估模型的 <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy">classification accuracy</a> 表現。也看一下模型的 <a href="https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">confusion matrix</a> 。</p>

In [None]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# Print the confusion matrix of the logreg model
confusion_matrix(y_test, y_pred)

Accuracy of logistic regression classifier:  0.8552631578947368


array([[101,  24],
       [  9,  94]])

## 8. Grid searching and making the model perform better
<p>混淆矩陣第一列的第一個元素是 True Negatives (TN)，代表模型正確 <strong>預測拒絕</strong> & <strong>真的被拒絕</strong> 的數量。 第二列的最後一個元素是 True Positives (TP)，代表模型正確 <strong>預測批准</strong> & <strong>真的被批准</strong> 的數量。</p>

<p>我們對邏輯回歸模型的參數進行 <a href="https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/">grid search</a> ，以提升模型的預測能力。我們簡單在兩個參數上進行網格搜索： <strong>tol</strong>, <strong>max_iter</strong></p>

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200] 

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

## 9. Finding the best performing model
<p>開始網格搜索，看最佳參數配置及模型最佳表現。補充: <a href="https://www.dataschool.io/machine-learning-with-scikit-learn/">cross-validation</a></p>

In [None]:
# Initiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print("Accuracy of logistic regression classifier: ", best_model.score(rescaledX_test, y_test))

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

Best: 0.867906 using {'max_iter': 100, 'tol': 0.01}
Accuracy of logistic regression classifier:  0.8552631578947368
