## Supervised Learning
<p>This notebook provides an example of supervised learning with several different modeling techniques in order to provide improved acccuracy.</p>

## 1. Credit card applications
<p>Throughout time, Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="R.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository. The structure of this notebook is as follows:</p>
<ul>
<li>First, we will start off by loading and viewing the dataset.</li>
<li>We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.</li>
<li>We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.</li>
<li>After our data is in good shape, we will do some exploratory data analysis to build our understanding.</li>
<li>Finally, we will build machine learning models that can predict if an individual's application for a credit card will be accepted. We will see how to build a pipeline also.</li>
</ul>
<p>First, loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.</p>

In [3]:
# We will be using pandas profile report so need to get this installed and then restart the kernel first
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
#restart the kernel
#the arrow will be red - it's normal
import os
os._exit(00)

In [1]:
#just about everything you need and more - let's import the rest of the libraries
#do you know the purpose of each of the imported modules?
import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
import collections
from collections import Counter

import sklearn
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA

  import pandas_profiling


In [7]:
#mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [18]:
# Load dataset
# if you are running this in Colab, be sure to select the folder icon in
# the side menu, then select the icon for "upload to session storage"
# and uploac the cc_approvals.data file from your local drive
# you will need to alter the input statement as well to remove
# the local folder name!
# the example here uses that version - the commented version includes the folder
# name if you are running jupyter locally
cc_apps = pd.read_csv("cc_approvals.data", header=None)
#cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None)

# Inspect data
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [19]:
# Let's first change those column names  based on domain information
cc_apps.columns =['Male', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel', 'Ethnicity',
          'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 'DriversLicense',
            'Citizen', 'ZipCode', 'Income', 'Approved']
cc_apps.head()

Unnamed: 0,Male,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [20]:
# Let's drop columns that are not important
# Drop the features DriversLicense and ZipCode
# It is obvious DriversLicense and ZipCode are not relevant to our model
cc_apps = cc_apps.drop(['DriversLicense', 'ZipCode'], axis=1)#

## 2. Inspecting the features
<p>The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a> gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.   </p>
<p>As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.</p>

### Let's Explore Our Data
We are going to use Pandas ProfileReport for some quick exploration!<br>
We have installed the necessary packages earlier and restarted the kernel.
You can learn more here:  <a href="https://www.linkedin.com/pulse/summarizing-exploring-datasets-using-jupyter-notebooks-joseph-true">Generating Reports for EDA using Pandas Profiling</a>


In [5]:
#run the report on cc_apps
profile = ProfileReport(cc_apps, title="Credit Card Data Profile Report")

In [None]:
# This code will save the report as a html file
# which you can view in any browser
#Assign to a string
html_data = profile.to_html()
# Save as a file
profile.to_file("creditcard_report.html")
# you can view the html file in a browser of your choice
# in colab you can find the file in the side menu by clicking on the folder icon

In [6]:
#this is even better
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## 3. Handling the missing values (part i)
<p>We've uncovered some issues that will affect the performance of our machine learning model(s) if they go unchanged:</p>
<ul>
<li>Our dataset contains both numeric and non-numeric data (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.</li>
<li>The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like <code>mean</code>, <code>max</code>, and <code>min</code>) about the features that have numerical values. </li>
<li>Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.</li>
</ul>
<p>Now, let's temporarily replace these missing value question marks with NaN.</p>

In [21]:
# We have already Imported numpy
# import numpy as np

# Inspect missing values in the dataset
print(cc_apps.tail(17))

# Replace the '?'s with NaN
cc_apps = cc_apps.replace('?', np.nan)

# Inspect the missing values again
cc_apps.tail(17)

    Male    Age    Debt Married BankCustomer EducationLevel Ethnicity  \
673    ?  29.50   2.000       y            p              e         h   
674    a  37.33   2.500       u            g              i         h   
675    a  41.58   1.040       u            g             aa         v   
676    a  30.58  10.665       u            g              q         h   
677    b  19.42   7.250       u            g              m         v   
678    a  17.92  10.210       u            g             ff        ff   
679    a  20.08   1.250       u            g              c         v   
680    b  19.50   0.290       u            g              k         v   
681    b  27.83   1.000       y            p              d         h   
682    b  17.08   3.290       u            g              i         v   
683    b  36.42   0.750       y            p              d         v   
684    b  40.58   3.290       u            g              m         v   
685    b  21.08  10.085       y            p       

Unnamed: 0,Male,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,Citizen,Income,Approved
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,g,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,g,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,g,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,g,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,g,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,g,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,g,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,g,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,g,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,g,2,-


## 4. Handling the missing values (part ii)
<p>We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.</p>
<p>An important question that gets raised here is <em>why are we giving so much importance to missing values</em>? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as LDA. </p>
<p>So, to avoid this problem, we are going to impute the missing values. Here is an article on the topic, there are many methods and considerations: </p>
<p>We are going to see how many rows have missing values, then remove the rows if it doesn't get rid of too much of the data. This can be risky!</p>

In [22]:
# Calculate number of missing values in each row
missing_values = cc_apps.isnull().sum(axis=1)

# Calculate percentage of rows with at least one missing value
percentage_missing_rows = (sum(missing_values > 0) / len(cc_apps)) * 100

print(f"Percentage of rows with missing values: {percentage_missing_rows:.2f}%")

Percentage of rows with missing values: 4.49%


In [23]:
# Remove rows with missing values
cc_apps_clean = cc_apps.dropna()

# Count missing values after removal
print("\nAfter removal:")
print(cc_apps_clean.isnull().sum())

# Show cleaned DataFrame
print("\nCleaned DataFrame:")
print(cc_apps_clean)


After removal:
Male              0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
Approved          0
dtype: int64

Cleaned DataFrame:
    Male    Age    Debt Married BankCustomer EducationLevel Ethnicity  \
0      b  30.83   0.000       u            g              w         v   
1      a  58.67   4.460       u            g              q         h   
2      a  24.50   0.500       u            g              q         h   
3      b  27.83   1.540       u            g              w         v   
4      b  20.17   5.625       u            g              w         v   
..   ...    ...     ...     ...          ...            ...       ...   
685    b  21.08  10.085       y            p              e         h   
686    a  22.67   0.750       u            g              c         v   
687    a  25.2

## 5. Handling the missing values (part iii)
<p>We have successfully taken care of the missing values present in all columns. Other strategies are mean for numeric and mode for categorical or object features.</p>
<p>Here is some code that shows how you can  impute these missing values with the most frequent values as present in the respective columns. This is <a href="https://www.datacamp.com/community/tutorials/categorical-data">good practice</a> when it comes to imputing missing values for categorical data in general.</p>
<p>This can be changed to impute with the mean or other value as well and is very powerful.</p>
<br>
# Iterate over each column of cc_apps<br>
for col in cc_apps.columns:<br>
    # Check if the column is of object type<br>
    if cc_apps[col].dtypes == 'object':<br>
        # Impute with the most frequent value<br>
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])<br>
<br>
# Count the number of NaNs in the dataset and print the counts to verify<br>
print(cc_apps.isnull().sum())<br>

## 6. Preprocessing the data (part i)
<p>The missing values are now successfully handled.</p>
<p>There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into three main tasks:</p>
<ol>
<li>Convert the non-numeric data into numeric.</li>
<li>Split the data into train and test sets. </li>
<li>Scale the feature values to a uniform range.</li>
</ol>
<p>First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">label encoding</a>.</p>

In [24]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le=LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps_clean.columns.to_numpy():
    # Compare if the dtype is object
    if cc_apps_clean[col].dtypes=='object':
    # Use LabelEncoder to do the numeric transformation
        cc_apps_clean[col]=le.fit_transform(cc_apps_clean[col])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cc_apps_clean[col]=le.fit_transform(cc_apps_clean[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cc_apps_clean[col]=le.fit_transform(cc_apps_clean[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cc_apps_clean[col]=le.fit_transform(cc_apps_clean[col])
A value is trying to be set on a copy

## 7. Splitting the dataset into train and test sets
<p>We have successfully converted all the non-numeric values to numeric ones.</p>
<p>Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then apply the scaling.</p>
<p>Also, features like <code>DriversLicense</code> and <code>ZipCode</code> are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as <em>feature selection</em>. </p>

In [25]:
# save cc_apps as a data frame before we convert it to a numpy array
df = cc_apps_clean

In [26]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# now change to numpy array
cc_apps_clean = cc_apps_clean.to_numpy()

# Segregate features and labels into separate variables
X,y = cc_apps_clean[:,0:13] , cc_apps_clean[:,13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X
                                                    ,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

## 8. Preprocessing the data (part ii)
<p>The data is now split into two separate sets - train and test sets respectively. We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data. </p>
<p>Now, let's try to understand what these scaled values mean in the real world. Let's use <code>CreditScore</code> as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a <code>CreditScore</code> of 1 is the highest since we're rescaling all the values to the range of 0-1.</p>

<p>Scaling is important for Logistic Regression</p>
<p>Scaling is not necessary for Decision Trees and Random Forest</p>

In [27]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

## 9. Fitting a logistic regression model to the train set
<p>Essentially, predicting if a credit card application will be approved or not is a <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> task. <a href="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names">According to UCI</a>, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved. </p>
<p>This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.</p>
<p>Which model should we pick? A question to ask is: <em>are the features that affect the credit card approval decision process correlated with each other?</em> Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).</p>

In [28]:
# Import LogisticRegression
# note we are using the rescaled data
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train,y_train)

## 10. Making predictions and evaluating performance
<p>But how well does our model perform? </p>
<p>We will now evaluate our model on the test set with respect to <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy">classification accuracy</a>. But we will also take a look the model's <a href="http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">confusion matrix</a>. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.  </p>

In [29]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test,y_test))

# Print the confusion matrix of the logreg model
confusion_matrix(y_test,y_pred)

Accuracy of logistic regression classifier:  0.8532110091743119


array([[ 86,   6],
       [ 26, 100]])

In [30]:
# print classification report
print("Classification report - \n", classification_report(y_test,y_pred))

Classification report - 
               precision    recall  f1-score   support

         0.0       0.77      0.93      0.84        92
         1.0       0.94      0.79      0.86       126

    accuracy                           0.85       218
   macro avg       0.86      0.86      0.85       218
weighted avg       0.87      0.85      0.85       218



## How do we read the classification report?
There are four ways to check if the predictions are right or wrong:<br>
1.TN / True Negative: the case was negative and predicted negative<br>
2.TP / True Positive: the case was positive and predicted positive<br>
3.FN / False Negative: the case was positive but predicted negative<br>
4.FP / False Positive: the case was negative but predicted positive<br><br>

### Precision
Precision — What percent of your predictions were correct?

Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class, it is defined as the ratio of true positives to the sum of a true positive and false positive.

Precision:- Accuracy of positive predictions.

Precision = TP/(TP + FP)

### Recall
Recall — What percent of the positive cases did you catch?

Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.

Recall:- Fraction of positives that were correctly identified.

Recall = TP/(TP+FN)

### F1
F1 score — What percent of positive predictions were correct?

The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

### Support
Support

Support is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the need for stratified sampling or rebalancing. Support doesn’t change between models but instead diagnoses the evaluation process.


## 11. Grid searching and making the model perform better
<p>Our model was pretty good! It was able to yield an accuracy score of almost 84%.</p>
<p>For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.</p>
<p>Let's see if we can do better. We can perform a <a href="https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/">grid search</a> of the model parameters to improve the model's ability to predict credit card approvals.</p>
<p><a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">scikit-learn's implementation of logistic regression</a> consists of different hyperparameters but we will grid search over the following two:</p>
<ul>
<li>tol</li>
<li>max_iter</li>
</ul>

<p>You can use grid search to try to improve other models as well! Grid Search is only applied to Logistic regression in this notebook.</p>

In [31]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001 ,0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are the corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

## 12. Finding the best performing model
<p>We have defined the grid of hyperparameter values and converted them into a single dictionary format which <code>GridSearchCV()</code> expects as one of its parameters. Now, we will begin the grid search to see which values perform best.</p>
<p>We will instantiate <code>GridSearchCV()</code> with our earlier <code>logreg</code> model with all the data we have. Instead of passing train and test sets separately, we will supply <code>X</code> (scaled version) and <code>y</code>. We will also instruct <code>GridSearchCV()</code> to perform a <a href="https://www.dataschool.io/machine-learning-with-scikit-learn/">cross-validation</a> of five folds.</p>
<p>We'll end the notebook by storing the best-achieved score and the respective best parameters.</p>
<p>While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as <strong>scaling</strong>, <strong>label encoding</strong>, and <strong>missing value imputation</strong>. We finished with some <strong>machine learning</strong> to predict if a person's application for a credit card would get approved or not given some information about that person.</p>

<p>Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters. It is an exhaustive search that is performed on a the specific parameter values of a model. The model is also known as an estimator. Grid search exercise can save us time, effort and resources.</p>

In [32]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.859033 using {'max_iter': 100, 'tol': 0.01}


## Let's Try Some Other Models for Supervised Learning

## Decision Tree
<p>Create an object of the class DecisionTreeClassifier, store its address in the variable dtree. Then fit this tree with X_train and y_train. Finally, print the statement Decision Tree Classifier Created after the decision tree is built.</p>
<p>Decision trees classification is not impacted by the outliers in the data as the data is split using scores which are calculated using the homogeneity of the resultant data points. Decision trees and ensemble methods do not require feature scaling to be performed as they are not sensitive to the the variance in the data.</p>

In [33]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Instantiate a DecisionTree classifier with default parameter values
dtree=DecisionTreeClassifier()

# Defining the decision tree algorithmdtree=DecisionTreeClassifier()
dtree.fit(X_train,y_train)

#print creation message
print('Decision Tree Classifier Created')




Decision Tree Classifier Created


In [34]:
# Predicting the values of test data
y_pred = dtree.predict(X_test)

# print classification report
print("Classification report - \n", classification_report(y_test,y_pred))

Classification report - 
               precision    recall  f1-score   support

         0.0       0.73      0.78      0.75        92
         1.0       0.83      0.79      0.81       126

    accuracy                           0.78       218
   macro avg       0.78      0.78      0.78       218
weighted avg       0.79      0.78      0.79       218



In [35]:
# Get the accuracy score of logreg model and print it
print("Accuracy of Decision Tree classifier: ", dtree.score(X_test,y_test))

# Print the confusion matrix of the Decisioin Tree model
confusion_matrix(y_test,y_pred)

Accuracy of Decision Tree classifier:  0.7844036697247706


array([[72, 20],
       [27, 99]])

### How did the Decision Tree classifier perform?
At 87.7 percent, it is performing better than the Logistic Regression model.

### Showing the rules from the tree
Feature names that are shown in the rules are not optimal.  Can this be improved?

In [36]:
#Let's see the rules that the decision tree provides
from sklearn.tree import export_text
#get rules
tree_rules = export_text(dtree)
print(tree_rules)

|--- feature_8 <= 0.50
|   |--- feature_2 <= 0.19
|   |   |--- feature_6 <= 1.00
|   |   |   |--- class: 0.0
|   |   |--- feature_6 >  1.00
|   |   |   |--- feature_4 <= 1.00
|   |   |   |   |--- feature_2 <= 0.02
|   |   |   |   |   |--- class: 1.0
|   |   |   |   |--- feature_2 >  0.02
|   |   |   |   |   |--- feature_0 <= 0.50
|   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |   |--- feature_0 >  0.50
|   |   |   |   |   |   |--- class: 0.0
|   |   |   |--- feature_4 >  1.00
|   |   |   |   |--- class: 1.0
|   |--- feature_2 >  0.19
|   |   |--- feature_12 <= 52776.00
|   |   |   |--- feature_5 <= 12.50
|   |   |   |   |--- feature_1 <= 180.50
|   |   |   |   |   |--- feature_5 <= 11.00
|   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |   |--- feature_5 >  11.00
|   |   |   |   |   |   |--- feature_7 <= 1.25
|   |   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |   |   |--- feature_7 >  1.25
|   |   |   |   |   |   |   |--- feature_7 <= 1.58
|   |   |   |   |   |

### Advantages and Disadvantages of Decision Tree
Advantages:<br>
1.Compared to other algorithms decision trees requires less effort for data preparation during pre-processing.<br>
2.A decision tree does not require normalization of data.<br>
3.A decision tree does not require scaling of data as well.<br>
4.Missing values in the data also do NOT affect the process of building a decision tree to any considerable extent.<br>
5.A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.<br><br>

Disadvantage:<br>
1.A small change in the data can cause a large change in the structure of the decision tree causing instability.<br>
2.For a Decision tree sometimes calculation can go far more complex compared to other algorithms.<br>
3.Decision tree often involves higher time to train the model.<br>
4.Decision tree training is relatively expensive as the complexity and time has taken are more.<br>
5.The Decision Tree algorithm is inadequate for applying regression and predicting continuous values.<br>


## Random Forest
<p>Learn more here: <a href="https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/">Random Forest Explanation</a></p>
<p> Random Forest generally performs better than other supervised methods!</p><p>Scaling is again not necessary</p>

In [37]:
# Import Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate a Random Forest Classifier with default parameter values
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=20,
                                       n_estimators=100, oob_score=True)

# Defining the random forest classifier
classifier_rf.fit(X_train,y_train)

#print creation message
print('Random Forest Classifier Created')



Random Forest Classifier Created


In [38]:
# Predicting the values of test data
y_pred = classifier_rf.predict(X_test)

# print classification report
print("Random Forest Classification report - \n", classification_report(y_test,y_pred))

Random Forest Classification report - 
               precision    recall  f1-score   support

         0.0       0.79      0.91      0.84        92
         1.0       0.93      0.82      0.87       126

    accuracy                           0.86       218
   macro avg       0.86      0.87      0.86       218
weighted avg       0.87      0.86      0.86       218



In [39]:
# Get the accuracy score of Random Forest model and print it
print("Accuracy of Random Forest classifier: ", classifier_rf.score(X_test,y_test))

# Print the confusion matrix of the Random Forest model
confusion_matrix(y_test,y_pred)

Accuracy of Random Forest classifier:  0.8577981651376146


array([[ 84,   8],
       [ 23, 103]])

### Advantages and Disadvantages of Random Forest
<p>
    Advantages

1.  It can be used in classification and regression problems.

2. It solves the problem of overfitting as output is based on majority voting or averaging.

3. It performs well even if the data contains null/missing values.

4. Each decision tree created is independent of the other thus it shows the property of parallelization.

5. It is highly stable as the average answers given by a large number of trees are taken.

6. It maintains diversity as all the attributes are not considered while making each decision tree though it is not true in all cases.

7. It is immune to the curse of dimensionality. Since each tree does not consider all the attributes, feature space is reduced.

8. We don’t have to segregate data into train and test as there will always be 30% of the data which is not seen by the decision tree made out of bootstrap.

Disadvantages

1. Random forest is highly complex when compared to decision trees where decisions can be made by following the path of the tree.

2. Training time is more compared to other models due to its complexity. Whenever it has to make a prediction each decision tree has to generate output for the given input data.
</p>

## Is there an Easier Way?  Pipelines (can also be used for preprocessing steps)
<p> The code below shows how you can facilitate running more than one model at a time</p>
<p>With this approach however, we are not using the scaled data so make sure your models don't depend on certain preprocessing</p><p>If you need to use additional preprocessing steps, you can also add this to the pipeline</p>

### Remember to learn more by reviewing the documentation on each model at http://www.scikit-learn.org



In [40]:
classifiers = [
    KNeighborsClassifier(n_neighbors = 5),
    SVC(kernel="rbf", C=0.025, probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    SGDClassifier(),
    GaussianNB()
    ]

top_class = []

for classifier in classifiers:
    pipe = Pipeline(steps=[('classifier', classifier)])

    # training model
    pipe.fit(X_train, y_train)
    print(classifier)

    acc_score = pipe.score(X_test, y_test)
    print("model score: %.3f" % acc_score)

    # using the model to predict
    y_pred = pipe.predict(X_test)

    # target_names = [le_name_mapping[x] for x in le_name_mapping]
    print(classification_report(y_test, y_pred))


KNeighborsClassifier()
model score: 0.716
              precision    recall  f1-score   support

         0.0       0.71      0.55      0.62        92
         1.0       0.72      0.83      0.77       126

    accuracy                           0.72       218
   macro avg       0.71      0.69      0.70       218
weighted avg       0.71      0.72      0.71       218

SVC(C=0.025, probability=True)
model score: 0.578
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00        92
         1.0       0.58      1.00      0.73       126

    accuracy                           0.58       218
   macro avg       0.29      0.50      0.37       218
weighted avg       0.33      0.58      0.42       218

DecisionTreeClassifier()
model score: 0.789
              precision    recall  f1-score   support

         0.0       0.73      0.80      0.76        92
         1.0       0.84      0.78      0.81       126

    accuracy                           0.79   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


RandomForestClassifier()
model score: 0.872
              precision    recall  f1-score   support

         0.0       0.82      0.89      0.85        92
         1.0       0.92      0.86      0.89       126

    accuracy                           0.87       218
   macro avg       0.87      0.87      0.87       218
weighted avg       0.88      0.87      0.87       218

AdaBoostClassifier()
model score: 0.839
              precision    recall  f1-score   support

         0.0       0.76      0.90      0.83        92
         1.0       0.92      0.79      0.85       126

    accuracy                           0.84       218
   macro avg       0.84      0.85      0.84       218
weighted avg       0.85      0.84      0.84       218

GradientBoostingClassifier()
model score: 0.858
              precision    recall  f1-score   support

         0.0       0.80      0.89      0.84        92
         1.0       0.91      0.83      0.87       126

    accuracy                           0.86       

## Which model performs the best?

#### Research Questions


1.Is there a correlation between Age, Income, Credit Score, and Debt levels and the credit approval status? Can this relationship be used to predict if a person is granted credit? If yes, does the relationship indicate reasonable risk management strategies?<br>


2.Ethnicity is a protected status and the decision to approve or deny an application cannot be based on the applicant’s ethnicity.
Is there a statistically significant difference in how credit is granted between ethnicities that could indicate bias or discrimination? Contrarily, could the difference indicate a business opportunity?<br>



### Feature Selection
https://www.datacamp.com/community/tutorials/feature-selection-python<br><br>
https://www.datatechnotes.com/2021/02/seleckbest-feature-selection-example-in-python.html<br><br>
The importance of feature selection can best be recognized when you are dealing with a dataset that contains a vast number of features. This type of dataset is often referred to as a high dimensional dataset. Now, with this high dimensionality, comes a lot of problems such as - this high dimensionality will significantly increase the training time of your machine learning model, it can make your model very complicated which in turn may lead to Overfitting.

Often in a high dimensional feature set, there remain several features which are redundant meaning these features are nothing but extensions of the other essential features. These redundant features do not effectively contribute to the model training as well. So, clearly, there is a need to extract the most important and the most relevant features for a dataset in order to get the most effective predictive modeling performance.

"The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data."


### Feature Selection and Dimensionality Reduction
Let's understand the difference between dimensionality reduction and feature selection.<br>

Sometimes, feature selection is mistaken with dimensionality reduction. But they are different. Feature selection is different from dimensionality reduction. Both methods tend to reduce the number of attributes in the dataset, but a dimensionality reduction method does so by creating new combinations of attributes (sometimes known as feature transformation), whereas feature selection methods include and exclude attributes present in the data without changing them.
<br><br>
Some examples of dimensionality reduction methods are Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis, etc.<br>

Importance of feature selection:
•It enables the machine learning algorithm to train faster.
•It reduces the complexity of a model and makes it easier to interpret.
•It improves the accuracy of a model if the right subset is chosen.
•It reduces Overfitting.

In the next section, you will study the different types of general feature selection methods - Filter methods, Wrapper methods, and Embedded methods.



### When to use Feature Selection
Important consideration

You may have already understood the worth of feature selection in a machine learning pipeline and the kind of services it provides if integrated. But it is very important to understand at exactly where you should integrate feature selection in your machine learning pipeline.

Simply speaking, you should include the feature selection step before feeding the data to the model for training especially when you are using accuracy estimation methods such as cross-validation. This ensures that feature selection is performed on the data fold right before the model is trained. But if you perform feature selection first to prepare your data, then perform model selection and training on the selected features then it would be a blunder.

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features, and this tends to bias the performance of your machine learning model.


In [41]:
# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_regression
from numpy import array

In [43]:
#use training X and y data
#select the top 8 features
select = SelectKBest(score_func=chi2, k=6)
#z = select.fit_transform(x,y)
z = select.fit_transform(X_train, y_train)
print("After selecting best features:", z.shape)

After selecting best features: (441, 6)


In [44]:
filter = select.get_support(6)
df.drop(columns=['Approved'])
features = array(df.columns)

print("All features:")
print(features)

print("Selected best:")
print(features[filter])


All features:
['Male' 'Age' 'Debt' 'Married' 'BankCustomer' 'EducationLevel' 'Ethnicity'
 'YearsEmployed' 'PriorDefault' 'Employed' 'CreditScore' 'Citizen'
 'Income' 'Approved']
Selected best:
['Age' 'Debt' 'YearsEmployed' 'PriorDefault' 'CreditScore' 'Income']


## Sources:
<p>The original code (from DataCamp) has been enhanced with code from Analytics by Vidhya, Calmcode and Dr. P. L. Thompson</p>