# Predicting Credit Card Approvals
\* Based on a project presented at <a href="https://www.datacamp.com/projects/558"> Datacamp</a>, but with more analyzes performed

## Credit Card Applications

<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming. Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval Dataset</a> from the UCI Machine Learning Repository. The structure of this notebook is as follows:</p>
<ul>
<li>First, we will start off by loading and viewing the dataset.</li>
<li>We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.</li>
<li>We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.</li>
<li>After our data is in good shape, we will do some exploratory data analysis to build our intuitions.</li>
<li>Finally, we will build some machine learning models that can predict if an individual application for a credit card will be accepted.</li>
</ul>

## 1. Loading and Viewing the Dataset

<p>First, loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.</p>

In [1]:
# Import pandas
import pandas as pd

# Loading dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None)

# Viewing the header of the dataset
print(cc_apps.head())

  0      1      2  3  4  5  6     7  8  9   10 11 12     13   14 15
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  f  g  00202    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  f  g  00043  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  f  g  00280  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  t  g  00100    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  f  s  00120    0  +


## 2. Inspecting the Application

<p>The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a> gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.   </p>

<p>As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.</p>

In [2]:
# Changing the columns names
cc_apps.columns = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel', 
                   'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 
                   'DriversLicense', 'Citizen', 'ZipCode', 'Income', 'ApprovalStatus']

# Inspect missing values in the dataset
print(cc_apps.tail(n=17))
print("\n")

# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)
print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

    Gender    Age    Debt Married BankCustomer EducationLevel Ethnicity  \
673      ?  29.50   2.000       y            p              e         h   
674      a  37.33   2.500       u            g              i         h   
675      a  41.58   1.040       u            g             aa         v   
676      a  30.58  10.665       u            g              q         h   
677      b  19.42   7.250       u            g              m         v   
678      a  17.92  10.210       u            g             ff        ff   
679      a  20.08   1.250       u            g              c         v   
680      b  19.50   0.290       u            g              k         v   
681      b  27.83   1.000       y            p              d         h   
682      b  17.08   3.290       u            g              i         v   
683      b  36.42   0.750       y            p              d         v   
684      b  40.58   3.290       u            g              m         v   
685      b  21.08  10.085

## 3. Handling the Missing Values (Part 1)

<p>We've uncovered some issues that will affect the performance of our machine learning models if they go unchanged:</p>
<ul>
<li>The dataset contains both numeric and non-numeric data (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, the features <code>Debt</code>, <code>YearsEmployed</code>, <code>CreditScore</code> and <code>Income</code> contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.</li>
<li>The dataset also contains values from several ranges. Some features have a value range (min and max values) of 0 to 28 (<code>Debt</code> and <code>YearsEmployed</code>), some have a range of 2 to 67 (<code>CreditScore</code>) and some have a range of 1017 to 100000 (<code>Income</code>). Apart from these, we can get useful statistical information (like <code>mean</code>, <code>max</code>, and <code>min</code>) about the features that have numerical values. </li>
<li>Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?' and will be converted into more interesting values for machine learning algorithms.</li>
</ul>
<p>For now, let's temporarily replace these missing value question marks with NaN.</p>

In [3]:
# Import numpy
import numpy as np

# Inspect missing values in the dataset (column Gender have one example)
print(cc_apps.Gender.tail(n=17))
print("\n")

# Replace the '?'s with NaN
cc_apps = cc_apps.replace('?', np.NaN)

# Inspect the missing values again
print(cc_apps.Gender.tail(n=17))

673    ?
674    a
675    a
676    a
677    b
678    a
679    a
680    b
681    b
682    b
683    b
684    b
685    b
686    a
687    a
688    b
689    b
Name: Gender, dtype: object


673    NaN
674      a
675      a
676      a
677      b
678      a
679      a
680      b
681      b
682      b
683      b
684      b
685      b
686      a
687      a
688      b
689      b
Name: Gender, dtype: object


## 4. Handling the Missing Values (Part 2)

<p>We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.</p>
<p>An important question that gets raised here is <em>why are we giving so much importance to missing values</em>? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly. </p>
<p>So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.</p>

In [4]:
# Impute the missing (NaN) values with mean imputation
cc_apps.fillna(cc_apps.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
print(cc_apps.isnull().sum())

Gender            12
Age               12
Debt               0
Married            6
BankCustomer       6
EducationLevel     9
Ethnicity          9
YearsEmployed      0
PriorDefault       0
Employed           0
CreditScore        0
DriversLicense     0
Citizen            0
ZipCode           13
Income             0
ApprovalStatus     0
dtype: int64


<p>Note that numeric features columns (2, 7, 10 and 14) no longer have NaN values, but the remaining columns still need to be manipulated.</p>

## 5. Handling the Missing Values (Part 3)

<p>We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for non-numeric columns (0, 1, 3, 4, 5, 6 and 13). All of these columns contain non-numeric data and this why the mean imputation strategy would not work here. This needs a different treatment. </p>

<p>We are going to impute these missing values with the most frequent values as present in the respective columns. This is a <a href="https://www.datacamp.com/community/tutorials/categorical-data">good practice</a> when it comes to imputing missing values for categorical data in general.</p>

In [5]:
# Iterate over each column of cc_apps
for col in cc_apps.columns:
    # Check if the column it is a object type
    if cc_apps[col].dtypes == 'object':
        # Impute the missing (NaN) values with the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
print(cc_apps.isnull().sum())

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
DriversLicense    0
Citizen           0
ZipCode           0
Income            0
ApprovalStatus    0
dtype: int64


<p>Now the non-numeric features columns also don't have NaN values.</p>

## 6. Preprocessing the Data (Part 1)

<p>The missing values are now successfully handled.</p>
<p>There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into three main tasks:</p>
<ol>
<li>Convert the non-numeric data into numeric.</li>
<li>Split the data into train and test sets. </li>
<li>Scale the feature values to a uniform range.</li>
</ol>

<p>First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">label encoding</a>.</p>

In [6]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

print(cc_apps.head())
print("\n")

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps.columns.values:
    # Compare if the dtype is a object
    if cc_apps[col].dtype == 'object':
    # Use LabelEncoder to do the numeric transformation
        cc_apps[col] = le.fit_transform(cc_apps[col])
        
print(cc_apps.head())

  Gender    Age   Debt Married BankCustomer EducationLevel Ethnicity  \
0      b  30.83  0.000       u            g              w         v   
1      a  58.67  4.460       u            g              q         h   
2      a  24.50  0.500       u            g              q         h   
3      b  27.83  1.540       u            g              w         v   
4      b  20.17  5.625       u            g              w         v   

   YearsEmployed PriorDefault Employed  CreditScore DriversLicense Citizen  \
0           1.25            t        t            1              f       g   
1           3.04            t        t            6              f       g   
2           1.50            t        f            0              f       g   
3           3.75            t        t            5              t       g   
4           1.71            t        f            0              f       s   

  ZipCode  Income ApprovalStatus  
0   00202       0              +  
1   00043     560           

## 7. Splitting the Dataset into Train/Test Sets

<p>We have successfully converted all the non-numeric values to numeric ones.</p>

<p>Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then apply the scaling.</p>

<p>Also, features like <code>DriversLicense</code> and <code>ZipCode</code> are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as <em>feature selection</em>. </p>

In [7]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features DriversLicense (11) and ZipCode (13) and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop(["DriversLicense", "ZipCode"], axis=1)
print(cc_apps.head())

# Convert the Dataframe to a Np.array
cc_apps = cc_apps.values

# Segregate features and labels into X and y (X = all rows and [0...13] columns, Y = all rows and 14 column)
X, y = cc_apps[:, 0:12] , cc_apps[:, 13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

   Gender  Age   Debt  Married  BankCustomer  EducationLevel  Ethnicity  \
0       1  156  0.000        2             1              13          8   
1       0  328  4.460        2             1              11          4   
2       0   89  0.500        2             1              11          4   
3       1  125  1.540        2             1              13          8   
4       1   43  5.625        2             1              13          8   

   YearsEmployed  PriorDefault  Employed  CreditScore  Citizen  Income  \
0           1.25             1         1            1        0       0   
1           3.04             1         1            6        0     560   
2           1.50             1         0            0        0     824   
3           3.75             1         1            5        0       3   
4           1.71             1         0            0        2       0   

   ApprovalStatus  
0               0  
1               0  
2               0  
3               0  
4   

## 8. Preprocessing the Data (Part 2)

<p>The data is now split into two separate sets - train and test sets respectively. We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data. </p>

<p>Now, let's try to understand what these scaled values mean in the real world. Let's use <code>CreditScore</code> as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a <code>CreditScore</code> of 1 is the highest since we're rescaling all the values to the range of 0-1.</p>

In [8]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler 

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

<p>Essentially, predicting if a credit card application will be approved or not is a <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> task. <a href="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names">According to UCI</a>, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved. </p>
<p>This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.</p>


## 9. Fitting a Logistic Regression Model to Making Predictions

<p>Which model should we pick? A question to ask is: <em>are the features that affect the credit card approval decision process correlated with each other?</em> Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).</p>

<p>We will now evaluate our model on the test set with respect to <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy">classification accuracy</a>. But we will also take a look the model's <a href="http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">confusion matrix</a>. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.  </p>

<p>For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.</p>

In [9]:
# Import LogisticRegression and confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression(solver='lbfgs')

# Fit logreg to the train set
logreg.fit(X_train, y_train)

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(X_test)

# Print the confusion matrix of the logreg model
print("The Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\n")

# Get the accuracy score of logreg model and print it
acc_lr = round(logreg.score(X_test, y_test) * 100, 2)
print("Accuracy of Logistic Regression: " + str(acc_lr) + "%")


The Confusion Matrix:
[[93 10]
 [27 98]]


Accuracy of Logistic Regression: 83.77%


## 10. Fitting a Support-Vector Clustering Model to Making Predictions


Let's try now with another machine learning modeling, the Support-vector Clustering (SVC).

In [10]:
# Import SVC 
from sklearn.svm import SVC

# Instantiate a SVC classifier with default parameter values
svc = SVC(gamma='scale')

# Fit SVC to the train set
svc.fit(X_train, y_train)

# Use SVC to predict instances from the test set and store it
y_pred = svc.predict(X_test)

# Get the accuracy score of SVC model and print it
acc_svc = round(svc.score(X_test, y_test) * 100, 2)
print("Accuracy of SVC: " + str(acc_svc) + "%")

Accuracy of SVC: 84.21%


## 11. Fitting a  K-Nearest Neighbors Model to Making Predictions


Let's try now with another machine learning modeling, the K-Nearest Neighbors (KNN).

In [11]:
# Import KNN 
from sklearn.neighbors import KNeighborsClassifier

# Instantiate a KNN classifier with default parameter values
knn = KNeighborsClassifier(n_neighbors = 4)

# Fit KNN to the train set
knn.fit(X_train, y_train)

# Use KNN to predict instances from the test set and store it
y_pred = knn.predict(X_test)

# Get the accuracy score of KNN model and print it
acc_knn = round(knn.score(X_test, y_test) * 100, 2)
print("Accuracy of KNN: " + str(acc_knn) + "%")

Accuracy of KNN: 87.28%


## 12. Fitting a  Gaussian Naive Bayes Model to Making Predictions


Let's try now with another machine learning modeling, the Gaussian Naive Bayes (GNB).

In [12]:
# Import GaussianNB 
from sklearn.naive_bayes import GaussianNB

# Instantiate a GaussianNB classifier with default parameter values
gaussian = GaussianNB()

# Fit GaussianNB to the train set
gaussian.fit(X_train, y_train)

# Use GaussianNB to predict instances from the test set and store it
y_pred = gaussian.predict(X_test)

# Get the accuracy score of GaussianNB model and print it
acc_gnb = round(gaussian.score(X_test, y_test) * 100, 2)
print("Accuracy of GNB: " + str(acc_gnb) + "%")

Accuracy of GNB: 82.46%


## 13. Fitting a  Perceptron Model to Making Predictions


Let's try now with another machine learning modeling, the Perceptron.

In [13]:
# Import Perceptron 
from sklearn.linear_model import Perceptron

# Instantiate a Perceptron classifier with default parameter values
perceptron = Perceptron(tol=0.21, random_state=0, max_iter=1000)

# Fit Perceptron to the train set
perceptron.fit(X_train, y_train)

# Use Perceptron to predict instances from the test set and store it
y_pred = perceptron.predict(X_test)

# Get the accuracy score of Perceptron model and print it
acc_per = round(perceptron.score(X_test, y_test) * 100, 2)
print("Accuracy of Perceptron: " + str(acc_per) + "%")

Accuracy of Perceptron: 86.4%


## 14. Fitting a Linear SVC Model to Making Predictions


Let's try now with another machine learning modeling, the Linear SVC.

In [14]:
# Import LinearSVC 
from sklearn.svm import LinearSVC

# Instantiate a LinearSVC classifier with default parameter values
linear_svc = LinearSVC()

# Fit LinearSVC to the train set
linear_svc.fit(X_train, y_train)

# Use LinearSVC to predict instances from the test set and store it
y_pred = linear_svc.predict(X_test)

# Get the accuracy score of LinearSVC model and print it
acc_lsvc = round(linear_svc.score(X_test, y_test) * 100, 2)
print("Accuracy of LinearSVC: " + str(acc_lsvc) + "%")

Accuracy of LinearSVC: 84.65%


## 15. Fitting a Stochastic Gradient Descent Model to Making Predictions


Let's try now with another machine learning modeling, the Stochastic Gradient Descent.

In [15]:
# Import SGDClassifier 
from sklearn.linear_model import SGDClassifier

# Instantiate a SGDClassifier classifier with default parameter values
sgd = SGDClassifier(tol=0.21, random_state=0, max_iter=1000)

# Fit SGDClassifier to the train set
sgd.fit(X_train, y_train)

# Use SGDClassifier to predict instances from the test set and store it
y_pred = sgd.predict(X_test)

# Get the accuracy score of SGDClassifier model and print it
acc_sgd = round(sgd.score(X_test, y_test) * 100, 2)
print("Accuracy of SGDClassifier: " + str(acc_sgd) + "%")

Accuracy of SGDClassifier: 79.39%


## 16. Fitting a Grid Search Model to Making Preditctions

<p>We have defined the grid of hyperparameter values and converted them into a single dictionary format which <code>GridSearchCV()</code> expects as one of its parameters. Now, we will begin the grid search to see which values perform best.</p>
<p>We will instantiate <code>GridSearchCV()</code> with our earlier <code>logreg</code> model with all the data we have. Instead of passing train and test sets separately, we will supply <code>X</code> (scaled version) and <code>y</code>. We will also instruct <code>GridSearchCV()</code> to perform a <a href="https://www.dataschool.io/machine-learning-with-scikit-learn/">cross-validation</a> of five folds.</p>
<p>We'll end the notebook by storing the best-achieved score and the respective best parameters.</p>

In [16]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol = tol, max_iter = max_iter)

# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
# print("Best: %f using %s" % (best_score, best_params))

acc_gscv = round(best_score * 100, 2)
print("Accuracy of GridSearchCV: " + str(acc_gscv) + "%")

Accuracy of GridSearchCV: 85.07%


## 17. Fitting a Decision Tree Model to Making Predictions


Let's try now with another machine learning modeling, the Decision Tree.

In [17]:
# Import DecisionTreeClassifier 
from sklearn.tree import DecisionTreeClassifier

# Instantiate a Decision Tree classifier with default parameter values
decision_tree = DecisionTreeClassifier(random_state=0)

# Fit Decision Tree to the train set
decision_tree.fit(X_train, y_train)

# Use Decision Tree to predict instances from the test set and store it
y_pred = decision_tree.predict(X_test)

# Get the accuracy score of Decision Tree model and print it
acc_dt = round(decision_tree.score(X_test, y_test) * 100, 2)
print("Accuracy of Decision Tree: " + str(acc_dt) + "%")

Accuracy of Decision Tree: 54.39%


## 18. Fitting a Random Forest Model to Making Predictions


Let's try now with another machine learning modeling, the Random Forest.

In [18]:
# Import RandomForestClassifier 
from sklearn.ensemble import RandomForestClassifier

# Instantiate a Random Forest classifier with default parameter values
random_forest = RandomForestClassifier(n_estimators=1000, random_state=0)

# Fit Random Forest to the train set
random_forest.fit(X_train, y_train)

# Use Random Forest to predict instances from the test set and store it
y_pred = random_forest.predict(X_test)

# Get the accuracy score of Random Forest model and print it
acc_rf = round(random_forest.score(X_test, y_test) * 100, 2)
print("Accuracy of Random Forest: " + str(acc_rf) + "%")

Accuracy of Random Forest: 85.53%


## 19. Model Evaluation

<p>While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as <strong>scaling</strong>, <strong>label encoding</strong>, and <strong>missing value imputation</strong>. We finished with some <strong>machine learning</strong> to predict if a person's application for a credit card would get approved or not given some information about that person.</p>

<p>We can now rank our evaluation of all the models to choose the best one for our problem. KNN has the highest accuracy, despite believing that if certain parameters are adjusted in different models, the results may be different.</p>



In [19]:
# Create a dict with all the results
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'SVC', 'KNN',
              'Naive Bayes', 'Perceptron', 'Linear SVC', 
              'Stochastic Gradient Decent', 'Grid Search',
              'Decision Tree', 'Random Forest'],
    'Score': [acc_lr, acc_svc, acc_knn, acc_gnb, 
              acc_per, acc_lsvc, acc_sgd, acc_gscv,
              acc_dt, acc_rf ]})

models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
2,KNN,87.28
4,Perceptron,86.4
9,Random Forest,85.53
7,Grid Search,85.07
5,Linear SVC,84.65
1,SVC,84.21
0,Logistic Regression,83.77
3,Naive Bayes,82.46
6,Stochastic Gradient Decent,79.39
8,Decision Tree,54.39
