## 1. Credit Card Applications
<p>Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons. <br>
Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. 
<br>
<br>
In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://cdn.pixabay.com/photo/2016/07/15/21/07/credit-card-1520400_960_720.jpg" alt="Credit card being held in hand"></p>
<br>
<p>
We'll use the Credit Card Approval dataset from the UCI Machine Learning Repository. You can find the dataset from "..\Datasets\cc_approvals.data" </p>
<br>
Notebook Outline:  

- First, we will start off by loading and viewing the dataset.  

- We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries. 

- We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.  
- After our data is in good shape, we will do some exploratory data analysis to build our intuitions.  
- Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.

**1. Load the data and viewing it:**

In [1]:
import pandas as pd

url = '..\Datasets\cc_approvals.data'
df = pd.read_csv(url, header=None)

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


**2. Inspecting:**  
The probable features in a typical credit card application are <br>
- *Gender*,
- *Age*, 
- *Debt*, 
- *Married*, 
- ...
- *ApprovalStatus*  

Before doing preprocessing things let's learn about the dataset a bit more.

In [2]:
df.columns = ["Gender", "Age", "Debt", 
            "Married", "BankCustomer", 
            "EducationLevel", "Ethnicity", 
            "YearsEmployed", "PriorDefault", 
            "Employed", "CreditScore",
             "DriversLicense", "Citizen", 
             "ZipCode", "Income", "ApprovalStatus"]
df.columns

Index(['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel',
       'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore',
       'DriversLicense', 'Citizen', 'ZipCode', 'Income', 'ApprovalStatus'],
      dtype='object')

In [3]:
# summary stats
df_describe = df.describe()
print(df_describe,'\n')

df_info = df.info()
print(df_info,'\n')

df.tail()

             Debt  YearsEmployed  CreditScore         Income
count  690.000000     690.000000    690.00000     690.000000
mean     4.758725       2.223406      2.40000    1017.385507
std      4.978163       3.346513      4.86294    5210.102598
min      0.000000       0.000000      0.00000       0.000000
25%      1.000000       0.165000      0.00000       0.000000
50%      2.750000       1.000000      0.00000       5.000000
75%      7.207500       2.625000      3.00000     395.500000
max     28.000000      28.500000     67.00000  100000.000000 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    object 
 1   Age             690 non-null    object 
 2   Debt            690 non-null    float64
 3   Married         690 non-null    object 
 4   BankCustomer    690 non-null    object 
 5   EducationLevel  690 non-

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260,0,-
686,a,22.67,0.75,u,g,c,v,2.0,f,t,2,t,g,200,394,-
687,a,25.25,13.5,y,p,ff,ff,2.0,f,t,1,t,g,200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280,750,-
689,b,35.0,3.375,u,g,c,h,8.29,f,f,0,t,g,0,0,-


**3. Splitting the dataset:**  
We should split the data into train and test set to prepare the data for two different phases of machine learning modelling.  
<br>
Also features like *ZipCode*, *DriverLicense* are not as important as the other features  

In [4]:
from sklearn.model_selection import train_test_split

# drop unnecessary
df = df.drop(['DriversLicense', 'ZipCode'], axis=1)

# split the df
df_train, df_test = train_test_split(df, test_size=3, random_state=1)

**4. Handling missing values:**
- The dataset also contains values from several ranges. Some features have a values range of 0-28, some have a range of 2-67, and some have another range.  
Apart from theese, we can get useful statistical information (like *mean*, *median*, *max* etc.)

In [5]:
import numpy as np 

# replace "?" 's with NaN in both of sets
df_train = df_train.replace('?', np.NaN)
df_test = df_test.replace('?', np.NaN)

df_train.tail()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,Citizen,Income,ApprovalStatus
144,b,27.25,1.665,u,g,cc,h,5.085,t,t,9,g,827,+
645,b,37.33,2.665,u,g,cc,v,0.165,f,f,0,g,501,-
72,a,38.58,5.0,u,g,cc,v,13.5,t,f,0,g,0,-
235,a,20.67,1.835,u,g,q,v,2.085,t,t,5,g,2503,+
37,a,23.0,11.75,u,g,x,h,0.5,t,t,2,g,551,+


In [14]:
# Impute the missing values with mean imputation
df_train.fillna(df_train.mean(), inplace=True)
df_test.fillna(df_test.mean(), inplace=True)

print(df_train.isnull().sum())
print(df_test.isnull().sum())

Gender            11
Age               12
Debt               0
Married            6
BankCustomer       6
EducationLevel     9
Ethnicity          9
YearsEmployed      0
PriorDefault       0
Employed           0
CreditScore        0
Citizen            0
Income             0
ApprovalStatus     0
dtype: int64
Gender            1
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
ApprovalStatus    0
dtype: int64


  df_train.fillna(df_train.mean(), inplace=True)
  df_test.fillna(df_test.mean(), inplace=True)


In [6]:
# Iterate columns
for col in df_train.columns:
    # Check if the column is of object type
    if df_train[col].dtypes == 'object':
        # Impute with the most frequent value
        df_train = df_train.fillna(df_train[col].value_counts().index[0])
        df_test = df_test.fillna(df_train[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
print(df_train.isnull().sum())
print(df_test.isnull().sum())

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
ApprovalStatus    0
dtype: int64
Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
ApprovalStatus    0
dtype: int64


**5. Preprocessing the data:**  
Processing the missing values are done.  
<br>
There is still some minor preprocessing stuffs.  
1. Convert the non-numeric data into numeric.  
1. Scale the feature values to a uniform range.  

In [7]:
# set train and test stuffs sets independently
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)

# reindex
df_test = df_test.reindex(columns=df_train.columns, fill_value=0)

Now we are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.  
<br>
Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to the range of 0-1.

In [9]:
from sklearn.preprocessing import MinMaxScaler

# Segregate features and labels into separate variables
X_train, y_train = df_train.iloc[:, :-1].values, df_train.iloc[:, [-1]].values
X_test, y_test = df_test.iloc[:, :-1].values, df_test.iloc[:, [-1]].values

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

**6. Fitting Machine Learning Model:**  
This task is about classification task. So I'll use classification models like _Logistic Regression_.  
  
At this part you may be asking yourselft to _Which model should I pick?_  
Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [11]:
from sklearn.linear_model import LogisticRegression

# instantiate a LogisticRegression with default parameters.
logreg = LogisticRegression()

# fit the model
logreg.fit(rescaledX_train, y_train)

  y = column_or_1d(y, warn=True)


**7. Making predictions and evaluating performance:**  
How well does our model perform?  
  
We will now evaluate our model on the test set with respect to [classification accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy).  
But we will also take a look the model's [confusion matrix](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/).

In [13]:
from sklearn.metrics import confusion_matrix

# make some predictions
y_pred = logreg.predict(rescaledX_test)

# get the accuracy score from logreg model
print(f'Accuracy of Logistic Regression Classifier:{logreg.score(rescaledX_test, y_test)}')

# get the accuracy from confusion matrix
confusion_matrix(y_test,y_pred)

Accuracy of Logistic Regression Classifier:1.0


array([[1, 0],
       [0, 2]], dtype=int64)