## Introduction
Numerous online lending platforms have emerged in recent times, offering loan opportunities to businesses,
similar to banks. However, they are also faced with the risk of user loan default, which is correlated to the
sustainability and healthy development of the platforms.
The importance of calculating and predicting credit risks cannot be overemphasised. Evaluating an
individual’s financial information and historical data is pivotal to predicting whether he/she would default
on a loan or not.


Credit score cards are a common risk control method in the financial industry. It uses personal information,
and data submitted by credit card applicants to predict the probability of future defaults and credit card
borrowings. The bank or lending company is able to decide whether to issue a credit card to the applicant or
not. Credit scores can objectively quantify the magnitude of risk.
The project will focus on predicting whether an applicant is qualified to receive a loan or not.

## Introduction to the Dataset

We'll use the [Credit Card Approval](http://archive.ics.uci.edu/ml/datasets/credit+approval) dataset from the UCI Machine Learning Repository. This data is confidential and so the contributor of the dataset anonymized the feature names.
The work flow of this notebook is as follows:

- First, we will start by loading and viewing the dataset.
- Analysis the dataset.
- Take care of the of missing entries.
- We will preprocess the dataset to ensure the machine learning model we choose can make good predictions.
- After we have cleaned and prepared our data, we will do some exploratory data analysis to improve our understanding.
- Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be approved.


In [1]:
import pandas as pd
import numpy as np


The features of this dataset have been anonymized to protect the privacy, but this [blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) gives us a pretty good overview of the probable features. The probable features in a typical credit card application are Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus.

In [4]:
credit_card_app = pd.read_csv('Data/credit_approval.data', header=None)
credit_card_app.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [12]:
credit_card_app.describe(include='all')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
count,690,690,690.0,690,690,690,690,690.0,690,690,690.0,690,690,690.0,690.0,690
unique,3,350,,4,4,15,10,,2,2,,2,3,171.0,,2
top,b,?,,u,g,c,v,,t,f,,f,g,0.0,,-
freq,468,12,,519,519,137,399,,361,395,,374,625,132.0,,383
mean,,,4.758725,,,,,2.223406,,,2.4,,,,1017.385507,
std,,,4.978163,,,,,3.346513,,,4.86294,,,,5210.102598,
min,,,0.0,,,,,0.0,,,0.0,,,,0.0,
25%,,,1.0,,,,,0.165,,,0.0,,,,0.0,
50%,,,2.75,,,,,1.0,,,0.0,,,,5.0,
75%,,,7.2075,,,,,2.625,,,3.0,,,,395.5,
