# <div style="text-align: center"> Machine Learning Project - Classify Income Basis in Mortgage Data </div>
### <div style="text-align: center"> Edward Sims </div>
<div style="text-align: center"> supervised, machine-learning, classification, time-series, mortgage </div>


------------------------

## Introduction 

Here's the situation. I have gained access to some mortgage data for the UK. In particular, there is a variable `income_basis` that tells us whether the mortgage was taken out by one person *(sole application)* or by more than one applicant *(joint applicantion)*. It's a really useful indicator of housing affordability, and one I'd like to explore further in the data analysis. 

The data are in a monthly time-series, going from Jan-2013 to Dec-2017, however for unexplained reasons, there are loads of missing variables in the 2016 data, `income_basis` being one of them. 

The dataset contains lots of features, such as gross income, purchase price, loan amount, initial interest rate, coordinates, number of bedrooms etc... And after tidying the dataset there are 2,000,000+ records in total. For the sake of the project, I've skipped explaining the lengthy data processing, wrangling and feature engineering sections so we can go straight to the fun part:

**In this project I attempt to create a model that can accurately classify the income basis (*sole or joint application*), and predict the income basis for the missing 2016 data.**

The project is beginner-friendly, and I will cover in detail a lot of the basics as I go through. I hope readers find it useful!

Any suggestions, corrections and opinions do let me know :)

*PLEASE NOTE: DATA FOR THIS PROJECT ARE NOT AVAILABLE TO SHARE DUE TO MY AGREEMENT WITH THE DATA SOURCE.* I'm really sorry about this!

## Initial thoughts

It's useful to critically engage with the problem before diving into a solution too quickly. If time is limited, knowing the potential issues your model may face, and consequently which model will perform best, could make a huge difference. 

The first difficulty that comes to mind is the time-series element, because date plays a significant factor that can be random and hard to quantify (e.g. seasonality of house purchases, weather on a given day). With this in mind, *overfitting* our model is a potential risk. Overfitting occurs when our model learns the noise from our data too well, and believes that this is a conceptual part of the data. For example, a good summer might lead to people buying more expensive houses, but it does not mean that the next summer will be the same.

Also, there are lots of other unseen features (variables) that would be a factor in taking out a mortgage as a sole or joint applicant (e.g. inheritance, monetary gifts and family loans) that we do not have data for to include in the model. This could lead to *underfitting* the model. Underfitting occurs when the model is unable to make generalisations about new data, as the model is essentially not good enough. 

There is a danger of relying too much on income-related features to predict `income_basis` though - just because income/purchase price is higher, doesn't mean it is a joint application. The model will need to reconcile this; perhaps using features like number of bedrooms will help to do so. With this in mind, it is unlikely that there will be clear, defined clusters for sole and joint applications. It's more likely that the clusters will be overlapping. We'll see how our model deals with this.

So we can assume the following about our model so far: 
 1. The model will likely be non-linear.
 - The model will likely be complex, rather than simple.
 
We're going to try a number of different classification machine learning methods and evaluate the best one for this particular problem. Let's take a look at our contenders. 
 
##  Classification models

Here are the options we'll consider for our project, with some basic explanation:

1. Naive Bayes (Generative Learning Model)
- Logistic Regression (with Stochastic Gradient Descent)
- Support Vector Classification (SVC)
- Decision Trees (and Gradient Boosted Trees)
- Random Forest
- Nearest Neighbour
- Neural Network

### 1. Naive Bayes
A simple but surprisingly powerful algorithm for predictive modeling. Based on Bayes theorum, it performs particularly well with binary categorical input values. Naive Bayes classifier assumes that the features are independent (hence naive), but despite this it can still be very effective even in complex predictions. 

### 2. Logistic Regression (with Stochastic Gradient Descent)
Logistic Regression is very often used for binary (two-class) classification problems. The method is easy to implement and understand, and so is a popular choice for many data scientists. Stochastic Gradient Descent (SGD) is the method that we use to optimise the regression model. This may be a strong performer for our specific problem, so long as the data is not too complex. 

### 3. Support Vector Classification (SVC)
Support Vector Classification plots each data item as a point in n-dimensional space (where n is number of features) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well. The reason this method is effective, is because it uses extreme values from the edges of the class (called "Support Vectors") to place the hyper-plane, ignoring noise from our data. This method can be particularly effective, but is generally outperformed by Gradient Boosted Trees or Random Forests in strutctured data. 

### 4. Decision Trees (and Gradient Boosted Trees)
Decision trees answer sequential questions that send us down a certain route of the tree given the answer. These questions are often binary, "Yes or No" type questions. You answer one question, and then get directed to the next question and so on, until you end up with your prediction. These are useful as they are simple, scalable and easy to understand/explain. On the other hand, they can be prone to overfitting. Gradient Boosted Trees (GBT) serve to improve this, by basically producing an "ensemble" of weak predicting models. Averaging all these weak decision trees, we get a much better result! At the start of this project, I would anticipate a GBT to be the best model for our problem. 

### 5. Random Forest
One of the most used algorithms due to its ease and simplicity. Essentially, random forests (a type of "ensemble", or "bagging" method) build numerous decision trees, and then merges them all together to try and produce a more accurate prediction. This helps to reconcile the risks of overfitting that decision trees have. There are few parameters to tune manually, and so can be very time-effective. I'm expecting this to be a close second. 

### 6. Nearest Neighbour

Or K-Nearest Neighbour, does what its name suggests. It looks at the class of all the nearest points (defined by the K) and predicts the class of that particular point by majority vote. This is a very simple, easy to understand method that doesn't make any assumptions about the data, which can be useful for non-linear data. It can be, however, computationally expensive and use a great deal of memory, meaning it may not be totally scalable. 

### 7. Neural Network
Neural Networks have gained a lot of attention recently due to their ability to process large amounts of complex, unstructured data (like images or speech) and accurately make predictions. Designed to replicate the way that human brains make decisions, data is passed through multiple "layers" of neurons, all with different weights and biases, before they reach their output. These can be highly effective, but are suited to complex data problems and sometimes don't perform as well when this isn't the case. There are issues with scalability too, but these questions will be discussed more later in the project. 

## Performance metrics

So how do we know which model is the best one for us? There are a number of ways to measure the performance of a model, and we need to pick the most relevant one for our particular problem. Performance metrics for classification are gone into in more detail [in this blog](https://medium.com/greyatom/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b), but we'll cover the basics. 

The first step we take for evaluating a classification model is creating a **confusion matrix**. All it does is place your predicted results against the actual results to see what areas your model was predicting well in and where it wasn't. The predicted values are either *Positive* or *Negative*, and they are evaluated to *True* or *False*. 

Copying my favourite example of *The Boy Who Cried Wolf*, [from Google Developers](https://developers.google.com/machine-learning/crash-course/), the four outcomes are:

![Confusion Matrix 1](Confusion matrix 1.png)
*Source: [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative), Google Developers*

Another, more generalised, diagram of a confusion matrix is shown below: 

![Confusion Matrix 2](Confusion matrix 2.png)
*Source: [Sanyam Kapoor](https://www.sanyamkapoor.com/machine-learning/confusion-matrix-visualization/)*


### Accuracy

The go-to method to evaluate performance. Using our example, accuracy tells us **how many times the boy was correct, out of the total amount of times he predicted anything**. This is a good general indicator when the target variable classes in the data are fairly balanced. We'll likely look at this measure most. 

$$\frac{TP + TN}{TP + TN + FP + FN}$$

### Recall

Recall is the **proportion of times there was a wolf, and the boy cried "wolf"**. For this scenario, recall is a helpful measure as it puts emphasis on the false negatives (the times the wolf came and the boy did not say anything). Given that the downside risk of NOT crying "wolf" is super high (the village gets destroyed by a pack of wolves), you will want to pay attention to this metric. Here's the formula:

$$\frac{TP}{TP + FN}$$

The downside though, is that it can give a biased view. Say if the boy just yelled "wolf" all day. Even if the wolf turned up only once out of all those times, his recall would still be 100%. Thumbs up to him, but is 100% really a satisfactory reflection of how good his system is? The town would probably hate him, and not trust him every time he cried "wolf" (hence the moral of the story - Aesop was really telling a fable about machine learning).

### Precision

Using our example again, this is the **proportion of times the boy cried "wolf", there was actually a wolf**. Like in the above, if the boy yells "wolf" 100 times in one day, and the wolf only turns up one of those times, his precision will be 1%. Now this is probably a better way to assess his system, and would indicate that maybe it's not very good. But this is where we need to think about our objective. Is it worth trying to improve the precision of the model at the potential cost of our recall? Flip our example around - if the wolf came 100 times, and only one of those times the boy shouted "wolf", he would technically have 100% precision. This is even worse than before though, because while the boy can pat himself on the back with his 100% precision trophy, the rest of the villagers have been killed by wolves, so not ideal. 

This measure puts emphasis on the false positives. It can be a good performance metric for spam filters, where there is more downside risk to *wrongly* classifying an email as spam. Here's the formula:

$$\frac{TP}{TP + FP}$$

To get the best picture of our model's performance, we'll use all three of these metrics. 

### Scalability

We mentioned how it's important to remember your objective when choosing a performance metric. As a data scientist, we're usually doing our research to support business or decision-making, and so resources are limited (whether it's time, money, computing power or any others). **Scalability is one of the most important performance metrics to look at when choosing a model.** We can build the most accurate model possible, but if that model requires an obscene amount of GPU and takes days to process, then perhaps a faster, more cost-effective model is required. This trade-off is common, and it's up to you to decide which route to take. Whatever you choose, be prepared to justify your decision to your stakeholders.

### Understandability 

For lack of a better word, understandability is often forgotten by many data scientists. While we may be pushing the frontiers of machine learning on behalf of our company, our stakeholders need to be able to understand what it is we're doing. Neural Networks may be a great choice for your model, but getting your stakeholders to understand how they work may be a different story. Stakeholder engagement is integral to success as a data scientist, so we'll think about this too when evaluating the best model




In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

%matplotlib inline

os.chdir(path = "/Users/edwardsims/Data Science/Project 4 - Classifying Income Basis")

In [6]:
data = pd.read_csv("FTB data.csv", 
                  names = ['comp_date', 'purchse_price', 'gross_income',
                          'loan_amount', 'initial_gross_int_rate', 'deposit',
                          'mortgage_term', 'age_main', 'income_basis',
                          'pcd_no_space'])

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
data.columns.values

array(['comp_date', 'purchse_price', 'gross_income', 'loan_amount',
       'initial_gross_int_rate', 'deposit', 'mortgage_term', 'age_main',
       'income_basis', 'pcd_no_space'], dtype=object)