# <div style="text-align: center"> Machine Learning Project - Classify Income Basis in Mortgage Data </div>
### <div style="text-align: center"> *Edward Sims* </div>
<div style="text-align: center"> *supervised, machine-learning, classification, time-series, mortgage* </div>


------------------------

## Introduction 
Here's the situation. I have gained access to some mortgage data for the UK. In particular, there is a variable `income_basis` that tells us whether the mortgage was taken out by one person *(sole application)* or by more than one applicant *(joint applicantion)*. It's a really useful indicator of housing affordability, and one I'd like to explore further in the data analysis. 

The data are monthly time-series, going from Jan-2013 to Dec-2017, however for unexplained reasons, there are loads of missing variables in the 2016 data, `income_basis` being one of them. 

The dataset contains loads of features, such as gross income, purchase price, loan amount, initial interest rate, coordinates, number of bedrooms etc... And after tidying the dataset there are ~800,000 records in total. For the sake of the project, I've skipped the lengthy data processing, wrangling and feature engineering sections so we can go straight to the fun part:

**In this project I attempt to create a model that can accurately classify the income basis (*sole or joint application*), to be applied to each month in the 2016 data.**


*PLEASE NOTE: DATA FOR THIS PROJECT ARE NOT AVAILABLE TO SHARE DUE TO MY AGREEMENT WITH THE DATA SOURCE.*

## Initial thoughts

The first difficulty that comes to mind is the time-series element, because date plays a significant factor that is hard to quantify (e.g. seasonality of house purchases, weather on a given day). Also, there are other unseen features that would be a factor in taking out a mortgage as a sole or joint applicant (e.g. inheritance, monetary gifts and family loans) that we do not have data for to include in the model. With this in mind, overfitting our model is a potential risk. Finally, there is a danger of relying too much on income-related features to predict `income_basis` - just because income/purchase price is higher, doesn't mean it is a joint application. The model will need to reconcile this.

So we can assume the following due to the nature of the dependent variable: 
 1. The model will likely be non-linear.
 - The model will likely be slightly complex.
 
##  Classification models

Here are the options we'll consider for our project:

1. Naive Bayes (Generative Learning Model)
- Support Vector Machine
- Decision Trees
- Random Forest
- Neural Network
- Nearest Neighbour
- Ensemble method?

#### 1. Naive Bayes
A simple but surprisingly powerful algorithm for predictive modeling. Based on Bayes theorum, it performs particularly well with binary categorical input values. Naive Bayes classifier assumes that the features are independent (hence naive), but despite this it can still be very effective even in complex predictions.  

#### 2. Support Vector Machine
Support Vector Machines plot each data item as a point in n-dimensional space (where n is number of features) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well.

#### 3. Decision Trees
Decision trees answer sequential questions that send us down a certain route of the tree given the answer. These questions are often binary, such as "Male or Female?". You answer one question, and then get directed to the next question etc until you end up with your prediction. These are useful as they are simple, scalable and easy to understand/explain. on the other hand, they can be prone to overfitting. 

#### 4. Random Forest
One of the most used algorithms due to its ease and simplicity. Essentially, random forests build numerous decision trees, and then merges them all together to try and produce a more accurate prediction. This helps to reconcile the risks of overfitting that decision trees have. 

#### 5. Neural Network
Neural Networks have gained a lot of attention recently due to their ability to process large amounts of complex data and accurately make predictions. Designed to replicate the way that human brains make decisions, data is passed through multiple "layers" of neurons, all with different weights and biases beore they make an output. These can be highly effective, but are suited to complex data problems and sometimes don't perform as well when this isn't the case. There are issues with scalability too, but these questions will be discussed further later in the project. 

#### 6. Nearest Neighbour



## Performance metrics

There are a number of ways to measure the performance of a model, and we need to pick the most relevant one for evaluating our performance. Performance metrics for classification are gone into in more detail [in this blog](https://medium.com/greyatom/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b), but we'll cover the basics. 

The first step we take for evaluating a classification model is creating a **confusion matrix**. All it does is place your predicted results against the actual results to see what areas your model was predicting well in and where it wasn't. The predicted values are either *Positive* or *Negative*, and they are evaluated to *True* or *False*. 

Copying my favourite example of *The Boy Who Cried Wolf*, [from Google Developers](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative), the four outcomes are:

![Confusion Matrix 1](Confusion matrix 1.png)

A more generalised diagram of a confusion matrix is shown below: 

![Confusion Matrix 2](Confusion matrix 2.png)


### Accuracy

The go-to method to evaluate performance. Using our example, accuracy tells us **how many times the boy was correct, out of the total amount of times he predicted anything**. This is a good general indicator when the target variable classes in the data are fairly balanced. We'll likely look at this measure most. 

$$\frac{TP + TN}{TP + TN + FP + FN}$$

### Recall

Recall is the **proportion of times there was a wolf, and the boy cried "wolf"**. For this scenario, recall is a helpful measure as it puts emphasis on the false negatives (the times the wolf came and the boy did not say anything). Given that the downside risk of not crying "wolf" is super high (the village gets destroyed by a pack of wolves), you will want to pay attention to this metric. Here's the formula:

$$\frac{TP}{TP + FN}$$

The downside though, is that it can give a biased view. Say if the boy just always cried "wolf" all day, every day. Yes, he would technically have a 100% recall rate, good for him, but is 100% really a satisfactory reflection of how good his system is? The town would probably hate him, and not trust him every time he cried "wolf" (hence the moral of the story - Aesop was really telling a fable about machine learning).

### Precision

Using our example, this is the **proportion of times the boy cried "wolf", there was actually a wolf**. If the boy yells "wolf" once and is correct, then he will have 100% precision. Good for him, but say the wolf came another 30 times and ate all the sheep because he didn't yell "wolf", he can still tell people he technically had 100% precision, but it's clear he didn't do his job very well. This measure puts emphasis on the false positives. It can be a good performance metric for spam filters, as there isn't a huge downside cost to the false 

$$\frac{TP}{TP + FP}$$



If the boy cries wolf every single time, yes he will have a 100% precision but the town would probably hate him for yelling wolf all the time. This is a good measure when there is significant cost associated with missing positives (FN), such as a model that classifies cancer patients - with this 

These two points will have to be reflected in our model, so before doing anything, we might expect a *Random Forest* classification model to be most successful. A *Neural Network* may be too complex and *Naive Bayes* or *Logistic Regression* may be too simple. Nearest NeighboursIn any case, we're gonig to try all classification methods to evaluate which one is most effective.

#### Machine learning classification methods include:

 1. Naive Bayes (Generative Learning Model)
 - Logistic Regression (Predictive Learning Model)
 - Decision Trees
 - Random Forest
 - Neural Network
 - Nearest Neighbour

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

%matplotlib inline

os.chdir(path = "/Users/edwardsims/Data Science/Project 4 - Classifying Income Basis")

In [6]:
data = pd.read_csv("FTB data.csv", 
                  names = ['comp_date', 'purchse_price', 'gross_income',
                          'loan_amount', 'initial_gross_int_rate', 'deposit',
                          'mortgage_term', 'age_main', 'income_basis',
                          'pcd_no_space'])

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
data.columns.values

array(['comp_date', 'purchse_price', 'gross_income', 'loan_amount',
       'initial_gross_int_rate', 'deposit', 'mortgage_term', 'age_main',
       'income_basis', 'pcd_no_space'], dtype=object)