# <div style="text-align: center"> Machine Learning Case Study - Classifying Property Type in Residential Transaction Data </div>
### <div style="text-align: center"> Edward Sims </div>
<div style="text-align: center"> supervised, machine-learning, classification, time-series, mortgage </div>


------------------------

## 1. Introduction 

Here's the situation. I have gained access to some housing data and want to be able to predict the property type based on the features we have. The property type (`prop_type`) can be one of **four** possibilities: 
 1. Flat/maisonette
 - Detached
 - Semi-detached
 - Terraced. 

This kind of model is useful for data where this variable not available, or for records in our data that are missing this field. Furthermore, companies like Zoopla can use this kind of model to identify any potential erroneous values, where somebody has accidentally entered the incorrect property type. 

The dataset contains lots of *features* (another word for *variables* used in machine learning), such as purchase price, coordinates, number of bedrooms etc... And after tidying the dataset there are ~23,000,000 records in total (~4GB). For the sake of the project, I've skipped explaining the lengthy data processing, wrangling and cleaning sections so we can go straight to the fun part: 

**In this project I attempt to create a model that can accurately classify the property type of a residential transaction.**

This report is beginner-friendly, very informal, and describes the life-cycle of a machine learning/data science project. We will use our `prop_type` problem to demonstrate the most common machine learning classification methods, evaluating their advantages and disadvantages as we go along with plenty of explaination. The aim of this report is to provide a basic overview of the most common machine learning classification methods, and to see them in action with a real world problem. 

If you have any suggestions, corrections and feedback - contact me on my GitHub, or by email at edwardsims30@yahoo.com.

## 2. Initial thoughts

It's useful to critically engage with the problem before diving into a solution too quickly. If time is limited, knowing the potential issues your model may face, and consequently which model will perform best, could make a huge difference.

What we know already:

 - This is a *classification* problem. Our prediction will be discrete (*flat/maisonette, detached, semi-detached or terraced*), so this kind of task is defined as *machine learning classification* problem. The opposite to this would be *machine learning regression*, used for making continuous predictions. 
 
 - This is a *supervised machine learning* problem. This means we have labelled data that we can train our model on before conducting our test. We can train our model using the `prop_type` data that we **do** have (in the same way that you do practise papers from previous years to study for a test), before creating our model to be used for the data we **don't** have. This is what *supervised machine learning* is. The alternative is *unsupervised* where we have to get the computer to predict the clusters itself as we don't have any training data, but that will be for another day!

The first difficulty that comes to mind about our task is that even if a human were to look at the variables (like number of rooms, purchase price, location etc...) it'd be pretty hard for us to determine the property type! There can be really big flats as well as very small detached houses. With this in mind, I expect there won't be any clear clusters for all four property types, which means our model will have its work cut out for it. And there is a danger of relying too much on income-related features to predict `prop_type` - just because purchase price is higher, doesn't mean it is a detached house, for example. The model will need to reconcile this; perhaps using features like number of bedrooms or location could help to do so. 

In our housing data, there may be significant noise with regards to anomalous or above average purchase prices. With this in mind, we need to be careful not to *overfit* our data. *Overfitting* occurs when our model learns the noise from our data too well, and believes that this is a conceptual part of the data. One ridiculously expensive flat in London does not mean that flats are usually at the price level in the rest of the country. 

Also, there are lots of other unseen features (variables) that would be a factor in deciding on what the property type is (e.g. number of floors, square feet etc.) that we do not have data for to include in the model. This could lead to *underfitting* the model. Underfitting occurs when the model is unable to make generalisations about new data, as the model is essentially not good enough. 

So we can assume the following about our model so far: 
 - The model will likely be *non-linear*.
 - The model will likely be *complex*, rather than simple.
 
We're going to try a number of different classification machine learning methods and evaluate the best one for this particular problem. Let's take a look at our contenders. 
 
##  3. Classification models

Here are the options we'll consider for our project, with some basic explanation:

 - **Naive Bayes** (Generative Learning Model)
 - **Logistic Regression** (with Stochastic Gradient Descent)
 - **Support Vector Classification** (SVC)
 - **Decision Trees** (and Gradient Boosted Trees)
 - **Random Forest**
 - **Nearest Neighbour**
 - **Neural Network**

### Naive Bayes
A simple but surprisingly powerful algorithm for predictive modeling. Based on *Bayes theorum*, it performs particularly well with binary categorical input values. *Naive Bayes classifier* assumes that the features are independent (hence naive), but despite this it can still be very effective even in complex predictions. 

### Logistic Regression (with Stochastic Gradient Descent)
*Logistic Regression* is mostly used for binary (two-class) classification problems. The method is easy to implement and understand, and so is a popular choice for many data scientists. *Stochastic Gradient Descent (SGD)* is the method that we use to optimise the regression model. This model may not be appropriate for our problem, given that it is not binary. 

### Support Vector Classification (SVC)
*Support Vector Classification* plots each data item as a point in n-dimensional space (where n is number of features) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well. The reason this method is effective, is because it uses extreme values from the edges of the class (called *"Support Vectors"*) to place the hyper-plane, ignoring noise from our data. This method can be particularly effective, but is generally outperformed by *Gradient Boosted Trees* or *Random Forests* in structured data. 

### Decision Trees (and Gradient Boosted Trees)
*Decision Trees* answer sequential questions that send us down a certain route of the tree given the answer. These questions are often binary, "Yes or No" type questions. You answer one question, and then get directed to the next question and so on, until you end up with your prediction. These are useful as they are simple, scalable and easy to understand/explain. On the other hand, they can be prone to overfitting. *Gradient Boosted Trees (GBT)* serve to improve this, by basically producing an *"ensemble"* of weak predicting models. Averaging lots of weak *decision trees*, we get a much better result! At the start of this project, I would anticipate a *GBT* to be the best model for our problem. 

### Random Forest
One of the most used algorithms due to its ease and simplicity. Essentially, *random forests* (a type of *"ensemble"*, or *"bagging"* method) build numerous, full grown *decision trees*, and then merges them all together to try and produce a more accurate prediction. It's as if lots of people have made a prediction, and you average out all those predictions - you're more likely to get a better result than just one person's prediction. This helps to reconcile the risks of overfitting that *decision trees* have. There are very few parameters to tune manually, and so can be very time-effective. *Random Forests* are indeed quite similar to *Gradient Boosted Decision Trees*, but the main difference to remember is that the former focuses on *fully grown* decision trees, while the latter focuses on *weak learners*. This difference will be described a little more later in the report. I'm expecting our *Random Forest* model to be a close second. 

### Nearest Neighbour

Or *k-Nearest Neighbour*, does what its name suggests. It looks at the class of a number of nearest points (the number of points being defined by *k*) and predicts the class of that particular point by majority vote of its "nearest neighbours". This is a very simple, easy to understand method that doesn't make any assumptions about the data, which can be useful for non-linear data. It can be, however, computationally expensive and use a great deal of memory, meaning it may not be totally scalable. 

### Neural Network
*Neural Networks* have gained a lot of attention recently due to their ability to process large amounts of complex, unstructured data (like images or speech) and accurately make predictions. Designed to replicate the way that human brains make decisions, data is passed through multiple *"layers"* of neurons, all with different *weights* and *biases*, before they reach their output. These can be highly effective, but are suited to complex data problems and sometimes don't perform as well when this isn't the case. There are issues with scalability too, but these questions will be discussed more later in the project. 

## 4. Performance metrics

So how will we know which model is the best one for us? There are a number of ways to measure the performance of a model, and we need to pick the most relevant one for our particular problem.

The first step we take for evaluating a classification model is creating a **confusion matrix**. All it does is place your predicted results against the actual results to see what areas your model was predicting well in and where it wasn't. The predicted values are either *Positive* or *Negative*, and they are evaluated to *True* or *False*. 

Copying my favourite example of *The Boy Who Cried Wolf*, [from Google Developers](https://developers.google.com/machine-learning/crash-course/), the four outcomes are:

![Confusion Matrix 1](Confusion matrix 1.png)
*Source: [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative), Google Developers*

Another, more generalised, diagram of a confusion matrix is shown below: 

![Confusion Matrix 2](Confusion matrix 2.png)
*Source: [Sanyam Kapoor](https://www.sanyamkapoor.com/machine-learning/confusion-matrix-visualization/)*


### Accuracy

The go-to method to evaluate performance. Using our example, accuracy tells us **how many times the boy was correct, out of the total amount of times he predicted anything**. This is a good general indicator when the target variable classes in the data are fairly balanced. We'll likely look at this measure most. 

$$\frac{TP + TN}{TP + TN + FP + FN}$$

### Recall

Recall is the **proportion of times there was a wolf, and the boy cried "wolf"**. For this scenario, recall is a helpful measure as it puts emphasis on the false negatives (the times the wolf came and the boy did not say anything). Given that the downside risk of NOT crying "wolf" is super high (the village gets destroyed by a pack of wolves), you will want to pay attention to this metric. Here's the formula:

$$\frac{TP}{TP + FN}$$

The downside though, is that it can give a biased view. Say if the boy just yelled "wolf" all day. Even if the wolf turned up only once out of all those times, his recall would still be 100%. Thumbs up to him, but is 100% really a satisfactory reflection of how good his system is? The town would probably hate him, and not trust him every time he cried "wolf" (hence the moral of the story - Aesop was really telling a fable about machine learning).

### Precision

Using our example again, this is the **proportion of times the boy cried "wolf", there was actually a wolf**. Like in the above, if the boy yells "wolf" 100 times in one day, and the wolf only turns up one of those times, his precision will be 1%. Now this is probably a better way to assess his system, and would indicate that maybe it's not very good. But this is where we need to think about our objective. Is it worth trying to improve the precision of the model at the potential cost of our recall? Flip our example around - if the wolf came 100 times, and only one of those times the boy shouted "wolf", he would technically have 100% precision. This is even worse than before though, because while the boy can pat himself on the back with his 100% precision trophy, the rest of the villagers have been killed by wolves, so not ideal. 

This measure puts emphasis on the false positives. It can be a good performance metric for spam filters, where there is more downside risk to *wrongly* classifying an email as spam. Here's the formula:

$$\frac{TP}{TP + FP}$$

To get the best picture of our model's performance, we'll actually use all three of these metrics. That way we can see them in action, and in future know which one will be most relevent to our model. There are two more that we'll look at too:

### Scalability

We mentioned how it's important to remember your objective when choosing a performance metric. As a data scientist, we're usually doing our research to support business or decision-making, and so resources are limited (whether it's time, money, computing power or any others). **Scalability is one of the most important performance metrics to look at when choosing a model.** We can build the most accurate model possible, but if that model requires an obscene amount of GPU and takes days to process, then perhaps a faster, more cost-effective model is required. This trade-off is common, and it's up to you to decide which route to take. Whatever you choose, be prepared to justify your decision to your stakeholders.

### Understandability 

For lack of a better word, understandability is often forgotten by many data scientists. While we may be pushing the frontiers of machine learning on behalf of our company, our stakeholders need to be able to understand what it is we're doing. Neural Networks may be a great choice for your model, but getting your stakeholders to understand how they work may be a different story. Stakeholder engagement is integral to success as a data scientist, so we'll think about this too when evaluating the best model

## 5. Data Exploration

After we are happy with our objectives and aims of the project, the next step is to import our data and take a good look at it. Before creating a model, we need to get a really good feel for our dataset by conducting a sufficient exploration on it. In this stage, we want to: 
 - Create visualisations
 - Look at mean/median/mode 
 - Minimums/maximums
 - Distributions
 - Frequencies
 - Identify noise and outliers
 - Summarise missingness
 - Answer specific questions we may have about our data. 
 
**Having a firm grip on the data and its features is almost always directly proportional to how well your supervised model performs.**

Let's go ahead and import our tidy data and start looking at it.

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv('C:/Users/simse/Data Science/Classify Income Basis/Mortgage data - tidy.csv')

  interactivity=interactivity, compiler=compiler, result=result)
