# Motivation

Previously we have take an in-depth look at our data, cleaned it and have a rough understanding of what features might be important to predict the sales price.

Now, we will build machine learning models to conduct our prediction.

Before we can fit the data into the model, we will have to do some data preprocessing steps.

In this notebook, we will provide some motivations to why data preprocessing and data cleaning are important and what is machine learning all about.

In [None]:
###########
# imports #
###########
import pickle # load data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

##########
# models #
##########
# linear models
from sklearn.linear_model import (
    # Ordinary least squares Linear Regression
    LinearRegression
)
# tree model
from sklearn.tree import (
    # A decision tree regressor
    DecisionTreeRegressor,
    # Plot a decision tree
    plot_tree
)

In [None]:
# load the cleaned data
with open("../data/hdb_final", "rb") as f:
    df = pickle.load(f)

In [None]:
df.info()

# Take a step back

## what is machine learning

A machine learning model in the simplest sense is about predicting an outcome (y) with an input data (X). There is nothing fancy or too intelligent about the behaviour. Therefore, it is very important for us to fit the model with meaningful data points instead of random noise.

Let's illustrate the importance of data cleaning with a simple function: $y = 10 + 5*x$

In [None]:
# creating ground truth
_total_sample = 1000  # let's create 1000 samples
X = np.random.random(_total_sample) * 10
X = X.reshape(-1, 1)
y = 10 + 5 * X
y = y.reshape(-1, )

In [None]:
plt.plot(X, y)
plt.title("y = 10 + 5x");

Now, when x = 0, y = 10; x = 1, X = 10, y = 10 + 5 * 10 = 60. Our data is generated properly.

We can fit a simple linear regression to capture this relationship.

[sklearn](https://scikit-learn.org/stable/modules/classes.html) provides a very easy to use high level API for machine learning problems.

In [None]:
# initiate a linear regression model
lm = LinearRegression()
# fit the data
lm.fit(X, y)

We have built a "machine learning model" with 2 lines of code (provided we have very clean data). Changing the model is as simple as calling another function.

Let's fit a random forest model next

In [None]:
# initiate a tree regressor model
tree = DecisionTreeRegressor(
    # set seed to reproduce the result
    random_state=123,
    # The maximum depth of the tree
    max_depth=3
)
# fit the data
tree.fit(X, y)

To conduct prediction it is as simple as calling a single line of code

In [None]:
# let Xi = 6, we expect the ground truth to be 10 + 5 * 6 = 40
_sample = [[6]]
print(f"Linear regression prediction: {lm.predict(_sample)}")
print(f"Tree regressor prediction: {tree.predict(_sample)}")

Both linear regression and tree model does very good prediction with this simple data. But what is really happening behind those model?

To investigate how the model arrive achieve those prediction is called `explainability`. Some models are easy to interpret (e.g. linear regression) while some are not possible to interpret fully (e.g. neural network). As a data scientist we have to understand the trade off between `explainability` and `predictive power`. There might be use cases when we care more about HOW the model predict the outcome instead of HOW GOOD the model is at predicting.

For example, if we were to build a model deciding which patient to receive medical care first, we might want a humane reason (ethnical reasons) instead of trusting fully on the model. On the other hand, if we are predicting stock market performance, we might not be concern with why our model predict the stock will go up, as long as we are earning a profit.

**linear regression explainability**

in a simple sense, linear regression is about finding a best fit line that minimise an error function

Therefore, how linear regression predicts the outcome is basically rely on (simple linear regression case)

$\hat{y} = \hat{b_0} + \hat{b_1} * x$

In [None]:
lm.intercept_, lm.coef_

In our case, $b_0 = 10$, and $b_1 = 5$. The estimated coefficients are highly accurate.

**decision tree explainability**

In a simple sense, decision tree works by splitting the data based on a information criteria

therefore, what the tree model is doing is basically grouping data that are similar to each other.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
plot_tree(tree)
plt.show()

In [None]:
# to illustrate our point, any data that is between 4.976 to 6.217
# will give us the same prediction result of 38.11675092 (which is our sample prediction result)
tree.predict(np.arange(4.976, 6.217, step=0.1).reshape(-1, 1))

## why is data preprocessing is important?

Now, imagine instead of having numeric data, we have categorical data. The model will not be able to calculate any mathematically result from the categorical data.

In [None]:
# sklearn will attempt to convert string object into numeric data,
# we prevent this behaviour by adding additional non numeric text
X_categorical = np.char.add("X = ", X.astype("str"))

In [None]:
try:
    lm.fit(X_categorical, y)
except ValueError as e:
    print(e)

In [None]:
try:
    tree.fit(X_categorical, y)
except ValueError as e:
    print(e)

Therefore, it is important to find a numeric representation of our data in order to train any machine learning (and deep learning) model.

## why is data cleaning is important?

Previously we have seen how both linear regression and tree model work relatively well in our simple case.

Now, what if we add some noise to the data? (adding data that does not follow the relationship y = 10 + 5 * X)

In [None]:
def add_noise(x_clean, y_clean, percent_noise):
    """Add a certain percentage of noisy data."""
    data = np.concatenate([x_clean, y_clean.reshape(-1, 1)], axis=1)
    n_noise = round(percent_noise * data.shape[0])
    noise_x = np.random.random(n_noise).reshape(-1, 1)
    noise_y = np.random.random(n_noise).reshape(-1, 1)
    # scale the noise to a random integer
    noise_x *= np.random.randint(0, 100, size=(n_noise, 1))
    noise_y *= np.random.randint(0, 100, size=(n_noise, 1))
    noise = np.concatenate([noise_x, noise_y], axis=1)
    data_noise = np.concatenate([data, noise], axis=0)
    np.random.shuffle(data)
    X_noise, y_noise = data_noise[:, 0], data_noise[:, 1]
    return X_noise.reshape(-1, 1), y_noise.reshape(-1, )

In [None]:
# now if we add 10% noise to the data, 
# we see that most of the data points still follow the relationship we expected
# but there are some random noise in our data
X_noise, y_noise = add_noise(X, y, .1)
lm.fit(X_noise, y_noise), tree.fit(X_noise, y_noise)
plt.scatter(X_noise, y_noise)

In [None]:
# let Xi = 6, we expect the ground truth to be 10 + 5 * 6 = 40
_sample = [[6]]
print(f"Linear regression prediction: {lm.predict(_sample)}")
print(f"Tree regressor prediction: {tree.predict(_sample)}")

We see that both model is less accurate than before.

In [None]:
# if we add 1000% noise to our data
X_noise, y_noise = add_noise(X, y, 10.)
lm.fit(X_noise, y_noise), tree.fit(X_noise, y_noise)
plt.scatter(X_noise, y_noise)

In [None]:
# let Xi = 6, we expect the ground truth to be 10 + 5 * 6 = 40
_sample = [[6]]
print(f"Linear regression prediction: {lm.predict(_sample)}")
print(f"Tree regressor prediction: {tree.predict(_sample)}")

We see that both models are only getting further away from the ground truth

# Closing remarks

We often spent most of the time on making sure our data is clean and relevant to the problem we are solving.

With the help of various high level APIs, training machine learning model is often the fastest step in a project.