# Machine Learning

Now that you have the basis of data manipulation and analysis, it's time to introduce you to Machine Learning algorithms!

Today we will:
- Have a brief introduction to different types of learning
- Use a DecisionTree to create a model that **predicts** home price values.
- Learn how to **measure** your **model quality**.
- Learn ways to **improve** this quality.

I encourage you to read official documentations. You will find links in this Notebook, but the best way to learn is by being curious. Keep track of your questions and try to find answers on the internet.

# Vocabulary

In data science, we do not use the terms `rows` and `columns` to refer to a dataset. 
* The rows are called **samples**. For example, in our dataset of US States, each line describing a State is a sample.
* The columns are called **features**. Each feature describe a particular aspect of the sample. In the US States dataset, the State id, the State name, and the Geometry were all features. In other datasets (with different samples), it could have been the length of a sentence, the frequency of a word, the number of legs of an animal, a creation date, a GPS location, a True/False boolean...
* The **dimension** of a dataset is its number of features. If a dataset have 3 dimensions, we can call it a *3-dimensional* dataset.


## Supervised Learning

In **supervised** learning, we already have a dataset containing the right answers, and we train a model to predict the right answer for new and unseen samples. 

The "right answers" are contained in a column that we call the **target**.

Supervised algorithms can be used for:
- Prediction
- Product recommendation
- Classification
    - Handwriting recognition
    - Speech recognition

## Unsupervised Learning

In **unsupervised** learning, we do not know the right answers beforehand. We have to use algorithms to explore and understand meaninful informations, detect patterns, etc.

Unsupervised algorithms can be used for:
- Clustering (divide a set of datas into `k` subgroups, called clusters)
- Pattern recognition
- Anomaly detection
- Dimensionality Reduction (i.e. PCA & LDA)

Both type of learnings can use methods based on Neural Networks, but will dig into that later.

## Semi-supervised Learning

A related variant of learning that makes use of both supervised and unsupervised techniques. 

# Decision Tree

A decision tree is a **supervised** machine learning algorithm for both **classification** and **regression** problems.

The goal is to create a **model** (a tree) that predicts a value of a *target* using selected *features*.

A decision tree builds upon iteratively asking questions (based on *features*) to partition data. 

In the Decision Tree example below, we have:
* A target: is giving a loan to this person is risky or safe? (2 unique values)
* 2 features: one is credit history, the other is the income.

![Decision Tree Example](https://miro.medium.com/max/1362/1*lAzhAq7poWg2hUfOp6-rsQ.png)

What you ask at each step is critical part and greatly influences the performance of a decision tree. 

Decision trees can be used to:
* Explicit decision making.
* Describe / explain data
* Determine the importance of each feature
* Do classification (if the target has multiple unique values) or regression task (if the target is a number).


In [None]:
import numpy as np
import pandas as pd

import sklearn
from sklearn.tree import DecisionTreeClassifier  
from sklearn.ensemble import BaggingClassifier
"""
`import sklearn` does not automatically import every submodules, you may have to import them explicitely like above.
"""

import matplotlib.pyplot as plt
import seaborn as sns

from jyquickhelper import add_notebook_menu
print("Table of Contents")
add_notebook_menu()


## Download the dataset

* Download the [Melbourne Housing dataset](https://www.kaggle.com/anthonypino/melbourne-housing-market)
* Unzip it in this folder.

## Load and clean

In [None]:
# TODO: Load file "Melbourne_housing_FULL.csv"
melbourne_data = _

Hint to remove NA values: use `DataFrame.dropna()`. [See official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

In [None]:
# TODO: Remove rows with N/A values.
melbourne_data = _

## Define the target and the features

By convention, we call the target variable `y`, and the features variable `X`.

In [None]:
y = melbourne_data.Price
y

In [None]:
print(melbourne_data.columns)

features = ['Rooms', 'Bathroom', 'Landsize', "Car", "YearBuilt"]
features

In [None]:
X = melbourne_data[features]
X

In [None]:
X.describe()

## Visualize the data in 2D

We will use the PCA to

1. Find the most important axis.
2. Print the data using only those axis.

In [None]:
# TODO: Fit the PCA on X values.
pca = _

k = 2
eig_pairs = [(feature, eigenvalue) for feature, eigenvalue in zip(features, pca.explained_variance_)]
eig_pairs = sorted(eig_pairs, key=lambda x: x[1], reverse=True)

for feature, eigenvalue in eig_pairs:
    print(f"{feature} eigenvalue: {eigenvalue:.2e}")
    
selected_axis = [x[0] for x in eig_pairs[:k]]
selected_axis

In [None]:
# This cell may take time to run since it will prints 8887 points.

fig, ax = plt.subplots(figsize=(18, 12))

ax.scatter(X[selected_axis[0]],
           X[selected_axis[1]],
           marker="x")

if True:
    for i, txt in enumerate(melbourne_data['Address']):
        x_point = X[selected_axis[0]][i]
        y_point = X[selected_axis[1]][i]
        text = ax.annotate(txt, (x_point + .02, y_point))
        text.set_alpha(.5)

plt.title('Houses 2D Visualization')

ax.set(xlabel=selected_axis[0],
       ylabel=selected_axis[1])

plt.plot()

## Use DecisionTree model

Specifying a number for `random_state` ensures you get the same results in each run. This is considered a good practice. You use any number.

In [None]:
from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(X, y)

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [None]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are:")

melbourne_model.predict(X.head())

### Visualize the tree

Using `sklearn.tree.plot_tree`. [See documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)

As you can see, we define a `max_depth` parameter. The tree's depth is a measure of how many splits it makes before coming to a prediction. Here, we define it to 2 as we just want to see the first most important features.

In [None]:
fig, ax = plt.subplots(figsize=(18, 12))

sklearn.tree.plot_tree(melbourne_model,
                       ax=ax,
                       max_depth=2,
                      feature_names=features)
plt.plot()

You can also
* Customize your tree
* Export it to a PNG file (like you can with every plot from matplotlib)

In [None]:
fig, ax = plt.subplots(figsize=(18, 12))

sklearn.tree.plot_tree(melbourne_model,
                       ax=ax,
                       max_depth=3,
                       feature_names=features,
                       proportion=True,
                       rotate=True,
                       rounded=True,
                       filled=True, 
                       fontsize=10)

plt.savefig("beautiful_tree.png")
plt.show()

## Splitting in training and testing set

In order to study if our predictive model is working well enough, we need to try it under new samples that the model did not see before.

To simulate that, we create a **training** and a **testing** test. 

Usually, we keep 1/3 of our original data as the test set, and build our model using the 2/3 left. 

We could do the splitting ourselves, but sklearn implemented a method to ease our work: `sklearn.model_selection.train_test_split()` [See documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print(X_test)
print(y_test)

In [None]:
# TODO: Use Decision Tree on your training set to create your model.
# Hint: We did that before we splitted the data, creating a new var X_train

melbourne_model = DecisionTreeRegressor(random_state=1)
trained_tree = _

In [None]:
# TODO: Use your model to predict your testing set.
predictions = _
predictions

### Visualize the better trained tree

In [None]:
# TODO: Use `sklearn.tree.plot_tree` to visualize your tree.

fig, ax = plt.subplots(figsize=(18, 12))

sklearn.tree.plot_tree(trained_tree,
                       max_depth=2,
                       feature_names=features)
plt.plot()

## Validate the Model

To measure the model quality, a task also named *model validation*, we use **metrics**.

One of the most common metric is **accuracy**.

There are many metrics to compare model predictions with actual forecasts, but we'll start with one called **Mean Absolute Error** (also called MAE). 

The MAE is computed with:

$$ error=actual−predicted $$ 

In [None]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, predictions)
mae

The resulting output for me is around ~405k.

*Note*: This value depends on the train and the test set, which were splitted randomly. Therefore, it is normal that the value you obtained is different from mine or your neighbour's.

This result can be read as follows:

**The average error is 405 000 dollars by house.**

Is it high? We can only know that if we compare to the average house prices.

In [None]:
melbourne_data["Price"].describe()

In [None]:
mean_price = melbourne_data["Price"].describe()['mean']

print(f"The error with test data is ~{round(mae / mean_price * 100, 2)}% "
       "of average home value.")

## Improve the model

How to make the model *more accurate*?

First, look into [sklearn DecisionTreeRegressor Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). There is a lot of customizable parameters. 

One of them is `max_depth`, which default is `None`.

As you saw with the visualization, our tree can get deep really quickly. If the tree's depth is $n$, the number of leaves will be $2^n$.

This is a phenomenon called **overfitting**, where a model matches the training data almost perfectly, but does poorly in validation and other new data. It usually means the model is learning on **noise**, data that are details that should have been discarded in the learning phase. 

On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups. It fails to capture the underlying trend of the data. This other extreme is called **underfitting**.

Examples:

![Appropriate fitting](https://media.geeksforgeeks.org/wp-content/cdn-uploads/20190523171258/overfitting_2.png)

Ironically, an accuracy of 100% is bad because it is often a sign of *overfit*.

### How to avoid over- and under- fitting?

Underfitting can be avoided by:
* Using more data
* Use a more complex, non-linear algorithm.

Overfitting can be avoided by:
* **Early stopping**. In our tree case, it means lower the tree's depth, or the tree's leaves. In other cases, it usually means reducing the number of learning iterations.
* **Pruning**. It means let the tree grows in a certain depth, but then cutting some branches / trim down some leaves that do not add much predictive power.

But how do you know if you have a **good fit** ?

Usually, the goal is to find the spot between overfitting and underfitting.

In [None]:
def get_mae_from_depth(max_depth):
    melbourne_model = DecisionTreeRegressor(max_depth=max_depth, random_state=1)
    trained_tree = melbourne_model.fit(X_train, y_train)
    predicted_home_prices = melbourne_model.predict(X_test)
    return mean_absolute_error(y_test, predicted_home_prices)

for max_depth in [5, 7, 10, 11, 12, 13, 14, 15, 20, 50, 100]:
    mae = get_mae_from_depth(max_depth)
    print(f"MAE for max_depth={max_depth:3}: {mae}")

We can see that the minimum MAE is around a max_depth of **12**.

Now, let's do something similar with the number of leaves.

In [None]:
def get_mae_from_leaves(max_leaf_nodes):
    melbourne_model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=1)
    trained_tree = melbourne_model.fit(X_train, y_train)
    predicted_home_prices = melbourne_model.predict(X_test)
    return mean_absolute_error(y_test, predicted_home_prices)

for max_leaf_nodes in [5, 10, 25, 50, 100, 150, 200, 250, 300, 350, 400, 500, 600, 700, 800, 900, 1000]:
    mae = get_mae_from_leaves(max_leaf_nodes)
    print(f"MAE for max_leaf_nodes={max_leaf_nodes:3}: {mae}")

We can see that the minimum MAE kinda ondulates. 

### Visualize the error

Let's visualize it to understand it.

In [None]:
mae_scores = pd.DataFrame(range(20, 1000, 20), columns=["max_leaves"])
mae_scores[:10]

In [None]:
mae_scores["mae_leaves"] = mae_scores["max_leaves"].apply(get_mae_from_leaves)

In [None]:
mae_scores.head()

In [None]:
fig, ax = plt.subplots()
ax.scatter(x="max_leaves", y="mae_leaves", data=mae_scores)
plt.show()

Okay, great, but this graph is too tiny. Let's also show the `xticks` so we can see at which `x` the minimum MAE is.

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.set_xticks(mae_scores["max_leaves"])
ax.scatter(x="max_leaves", y="mae_leaves", data=mae_scores)

plt.xticks(rotation=45)
plt.show()

Now it's clear, the minimum MAE is clearer.

For me it's ~324k MAE for a max_leaves of 60.

In [None]:
mae_scores[mae_scores['max_leaves'] == 60]

Let's do the same for max_depth!

In [None]:
depth_score = pd.DataFrame(range(3, 50, 1), columns=["max_depth"])
depth_score["mae_depth"] = depth_score["max_depth"].apply(get_mae_from_depth)
depth_score.head()

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.set_xticks(depth_score["max_depth"])
ax.scatter(x="max_depth", y="mae_depth", data=depth_score, c='orange')

plt.xticks(rotation=45)
plt.show()

This time it's a ~317 MAE for a max_depth of 7.

In [None]:
depth_score[depth_score['max_depth'] == 7]

Just for visual comparaison, we can put them on the same axis.

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.set_xticks(range(0, 1000, 20))

ax.scatter(x="max_leaves", y="mae_leaves", data=mae_scores)
ax.scatter(x="max_depth", y="mae_depth", data=depth_score, c='orange')

plt.xticks(rotation=45)
plt.show()

### Conclude

In [None]:
# TODO: Change max_depth and max_leaf_nodes according to your result
max_depth = 7
max_leaf_nodes = 60

# MAE with best max_depth
melbourne_model = DecisionTreeRegressor(max_depth=max_depth, 
                                        random_state=1)
trained_tree = melbourne_model.fit(X_train, y_train)
predicted_home_prices = melbourne_model.predict(X_test)
mae = mean_absolute_error(y_test, predicted_home_prices)
print(f"MAE for max_depth={max_depth:3}: {mae}")
print(f"The error with test data is ~{round(mae / mean_price * 100, 2)}% "
       "of average home value.")

# MAE with best max_leaf_nodes
mae = get_mae_from_leaves(max_leaf_nodes)
print(f"MAE for max_leaf_nodes={max_leaf_nodes:3}: {mae}")

print(f"The error with test data is ~{round(mae / mean_price * 100, 2)}% "
       "of average home value.")

As already supported by our visualizations, we got the minimum error for a `max_depth` of **7**. 

It's still not perfect, but we reduced our original error of ~405k to ~316k, which represents a **22% error reduction**.