# Exercise: Build a confusion matrix


* build a model to predict objects based on the scenario (We want to find hikers)
* NB the data will be very biased. Don't try to solve this. The focus here is only exploring the results with a confusion matrix
* See the content doc for context.

## Data visualization

Let's start this exercise by loading in and having a look at our data:

In [29]:
import pandas

#Import the data from the .csv file
dataset = pandas.read_csv('Data/snow_objects.csv', delimiter="\t")

#Let's have a look at the data
dataset

Unnamed: 0,size,roughness,color,motion,label
0,50.959361,1.318226,green,0.054290,tree
1,60.008521,0.554291,brown,0.000000,tree
2,20.530772,1.097752,white,1.380464,tree
3,28.092138,0.966482,grey,0.650528,tree
4,48.344211,0.799093,grey,0.000000,tree
...,...,...,...,...,...
2195,1.918175,1.182234,white,0.000000,animal
2196,1.000694,1.332152,black,4.041097,animal
2197,2.331485,0.734561,brown,0.961486,animal
2198,1.786560,0.707935,black,0.000000,animal


## Data Exploration

We can see that the dataset heas both continuous (`size`, `roughness`, `motion`) and categorical data (`color` and `label`).
Let's do some quick data exploration and see what different labels we have and their respective counts:

In [30]:
import graphing # custom graphing code. See our GitHub repo for details

# Plot a histogram with counts for each label
graphing.multiple_histogram(dataset, label_x="label", label_group="label", title="Label distribution")

The histogram above makes it very easy to understand both the labels we have in the dataset and their distribution.

We can do the same thing for the `color` feature:

In [31]:
# Plot a histogram with counts for each label
graphing.multiple_histogram(dataset, label_x="color", label_group="color", title="Color distribution")

From the plot above we can conclude that:

- We have `8` different color categories.
- The most predominant colors are `brown` and `white`
- The least predominant colors are `blue`, `orange` and `white`
- Out plotting algorithm is not smart enough to assign the correct colors to their respective names

Let's see what we can find about the other features:


In [32]:
graphing.box_and_whisker(dataset, label_y="size", title='Boxplot of "size" feature')

Above we can see that the majority of our samples are relativelly small, with sizes ranging from `0` to `70`, but we have a few much bigger outliers.

Let's take a look at the `roughness` feature:

In [33]:
graphing.box_and_whisker(dataset, label_y="roughness", title='Boxplot of "roughness" feature')

The mean value is centered around `1`, but there's not  lot of variation: values for `roughness` range from `0` to a little over `2`.

Finally, let's plot the `motion` feature:

In [34]:
graphing.box_and_whisker(dataset, label_y="motion", title='Boxplot of "motion" feature')

Most objects seem to be either static or moving very slowly. There is a smaller number of objects moving faster, with a couple of outliers going over `10`.

From the data above one could assume that the smaller and faster objects are likely hikers and animals, whereas the bigger, more static elements are trees and rocks.

## Building a classification model

--- FIRST TRY TO DO IT WITH RANDOM FOREST JUST AS AN EXAMPLE. SEE `2c - dataset generation.py` for an example.

Let's build and train a classification model using a random forest, to predict the class of an object based on the features in our dataset:


In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split the dataset in an 70/30 train/test ratio. 
train, test = train_test_split(dataset, test_size=0.3, random_state=2)
print(train.shape)
print(test.shape)

(1540, 5)
(660, 5)


Now we can train our model, using the `train` dataset we've just created:

In [36]:
# Create the model
random_forest = RandomForestClassifier(n_estimators=1, random_state=1, verbose=False)

# Define which features are to be used (leave color out for now)
features = ["size", "roughness", "motion"]

# Train the model
random_forest.fit(train[features], train.label)

print("Model trained!")

Model trained!


## Assessing our model

--- CONTENT - NOTE ACCURACY. EXPLAIN WHAT THIS IS WITHOUT TALKING ABOUT TP/FP/TN/FN 

We can now use our newly trained model to make predictions using the *test* set.

By comparing the values predicted to the actual labels (also called *true* values), we can measure the model's performance using different *metrics*, such as *accuracy*.

*Accuracy* is the simply number of correctly predicted labels out of all predictions performed:

```sh
    Accuracy = Correct Predictions / Total Predictions
```

Let's see how this can be done in code:

In [37]:
# Import function that measures a models accuracy
from sklearn.metrics import accuracy_score

# Calculate the model's accuracy on the TEST set
y_true = test.label
y_pred = random_forest.predict(test[features])

# This gives as the accuracy as a fraction
acc = accuracy_score(y_true, y_pred)

# This gives as the accuracy as a number of correct predictions
acc_norm = accuracy_score(y_true, y_pred, normalize=False)

print(f"The random forest model's accuracy on the test set is {acc:.4f}.")
print(f"It correctly predicted {acc_norm} labels in {len(test.label)} predictions.")

The random forest model's accuracy on the test set is 0.8924.
It correctly predicted 589 labels in 660. predictions


--- NOTE HOW THAT MAKES IT HARD TO UNDERSTAND WHAT KIND OF ERRORS IT IS MAKING

Our model __seems__ to be doing quite well!

That intuition, however, can be misleading:

- Accuracy does not take into account the __wrong__ predictions mode by the model

- It's also not very good at painting a clear picture in *class-imbalanced datasets*, like ours, where the number of possible classes is not evely distributed (for example, we have 800 trees, 800 rocks, but only 200 animals)

## Building a confusion matrix

WALKTHROUGH ON HOW TO BUILD A CONFUSION MATRIX. YOU MIGHT NEED A FEW CELLS. SKLEARN PROVIDES A WAY TO DO THIS. IF THERE'S AN EASY WAY THAT USES PLOTLY THAT WOULD BE BETTER AS MATPLOT LIB NEEDS LOTS OF CODE. DON'T HIDE THIS CODE IN CODE BEHIND AS THEY NEED TO LEARN HOW TO DO IT

COMMENT ON WHAT EACH CELL MEANS. NOTE THEY HAVE COVERED THIS THEORY ALREADY IN CONTENT. SO, FOCUS ON WHAT IT MEANS FOR THIS SCENARIO (ESSENTIALLY, THE MODEL LOOKS TO DO REALLY WELL WITH ACCURACY, BUT WE CAN SEE THAT IT'S ONLY BECAUSE IT CAN PREDICT TREES AND ROCKS WELL. IT TENDS TO MUDDLE HUMANS WITH ANIMALS)

## Summary

TODO