# CPSC 330 Lecture 2

# Lecture outline

- Wave hello
- **!! Turn on recording !!**
- Announcements (5 min)
- Cilantro dataset (5 min)
- Decision trees (30 min)
- Break (5 min)
- True/False questions (15 min)
- ML model parameters and hyperparameters (5 min)
- Overfitting (15 min)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 16

from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

from plot_classifier import plot_classifier

In [2]:
import re
import graphviz
from sklearn.tree import export_graphviz

def display_tree(feature_names, tree):
    """ For binary classification only """
    dot = export_graphviz(tree, out_file=None, feature_names=feature_names, class_names=tree.classes_,impurity=False)
    # adapted from https://stackoverflow.com/questions/44821349/python-graphviz-remove-legend-on-nodes-of-decisiontreeclassifier
    dot = re.sub('(\\\\nsamples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+\])(\\\\nclass = [A-Za-z0-9]+)', '', dot)
    dot = re.sub(     '(samples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+\])\\\\n', '', dot)
    return graphviz.Source(dot)

## Announcements (5 min)

- hw1 due tonight at 11:59pm
- hw2 will be released tomorrow, due Monday 11:59pm
  - See [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#groups) for instructions on working with a partner.
  - You are free to work alone or with a partner.
- On the usual schedule, hw will be due Mondays and released Tuesdays
- My evening office hour moved from Wed to Thu 
  - Note I have 30 min morning OH and 30 min evening OH.
- Update on the plan for the final exam:
  - We will **not** have a regular 2.5 hour exam in the regular way.
  - There will be a take-home, with a mix of coding and conceptual questions.
  - The time window will be 24-48 hours (exact time window TBD).
  - Open book.
- Update on the plan for the midterm:
  - We'll do it on Canvas during class time on Oct 22.
  - This will be the one time you'll need to operate in the middle of the night if you're in a far time zone (sorry).
  - Probably open book.
- Please monitor Piazza (especially pinned posts and instructor posts) for announcements.
- Sorry for the setup difficulties.

## Cilantro dataset (5 min)

Here's the dataset you generated last class!

In [None]:
df = pd.read_csv('data/330-students-cilantro.csv')
df.head()

- head(<int>) prints the first 5 rows by default

In [None]:
df.columns = ["meat", "grade", "cilantro"]
df.head()

- df.columns is used to rename the columns

In [None]:
df.describe()

- describe() gets the statistic for numerical columns

In [None]:
scatter = plt.scatter(df["meat"], df["grade"], c=df["cilantro"]=="Yes", cmap=plt.cm.coolwarm);
plt.xlabel("Meat consumption (% days)");
plt.ylabel("Expected grade (%)");
plt.legend(scatter.legend_elements()[0], ["No", "Yes"]);

In [None]:
scatter.legend_elements()[0]

Can you find yourself on this plot?!

In [None]:
df["cilantro"].value_counts()

In [None]:
X = df[["meat", "grade"]]
X.head()

In [None]:
y = df["cilantro"]
y.head()

In [None]:
dc = DummyClassifier(strategy="prior")

- DummyClassifier predicts the most common class

In [None]:
dc.fit(X, y)
dc.score(X, y)

- fit() takes the data and does the learning, in this case(DummyClassifier), find the most common class
- score() shows how well the DummyClassifier do on our data
- 0.72 here means that the DummyClassifier gives a 72% correct prediction

In [None]:
y.value_counts()/len(y)

## Decision trees (20 min)

- Our first approach to supervised learning: **decision trees**.
- Basic idea: ask a bunch of yes/no questions until you end up at a prediction.
- E.g. for our cilantro dataset,
  - If you eat meat <5% of the time, predict "Yes"
  - Otherwise, if you eat meat >95% of the time, predict "No"
  - Otherwise, if you expect to fail the course, predict "No"
  - Otherwise, predict "Yes"

- This "series of questions" approach can be drawn as a tree:

```
            Eats meat <5% of the time
            /          \
           / True       \  False
          /              \
         Yes           Eats meat >95% of the time
                        /      \
                  True /        \ False
                      /          \ 
                    No         Expects to fail the course (<50%)
                                 /           \
                                / True        \ False
                               /               \
                              No              Yes
```

- The decision tree algorithm automatically learns a tree like this, based on the data set!
  - We won't go through **how** it does this - that's CPSC 340.
  - But it's worth noting that it support two types of inputs:

1. Categorical (e.g., Yes/No or more options)
2. Numeric (a number)

In the numeric case, the decision tree algorithm also picks the _threshold_ (e.g. 5%, 50%, etc.)

In our case here, both features are numeric. (meat & grade)

Let's apply a decision tree to our cilantro dataset.

In [None]:
tree1 = DecisionTreeClassifier(max_depth=1)

- Here, we create a `DecisionTreeClassifier` object from scikit-learn.
- We pass in parameters - these are called **hyperparameters** - in this case `max_depth=1` which means the tree can only have depth 1. (A question/tree could spread no more than `max_depth` times.)
- Next we fit to the data using `.fit()`.
- The semicolon is just cosmetic, otherwise some junk gets printed out.

In [None]:
tree1.fit(X, y);

In [None]:
display_tree(df.columns[:-1], tree1)

- This is a totally useless decision tree that predicts "Yes" for any feature.
- This happens sometimes. Let's roll with it for the moment.

In [None]:
plot_classifier(X, y, tree1, ticks=True, vmin=0, vmax=1); # note to self: need to set vmin/vmax to to an issue with plot_classifier that always draws blue if all predictions are the same
plt.xlabel("Meat consumption (% days)");
plt.ylabel("Expected grade (%)");

- The background colour shows our prediction. ("Yes" here)
- We predict red (likes cilantro) for any features.
- We can get an accuracy score using `.score()` from sklearn

In [None]:
tree1.score(X, y)

- This is doing the same thing as `DummyClassifier` so we get the same score.
- We can verify this using `.predict()`

In [None]:
tree1.predict([[50, 50]])

- A 50% meat eater and 50% course grade is gonna predict Cilantro == 'Yes'

In [None]:
tree1.predict([[99,99]])

In [None]:
tree1.predict(X)

- For all the people in the class, or all the data in the data set, gonna predict 'Yes'

etc.

- Let's make the tree deeper by increasing `max_depth`.

In [None]:
tree2 = DecisionTreeClassifier(max_depth=2)
tree2.fit(X, y);

In [None]:
display_tree(df.columns[:-1], tree2)

- df.columns[:-1] means escape the last column, here means that only get 'meat' and 'grade' columns

In [None]:
plot_classifier(X, y, tree2, ticks=True, show_data=True);
plt.xlabel("Meat consumption (% days)");
plt.ylabel("Expected grade (%)");

- Let's take a moment to make sure we can correspond the tree diagram to this diagram - they are saying the same thing.

In [None]:
tree2.score(X, y)

- By the way, what does `.score()` do?
- It calls `predict` and then compares the predictions to the true labels.

In [None]:
(tree2.predict(X) == y).sum()/len(y)

Or, equivalently,

In [None]:
(tree2.predict(X) == y).mean()

In [None]:
y

Moving on to `max_depth=None`, which lets it grow the tree as much as it wants.

In [None]:
tree = DecisionTreeClassifier(max_depth=None)
tree.fit(X, y);

In [None]:
tree.predict([[90, 90]])

In [None]:
display_tree(df.columns[:-1], tree)

In [None]:
plot_classifier(X, y, tree, ticks=True);
plt.xlabel("Meat consumption (% days)");
plt.ylabel("Expected grade (%)");

In [None]:
tree.score(X, y)

The reason it's not getting 100% accuracy: instances of duplicated features

In [None]:
# it's OK if you don't understand this line
df.loc[df.duplicated(subset=df.columns[:-1], keep=False)].sort_values(by=df.columns.values.tolist()).head(20)

If we remove the "duplicates" (cases where X is the same, not y) then we can get 100% accuracy:

In [None]:
# it's OK if you don't understand this line
df_nodup = df.sort_values(by="cilantro").drop_duplicates(subset=df.columns[:-1]).reset_index(drop=True)

In [None]:
df_nodup.shape

In [None]:
X_nodup = df_nodup.iloc[:,:2]
y_nodup = df_nodup.iloc[:,-1]

In [None]:
tree_nodup = DecisionTreeClassifier() # default is max_depth=None

In [None]:
tree_nodup.fit(X_nodup, y_nodup);

In [None]:
tree_nodup.score(X_nodup, y_nodup)

In [None]:
plot_classifier(X_nodup, y_nodup, tree_nodup, ticks=True);
plt.xlabel("Meat consumption (% days)");
plt.ylabel("Expected grade (%)");

Note: one would not actually remove the duplicates in a real scenario. This is just for illustration purposes.

## Break (5 mins)

## True/False questions (15 min)

For each of the following, answer with `fit` or `predict`:

1. At least for decision trees, this is where most of the hard work is done. `fit`
2. Only takes `X` as an argument. `predict`
3. In scikit-learn, we can ignore its output. (kind of a void function, we don't need to grab and store the result) `fit` ('predict()' has a return value)
4. Is called first (before the other one). `fit`

<br><br><br><br><br><br>

##  ML model parameters and hyperparameters (5 mins)

- When you call `fit`, a bunch of values get set, like the split variables and split thresholds. 
- These are called **parameters**.
- But even before calling `fit` on a specific data set, we can set some "knobs" that control the learning, e.g. `max_depth`.
- These are called **hyperparameters**.

In scikit-learn, hyperparameters are set in the constructor:

In [None]:
tree = DecisionTreeClassifier(max_depth=3) 
tree.fit(X, y);

Here, `max_depth` is a hyperparameter. There are many, many more! See [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).



To summarize:

- **parameters** are automatically learned by the algorithm during training (`fit`)
- **hyperparameters** are specified by the human, before `fit` (decided in advanced, a pre-decided value), based on:
    - expert knowledge
    - heuristics, or 
    - systematic/automated optimization (more on that later on)

## Overfitting (15 mins)

Important question: how does accuracy change vs. max_depth?

In [None]:
# it would be good to understand this code, but not that urgent
# I am using a list comprehension but you might find it easier to understand with a `for` loop - post on Piazza for more info
max_depths = np.arange(1, 18)
scores = [DecisionTreeClassifier(max_depth=max_depth).fit(X_nodup, y_nodup).score(X_nodup, y_nodup) for max_depth in max_depths]
plt.plot(max_depths, scores);
plt.xlabel("max depth");
plt.ylabel("accuracy score");

- Why not just use a very deep decision tree for every supervised learning problem and get super high accuracy?
- Well, the goal of supervised learning is to predict unseen/new data...
  - The above decision tree has 100% accuracy on the training data **where we already know the answer**.
  - It perfectly labels the data we used to make the tree...
  - But we want to know how our model performs on data not used in training.
  - We will split our original dataset into two parts, one for "training" and one for "testing".

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df_train, df_test = train_test_split(df_nodup)

- Split the data set into train examples and test examples, 75% and 25% of the data set respectively by defdalult

In [None]:
scatter = plt.scatter(df_train["meat"], df_train["grade"], c=df_train["cilantro"]=="Yes", cmap=plt.cm.coolwarm);
plt.xlabel("Meat consumption (% days)");
plt.ylabel("Expected grade (%)");
plt.legend(scatter.legend_elements()[0], ["No", "Yes"]);

In [None]:
scatter = plt.scatter(df_test["meat"], df_test["grade"], c=df_test["cilantro"]=="Yes", cmap=plt.cm.coolwarm);
plt.xlabel("Meat consumption (% days)");
plt.ylabel("Expected grade (%)");
plt.xlim((0,100));
plt.ylim((0,100));
plt.legend(scatter.legend_elements()[0], ["No", "Yes"]);