![](img/330-banner.png)

# Lecture 2: Terminology, Baselines, Decision Trees

UBC 2021-22

Instructor: Varada Kolhatkar

## Imports

In [1]:
import glob
import os
import re
import sys
from collections import Counter, defaultdict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

sys.path.append("code/.")
import graphviz
import IPython
import mglearn
from IPython.display import HTML, display
from plotting_functions import *
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from utils import *

plt.rcParams["font.size"] = 16
pd.set_option("display.max_colwidth", 200)

## Announcements 

- hw1 due tonight at 11:59pm
- hw2 will be released tomorrow, due Monday 11:59pm
  - See [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#groups) for instructions on working with a partner.
  - You are free to work alone or with a partner.
- On the usual schedule, hw will be due Mondays and released Tuesdays
- My evening office hour moved from Wed to Thu 
  - Note I have 30 min morning OH and 30 min evening OH.
- Update on the plan for the final exam:
  - We will **not** have a regular 2.5 hour exam in the regular way.
  - There will be a take-home, with a mix of coding and conceptual questions.
  - The time window will be 24-48 hours (exact time window TBD).
  - Open book.
- Update on the plan for the midterm:
  - We'll do it on Canvas during class time on Oct 22.
  - This will be the one time you'll need to operate in the middle of the night if you're in a far time zone (sorry).
  - Probably open book.
- Please monitor Piazza (especially pinned posts and instructor posts) for announcements.
- Sorry for the setup difficulties.

<br><br>

## Learning outcomes 
From this lecture, you will be able to 

- identify whether a given problem could be solved using supervised machine learning or not; 
- differentiate between supervised and unsupervised machine learning;
- explain machine learning terminology such as features, targets, predictions, training, and error;
- differentiate between classification and regression problems;
- use `DummyClassifier` and `DummyRegressor` as baselines for machine learning problems;
- explain the `fit` and `predict` paradigm and use `score` method of ML models; 
- broadly describe how decision tree prediction works;
- use `DecisionTreeClassifier` and `DecisionTreeRegressor` to build decision trees using `scikit-learn`; 
- visualize decision trees; 
- explain the difference between parameters and hyperparameters; 
- explain the concept of decision boundaries;
- explain the relation between model complexity and decision boundaries.

<br><br>

## Big picture and datasets

### Toy datasets 

We will be working with three toy datasets 
- 

### 🤔 Eva's questions

![](img/eva-think.png)

At this point Eva is wondering about the following questions. 

- What might be the difference between spam filtering vs. Google News?   
- What might be the difference between predicting spam vs. predicting housing prices? 

Think about these questions on your own or discuss them with your friend/neighbour.

<br><br><br><br>

## Terminology

You will see a lot of variable terminology in machine learning and statistics. Let's familiarize ourselves with some of the basic terminology used in ML. 

I'll be using the following grade prediction toy dataset to demonstrate the terminology. Imagine that you are taking a course with four home work assignments and two quizzes. You and your friends are quite nervous about your quiz2 grades and you want to know how will you do based on your previous performance and some other attributes. So you decide to collect some data from your friends from last year and train a supervised machine learning model for quiz2 grade prediction. 

In [2]:
classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
print(classification_df.shape)
classification_df.head()

(21, 8)


Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,A+
1,1,0,94,90,80,83,91,not A+
2,0,0,78,85,83,80,80,not A+
3,0,1,91,94,92,91,89,A+
4,0,1,77,83,90,92,85,A+


### Recap: Supervised machine learning

<img src="img/sup-learning.png" height="800" width="800"> 

### Tabular data
In supervised machine learning, the input data is typically organized in a **tabular** format, where rows are **examples** and columns are **features**. One of the columns is typically the **target**. 

<img src="img/sup-ml-terminology.png" height="1000" width="1000"> 

**Features** 
: Features are relevant characteristics of the problem, usually suggested by experts. Features are typically denoted by $X$ and the number of features is usually denoted by $d$.  

**Target**
: Target is the feature we want to predict (typically denoted by $y$). 

**Example** 
: A row of feature values. When people refer to an example, it may or may not include the target corresponding to the feature values, depending upon the context. The number of examples is usually denoted by $n$. 

**Training**
: The process of learning the mapping between the features ($X$) and the target ($y$). 

#### Example: Tabular data for grade prediction

The tabular data usually contains both: the features (`X`) and the target (`y`). 

In [3]:
classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df.head()

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,A+
1,1,0,94,90,80,83,91,not A+
2,0,0,78,85,83,80,80,not A+
3,0,1,91,94,92,91,89,A+
4,0,1,77,83,90,92,85,A+


So the first step in training a supervised machine learning model is separating `X` and `y`. 

In [4]:
X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]
X.head()

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1
0,1,1,92,93,84,91,92
1,1,0,94,90,80,83,91
2,0,0,78,85,83,80,80
3,0,1,91,94,92,91,89
4,0,1,77,83,90,92,85


In [5]:
y.head()

0        A+
1    not A+
2    not A+
3        A+
4        A+
Name: quiz2, dtype: object

#### Example: Tabular data for the housing price prediction

Here is an example of tabular data for housing price prediction. You can download the data from [here](https://www.kaggle.com/harlfoxem/housesalesprediction). 

In [6]:
housing_df = pd.read_csv("data/kc_house_data.csv")
housing_df.drop(["id", "date"], axis=1, inplace=True)
HTML(housing_df.head().to_html(index=False))

price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180.0,0,1955,0,98178,47.5112,-122.257,1340,5650
538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170.0,400,1951,1991,98125,47.721,-122.319,1690,7639
180000.0,2,1.0,770,10000,1.0,0,0,3,6,770.0,0,1933,0,98028,47.7379,-122.233,2720,8062
604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050.0,910,1965,0,98136,47.5208,-122.393,1360,5000
510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680.0,0,1987,0,98074,47.6168,-122.045,1800,7503


In [7]:
X = housing_df.drop(columns=["price"])
y = housing_df["price"]
X.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,3,1.0,1180,5650,1.0,0,0,3,7,1180.0,0,1955,0,98178,47.5112,-122.257,1340,5650
1,3,2.25,2570,7242,2.0,0,0,3,7,2170.0,400,1951,1991,98125,47.721,-122.319,1690,7639
2,2,1.0,770,10000,1.0,0,0,3,6,770.0,0,1933,0,98028,47.7379,-122.233,2720,8062
3,4,3.0,1960,5000,1.0,0,0,5,7,1050.0,910,1965,0,98136,47.5208,-122.393,1360,5000
4,3,2.0,1680,8080,1.0,0,0,3,8,1680.0,0,1987,0,98074,47.6168,-122.045,1800,7503


In [8]:
y.head()

0    221900.0
1    538000.0
2    180000.0
3    604000.0
4    510000.0
Name: price, dtype: float64

```{admonition} Attention
:class: important
To a machine, column names (features) have no meaning. Only feature values and how they vary across examples mean something. 
```

<br><br>

#### Alternative terminology for examples, features, targets, and training

- **examples** = rows = samples = records = instances 
- **features** = inputs = predictors = explanatory variables = regressors = independent variables = covariates
- **targets** = outputs = outcomes = response variable = dependent variable = labels (if categorical).
- **training** = learning = fitting

```{seealso} 
Check out [the MDS terminology document](https://ubc-mds.github.io/resources_pages/terminology/). 
```

<br><br>

### Supervised learning vs. Unsupervised learning

In **supervised learning**, training data comprises a set of features ($X$) and their corresponding targets ($y$). We wish to find a **model function $f$** that relates $X$ to $y$. Then use that model function **to predict the targets** of new examples. 

<img src="img/sup-learning.png" height="900" width="900">


In **unsupervised learning** training data consists of observations ($X$) **without any corresponding targets**. Unsupervised learning could be used to **group similar things together** in $X$ or to provide **concise summary** of the data. We'll learn more about this topic in later videos.

<img src="img/unsup-learning.png" alt="" height="900" width="900">

<br><br>

### Classification vs. Regression 
In supervised machine learning, there are two main kinds of learning problems based on what they are trying to predict.
- **Classification problem**: predicting among two or more discrete classes
    - Example1: Predict whether a patient has a liver disease or not
    - Example2: Predict whether a student would get an A+ or not in quiz2.  
- **Regression problem**: predicting a continuous value
    - Example1: Predict housing prices 
    - Example2: Predict a student's score in quiz2.

<img src="img/classification-vs-regression.png" height="1500" width="1500"> 

In [11]:
# quiz2 classification toy data
classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df.head(4)

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,A+
1,1,0,94,90,80,83,91,not A+
2,0,0,78,85,83,80,80,not A+
3,0,1,91,94,92,91,89,A+


In [12]:
# quiz2 regression toy data
regression_df = pd.read_csv("data/quiz2-grade-toy-regression.csv")
regression_df.head(4)

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,90
1,1,0,94,90,80,83,91,84
2,0,0,78,85,83,80,80,82
3,0,1,91,94,92,91,89,92


### ❓❓ Questions for you

```{admonition} Exercise 2.1.1: $X$ and $y$  
1. How many examples and features are there in the housing price data above? You can use `df.shape` to get number of rows and columns in a dataframe. 
2. For each of the following examples what would be the relevant features and what would be the target?
    1. Credit risk assessment
    2. Sentiment analysis
    2. Fraud detection 
    3. Face recognition 
```    

<br><br>

```{admonition} Exercise 2.1.2: Supervised vs. unsupervised 

Which of these are examples of supervised learning?

1. Finding groups of similar properties in a real estate data set.
2. Predicting real estate prices based on house features like number of rooms, learning from past sales as examples.
3. Grouping articles on different topics from different news sources (something like Google News app). 
4. Detecting credit card fraud based on examples of fraudulent and non-fraudulent transactions.
```

```{admonition} Exercise 1.1.2: Solutions!
:class: tip, dropdown
1. Unsupervised 
2. Supervised
3. Unsupervised
4. Supervised
```

<br><br>

```{admonition} Exercise 2.1.3: Classification vs. Regression

Which of these are examples of classification and which ones are of regression?

1. Predicting the price of a house based on features such as number of bedrooms and the year built.
2. Predicting if a house will sell or not based on features like the price of the house, number of rooms, etc.
3. Predicting your grade in 571 based on past grades.
4. Predicting whether you should bicycle tomorrow or not based on the weather forecast.
```

```{admonition} Exercise 1.1.3: Solutions!
:class: tip, dropdown
1. Regression  
2. Classification 
3. Regression
4. Classification
```

<br><br>

## 1.2 Baselines

### Supervised Learning (Reminder)

- Training data $\rightarrow$ Machine learning algorithm $\rightarrow$ ML model 
- Unseen test data + ML model $\rightarrow$ predictions


```{image} img/sup-learning.png
:alt: ML examples
:class: bg-primary mb-1
:width: 1200px
:align: center
```
<!-- <center>
<img src="img/sup-learning.png" height="1200" width="1200"> 
</center> -->

### Baseline intuition

- Baseline: A simple machine learning algorithm based on simple rules of thumb. For example, 
    - most frequent baseline: always predicts the most frequent label in the training set. 
- Baselines provide a way to sanity check your machine learning model.    

### `DummyClassifier` 

Let's look at an example of using `DummyClassifier`, `sklearn`'s baseline model for classification.  

Building any kind of machine learning model using `sklearn`, including the baseline model, has the following steps. 

1. Read the data
2. Create $X$ and $y$
3. Create a classifier object
4. `fit` the classifier
5. `predict` on new examples
6. `score` the model

#### Reading the data

Let's use our quiz2 grade prediction toy classification dataset as an example. 

In [None]:
classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df.head(4)

#### Create $X$ and $y$

- $X$ &rarr; Feature vectors
- $y$ &rarr; Target

In [None]:
X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]

#### Create a classifier object

- `import` the appropriate classifier 
- Create an object of the classifier 

In [None]:
# scikit-learn DummyClassifier
from sklearn.dummy import DummyClassifier

# Create a classifier object
dummy_clf = DummyClassifier(strategy="most_frequent")

#### `fit` the classifier

- The "learning" is carried out when we call `fit` on the classifier object. 

In [None]:
# fit the classifier
dummy_clf.fit(X, y);

#### `predict` the target of given examples

- We can predict the target of examples by calling `predict` on the classifier object. 

In [None]:
# predict using the learner classifier
dummy_clf.predict(X)

In [None]:
classification_df["quiz2"].value_counts()

#### `score` your model

- How do you know how well your model is doing?
- For classification problems, by default, `score` gives the **accuracy** of the model, i.e., proportion of correctly predicted targets.  

    $accuracy = \frac{\text{correct predictions}}{\text{total examples}}$
    
- Sometimes you will also see people reporting **error**, which is usually $1 - accuracy$ 
- `score` 
    - calls `predict` on `X` 
    - compares predictions with `y` (true targets)
    - returns the accuracy in case of classification.  

In [None]:
print("The accuracy of the model on the training data: %0.3f" % (dummy_clf.score(X, y)))

In [None]:
print(
    "The error of the model on the training data: %0.3f" % (1 - dummy_clf.score(X, y))
)

#### `fit`, `predict` , and `score` summary

Here is the general pattern when we build ML models using `sklearn`. 

In [None]:
# Create `X` and `y` from the given data
X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]

# Create a class object
clf = DummyClassifier(strategy="most_frequent")

# Train/fit the model
clf.fit(X, y)

# Assess the model
clf.score(X, y)

# Predict on some new data using the trained model
new_examples = [[0, 1, 92, 90, 95, 93, 92], [1, 1, 92, 93, 94, 92]]
clf.predict(new_examples)

```{note} 
You'll be exploring dummy classifier in your lab!
```

### `DummyRegressor`

You can also do the same thing for regression problems using `DummyRegressor`. Let's build a regression baseline model using `sklearn`. The `fit` and `predict` paradigms similar to classification. The `score`  method in the context of regression returns somethings called [$R^2$ score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score). (More on this in DSCI 573.)     
    - The maximum $R^2$ is 1 for perfect predictions. 
    - For `DummyRegressor` it returns the mean of the `y` values.   

In [None]:
# scikit-learn DummyClassifier
from sklearn.dummy import DummyRegressor

# Create `X` and `y` from the given data
X = regression_df.drop(columns=["quiz2"])
y = regression_df["quiz2"]

# Create a class object
reg = DummyRegressor()

# Train/fit the model
reg.fit(X, y)

# Assess the model
reg.score(X, y)

# Predict on some new data using the trained model
new_examples = [[0, 1, 92, 90, 95, 93, 92], [1, 1, 92, 93, 94, 92]]
reg.predict(new_examples)

### ❓❓ Questions for you

```{admonition} Exercise 2.2.1
1. Order the steps below to build ML models using `sklearn`. 
    a. `score` to evaluate the performance of a given model
    b. `predict` on new examples 
    c. Creating a model instance
    d. Creating `X` and `y` 
    e. `fit`
2. `predict` takes only `X` as argument whereas `fit` and `score` take both `X` and `y` as arguments. True or False. 
3. Have you ever played [20-questions game](https://en.wikipedia.org/wiki/Twenty_questions)? If yes, think about how do you decide what question to ask next? 

```

```{admonition} Exercise 2.2.1: Solutions!
:class: tip, dropdown

- Steps to build ML models using `sklearn`: d, c, e, a, b          
- True
```

<br><br><br><br>

## 1.3 Decision trees

### Writing a traditional program to predict quiz2 grade

- Can we do better than the baseline? 
- Forget about ML for a second. If you are asked to write a program to predict whether a student gets an A+ or not in quiz2, how would you go for it?  
- For simplicity, let's binarize the feature values. 

```{image} img/quiz2-grade-toy.png
:alt: quiz2-grade-toy
:class: bg-primary mb-1
:width: 1000px
:align: center
```

<!-- <img src="img/quiz2-grade-toy.png" height="1000" width="1000">  -->

- Is there a pattern that distinguishes yes's from no's and what does the pattern say about today? 
- How about a rule-based algorithm with a number of *if else* statements?  
    ```
    if class_attendance == 1 and quiz1 == 1:
        quiz2 == "A+"
    elif class_attendance == 1 and lab3 == 1 and lab4 == 1:
        quiz2 == "A+"
    ...
    ```
- How many possible rule combinations there could be with the given 7 binary features? 
    - Gets unwieldy pretty quickly 

### Decision tree algorithm 

- A machine learning algorithm to derive such rules from data in a principled way.  
- Have you ever played [20-questions game](https://en.wikipedia.org/wiki/Twenty_questions)? Decision trees are based on the same idea! 
- Let's `fit` a decision tree using `scikit-learn` and `predict` with it.
- Recall that `scikit-learn` uses the term `fit` for training or learning and uses `predict` for prediction. 

### Building decision trees with `sklearn`

Let's binarize our toy dataset for simplicity. 

In [None]:
classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]

X_binary = X.copy()
columns = ["lab1", "lab2", "lab3", "lab4", "quiz1"]
for col in columns:
    X_binary[col] = X_binary[col].apply(lambda x: 1 if x >= 90 else 0)
X_binary.head()

In [None]:
y.head()

#### `DummyClassifier` on quiz2 grade prediction toy dataset 

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_binary, y)
dummy_clf.score(X_binary, y)

#### `DecisionTreeClassifier` on quiz2 grade prediction toy dataset 

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree
model = DecisionTreeClassifier()

# Fit a decision tree
model.fit(X_binary, y)

# Assess the model
model.score(X_binary, y)

The decision tree classifier is giving much higher accuracy than the dummy classifier. That's good news! 

In [None]:
# Let's visualize the learned model
display_tree(X_binary.columns, model)

### Some terminology related to trees 

Here is a commonly used terminology in a typical representation of decision trees. 

**A root node**
: represents the first condition to check or question to ask

**A branch**
: connects a node (condition) to the next node (condition) in the tree. Each branch typically represents either true or false. 

**An internal node** 
: represents conditions within the tree

**A leaf node**
: represents the predicted class/value when the path from root to the leaf node is followed. 

**Tree depth**
: The number of edges on the path from the root node to the farthest away leaf node.

### How does `predict` work? 

In [None]:
new_example = np.array([[0, 1, 0, 0, 1, 1, 1]])
pd.DataFrame(data=new_example, columns=X.columns)

In [None]:
pd.DataFrame(data=new_example, columns=X.columns)

In [None]:
display_tree(X_binary.columns, model)

In [None]:
# What't the prediction for the new example?
model.predict(new_example)

In summary, given a learned tree and a test example, during prediction time,  
- Start at the top of the tree. Ask binary questions at each node and follow the appropriate path in the tree. Once you are at a leaf node, you have the prediction. 
- Note that the model only considers the features which are in the learned tree and ignores all other features. 

### (optional) How does `fit` work? 

- Which features are most useful for classification? 
- Minimize **impurity** at each question
- Common criteria to minimize impurity: [gini index](https://scikit-learn.org/stable/modules/tree.html#classification-criteria), information gain, cross entropy

```{admonition} Warning 
:class: warning 
We won't go through **how** it does this - that's CPSC 340. But it's worth noting that it support two types of inputs: 
    1. Categorical (e.g., Yes/No or more options, as shown in the tree above)
    2. Numeric (a number)In the numeric case, the decision tree algorithm also picks the _threshold_. 
```

### Decision trees with continuous features

In [None]:
X.head()

In [None]:
# Trees with continuous features
model = DecisionTreeClassifier()
model.fit(X, y)
display_tree(X.columns, model)

### Decision tree Regressor <a name="5"></a>

- We can also use decision tree algorithm for regression. 
- Instead of gini, we use [some other criteria](https://scikit-learn.org/stable/modules/tree.html#mathematical-formulation) for splitting. A common one is mean squared error (MSE). (More on this in the next block.)
- `scikit-learn` supports regression using decision trees with `DecisionTreeRegressor` 
    - `fit` and `predict` paradigms similar to classification
    - `score` returns somethings called [$R^2$ score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score). (More on this in 573.)     
        - The maximum $R^2$ is 1 for perfect predictions. 
        - It can be negative which is very bad (worse than `DummyRegressor`). 


In [None]:
regression_df = pd.read_csv("data/quiz2-grade-toy-regression.csv")
regression_df.head()

In [None]:
X = regression_df.drop(["quiz2"], axis=1)
y = regression_df["quiz2"]

depth = 4
reg_model = DecisionTreeRegressor(max_depth=depth)
reg_model.fit(X, y)

# display_tree(X.columns, reg_model)

In [None]:
reg_model.predict(X)
regression_df["predicted_quiz2"] = reg_model.predict(X)
print("R^2 score on the training data: %0.3f\n\n" % (reg_model.score(X, y)))
regression_df.head()

### ❓❓ Questions for you to ponder on

```{admonition} Exercise 2.3.1
1. Should change in features (i.e., binarizing features above) change `DummyClassifier` predictions? 
2. When you play 20-questions game, how do you pick the next question to ask?
``` 

```{admonition} Exercise 2.3.2 True or False 
1. For the decision tree algorithm to work, the feature values must be numeric.  
2. For the decision tree algorithm to work, the target values must be numeric.
3. The decision tree algorithm creates balanced decision trees. 
``` 

<br><br><br><br>

## 1.4 More terminology

### Parameters and hyperparameters

**Parameters**
: When you call `fit`, a bunch of values get set, like the features to split on and split thresholds. These are called **parameters**. These are automatically learned by the algorithm during training.

**Hyperparameters**
: Even before calling `fit` on a specific data set, we can set some "knobs" that control the learning. These are called **hyperparameters**. These are specified based on: expert knowledge, heuristics, or systematic/automated optimization (more on that in the coming lectures)    

#### `max_depth` hyperparameter of decision trees

- `DecisionTreeClassifier` has a number of hyperparameters. 
- Let's explore at the `max_depth` hyperparameter, which controls the **depth** of the decision tree, which is the length of the longest path from the tree root to a leaf. (You'll learn about the tree data structure in 512.)

```{admonition} Attention
:class: important
In `sklearn` hyperparameters are set in the constructor. 
```

In [None]:
classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]

In [None]:
model = DecisionTreeClassifier(max_depth=1)
model.fit(X, y)
display_tree(X.columns, model)

**Decision stump**
: A decision tree with only one split (depth=1) is called a **decision stump**. 

In [None]:
model = DecisionTreeClassifier(
    max_depth=4
)  # Let's try another value for the hyperparameter
model.fit(X, y)
display_tree(X.columns, model)

```{seealso}
See [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for other hyperparameters of a tree.
```

<br><br>

### Decision boundary 

What do we do with learned models? So far we have been using them to predict the class of a new instance. Another way to think about them is to ask: what sort of test examples will the model classify as positive, and what sort will it classify as negative? 

#### Example 1: quiz 2 grade prediction 

For visualization, let's consider a subset of the data with only two features. 

In [None]:
X_subset = X[["lab4", "quiz1"]]
X_subset.head()

##### Decision boundary for `max_depth=1` 

In [None]:
depth = 1  # decision stump
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X_subset, y)
plot_tree_decision_boundary_and_tree(
    model, X_subset, y, x_label="lab1", y_label="quiz1"
)

We assume geometric view of the data. (More on this in lecture 3.) Here, the red region corresponds to "not A+" class and blue region corresponds to "A+" class. And there is a line separating the red region and the blue region which is called the **decision boundary** of the model. In our current model, this decision boundary is created by asking one question `lab4 <= 84.5`. 

##### Decision boundary for `max_depth=3` 

In [None]:
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_subset, y)
plot_tree_decision_boundary_and_tree(
    model, X_subset, y, x_label="lab1", y_label="quiz1"
)

The decision boundary, i.e., the model gets a bit more complicated. 

##### Decision boundary for `max_depth=5` 

In [None]:
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_subset, y)
plot_tree_decision_boundary_and_tree(
    model, X_subset, y, x_label="lab1", y_label="quiz1"
)

The decision boundary, i.e., the model gets even more complicated with `max_depth=5`. 

<br><br>

#### Example 2: Predicting country using the longitude and latitude 

Imagine that you are given longitude and latitude of some border cities of USA and Canada along with which country they belong to. Using this training data, you are supposed to come up with a classification model to predict whether a given longitude and latitude combination is in the USA or Canada. 

In [None]:
### US Canada cities data
df = pd.read_csv("data/canada_usa_cities.csv")
df

In [None]:
X = df[["longitude", "latitude"]]

In [None]:
y = df["country"]
y_bin = y.replace("USA", 0)
y_bin = y_bin.replace("Canada", 1)

In [None]:
mglearn.discrete_scatter(X.iloc[:, 0], X.iloc[:, 1], y)
plt.xlabel("longitude")
plt.ylabel("latitude");

##### Real boundary between Canada and USA

In real life we know what's the boundary between USA and Canada. 

```{image} img/canada-us-border.jpg
:alt: canada-us-border
:class: bg-primary mb-1
:width: 500px
:align: center
```

<!-- <img src="img/canada-us-border.jpg" height="500" width="500">  -->

[Source](https://sovereignlimits.com/blog/u-s-canada-border-history-disputes)

Here we want to pretend that we do not know this boundary and we want to infer this boundary based on the limited training examples given to us. 

In [None]:
model = DecisionTreeClassifier(max_depth=1)
model.fit(X, y)
plot_tree_decision_boundary_and_tree(
    model,
    X,
    y,
    height=6,
    width=16,
    eps=10,
    x_label="longitude",
    y_label="latitude",
)

In [None]:
model = DecisionTreeClassifier(max_depth=2)
model.fit(X, y)
plot_tree_decision_boundary_and_tree(
    model,
    X,
    y,
    height=6,
    width=16,
    eps=10,
    x_label="longitude",
    y_label="latitude",
)

In [None]:
mglearn.plots.plot_tree_progressive()

<br><br><br><br>

## Final comments and summary

What did we learn today? 

- There is a lot of terminology and jargon used in ML. Some of the basic 
terminology includes:
    - Features, target, examples, training
    - Classification, regression    
    - Accuracy and error    
    - Parameters and hyperparameters
    - Decision boundary 
- Supervised vs. Unsupervised machine learning 
    - Supervised machine learning is about function approximation whereas unsupervised machine learning is about concisely describing the data.   
    - There are two major types of supervised machine learning problems: classification and regression.    
- Baselines
    - Baselines serve as reference points in ML workflow. 
- Decision trees    
    - are classifiers that make predictions by sequentially looking at features and checking whether they are above/below a threshold
    - learn a hierarchy of if/else questions, similar to questions you might ask in a 20-questions game.       
    - learn axis-aligned decision boundaries (vertical and horizontal lines with 2 features)    
    - One way to control the complexity of decision tree models is by using the depth hyperparameter (`max_depth` in `sklearn`). 
    
- **Decision boundaries** provide a way to visualize what sort of examples will be classified as positive and negative.    

![](img/eva-logging-off.png)
