In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw10.ipynb")

<a id="top"></a>

# Homework 10:  PCA and Decision Trees
## Due Monday, August 5th, 11:59 PM PT

In this assignment, we will use PCA to analyze a dataset on presidential elections, then we will build and evaluate decision trees on a dataset of NBA players.

As we are nearing the end of the summer, this assignment is an **optional, extra-credit homework**. Students skipping this assignment will not lose homework points, while students completing this assignment will receive extra credit equivalent to the point value of a normal homework assignment. Also note that for this assignment, **free response questions will be graded on completion and coherence**. Since we are approaching the end of the course and have limited time for grading, you will receive full credit on free response questions as long as your answers demonstrate effort.

You must submit this assignment to Gradescope by the on-time deadline, Monday, August 5th, 11:59 PM PT. Please read the syllabus for the grace period policy. No late submissions beyond the grace period will be accepted. **We strongly encourage you to plan to submit your work to Gradescope several hours before the stated deadline.** This way, you will have ample time to reach out to staff for support if you encounter difficulties with submission. While course staff is happy to help guide you with submitting your assignment ahead of the deadline, we will not respond to last-minute requests for assistance (TAs need to sleep, after all!).

Please read the instructions carefully when submitting your work to Gradescope.

**Collaboration Policy**

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** below.

**Collaborators**: *list collaborators here*

## Grading
Grading is broken down into autograded answers and free response. As noted above, for this assignment free response questions will be graded based on completion and coherence.

<!--
<details>
    <summary>[Click to Expand] <b>Scoring Breakdown</b></summary>-->
|Question|Points|
|---|---|
|1a | 1 |
|1b | 1 |
|1c | 1 |
|1d | 1 |
|1e | 1 |
|2a | 1 |
|2b | 1 |
|3a | 1 |
|3b | 1 |
|3c | 1 |
|4a | 1 |
|4b | 1 |
|4c | 1 |
|5a | 1 |
|5b | 1 |
|5c | 1 |
|5d | 1 |
|Total | 17 |
</details>

In [None]:
# Run this cell to set up your notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import sqlalchemy
from pathlib import Path

from matplotlib.colors import ListedColormap

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import tree

# You may get a warning from importing ensemble. It is OK to ignore this warning.
from sklearn import ensemble

plt.style.use('fivethirtyeight') # Use plt.style.available to see more styles
sns.set()
sns.set_context("talk")
np.set_printoptions(threshold=5) # avoid printing out big matrices
%matplotlib inline

<br/><br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# PCA: U.S. Presidential Elections By State

First, we'll start off with PCA on a real-world dataset. If you haven't worked with Principal Component Analysis before, we highly encourage you to take a look at the PCA lab first. PCA really shines on data where you have reason to believe that the data is relatively low in rank. 

We'll look at how states voted in presidential elections between 1972 and 2016. **Our ultimate goal is to show how 2D PCA scatterplots can allow us to identify clusters in a high dimensional dataset.** For this example, that means finding groups of states that vote similarly by plotting their 1st and 2nd principal components.

<br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 1

We explore a dataset on U.S. Presidential Elections by State since 1789, as taken from [Wikipedia](https://en.wikipedia.org/wiki/List_of_United_States_presidential_election_results_by_state).

In [None]:
df = pd.read_csv("data/presidential_elections.csv")
df.head(5)

The data in this table is pretty messy (missing records, inconsistent field naming, etc.), so let's create a clean version.

Running the cell below will produce a clean table, which contains exactly 51 rows (corresponding to the 50 states plus Washington DC) and 13 columns (one for each of the election years from 1972 to 2020). The index of this dataframe is the state name.

In [None]:
# just run this cell
df_clean = ( 
        df.iloc[:, -15:]    
        .drop(['Unnamed: 60'], axis = 1) 
        .rename(columns = {"2000 ‡": "2000", "2016 ‡": "2016", "State.1": "State"}) 
        .drop([51]) 
        .set_index("State")
)
df_clean

Side note: We produced the data cleaning function chain above by inspecting the CSV file. In your personal projects, you may be tempted to use Excel or Google Sheets (What You See Is What You Get, or WYSIWYG) to clean data. While sometimes more convenient, the downside of this is that you have no record of what you did—and if you have to redownload the data, you have to redo the manual data cleaning process.

<!-- BEGIN QUESTION -->

<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 1a

What does each row in `df_clean` represent?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 1b

To perform PCA, we need to convert our data into numerical values. Create a `df_numerical` dataframe that replaces all of the "D" characters in `df_clean` with the number 0, and all of the "R" characters with the number 1. 

*Hint:* Use `df.replace` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)).

In [None]:
df_numerical = ...

In [None]:
grader.check("q1b")

<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 1c

Now **standardize the data**: Center the data so that each column's mean is 0, and scale the data so that each column's variance is 1. Store your result in `df_standardized`.

In [None]:
df_standardized = ...

In [None]:
grader.check("q1c")

<br/>

We now have our data in a nice and tidy centered and scaled format. Phew! We are now ready to do PCA.

<br/>
<hr style="border: 1px solid #fdb515;" />

### Question 1d: SVD

In the following cell, compute the SVD of `df_standardized`:

$$\texttt{df}\_\texttt{standardized}=U S V^{T}$$


Store the $U$, $S$, and $V^T$ matrices in `u`, `s`, and `vt` respectively. This is one line of simple code (exactly like what we saw in lecture and what you did in lab) using the [`np.linalg.svd`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) function with the `full_matrices` argument set to `False`.

In [None]:
u, s, vt = ...
u, s, vt

In [None]:
grader.check("q1d")

<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 1e: Get Principal Components

Using your results from the previous part, create a new `first_2_pcs` **dataframe** ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)) that contains exactly the first two columns of the principal components matrix. The first column should be labeled `pc1` and the second column should be labeled `pc2`. Store your result in `first_2_pcs`, and make sure to set the index to be the default numerical range index (i.e. 0, 1, 2, ...).

In [None]:
first_2_pcs = ...
first_2_pcs.head()

In [None]:
grader.check("q1e")

<br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 2

The cell below plots the 1st and 2nd principal components of our 50 states + Washington DC.

In [None]:
# just run this cell
sns.scatterplot(data = first_2_pcs, x = "pc1", y = "pc2");

<!-- BEGIN QUESTION -->

Unfortunately, we have two problems:

1. There is a lot of overplotting, with only 28 distinct dots (out of 104 points). This means that at least some states voted exactly alike in these elections.
2. We don't know which state is which because the points are unlabeled.

<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 2a: Jitter

Let's start by addressing problem 1. 

**In the cell below, create a new dataframe `first_2_pcs_jittered` with a small amount of random noise added to each principal component. In this same cell, create a scatterplot.**

To reduce overplotting, we **jitter** the first two principal components:
* Add a small amount of random, unbiased Gaussian noise to each value using `np.random.normal` ([documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html)) with mean 0 and standard deviation less than 1.
* Don't get caught up on the exact details of your noise generation; it's fine as long as your plot looks roughly the same as the original scatterplot, but without overplotting.
* The amount of noise you add *should not significantly affect* the appearance of the plot; it should simply serve to separate overlapping observations.

In [None]:
# first, jitter the data
first_2_pcs_jittered = ...

# then, create a scatter plot
...

<!-- END QUESTION -->

<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 2b

To address Problem 2, we can turn to Plotly. The below cell uses these Plotly guides ([example 1](https://plot.ly/python/text-and-annotations/), [example 2](https://plotly.com/python/hover-text-and-formatting/#disabling-or-customizing-hover-of-columns-in-plotly-express)) to create a scatter plot of the jittered .

In [None]:
# just run this cell
import plotly.express as px

# get the state names from the standardized dataframe's index
first_2_pcs_jittered['state'] = df_standardized.index

# show state names on hover
fig = px.scatter(first_2_pcs_jittered, x="pc1", y="pc2",
                hover_data={"state": True}); 

fig.show(); 

<!-- BEGIN QUESTION -->

Analyze the above plot. In the below cell, address the following two points:
1. Give an example of a cluster of states that vote a similar way. Does the composition of this cluster surprise you? If you're not familiar with U.S. politics, it's fine to just say "No, I'm not surprised because I don't know anything about U.S. politics."
1. Include anything interesting that you observe. You will get credit for this as long as you write something reasonable that you can takeaway from the plot.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 3

We can also look at the contributions of each year's elections results on the values for our principal components.

Below, we define the `plot_pc` function that plots and labels the rows of $V^T$. We then call this function to plot the 1st row of $V^T$, i.e., the row of $V^T$ that corresponds to `pc1`.

**Note**: If you get an error when running this cell, make sure you are properly assigning the `vt` variable from Question 1.

In [None]:
# just run this cell

def plot_pc(col_names, row_mat_vt, k):
    plt.bar(col_names, row_mat_vt[k, :], alpha=0.7)
    plt.xticks(col_names, rotation=90);
    
plt.figure(figsize=(12, 4)) # adjusts size of plot
plot_pc(list(df_standardized.columns), vt, 0);

<!-- BEGIN QUESTION -->


<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 3a

In the cell below, plot the the 2nd row of $V^T$, i.e., the row of $V^T$ that correpsonds to `pc2`.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->


<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 3b

Using the two above plots of the rows of $V^T$ as well as the original table, give a description of what it means for a point to be in the top-right quadrant of the 2-D scatter plot from Question 2.

In other words, what is generally true about a state with relatively large positive value for `pc1` (right side of 2-D scatter plot)? For a large positive value for `pc2` (top side of 2-D scatter plot)?

Notes:
* `pc2` is pretty hard to interpret, and the staff doesn't really have a consensus on what it means either - there is no correct answer necessarily. 
* Principal components beyond the first are often hard to interpret (but not always; see the lab).

_Type your answer here, replacing this text._

In [None]:
# feel free to use this cell for scratch work. If you need more scratch space, add cells *below* this one.

# Make sure to put your actual answer in the cell above where it says "Type your answer here, replacing this text"

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br/>
<br/>

<hr style="border: 1px solid #fdb515;" />

### Question 3c

To get a better sense of whether our 2D scatterplot captures the whole story, create a **scree plot** for this data. In other words, plot the fraction of the total variance (y-axis) captured by the ith principal component (x-axis).

*Hint:* Be sure to label your axes appropriately! You may find `plt.xticks()` ([documentation](https://matplotlib.org/3.5.0/api/_as_gen/matplotlib.pyplot.xticks.html)) helpful for formatting. Also check out the lab for more on scree plots.

In [None]:
...

<!-- END QUESTION -->

<br/>

From your scree plot above, you should see that the first two principal components capture a large portion of the variance in our cleaned data. It is partially for this reason that the 2D scatter plot of principal components was easy to interpret.

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Decision Trees

Now, we'll be switching gears to try out building decision trees on a different dataset. We will have you train a multi-class classifier with three different models (one-vs-rest logistic regression, decision trees, random forests) and compare the accuracies and decision boundaries created by each. 

<br/><br/>

## [Tutorial] Dataset, EDA, and Classification Task
We'll be looking at a dataset of per-game stats for all NBA players in the 2018-19 season. This dataset comes from [basketball-reference.com](https://www.basketball-reference.com/).

In [None]:
# just run this cell

nba_data = pd.read_csv("data/nba18-19.csv")
nba_data.head(5)

Our goal will be to predict a player's **position** given several other features. The 5 positions in basketball are PG, SG, SF, PF, and C (which stand for point guard, shooting guard, small forward, power forward, and center; [Wikipedia](https://en.wikipedia.org/wiki/Basketball_positions)).

This information is contained in the `Pos` column:

In [None]:
nba_data['Pos'].value_counts()

There are several features we could use to predict this position; check the [Basketball statistics](https://en.wikipedia.org/wiki/Basketball_statistics) page of Wikipedia for more details on the statistics themselves.

In [None]:
nba_data.columns

In this lab, we will restrict our exploration to two inputs: [Rebounds](https://en.wikipedia.org/wiki/Rebound_(basketball)) (`TRB`) and [Assists](https://en.wikipedia.org/wiki/Assist_(basketball)) (`AST`). Two-input feature models will make our 2-D visualizations more straightforward.

<br/>

### 3-class classification

While we could set out to try and perform 5-class classification, the results (and visualizations) are slightly more interesting if we try and categorize players into 1 of 3 categories: **Guard**, **Forward**, and **Center**. The below code will take the `Pos` column of our dataframe and use it to create a new column `Pos3` that consist of values `'G'`, `'F'`, and `'C'` (which stand for Guard, Forward, and Center).

In [None]:
# just run this cell
def basic_position(pos):
    if 'F' in pos:
        return 'F'
    elif 'G' in pos:
        return 'G'
    return 'C'

nba_data['Pos3'] = nba_data['Pos'].apply(basic_position)
nba_data['Pos3'].value_counts()

<br/><br/>

### Data Cleaning and Visualization

Furthermore, since there are **many** players in the NBA (in the 2018-19 season there were 530 unique players), our visualizations can get noisy and messy. Let's restrict our data to only contain rows for players that averaged 10 or more points per game.

In [None]:
# just run this cell
nba_data = nba_data[nba_data['PTS'] > 10]

Now, let's look at a scatterplot of Rebounds (`TRB`) vs. Assists (`AST`).

In [None]:
sns.scatterplot(data = nba_data, x = 'AST', y = 'TRB', hue = 'Pos3');

As you can see, when using just rebounds and assists as our features, we see pretty decent cluster separation. That is, Guards, Forward, and Centers appear in different regions of the plot.

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 4: Evaluating Split Quality

We will explore different ways to evaluate split quality for classification trees in this question.

<br/>

---

### Question 4a: Entropy

In lecture we defined the entropy $S$ of a node as:

$$ S = -\sum_{C} p_C \log_{2} p_C $$

where $p_C$ is the proportion of data points in a node with label $C$. This function is a measure of the unpredictability of a node in a decision tree. 

Implement the `entropy` function, which outputs the entropy of a node with a given set of labels. The `labels` parameter is a list of labels in our dataset. For example, `labels` could be `['G', 'G', 'F', 'F', 'C', 'C']`.


In [None]:
def entropy(labels):
    ...

entropy(nba_data['Pos3'])

In [None]:
grader.check("q4a")

<br/>

---

### Question 4b: Gini impurity

Another metric for determining the quality of a split is **Gini impurity**. This is defined as the chance that a sample would be misclassified if randomly assigned at this point. Gini impurity is a popular alternative to entropy for determining the best split at a node, and it is in fact the default criterion for scikit-learn's `DecisionTreeClassifier`.

We can calculate the Gini impurity of a node with the formula ($p_C$ is the proportion of data points in a node with label $C$):

$$ G = 1 - \sum_{C} {p_C}^2 $$

Note that no logarithms are involved in the calculation of Gini impurity, which can make it faster to compute compared to entropy.

Implement the `gini_impurity` function, which outputs the Gini impurity of a node with a given set of labels. The `labels` parameter is defined similarly to the previous part.


In [None]:
def gini_impurity(labels):
    ...

gini_impurity(nba_data['Pos3'])

In [None]:
grader.check("q4b")

As an optional exercise in probability, try to think of a way to derive the formula for Gini impurity.

<br/>

---

### Question 4c: Weighted Metrics

In lecture, we used **weighted entropy** as a loss function to help us determine the best split. Recall that the weighted entropy is given by:

$$ L = \frac{N_1 S(X) + N_2 S(Y)}{N_1 + N_2} $$

$N_1$ is the number of samples in the left node $X$, and $N_2$ is the number of samples in the right node $Y$. This notion of a weighted average can be extended to other metrics such as Gini impurity simply by changing the $S$ (entropy) function to $G$ (Gini impurity).

First, implement the `weighted_metric` function. The `left` parameter is a list of labels or values in the left node $X$, and the `right` parameter is a list of labels or values in the right node $Y$. The `metric` parameter is a function which can be `entropy` or `gini_impurity`. For `entropy` and `gini_impurity`, you may assume that `left` and `right` contain discrete labels.

Then, assign `we_pos3_age_30` to the weighted entropy (in the `Pos3` column) of a split that partitions `nba_data` into two groups: a group with players who are 30 years old or older and a group with players who are younger than 30 years old.


In [None]:
def weighted_metric(left, right, metric):
    ...

we_pos3_age_30 = ...
we_pos3_age_30

In [None]:
grader.check("q4c")

We will not go over the entire decision tree fitting process in this assignment, but you now have the basic tools to fit a decision tree. As an optional exercise, try to think about how you would extend these tools to fit a decision tree from scratch.

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 5: Classification

Before fitting any models, let's first split `nba_data` into a training set and test set.


In [None]:
# just run this cell
nba_train, nba_test = train_test_split(nba_data, test_size=0.25, random_state=100)
nba_train = nba_train.sort_values(by='Pos')
nba_test = nba_test.sort_values(by='Pos')

<br/><br/>

<hr style="border: 1px solid #fdb515;" />

## One-vs-Rest Logistic Regression

As we saw at the beginning of the Decision Trees lecture, there is a natural extension of binary logistic regression from binary classification to multiclass classification called one-vs-rest logistic regression. In essence, one-vs-rest logistic regression simply builds one binary logistic regression classifier for each of the $N$ classes (in this scenario $N = 3$). We then predict the class corresponding to the classifier that gives the highest probability among the $N$ classes.


### Question 5a

In the cell below, set `logistic_regression_model` to be a one-vs-rest logistic regression model. Then, fit that model using the `AST` and `TRB` columns (in that order) from `nba_train` as our features, and `Pos3` as our response variable.

Remember, `sklearn.linear_model.LogisticRegression` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)) has already been imported for you. There is an optional parameter **`multi_class`** you need to specify in order to make your model a multi-class one-vs-rest classifier. See the documentation for more details.


In [None]:
logistic_regression_model = ...
...

In [None]:
grader.check("q5a")

<br/><br/>

### [Tutorial] Visualizing Performance

To see our classifier in action, we can use `logistic_regression_model.predict` and see what it outputs.

In [None]:
# just run this cell
nba_train['Predicted (OVRLR) Pos3'] = logistic_regression_model.predict(nba_train[['AST', 'TRB']])
nba_train[['AST', 'TRB', 'Pos3', 'Predicted (OVRLR) Pos3']].head(15)

Our model does decently well here, as you can see visually above. Below, we compute the training accuracy; remember that `model.score()` computes accuracy.

In [None]:
lr_training_accuracy = logistic_regression_model.score(nba_train[['AST', 'TRB']], nba_train['Pos3'])
lr_training_accuracy

We can compute the test accuracy as well by looking at `nba_test` instead of `nba_train`:

In [None]:
lr_test_accuracy = logistic_regression_model.score(nba_test[['AST', 'TRB']], nba_test['Pos3'])
lr_test_accuracy

Now, let's draw the decision boundary for this logistic regression classifier, and see how the classifier performs on both the training and test data.

In [None]:
# Just run this cell to save the helper function.
def plot_decision_boundaries(model, nba_dataset, title=None, ax=None):
    sns_cmap = ListedColormap(np.array(sns.color_palette())[0:3, :])

    xx, yy = np.meshgrid(np.arange(0, 12, 0.02), np.arange(0, 16, 0.02))
    Z_string = model.predict(np.c_[xx.ravel(), yy.ravel()])
    categories, Z_int = np.unique(Z_string, return_inverse = True)
    Z_int = Z_int.reshape(xx.shape)
    
    if ax is None:
        plt.figure()
        ax = plt.gca()
        
    ax.contourf(xx, yy, Z_int, cmap = sns_cmap)
    
    sns.scatterplot(data = nba_dataset, x = 'AST', y = 'TRB', hue = 'Pos3', ax=ax)

    if title is not None:
        ax.set_title(title)

In [None]:
# Run this cell to suppress all UserWarnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
# Just run this cell
plot_decision_boundaries(logistic_regression_model, nba_train, "Logistic Regression on nba_train")

In [None]:
# Just run this cell
plot_decision_boundaries(logistic_regression_model, nba_test, "Logistic Regression on nba_test")

Our one-vs-rest logistic regression was able to find a linear decision boundary between the three classes. It generally classifies centers as players with a lot of rebounds, forwards as players with a medium number of rebounds and a low number of assists, and guards as players with a low number of rebounds. 

Note: In practice we would use many more features – we only used 2 here just so that we could visualize the decision boundary.

<br/>
<br/>

<hr style="border: 1px solid #fdb515;" />

## Decision Trees

### Question 5b

Let's now create a decision tree classifier on the same training data `nba_train`, and look at the resulting decision boundary. 

In the following cell, first, use `tree.DecisionTreeClassifier` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)) to fit a model using the same features and response as above, and call this model `decision_tree_model`. Set the `random_state` and `criterion` parameters to 42 and `entropy`, respectively.

**Hint:** Your code will be mostly the same as the previous part.


In [None]:
decision_tree_model = ...
...

In [None]:
grader.check("q5b")

### [Tutorial] Decision Tree Performance

Now, let's draw the decision boundary for this decision tree classifier, and see how the classifier performs on both the training and test data.

In [None]:
# Just run this cell
plot_decision_boundaries(decision_tree_model, nba_train, "Decision Tree on nba_train")

In [None]:
# Just run this cell. You may get a UserWarning--it is OK to ignore this warning.
plot_decision_boundaries(decision_tree_model, nba_test, "Decision Tree on nba_test")

We compute the training and test accuracies of the decision tree model below.

In [None]:
dt_training_accuracy = decision_tree_model.score(nba_train[['AST', 'TRB']], nba_train['Pos3'])
dt_test_accuracy = decision_tree_model.score(nba_test[['AST', 'TRB']], nba_test['Pos3'])
dt_training_accuracy, dt_test_accuracy

<br/>
<br/>

<hr style="border: 1px solid #fdb515;" />

## Random Forests

### Question 5c

Let's now create a random forest classifier on the same training data `nba_train` and look at the resulting decision boundary. 

In the following cell, use `ensemble.RandomForestClassifier` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)) to fit a model using the same features and response as above, and call this model `random_forest_model`. Use 20 trees in your random forest classifier; set the `random_state` and `criterion` parameters to 42 and `entropy`, respectively.

**Hint:** Your code for both parts will be mostly the same as the first few parts of this question.

**Hint:** Look at the `n_estimators` parameter of `ensemble.RandomForestClassifier`.


In [None]:
random_forest_model = ...
...

In [None]:
grader.check("q5c")

### [Tutorial] Random Forest Performance

Now, let's draw the decision boundary for this random forest classifier, and see how the classifier performs on both the training and test data.

In [None]:
# Just run this cell. You may get a UserWarning--it is OK to ignore this warning.
plot_decision_boundaries(random_forest_model, nba_train, "Random Forest on nba_train")

In [None]:
# Just run this cell. You may get a UserWarning--it is OK to ignore this warning.
plot_decision_boundaries(random_forest_model, nba_test, "Random Forest on nba_test")

We compute the training and test accuracies of the random forest model below.

In [None]:
# just run this cell
rf_train_accuracy = random_forest_model.score(nba_train[['AST', 'TRB']], nba_train['Pos3'])
rf_test_accuracy = random_forest_model.score(nba_test[['AST', 'TRB']], nba_test['Pos3'])
rf_train_accuracy, rf_test_accuracy

<br/>
<br/>

<hr style="border: 1px solid #fdb515;" />

## Compare/Contrast

How do the three models you created (multiclass one-vs-rest logistic regression, decision tree, random forest) compare to each other?)

**Decision boundaries**: Run the below cell for your convenience. It overlays the decision boundaries for the train and test sets for each of the models you created.

In [None]:
# Just run this cell

fig, axs = plt.subplots(2, 3, figsize=(12, 6))
for j, (model, title) in enumerate([(logistic_regression_model, "Logistic Regression"),
                                    (decision_tree_model, "Decision Tree"),
                                    (random_forest_model, "Random Forest")]):
    axs[0, j].set_title(title)
    for i, nba_dataset in enumerate([nba_train, nba_test]):
        plot_decision_boundaries(model, nba_dataset, ax=axs[i, j])
        
# reset leftmost ylabels
axs[0, 0].set_ylabel("nba_train\nTRB")
axs[1, 0].set_ylabel("nba_test\nTRB")
fig.tight_layout()

**Performance Metrics**: Run the below cell for your convenience. It summarizes the train and test accuracies for the three models you created.

In [None]:
# Just run this cell
train_accuracy = [lr_training_accuracy, lr_test_accuracy, dt_training_accuracy, dt_test_accuracy, rf_train_accuracy, rf_test_accuracy]
index = ['OVR Logistic Regression', 'Decision Tree', 'Random Forest']
df = pd.DataFrame([(lr_training_accuracy, lr_test_accuracy), 
                   (dt_training_accuracy, dt_test_accuracy),
                   (rf_train_accuracy, rf_test_accuracy)], 
                  columns=['Training Accuracy', 'Test Accuracy'], index=index)
df.plot.bar();
plt.legend().remove() # remove legend from plot itself
plt.gcf().legend(loc='lower right') # and add legend to bottom of figure

---

### Question 5d

<!-- BEGIN QUESTION -->


Looking at the three models, which model performed the best on the training set, and which model performed the best on the test set? How are the training and test accuracy related for the three models, and how do the decision boundaries generated for each of the three models relate to the model's performance?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Congratulations on finishing Homework 10!

<img src="images/mitski.jpg" width="500"/>

Mitski the cat congratulates you!

### Course Content Feedback

If you have any feedback about this assignment or about any of our other weekly assignments, lectures, or discussions, please fill out the [Course Content Feedback Form](https://forms.gle/owfPCGgnrju1xQEA9). Your input is valuable in helping us improve the quality and relevance of our content to better meet your needs and expectations!

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Submit this file to the Homework 10 assignment on Gradescope. If you run into any issues when running this cell, [this section of the debugging guide](https://ds100.org/debugging-guide/autograder_gradescope/autograder_gradescope.html#why-does-grader.exportrun_teststrue-fail-if-all-previous-tests-passed) may help.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)