# Project Activity: Automated Decision Trees

## Group Names and Roles

- Partner 1 (Role)
- Partner 2 (Role)
- Partner 3 (Role)

## Introduction

Last time, you used `pandas` summary tables to create some informed guesses about good *decision trees* -- flow-chart-like rules for making a guess about the species of a penguin. In this activity, we'll use `scikit-learn` to automatically create superior decision trees.

Let's begin by importing all the libraries we'll need, and by downloading the penguins dataset:

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import tree, preprocessing
import numpy as np

In [None]:
url = "https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

### §1. Preparing your data

For this activity, we will use only the following columns: `"Species"`, `"Flipper Length (mm)"`, `"Body Mass (g)"`, `"Sex"`. (Use the square brackets operator on the list of these strings, and assign the result back to `penguins`.)

In [None]:
# your code here


Next, inspect the `penguins` data frame. You should have 344 rows and 4 columns.

In [None]:
# your code here


You might have noticed that your dataframe contains rows with `NaN` values. Calling `.dropna()` on the dataframe will remove these rows. Do this below, and reassign the result back to `penguins`.

In [None]:
# your code here


Look at your dataframe once again. You should have 334 rows and 4 columns.

In [None]:
# your code here


Run the next cell. Doing this will make sure that the random values that your code will generate will be the same every time you run the code.

In [None]:
np.random.seed(1000)

Our goal is to build a model that predicts the species of a penguin based on the other features that you now have in the `penguins` dataframe. With this in mind, split your dataframe into `X` and `y` (predictor variables and target variable). Then split `X` and `y` into training and test data (80/20% of the rows).

In [None]:
# Use the names X_train, X_test, y_train, y_test in your code


Make a `clean_data` function to clean up your data. The function should take `X` and `y` -- you will be calling it on both the training and the test data separately. Your function should:

- make copies of the given inputs using the `.copy()` method

- encode the sex and the species of the penguins as integers

- return three elements: two `numpy` arrays containing the data from `X` and `y` respectively, and the names of the columns of `X`

In [None]:
def clean_data(X, y):
    # your code here
    pass

Now run the following cell.

In [None]:
X_train, y_train, labels = clean_data(X_train, y_train)
X_test,  y_test,  labels = clean_data(X_test, y_test)

To make sure that you understand what is going on, observe the output of the following:

In [None]:
print(X_train, y_train, labels)

### §2. Training a model

Using the training data you generated in the previous part, train a decision tree classification model `T` with a `max_depth` value of 20. Score your model **against the training data**. Then score your model again, this time **against the test data**. Print both scores and observe the output.

In [None]:
# your code here


Again, using the training data you generated in the previous part, train another model, also named `T`. This time, use a `max_depth` value of 3. Just as above, score your model **against the training data**, and again **against the test data**. Print both scores and observe the output.

In [None]:
# your code here


Discuss your observations in the next cell. Which model is better? Is a model better when it performs better against the training data or the test data? Why does one model perform better against the training data while the other performs better against the test data?

---

*your discussion here*

---

Run the next cell to visualize your model.

In [None]:
fig, ax = plt.subplots(1, figsize = (20, 20))
p = tree.plot_tree(T, filled = True, feature_names = labels)

### §3. Cross-validation

Now estimate the optimal tree depth using cross-validation, and plot the results, as follows.

Make an empty plot. The x-axis will be the tree depth, and the y-axis will be the cross-validation score. Label your axes.

Make a `for` loop that will test a particular tree depth, between 1 and 30. On each iteration, train a decision tree model of the given depth and calculate its cross-validation score. Plot the depth and the score in your scatterplot. Then compare your score to the best score you have so far, and update the best score and best max depth if the new score is better.

In [None]:
# your code here


Lastly, train a decision tree classification model `T` using the best max depth. Score the model against the test data. Print the score and observe the output.

In [None]:
# your code here


### §4. Decision Regions

If you have gotten this far and still have time, experiment with using different columns than the ones specified in §1. How accurate can your model be on test data?