# Quiz 3 Starter Code

You can use this notebook to answer the questions in quiz 3, on Decision Trees and Random forests. Some starter code and hints are also provided to keep you on the right track.

Details are given in the quiz about the provenance of the data we're using. To recap what the features are:

- `median_income` is the median income in that block group
- `pct_broadband` is the percentage with access to broadband
- `pct_white` and `pct_black` are the percentage of each Census block group's population that is White and Black
- `pop_density` is the density of blockgroup (e.g. in units of population per square kilometer) 
- `max_speed` is the maximum contractual downstream speed offered by any provider in each Census block group
- `num_isp` is the number of unique ISPs that offer service in each Census block group
- `num_broadband` is the number of unique ISPs that offer service at or above 25 Mbps downstream and 3 Mbps upstream in each Census block group (This is the FCC's definition of broadband Internet access, which you can read about more in the 2019 broadband deployment report)

First, we load the dataset:

In [2]:
import pandas as pd
df = pd.read_csv("../data/fcc_acs.csv")

### Question 1

Now, train a decision tree classifier to predict if a Census block group has broadband Internet access or not (i.e., at least one ISP that offers service at or above 25 Mbps downstream and 3 Mbps upstream). Tune your classifier with a hyperparameter grid and use k-fold cross validation. 

- Divide the dataset into training and testing, with a .2 test_size. Use random_state=0 for this, and when training your classifiers.
- Use `random_state=0` when both splitting your data and training your classifiers.
- Preprocess your data by imputing missing values with the mean of the column from the training set. Note that (as you saw in the previous assignment) an administrative code is used for some missing values in `median income`, and not NaN.
- Given the size of this dataset, it will be sufficient to forgo the validation set and only separate your data into training and testing sets (i.e., use `sklearn.model_selection.test_train_split`)
- Your classifier should use the following features:
    - The ACS population characteristics described above (percentage of population that is White and Black, median income, population density)
    - The number of ISPs serving that block group
- Tune your classifier with a hyperparameter grid, using the following hyperparameter options:
    - `criterion`: `gini` or `entropy`
    - `max_depth`: 1, 3, 5
    - `min_samples_split`: 2, 5, 10
- Use K-fold cross validation with `k = 10`, using accuracy as your scoring metric.
- For scoring, use accuracy, precision, and recall.

What is the best mean test accuracy?

In [2]:
# First, divide your data into train and test sets, using random_state = 0 and a split of .2

In [3]:
# Next, begin to preprocess your data:

# First, for median income, replace the administrative code that indicates missing data with NaN
# Then, fill the missing values in each column with the mean of that column from the training set
# Hints: after you do this, the mean of the median_income column in the training set should be about 60906;
# in the testing set, the median_income column should have a mean of about 57194

In [4]:
# The final preprocessing step will be generating your target: you're predicting whether or not a block has broadband,
# not the number of unique ISPs with broadband, which is what your data currently has.
# Create a has_broadband column in the train and test sets.
# Assign 1 to it if num_broadband is greater than 0, and 0 otherwise.

In [5]:
# Finally, use GridSearch to tune a decision tree classifier, with has_broadband as the target
# Use 10 folds and the parameters above
# When creating your decision tree, remember to use random_state=0

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
k = 10 # Set the number of folds

params = { #Set the parameters to use for GridSearch
    'criterion': ('gini', 'entropy'),
    'max_depth': (1, 3, 5),
    'min_samples_split': (2, 5, 10)
}

# Your code to conduct your GridSearch here

# Afterwards, you can access the best score with .best_score_

### Question 2

You should find that one of the best trees used for its splitting criterion "entropy", a max depth of 3, and a minimum sample split of 10.

For a tree trained with those parameters, what is the most important feature?

In [None]:
# Train a tree with the parameters above
# Then, find the feature with the highest importance
# You can access feature importances with my.feature_importances_

### Question 3

Now, you want to plot the confusion matrix for the test set for the tree with the hyperparameters above (split on entropy, max depth of 3, and min samples split of 10). Which of the following matrices corresponds to that tree?

In [2]:
# Plot the confusion matrix for the tree with the hyperparameters above
# You can use sklearn.metrics.plot_confusion_matrix. It will take as parameters the tree, the test x, and test y

### Question 4

What is the F1 score for this confusion matrix?

<img src="../assets/man_calc.PNG">

Hint:

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 = $\frac{2 * precision * recall}{precision + recall} $

In [None]:
# This asks you to calculate these metrics manually

### Question 5

Next, train a Random Forest classifier to predict broadband Internet deployment. As before, tune your classifier with a hyperparameter grid and use k-fold cross validation.

- Use sckit-learn's `GridSearchCV` function
- Continue to use `random_state=0` when both splitting your data and training your classifiers.
- Given the size of this dataset, it will be sufficient to forgo the validation set and only separate your data into training and testing sets (i.e., use `sklearn.model_selection.test_train_split`)
- Your classifier should use the following features:
    - The ACS population characteristics from Section 2.2 (percentage of population that is White and Black, median income, population density)
    - The number of ISPs serving that block group
- Tune your classifier with a hyperparameter grid, using the following hyperparameter options (note the differences for Random Forests):
    - `n_estimators`: 1, 10, 20 
    - `criterion`: `gini` or `entropy`
    - `max_depth`: 1, 3, 5
    - `min_samples_split`: 2, 5, 10
- Use K-fold cross validation with `k = 10`, using accuracy as your scoring metric.

Note: In practice, you may want to use thousands of estimators. We only try a few here to save time--and this may still take a few minutes. You may want to test your code first on a sample of the larger dataset.

Which combination of hyperparameters gives the best score?

In [7]:
# Your process will be very similar to when you did Grid Search on a decision tree
k = 10 # Set the number of folds
params = { # Set the hyperparameters
    'n_estimators': (5,50,100),
    'criterion': ('gini', 'entropy'),
    'max_depth': (1, 3, 5),
    'min_samples_split': (2, 5, 10)
}

# Your code to conduct your Grid Search here

# Finally, find the parameters that produced the best score (you can access with .best_params_)

### Question 6

So far, we've been using GridSearch and Cross-Validation (with GridSearchCV) to find the best models. Internally, this method is running "predict" on each validation fold, computing the evaluation metrics (accuracy, precision, etc.) for each fold and returning the average for the metrics. Predict uses 0.5 as the default classification threshold, meaning every observation with a probability score above 0.5 is classified as True.

In many cases, this is not ideal. Instead, we might want to choose a classification threshold ourselves, rather than use the default 0.5 cutoff. The threshold might be a fixed value (say every observation with a probability above 0.7 receives a predicted label of True). Alternatively, the threshold might be based on a percentage of the total observations. Which approach to choose will depend on your problem. Here, we will look only at the former.

One way to approach finding a good fixed threshold is to look at how different thresholds affect your evaluation metrics. You may want to choose one that optimizes for precision or recall, or one that is in a 'sweet spot' that balances nicely precision and recall.

One of the best Random Forest classifiers from the Grid Search in the previous question had the following parameters:
{'criterion': 'gini',
 'max_depth': 5,
 'min_samples_split': 5,
 'n_estimators': 100}

Train a Random Forest classifier with the parameters above on your training sets, once again using `random_state=0`. Then plot a precision-recall curve using the test sets. You can use sklearn.metrics.plot_precision_recall_curve.

Which combination of precision and recall seem like a good balance that, with the right threshold, would be achievable by your Random Forest classifier?

In [10]:
# First, train a Random Forest Classifier with the parameters above
# Then, plot a precision recall curve. We import the library for you below
from sklearn.metrics import plot_precision_recall_curve 

# Eyeball where we get the best precision and the best recall