d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Hyperparameter Search Lab

**Objective**: *Apply grid-search hyperparameter optimization to improve the performance of a model.*

In this lab, you will apply what you've learned in this lesson. When complete, please use the answers to the exercises to answer questions in the following quiz within Coursera.

In [0]:
%run "../../Includes/Classroom-Setup"

-sandbox

## Exercise 1

In this exercise, you will create an enhanced user-level table to try to better predict whether or not each user takes at least 10,000 steps in a day.

Fill in the blanks in the below cell to create the `adsda.ht_user_metrics_hs_lab` table.

**Hint:** Refer back to previous demos on how to create the `steps_10000` column.

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE TABLE adsda.ht_user_metrics_hs_lab
USING DELTA LOCATION "/adsda/ht-user-metrics-hs-lab" AS (
  SELECT min(resting_heartrate) AS min_resting_heartrate,
         avg(resting_heartrate) AS avg_resting_heartrate,
         max(resting_heartrate) AS max_resting_heartrate,
         max(resting_heartrate) - min(resting_heartrate) AS resting_heartrate_change,
         min(active_heartrate) AS min_active_heartrate,
         avg(active_heartrate) AS avg_active_heartrate,
         max(active_heartrate) AS max_active_heartrate,
         max(active_heartrate) - min(active_heartrate) AS active_heartrate_change,
         min(bmi) AS min_bmi,
         avg(bmi) AS avg_bmi,
         max(bmi) AS max_bmi,
         max(bmi) - min(bmi) AS bmi_change,
         min(vo2) AS min_vo2,
         avg(vo2) AS avg_vo2,
         max(vo2) AS max_vo2,
         max(vo2) - min(vo2) AS vo2_change,
         min(workout_minutes) AS min_workout_minutes,
         avg(workout_minutes) AS avg_workout_minutes,
         max(workout_minutes) AS max_workout_minutes,
         max(workout_minutes) - min(workout_minutes) AS workout_minutes_change,
         CASE WHEN avg(steps) >= 10000 THEN 1 ELSE 0 END AS steps_10000
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

num_affected_rows,num_inserted_rows


**Coursera Quiz:** How many users in `adsda.ht_user_metrics_hs_lab` take, on average, 10,000 steps per day?

In [0]:
%sql
SELECT steps_10000, count(*) FROM adsda.ht_user_metrics_hs_lab GROUP BY steps_10000

steps_10000,count(1)
1,1892
0,1108


-sandbox
## Exercise 2

In this exercise, you will split your data into a training set (`train_df`), validation set (`val_df`), and test set (`test_df`).

Fill in the blanks below to split your data.

**Hint:** Refer to the previous demo for guidance.

In [0]:
# ANSWER
from sklearn.model_selection import train_test_split

ht_user_metrics_pd_df = spark.table("adsda.ht_user_metrics_hs_lab").toPandas()

train_val_df, test_df = train_test_split(ht_user_metrics_pd_df, train_size=0.85, test_size=0.15, random_state=42)
train_df, val_df = train_test_split(train_val_df, train_size=0.7, test_size=0.3, random_state=42)

**Coursera Quiz:** How many rows are in the `val_df` DataFrame?

In [0]:
val_df.shape

## Exercise 3

In this exercise, you will prepare your random forest classifier.

Fill in the blanks below to complete the task.

In [0]:
# ANSWER
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)

## Exercise 4

In this exercise, you will create a hyperparameter grid to use during the grid search process.

Use the following hyperparameter values:

1. `max_depth`: 2, 3, 5, 8, 10, 15
1. `n_estimators`: 5, 10, 25, 50, 100, 250
1. `min_samples_split`: 2, 3, 4
1. `min_impurity_decrease`: 0.0, 0.01, 0.05

Fill in the blanks below to create the grid.

In [0]:
# ANSWER
parameter_grid = {
  "max_depth": [2, 3, 5, 8, 10, 15],
  "n_estimators": [5, 10, 25, 50, 100, 250],
  "min_samples_split": [2, 3, 4],
  "min_impurity_decrease": [0.0, 0.01, 0.05]
}

**Coursera Quiz**: How many total unique combinations of hyperparameters are there in `parameter_grid`?

In [0]:
len(parameter_grid["max_depth"]) * len(parameter_grid["n_estimators"]) * len(parameter_grid["min_samples_split"]) * len(parameter_grid["min_impurity_decrease"])

## Exercise 5

In this exercise, you will create a predefined split for your training set and your validation set.

Fill in the blanks below to create the PredefinedSplit.

In [0]:
# ANSWER
from sklearn.model_selection import PredefinedSplit

# Create list of -1s for training set row or 0s for validation set row
split_index = [-1 if row in train_df.index else 0 for row in train_val_df.index]

# Create predefined split object
predefined_split = PredefinedSplit(test_fold=split_index)

**Coursera Quiz**: How many 0s are there in `split_index`?

In [0]:
split_index.count(0)

## Exercise 6

In this exercise, you will create the grid search object that you will use to optimize your hyperparameter values.

Fill in the blanks below to create the object.

In [0]:
# ANSWER
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=rfc, cv=predefined_split, param_grid=parameter_grid)

## Exercise 7

In this exercise, you will fit the grid search process.

Fill in the blanks below to perform the grid search process.

In [0]:
# ANSWER
grid_search.fit(train_val_df.drop("steps_10000", axis=1), train_val_df["steps_10000"])

-sandbox
**Coursera Quiz**: How many unique models are being trained by the grid search process?

* 324
* 765
* 325
* 766

**Hint:** Consider the number of unique feature combinations and the final retraining of the model on the training *and* validation sets.

## Exercise 8

In this exercise, you will identify the optimal hyperparameter values.

Fill in the blanks below to indentify the optimal hyperparameter values.

In [0]:
# ANSWER
grid_search.best_params_

**Coursera Quiz:** What is the optimal hyperparameter value for `min_samples_split` according to the grid search process?

## Exercise 9

In this exercise, you will identify the validation accuracy that was achieved for the optimal hyperparameter values when trained on the training set.

Fill in the blanks below to identify the validation accuracy.

In [0]:
# ANSWER
grid_search.best_score_

**Coursera Quiz:** What is the best validation set accuracy?

## Exercise 10

In this exercise, you will identify the test accuracy achieved by the final, refit model.

Fill in the blanks below to identify the test accuracy.

In [0]:
# ANSWER
from sklearn.metrics import accuracy_score

accuracy_score(
  test_df["steps_10000"], 
  grid_search.predict(test_df.drop("steps_10000", axis=1))
)

**Coursera Quiz:** What is the test set accuracy?

Congrats! That concludes our lesson on hyperparameter optimization!

Be sure to submit your quiz answers to Coursera, and join us in the next module to learn about how to improve the process even further using cross-validation.

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>