d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Feature Engineering Lab

**Objective**: *Apply feature engineering to a dataset to derive more meaningful features and improve predictions.*

In this lab, you will apply what you've learned in this lesson. When complete, please use the answers to the exercises to answer questions in the following quiz within Coursera.

In [0]:
%run "../../Includes/Classroom-Setup"

## Exercise 1

In this exercise, you will create a user-level table with the following columns:

1. `avg_resting_heartrate` – the average resting heartrate
1. `avg_active_heartrate` - the average active heartrate
1. `avg_bmi` – the average BMI
1. `avg_vo2` - the average oxygen volume
1. `sum_workout_minutes` - the sum of total workout minutes
1. `sum_steps` - the sum of total steps

Fill in the blanks in the below cell to create the `adsda.ht_user_metrics_lab` table.

In [0]:
%sql
-- TODO
CREATE OR REPLACE TABLE adsda.ht_user_metrics_lifestyle
USING DELTA LOCATION "/adsda/ht_user_metrics_lifestyle" AS (
  SELECT avg(resting_heartrate) AS avg_resting_heartrate,
         avg(active_heartrate) AS avg_active_heartrate,
         avg(bmi) AS avg_bmi,
         avg(vo2) AS avg_vo2,
         avg(workout_minutes) AS avg_workout_minutes,
         avg(steps) AS avg_steps,
         first(lifestyle) AS lifestyle
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

num_affected_rows,num_inserted_rows


**Coursera Quiz:** Why did we run a `group by`?

In [0]:
import numpy as np
np.random.seed(0)
df = spark.table("adsda.ht_user_metrics_lifestyle").toPandas()
df.loc[df.sample(frac=0.18).index, 'avg_bmi'] = np.nan
df.shape

-sandbox
## Exercise 2

In this exercise, you will split your data into an training set and an inference set.

Fill in the blanks below to split the data.

In practice, you should use as much data as possible for your training set. An inference set will usually become available after the training process, rather than being split apart from your training set prior to the training of the model.

In [0]:
# TODO
from sklearn.model_selection import train_test_split

train_df, inference_df = train_test_split(df, train_size=0.85, test_size=0.15, random_state=42)

**Coursera Quiz:** How many rows have missing values in the `avg_bmi` column in the training set?

## Exercise 3

In this exercise, you'll fill the `avg_bmi` with the median.

Fill in the blanks below to complete the task.

In [0]:
# TODO
import pandas as pd

avg_bmi_median = train_df['avg_bmi'].median()

train_df['avg_bmi'] = train_df['avg_bmi'].fillna(avg_bmi_median)
inference_df['avg_bmi'] = inference_df['avg_bmi'].fillna(avg_bmi_median)

avg_bmi_median

**Coursera Quiz:** What is the value of `avg_bmi_median` rounded to the nearest hundredth place?

## Exercise 4

Scale the `avg_bmi`, using the `train_df` to fit and transform the data.

Fill in the blanks below to complete the task.

In [0]:
# TODO
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
scaled_training_data = sc.fit_transform(train_df[['avg_bmi']])
train_df['avg_bmi_scaled'] = scaled_training_data

scaled_inference_data = sc.transform(inference_df[['avg_bmi']])
inference_df['avg_bmi_scaled'] = scaled_inference_data

**Coursera Quiz**: Using the `.min()` method on the original `avg_bmi` and `avg_bmi_scaled` columns, find the difference, rounded to the nearest tenth

In [0]:
print(inference_df['avg_bmi'].min() - inference_df['avg_bmi_scaled'].min())

## Exercise 5

In this exercise, you will create one-hot encoded columns on the `lifestyle` column.

Fill in the blanks below to complete the task.

In [0]:
# TODO
train_df = pd.get_dummies(train_df, prefix='ohe', columns=['lifestyle'])
inference_df = pd.get_dummies(inference_df, prefix='ohe', columns=['lifestyle'])


**Coursera Quiz**: How many rows in our training set (`train_df`) have a value of 1 for the column `ohe_Weight Trainer`?

In [0]:
train_df[train_df['ohe_Weight Trainer'] > 0].shape[0]

## Exercise 6

Over the next series of exercises, you will fit a Logistic Regression model, utilizing several steps above and a few new ones. 

Our target here is the lifestyle column. The cell below will reset our dataframe and also transform the target lifestyle column so that we have a binary classification task.

In [0]:
df = spark.table("adsda.ht_user_metrics_lifestyle").toPandas()
df.loc[df.sample(frac=0.18).index, 'avg_workout_minutes'] = np.nan
df['lifestyle'] = df['lifestyle'].map({'Sedentary':0, 'Weight Trainer':1, 'Athlete':1, 'Cardio Enthusiast':1})

**Coursera Quiz**: How many observations of class `sedentary` do we have in the totality of our dataset?

Write the code in the below cell to answer the question.

In [0]:
# TODO
df['lifestyle'].value_counts()

## Exercise 7

In this exercise, you will train-test split the data, using `lifestyle` as the target. Set the test size to be `10%` and the random state to `3`.

Fill in the blanks below to complete the task.

In [0]:
# TODO
from sklearn.model_selection import train_test_split
X = df.drop('lifestyle', axis=1)
y = df['lifestyle']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state=42)

-sandbox
## Exercise 8

In this exercise, you will clean up any missing values by imputing with the mean. 

Fill in the blanks below to complete the task.

**Hint:** Recall that we always want to learn values from the training set!

In [0]:
# TODO
avg_wo_minutes_mean = X_train['avg_workout_minutes'].mean()
X_train['avg_workout_minutes'] = X_train['avg_workout_minutes'].fillna(avg_wo_minutes_mean)
X_test['avg_workout_minutes'] = X_test['avg_workout_minutes'].fillna(avg_wo_minutes_mean)

## Exercise 8

In this exericse, you will scale *all* of the columns.

Fill in the blanks below to complete the task.

In [0]:
# TODO
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

## Exercise 9

In this exercise, you will fit a Logistic Regression model on our target: `lifestyle`.

Fill in the blanks below to complete the task.

In [0]:
# TODO
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train,y_train)

lr.score(X_test,y_test)

**Coursera Quiz**: 
What might account for our score?   **Answer:** One or more of our features has a high predictive value.

Congrats! That concludes our lesson on feature engineering!

Be sure to submit your quiz answers to Coursera, and join us in the next lesson to learn about feature selection.

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>