d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Data Science Project

**Objective**: *Design, complete, and assess a common data science project.*

In this lab, you will use the data science process to design, build, and assess a common data science project.

In [0]:
%run "../../Includes/Classroom-Setup"

-sandbox
## Project Details

In recent months, our health tracker company has noticed that many customers drop out of the sign-up process when they have to self-identify their exercise lifestyle (`ht_users.lifestyle`) – this is especially true for those with a "Sedentary" lifestyle. As a result, the company is considering removing this step from the sign-up process. However, the company knows this data is valuable for targeting introductory exercises and they don't want to lose it for customers that sign up after the step is removed.

In this data science project, our business stakeholders are interested in identifying which customers have a sedentary lifestyle – specifically, they want to know if we can correctly identify whether somebody has a "Sedentary" lifestyle at least 95 percent of the time. If we can meet this objective, the organization will be able to remove the lifestyle-specification step of the sign-up process *without losing the valuable information provided by the data*.

There are no solutions provided for this project. You will need to complete it independently using the guidance detailed below and the previous labs from the project.

-sandbox

## Exercise 1

Summary: 
* Specify the data science process question. 
* Indicate whether this is framed as a supervised learning or unsupervised learning problem. 
* If it is supervised learning, indicate whether the problem is a regression problem or a classification problem.

**Hint:** When we are interested in predicting something, we are usually talking about a supervised learning problem.

* **Answer:** The data science question is: given a particular user's data, can we <span style="text-decoration: underline">correctly</span> classify them as having a "Sendentary" or "Non-Sendentary" lifestyle at least 95% of the time?  Since we will be working with labeled data (i.e. the "lifestyle" feature) and we are trying to assign a classification to each user, this is a **supervised learning/classification** problem.

-sandbox

## Exercise 2

Summary: 

* Specify the data science objective. 
* Indicate which evaluation metric should be used to assess the objective.

**Hint:** Remember, the data science objective needs to be measurable.

* **Answer:** The data science objective is to train and test a machine learning classification model that will input user data and classify his/her lifestyle as either "Sendentary" or "Non-Sendentary". We need our model's accuracy to be 0.95 minimum on both the training and test datasets.

-sandbox

## Exercise 3

Summary:
* Design a baseline solution.
* Develop a baseline solution – be sure to split data between training for development and test for assessment.
* Assess your baseline solution. Does it meet the project objective? If not, use it as a threshold for further development.

**Hint:** Recall that baseline solutions are meant to be easy to develop.

In [0]:
%python
import numpy as np

# Create the 'ht_users' df in Pandas...
ht_users_spark_df = spark.read.table("ht_users")
ht_users_pandas_df = ht_users_spark_df.toPandas()


# Create column 'lifestyle_recode' that recodes the 'lifestyle' column
# to either "Sedentary" or "Non-Sedentary"

# ht_users_pandas_df['lifestyle_recode'] = np.where(ht_users_pandas_df['lifestyle'] == "Sedentary", "Sedentary", "Non-Sedentary")
ht_users_pandas_df['lifestyle_recode'] = ht_users_pandas_df['lifestyle'].apply(lambda x: 'Sedentary' if x == 'Sedentary' else 'Non-Sedentary')

# Drop 'first_name' and 'last_name' and 'lifestyle' columns they don't
# contribute anything to the analysis...
ht_users_pandas_df.drop(['first_name','last_name','lifestyle'],axis=1,inplace=True)

ht_users_pandas_df['lifestyle_recode'].value_counts()


In [0]:
# Create the X matrix and y vector...
X = ht_users_pandas_df[['device_id','country']]
y = ht_users_pandas_df['lifestyle_recode']

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)

# Check the baseline percentages for 'y_train'

lifestyle_summary = pd.DataFrame(y_train.value_counts()).reset_index().rename(columns={"index": "lifestyle", "lifestyle_recode": "freq"})
lifestyle_summary['pct'] = (lifestyle_summary['freq'] / lifestyle_summary['freq'].sum()) * 100
lifestyle_summary

Unnamed: 0,lifestyle,freq,pct
0,Non-Sedentary,2157,89.875
1,Sedentary,243,10.125


In [0]:
# Check the baseline percentages for 'y_test'

lifestyle_summary = pd.DataFrame(y_test.value_counts()).reset_index().rename(columns={"index": "lifestyle", "lifestyle_recode": "freq"})
lifestyle_summary['pct'] = (lifestyle_summary['freq'] / lifestyle_summary['freq'].sum()) * 100
lifestyle_summary

Unnamed: 0,lifestyle,freq,pct
0,Non-Sedentary,531,88.5
1,Sedentary,69,11.5


-sandbox

## Exercise 4

Summary: 
* Design the machine learning solution, but do not yet develop it. 
* Indicate whether a machine learning model will be used. If so, indicate which machine learning model will be used and what the label/output variable will be.

**Hint:** Consider solutions that align with the framing you did in Exercise 1.

-sandbox
## Exercise 5

Summary: 
* Explore your data. 
* Specify which tables and columns will be used for your label/output variable and your feature variables.

**Hint:** Consider aggregating features from other tables.

-sandbox
## Exercise 6

Summary: 
* Prepare your modeling data. 
* Create a customer-level modeling table with the correct output variable and features. 
* Finally, split your data between training and test sets. Make sure this split aligns with that of your baseline solution.

**Hint:** Consider how to make the data split reproducible.

-sandbox
## Exercise 7

Summary: 
* Build the model specified in your answer to Exercise 4. 
* Be sure to use an evaluation metric that aligns with your specified objective.

**Hint:** This evaluation metric should align with the one used in your baseline solution.

-sandbox
## Exercise 8

Summary: 
* Assess your model against the overall objective. 
* Be sure to use an evaluation metric that aligns with your specified objective.

**Hint:** Remember that we assess our models against our test data set to ensure that our solutions generalize.

**Hint:** If your solution doesn't meet the objective, consider tweaking the model and data used by the model until it does meet the objective.

After completing all of the above objectives, you should be ready to communicate your results. Move to the next video in the lesson for a description on that part of the project.

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>