
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Data Science Project

**Objective**: *Design, complete, and assess a common data science project.*

In this lab, you will use the data science process to design, build, and assess a common data science project.

In [0]:
dbutils.fs.rm("dbfs:/user/hive/warehouse/dsfda.db/ht_agg", recurse=True)

Out[4]: True

In [0]:
%run "../../Includes/Classroom-Setup"

Out[6]: DataFrame[]

## Project Details

In recent months, our health tracker company has noticed that many customers drop out of the sign-up process when they have to self-identify their exercise lifestyle (`ht_users.lifestyle`) – this is especially true for those with a "Sedentary" lifestyle. As a result, the company is considering removing this step from the sign-up process. However, the company knows this data is valuable for targeting introductory exercises and they don't want to lose it for customers that sign up after the step is removed.

In this data science project, our business stakeholders are interested in identifying which customers have a sedentary lifestyle – specifically, they want to know if we can correctly identify whether somebody has a "Sedentary" lifestyle at least 95 percent of the time. If we can meet this objective, the organization will be able to remove the lifestyle-specification step of the sign-up process *without losing the valuable information provided by the data*.


<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> There are no solutions provided for this project. You will need to complete it independently using the guidance detailed below and the previous labs from the project.


## Exercise 1

Summary: 
* Specify the data science process question. 
* Indicate whether this is framed as a supervised learning or unsupervised learning problem. 
* If it is supervised learning, indicate whether the problem is a regression problem or a classification problem.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** When we are interested in predicting something, we are usually talking about a supervised learning problem.

In [0]:
#Question
#Can we identify people with a "Sedentary" lifestyle based on the other informatioon given in the sign up process?
#Supervised learning - Classification


## Exercise 2

Summary: 

* Specify the data science objective. 
* Indicate which evaluation metric should be used to assess the objective.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Remember, the data science objective needs to be measurable.

In [0]:
#Objective: To create a model that can predict if a user has a sedentary lifestyle based on the information given on the sign up process with at least 95% accuracy.


## Exercise 3

Summary:
* Design a baseline solution.
* Develop a baseline solution – be sure to split data between training for development and test for assessment.
* Assess your baseline solution. Does it meet the project objective? If not, use it as a threshold for further development.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Recall that baseline solutions are meant to be easy to develop.

In [0]:
import pandas as pd

ht_users_spark_df = spark.read.table("ht_users")
ht_users_pandas_df = ht_users_spark_df.toPandas()

print(len(ht_users_pandas_df))

3000


In [0]:
#Baseline Solution: Assume that there evry person has a "Sedentary" lifestyle
from sklearn.model_selection import train_test_split
import numpy as np

ht_users_train_df, ht_users_test_df = train_test_split(ht_users_pandas_df, test_size = 0.2, random_state = 42)

#Check size of the training and test set
print(len(ht_users_train_df))
print(len(ht_users_test_df))

#Baseline Solution Accuracy 
(ht_users_train_df.groupby(['lifestyle'], dropna=False).count()/2400)*100


2400
600


Unnamed: 0_level_0,device_id,first_name,last_name,country
lifestyle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Athlete,29.166667,29.166667,29.166667,29.166667
Cardio Enthusiast,35.791667,35.791667,35.791667,35.791667
Sedentary,10.125,10.125,10.125,10.125
Weight Trainer,24.916667,24.916667,24.916667,24.916667


In [0]:
#Baseline Solution: Assume that every person has a "Sedentary" lifestyle
from sklearn.model_selection import train_test_split

ht_users_train_df, ht_users_test_df = train_test_split(ht_users_pandas_df, test_size = 0.2, random_state = 42)

#Lifestyle Distribution
train_lifestyle_distribution = ht_users_train_df['lifestyle'].value_counts(normalize=True) 
test_lifestyle_distribution = ht_users_test_df['lifestyle'].value_counts(normalize=True) 

#Baseline Solution Accuracy 
print('The accuracy in the training set is: {}'.format(round(train_lifestyle_distribution.loc['Sedentary'], 2)))
print('The accuracy in the test set is: {}'.format(round(test_lifestyle_distribution.loc['Sedentary'], 2)))



The accuracy in the training set is: 0.1
The accuracy in the test set is: 0.12



## Exercise 4

Summary: 
* Design the machine learning solution, but do not yet develop it. 
* Indicate whether a machine learning model will be used. If so, indicate which machine learning model will be used and what the label/output variable will be.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Consider solutions that align with the framing you did in Exercise 1.

In [0]:
#ML model -> Decision Tree
#Label variable -> lifestyle


## Exercise 5

Summary: 
* Explore your data. 
* Specify which tables and columns will be used for your label/output variable and your feature variables.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Consider aggregating features from other tables.

In [0]:
import pandas as pd
ht_agg_spark_df = spark.read.table("ht_agg")
ht_agg_pandas_df = ht_agg_spark_df.toPandas()
ht_agg_pandas_df.sort_values(by=['device_id'])

Unnamed: 0,device_id,mean_bmi,mean_active_heartrate,mean_resting_heartrate,mean_vo2,mean_steps,lifestyle
1722,0003a6b8-e48b-11ea-8204-0242ac110002,22.398064,139.434875,82.683797,20.994012,5171.495890,Sedentary
696,0007a88a-e48b-11ea-8204-0242ac110002,25.150813,127.057153,77.732942,25.527475,7115.591781,Weight Trainer
1574,000b9c56-e48b-11ea-8204-0242ac110002,19.148256,147.315731,86.511629,19.448407,7257.693151,Weight Trainer
1439,000f916c-e48b-11ea-8204-0242ac110002,24.240376,129.577004,77.550541,21.401302,7129.690411,Weight Trainer
2857,00138330-e48b-11ea-8204-0242ac110002,30.726596,136.502687,68.933106,28.855230,6958.378082,Weight Trainer
...,...,...,...,...,...,...,...
557,fff00e82-e48a-11ea-8204-0242ac110002,21.432227,140.614679,82.056086,24.162064,7283.430137,Weight Trainer
2082,fff403a2-e48a-11ea-8204-0242ac110002,21.432304,144.578567,90.113362,18.346046,7110.720548,Weight Trainer
397,fff7e742-e48a-11ea-8204-0242ac110002,21.598411,148.378812,81.907737,22.797690,5153.890411,Sedentary
2127,fffbc5d8-e48a-11ea-8204-0242ac110002,26.134180,140.814891,69.910868,26.965661,5167.194521,Sedentary


In [0]:
ht_users_pandas_df.sort_values(by=['device_id'])

Unnamed: 0,device_id,first_name,last_name,lifestyle,country
844,0003a6b8-e48b-11ea-8204-0242ac110002,Sydney,Pickett,Sedentary,United States
264,0007a88a-e48b-11ea-8204-0242ac110002,Echo,Preston,Weight Trainer,United States
1046,000b9c56-e48b-11ea-8204-0242ac110002,Camilla,Bishop,Weight Trainer,United States
2175,000f916c-e48b-11ea-8204-0242ac110002,Melanie,William,Weight Trainer,United States
2821,00138330-e48b-11ea-8204-0242ac110002,Courtney,Church,Weight Trainer,United States
...,...,...,...,...,...
225,fff00e82-e48a-11ea-8204-0242ac110002,Teagan,Bruce,Weight Trainer,United States
709,fff403a2-e48a-11ea-8204-0242ac110002,Kylynn,Long,Weight Trainer,United States
217,fff7e742-e48a-11ea-8204-0242ac110002,Anika,Mcgee,Sedentary,United States
419,fffbc5d8-e48a-11ea-8204-0242ac110002,Dacey,Vinson,Sedentary,United States


In [0]:
merged_df = pd.merge(ht_users_pandas_df, ht_agg_pandas_df)
merged_df.sort_values(by=['device_id'])

Unnamed: 0,device_id,first_name,last_name,lifestyle,country,mean_bmi,mean_active_heartrate,mean_resting_heartrate,mean_vo2,mean_steps
844,0003a6b8-e48b-11ea-8204-0242ac110002,Sydney,Pickett,Sedentary,United States,22.398064,139.434875,82.683797,20.994012,5171.495890
264,0007a88a-e48b-11ea-8204-0242ac110002,Echo,Preston,Weight Trainer,United States,25.150813,127.057153,77.732942,25.527475,7115.591781
1046,000b9c56-e48b-11ea-8204-0242ac110002,Camilla,Bishop,Weight Trainer,United States,19.148256,147.315731,86.511629,19.448407,7257.693151
2175,000f916c-e48b-11ea-8204-0242ac110002,Melanie,William,Weight Trainer,United States,24.240376,129.577004,77.550541,21.401302,7129.690411
2821,00138330-e48b-11ea-8204-0242ac110002,Courtney,Church,Weight Trainer,United States,30.726596,136.502687,68.933106,28.855230,6958.378082
...,...,...,...,...,...,...,...,...,...,...
225,fff00e82-e48a-11ea-8204-0242ac110002,Teagan,Bruce,Weight Trainer,United States,21.432227,140.614679,82.056086,24.162064,7283.430137
709,fff403a2-e48a-11ea-8204-0242ac110002,Kylynn,Long,Weight Trainer,United States,21.432304,144.578567,90.113362,18.346046,7110.720548
217,fff7e742-e48a-11ea-8204-0242ac110002,Anika,Mcgee,Sedentary,United States,21.598411,148.378812,81.907737,22.797690,5153.890411
419,fffbc5d8-e48a-11ea-8204-0242ac110002,Dacey,Vinson,Sedentary,United States,26.134180,140.814891,69.910868,26.965661,5167.194521


## Exercise 6

Summary: 
* Prepare your modeling data. 
* Create a customer-level modeling table with the correct output variable and features. 
* Finally, split your data between training and test sets. Make sure this split aligns with that of your baseline solution.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Consider how to make the data split reproducible.

In [0]:
from sklearn.preprocessing import LabelEncoder

X_1 = merged_df[['mean_active_heartrate', 'mean_resting_heartrate']]
X_2 = merged_df[['mean_steps']]
X_3 = merged_df[['mean_bmi', 'mean_steps']]
X_4 = merged_df[['mean_active_heartrate', 'mean_bmi', 'mean_vo2', 'mean_resting_heartrate']]

le = LabelEncoder()
lifestyle = merged_df['lifestyle']
le.fit(lifestyle)
y = le.transform(lifestyle)


X_1_train, X_1_test, y_train, y_test = train_test_split(X_1, y, test_size = 0.2, random_state = 42)
X_2_train, X_2_test, y_train, y_test = train_test_split(X_2, y, test_size = 0.2, random_state = 42)
X_3_train, X_3_test, y_train, y_test = train_test_split(X_3, y, test_size = 0.2, random_state = 42)
X_4_train, X_4_test, y_train, y_test = train_test_split(X_4, y, test_size = 0.2, random_state = 42)

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

X_1 = merged_df[['mean_active_heartrate', 'mean_resting_heartrate']]
X_2 = merged_df[['mean_steps']]
X_3 = merged_df[['mean_bmi', 'mean_steps']]
X_4 = merged_df[['mean_active_heartrate', 'mean_bmi', 'mean_vo2', 'mean_resting_heartrate']]
y = merged_df['lifestyle']

X_1_train, X_1_test, y_train, y_test = train_test_split(X_1, y, test_size = 0.2, random_state = 42)
X_2_train, X_2_test, y_train, y_test = train_test_split(X_2, y, test_size = 0.2, random_state = 42)
X_3_train, X_3_test, y_train, y_test = train_test_split(X_3, y, test_size = 0.2, random_state = 42)
X_4_train, X_4_test, y_train, y_test = train_test_split(X_4, y, test_size = 0.2, random_state = 42)

## Exercise 7

Summary: 
* Build the model specified in your answer to Exercise 4. 
* Be sure to use an evaluation metric that aligns with your specified objective.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** This evaluation metric should align with the one used in your baseline solution.

In [0]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


#Create Decision Trees
dt_1 = DecisionTreeClassifier()
dt_2 = DecisionTreeClassifier()
dt_3 = DecisionTreeClassifier()
dt_4 = DecisionTreeClassifier()

#Train Decision Trees
dt_1.fit(X_1_train, y_train)
dt_2.fit(X_2_train, y_train)
dt_3.fit(X_3_train, y_train)
dt_4.fit(X_4_train, y_train)

#Generate predictions
y_train_1_predicted = dt_1.predict(X_1_train)
y_test_1_predicted = dt_1.predict(X_1_test)
y_train_2_predicted = dt_2.predict(X_2_train)
y_test_2_predicted = dt_2.predict(X_2_test)
y_train_3_predicted = dt_3.predict(X_3_train)
y_test_3_predicted = dt_3.predict(X_3_test)
y_train_4_predicted = dt_4.predict(X_4_train)
y_test_4_predicted = dt_4.predict(X_4_test)

In [0]:
train_1_accuracy = accuracy_score(y_train, y_train_1_predicted)
test_1_accuracy = accuracy_score(y_test, y_test_1_predicted)

train_2_accuracy = accuracy_score(y_train, y_train_2_predicted)
test_2_accuracy = accuracy_score(y_test, y_test_2_predicted)

train_3_accuracy = accuracy_score(y_train, y_train_3_predicted)
test_3_accuracy = accuracy_score(y_test, y_test_3_predicted)

train_4_accuracy = accuracy_score(y_train, y_train_4_predicted)
test_4_accuracy = accuracy_score(y_test, y_test_4_predicted)

print("model 1: training accuracy: ", train_1_accuracy)
print("model 1: test accuracy:     ", test_1_accuracy)
print(" ")
print("model 2: training accuracy: ", train_2_accuracy)
print("model 2: test accuracy:     ", test_2_accuracy)
print(" ")
print("model 3: training accuracy: ", train_3_accuracy)
print("model 3: test accuracy:     ", test_3_accuracy)
print(" ")
print("model 4: training accuracy: ", train_4_accuracy)
print("model 4: test accuracy:     ", test_4_accuracy)

model 1: training accuracy:  1.0
model 1: test accuracy:      0.5016666666666667
 
model 2: training accuracy:  1.0
model 2: test accuracy:      0.9233333333333333
 
model 3: training accuracy:  1.0
model 3: test accuracy:      0.9933333333333333
 
model 4: training accuracy:  1.0
model 4: test accuracy:      0.49333333333333335


In [0]:
from sklearn.metrics import confusion_matrix
test_3_cm= confusion_matrix(y_test, y_test_3_predicted)
print(y_test.value_counts())
print("Model 3 Confusion Matrix:     ", test_3_cm)

Cardio Enthusiast    205
Weight Trainer       167
Athlete              159
Sedentary             69
Name: lifestyle, dtype: int64
Model 3 Confusion Matrix:      [[158   1   0   0]
 [  2 203   0   0]
 [  0   0  69   0]
 [  1   0   0 166]]


## Exercise 8

Summary: 
* Assess your model against the overall objective. 
* Be sure to use an evaluation metric that aligns with your specified objective.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Remember that we assess our models against our test data set to ensure that our solutions generalize.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** If your solution doesn't meet the objective, consider tweaking the model and data used by the model until it does meet the objective.

After completing all of the above objectives, you should be ready to communicate your results. Move to the next video in the lesson for a description on that part of the project.


&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>