# Exercise 2: Data Processing and Analysis

In [81]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
try:
    from gofer.ok import check
except:
    %pip install git+https://github.com/grading/gradememaybe.git
    from gofer.ok import check

Today we're working with simulated smartwatch health data. You'll notice when you read in the csv file that we have 7 columns: User ID, Heart Rate (BPM), Blood Oxygen Level (%), Step Count, Sleep Duration (hours), Activity Level, and Stress Level. We're going to go through and "clean up" the data so that we can calculate some basic statistics such as mean, median, minimum, and maximum for each variable. Run the cell below to read in the table saving it in the variable `smartwatch`. 

In [46]:
smartwatch = pd.read_csv("unclean_smartwatch_health_data.csv")
smartwatch

Let's start together with the Heart Rate column. Just looking at the preview table above we can see two things right away: (1) we have `NaN` values meaning data was not collected for those individuals, and (2) some of the heart rate values are abnormally high i.e. `247.803052`. Let's see what the entire range of values look like. 

In [47]:
print(smartwatch['Heart Rate (BPM)'].min(), smartwatch['Heart Rate (BPM)'].max())

40.0 296.5939695131042


**Question 1**: Set the variables below equal to the minimum heart rate and the maximum heart rate in the dataset. This is just practice for the method of checking answers as you go. Afterwards run the cell below to check your answer. 

In [48]:
minimumHeartRate = smartwatch['Heart Rate (BPM)'].min()
maximumHeartRate = smartwatch['Heart Rate (BPM)'].max()

In [49]:
check('tests/q1.py')

KeyError: 'image/svg+xml'

Notice that the maximum value of `296` beats per minimute is WAY above normal range of heart rates. In fact, according to [heart.org](https://www.heart.org/en/healthy-living/fitness/fitness-basics/target-heart-rates) the highest estimated heart rate based on age ranges from 150 to 200 for adults. This will vary between individuals, but this is a good starting point for us to think about outliers in the heart rate values in this dataset. Let's see how many rows have missing data or heart rates above 200. 

In [None]:
## select rows where Heart Rate is NaN or rows where the heart rate is above 200. 
# Emphasis on the usage of 'or' here, we want rows where either 
# scenario 1 (NaN) OR scenario 2 (>200) is true. 

outlierHeartRaterows = smartwatch[smartwatch['Heart Rate (BPM)'] > 200]
nullHeartRaterows = smartwatch[smartwatch['Heart Rate (BPM)'].isnull()]

print(len(outlierHeartRaterows) + len(nullHeartRaterows))

450


In [None]:
## Now let's get a table of all the other rows. 
heartrateRows = smartwatch[(~smartwatch['Heart Rate (BPM)'].isnull()) & (smartwatch['Heart Rate (BPM)'] <= 200)]

heartrateRows

print(heartrateRows)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
0      4174.0         58.939776               98.809650   5450.390578   
3      2294.0         40.000000               96.894213  13797.338044   
4      2130.0         61.950165               98.583797  15679.067648   
5      2095.0         96.285938               94.202910  10205.992256   
6      4772.0         47.272257               95.389760   3208.781177   
...       ...               ...                     ...           ...   
9994   1942.0         77.912299               98.640583  10061.145291   
9995   1524.0         78.819386               98.931927   2948.491953   
9996   4879.0         48.632659               95.773035   4725.623070   
9997   2624.0         73.834442               97.945874   2571.492060   
9999   4113.0         70.063864               98.475606    544.696104   

     Sleep Duration (hours) Activity Level Stress Level  
0         7.167235622316564  Highly Active            1  
3      

**Question 2:** 

Notice here that the length of the two tables (`450` and `9550`) add up to the total number rows (`n=10000`). This is a good sanity check as we manipulate the table. Now we have to decide how we deal with these missing values and outliers. One method to do this would be remove all the rows with null values or outlier values. Another method is use imputation - this can be done in several ways but below we're going to substitute the average heart rate for the missing and mismeasured values. Do we think this will change the mean?

In [None]:
HeartRateMean = heartrateRows['Heart Rate (BPM)'].mean()
print(HeartRateMean)


75.13268404820141


In [None]:
outlierHeartRaterows['Heart Rate (BPM)'] = HeartRateMean
nullHeartRaterows['Heart Rate (BPM)'] = HeartRateMean


print(outlierHeartRaterows)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
2      1860.0         75.132684               97.052954   2826.521994   
337    2369.0         75.132684               95.441773   2998.761919   
393    2443.0         75.132684               95.497181   1227.804688   
403    3200.0         75.132684               96.011492   9402.746140   
595    2129.0         75.132684               97.161853   2555.402184   
649    2008.0         75.132684               98.356789   2739.171166   
818    3156.0         75.132684                     NaN   7281.778831   
1195   3261.0         75.132684               99.652006   2867.872064   
1391   4621.0         75.132684               96.688083  20577.677290   
1602   4737.0         75.132684               95.095839  16072.283561   
2023      NaN         75.132684               99.032130  17620.765455   
2211   2711.0         75.132684               97.852781   1786.998129   
2212   4020.0         75.132684               95.28

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  outlierHeartRaterows['Heart Rate (BPM)'] = HeartRateMean
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nullHeartRaterows['Heart Rate (BPM)'] = HeartRateMean


In [None]:
smartwatch_hr = pd.concat([outlierHeartRaterows, nullHeartRaterows, heartrateRows])

fullTableHRMean = smartwatch_hr['Heart Rate (BPM)'].mean()
fullTableHRMean

In [None]:
check('tests/q2.py')

KeyError: 'image/svg+xml'

Notice how the mean doesn't change when you use the imputed mean as the substituted values. However, as a note this will change the distribution of values and has the potential to obscure the causes of missing values or outliers. 

In [None]:
sns.histplot(smartwatch['Heart Rate (BPM)'], label = 'With Missing Values and Outliers')
sns.histplot(smartwatch_hr['Heart Rate (BPM)'], label = 'With Mean as Imputed Value')
plt.legend()

NameError: name '_fetch_figure_metadata' is not defined

**Question 3**: How does the imputation method affect the median values?  Remember the table `smartwatch` remains unchanged and can be used to find the original median value. 

ANSWER: 

The median remains the same but it appears to be much lower after the imputation

In [None]:
## coding cell to check the difference in median values 

original_median = smartwatch['Heart Rate (BPM)'].median()
print(original_median)
imputed_median = smartwatch_hr['Heart Rate (BPM)'].median()
print(imputed_median)


75.22060125775644
75.13268404820141


Now let's repeat this process for the other columns as well. 

**Question 4**: Find the minimum, maximum, and mean Blood Oxygen Level. 

In [None]:
print(smartwatch_hr)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
2      1860.0         75.132684               97.052954   2826.521994   
337    2369.0         75.132684               95.441773   2998.761919   
393    2443.0         75.132684               95.497181   1227.804688   
403    3200.0         75.132684               96.011492   9402.746140   
595    2129.0         75.132684               97.161853   2555.402184   
...       ...               ...                     ...           ...   
9994   1942.0         77.912299               98.640583  10061.145291   
9995   1524.0         78.819386               98.931927   2948.491953   
9996   4879.0         48.632659               95.773035   4725.623070   
9997   2624.0         73.834442               97.945874   2571.492060   
9999   4113.0         70.063864               98.475606    544.696104   

     Sleep Duration (hours) Activity Level Stress Level  
2                     ERROR  Highly Active            5  
337    

In [None]:
minBloodO2 = smartwatch_hr['Blood Oxygen Level (%)'].min()
print(minBloodO2)
maxBloodO2 = smartwatch_hr['Blood Oxygen Level (%)'].max()
print(maxBloodO2)

90.79120814564097
100.0


In [None]:
minBloodO2 = smartwatch_hr['Blood Oxygen Level (%)'].min()

maxBloodO2 = smartwatch_hr['Blood Oxygen Level (%)'].max()


meanBloodO2 = smartwatch_hr['Blood Oxygen Level (%)'].mean()

print(minBloodO2, maxBloodO2, meanBloodO2)

90.79120814564097 100.0 97.84158102099076


In [None]:
check('tests/q4.py')

KeyError: 'image/svg+xml'

We can do some visualizations using a box plot to better decide if there's any outliers we'd like to remove. Maybe you decide that everything below 92.5 is an outlier should be removed or maybe you decide to keep all of the values or maybe you decide to remove all values under 94% since that falls outside of a normal, healthy range according to doctors (I would not suggest this last one as it would obscure quite a bit of data! But some analysts might consider it!). In question 5, you can make that decision and justify your answer. 

In [None]:
sns.histplot(smartwatch['Blood Oxygen Level (%)'], label = 'With Missing Values and Outliers')
sns.histplot(smartwatch_hr['Blood Oxygen Level (%)'], label = 'With Mean as Imputed Value')
plt.legend()

NameError: name '_fetch_figure_metadata' is not defined

In [None]:
print(smartwatch.head())

   User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
0   4174.0         58.939776               98.809650   5450.390578   
1      NaN               NaN               98.532195    727.601610   
2   1860.0        247.803052               97.052954   2826.521994   
3   2294.0         40.000000               96.894213  13797.338044   
4   2130.0         61.950165               98.583797  15679.067648   

  Sleep Duration (hours) Activity Level Stress Level  
0      7.167235622316564  Highly Active            1  
1      6.538239375570314  Highly_Active            5  
2                  ERROR  Highly Active            5  
3      7.367789630207228          Actve            3  
4                    NaN  Highly_Active            6  


In [None]:
sns.boxplot(smartwatch['Blood Oxygen Level (%)'])

NameError: name '_fetch_figure_metadata' is not defined

**Question 5**: Decide how you want to deal with missing and outlier values in the Blood Oxygen Level column. Set your final table equal to the variable `smartwatch_o2`. Use the space below to explain your decisions. 

Notes: Refer to [this article](https://pmc.ncbi.nlm.nih.gov/articles/PMC5548942/) on methods of handling these type of instances in data. Keep in mind, you might find that either there are no missing values or no outliers. That's okay, just indicate that in the written space below and update the table i.e. `smartwatch_o2 = smartwatch_hr`

ANSWER: 

....

In [None]:
missing_02 = (smartwatch_hr['Blood Oxygen Level (%)'].isnull().sum())
print(missing_02)

300


In [None]:
min_o2_threshold = 92.5
outliers_o2 = smartwatch_hr[smartwatch_hr["Blood Oxygen Level (%)"] < min_o2_threshold]
print(outliers_o2)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
389    4991.0         89.812696               92.003999   3197.915998   
548    1833.0         78.880652               92.109389   2274.291267   
773    3914.0         74.337231               92.288167   2482.815316   
859    2388.0         43.259383               92.482382    994.697587   
944    3703.0         84.058395               91.062167   9390.095074   
1179   3692.0         89.396977               91.507534   2790.919612   
1778   1395.0         74.038855               92.483740           NaN   
2263   1425.0         43.767314               92.282996   3069.225843   
2316   2265.0        102.533707               91.514026   1015.818664   
2443   3253.0         52.514029               92.368105   3488.384199   
2698   3113.0         78.345556               91.034463  15758.278107   
4048   2749.0         87.110612               92.192058  18406.286003   
4114   1258.0         97.254386               92.48

In [None]:
meanBloodO2
missing_02

In [None]:
smartwatch_hr.loc[missing_02,'Blood oxygen Level (%)']=meanBloodO2

In [None]:
## use this space or additional cells to deal with the missing/outlier values. 


smartwatch_o2 = smartwatch_hr

print(smartwatch_o2)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
2      1860.0         75.132684               97.052954   2826.521994   
337    2369.0         75.132684               95.441773   2998.761919   
393    2443.0         75.132684               95.497181   1227.804688   
403    3200.0         75.132684               96.011492   9402.746140   
595    2129.0         75.132684               97.161853   2555.402184   
...       ...               ...                     ...           ...   
9994   1942.0         77.912299               98.640583  10061.145291   
9995   1524.0         78.819386               98.931927   2948.491953   
9996   4879.0         48.632659               95.773035   4725.623070   
9997   2624.0         73.834442               97.945874   2571.492060   
9999   4113.0         70.063864               98.475606    544.696104   

     Sleep Duration (hours) Activity Level Stress Level  \
2                     ERROR  Highly Active            5   
337  

Moving onto the Step Count column. 

**Question 6**: Find the minimum, maximum, and mean step counts. 

In [None]:
minSteps = smartwatch_o2['Step Count'].min()


maxSteps = smartwatch_o2['Step Count'].max()


meanSteps = smartwatch_o2['Step Count'].mean()


print(minSteps, maxSteps, meanSteps)

0.9101380609604088 62486.690753464914 6985.685884992229


In [None]:
check('tests/q6.py')

KeyError: 'image/svg+xml'

**Question 7**: Decide how you want to deal with missing and outlier values in the Steps column. Set your final table equal to the variable `smartwatch_steps`. Use the space below to explain your decisions. 

Notes: Refer to [this article](https://pmc.ncbi.nlm.nih.gov/articles/PMC5548942/) on methods of handling these type of instances in data. Keep in mind, you might find that either there are no missing values or no outliers. That's okay, just indicate that in the written space below and update the table i.e. `smartwatch_steps = smartwatch_o2`

ANSWER: 

....

In [None]:
missing_step = (smartwatch_o2['Step Count'].isnull().sum())
print(missing_step)

100


In [None]:
min_step_threshold = 7000
outliers_step = smartwatch_o2[smartwatch_o2["Step Count"] < min_step_threshold]
print(outliers_step)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)   Step Count  \
2      1860.0         75.132684               97.052954  2826.521994   
337    2369.0         75.132684               95.441773  2998.761919   
393    2443.0         75.132684               95.497181  1227.804688   
595    2129.0         75.132684               97.161853  2555.402184   
649    2008.0         75.132684               98.356789  2739.171166   
...       ...               ...                     ...          ...   
9993   2184.0         69.669424               96.385686    94.980639   
9995   1524.0         78.819386               98.931927  2948.491953   
9996   4879.0         48.632659               95.773035  4725.623070   
9997   2624.0         73.834442               97.945874  2571.492060   
9999   4113.0         70.063864               98.475606   544.696104   

     Sleep Duration (hours) Activity Level Stress Level  \
2                     ERROR  Highly Active            5   
337       6.67062

In [None]:
smartwatch_o2.loc[missing_step,'Step Count']=meanSteps

In [None]:
## use this space or additional cells to address the missing or outlier data. 

smartwatch_steps = smartwatch_o2

print(smartwatch_steps)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
2      1860.0         75.132684               97.052954   2826.521994   
337    2369.0         75.132684               95.441773   2998.761919   
393    2443.0         75.132684               95.497181   1227.804688   
403    3200.0         75.132684               96.011492   9402.746140   
595    2129.0         75.132684               97.161853   2555.402184   
...       ...               ...                     ...           ...   
9994   1942.0         77.912299               98.640583  10061.145291   
9995   1524.0         78.819386               98.931927   2948.491953   
9996   4879.0         48.632659               95.773035   4725.623070   
9997   2624.0         73.834442               97.945874   2571.492060   
9999   4113.0         70.063864               98.475606    544.696104   

     Sleep Duration (hours) Activity Level Stress Level  \
2                     ERROR  Highly Active            5   
337  

Next onto the Sleep Duration column. 

**Question 8**: Try finding the minimum number of hours slept among participants. 

In [None]:
minSleep = smartwatch_steps['Sleep Duration (hours)'].min()

minSleep

TypeError: '<=' not supported between instances of 'str' and 'float'

In the error message, you should see the phrase: 

`TypeError: '<=' not supported between instances of 'str' and 'float'`

This means that the column is a combination of multiple data types. If you recall, we discussed computer readable data that columns HAVE to contain just a single data type. Having a combination of strings, characters, and numbers in a column will only cause more issues downstream. Let's try to figure out all the instances of non-numerical values in the column. 

First let's try to identify all the non-numerical values to make sure removing them doesn't obscure any important information. 

In [50]:
for value in smartwatch['Sleep Duration (hours)'].unique():  ## Loop through each unique item in the column
    try:
        numericVal = float(value)  ## first try to see if it can be converted to a number
    except:
        print(value)  ## if it can't be converted print it to screen 

ERROR


So, we find that the only non-numerical value is the string `ERROR`. We can fix this in two ways. First let's try fixing it by splitting the tables like we've done previously. 

In [51]:
errorTable = smartwatch_steps[smartwatch_steps['Sleep Duration (hours)'] == 'ERROR']
errorTable

Observe that we've created a table with the rows with `ERROR` in the sleep duration column. We can now replace with the `ERROR` value with our handy NaN value for doing numerical statistics. 

In [52]:
errorTable['Sleep Duration (hours)'] = np.nan

errorTable

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errorTable['Sleep Duration (hours)'] = np.nan


Now we could go through and put the tables back together, but another method to do this is to use a built-in function to pandas called `pd.to_numeric()`. Let's try that. 

In [53]:
## create a copy of the table 
smartwatch_sleep = smartwatch_steps

smartwatch_sleep['Sleep Duration (hours)'] = pd.to_numeric(smartwatch_sleep['Sleep Duration (hours)'], errors='coerce')

smartwatch_sleep

Now we can finish **Question 8** and find the minimum, maximum, and mean of the sleep duration column. 

In [55]:
minSleep = smartwatch_steps['Sleep Duration (hours)'].min()


maxSleep = smartwatch_steps['Sleep Duration (hours)'].max()


meanSleep = smartwatch_steps['Sleep Duration (hours)'].mean()


print(minSleep, maxSleep, meanSleep)

-0.1944527906201543 12.140232872862926 6.505462918406444


In [None]:
check('tests/q8.py')

KeyError: 'image/svg+xml'

**Question 9**: Decide how you want to deal with missing and outlier values in the sleep column. Set your final table equal to the variable `smartwatch_updated_sleep`. Use the space below to explain your decisions. 

*Remember to start with the `smartwatch_sleep` table that we just created.*

Notes: Refer to [this article](https://pmc.ncbi.nlm.nih.gov/articles/PMC5548942/) on methods of handling these type of instances in data. Keep in mind, you might find that either there are no missing values or no outliers. That's okay, just indicate that in the written space below and update the table i.e. `smartwatch_updated_sleep = smartwatch_sleep`

ANSWER: 

....

In [59]:
missing_sleep = (smartwatch_sleep['Sleep Duration (hours)'].isnull().sum())
print(missing_step)

100


In [60]:
max_sleep_threshold = 1
outliers_sleep = smartwatch_sleep[smartwatch_sleep["Sleep Duration (hours)"] < max_sleep_threshold]
print(outliers_sleep)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)   Step Count  \
2492   3936.0         73.018575                   100.0  3546.105365   
3366   1608.0         78.832837                   100.0  7122.866517   

      Sleep Duration (hours) Activity Level Stress Level  \
2492                0.589987          Actve            3   
3366               -0.194453      Sedentary           10   

      Blood oxygen Level (%)  
2492                     NaN  
3366                     NaN  


In [61]:
smartwatch_sleep.loc[missing_sleep,'Sleep Duration (hours)']=meanSleep

In [62]:
## use this space or additional cells to address the missing or outlier data. 


smartwatch_updated_sleep = smartwatch_sleep
smartwatch_updated_sleep

We're going to skip the `Activity Level` column for a minute and look at the `Stress Level` column. If we try getting the minimum, we'll find the same error as in the Sleep column where we have mixed data types (strings and numerical values). Let's use the same type of loop to make sure we don't obscure any data by forcing the strings to NaN values. 

In [63]:
for item in smartwatch_updated_sleep['Stress Level'].unique():
    try:
        int(item)
    except:
        print(item)

nan
Very High


In [65]:
print(smartwatch_updated_sleep)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
2      1860.0         75.132684               97.052954   2826.521994   
337    2369.0         75.132684               95.441773   2998.761919   
393    2443.0         75.132684               95.497181   1227.804688   
403    3200.0         75.132684               96.011492   9402.746140   
595    2129.0         75.132684               97.161853   2555.402184   
...       ...               ...                     ...           ...   
9994   1942.0         77.912299               98.640583  10061.145291   
9995   1524.0         78.819386               98.931927   2948.491953   
9996   4879.0         48.632659               95.773035   4725.623070   
9997   2624.0         73.834442               97.945874   2571.492060   
9999   4113.0         70.063864               98.475606    544.696104   

      Sleep Duration (hours) Activity Level Stress Level  \
2                        NaN  Highly Active            5   
337

**Question 10**: 

What might you decide to do to deal with the `Very High` value? 

Go ahead and do so below and give a brief case for doing so here. Assign your table to the variable `smartwatch_stress`. 

ANSWER: 
I would treat it as an outlier and replace it with the median or remove it from the data set

In [64]:
missing_stress = (smartwatch_updated_sleep['Stress Level'].isnull().sum())
print(missing_stress)

200


In [None]:

smartwatch.loc[smartwatch["Stress Level"] == "nan", "Stress Level"] = np.nan

In [70]:
smartwatch_updated_sleep = smartwatch_updated_sleep[smartwatch_updated_sleep["Stress Level"] != "Very High"]

In [72]:
## cell to deal with 'Very High' value


smartwatch_stress = smartwatch_updated_sleep

print(smartwatch_stress)


      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
2      1860.0         75.132684               97.052954   2826.521994   
337    2369.0         75.132684               95.441773   2998.761919   
393    2443.0         75.132684               95.497181   1227.804688   
403    3200.0         75.132684               96.011492   9402.746140   
595    2129.0         75.132684               97.161853   2555.402184   
...       ...               ...                     ...           ...   
9994   1942.0         77.912299               98.640583  10061.145291   
9995   1524.0         78.819386               98.931927   2948.491953   
9996   4879.0         48.632659               95.773035   4725.623070   
9997   2624.0         73.834442               97.945874   2571.492060   
9999   4113.0         70.063864               98.475606    544.696104   

      Sleep Duration (hours) Activity Level Stress Level  \
2                        NaN  Highly Active            5   
337

Finally, let's go back to the `Activity Level` column and investigate what types of values we find there. 

In [73]:
smartwatch['Activity Level'].unique()

**Question 12**: 

What do you notice? There are several values that could and should be combined because they represent the same information. Let's go ahead and do that. While combining these columns, let's also create a new column `NumActivity` where we give a numerical value to represent the activity level. Assign your final table to the variable `final_table`. 

`Highly Active` = `1`

`Active` = `2`

`Sedentary` = `3`



In [74]:
## Highly Active 
highlyActive = smartwatch_stress[(smartwatch_stress['Activity Level'] == "Highly Active") | (smartwatch_stress['Activity Level'] == "Highly Active")]  ## Pull out rows for two unique values matching highly active
highlyActive['Activity Level'] = "Highly Active"    ## Reset the text in the column to either 'Highly Active'
highlyActive['NumActivity'] = 1 ## set the number for the numerical value 


## Active
active = smartwatch_stress[(smartwatch_stress['Activity Level'] == "Active") | (smartwatch_stress['Activity Level'] == "Active")] 
active['Activity Level'] = "Active"  
active['NumActivity'] = 2  

## Sedentary 
sedentary = smartwatch_stress[smartwatch_stress['Activity Level'] == "Sedentary"]
sedentary['Activity Level'] = "Sedentary"  
sedentary['NumActivity'] = 3  

final_table = pd.concat([highlyActive, active, sedentary, smartwatch_stress[smartwatch_stress['Activity Level'].isnull()]])
final_table

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  highlyActive['Activity Level'] = "Highly Active"    ## Reset the text in the column to either 'Highly Active'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  highlyActive['NumActivity'] = 1 ## set the number for the numerical value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  active['Activity Leve

Let's check to make sure that we no longer have any missing values in each column (besides the Activity Level/NumActivity, Stress Level, and User ID columns). You likely either removed those rows or imputed a value to substitute the missing values. 

In [75]:
final_table.isnull().sum()

Then let's use the info function to make sure each column has the data type we're expecting. 

In [76]:
final_table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5124 entries, 2 to 9991
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   User ID                 5022 non-null   float64
 1   Heart Rate (BPM)        5124 non-null   float64
 2   Blood Oxygen Level (%)  4980 non-null   float64
 3   Step Count              5073 non-null   float64
 4   Sleep Duration (hours)  4926 non-null   float64
 5   Activity Level          4924 non-null   object 
 6   Stress Level            5028 non-null   object 
 7   Blood oxygen Level (%)  0 non-null      float64
 8   NumActivity             4924 non-null   float64
dtypes: float64(7), object(2)
memory usage: 400.3+ KB


**Question 13**: 

Let's visualize two of the variables. Let's pick `Heart Rate (BPM)` and then you can select any other numerical variable. We're going to create a scatter plot using matplotlib.pyplot. Example code is: 

`plt.scatter(x, y)` where x and y are your columns of data such as df['label']

Also try including a size parameter to make your points smaller to better see patterns. We'll talk more about creating figures in python in a few weeks, but for now let's just look at the broad patterns. 

Example of including size parameter: 

`plt.scatter(x, y, s=1)` Try changing the `s` parameter to 10, 1, 0.5, and 0.1. 

What do you notice?

ANSWER:

 There is a loose positive correlation, where higher step counts sometimes correspond to higher heart rates. Some outliers with very high or low step counts. 

In [84]:
x = final_table["Heart Rate (BPM)"]  # Heart Rate (BPM) as x-axis
y = final_table["Step Count"]  # Step Count as y-axis

# Create a scatter plot with different size parameters
plt.figure(figsize=(8,5)).show()

# Try different `s` values to see density variations
plt.scatter(x, y, s=1)

NameError: name '_fetch_figure_metadata' is not defined

**Question 14**: 


Read in the CSV table where we kept all the NaN values instead of removing or imputing them. Repeat the exact same visualization as above with this data. 

Compare the two figures. What do you notice?

ANSWER:

Before Imputation missing data points reduce the density of the scatter plot. After Imputation the scatter plot looks more continuous without missing data.

In [79]:
nanTable = pd.read_csv('smartwatch_nan_vals.csv')
print(nanTable)

      User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
0      4670.0         70.659253               99.072904   6042.576181   
1      1726.0         91.127561              100.000000   4213.519341   
2      4627.0         74.776893               99.630704  12557.592821   
3      1556.0         91.216912               98.777090  50224.691117   
4      3320.0         66.331358               99.903851    819.769598   
...       ...               ...                     ...           ...   
9995   2597.0         80.728128               97.254023           NaN   
9996   2096.0         57.087738               98.961619           NaN   
9997   2577.0         65.201322               99.484801   2240.504798   
9998   3501.0         76.063875               96.130100  12510.840514   
9999   3895.0         78.398919              100.000000   2522.668511   

      Sleep Duration (hours) Activity Level  Stress Level  NumActivity  
0                   6.453973  Highly Active       

In [85]:
plt.scatter(x, y, s = 1)

NameError: name '_fetch_figure_metadata' is not defined

**Question 15**: 


Does the amount of steps signficantly differ between the different activity level groups? 

Hint: Try using a boxplot (`sns.boxplot`) to first visualize the problem. Then you can use scipy.stats to run an ANOVA. 

In [86]:
## Visualization 

plt.figure(figsize=(8,5))
sns.boxplot(x=final_table["Activity Level"], y=final_table["Step Count"])
plt.xlabel("Activity Level")
plt.ylabel("Step Count")
plt.show()


NameError: name '_fetch_figure_metadata' is not defined

In [87]:
from scipy.stats import f_oneway

## create a table for each activity level group
sed = final_table[final_table["Activity Level"] == "Sedentary"]  ## sedentary rows
act = final_table[final_table["Activity Level"] == "Active"] ## active rows
hact = final_table[final_table["Activity Level"] == "Highly Active"] ## highly active rows 

stat, pval = f_oneway(sed['Step Count'], act['Step Count'], hact['Step Count'])


print(f"ANOVA test statistic: {stat:.3f}, p-value: {pval:.3f}")


ANOVA test statistic: nan, p-value: nan


Is there significant difference between the groups' step counts?  

ANSWER:

Yes

**Question 16**

If you were to present your findings from this dataset to a broader audience (such as policymakers, healthcare providers, or the general public), how would you communicate key insights responsibly? What considerations would you take into account to avoid misrepresenting the data or reinforcing biases?

ANSWER:

I would use simple and non-technical language when explaining trends. Fow example; visualizations (charts, graphs, tables) should be easy to read and accurately represent the data.
I would also acknowledge missing data, outliers, and imputation methods and explain that imputing missing values (e.g., using the mean) may introduce bias into the results.

