# Assignment 2 - Individual Checkpoint 1  

**Driving Problem:** Do they avoid inactivity in at least 10 hours a day?
   
**Author:** Jingwei Lin  
  
**Group ID:** CC08-3

## Overview
This notebook explores step count data for three people with ID 1844505072, 1927972279 and 2022484408 to analyze their daily and minute step information to determine if each person is sufficiently active.  

## Assumptions and Predictions
Our primary assumption is that individuals are inactive during certain hours, particularly at night, and may have periods of inactivity during the day. We predict:
1. There may be some missing data.
2. Individuals may have distinct daily patterns in step counts.

## Loading Data
Start date: 15/09/2024  
End date: 15/09/2024  
  
Use read_csv function in pandas to read all three files.  
Filter the dataframe to get the targeted 3 users.  
Show the dimension of three filtered dataframes with sample data.

In [19]:
import pandas as pd

daily = pd.read_csv('data/dailySteps_merged.csv')
hourly = pd.read_csv('data/hourlySteps_merged.csv')
minute = pd.read_csv('data/minuteStepsWide_merged.csv')

# Get unique Id values
unique_ids = daily['Id'].unique()

# Filter the dataset to include only the 3 selected users
filtered_daily = daily[daily['Id'].isin([unique_ids[3], unique_ids[4], unique_ids[5]])]
filtered_hourly = hourly[hourly['Id'].isin([unique_ids[3], unique_ids[4], unique_ids[5]])]
filtered_minute = minute[minute['Id'].isin([unique_ids[3], unique_ids[4], unique_ids[5]])]

# Displaying the first few rows of each dataset
print(f"Daily Data Dimension: {filtered_daily.shape[0]} x {filtered_daily.shape[1]}")
print("Daily Data Sample:")
display(filtered_daily.head())

print(f"Hourly Data Dimension: {filtered_hourly.shape[0]} x {filtered_hourly.shape[1]}")
print("Hourly Data Sample:")
display(filtered_hourly.head())

print(f"Minute Data Dimension: {filtered_minute.shape[0]} x {filtered_minute.shape[1]}")
print("Minute Data Sample:")
display(filtered_minute.head())

# Start date: 15/09/2024, End date: 15/09/2024 

Daily Data Dimension: 93 x 3
Daily Data Sample:


Unnamed: 0,Id,ActivityDay,StepTotal
92,1844505072,4/12/2016,6697
93,1844505072,4/13/2016,4929
94,1844505072,4/14/2016,7937
95,1844505072,4/15/2016,3844
96,1844505072,4/16/2016,3414


Hourly Data Dimension: 2203 x 3
Hourly Data Sample:


Unnamed: 0,Id,ActivityHour,StepTotal
2161,1844505072,4/12/2016 12:00:00 AM,0
2162,1844505072,4/12/2016 1:00:00 AM,0
2163,1844505072,4/12/2016 2:00:00 AM,0
2164,1844505072,4/12/2016 3:00:00 AM,0
2165,1844505072,4/12/2016 4:00:00 AM,0


Minute Data Dimension: 2182 x 62
Minute Data Sample:


Unnamed: 0,Id,ActivityHour,Steps00,Steps01,Steps02,Steps03,Steps04,Steps05,Steps06,Steps07,...,Steps50,Steps51,Steps52,Steps53,Steps54,Steps55,Steps56,Steps57,Steps58,Steps59
2132,1844505072,4/13/2016 12:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2133,1844505072,4/13/2016 1:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2134,1844505072,4/13/2016 2:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2135,1844505072,4/13/2016 3:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2136,1844505072,4/13/2016 4:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In these three csv file, they all have **Id** column meaning the user Id and time (ActivityDay, ActivityHour) meaning the time of the recorded data. In **minuteStepsWide_merged.csv**, it also has the columns **Steps00, Steps01, ..., Steps59**, meaning the minute data in a certain hour. 

## Missing Values
Start date: 15/09/2024  
End date: 15/09/2024  
  
Before doing the data analysis, we should check missing values to make sure all three datasets 

In [20]:
# print the sum of null value for each column of all 3 dataframes
print("Daily Data:")
print(filtered_daily.isnull().sum())
print("-" * 40)  # Separator line
print("Hourly Data:")
print(filtered_hourly.isnull().sum())
print("-" * 40)  # Separator line
print("Minite Data:")
print(filtered_minute.isnull().sum())
print("-" * 40)  # Separator line

# Start date: 15/09/2024, End date: 15/09/2024 

Daily Data:
Id             0
ActivityDay    0
StepTotal      0
dtype: int64
----------------------------------------
Hourly Data:
Id              0
ActivityHour    0
StepTotal       0
dtype: int64
----------------------------------------
Minite Data:
Id              0
ActivityHour    0
Steps00         0
Steps01         0
Steps02         0
               ..
Steps55         0
Steps56         0
Steps57         0
Steps58         0
Steps59         0
Length: 62, dtype: int64
----------------------------------------


There are no missing values in all three datasets, as the column counts for missing data all show 0.

## The number of days of data for this person
Start date: 15/09/2024  
End date: 15/09/2024  
  
We need to check the number of days of data for all three people to ensure we have enough data for a complete and accurate analysis of their activity patterns.

In [21]:
# filter daily dataframe and print the days of data
for user_id in filtered_daily['Id'].unique():
    day_num = filtered_daily[filtered_daily['Id'] == user_id]
    print(f"User ID {user_id} has {len(day_num)} days of data.")

# Start date: 15/09/2024, End date: 15/09/2024 

User ID 1844505072 has 31 days of data.
User ID 1927972279 has 31 days of data.
User ID 2022484408 has 31 days of data.


All three people have the same number of days recorded, ensuring consistency in the data for analysis.

## Daily Step Count Information
Start date: 15/09/2024  
End date: 16/09/2024  
  
Before calculating daily step statistics, we need to process the step data for each user in the dataset. For this, the **calculate_daily_stats** function is used, which computes the average, maximum, minimum step count, and median step count as additional observation.

In [22]:
def calculate_daily_stats(df):
    """
    This function calculates daily statistics for the step count data.
    It returns the average, max, min, and median of the data.
    """
    avg_steps = int(df['StepTotal'].mean())
    max_steps = int(df['StepTotal'].max())
    min_steps = int(df['StepTotal'].min())
    std_steps = int(df['StepTotal'].median())
    
    return avg_steps, max_steps, min_steps, std_steps

daily_stats = []
# Calculate and print daily statistics for each user
for user_id in filtered_daily['Id'].unique():
    daily_data = filtered_daily[filtered_daily['Id'] == user_id]
    daily_avg, daily_max, daily_min, daily_median = calculate_daily_stats(daily_data)
    # Append the statistics as a dictionary to the list
    daily_stats.append({
        'User ID': user_id,
        'Average Steps': daily_avg,
        'Maximum Steps': daily_max,
        'Minimum Steps': daily_min,
        'Median': daily_median 
    })
    # print(f"User ID: {user_id}")
    # print(f"  Average Steps: {daily_avg}")
    # print(f"  Maximum Steps: {daily_max}")
    # print(f"  Minimum Steps: {daily_min}")
    # print(f"  Median: {daily_median}")  
    # print("-" * 40)  # Separator line

# Convert the list of dictionaries into a DataFrame
daily_stats_df = pd.DataFrame(daily_stats)

# Print the DataFrame
display(daily_stats_df)

# Start date: 15/09/2024, End date: 16/09/2024 

Unnamed: 0,User ID,Average Steps,Maximum Steps,Minimum Steps,Median
0,1844505072,2580,8054,0,2237
1,1927972279,916,3790,0,152
2,2022484408,11370,18387,3292,11548


After running this code above, the resulting DataFrame displays daily statistics for all three users. The columns include the **User ID**, **Average Steps**, **Maximum Steps**, **Minimum Steps**, and the **Median** steps taken by each user.  

From the result table, it indicates that:
1. **User ID 1844505072:** This user has a high average step count with significant variability, as indicated by the large difference between average and maximum steps.

2. **User ID 1927972279:** This user has the lowest average step count, with occasional high activity days, but the overall activity level is relatively low.

3. **User ID 2022484408:** This user consistently has high step counts, with both the highest average and maximum steps, indicating a high level of daily activity.

## Minute Step Count Information
Start date: 15/09/2024  
End date: 16/09/2024  
  
Before calculating minute-level statistics, we need to analyze the step count data at the minute level for each user. The **calculate_minute_stats** function will be used to determine several metrics, including the number of non-zero minutes, missing data, average, maximum, minimum step, and standard deviation of steps as additional observation.

In [23]:
def calculate_minute_stats(df):
    """
    This function calculates minute-level statistics for the step count data.
    It returns the number of non-zero minutes, missing data, average, max, min, and standard deviation of steps.
    """
    # Assuming non-zero minutes are those where any step column is greater than zero
    step_columns = [col for col in df.columns if 'Steps' in col]
    non_zero_minutes = (df[step_columns].sum(axis=1) > 0).sum()
    
    # Missing data can be identified by zero rows
    missing_data = df[step_columns].isnull().sum().sum()
    
    avg_steps = int(df[step_columns].mean().mean())
    max_steps = int(df[step_columns].max().max())
    min_steps = int(df[step_columns].min().min())
    std_steps = int(df[step_columns].std().mean())
    
    return non_zero_minutes, missing_data, avg_steps, max_steps, min_steps, std_steps

minute_stats = []

# Calculate minute-level statistics for each user
for user_id in filtered_minute['Id'].unique():
    minute_data = filtered_minute[filtered_minute['Id'] == user_id]
    minute_non_zero, minute_missing, minute_avg, minute_max, minute_min, minute_std = calculate_minute_stats(minute_data)
    
    # Append the statistics as a dictionary to the list
    minute_stats.append({
        'User ID': user_id,
        'Non-Zero Minutes': minute_non_zero,
        'Missing Data': minute_missing,
        'Average Steps': minute_avg,
        'Maximum Steps': minute_max,
        'Minimum Steps': minute_min,
        'Standard Deviation': minute_std
    })

    # print(f"User ID: {user_id}")
    # print(f"  Non-Zero Minutes: {minute_non_zero}")
    # print(f"  Missing Data: {minute_missing}")
    # print(f"  Average Steps: {minute_avg}")
    # print(f"  Maximum Steps: {minute_max}")
    # print(f"  Minimum Steps: {minute_min}")
    # print(f"  Standard Deviation: {minute_std}") 
    # print("-" * 40) 

# Convert the list of dictionaries into a DataFrame
minute_stats_df = pd.DataFrame(minute_stats)

# Display the DataFrame
display(minute_stats_df)

# Start date: 15/09/2024, End date: 16/09/2024 

Unnamed: 0,User ID,Non-Zero Minutes,Missing Data,Average Steps,Maximum Steps,Minimum Steps,Standard Deviation
0,1844505072,192,0,1,115,0,7
1,1927972279,123,0,0,117,0,5
2,2022484408,386,0,7,176,0,21


After running this code, the resulting DataFrame provides a summary of minute-level statistics for all three users. The table includes columns for **User ID**, **Non-Zero Minutes**, **Missing Data**, **Average Steps**, **Maximum Steps**, **inimum Steps**, and **Standard Deviation**. This summary helps us understand the distribution and consistency of each user's step counts across different minutes.  

From the result table, it shows that:
1. **User ID 1844505072:** This user has a very low average step count but shows significant variability with a maximum step count of 115. The low average and relatively high maximum suggest occasional bursts of activity amidst generally low activity levels.

2. **User ID 1927972279:** This user has an average step count of 0, indicating minimal overall activity, with occasional high counts reaching up to 117. Despite the lack of regular activity, there are instances of higher step counts, though they do not significantly impact the overall average.

3. **User ID 2022484408:** This user has the highest average step count of 7 and the greatest variability with a maximum of 176 steps. The higher average and standard deviation suggest more consistent and varied activity throughout the day compared to the other users.

## Conclusion
Start date: 16/09/2024  
End date: 16/09/2024  
  
From the data exploration, I found:
- No missing data was observed.  
- Step counts vary widely among individuals, with distinct patterns emerging in daily and minute data.
- All the results are stored in the dataframe for further analysis.  

**Relation to Driving Problem:** The analysis reveals that the users often experience periods of inactivity, as their step counts vary widely. This confirms that activity levels differ significantly among individuals, with some users being more prone to inactivity.