# Individual Checkpoint 1: Personal data exploration
Author: Erwann LE POUL
Group ID: CC08-03

Overview of the notebook:
In this notebook we will be analysing 3 individuals in the following datasets:dailySteps_merged, hourlySteps_merged and minuteStepsWide_merged. We will explore the data for the participants chosen (The User ID are: 2873212765 3372868164 3977333714) and report some statistical summaries. This step in the data exploration will further help us in the completion of our assignment 2, with our driving problem being "Do they avoid inactivity in at least 10 hours a day?".

Initial assumptions and predictions:
1. Data integrity
2. Representativeness
3. Independence of observations
4. No external influences
5. Consistent measuring
6. Absence of outliers
7. Absence of null values

## Exploraty Data Analysis
### Loading the packages
Beginning of Work: 13th September 2024, End of Work: 13th September 2024

In [5]:
# Importing the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


### Loading the dataset, Cleaning the data and determining the Individuals for checkpoint
Beginning of Work: 15th September 2024, End of Work: 15th September 2024

In [8]:
# Load and inspect step count data across different time intervals
# The dataset includes step data collected at daily, hourly, and minute intervals

# Loading the daily, hourly, and minute datasets
df_day = pd.read_csv("dailySteps_merged.csv")
df_hour = pd.read_csv("hourlySteps_merged.csv")
df_minute = pd.read_csv("minuteStepsWide_merged.csv")

The 3 datasets are now loaded and we can move on to cleaning the data.

In [11]:
# Check for missing values in each dataset to ensure data quality
# isnull().any() will return True for columns with missing data
print("Checking for missing values in daily data:", df_day.isnull().any())
print("Checking for missing values in hourly data:", df_hour.isnull().any())
print("Checking for missing values in minute data:", df_minute.isnull().any())

Checking for missing values in daily data: Id             False
ActivityDay    False
StepTotal      False
dtype: bool
Checking for missing values in hourly data: Id              False
ActivityHour    False
StepTotal       False
dtype: bool
Checking for missing values in minute data: Id              False
ActivityHour    False
Steps00         False
Steps01         False
Steps02         False
                ...  
Steps55         False
Steps56         False
Steps57         False
Steps58         False
Steps59         False
Length: 62, dtype: bool


In each dataset that was loaded none of them had null values thus we can move on to determining the individuals chosen for this checkpoint. Note: the individuals I chose for this checkpoint was agreed during the thursday tutorial.

In [14]:
# Extracting unique user IDs to avoid duplicates in analysis
# This helps focus on distinct users
unique_day_ids = df_day["Id"].unique()
unique_hour_ids = df_hour["Id"].unique()
unique_minute_ids = df_minute["Id"].unique()

# Displaying three specific user IDs (index 9, 10, 11) from each dataset
# These IDs will be used for further analysis
print("Selected user IDs from daily data:", unique_day_ids[9], unique_day_ids[10], unique_day_ids[11])
print("Selected user IDs from hourly data:", unique_hour_ids[9], unique_hour_ids[10], unique_hour_ids[11])
print("Selected user IDs from minute data:", unique_minute_ids[9], unique_minute_ids[10], unique_minute_ids[11])

# Assigning selected user IDs to variables for ease of use in future steps
id9 = unique_day_ids[9]
id10 = unique_day_ids[10]
id11 = unique_day_ids[11]

# Now, id9, id10, and id11 represent the three unique user IDs across all datasets

Selected user IDs from daily data: 2873212765 3372868164 3977333714
Selected user IDs from hourly data: 2873212765 3372868164 3977333714
Selected user IDs from minute data: 2873212765 3372868164 3977333714


### Daily Steps Analysis:
Beginning of Work: 13th September 2024, End of Work: 13th September 2024
We will now analyse the usage of the fitness watch by determining the number of days each individual used the device based on their unique IDs.


In [17]:
# Filtering activity data for each user by their ID
# The 'isin()' function helps to select rows where the 'Id' matches the specified user IDs
# Extracting 'ActivityDay' for each individual (id9, id10, id11)

id9_activity_day = df_day[df_day['Id'].isin([id9])]['ActivityDay']
id10_activity_day = df_day[df_day['Id'].isin([id10])]['ActivityDay']
id11_activity_day = df_day[df_day['Id'].isin([id11])]['ActivityDay']

# Determining the number of unique days each individual was active (used the device)
# We use 'unique()' on the 'ActivityDay' column to find the distinct days and then count them with len()
id9_nb_days = len(id9_activity_day.unique())
id10_nb_days = len(id10_activity_day.unique())
id11_nb_days = len(id11_activity_day.unique())

# Displaying the selected user IDs and their corresponding active days
print("The IDs of the individuals selected:", id9, id10, id11)
print("The number of active days for these individuals (in the same order):", id9_nb_days, id10_nb_days, id11_nb_days)

# Summary:
# id9_nb_days, id10_nb_days, and id11_nb_days now contain the number of unique days
# each individual used the device. This can provide insights into user engagement over time.

The IDs of the individuals selected: 2873212765 3372868164 3977333714
The number of active days for these individuals (in the same order): 31 20 30


We can see that some indiduals missed a few days when wearing the device (Index 10 and 11) but that should not have major effect on the exploration of the data and statistical summaries. After completing this, we can now move on to the statistical summaries for this dataset.

Beginning of Work: 15th September 2024, End of Work: 15th September 2024

In [20]:
# Statistical summary for step count data of selected users
# We aim to extract daily step totals and calculate key statistics (min, max, median, and mean) for three users

# Extracting daily step totals for each user (id9, id10, id11) from the dataset
id9_daily_step_total = df_day[df_day['Id'].isin([id9])]['StepTotal']
id10_daily_step_total = df_day[df_day['Id'].isin([id10])]['StepTotal']
id11_daily_step_total = df_day[df_day['Id'].isin([id11])]['StepTotal']

# Calculating the minimum number of steps taken each day by the selected users
# min() finds the lowest step count in the 'StepTotal' column for each user
id9_min_step = int(id9_daily_step_total.min())
id10_min_step = int(id10_daily_step_total.min())
id11_min_step = int(id11_daily_step_total.min())

# Calculating the maximum number of steps taken each day by the selected users
# max() finds the highest step count in the 'StepTotal' column for each user
id9_max_step = int(id9_daily_step_total.max())
id10_max_step = int(id10_daily_step_total.max())
id11_max_step = int(id11_daily_step_total.max())

# Determining the median step count for each user
# median() provides the middle value when the step counts are sorted, offering insight into typical daily activity
id9_median_step = int(id9_daily_step_total.median())
id10_median_step = int(id10_daily_step_total.median())
id11_median_step = int(id11_daily_step_total.median())

# Calculating the mean step count for each user
# mean() returns the average step count, representing the overall activity level for each user
id9_mean_step = int(id9_daily_step_total.mean())
id10_mean_step = int(id10_daily_step_total.mean())
id11_mean_step = int(id11_daily_step_total.mean())

# Organizing the statistical results into a dictionary for tabular display
# Each key represents a column (e.g., 'ID', 'Daily Min Steps'), and values are the statistics for the selected users
daily_stat = {
    'ID': [id9, id10, id11],
    'Daily Min Steps': [id9_min_step, id10_min_step, id11_min_step],
    'Daily Max Steps': [id9_max_step, id10_max_step, id11_max_step],
    'Daily Median Steps': [id9_median_step, id10_median_step, id11_median_step],
    'Daily Average Step': [id9_mean_step, id10_mean_step, id11_mean_step]
}

# Creating a DataFrame to display the statistics in a structured table format
table_day = pd.DataFrame(daily_stat)

# Displaying the table with the calculated statistics for each user
print(table_day)

           ID  Daily Min Steps  Daily Max Steps  Daily Median Steps  \
0  2873212765             2524             9685                7762   
1  3372868164             3077             9715                7150   
2  3977333714              746            16520               11604   

   Daily Average Step  
0                7555  
1                6861  
2               10984  


From the table given above we can now have a look at the daily minimum, maximum, median and average step count. Here is what we can extract from the table:

ID 2873212765:
- The daily step count ranges between 2,524 and 9,685 steps.
- The median step count is 7,762, indicating that half of the days this user took more than 7,762 steps.
- The average daily step count is 7,555, showing consistent activity levels close to the median.

ID 3372868164:
- This user's step count ranges from 3,077 to 9,715 steps.
- The median step count is 7,150, lower than the first user's, indicating somewhat less activity on half of the days.
- The average step count of 6,861 is lower than the median, suggesting some days with significantly fewer steps than others.

ID 3977333714:
- This individual has the most variation in step count, ranging from as low as 746 steps to as high as 16,520.
- A high median (11,604) and average (10,984) indicate that this user is consistently more active, despite occasional low activity days.

#### Conclusion:
The third individual (ID 3977333714) is the most active overall, with significantly higher median and average steps compared to the other two. The data suggests varying patterns of activity across all three users, with one user showing more consistent daily activity (ID 2873212765), while another (ID 3977333714) exhibits more fluctuation but generally higher step counts.

### Hourly Steps Analysis:
Beginning of Work: 15th September 2024, End of Work: 135h September 2024
We will now analyse the data from the number of hours each individual used the devised based on their unique IDs and determine some statistical summaries.

In [25]:
# Exploratory Data Analysis: Hourly Step Count
# This section analyzes step count data on an hourly basis for three specific users

# Extracting hourly step data for the selected users (id9, id10, id11)
# The 'isin()' function is used to filter the dataframe for these users, focusing on 'StepTotal'
id9_hour_step_total = df_hour[df_hour['Id'].isin([id9])]['StepTotal']
id10_hour_step_total = df_hour[df_hour['Id'].isin([id10])]['StepTotal']
id11_hour_step_total = df_hour[df_hour['Id'].isin([id11])]['StepTotal']

# Calculating the minimum number of steps each user took in any given hour
# The min() function identifies the lowest value in the 'StepTotal' column
id9_min_step = int(id9_hour_step_total.min())
id10_min_step = int(id10_hour_step_total.min())
id11_min_step = int(id11_hour_step_total.min())

# Calculating the maximum number of steps taken in a single hour by each user
# max() provides the highest step count recorded within an hour
id9_max_step = int(id9_hour_step_total.max())
id10_max_step = int(id10_hour_step_total.max())
id11_max_step = int(id11_hour_step_total.max())

# Computing the median number of steps taken per hour for each user
# median() gives the middle value, which is useful for understanding typical hourly activity
id9_median_step = int(id9_hour_step_total.median())
id10_median_step = int(id10_hour_step_total.median())
id11_median_step = int(id11_hour_step_total.median())

# Determining the average hourly step count for each user
# mean() calculates the average steps taken per hour, providing a general activity level
id9_mean_step = int(id9_hour_step_total.mean())
id10_mean_step = int(id10_hour_step_total.mean())
id11_mean_step = int(id11_hour_step_total.mean())

# Compiling the calculated statistics into a dictionary for a clearer presentation in tabular form
hourly_stat = {
    'ID': [id9, id10, id11],
    'Hourly Min Steps': [id9_min_step, id10_min_step, id11_min_step],
    'Hourly Max Steps': [id9_max_step, id10_max_step, id11_max_step],
    'Hourly Median Steps': [id9_median_step, id10_median_step, id11_median_step],
    'Hourly Average Step': [id9_mean_step, id10_mean_step, id11_mean_step]
}

# Create the dataframe for printing as a table
table_hour = pd.DataFrame(hourly_stat)

# Display the table
print(table_hour)


           ID  Hourly Min Steps  Hourly Max Steps  Hourly Median Steps  \
0  2873212765                 0              4534                  125   
1  3372868164                 0              3084                  147   
2  3977333714                 0              5414                   79   

   Hourly Average Step  
0                  318  
1                  290  
2                  471  


The table provides a summary of hourly step count statistics for three individuals. Here are the key insights:

ID 2873212765:
- The hourly step count ranges from 0 to 4,534 steps.
- The median number of steps per hour is 125, meaning that in half the hours, this individual took fewer than 125 steps.
- On average, the individual takes 318 steps per hour, suggesting relatively moderate activity levels, with some hours showing high step counts.
  
ID 3372868164:
- This individual also has some inactive hours, with 0 steps recorded in certain hours and a maximum of 3,084 steps in the most active hour.
- The median hourly step count is 147, slightly higher than the first user.
- The average step count is 290 steps per hour, indicating a similar overall activity pattern to the first user but with slightly less variability.

ID 3977333714:
- The third individual exhibits more variability, with hourly steps ranging from 0 to 5,414 steps.
- However, the median hourly step count is only 79, meaning that most hours are relatively inactive, though bursts of high activity are evident in some hours.
- The average hourly step count is 471, which is significantly higher than the other two users, reflecting a more active lifestyle on average.

#### Conclusion:
The third individual (ID 3977333714) shows greater fluctuation in activity with higher overall average steps, while the first two users maintain more consistent but lower activity levels. All three individuals have periods of inactivity (0 steps), but the third individual tends to take more steps during active periods.

### Minute Steps Analysis:
Beginning of Work: 15th September 2024, End of Work: 135h September 2024
We will now analyse the data from the number of minutes each individual used the devised based on their unique IDs and determine some statistical summaries.

In [29]:
# Exploratory Data Analysis (EDA): Minute Step Count
# This analysis focuses on evaluating step counts at the minute level for the selected individuals.

# Statistical Summary for Minute Step Count:
# Defining a function to count the number of non-zero minutes (minutes with steps recorded)
def count_non_zero_minutes(df):
    # Identify columns starting with 'Steps' (minute step counts)
    step_columns = [col for col in df.columns if col.startswith('Steps')]
    
    # Sum non-zero values per row and aggregate across all rows
    total_non_zero = (df[step_columns] > 0).sum().sum()
    
    # Return the total count of non-zero step values
    return int(total_non_zero)

# Defining a function to count missing values in step columns
def count_missing_minutes(df):
    # Identify 'Steps' columns for each minute
    step_columns = [col for col in df.columns if col.startswith('Steps')]
    
    # Count missing values in these columns
    total_missing = df[step_columns].isnull().sum().sum()
    
    # Return the count of missing values
    return int(total_missing)

# Defining a function to calculate the average number of steps per minute
def calculate_avg_steps(df):
    # Identify 'Steps' columns for minute step counts
    step_columns = [col for col in df.columns if col.startswith('Steps')]
    
    # Calculate total step count and divide by the count of valid entries
    average_steps = df[step_columns].sum().sum() / df[step_columns].count().sum()
    
    # Return the average step count as an integer
    return int(average_steps)

# Defining a function to find the maximum and minimum steps in any minute
def calculate_max_min_steps(df):
    # Identify 'Steps' columns for minute step counts
    step_columns = [col for col in df.columns if col.startswith('Steps')]
    
    # Calculate maximum and minimum values across all step columns
    max_steps = df[step_columns].max().max()
    min_steps = df[step_columns].min().min()
    
    # Return both maximum and minimum step counts
    return int(max_steps), int(min_steps)

# Extracting minute-level step data for the selected users (ID9, ID10, ID11)
id9_minute_data = df_minute[df_minute['Id'].isin([id9])]
id10_minute_data = df_minute[df_minute['Id'].isin([id10])]
id11_minute_data = df_minute[df_minute['Id'].isin([id11])]

# Counting non-zero minutes for each user
id9_non_zero_minutes = count_non_zero_minutes(id9_minute_data)
id10_non_zero_minutes = count_non_zero_minutes(id10_minute_data)
id11_non_zero_minutes = count_non_zero_minutes(id11_minute_data)

# Counting missing minutes for each user
id9_missing_minutes = count_missing_minutes(id9_minute_data)
id10_missing_minutes = count_missing_minutes(id10_minute_data)
id11_missing_minutes = count_missing_minutes(id11_minute_data)

# Calculating the average steps per minute for each user
id9_avg_steps = calculate_avg_steps(id9_minute_data)
id10_avg_steps = calculate_avg_steps(id10_minute_data)
id11_avg_steps = calculate_avg_steps(id11_minute_data)

# Determining the maximum and minimum steps taken in any minute for each user
id9_max_steps, id9_min_steps = calculate_max_min_steps(id9_minute_data)
id10_max_steps, id10_min_steps = calculate_max_min_steps(id10_minute_data)
id11_max_steps, id11_min_steps = calculate_max_min_steps(id11_minute_data)

# Storing results in a dictionary to create a summary table
minute_summary = {
    'ID': [id9, id10, id11],
    'Non-Zero Minutes': [id9_non_zero_minutes, id10_non_zero_minutes, id11_non_zero_minutes],
    'Missing Minutes': [id9_missing_minutes, id10_missing_minutes, id11_missing_minutes],
    'Average Steps per Minute': [id9_avg_steps, id10_avg_steps, id11_avg_steps],
    'Max Steps per Minute': [id9_max_steps, id10_max_steps, id11_max_steps],
    'Min Steps per Minute': [id9_min_steps, id10_min_steps, id11_min_steps]
}

# Creating a DataFrame to display the results
table_minute = pd.DataFrame(minute_summary)

# Displaying the summary table for minute step count analysis
print(table_minute)


           ID  Non-Zero Minutes  Missing Minutes  Average Steps per Minute  \
0  2873212765              7052                0                         5   
1  3372868164              4517                0                         4   
2  3977333714              8033                0                         7   

   Max Steps per Minute  Min Steps per Minute  
0                   164                     0  
1                   164                     0  
2                   190                     0  


The table provides a detailed summary of minute-level step data for three individuals (ID9, ID10, and ID11).
- Non-Zero Minutes: ID11 had the highest number of active minutes, with 8,033 minutes of recorded steps, followed by ID9 with 7,052 minutes. ID10 had the fewest active minutes at 4,517. This suggests that ID11 was the most consistently active, while ID10 had less overall movement.
- Missing Minutes: None of the users had missing data, indicating complete step count recordings for all three individuals.
- Average Steps per Minute: ID11 also had the highest average step count per minute, with 7 steps per minute, followed by ID9 with 5 steps per minute, and ID10 with 4 steps per minute. This highlights ID11's higher activity intensity compared to the other two users.
- Max and Min Steps per Minute: The maximum step count in a single minute was 190 for ID11, slightly higher than ID9 and ID10, who both had a maximum of 164. The minimum steps per minute was 0 for all users, indicating periods of inactivity across all participants.

In summary, ID11 was the most active in terms of both the duration and intensity of activity, while ID10 had the lowest activity levels overall.

### Final Statement:
Final statement about what I learnt from EDA and how it relates to the driving problem:

From the Exploratory Data Analysis (EDA), I learned that individuals exhibit different patterns of physical activity based on step data at daily, hourly, and minute levels. Specifically:

1. Activity Consistency vs. Variability: ID9 and ID10 tend to have more consistent but lower step counts, both daily and hourly, compared to ID11, who shows higher variability but also higher activity levels overall. This suggests that while ID9 and ID10 might avoid long periods of inactivity, they may not achieve high-intensity activity levels as frequently as ID11.

2. Key Patterns of Inactivity: All users, particularly ID10, experienced periods of inactivity, as shown by zero-step minutes and lower step counts per minute. However, none of the users had missing data, indicating robust data collection throughout their monitored periods. This is crucial when assessing physical activity patterns over time.

3. Relation to the Driving Problem: The driving problem focuses on understanding whether individuals avoid inactivity in at least 10 hours a day. Based on the analysis, ID11 is the most likely to avoid long periods of inactivity, with a high number of active minutes and consistent bursts of activity. ID9 also shows relatively high active minutes but to a lesser extent. In contrast, ID10 may face challenges in maintaining continuous activity due to a lower number of active minutes and a reduced average steps per minute.

This analysis helps in identifying which individuals meet the threshold of avoiding inactivity and highlights the need for tailored interventions to increase physical activity, especially for those like ID10 who may require strategies to boost both activity levels and consistency.
