# 01 - Download and Raw Inspection

# Case Study 2: How Can a Wellness Technology Company Play It Smart?

This notebook documents the process of downloading and cleaning the Kaggle Fitbit Fitness Tracker Data (2016)

### DATA SOURCES & DESCRIPTION¶
This section details the origin, structure, and inherent characteristics of the dataset utilized for this analysis, addressing its suitability and limitations.

#### 2.1. Primary Data Source: Fitbit Fitness Tracker Data (2016)
Name: "Fitbit Fitness Tracker Data (2016)"
Source: Kaggle user Mobius (arashnic). The data was collected with a distributed survey conducted through Amazon Mechanical Turk.
Link: https://www.kaggle.com/datasets/arashnic/fitbit
Kaggle Usability Score: 9.41 (out of 10) This high score reflects the dataset's clear structure, consistent formatting, and ease of use for general data exploration within the Kaggle environment.
Description: This dataset comprises personal fitness tracker data from 35 anonymous users, collected between March 12, 2016, and May 12, 2016. It includes daily records for activity (steps, distance, calories burned, and active minutes), sleep (minutes asleep and total time in bed), and weight logs across multiple CSV files.
#### 2.2. Data Storage & Organization
Organization: The dataset consists of 11 CSV files, each organized in a long format where rows represent a single observation (activity, sleep, or weight) for a unique user (Id) at a specific date or time. These files offer three distinct temporal levels: hour, minute, and second.

Storage and Tools: The raw Fitbit dataset was downloaded as a ZIP file from Kaggle. 

#### 2.3. ROCCC Analysis & Mitigation Strategy
This assessment evaluates the reliability and appropriateness of the public dataset using the ROCCC framework (Reliable, Original, Comprehensive, Current, Cited), specifically in the context of informing Bellabeat's marketing strategy.

##### ROCCC Analysis (Data Limitations)

- Reliable: Low. The small sample size (35 users) and the crowdsourced collection method (Amazon Mechanical Turk) significantly reduce its reliability for generalization.

- Original: Medium-Low. Data is a secondary aggregation from a third-party source (Möbius on Kaggle), not direct proprietary data from Bellabeat.

- Comprehensive: Low. Critically lacks detailed demographic information and Bellabeat-specific metrics (menstrual cycle data, stress levels) crucial for targeting the female audience.
Current: Very Low. The data was collected in 2016, making it outdated and unrepresentative of current smart device technology and wellness trends.

- Cited: Medium. The source is publicly available on Kaggle with stated collection methodology (Amazon Mechanical Turk), providing transparent traceability.

##### Mitigation Strategy for Limitations
Given the low scores in Reliability, Comprehensiveness, and Currency, the analysis will incorporate external data:

- External Validation (Context): Incorporate Industry Churn Reports (Public studies on wearable retention rates) to contextualize the findings on inconsistent user habits.

- Benchmarking (Strategy): Referenced Digital Fitness Benchmarks (Analysis of UX/gamification metrics like Active Calories and METs) used by leading competitors (e.g., Oura, Apple Fitness) to inform viable recommendations.

- Internal Validation (Technical): Utilize Statistical Analysis to confirm the internal consistency of the device's sensor data, strengthening the technical credibility of the measured activity patterns.

#### 2.4. Ethical Considerations

Licensing: The dataset is designated as CC0: Public Domain, meaning it is free to use, share, and build upon for any purpose, without restriction. This provides full legal clarity for its use in this analysis.

Privacy: The data is fully anonymized, containing only numerical Ids for users and no personally identifiable information (PII). This ensures user privacy is maintained, aligning with ethical data handling practices.

Security: Data is downloaded and stored locally for analysis. Data is anonymized and public, the risk of direct personal harm from a breach is minimal.

Accessibility: The data is highly accessible, being freely available for download in a standard .csv format from Kaggle, requiring no special permissions or tools beyond standard spreadsheet software or a programming environment.

#### 2.5 Data Integrity

????? Data Exclusion (Weight Logs): The weightLogInfo dataset contains only 34 records, with 14 belonging to a single user. Due to this extremely low and sparse record count, this file will be excluded from the analysis workflow to prevent statistically unreliable results from being generated.

Redundancy and Temporal Consistency: The majority of minute-level files were deemed redundant as they are already consolidated into the hourly files. However, the minuteSleep, minuteHeartRate, and minuteMETs datasets record data at a finer temporal granularity (minutes or seconds); therefore, they must be aggregated to an hourly or daily metric to ensure structural consistency with the primary hourly and daily datasets.

Hourly Structure Verification: The core hourly files were verified to have a consistent structure, sharing the common keys: Id and ActivityHour, confirming their readiness for unification.?????

#### 2.6. Contribution to Answering the Business Question

The dataset is limited, but crucial for answering the first guiding question: "What are some trends in smart device usage?". The dataset provides insight into typical user behavior, daily activity rhythms, sleep consistency, engagement levels. And forms the foundation for Bellabeat to identify universal challenges and opportunities.



In [22]:
!kaggle datasets download -d arashnic/fitbit -p data_raw

Dataset URL: https://www.kaggle.com/datasets/arashnic/fitbit
License(s): CC0-1.0
Downloading fitbit.zip to data_raw




  0%|          | 0.00/43.3M [00:00<?, ?B/s]
  2%|2         | 1.00M/43.3M [00:00<00:32, 1.36MB/s]
  5%|4         | 2.00M/43.3M [00:01<00:18, 2.29MB/s]
  7%|6         | 3.00M/43.3M [00:01<00:14, 2.94MB/s]
  9%|9         | 4.00M/43.3M [00:01<00:12, 3.39MB/s]
 12%|#1        | 5.00M/43.3M [00:01<00:10, 3.71MB/s]
 14%|#3        | 6.00M/43.3M [00:01<00:09, 3.92MB/s]
 16%|#6        | 7.00M/43.3M [00:02<00:09, 4.08MB/s]
 18%|#8        | 8.00M/43.3M [00:02<00:08, 4.18MB/s]
 21%|##        | 9.00M/43.3M [00:02<00:08, 4.24MB/s]
 23%|##3       | 10.0M/43.3M [00:02<00:08, 4.29MB/s]
 25%|##5       | 11.0M/43.3M [00:03<00:07, 4.32MB/s]
 28%|##7       | 12.0M/43.3M [00:03<00:07, 4.34MB/s]
 30%|###       | 13.0M/43.3M [00:03<00:07, 4.36MB/s]
 32%|###2      | 14.0M/43.3M [00:03<00:07, 4.35MB/s]
 35%|###4      | 15.0M/43.3M [00:04<00:06, 4.36MB/s]
 37%|###6      | 16.0M/43.3M [00:04<00:06, 4.39MB/s]
 39%|###9      | 17.0M/43.3M [00:04<00:06, 4.39MB/s]
 42%|####1     | 18.0M/43.3M [00:04<00:06, 4.38MB/s]
 

In [24]:
import zipfile

with zipfile.ZipFile("data_raw/fitbit.zip", "r") as z:
    z.extractall("data_raw/extracted")

In [36]:
import os

for root, dirs, files in os.walk("data_raw", topdown=True):
    print(root, files)


data_raw ['fitbit.zip']
data_raw\extracted []
data_raw\extracted\mturkfitbit_export_3.12.16-4.11.16 []
data_raw\extracted\mturkfitbit_export_3.12.16-4.11.16\Fitabase Data 3.12.16-4.11.16 ['dailyActivity_merged.csv', 'heartrate_seconds_merged.csv', 'hourlyCalories_merged.csv', 'hourlyIntensities_merged.csv', 'hourlySteps_merged.csv', 'minuteCaloriesNarrow_merged.csv', 'minuteIntensitiesNarrow_merged.csv', 'minuteMETsNarrow_merged.csv', 'minuteSleep_merged.csv', 'minuteStepsNarrow_merged.csv', 'weightLogInfo_merged.csv']
data_raw\extracted\mturkfitbit_export_4.12.16-5.12.16 []
data_raw\extracted\mturkfitbit_export_4.12.16-5.12.16\Fitabase Data 4.12.16-5.12.16 ['dailyActivity_merged.csv', 'dailyCalories_merged.csv', 'dailyIntensities_merged.csv', 'dailySteps_merged.csv', 'heartrate_seconds_merged.csv', 'hourlyCalories_merged.csv', 'hourlyIntensities_merged.csv', 'hourlySteps_merged.csv', 'minuteCaloriesNarrow_merged.csv', 'minuteCaloriesWide_merged.csv', 'minuteIntensitiesNarrow_merged.cs

In [38]:
csv_files = glob.glob("data_raw/extracted/**/*.csv", recursive=True)

In [40]:
print(len(csv_files))
csv_files[:5]


29


['data_raw/extracted\\mturkfitbit_export_3.12.16-4.11.16\\Fitabase Data 3.12.16-4.11.16\\dailyActivity_merged.csv',
 'data_raw/extracted\\mturkfitbit_export_3.12.16-4.11.16\\Fitabase Data 3.12.16-4.11.16\\heartrate_seconds_merged.csv',
 'data_raw/extracted\\mturkfitbit_export_3.12.16-4.11.16\\Fitabase Data 3.12.16-4.11.16\\hourlyCalories_merged.csv',
 'data_raw/extracted\\mturkfitbit_export_3.12.16-4.11.16\\Fitabase Data 3.12.16-4.11.16\\hourlyIntensities_merged.csv',
 'data_raw/extracted\\mturkfitbit_export_3.12.16-4.11.16\\Fitabase Data 3.12.16-4.11.16\\hourlySteps_merged.csv']

In [42]:
dfs = []

for file in csv_files:
    df = pd.read_csv(file)
    df['source_file'] = os.path.basename(file)
    dfs.append(df)

fitbit_df = pd.concat(dfs, ignore_index=True)
fitbit_df.head(), fitbit_df.shape


(           Id ActivityDate  TotalSteps  TotalDistance  TrackerDistance  \
 0  1503960366    3/25/2016     11004.0           7.11             7.11   
 1  1503960366    3/26/2016     17609.0          11.55            11.55   
 2  1503960366    3/27/2016     12736.0           8.53             8.53   
 3  1503960366    3/28/2016     13231.0           8.93             8.93   
 4  1503960366    3/29/2016     12041.0           7.85             7.85   
 
    LoggedActivitiesDistance  VeryActiveDistance  ModeratelyActiveDistance  \
 0                       0.0                2.57                      0.46   
 1                       0.0                6.92                      0.73   
 2                       0.0                4.66                      0.16   
 3                       0.0                3.19                      0.79   
 4                       0.0                2.16                      1.09   
 
    LightActiveDistance  SedentaryActiveDistance  ...  Steps54  Steps55  \
 0 