**This project focuses on exploring and analyzing personal fitness activity data collected through Strava fitness application, including heart rate, sleep patterns, physical activity, and user behavior.**

**The goal is to derive meaningful insights about daily routines, physical intensity, sleep quality, and overall wellness trends using various EDA techniques and visualizations.**

FINAL DATASETS

1. daily_df: Summarized daily activity (steps, distance, active minutes, calories)

2. hourly_df: Hour-by-hour step and calorie data

3. minute_df: Minute-level data for calories, steps, METs, etc.

4. sleep_df: Sleep stats including total minutes asleep, efficiency, and long/short sleep breakdowns

5. heartrate_df: Continuous heart rate monitoring across days and time

6. weightlog_df: Weight and BMI data logged over time




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
import pandas as pd

# folder path in drive
folder_path = '/content/drive/MyDrive/Strava_data'

dataframes = {}

# Loop through each file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(folder_path, filename)
        df_name = filename.replace('.csv', '').lower().replace(' ', '_')
        df = pd.read_csv(file_path)
        dataframes[df_name] = df
        print(f"Loaded {df_name} with shape {df.shape}")



**Daily Granularity**

In [None]:
import pandas as pd

In [None]:
dailyActivity_merged=pd.read_csv('/content/drive/MyDrive/Strava_data/dailyActivity_merged.csv')

In [None]:
dailyCalories_merged=pd.read_csv('/content/drive/MyDrive/Strava_data/dailyCalories_merged.csv')

In [None]:
dailyIntensities_merged=pd.read_csv('/content/drive/MyDrive/Strava_data/dailyIntensities_merged.csv')

In [None]:
dailySteps_merged=pd.read_csv('/content/drive/MyDrive/Strava_data/dailySteps_merged.csv')

**Checking basic data information**

In [None]:
dailyActivity_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories  

In [None]:
dailyIntensities_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDay               940 non-null    object 
 2   SedentaryMinutes          940 non-null    int64  
 3   LightlyActiveMinutes      940 non-null    int64  
 4   FairlyActiveMinutes       940 non-null    int64  
 5   VeryActiveMinutes         940 non-null    int64  
 6   SedentaryActiveDistance   940 non-null    float64
 7   LightActiveDistance       940 non-null    float64
 8   ModeratelyActiveDistance  940 non-null    float64
 9   VeryActiveDistance        940 non-null    float64
dtypes: float64(4), int64(5), object(1)
memory usage: 73.6+ KB


In [None]:
dailySteps_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Id           940 non-null    int64 
 1   ActivityDay  940 non-null    object
 2   StepTotal    940 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 22.2+ KB


In [None]:
dailyCalories_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Id           940 non-null    int64 
 1   ActivityDay  940 non-null    object
 2   Calories     940 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 22.2+ KB


**Converting date-time column to proper format**

In [None]:
# standardizing date columns

dailyActivity_merged['ActivityDate'] = pd.to_datetime(dailyActivity_merged['ActivityDate'])
dailyIntensities_merged['ActivityDay'] = pd.to_datetime(dailyIntensities_merged['ActivityDay'])
dailySteps_merged['ActivityDay'] = pd.to_datetime(dailySteps_merged['ActivityDay'])
dailyCalories_merged['ActivityDay'] = pd.to_datetime(dailyCalories_merged['ActivityDay'])

In [None]:
# merging the datasets by daily granularity

merged_df = dailyActivity_merged.merge(
    dailyIntensities_merged,
    left_on=['Id', 'ActivityDate'],
    right_on= ['Id', 'ActivityDay'],
    how='left',
    suffixes=('', '_intensity')
)

merged_df = merged_df.merge(
    dailySteps_merged,
    left_on=['Id', 'ActivityDay'],
    right_on=['Id', 'ActivityDay'],
    how='left'
)

daily_df = merged_df.merge(
    dailyCalories_merged,
    left_on=['Id', 'ActivityDay'],
    right_on=['Id', 'ActivityDay'],
    how='left',
    suffixes=('', '_calories')
)



In [None]:
daily_df.head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,...,SedentaryMinutes_intensity,LightlyActiveMinutes_intensity,FairlyActiveMinutes_intensity,VeryActiveMinutes_intensity,SedentaryActiveDistance_intensity,LightActiveDistance_intensity,ModeratelyActiveDistance_intensity,VeryActiveDistance_intensity,StepTotal,Calories_calories
0,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,...,728,328,13,25,0.0,6.06,0.55,1.88,13162,1985
1,1503960366,2016-04-13,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,...,776,217,19,21,0.0,4.71,0.69,1.57,10735,1797
2,1503960366,2016-04-14,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,...,1218,181,11,30,0.0,3.91,0.4,2.44,10460,1776
3,1503960366,2016-04-15,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,...,726,209,34,29,0.0,2.83,1.26,2.14,9762,1745
4,1503960366,2016-04-16,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,...,773,221,10,36,0.0,5.04,0.41,2.71,12669,1863


In [None]:
daily_df.columns

Index(['Id', 'ActivityDate', 'TotalSteps', 'TotalDistance', 'TrackerDistance',
       'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories', 'ActivityDay',
       'SedentaryMinutes_intensity', 'LightlyActiveMinutes_intensity',
       'FairlyActiveMinutes_intensity', 'VeryActiveMinutes_intensity',
       'SedentaryActiveDistance_intensity', 'LightActiveDistance_intensity',
       'ModeratelyActiveDistance_intensity', 'VeryActiveDistance_intensity',
       'StepTotal', 'Calories_calories'],
      dtype='object')

**Hourly Granularity**

In [None]:
hourlyCalories_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/hourlyCalories_merged.csv')

In [None]:
hourlyIntensities_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/hourlyIntensities_merged.csv')

In [None]:
hourlySteps_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/hourlySteps_merged.csv')

In [None]:
hourlyCalories_merged

Unnamed: 0,Id,ActivityHour,Calories
0,1503960366,4/12/2016 12:00:00 AM,81
1,1503960366,4/12/2016 1:00:00 AM,61
2,1503960366,4/12/2016 2:00:00 AM,59
3,1503960366,4/12/2016 3:00:00 AM,47
4,1503960366,4/12/2016 4:00:00 AM,48
...,...,...,...
22094,8877689391,5/12/2016 10:00:00 AM,126
22095,8877689391,5/12/2016 11:00:00 AM,192
22096,8877689391,5/12/2016 12:00:00 PM,321
22097,8877689391,5/12/2016 1:00:00 PM,101


In [None]:
hourlyCalories_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   Calories      22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB


In [None]:
hourlyIntensities_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Id                22099 non-null  int64  
 1   ActivityHour      22099 non-null  object 
 2   TotalIntensity    22099 non-null  int64  
 3   AverageIntensity  22099 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 690.7+ KB


In [None]:
hourlySteps_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   StepTotal     22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB


In [None]:
# standardize the datetime columns

hourlySteps_merged['ActivityHour'] = pd.to_datetime(hourlySteps_merged['ActivityHour'],format='%m/%d/%Y %I:%M:%S %p',  errors= 'coerce')
hourlyCalories_merged['ActivityHour'] = pd.to_datetime(hourlyCalories_merged['ActivityHour'], format='%m/%d/%Y %I:%M:%S %p',errors= 'coerce')
hourlyIntensities_merged['ActivityHour'] = pd.to_datetime(hourlyIntensities_merged['ActivityHour'], format='%m/%d/%Y %I:%M:%S %p',errors= 'coerce')

In [None]:
import pandas as pd

# Merge hourly datasets on Id and ActivityHour

hourly_df = hourlySteps_merged.merge(
    hourlyCalories_merged, on=["Id", "ActivityHour"], how="inner"
).merge(
   hourlyIntensities_merged, on=["Id", "ActivityHour"], how="inner"
)


In [None]:
hourly_df.head()


Unnamed: 0,Id,ActivityHour,StepTotal,Calories,TotalIntensity,AverageIntensity
0,1503960366,2016-04-12 00:00:00,373,81,20,0.333333
1,1503960366,2016-04-12 01:00:00,160,61,8,0.133333
2,1503960366,2016-04-12 02:00:00,151,59,7,0.116667
3,1503960366,2016-04-12 03:00:00,0,47,0,0.0
4,1503960366,2016-04-12 04:00:00,0,48,0,0.0


In [None]:
hourly_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Id                22099 non-null  int64         
 1   ActivityHour      22099 non-null  datetime64[ns]
 2   StepTotal         22099 non-null  int64         
 3   Calories          22099 non-null  int64         
 4   TotalIntensity    22099 non-null  int64         
 5   AverageIntensity  22099 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(4)
memory usage: 1.0 MB


**Minute-wise Granularity**

In [None]:
minuteCaloriesNarrow_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/minuteCaloriesNarrow_merged.csv')

In [None]:
minuteCaloriesWide_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/minuteCaloriesWide_merged.csv')

In [None]:
minuteIntensitiesNarrow_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/minuteIntensitiesNarrow_merged.csv')

In [None]:
minuteIntensitiesWide_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/minuteIntensitiesWide_merged.csv')

In [None]:
minuteMETsNarrow_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/minuteMETsNarrow_merged.csv')

In [None]:
minuteStepsNarrow_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/minuteStepsNarrow_merged.csv')

In [None]:
minuteStepsWide_merged =pd.read_csv('/content/drive/MyDrive/Strava_data/minuteStepsWide_merged.csv')

**Basic info**

In [None]:
minuteCaloriesNarrow_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   Id              1325580 non-null  int64  
 1   ActivityMinute  1325580 non-null  object 
 2   Calories        1325580 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 30.3+ MB


In [None]:
minuteCaloriesWide_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21645 entries, 0 to 21644
Data columns (total 62 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            21645 non-null  int64  
 1   ActivityHour  21645 non-null  object 
 2   Calories00    21645 non-null  float64
 3   Calories01    21645 non-null  float64
 4   Calories02    21645 non-null  float64
 5   Calories03    21645 non-null  float64
 6   Calories04    21645 non-null  float64
 7   Calories05    21645 non-null  float64
 8   Calories06    21645 non-null  float64
 9   Calories07    21645 non-null  float64
 10  Calories08    21645 non-null  float64
 11  Calories09    21645 non-null  float64
 12  Calories10    21645 non-null  float64
 13  Calories11    21645 non-null  float64
 14  Calories12    21645 non-null  float64
 15  Calories13    21645 non-null  float64
 16  Calories14    21645 non-null  float64
 17  Calories15    21645 non-null  float64
 18  Calories16    21645 non-nu

In [None]:
minuteIntensitiesWide_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21645 entries, 0 to 21644
Data columns (total 62 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            21645 non-null  int64 
 1   ActivityHour  21645 non-null  object
 2   Intensity00   21645 non-null  int64 
 3   Intensity01   21645 non-null  int64 
 4   Intensity02   21645 non-null  int64 
 5   Intensity03   21645 non-null  int64 
 6   Intensity04   21645 non-null  int64 
 7   Intensity05   21645 non-null  int64 
 8   Intensity06   21645 non-null  int64 
 9   Intensity07   21645 non-null  int64 
 10  Intensity08   21645 non-null  int64 
 11  Intensity09   21645 non-null  int64 
 12  Intensity10   21645 non-null  int64 
 13  Intensity11   21645 non-null  int64 
 14  Intensity12   21645 non-null  int64 
 15  Intensity13   21645 non-null  int64 
 16  Intensity14   21645 non-null  int64 
 17  Intensity15   21645 non-null  int64 
 18  Intensity16   21645 non-null  int64 
 19  Inte

In [None]:
minuteStepsNarrow_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   Id              1325580 non-null  int64 
 1   ActivityMinute  1325580 non-null  object
 2   Steps           1325580 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 30.3+ MB


In [None]:
minuteStepsWide_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21645 entries, 0 to 21644
Data columns (total 62 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            21645 non-null  int64 
 1   ActivityHour  21645 non-null  object
 2   Steps00       21645 non-null  int64 
 3   Steps01       21645 non-null  int64 
 4   Steps02       21645 non-null  int64 
 5   Steps03       21645 non-null  int64 
 6   Steps04       21645 non-null  int64 
 7   Steps05       21645 non-null  int64 
 8   Steps06       21645 non-null  int64 
 9   Steps07       21645 non-null  int64 
 10  Steps08       21645 non-null  int64 
 11  Steps09       21645 non-null  int64 
 12  Steps10       21645 non-null  int64 
 13  Steps11       21645 non-null  int64 
 14  Steps12       21645 non-null  int64 
 15  Steps13       21645 non-null  int64 
 16  Steps14       21645 non-null  int64 
 17  Steps15       21645 non-null  int64 
 18  Steps16       21645 non-null  int64 
 19  Step

In [None]:
minuteMETsNarrow_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   Id              1325580 non-null  int64         
 1   ActivityMinute  1325580 non-null  datetime64[ns]
 2   METs            1325580 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 30.3 MB


In [None]:
minuteStepsNarrow_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   Id              1325580 non-null  int64         
 1   ActivityMinute  1325580 non-null  datetime64[ns]
 2   Steps           1325580 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 30.3 MB


**Date-time conversion**

In [None]:
import pandas as pd

minuteCaloriesNarrow_merged['ActivityMinute'] = pd.to_datetime(minuteCaloriesNarrow_merged['ActivityMinute'], format="%m/%d/%Y %I:%M:%S %p")


In [None]:
minuteCaloriesWide_merged['ActivityHour'] = pd.to_datetime(minuteCaloriesWide_merged['ActivityHour'], format="%m/%d/%Y %I:%M:%S %p")


In [None]:
minuteIntensitiesWide_merged['ActivityHour'] = pd.to_datetime(minuteIntensitiesWide_merged['ActivityHour'], format="%m/%d/%Y %I:%M:%S %p")


In [None]:
minuteIntensitiesNarrow_merged['ActivityMinute'] = pd.to_datetime(minuteIntensitiesNarrow_merged['ActivityMinute'], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
minuteMETsNarrow_merged['ActivityMinute'] = pd.to_datetime(minuteMETsNarrow_merged['ActivityMinute'], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
minuteStepsNarrow_merged['ActivityMinute'] = pd.to_datetime(minuteStepsNarrow_merged['ActivityMinute'], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
minuteStepsWide_merged['ActivityHour'] = pd.to_datetime(minuteStepsWide_merged['ActivityHour'], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
# merging all the narrow datasets

minute_narrow_df = minuteCaloriesNarrow_merged .merge(
    minuteIntensitiesNarrow_merged, on=['Id', 'ActivityMinute'], how='inner') .merge(
        minuteStepsNarrow_merged, on=['Id', 'ActivityMinute'], how='inner') .merge(
            minuteMETsNarrow_merged, on=['Id', 'ActivityMinute'], how='inner')

In [None]:
# merging all the wide datasets

minute_wide_df = minuteCaloriesWide_merged .merge(
    minuteIntensitiesWide_merged, on=['Id', 'ActivityHour'], how='inner') .merge(
        minuteStepsWide_merged, on=['Id', 'ActivityHour'], how='inner')


In [None]:
daily_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   Id                                  940 non-null    int64         
 1   ActivityDate                        940 non-null    datetime64[ns]
 2   TotalSteps                          940 non-null    int64         
 3   TotalDistance                       940 non-null    float64       
 4   TrackerDistance                     940 non-null    float64       
 5   LoggedActivitiesDistance            940 non-null    float64       
 6   VeryActiveDistance                  940 non-null    float64       
 7   ModeratelyActiveDistance            940 non-null    float64       
 8   LightActiveDistance                 940 non-null    float64       
 9   SedentaryActiveDistance             940 non-null    float64       
 10  VeryActiveMinutes         

In [None]:
hourly_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Id                22099 non-null  int64         
 1   ActivityHour      22099 non-null  datetime64[ns]
 2   StepTotal         22099 non-null  int64         
 3   Calories          22099 non-null  int64         
 4   TotalIntensity    22099 non-null  int64         
 5   AverageIntensity  22099 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(4)
memory usage: 1.0 MB


In [None]:
minute_narrow_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 6 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   Id              1325580 non-null  int64         
 1   ActivityMinute  1325580 non-null  datetime64[ns]
 2   Calories        1325580 non-null  float64       
 3   Intensity       1325580 non-null  int64         
 4   Steps           1325580 non-null  int64         
 5   METs            1325580 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(4)
memory usage: 60.7 MB


In [None]:
minute_wide_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21645 entries, 0 to 21644
Columns: 182 entries, Id to Steps59
dtypes: datetime64[ns](1), float64(60), int64(121)
memory usage: 30.1 MB


**Sleep-wise data**

In [None]:
import pandas as pd

minuteSleep_merged= pd.read_csv('/content/drive/MyDrive/Strava_data/minuteSleep_merged.csv')

In [None]:
sleepDay_merged = pd.read_csv('/content/drive/MyDrive/Strava_data/sleepDay_merged.csv')

In [None]:
minuteSleep_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188521 entries, 0 to 188520
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Id      188521 non-null  int64 
 1   date    188521 non-null  object
 2   value   188521 non-null  int64 
 3   logId   188521 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 5.8+ MB


In [None]:
sleepDay_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Id                  413 non-null    int64 
 1   SleepDay            413 non-null    object
 2   TotalSleepRecords   413 non-null    int64 
 3   TotalMinutesAsleep  413 non-null    int64 
 4   TotalTimeInBed      413 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 16.3+ KB


In [None]:
# Convert to datetime format

minuteSleep_merged['date'] = pd.to_datetime(minuteSleep_merged['date'], format="%m/%d/%Y %I:%M:%S %p", errors='coerce')
sleepDay_merged['SleepDay'] = pd.to_datetime(sleepDay_merged['SleepDay'], format="%m/%d/%Y %I:%M:%S %p", errors='coerce')

In [None]:
# merging the dataset

import pandas as pd

# removing the time part from the minuteSleep data for merging

minuteSleep_merged['SleepDate'] = minuteSleep_merged['date'].dt.date
sleepDay_merged['SleepDate'] = sleepDay_merged['SleepDay'].dt.date

# Step 3: Merge on Id and SleepDate
sleep_df = pd.merge(
    minuteSleep_merged,
    sleepDay_merged,
    how='left',
    on=['Id', 'SleepDate']
)



In [None]:
sleep_df.head()

Unnamed: 0,Id,date,value,logId,SleepDate,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,2016-04-12 02:47:30,3,11380564589,2016-04-12,2016-04-12,1.0,327.0,346.0
1,1503960366,2016-04-12 02:48:30,2,11380564589,2016-04-12,2016-04-12,1.0,327.0,346.0
2,1503960366,2016-04-12 02:49:30,1,11380564589,2016-04-12,2016-04-12,1.0,327.0,346.0
3,1503960366,2016-04-12 02:50:30,1,11380564589,2016-04-12,2016-04-12,1.0,327.0,346.0
4,1503960366,2016-04-12 02:51:30,1,11380564589,2016-04-12,2016-04-12,1.0,327.0,346.0


In [None]:
sleep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190034 entries, 0 to 190033
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   Id                  190034 non-null  int64         
 1   date                190034 non-null  datetime64[ns]
 2   value               190034 non-null  int64         
 3   logId               190034 non-null  int64         
 4   SleepDate           190034 non-null  object        
 5   SleepDay            185923 non-null  datetime64[ns]
 6   TotalSleepRecords   185923 non-null  float64       
 7   TotalMinutesAsleep  185923 non-null  float64       
 8   TotalTimeInBed      185923 non-null  float64       
dtypes: datetime64[ns](2), float64(3), int64(3), object(1)
memory usage: 13.0+ MB


In [None]:
# drop SleepDate column

sleep_df.drop(columns=["SleepDate"], inplace=True)


In [None]:
# dropping the nulls in sleep_df

sleep_df.dropna(axis=0, inplace=True)

In [None]:
sleep_df.head()

Unnamed: 0,Id,date,value,logId,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,2016-04-12 02:47:30,3,11380564589,2016-04-12,1.0,327.0,346.0
1,1503960366,2016-04-12 02:48:30,2,11380564589,2016-04-12,1.0,327.0,346.0
2,1503960366,2016-04-12 02:49:30,1,11380564589,2016-04-12,1.0,327.0,346.0
3,1503960366,2016-04-12 02:50:30,1,11380564589,2016-04-12,1.0,327.0,346.0
4,1503960366,2016-04-12 02:51:30,1,11380564589,2016-04-12,1.0,327.0,346.0


In [None]:
# saving merged dataframe in csv format

daily_df.to_csv("daily_df.csv", index=False)

hourly_df.to_csv("hourly_df.csv", index=False)

minute_narrow_df.to_csv("minute_narrow_df.csv", index=False)

minute_wide_df.to_csv("minute_wide_df.csv", index=False)

sleep_df.to_csv('sleep_df.csv', index=False)

In [None]:
# downloading the merged dataframe

from google.colab import files

files.download("daily_df.csv")
files.download("hourly_df.csv")
files.download("minute_narrow_df.csv")
files.download("minute_wide_df.csv")
files.download ("sleep_df.csv")



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Analysis on daily_df dataset

In [None]:
daily_df = pd.read_csv('/content/drive/MyDrive/Strava_data/daily_df.csv')


In [None]:
daily_df.head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,...,SedentaryMinutes_intensity,LightlyActiveMinutes_intensity,FairlyActiveMinutes_intensity,VeryActiveMinutes_intensity,SedentaryActiveDistance_intensity,LightActiveDistance_intensity,ModeratelyActiveDistance_intensity,VeryActiveDistance_intensity,StepTotal,Calories_calories
0,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,...,728,328,13,25,0.0,6.06,0.55,1.88,13162,1985
1,1503960366,2016-04-13,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,...,776,217,19,21,0.0,4.71,0.69,1.57,10735,1797
2,1503960366,2016-04-14,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,...,1218,181,11,30,0.0,3.91,0.4,2.44,10460,1776
3,1503960366,2016-04-15,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,...,726,209,34,29,0.0,2.83,1.26,2.14,9762,1745
4,1503960366,2016-04-16,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,...,773,221,10,36,0.0,5.04,0.41,2.71,12669,1863


In [None]:
daily_df.columns

Index(['Id', 'ActivityDate', 'TotalSteps', 'TotalDistance', 'TrackerDistance',
       'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories', 'ActivityDay',
       'SedentaryMinutes_intensity', 'LightlyActiveMinutes_intensity',
       'FairlyActiveMinutes_intensity', 'VeryActiveMinutes_intensity',
       'SedentaryActiveDistance_intensity', 'LightActiveDistance_intensity',
       'ModeratelyActiveDistance_intensity', 'VeryActiveDistance_intensity',
       'StepTotal', 'Calories_calories'],
      dtype='object')

In [None]:
daily_df.shape

(940, 26)

In [None]:
# Checking duplicate columns

columns= [
    ("TotalSteps", "StepTotal"),
    ("Calories", "Calories_calories"),
    ("VeryActiveMinutes", "VeryActiveMinutes_intensity"),
    ("FairlyActiveMinutes", "FairlyActiveMinutes_intensity"),
    ("LightlyActiveMinutes", "LightlyActiveMinutes_intensity"),
    ("SedentaryMinutes", "SedentaryMinutes_intensity"),
    ("VeryActiveDistance", "VeryActiveDistance_intensity"),
    ("LightActiveDistance", "LightActiveDistance_intensity"),
    ("ModeratelyActiveDistance", "ModeratelyActiveDistance_intensity"),
    ("SedentaryActiveDistance", "SedentaryActiveDistance_intensity"),
    ("ActivityDate", "ActivityDay")
]

for col1, col2 in columns:
    if col1 in daily_df.columns and col2 in daily_df.columns:
        match = (daily_df[col1] == daily_df[col2]).all()
        print(f"{col1} and {col2} are {'identical' if match else 'different'}")
    else:
        print(f"{col1} or {col2} not found in DataFrame")


TotalSteps and StepTotal are identical
Calories and Calories_calories are identical
VeryActiveMinutes and VeryActiveMinutes_intensity are identical
FairlyActiveMinutes and FairlyActiveMinutes_intensity are identical
LightlyActiveMinutes and LightlyActiveMinutes_intensity are identical
SedentaryMinutes and SedentaryMinutes_intensity are identical
VeryActiveDistance and VeryActiveDistance_intensity are identical
LightActiveDistance and LightActiveDistance_intensity are identical
ModeratelyActiveDistance and ModeratelyActiveDistance_intensity are identical
SedentaryActiveDistance and SedentaryActiveDistance_intensity are identical
ActivityDate and ActivityDay are identical


In [None]:
# dropping identical columns

columns_to_drop = [
    'StepTotal',
    'Calories_calories',
    'VeryActiveMinutes_intensity',
    'FairlyActiveMinutes_intensity',
    'LightlyActiveMinutes_intensity',
    'SedentaryMinutes_intensity',
    'VeryActiveDistance_intensity',
    'LightActiveDistance_intensity',
    'ModeratelyActiveDistance_intensity',
    'SedentaryActiveDistance_intensity',
    'ActivityDay'
]

daily_df = daily_df.drop(columns=columns_to_drop)


In [None]:
daily_df.columns

Index(['Id', 'ActivityDate', 'TotalSteps', 'TotalDistance', 'TrackerDistance',
       'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories'],
      dtype='object')

In [None]:
daily_df.shape

(940, 15)

In [None]:
daily_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories  

In [None]:
daily_df.isnull().sum()

Unnamed: 0,0
Id,0
ActivityDate,0
TotalSteps,0
TotalDistance,0
TrackerDistance,0
LoggedActivitiesDistance,0
VeryActiveDistance,0
ModeratelyActiveDistance,0
LightActiveDistance,0
SedentaryActiveDistance,0


In [None]:
daily_df.duplicated().sum()


np.int64(0)

**Feature Engineering**

In [None]:
# extracting day_name and month_day from ActivityDate

daily_df['ActivityDate'] = pd.to_datetime(daily_df['ActivityDate'])
daily_df['DayOfWeek'] = daily_df['ActivityDate'].dt.day_name()
daily_df['Month'] = daily_df['ActivityDate'].dt.month_name()


In [None]:
# Categorize Steps
daily_df['StepCategory'] = pd.cut(daily_df['TotalSteps'],
                                  bins=[-1, 5000, 10000, float('inf')],
                                  labels=['Low', 'Moderate', 'High'])

In [None]:
# Compute Active Metrics

daily_df['TotalActiveMinutes'] = daily_df['VeryActiveMinutes'] + daily_df['FairlyActiveMinutes'] + daily_df['LightlyActiveMinutes']

daily_df['TotalMovementDistance'] = daily_df[['VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance']].sum(axis=1)

daily_df['PercentActiveTime'] = daily_df['TotalActiveMinutes'] / (daily_df['TotalActiveMinutes'] + daily_df['SedentaryMinutes'])

In [None]:
daily_df.head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,...,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,DayOfWeek,Month,StepCategory,TotalActiveMinutes,TotalMovementDistance,PercentActiveTime
0,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,...,13,328,728,1985,Tuesday,April,High,366,8.49,0.334552
1,1503960366,2016-04-13,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,...,19,217,776,1797,Wednesday,April,High,257,6.97,0.24879
2,1503960366,2016-04-14,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,...,11,181,1218,1776,Thursday,April,High,222,6.75,0.154167
3,1503960366,2016-04-15,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,...,34,209,726,1745,Friday,April,Moderate,272,6.23,0.272545
4,1503960366,2016-04-16,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,...,10,221,773,1863,Saturday,April,High,267,8.16,0.256731


In [None]:
print(daily_df['ActivityDate'].apply(type).value_counts())


ActivityDate
<class 'pandas._libs.tslibs.timestamps.Timestamp'>    940
Name: count, dtype: int64


# Analysis on hourly_df dataset

In [None]:
import pandas as pd
hourly_df= pd.read_csv('/content/drive/MyDrive/Strava_data/hourly_df.csv')

In [None]:
hourly_df.head()

Unnamed: 0,Id,ActivityHour,StepTotal,Calories,TotalIntensity,AverageIntensity
0,1503960366,2016-04-12 00:00:00,373,81,20,0.333333
1,1503960366,2016-04-12 01:00:00,160,61,8,0.133333
2,1503960366,2016-04-12 02:00:00,151,59,7,0.116667
3,1503960366,2016-04-12 03:00:00,0,47,0,0.0
4,1503960366,2016-04-12 04:00:00,0,48,0,0.0


In [None]:
hourly_df.shape

(22099, 6)

In [None]:
hourly_df.isnull().sum()

Unnamed: 0,0
Id,0
ActivityHour,0
StepTotal,0
Calories,0
TotalIntensity,0
AverageIntensity,0


In [None]:
hourly_df.columns

Index(['Id', 'ActivityHour', 'StepTotal', 'Calories', 'TotalIntensity',
       'AverageIntensity'],
      dtype='object')

In [None]:
hourly_df.columns

Index(['Id', 'ActivityHour', 'StepTotal', 'Calories', 'TotalIntensity',
       'AverageIntensity'],
      dtype='object')

# Analysis on minute-wise dataset

In [None]:
import pandas as pd
minute_narrow_df= pd.read_csv('/content/drive/MyDrive/Strava_data/minute_narrow_df.csv')

In [None]:
minute_narrow_df.head()

Unnamed: 0,Id,ActivityMinute,Calories,Intensity,Steps,METs
0,1503960366,2016-04-12 00:00:00,0.7865,0,0,10
1,1503960366,2016-04-12 00:01:00,0.7865,0,0,10
2,1503960366,2016-04-12 00:02:00,0.7865,0,0,10
3,1503960366,2016-04-12 00:03:00,0.7865,0,0,10
4,1503960366,2016-04-12 00:04:00,0.7865,0,0,10


In [None]:
minute_narrow_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 6 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   Id              1325580 non-null  int64  
 1   ActivityMinute  1325580 non-null  object 
 2   Calories        1325580 non-null  float64
 3   Intensity       1325580 non-null  int64  
 4   Steps           1325580 non-null  int64  
 5   METs            1325580 non-null  int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 60.7+ MB


In [None]:
minute_narrow_df.isnull().sum()

Unnamed: 0,0
Id,0
ActivityMinute,0
Calories,0
Intensity,0
Steps,0
METs,0


In [None]:
# handling date-time column

minute_narrow_df['ActivityMinute'] = pd.to_datetime(minute_narrow_df['ActivityMinute'])

**Feature Engineering**

In [None]:
# time features

minute_narrow_df['Hour'] = minute_narrow_df['ActivityMinute'].dt.hour

minute_narrow_df['Minute'] = minute_narrow_df['ActivityMinute'].dt.minute

minute_narrow_df['DayOfWeek'] = minute_narrow_df['ActivityMinute'].dt.dayofweek
minute_narrow_df['IsWeekend'] = minute_narrow_df['DayOfWeek'].isin([5, 6]).astype(int)

minute_narrow_df['TimeOfDay'] = pd.cut(minute_narrow_df['Hour'],
                         bins=[-1, 6, 12, 17, 21, 24],
                         labels=['Night', 'Morning', 'Afternoon', 'Evening', 'Late Night'])

In [None]:
# steps and activity features

import numpy as np

minute_narrow_df['IsActive'] = (minute_narrow_df['Intensity'] > 0).astype(int)

minute_narrow_df['IsWalking'] = (minute_narrow_df['Steps'] > 0).astype(int)

minute_narrow_df['CaloriesPerStep'] = minute_narrow_df['Calories'] / minute_narrow_df['Steps'].replace(0, np.nan)

minute_narrow_df['HighCalorieMinute'] = (minute_narrow_df['Calories'] > minute_narrow_df['Calories'].quantile(0.95)).astype(int)

minute_narrow_df['IsHighIntensity'] = (minute_narrow_df['Intensity'] >= 3).astype(int)

In [None]:
# METs Features

minute_narrow_df['IsHighMET'] = (minute_narrow_df['METs'] >= 6).astype(int)

# Analysis on Sleep Dataset

In [None]:
sleep_df = pd.read_csv('/content/drive/MyDrive/Strava_data/sleep_df.csv')

In [None]:
sleep_df.head()

Unnamed: 0,Id,date,value,logId,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,2016-04-12 02:47:30,3,11380564589,2016-04-12,1.0,327.0,346.0
1,1503960366,2016-04-12 02:48:30,2,11380564589,2016-04-12,1.0,327.0,346.0
2,1503960366,2016-04-12 02:49:30,1,11380564589,2016-04-12,1.0,327.0,346.0
3,1503960366,2016-04-12 02:50:30,1,11380564589,2016-04-12,1.0,327.0,346.0
4,1503960366,2016-04-12 02:51:30,1,11380564589,2016-04-12,1.0,327.0,346.0


In [None]:
sleep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185923 entries, 0 to 185922
Data columns (total 8 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Id                  185923 non-null  int64  
 1   date                185923 non-null  object 
 2   value               185923 non-null  int64  
 3   logId               185923 non-null  int64  
 4   SleepDay            185923 non-null  object 
 5   TotalSleepRecords   185923 non-null  float64
 6   TotalMinutesAsleep  185923 non-null  float64
 7   TotalTimeInBed      185923 non-null  float64
dtypes: float64(3), int64(3), object(2)
memory usage: 11.3+ MB


In [None]:
sleep_df.isnull().sum()

Unnamed: 0,0
Id,0
date,0
value,0
logId,0
SleepDay,0
TotalSleepRecords,0
TotalMinutesAsleep,0
TotalTimeInBed,0


In [None]:
# handle date time columns

sleep_df['SleepDay'] = pd.to_datetime(sleep_df['SleepDay'])

sleep_df['date'] = pd.to_datetime(sleep_df['date'])



**Feature Engineering**

In [None]:
# date-time features

sleep_df['SleepHour'] = sleep_df['SleepDay'].dt.hour
sleep_df['SleepDayOfWeek'] = sleep_df['SleepDay'].dt.dayofweek
sleep_df['IsWeekend'] = sleep_df['SleepDayOfWeek'].isin([5, 6]).astype(int)
sleep_df['SleepDate'] = sleep_df['SleepDay'].dt.date


In [None]:
# sleep quality features

# Sleep efficiency
sleep_df['SleepEfficiency'] = sleep_df['TotalMinutesAsleep'] / sleep_df['TotalTimeInBed']

# sleep duration difference
sleep_df['TimeAwake'] = sleep_df['TotalTimeInBed'] - sleep_df['TotalMinutesAsleep']

# sleep duration (long or short)
sleep_df['ShortSleep'] = (sleep_df['TotalMinutesAsleep'] < 360).astype(int)
sleep_df['LongSleep'] = (sleep_df['TotalMinutesAsleep'] > 540).astype(int)

# Average duration per sleep record

import numpy as np
sleep_df['AvgMinutesPerSleep'] = sleep_df['TotalMinutesAsleep'] / sleep_df['TotalSleepRecords'].replace(0, np.nan)


In [None]:
sleep_df.head()

Unnamed: 0,Id,date,value,logId,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed,SleepHour,SleepDayOfWeek,IsWeekend,SleepDate,SleepEfficiency,TimeAwake,ShortSleep,LongSleep,AvgMinutesPerSleep
0,1503960366,2016-04-12 02:47:30,3,11380564589,2016-04-12,1.0,327.0,346.0,0,1,0,2016-04-12,0.945087,19.0,1,0,327.0
1,1503960366,2016-04-12 02:48:30,2,11380564589,2016-04-12,1.0,327.0,346.0,0,1,0,2016-04-12,0.945087,19.0,1,0,327.0
2,1503960366,2016-04-12 02:49:30,1,11380564589,2016-04-12,1.0,327.0,346.0,0,1,0,2016-04-12,0.945087,19.0,1,0,327.0
3,1503960366,2016-04-12 02:50:30,1,11380564589,2016-04-12,1.0,327.0,346.0,0,1,0,2016-04-12,0.945087,19.0,1,0,327.0
4,1503960366,2016-04-12 02:51:30,1,11380564589,2016-04-12,1.0,327.0,346.0,0,1,0,2016-04-12,0.945087,19.0,1,0,327.0


# Analysis on HeartRate Dataset

In [None]:
import pandas as pd

heartrate_df= pd.read_csv('/content/drive/MyDrive/Strava_data/heartrate_seconds_merged.csv')

In [None]:
heartrate_df.head()

Unnamed: 0,Id,Time,Value
0,2022484408,4/12/2016 7:21:00 AM,97
1,2022484408,4/12/2016 7:21:05 AM,102
2,2022484408,4/12/2016 7:21:10 AM,105
3,2022484408,4/12/2016 7:21:20 AM,103
4,2022484408,4/12/2016 7:21:25 AM,101


In [None]:
heartrate_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483658 entries, 0 to 2483657
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Time    object
 2   Value   int64 
dtypes: int64(2), object(1)
memory usage: 56.8+ MB


In [None]:
heartrate_df.isnull().sum()

Unnamed: 0,0
Id,0
Time,0
Value,0


In [None]:
# handling date time column

import pandas as pd
heartrate_df['Time'] = pd.to_datetime(heartrate_df['Time'])

**Feature Engineering**

In [None]:
# date time features
heartrate_df['Date'] = heartrate_df['Time'].dt.date

heartrate_df['Hour'] = heartrate_df['Time'].dt.hour

heartrate_df['Day'] = heartrate_df['Time'].dt.day

heartrate_df['DayOfWeek'] = heartrate_df['Time'].dt.dayofweek

heartrate_df['IsWeekend'] = heartrate_df['DayOfWeek'].isin([5, 6]).astype(int)

heartrate_df['TimeOfDay'] = pd.cut(heartrate_df['Hour'],
                                   bins=[-1, 6, 12, 17, 21, 24],
                                   labels=['Night', 'Morning', 'Afternoon', 'Evening', 'Late Night'])


In [None]:
# heart rate bins

def hr_status(hr):
    if hr < 60:
        return 'Resting'
    elif hr < 100:
        return 'Normal'
    elif hr < 140:
        return 'Fat Burn'
    elif hr < 160:
        return 'Cardio'
    else:
        return 'Peak'

heartrate_df['HR_Status'] = heartrate_df['Value'].apply(hr_status)


In [None]:
# unusual heartrates (less than 40 or more than 180)

(((heartrate_df['Value'] < 40) | (heartrate_df['Value'] > 180)).astype(int)).value_counts()


Unnamed: 0_level_0,count
Value,Unnamed: 1_level_1
0,2483391
1,267


In [None]:
heartrate_df.columns


Index(['Id', 'Time', 'Value', 'Date', 'Hour', 'Day', 'DayOfWeek', 'IsWeekend',
       'TimeOfDay', 'HR_Status'],
      dtype='object')

# Analysis on WeightLogInfo Dataset

In [None]:
import pandas as pd

weightlog_df= pd.read_csv('/content/drive/MyDrive/Strava_data/weightLogInfo_merged.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Strava_data/weightLogInfo_merged.csv'

In [None]:
weightlog_df.head()

In [None]:
weightlog_df.info()

In [None]:
weightlog_df.isnull().sum()

Unnamed: 0,0
Id,0
Date,0
WeightKg,0
WeightPounds,0
Fat,65
BMI,0
IsManualReport,0
LogId,0


In [None]:
weightlog_df.drop('Fat', axis=1, inplace=True)

In [None]:
weightlog_df['Date'] = pd.to_datetime(weightlog_df['Date'], format='mixed')


**Feature Engineering**

In [None]:
# date time features

weightlog_df['DayOfWeek'] = weightlog_df['Date'].dt.dayofweek
weightlog_df['Month'] = weightlog_df['Date'].dt.month
weightlog_df['Week'] = weightlog_df['Date'].dt.isocalendar().week
weightlog_df['IsWeekend'] = weightlog_df['DayOfWeek'].isin([5, 6]).astype(int)


In [None]:
# BMI categories

def bmi_category(bmi):

    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

weightlog_df['BMICategory'] = weightlog_df['BMI'].apply(bmi_category)


In [None]:
# weight categories

weightlog_df['WeightGroup'] = pd.cut(weightlog_df['WeightKg'], bins=[0, 60, 70, 80, 90, 100, 150],
                           labels=['<60kg', '60-70kg', '70-80kg', '80-90kg', '90-100kg', '>100kg'])


In [None]:
weightlog_df.head()

Unnamed: 0,Id,Date,WeightKg,WeightPounds,BMI,IsManualReport,LogId,DayOfWeek,Month,Week,IsWeekend,BMICategory,WeightGroup
0,1503960366,2016-05-02 23:59:59,52.599998,115.963147,22.65,True,1462233599000,0,5,18,0,Normal,<60kg
1,1503960366,2016-05-03 23:59:59,52.599998,115.963147,22.65,True,1462319999000,1,5,18,0,Normal,<60kg
2,1927972279,2016-04-13 01:08:52,133.5,294.31712,47.540001,False,1460509732000,2,4,15,0,Obese,>100kg
3,2873212765,2016-04-21 23:59:59,56.700001,125.002104,21.450001,True,1461283199000,3,4,16,0,Normal,<60kg
4,2873212765,2016-05-12 23:59:59,57.299999,126.324875,21.690001,True,1463097599000,3,5,19,0,Normal,<60kg


In [None]:
# Convert to string format (24-hour)

weightlog_df['Date'] = pd.to_datetime(weightlog_df['Date'], format='%m/%d/%Y %I:%M:%S %p')

weightlog_df['Date'] = weightlog_df['Date'].dt.strftime('%Y-%m-%d %H:%M:%S')


# Downloading the final datasets for further Analysis & Dashboarding

In [None]:
# saving the final data in csv

# daily_df.to_csv("daily_df.csv", index=False)
# hourly_df.to_csv("hourly_df.csv", index=False)
minute_narrow_df.to_csv("minute_narrow_df.csv", index=False)
# sleep_df.to_csv("sleep_df.csv", index=False)
# heartrate_df.to_csv("heartrate_df.csv", index=False)
# weightlog_df.to_csv("weightlog_df.csv", index=False)



In [None]:
from google.colab import files

# files.download ("daily_df.csv")
# files.download ("hourly_df.csv")
files.download ("minute_narrow_df.csv")
# files.download ("sleep_df.csv")
# files.download ("heartrate_df.csv")
# files.download ("weightlog_df.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Splitting large datasets

In [None]:
# splitting minute_narrow_df into two

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/Strava_data/minute_narrow_df.csv')

half = len(df) // 2

# Split into two parts
df1 = df.iloc[:half]
df2 = df.iloc[half:]

# Save as two new Excel files
df1.to_csv('minute_narrow_df1.csv', index=False)
df2.to_csv('minute_narrow_df2.csv', index=False)


In [None]:
from google.colab import files

files.download ("minute_narrow_df1.csv")
files.download ("minute_narrow_df2.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/Strava_data/heartrate_df.csv')

In [None]:
# splitting heartrate_df into two

n = len(df)
part_size = n // 3

df1 = df.iloc[:part_size]
df2 = df.iloc[part_size:2*part_size]
df3 = df.iloc[2*part_size:]

df1.to_csv('heartrate_df1.csv', index=False)
df2.to_csv('heartrate_df2.csv', index=False)
df3.to_csv('heartrate_df3.csv', index=False)




In [None]:
from google.colab import files

files.download ("heartrate_df1.csv")
files.download ("heartrate_df2.csv")
files.download ("heartrate_df3.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>