# Missingness Analysis

This notebook aims to understand the prevalence of missing rows in the activity data of the athletes shortlisted for further analysis (df_atheltes_final.csv). Specifically, we want to understand how common it is to find missing rows in the data as a whole and to also understand the frequency of this missingness - how frequently is a single row missing, two consecutive rows, three consecutive rows, etc. This will help inform our approach to data cleaning and imputation before further analysis.

## Downloading Data

In this section, I will store the activity data of selected athletes locally to improve execution speed for future tasks.

In [1]:
# Importing requirements
import os
import pandas as pd
from opendata import OpenData

In [None]:
# Change working directory

os.chdir("..")
os.getcwd()

'c:\\Users\\karka\\Projects\\Golden-Cheetah'

In [None]:
# Load csv as dataframe

df = pd.read_csv(r"data\interim\df_athletes_final.csv")

In [None]:
from concurrent.futures import ThreadPoolExecutor

od = OpenData()


def fetch_and_store(athlete_id):
    od.get_remote_athlete(athlete_id).store_locally()


# Using threading to speed up API calls
with ThreadPoolExecutor() as executor:
    executor.map(fetch_and_store, df["id"])

# Analysing Missingness

This section will open 10 randomly selected bike rides from each shortlisted athlete and calculate missingness.

In [5]:
# DELETE LATER
od = OpenData()

In [None]:
import pandas as pd
import random

# Setting the random seed for reproducibility
random.seed(42)

# Initialise a list to collect DataFrames
df_list = []

# Loop through each athlete ID
for athlete_id in df["id"]:
    # Retrieve athlete and their activities
    athlete = od.get_local_athlete(athlete_id)
    activities = list(athlete.activities())

    # Filter activities to include only those where the sport is 'Bike'
    cycling_activities = [
        activity for activity in activities if activity.metadata.get("sport") == "Bike"
    ]

    # Select 10 random bike rides from cycling_activities
    sample_activities = random.sample(cycling_activities, 10)

    # Analyze missingness in each ride in the sample
    for ride in sample_activities:
        data_df = ride.data

        # Calculate time delta in the 'secs' column and check for missingness
        data_df["deltaSecs"] = data_df["secs"].diff()
        data_df["missingRows"] = data_df["deltaSecs"] - 1

        # Create a missingness DataFrame
        df_activity_missingness = (
            data_df["missingRows"].value_counts().sort_index().reset_index()
        )
        df_activity_missingness.columns = ["missingSeconds", "frequency"]

        # Add DataFrame length, athlete ID, and activity date to the DataFrame
        df_activity_missingness["totalSeconds"] = int(data_df["secs"].iloc[-1])
        df_activity_missingness["athleteID"] = athlete_id
        df_activity_missingness["activityDate"] = ride.metadata["date"]

        # Append the DataFrame to the list
        df_list.append(df_activity_missingness)

# Concatenate all DataFrames in the list into a single DataFrame
df_missingness = pd.concat(df_list, ignore_index=True)

In [11]:
df_missingness.head()

Unnamed: 0,missingSeconds,frequency,totalSeconds,athleteID,activityDate
0,0.0,4149,4149,75119381-8969-4cfe-8c31-f21ce0f7ae3a,2019/11/02 12:45:00 UTC
1,0.0,6525,7442,75119381-8969-4cfe-8c31-f21ce0f7ae3a,2019/05/24 17:44:00 UTC
2,27.0,1,7442,75119381-8969-4cfe-8c31-f21ce0f7ae3a,2019/05/24 17:44:00 UTC
3,37.0,2,7442,75119381-8969-4cfe-8c31-f21ce0f7ae3a,2019/05/24 17:44:00 UTC
4,41.0,1,7442,75119381-8969-4cfe-8c31-f21ce0f7ae3a,2019/05/24 17:44:00 UTC


In [None]:
# Group by 'missingSeconds' and sum the 'frequency' column to create df_missingness_summary
df_missingness_summary = (
    df_missingness.groupby("missingSeconds")["frequency"].sum().reset_index()
)
df_missingness_summary.head()

Unnamed: 0,missingSeconds,frequency
0,-1.0,120
1,0.0,2413911
2,1.0,2582
3,2.0,161
4,3.0,129


In [None]:
df_missingness_summary.to_csv(r"data\processed\df_missingness.csv", index=False)

In [None]:
# Total seconds analysed

df_temp = df_missingness[["activityDate", "totalSeconds"]].drop_duplicates()
totalSeconds = sum(df_temp["totalSeconds"])

In [25]:
totalSeconds

2757990