# SIADS521 Assignment 04

In [414]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
from folium.plugins import MarkerCluster
from datetime import datetime
import panel as pn
import hvplot.pandas

# Initialize Panel with material design and link to fonts
pn.extension(design='material', css_files=['https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500&family=Montserrat:wght@400;500;600&display=swap'])

# Add simplified CSS for font styling
pn.config.raw_css.append("""
body { font-family: 'Roboto', Helvetica, Arial, sans-serif; }
h1, h2, h3, h4, h5, h6 { font-family: 'Montserrat', Helvetica, Arial, sans-serif; }
""")

## Data Ingestion and Processing

### Data Ingestion and Inspection

In [415]:
df = pd.read_csv('strava.csv')
df.head()

Unnamed: 0,Air Power,Cadence,Form Power,Ground Time,Leg Spring Stiffness,Power,Vertical Oscillation,altitude,cadence,datafile,...,enhanced_speed,fractional_cadence,heart_rate,position_lat,position_long,speed,timestamp,unknown_87,unknown_88,unknown_90
0,,,,,,,,,0.0,activities/2675855419.fit.gz,...,0.0,0.0,68.0,,,0.0,2019-07-08 21:04:03,0.0,300.0,
1,,,,,,,,,0.0,activities/2675855419.fit.gz,...,0.0,0.0,68.0,,,0.0,2019-07-08 21:04:04,0.0,300.0,
2,,,,,,,,,54.0,activities/2675855419.fit.gz,...,1.316,0.0,71.0,,,1316.0,2019-07-08 21:04:07,0.0,300.0,
3,,,,,,,,3747.0,77.0,activities/2675855419.fit.gz,...,1.866,0.0,77.0,504432050.0,-999063637.0,1866.0,2019-07-08 21:04:14,0.0,100.0,
4,,,,,,,,3798.0,77.0,activities/2675855419.fit.gz,...,1.894,0.0,80.0,504432492.0,-999064534.0,1894.0,2019-07-08 21:04:15,0.0,100.0,


In [416]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40649 entries, 0 to 40648
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Air Power             17842 non-null  float64
 1   Cadence               17847 non-null  float64
 2   Form Power            17842 non-null  float64
 3   Ground Time           17847 non-null  float64
 4   Leg Spring Stiffness  17842 non-null  float64
 5   Power                 17847 non-null  float64
 6   Vertical Oscillation  17847 non-null  float64
 7   altitude              14905 non-null  float64
 8   cadence               40627 non-null  float64
 9   datafile              40649 non-null  object 
 10  distance              40649 non-null  float64
 11  enhanced_altitude     40598 non-null  float64
 12  enhanced_speed        40639 non-null  float64
 13  fractional_cadence    40627 non-null  float64
 14  heart_rate            38355 non-null  float64
 15  position_lat       

In [417]:
print(df.isnull().mean().sort_values(ascending=False) * 100)

altitude                63.332431
speed                   63.275849
Air Power               56.107161
Form Power              56.107161
Leg Spring Stiffness    56.107161
Cadence                 56.094861
Ground Time             56.094861
Power                   56.094861
Vertical Oscillation    56.094861
unknown_90              54.198135
heart_rate               5.643435
unknown_88               5.643435
position_lat             0.472336
position_long            0.472336
enhanced_altitude        0.125464
fractional_cadence       0.054122
cadence                  0.054122
unknown_87               0.054122
enhanced_speed           0.024601
distance                 0.000000
datafile                 0.000000
timestamp                0.000000
dtype: float64


#### Observation

After the above analysis the following is evident regarding the Strava dataset:
- The dataset has 40649 rows and 22 columns.
- Multiple rows make up a single 'exercise', which is uniqely identified by the datafile (i.e. if you group by datafile you can get all of the data for a single run or cycle)
- The exercises are either runs or cycles, this can be differentiated by (1) the average speed of the exercise and (2) whether or not the run-specific variables have been populated (i.e. Form Power, Ground Time and Vertical Oscillation)
- The data is very raw and 'messy', as it contain many missing values and inconsistently populated fields. This is liekly due to multiple data sources and underlying inconsistencies.
- The percentage of null values within a given column ranges from acceptable (i.e. 5%) to significantly compromising the data (e.g. upwards of 50%).
- One of the more relevant variables, speed, is underpopulated. In order to remedy this, the average speed can be calculated from the distance and time (which can be derived from the timestamp variable).

### Data Processing and Manipulation

In [418]:
# Convert the time stamp to Pandas DateTime
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [None]:
def get_exercise_groups(df):
    # Group by datafile to get individual exercises
    exercise_groups = df.groupby('datafile')
    
    # Process each exercise
    exercise_data = []
    for datafile, group in exercise_groups:
        # Extract timestamps for duration calculation
        timestamps = group['timestamp']
        if len(timestamps) > 0:
            start_time = timestamps.min()
            end_time = timestamps.max()
            duration_minutes = (end_time - start_time).total_seconds() / 60
            
            # Get distance (max value should be total distance)
            distance_km = group['distance'].max() / 1000 if 'distance' in group.columns else None
            
            # Calculate average speed
            avg_speed_kmh = distance_km / (duration_minutes / 60) if distance_km and duration_minutes else None
            
            # Check for running-specific metrics
            has_running_metrics = (
                ('Form Power' in group.columns and group['Form Power'].notna().any()) or
                ('Ground Time' in group.columns and group['Ground Time'].notna().any()) or
                ('Vertical Oscillation' in group.columns and group['Vertical Oscillation'].notna().any())
            )
            
            # Determine activity type
            # Running if: has running-specific metrics OR average speed is below 15 km/h (reasonable threshold to differentiate running from cycling)
            activity_type = 'Running' if has_running_metrics or (avg_speed_kmh and avg_speed_kmh < 15) else 'Cycling'
            
            # Check if there's position data
            has_position = (
                'position_lat' in group.columns and
                'position_long' in group.columns and
                group['position_lat'].notna().any() and
                group['position_long'].notna().any()
            )
            
            # Calculate heart rate stats if available
            hr_avg = group['heart_rate'].mean() if 'heart_rate' in group.columns and group['heart_rate'].notna().any() else None
            hr_max = group['heart_rate'].max() if 'heart_rate' in group.columns and group['heart_rate'].notna().any() else None
            
            # Add to the exercise list
            exercise_data.append({
                'datafile': datafile,
                'start_time': start_time,
                'duration_minutes': duration_minutes,
                'distance_km': distance_km,
                'avg_speed_kmh': avg_speed_kmh,
                'activity_type': activity_type,
                'has_position': has_position,
                'hr_avg': hr_avg,
                'hr_max': hr_max,
                'data': group
            })
    
    return pd.DataFrame(exercise_data)

exercise_df = get_exercise_groups(df)

#### Observations

Based off of the inspection the following was done to clean or improve the usability of data:
- The timestamp variable was chnaged to Pandas DateTime type.
- The data was grouped by datafile to extract individual exercises' data (once grouped there are 64 exercises)
- For each of the grouped exercises, the following was extracted: (1) the start and end time, (2) the duraction, (3) the average speec, (4) the exercise type (i.e. run or cycle)
- Finally, an exercise Pandas DataFrame was created, with only the relevant variables included (this also ensured that all variables with significant percentages of null values, as determined above, were not included to maintain data integrity).

## Visualisation Functions

As per the assignmnet requirements, there will be a combination of basic and advanced visualisation techniques. The basic include: bar graphs, box plots and scatter plots and the more advanced include: a geographic map plot and a combined line plot. The basic plots as well as the combined line plot were created using hvplot and the geographic map plot was created using Folium. 

hvplot was selected for basic and combined line plots due to its powerful interactive visualization capabilities. The library provides seamless integration with Pandas, enabling easy creation of complex, interactive plots with minimal code. Its key strengths include intuitive pandas-compatible plotting, interactive features like zooming and panning, and the ability to create consistent, aesthetically pleasing visualizations across different plot types.

Folium was chosen for geographic mapping because of its specialized capabilities in geospatial visualization. Unlike general plotting libraries, Folium is designed specifically for creating interactive maps using OpenStreetMap services. It offers precise control over map representations, allows easy overlay of geographic data points, and provides a robust way to visualize exercise routes in their actual geographical context.

### Overview Plots

The overview plots will go on the first tab or overview tab of the dashboard. These also include other functions for elements that are not structly visualisations but are useful for data communication (i.e. tables and summary statistics). After each visualisation function below, there will be a justification as to why these plots were chosen.

In [420]:
def create_summary_card(title, value, subtext=''):
    """
    Creates a styled HTML card for displaying summary statistics in the dashboard.

    Parameters:
    -----------
    title : str
        The label or name of the metric (appears at top of card)
    value : str
        The main value or statistic to display prominently
    subtext : str, optional
        Additional context or detail to display below the main value
        (default is empty string, which displays no subtext)

    Returns:
    --------
    panel.pane.HTML
        A Panel HTML pane containing the formatted card
    """
    # Construct HTML for the card using f-string templating
    card_html = f"""
    <div style="
        background-color: #f0f0f0; 
        border-radius: 8px; 
        padding: 15px; 
        text-align: center; 
        height: 100%;
    ">
        <!-- Card title in smaller, lighter text -->
        <p style="color: #666; margin-bottom: 5px; font-size: 0.9em;">{title}</p>
        
        <!-- Main value in large, bold text -->
        <h2 style="margin: 0; font-size: 2.5em; font-weight: bold;">{value}</h2>
        
        <!-- Optional subtext - only included if provided -->
        {"<p style='color: #666; font-size: 0.8em; margin-top: 5px;'>" + subtext + "</p>" if subtext else ""}
    </div>
    """
    
    # Return the HTML wrapped in a Panel pane with fixed height
    return pn.pane.HTML(card_html, height=150, sizing_mode='stretch_width')

#### Observations
<img src="/Users/natashasoldin/Documents/Education/Masters/First Semester/Session 03/SIADS 521/Assignments/Assignment 04/Dashboard Screenshots/Summary Tabs.png" width="800">

The total activities summary provides a critical overview of Professor Brooks' exercise journey during the summer of 2019. The breakdown of 64 total activities, with 51 running and 13 cycling activities, reveals a clear preference for running while maintaining a complementary cycling routine. This summary immediately contextualizes the entire dataset, showing a committed approach to maintaining regular physical activity across two different exercise modalities.


In [421]:
def create_activity_table(exercises_df):
   """
   Creates an interactive, formatted table displaying activity details.
   
   This function processes the exercise DataFrame to create a user-friendly
   table of activities with consistent formatting, alignment, and display units.
   It handles data cleaning, column renaming, and styling to ensure the 
   table is readable and professional.
   
   Parameters:
   -----------
   exercises_df : pandas.DataFrame
       DataFrame containing processed exercise data with columns for
       datafile, start_time, activity_type, distance_km, avg_speed_kmh, and hr_avg
   
   Returns:
   --------
   panel.widgets.DataFrame
       An interactive Panel DataFrame widget with custom formatting
   """
   # Create a copy to avoid modifying the original DataFrame
   table_df = exercises_df[['datafile', 'start_time', 'activity_type', 'distance_km', 
                          'avg_speed_kmh', 'hr_avg']].copy()
   
   # Format numeric columns to display 2 decimal places (only for non-null values)
   for col in ['distance_km', 'avg_speed_kmh', 'hr_avg']:
       table_df[col] = table_df[col].apply(lambda x: round(x, 2) if pd.notna(x) else x)
   
   # Format date to a more readable format (day/month/year)
   table_df['start_time'] = table_df['start_time'].dt.strftime('%d/%m/%Y')
   
   # Rename columns to more user-friendly display names
   table_df = table_df.rename(columns={
       'start_time': 'Date',
       'activity_type': 'Type',
       'distance_km': 'Distance (km)',
       'avg_speed_kmh': 'Avg Speed (km/h)',
       'hr_avg': 'Avg Heart Rate'
   })
   
   # Replace NaN values with a dash for cleaner display
   table_df = table_df.fillna('-')
   
   # Remove the internal ID column which isn't useful to end users
   table_df = table_df.drop(columns=['datafile'])
   
   # Define custom number formatters for numeric columns to ensure consistent
   # decimal places and left alignment
   formatters = {
       col: pn.widgets.tables.NumberFormatter(format='0.00', text_align='left')
       for col in table_df.columns 
       if table_df[col].dtype in [float, int, 'float64', 'int64']
   }
   
   # Create the interactive Panel DataFrame widget with appropriate styling
   table = pn.widgets.DataFrame(
       table_df,               # Processed data
       width=1200,             # Wide enough for all columns
       height=500,             # Tall enough to show multiple activities
       show_index=False,       # Hide the DataFrame index
       sizing_mode='stretch_width',  # Responsive width
       formatters=formatters,  # Apply custom number formatting
       text_align='left'       # Consistent text alignment
   )
   
   return table

#### Observation

<img src="/Users/natashasoldin/Documents/Education/Masters/First Semester/Session 03/SIADS 521/Assignments/Assignment 04/Dashboard Screenshots/Activity Table.png" width="1000">

The table offers a summary of all of the exercises and their distinct variables (i.e. date, exercise type, distance, average speed and average heart rate). The table also allows for filtering by exercise type using a selection bar at the top.

In [422]:
def create_monthly_activity_plot(exercises_df):
   """
   Creates a bar chart showing the count of activities grouped by month.
   
   This visualization helps identify activity patterns and trends over time,
   allowing users to see which months had higher or lower activity levels.
   
   Parameters:
   -----------
   exercises_df : pandas.DataFrame
       DataFrame containing exercise data with a 'start_time' column 
       that has datetime values
   
   Returns:
   --------
   hvplot.plotting.core.HoloViews
       An interactive bar chart showing activity counts by month
   
   Notes:
   ------
   - Months are formatted as 'YYYY-MM' for clear chronological ordering
   - Each bar represents the total number of activities recorded in that month
   """
   # Create a copy to avoid modifying the original DataFrame
   exercises_df = exercises_df.copy()
   
   # Extract month and year from timestamp and create a new column
   # Format as 'YYYY-MM' for chronological sorting
   exercises_df['month_year'] = exercises_df['start_time'].dt.strftime('%Y-%m')
   
   # Group by month-year and count the number of activities in each period
   monthly_counts = exercises_df.groupby('month_year').size().reset_index()
   
   # Rename the columns for clarity ('size' becomes 'count')
   monthly_counts.columns = ['month_year', 'count']
   
   # Create an interactive bar chart using hvplot
   bar_chart = monthly_counts.hvplot.bar(
       x='month_year',          # Month-year on x-axis
       y='count',               # Activity count on y-axis
       title='Activities per Month',
       xlabel='Month',
       ylabel='Number of Activities',
       height=300,              # Fixed height 
       width=600                # Fixed width
   )
   
   return bar_chart

df = exercise_df.copy()
bar_plot = create_monthly_activity_plot(exercise_df)
bar_plot 


#### Observations
The activities per month visualization offers a compelling narrative of exercise consistency and potential seasonal variations. The plot shows a peak in activities during July 2019, with a gradual decline in August and September. This pattern might indicate factors such as summer availability, training intensity, or potential external constraints affecting exercise frequency. The visualization provides immediate insights into the rhythm of Professor Brooks' exercise routine, highlighting the importance of tracking not just individual activities, but their distribution over time.

In [None]:
def create_distance_boxplot(exercises_df):
   """
   Creates a box plot showing the distribution of distances by activity type.
   
   This visualization helps users understand the typical distance ranges for
   different activity types (Running vs Cycling) and identify outliers.
   Box plots display the median, quartiles, and outliers in the distance data.
   
   Parameters:
   -----------
   exercises_df : pandas.DataFrame
       DataFrame containing exercise data with 'distance_km' and 'activity_type' columns
   
   Returns:
   --------
   hvplot.plotting.core.HoloViews
       An interactive box plot showing distance distributions by activity type
   
   Notes:
   ------
   - The box shows the interquartile range (IQR) with a line for the median
   - Whiskers typically extend to 1.5 * IQR
   - Points beyond the whiskers are considered outliers
   """
   # Create the box plot using hvplot
   box_plot = exercises_df.hvplot.box(
       y='distance_km',         # Distance values for the box plot distribution
       by='activity_type',      # Group and compare by activity type (e.g., Running vs Cycling)
       title='Distance Distribution by Activity Type',
       xlabel='Activity Type',  # Categories on x-axis
       ylabel='Distance (km)',  # Distance measurement on y-axis
       height=300,              # Fixed height
       width=400                # Fixed width
   )
   
   return box_plot

df = exercise_df.copy()
box_plot = create_distance_boxplot(exercise_df)
box_plot 

#### Observations
This box plot presents a nuanced comparison between running and cycling activities, revealing the distinctive characteristics of each exercise type. The plot demonstrates significant variations in distance distributions, with cycling activities showing a wider range and potentially longer distances compared to running. This visualization goes beyond simple numerical comparison, offering insights into the different nature of running and cycling as exercise modalities. It highlights the variability within each activity type and provides a comprehensive view of Professor Brooks' exercise diversity.

In [424]:
def create_speed_trend_plot(exercises_df):
   """
   Creates a scatter plot showing speed trends over time, with different colors for 
   running and cycling activities.
   
   This visualization helps identify patterns in performance over time, seasonal
   variations, and improvements in speed for different activity types.
   
   Parameters:
   -----------
   exercises_df : pandas.DataFrame
       DataFrame containing exercise data with columns for:
       'start_time', 'activity_type', and 'avg_speed_kmh'
   
   Returns:
   --------
   hvplot.plotting.core.HoloViews or panel.pane.HTML
       An interactive scatter plot showing speed trends over time,
       or an HTML message if no data is available
   
   Notes:
   ------
   - Running activities are shown in blue
   - Cycling activities are shown in orange
   - Points higher on the y-axis indicate faster speeds
   """
   # Sort activities chronologically to properly show progression over time
   exercises_df = exercises_df.sort_values('start_time')
   
   # Split data by activity type to apply different styling
   running_df = exercises_df[exercises_df['activity_type'] == 'Running'].copy()
   cycling_df = exercises_df[exercises_df['activity_type'] == 'Cycling'].copy()
   
   # Create scatter plot for running activities with full configuration
   running_plot = running_df.hvplot.scatter(
       x='start_time',         # Time on x-axis
       y='avg_speed_kmh',      # Speed on y-axis
       title='Speed Trends Over Time',
       xlabel='Date',
       ylabel='Average Speed (km/h)',
       label='Running',        # For legend
       color='blue',           # Running activities in blue
       height=400,
       width=800
   )
   
   # Create scatter plot for cycling activities
   # Note: Some parameters inherit from the first plot
   cycling_plot = cycling_df.hvplot.scatter(
       x='start_time',
       y='avg_speed_kmh',
       label='Cycling',        # For legend
       color='orange'          # Cycling activities in orange
   )
   
   # Combine plots or handle cases with missing data
   if len(running_df) > 0 and len(cycling_df) > 0:
       # Both activity types exist - overlay them
       combined_plot = running_plot * cycling_plot
   elif len(running_df) > 0:
       # Only running activities exist
       combined_plot = running_plot
   elif len(cycling_df) > 0:
       # Only cycling activities exist
       combined_plot = cycling_plot
   else:
       # No speed data available - return a placeholder message
       combined_plot = pn.pane.HTML(
           "<div style='height: 400px; display: flex; align-items: center; justify-content: center;'>"
           "<h3>No speed data available</h3></div>"
       )
   
   return combined_plot

df = exercise_df.copy()
scatter_plot = create_speed_trend_plot(exercise_df)
scatter_plot 

#### Observations
The speed trends visualization captures the dynamic nature of exercise performance over time. By plotting speed across different dates, the graph reveals interesting patterns in exercise intensity and potential performance improvements. The scatter plot shows variations between running (blue points) and cycling (yellow points) speeds, with cycling activities generally demonstrating higher speed ranges. This visualization is particularly powerful in showing the progression and variability of exercise intensity, providing insights into training consistency and potential areas of improvement.

In [425]:
def add_overview_plots(exercises_df):
   """
   Creates and organizes a comprehensive dashboard section with multiple visualizations
   of exercise activity data.
   
   This function combines three different visualizations into a well-structured,
   responsive layout:
   1. Monthly activity count bar chart
   2. Distance distribution box plot by activity type
   3. Speed trends over time scatter plot
   
   Parameters:
   -----------
   exercises_df : pandas.DataFrame
       DataFrame containing processed exercise data with all necessary columns
       for the three visualization functions
   
   Returns:
   --------
   panel.layout.Column
       A Panel layout containing all three visualizations arranged in a
       user-friendly, responsive manner
   
   Notes:
   ------
   - The layout is responsive and will adjust to different screen sizes
   - The first two plots are arranged side-by-side
   - The third plot spans the full width below the first two
   """
   # Generate the three visualization components
   monthly_activity_plot = create_monthly_activity_plot(exercises_df)
   distance_boxplot = create_distance_boxplot(exercises_df)
   speed_trend_plot = create_speed_trend_plot(exercises_df)
   
   # Create the first row with two plots side-by-side
   # Using stretch_width makes this responsive to window size
   row1 = pn.Row(
       monthly_activity_plot,     # Left plot: activity counts by month
       distance_boxplot,          # Right plot: distance distributions
       sizing_mode='stretch_width',  # Make row responsive
       align='center',               # Center-align the plots vertically
       styles={
           'padding': '20px',        # Add space around the row
           'justify-content': 'center'  # Center the plots horizontally
       }
   )
   
   # Create the second row with a single full-width plot
   row2 = pn.Row(
       speed_trend_plot,          # Bottom plot: speed trends over time
       sizing_mode='stretch_width',
       align='center',
       styles={
           'padding': '20px',
           'justify-content': 'center'
       }
   )
   
   # Combine the rows into a single layout with a heading
   plots_panel = pn.Column(
       # Add a section heading
       pn.pane.Markdown(
           '## Activity Visualizations', 
           sizing_mode='stretch_width',
           styles={
               'text-align': 'center',    # Center the heading
               'font-size': '24px',       # Large, readable font
               'margin-bottom': '20px'    # Space between heading and plots
           }
       ),
       row1,     # First row (monthly activity and distance box plots)
       row2,     # Second row (speed trends)
       sizing_mode='stretch_width',  # Make the entire section responsive
       align='center',               # Center contents vertically
       styles={
           'margin': '30px 0'        # Add vertical spacing around the section
       }
   )
   
   return plots_panel

### Activity Plots

The activity plots will go on the second tab or activity specific tab of the dashboard. After each visualisation function below, there will be a justification as to why these plots were chosen.

In [426]:
def create_route_map(exercise_data, activity_type):
    """
    Creates an interactive map visualization of an exercise route using Folium.
    
    This function takes GPS data from a workout and generates a map showing the route,
    start/end points, and distance markers at each kilometer.
    
    Parameters:
    -----------
    exercise_data : pandas.DataFrame
        DataFrame containing workout data with position_lat and position_long columns.
        These values are expected to be in the Garmin semicircles format.
    
    activity_type : str
        Type of activity ('Running' or 'Cycling'), used to determine route color.
    
    Returns:
    --------
    folium.Map or str
        Interactive Folium map object if position data exists, otherwise an error message.
    """
    # Filter out rows with missing GPS coordinates to ensure valid route plotting
    position_data = exercise_data[exercise_data['position_lat'].notna() & exercise_data['position_long'].notna()]
    
    # Return early if no GPS data is available
    if len(position_data) == 0:
        return "No position data available for this activity."
    
    # Convert from Garmin's semicircles format to standard GPS degrees
    # Garmin devices store coordinates in semicircles where 2^31 semicircles = 180 degrees
    lat_degrees = position_data['position_lat'] * (180 / 2**31)
    lon_degrees = position_data['position_long'] * (180 / 2**31)
    
    # Create a list of (latitude, longitude) coordinate pairs for the route
    coordinates = list(zip(lat_degrees, lon_degrees))
    
    # Initialize the map centered at the starting point with a reasonable zoom level
    m = folium.Map(location=coordinates[0], zoom_start=14)
    
    # Set route color based on activity type for visual distinction
    color = 'blue' if activity_type == 'Running' else 'orange'
    
    # Add the main route line to the map
    folium.PolyLine(
        coordinates, 
        color=color, 
        weight=4,           # Line thickness
        opacity=0.7         # Semi-transparent line
    ).add_to(m)
    
    # Add a marker at the starting point with a play icon
    folium.Marker(
        coordinates[0],
        icon=folium.Icon(color='green', icon='play', prefix='fa'),
        popup='Start'
    ).add_to(m)
    
    # Add a marker at the ending point with a stop icon
    folium.Marker(
        coordinates[-1],
        icon=folium.Icon(color='red', icon='stop', prefix='fa'),
        popup='End'
    ).add_to(m)
    
    # Calculate and add kilometer markers along the route
    cumulative_distance = 0
    prev_point = None
    
    for i, (lat, lon) in enumerate(coordinates):
        if prev_point:
            # Calculate the straight-line distance from the previous point
            # Note: This is a simple Euclidean approximation that works for short distances
            # 1 degree ≈ 111 km at the equator (less accurate at higher latitudes)
            dlat = lat - prev_point[0]
            dlon = lon - prev_point[1]
            dist = np.sqrt(dlat**2 + dlon**2) * 111  # Convert degrees to kilometers
            cumulative_distance += dist
            
            # Add a marker at each whole kilometer
            # int() truncates to whole number, so this detects crossing kilometer boundaries
            if int(cumulative_distance) > int(cumulative_distance - dist):
                folium.CircleMarker(
                    [lat, lon],
                    radius=5,           # Size of the circle
                    color=color,        # Match route color
                    fill=True,          # Fill the circle
                    fill_color=color,   # Same color for fill
                    popup=f"{int(cumulative_distance)} km"  # Show distance in popup
                ).add_to(m)
        
        # Save current point to calculate distance in next iteration
        prev_point = (lat, lon)
    
    return m

#### Observation
The Strava activity map transforms exercise data into a geographical narrative. Situated in Ann Arbor, the map provides context to the exercise routes, showing the specific urban and natural landscapes traversed during workouts. This visualization moves beyond numerical data, connecting the abstract metrics to a concrete physical space. It demonstrates how exercise is not just about numbers, but about exploration, movement, and interaction with one's environment.

In [427]:
def create_workout_time_series(exercise_data, selected_variables):
    """
    Create a time series plot for selected variables of a specific workout
    
    Parameters:
    -----------
    exercise_data : DataFrame
        The data for the selected exercise
    selected_variables : list
        List of variable names to plot
        
    Returns:
    --------
    HoloViews plot object
    """
    
    if exercise_data is None or len(exercise_data) == 0 or not selected_variables:
        # Return empty plot if no data or variables selected
        return pn.pane.HTML("<div style='height: 300px; display: flex; align-items: center; justify-content: center;'>"
                           "<h3>Select metrics to display time series</h3></div>")
    
    # Make sure timestamp is sorted
    exercise_data = exercise_data.sort_values('timestamp')
    
    # Normalize timestamps for x-axis (minutes from start)
    start_time = exercise_data['timestamp'].min()
    exercise_data = exercise_data.copy()
    exercise_data['minutes'] = (exercise_data['timestamp'] - start_time).dt.total_seconds() / 60
    
    # Create a base plot
    base_plot = None
    colors = ['blue', 'red', 'green', 'orange', 'purple', 'cyan', 'magenta']
    
    # Define nice labels for the metrics
    metric_labels = {
        'speed': 'Speed',
        'heart_rate': 'Heart Rate',
        'altitude': 'Altitude',
        'cadence': 'Cadence',
        'temperature': 'Temperature',
        'power': 'Power'
    }
    
    # Define y-axis units for each variable
    y_axis_units = {
        'speed': 'km/h',
        'heart_rate': 'bpm',
        'altitude': 'm',
        'cadence': 'spm',
        'temperature': '°C',
        'power': 'watts'
    }
    
    # Add each selected variable to the plot
    for i, var in enumerate(selected_variables):
        # Extract the variable name if it's a tuple or otherwise formatted
        if isinstance(var, tuple):
            var = var[1]  # Get the second item in the tuple (the value)
        elif ',' in str(var) and "'" in str(var):
            # It might be a string representation of a tuple
            var = var.split(",")[1].strip().strip("'").strip(")").strip("'")
            
        
        if var in exercise_data.columns:
            
            # Skip if variable doesn't exist or all values are NaN
            if exercise_data[var].isna().all():
                continue
                
            # Create plot for this variable
            color = colors[i % len(colors)]
            var_label = metric_labels.get(var, var.replace('_', ' ').title())
            units = y_axis_units.get(var, '')
            
            current_plot = exercise_data.hvplot.line(
                x='minutes',
                y=var,
                title='Workout Metrics Over Time',
                xlabel='Minutes from Start',
                ylabel='',
                label=f"{var_label} ({units})" if units else var_label,
                color=color,
                height=450,
                width=800,
                fontscale=1.2,
                legend='top_right',
                line_width=2
            )
            
            if base_plot is None:
                base_plot = current_plot
            else:
                base_plot = base_plot * current_plot
        else:
            print(f"Variable {var} not found in columns")
    
    # If no valid variables were plotted
    if base_plot is None:
        return pn.pane.HTML("<div style='height: 300px; display: flex; align-items: center; justify-content: center;'>"
                           "<h3>No data available for selected metrics</h3></div>")
    
    return base_plot

#### Observation
The workout metrics graph offers a multidimensional view of a single exercise session, showcasing how different metrics interact over time. The plot demonstrates variations in speed and altitude, providing a granular look at the complexities of a single workout. The dramatic changes in the graph – particularly the speed dips and altitude variations – illustrate the dynamic nature of exercise, revealing moments of intensity, potential rest periods, or changes in terrain.

## Dashboard Creation

In [None]:
def create_strava_dashboard(exercises_df):
    """
    Create a comprehensive Strava dashboard with both summary and activity detail views.
    
    Parameters:
    exercises_df (pandas.DataFrame): DataFrame containing exercise data
    
    Returns:
    panel.Tabs: A tabbed dashboard with summary and activity detail views
    """
    import pandas as pd
    import panel as pn
    
    # ----- SUMMARY DASHBOARD COMPONENTS -----
    
    # Calculate summary statistics
    total_activities = len(exercises_df)
    running_count = len(exercises_df[exercises_df['activity_type'] == 'Running'])
    cycling_count = len(exercises_df[exercises_df['activity_type'] == 'Cycling'])
    
    # Calculate total distance (excluding NaN values)
    total_distance = exercises_df['distance_km'].sum(skipna=True)
    
    # Calculate total duration
    total_duration_minutes = exercises_df['duration_minutes'].sum(skipna=True)
    hours = int(total_duration_minutes // 60)
    minutes = int(total_duration_minutes % 60)
    total_duration = f'{hours}h {minutes}m'
    
    # Estimate calories (basic calculation)
    # Rough estimate: 500 calories per hour for running, 400 for cycling
    running_minutes = exercises_df[exercises_df['activity_type'] == 'Running']['duration_minutes'].sum(skipna=True)
    cycling_minutes = exercises_df[exercises_df['activity_type'] == 'Cycling']['duration_minutes'].sum(skipna=True)
    running_calories = (running_minutes / 60) * 500
    cycling_calories = (cycling_minutes / 60) * 400
    total_calories = int(running_calories + cycling_calories)
    
    # Create Summary Cards
    total_activities_card = create_summary_card(
        'Total Activities',
        f'{total_activities}',
        f'Running: {running_count}, Cycling: {cycling_count}'
    )
    total_distance_card = create_summary_card('Total Distance', f'{total_distance:.2f} km')
    total_duration_card = create_summary_card('Total Duration', total_duration)
    total_calories_card = create_summary_card('Est. Calories Burned', f'{total_calories}')
    
    # Activity Type Filter
    summary_activity_type_filter = pn.widgets.RadioButtonGroup(
        name='Activity Type',
        options=['All', 'Running', 'Cycling'],
        value='All'
    )
    
    # Create a container for the table
    table_container = pn.Column(create_activity_table(exercises_df), sizing_mode='stretch_width')
    
    # Filtered Table Function
    def on_filter_change(event):
        filtered_df = exercises_df.copy()
        if event.new != 'All':
            filtered_df = filtered_df[filtered_df['activity_type'] == event.new]
        # Update the table in the container
        table_container[0] = create_activity_table(filtered_df)
    
    summary_activity_type_filter.param.watch(on_filter_change, 'value')
    
    # Create overview plots
    overview_plots = add_overview_plots(exercises_df)
    
    # Create a container for each card with fixed width
    card1 = pn.Column(total_activities_card, width=220)
    card2 = pn.Column(total_distance_card, width=220)
    card3 = pn.Column(total_duration_card, width=220)
    card4 = pn.Column(total_calories_card, width=220)
    
    # Build the Summary Dashboard Layout
    summary_dashboard = pn.Column(
        pn.pane.Markdown('# Professor Brooks\' Exercise Dashboard', sizing_mode='stretch_width'),
        pn.Row(
            card1, card2, card3, card4,
            sizing_mode='stretch_width',
            align='center'
        ),
        pn.pane.Markdown('## Activity Explorer', sizing_mode='stretch_width'),
        summary_activity_type_filter,
        table_container,
        overview_plots,  # Add the overview plots section
        sizing_mode='stretch_width'
    )
    
    # ----- ACTIVITY DETAIL DASHBOARD COMPONENTS -----
    
    # Create widgets
    detail_activity_type_selector = pn.widgets.RadioButtonGroup(
        options=['All', 'Running', 'Cycling'],
        value='All',
        name='Activity Type'
    )
    
    # Function to update the exercise selector based on activity type
    def update_exercise_options(activity_type):
        if activity_type == 'All':
            filtered_exercises = exercises_df
        else:
            filtered_exercises = exercises_df[exercises_df['activity_type'] == activity_type]
        
        # Format options for the selector
        options = {}
        for _, ex in filtered_exercises.iterrows():
            date_str = ex['start_time'].strftime('%Y-%m-%d')
            dist_str = f"{ex['distance_km']:.2f}" if pd.notna(ex['distance_km']) else "?"
            dur_str = f"{ex['duration_minutes']:.0f}" if pd.notna(ex['duration_minutes']) else "?"
            label = f"{date_str}: {dist_str} km, {dur_str} min"
            options[label] = ex['datafile']
        
        return options
    
    # Initial options
    exercise_options = update_exercise_options('All')
    exercise_selector = pn.widgets.Select(
        name='Select Exercise',
        options=exercise_options
    )
    
    # Create panels for the map 
    map_panel = pn.pane.HTML("""
    <div style="width: 100%; background-color: #f0f0f0; display: flex; align-items: center; justify-content: center;">
    <h2>Select an exercise to display the route map</h2>
    </div>
    """, width=800)
    
    stats_panel = pn.pane.HTML("""
    <div style="padding: 10px; background-color: #f8f8f8; border-radius: 5px;">
    <h3>Activity Details</h3>
    <p>Select an exercise to view details</p>
    </div>
    """, height=250)
    
    # Create the variable selector for the time series plot
    variable_selector = pn.widgets.CheckBoxGroup(
        name='Select Metrics to Plot',
        options=[],  # Will be populated when a workout is selected
        value=[],
        inline=True
    )
    
    # Create container for the time series plot
    time_series_container = pn.Column(
        pn.pane.HTML("<div></div>"),
        sizing_mode='stretch_width'
    )
    
    # We'll create a single HTML container that will hold both the map and time series
    combined_view = pn.pane.HTML("""
    <div id="combined-container" style="display: flex; flex-direction: column; width: 100%; padding: 0; margin: 0; border: none;">
        <div id="map-placeholder" style="width: 100%; min-height: 600px; padding: 0; margin: 0; border: none;">
            <!-- Map will be injected here -->
        </div>
        <div id="metrics-container" style="width: 100%; padding: 0; margin: 0; border: none;">
            <!-- Metrics and time series will be injected here -->
        </div>
    </div>
    """, width=800, height=900, sizing_mode='stretch_width')
    
    # Define metric labels and mapping dictionaries at a level accessible to all functions
    metric_labels = {
        'speed': 'Speed',
        'heart_rate': 'Heart Rate',
        'altitude': 'Altitude',
        'cadence': 'Cadence',
        'temperature': 'Temperature',
        'power': 'Power'
    }
    
    # Create mappings between labels and variable names
    labels_to_vars = {v: k for k, v in metric_labels.items()}
    
    # Function to create the map and update plots when an exercise is selected
    def create_map(event):
        if event.new:
            # Get the selected exercise
            selected_exercise = exercises_df[exercises_df['datafile'] == event.new].iloc[0]
            exercise_data = selected_exercise['data']
            
            # Create the map
            route_map = create_route_map(exercise_data, selected_exercise['activity_type'])
            
            # Update the map panel (we'll keep this for reference)
            map_panel.object = route_map._repr_html_()
            
            # INJECT THE MAP into our combined view
            map_html = route_map._repr_html_()
            metrics_html = ""  # Prepare the metrics container
            
            # Update the stats panel
            start_time = selected_exercise['start_time'].strftime('%Y-%m-%d %H:%M:%S')
            distance = f"{selected_exercise['distance_km']:.2f} km" if pd.notna(selected_exercise['distance_km']) else "Unknown"
            duration = f"{selected_exercise['duration_minutes']:.1f} min" if pd.notna(selected_exercise['duration_minutes']) else "Unknown"
            speed = f"{selected_exercise['avg_speed_kmh']:.2f} km/h" if pd.notna(selected_exercise['avg_speed_kmh']) else "Unknown"
            hr_avg = f"{selected_exercise['hr_avg']:.0f} bpm" if pd.notna(selected_exercise['hr_avg']) else "Unknown"
            hr_max = f"{selected_exercise['hr_max']:.0f} bpm" if pd.notna(selected_exercise['hr_max']) else "Unknown"
            
            stats_panel.object = f"""
            <div style="padding: 10px;">
            <h3>Activity Details</h3>
            <p><strong>Date:</strong> {start_time}</p>
            <p><strong>Type:</strong> {selected_exercise['activity_type']}</p>
            <p><strong>Distance:</strong> {distance}</p>
            <p><strong>Duration:</strong> {duration}</p>
            <p><strong>Avg Speed:</strong> {speed}</p>
            <p><strong>Avg HR:</strong> {hr_avg}</p>
            <p><strong>Max HR:</strong> {hr_max}</p>
            </div>
            """
            
            # Update available metrics based on what's present in this workout
            available_vars = []
            
            # Create options list with just the display labels
            option_list = []
            for var in ['speed', 'heart_rate', 'altitude', 'cadence', 'temperature', 'power']:
                if var in exercise_data.columns and not exercise_data[var].isna().all():
                    available_vars.append(var)
                    option_list.append(metric_labels[var])  # Just append the display label
            
            # Update the variable selector with display labels only
            variable_selector.options = option_list
            
            # Don't auto-select anything
            variable_selector.value = []
            
            # Initialize with an empty message
            time_series_container[0] = pn.pane.HTML("<div style='height: 300px; display: flex; align-items: center; justify-content: center;'>"
                            "<h3>Select metrics to display time series</h3></div>")
            
            # Update the combined view with both map and time series
            combined_view.object = f"""
            <div id="combined-container" style="display: flex; flex-direction: column; width: 100%; padding: 0; margin: 0; border: none;">
                <div id="map-container" style="width: 100%; padding: 0; margin: 0; border: none;">
                    {map_html}
                </div>
                <div id="metrics-title" style="width: 100%; padding: 30px 0 0 0; margin: 0; text-align: center;">
                    <h2>Workout Metrics</h2>
                </div>
                <div id="metrics-controls" style="width: 100%; padding: 0; margin: 0;">
                    <!-- Variable selector will be shown separately -->
                </div>
                <div id="metrics-plot" style="width: 100%; padding: 0; margin: 0;">
                    <!-- Time series will be injected elsewhere -->
                </div>
            </div>
            """
    
    # Function to update time series when variables are changed
    def update_time_series(exercise_data=None, selected_vars=None):
        if exercise_data is None:
            # No exercise selected yet, keep default message
            return
        
        # If no selected vars, use the current value from the selector
        if selected_vars is None:
            selected_vars = variable_selector.value
        
        # Create the time series plot
        time_series_plot = create_workout_time_series(exercise_data, selected_vars)
        
        # Update the container
        time_series_container[0] = time_series_plot
    
    # Connect the exercise selector to the map update function
    exercise_selector.param.watch(create_map, 'value')
    
    # Connect the variable selector to update the time series plot
    def on_variable_change(event):
        # Only update if an exercise is selected and variables are selected
        if exercise_selector.value:
            if event.new:  # If there are selected variables
                selected_exercise = exercises_df[exercises_df['datafile'] == exercise_selector.value].iloc[0]
                # Convert the display labels back to variable names
                selected_vars = [labels_to_vars[label] for label in event.new]
                update_time_series(selected_exercise['data'], selected_vars)
            else:  # If all variables are deselected
                time_series_container[0] = pn.pane.HTML("<div style='height: 300px; display: flex; align-items: center; justify-content: center;'>"
                            "<h3>Select metrics to display time series</h3></div>")
    
    variable_selector.param.watch(on_variable_change, 'value')
    
    # Update exercise selector when activity type changes
    def update_selector(event):
        exercise_selector.options = update_exercise_options(event.new)
    
    detail_activity_type_selector.param.watch(update_selector, 'value')
    
    # Build Activity Detail Dashboard
    controls_and_stats = pn.Column(
        pn.pane.Markdown("# Strava Activity Dashboard"),
        pn.pane.Markdown("## Filter and select an activity"),
        detail_activity_type_selector,
        exercise_selector,
        pn.layout.Divider(),
        stats_panel,
        width=350  # Fixed width for the left column
    )
    
    time_series_section = pn.Column(
        variable_selector,
        pn.Row(  # Row for horizontal centering
            pn.Spacer(),
            time_series_container,
            pn.Spacer(),
            sizing_mode='stretch_width'
        ),
        sizing_mode='stretch_width',
        margin=(0,0,0,0)
    )
    
    # Use a simpler layout approach
    # Put the combined view on the right and the original time series below it
    activity_detail_dashboard = pn.Row(
        # Left side controls
        controls_and_stats,
        
        # Right side - using the combined HTML view
        pn.Column(
            combined_view,  # This has both map and title
            time_series_section,  # This has the selector and plot
            sizing_mode='stretch_width'
        ),
        
        sizing_mode='stretch_width'
    )
    
    # ----- CREATE THE COMBINED TABBED DASHBOARD -----
    
    # Create a tabbed dashboard
    tabs = pn.Tabs(
        ('Summary Dashboard', summary_dashboard),
        ('Activity Details', activity_detail_dashboard)
    )
    
    return tabs

In [430]:
# Create the combined dashboard
dashboard = create_strava_dashboard(exercise_df)

# Display the dashboard
dashboard



BokehModel(combine_events=True, render_bundle={'docs_json': {'9f1d754c-ac3e-4dcf-8281-b6ff20fd881f': {'version…

#### Observations

The dashboard is more than a collection of visualizations – it's a comprehensive narrative of Professor Brooks' fitness journey during the summer of 2019. By carefully curating metrics, designing intuitive layouts, and ensuring visual coherence, the dashboard transforms raw data into a meaningful story. It speaks not just to the numbers, but to the human experience of athletic progression, commitment, and personal growth. The design philosophy draws deeply from established scientific visualization principles, ensuring that each visual element is not just aesthetically pleasing, but informationally rich. The dashboard becomes a tool for reflection, allowing Professor Brooks to see his exercise journey from multiple perspectives, understanding not just what happened, but the deeper patterns and potential insights.

Adhereance to Rule et al.'s Ten Principles of Computational Analyses
- Rule 01 - Reproducability: in this exercise data analysis, reproducibility is achieved through meticulous documentation of every data processing step, from initial data import to final visualization. Each transformation is explicitly coded and explained, allowing another researcher to exactly recreate the analysis, understand the methodological choices, and verify the resulting insights about Professor Brooks' exercise journey.
- Rule 2 - Documentation: comprehensive documentation transforms raw code into a narrative of scientific discovery. The notebook goes beyond mere technical instructions, providing rich context about data sources, preprocessing decisions, and analytical rationale. Markdown cells explain the purpose behind each visualization, the challenges of working with multi-device sensor data, and the insights derived from each analytical step, creating a transparent and comprehensible research document.
- Rule 3 - Project Organization: effective computational research requires a clear, logical structure that guides the reader through the analytical process. The notebook is carefully organized into distinct, logical sections: data import, preprocessing, visualization functions, and dashboard creation. This structured approach allows for easy navigation, understanding of the analytical workflow, and potential future modifications or extensions of the research.
- Rule 4 - Data Management: managing complex, real-world data requires careful, thoughtful approaches to cleaning, transforming, and interpreting information. The exercise dataset, described as "authentically messy and unclear", is transformed through sophisticated preprocessing techniques. Multiple device inputs are reconciled, inconsistent measurements are standardized, and potential data quality issues are explicitly addressed, turning raw sensor data into meaningful insights.
- Rule 5 - Collaborative Sharing: scientific work should be accessible and meaningful to both technical experts and broader audiences. The dashboard and accompanying analysis bridge technical complexity and narrative clarity, presenting exercise data in ways that are both analytically rigorous and intuitively understandable. Visualizations are designed to tell a story, not just display numbers, making the research accessible to Professor Brooks and other stakeholders.
- Rule 10 - Dissemination: the ultimate goal of computational research is to communicate insights effectively. The dashboard goes beyond mere data presentation, creating an interactive, narrative-driven exploration of Professor Brooks' exercise journey. By combining technical rigor with storytelling, the analysis transforms raw sensor data into a meaningful, accessible narrative about personal fitness, performance, and commitment.