# NYC Marathon Insights Dash App #

This Dash app provides an advanced data analysis and visualization platform for NYC Marathon Results. The project involves data cleaning, feature engineering, outlier detection, and a user-friendly interface for visual exploration. The main features include:
1. Data Cleaning and Preprocessing:
    - Conversion of time-related columns (`overallTime`, `pace`, `ageGradeTime`) into seconds for better numerical analysis.
    - Creation of new features like `pace_to_age_ratio`, `speed` (meters per second), and `gender_age_rank` to provide deeper insights.
    - Binning ages into defined age groups for categorical analysis.
2. Outlier Detection and Removal:
    - Analysis of the `pace_sec` column to identify and remove extreme outliers using a threshold based on the interquartile range (IQR).
3. Interactive Dash App:
    - A dropdown menu allows users to select various visualizations:
        - Violin Plot: Distribution of pace by age group and gender.
        - Scatter Plot: Relationship between overall time and age.
        - Histogram: Distribution of pace values.
        - Scatter Plot: Speed vs. Age.
        - Bar Plot: Average speed by age group.
    - Users can save any generated visualization as an HTML file by clicking a dedicated button.
4. Code Enhancements and User Guidance:
    - Extensive inline comments explain each step of the process for clarity and learning purposes.
    - Configured for offline use with customizable data paths.

This app is ideal for understanding marathon performance patterns and extracting actionable insights from the NYC Marathon dataset.

### 1. Import libraries and Load the dataset ###

In [1]:
import dash
from dash import dcc, html, Input, Output
import plotly.express as px
import pandas as pd
import numpy as np
import plotly.io as pio

In [2]:
data = pd.read_csv("NYC Marathon Results, 2024 - Marathon Runner Results.csv")
data

Unnamed: 0,runnerId,firstName,bib,age,gender,city,countryCode,stateProvince,iaaf,overallPlace,overallTime,pace,genderPlace,ageGradeTime,ageGradePlace,ageGradePercent,racesCount
0,41771195,Abdi,7.0,35,M,Nijmegen,NLD,,NED,1,2:07:39,4:53,1,6:57,1,96.86,4
1,41775746,Evans,3.0,35,M,Kapsabet,KEN,-,KEN,2,2:07:45,4:53,2,7:03,2,96.79,2
2,41766254,Albert,2.0,30,M,Kapkitony,KEN,,KEN,3,2:08:00,4:53,3,8:00,3,96.06,5
3,41763160,Tamirat,1.0,33,M,Addis Ababa,ETH,,ETH,4,2:08:12,4:54,4,8:02,4,96.03,4
4,41757406,Geoffrey,6.0,31,M,Kapchorwa District,KEN,-,KEN,5,2:08:50,4:55,5,8:50,6,95.44,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55519,41802120,Lay,66580.0,73,W,Auckland,NZL,Auckland,GBR,55520,11:23:55,26:06:00,24707,12:47,24396,31.29,1
55520,41753795,Guillermo,12699.0,71,M,Garza-Garcia,MEX,Nuevo Leon,MEX,55521,11:38:21,26:39:00,30695,28:49:00,30674,24.16,37
55521,41751479,Thomas,12713.0,82,M,Randolph,USA,NJ,USA,55522,11:42:21,26:48:00,30696,56:59:00,30501,29.49,35
55522,41770749,Jill,14443.0,43,W,Chicago,USA,IL,USA,55523,11:43:07,26:50:00,24708,12:19,24709,20.14,1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55524 entries, 0 to 55523
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   runnerId         55524 non-null  int64  
 1   firstName        55524 non-null  object 
 2   bib              55512 non-null  float64
 3   age              55524 non-null  int64  
 4   gender           55512 non-null  object 
 5   city             55462 non-null  object 
 6   countryCode      55512 non-null  object 
 7   stateProvince    55231 non-null  object 
 8   iaaf             55511 non-null  object 
 9   overallPlace     55524 non-null  int64  
 10  overallTime      55524 non-null  object 
 11  pace             55524 non-null  object 
 12  genderPlace      55524 non-null  int64  
 13  ageGradeTime     55524 non-null  object 
 14  ageGradePlace    55524 non-null  int64  
 15  ageGradePercent  55524 non-null  float64
 16  racesCount       55524 non-null  int64  
dtypes: float64(2

### 2. Data cleaning ###

Convert time columns `overallTime`, `pace`, `ageGradeTime` to seconds for analysis

In [4]:
# Function to convert time strings (HH:MM:SS) into seconds
def time_to_seconds(time_str):
    try:
        parts = list(map(int, time_str.split(':')))
        # for format MM:SS
        if len(parts) == 2:
            return parts[0] * 60 + parts[1]
        
        # for format HH:MM:SS
        elif len(parts) == 3:
            return parts[0] * 3600 + parts[1] * 60 + parts[2]
        
        return result
    except:
        return None

Apply the time conversion function to relevant columns

In [5]:
# Convert overall time to seconds
data['overallTime_sec'] = data['overallTime'].apply(time_to_seconds)

# Convert pace to seconds per mile
data['pace_sec'] = data['pace'].apply(time_to_seconds)

# Convert age-graded time to seconds
data['ageGradeTime_sec'] = data['ageGradeTime'].apply(time_to_seconds)

In [6]:
data.head()

Unnamed: 0,runnerId,firstName,bib,age,gender,city,countryCode,stateProvince,iaaf,overallPlace,overallTime,pace,genderPlace,ageGradeTime,ageGradePlace,ageGradePercent,racesCount,overallTime_sec,pace_sec,ageGradeTime_sec
0,41771195,Abdi,7.0,35,M,Nijmegen,NLD,,NED,1,2:07:39,4:53,1,6:57,1,96.86,4,7659,293,417
1,41775746,Evans,3.0,35,M,Kapsabet,KEN,-,KEN,2,2:07:45,4:53,2,7:03,2,96.79,2,7665,293,423
2,41766254,Albert,2.0,30,M,Kapkitony,KEN,,KEN,3,2:08:00,4:53,3,8:00,3,96.06,5,7680,293,480
3,41763160,Tamirat,1.0,33,M,Addis Ababa,ETH,,ETH,4,2:08:12,4:54,4,8:02,4,96.03,4,7692,294,482
4,41757406,Geoffrey,6.0,31,M,Kapchorwa District,KEN,-,KEN,5,2:08:50,4:55,5,8:50,6,95.44,5,7730,295,530


Outlier Detection and Removal for `pace_sec`

In [7]:
# Calculate the interquartile range (IQR)
# 25th percentile
q1 = data['pace_sec'].quantile(0.25)
# 75th percentile
q3 = data['pace_sec'].quantile(0.75)
iqr = q3 - q1

# Define the lower and upper bounds for outliers
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Filter out outliers
data = data[(data['pace_sec'] >= lower_bound) & (data['pace_sec'] <= upper_bound)]

Create age groups for analysis

In [8]:
bins = [10, 20, 30, 40, 50, 60, 70, 80, 100]
labels = ['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-100']
data.loc[:, 'age_group'] = pd.cut(data['age'], bins=bins, labels=labels, right=False)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'age_group'] = pd.cut(data['age'], bins=bins, labels=labels, right=False)


Unnamed: 0,runnerId,firstName,bib,age,gender,city,countryCode,stateProvince,iaaf,overallPlace,...,pace,genderPlace,ageGradeTime,ageGradePlace,ageGradePercent,racesCount,overallTime_sec,pace_sec,ageGradeTime_sec,age_group
0,41771195,Abdi,7.0,35,M,Nijmegen,NLD,,NED,1,...,4:53,1,6:57,1,96.86,4,7659,293,417,30-40
1,41775746,Evans,3.0,35,M,Kapsabet,KEN,-,KEN,2,...,4:53,2,7:03,2,96.79,2,7665,293,423,30-40
2,41766254,Albert,2.0,30,M,Kapkitony,KEN,,KEN,3,...,4:53,3,8:00,3,96.06,5,7680,293,480,30-40
3,41763160,Tamirat,1.0,33,M,Addis Ababa,ETH,,ETH,4,...,4:54,4,8:02,4,96.03,4,7692,294,482,30-40
4,41757406,Geoffrey,6.0,31,M,Kapchorwa District,KEN,-,KEN,5,...,4:55,5,8:50,6,95.44,5,7730,295,530,30-40


Create new features for advanced analysis

In [9]:
# Calculate the ratio of pace (in seconds) to age
data.loc[:, 'pace_to_age_ratio'] = data['pace_sec'] / data['age']

# Calculate speed in meters per second (1 mile = 1609.34 meters)
data.loc[:, 'speed'] = 1609.34 / data['pace_sec']

# Rank runners by gender within each age group based on overall time
data.loc[:, 'gender_age_rank'] = data.groupby(['age_group', 'gender'], observed=False)['overallTime_sec'].rank()

data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'pace_to_age_ratio'] = data['pace_sec'] / data['age']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'speed'] = 1609.34 / data['pace_sec']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'gender_age_rank'] = data.groupby(['age_group', 'gender'], observed=False)['o

Unnamed: 0,runnerId,firstName,bib,age,gender,city,countryCode,stateProvince,iaaf,overallPlace,...,ageGradePlace,ageGradePercent,racesCount,overallTime_sec,pace_sec,ageGradeTime_sec,age_group,pace_to_age_ratio,speed,gender_age_rank
0,41771195,Abdi,7.0,35,M,Nijmegen,NLD,,NED,1,...,1,96.86,4,7659,293,417,30-40,8.371429,5.492628,1.0
1,41775746,Evans,3.0,35,M,Kapsabet,KEN,-,KEN,2,...,2,96.79,2,7665,293,423,30-40,8.371429,5.492628,2.0
2,41766254,Albert,2.0,30,M,Kapkitony,KEN,,KEN,3,...,3,96.06,5,7680,293,480,30-40,9.766667,5.492628,3.0
3,41763160,Tamirat,1.0,33,M,Addis Ababa,ETH,,ETH,4,...,4,96.03,4,7692,294,482,30-40,8.909091,5.473946,4.0
4,41757406,Geoffrey,6.0,31,M,Kapchorwa District,KEN,-,KEN,5,...,6,95.44,5,7730,295,530,30-40,9.516129,5.45539,5.0


### 3. Dash app ###

Initialize Dash app

In [10]:
app = dash.Dash(__name__)

Define the layout of the Dash app

In [11]:
app.layout = html.Div([
    html.H1("NYC Marathon Advanced Analysis", style={'textAlign': 'center'}),
    
    # Dropdown for selecting the type of visualization
    html.Div([
        html.Label('Select Visualization:'),
        dcc.Dropdown(
            id='viz-type',
            options = [
                {'label': 'Violin Plot (Pace by Age Group)', 'value': 'violin'},
                {'label': 'Scatter Plot (Overall Time vs Age)', 'value': 'scatter'},
                {'label': 'Histogram (Pace Distribution)', 'value': 'histogram'},
                {'label': 'Scatter Plot (Speed vs Age)', 'value': 'speed_scatter'},
                {'label': 'Bar Plot (Average Speed by Age Group)', 'value': 'bar_speed'}
            ],
            value='violin'  # Default visualization 
        )
    ]),
    # Graph output placeholder
    dcc.Graph(id='graph-output'),

    # Button to save the current graph as HTML
    html.Button('Save Graph as HTML', id='save-button', n_clicks=0),
    html.Div(id='save-confirmation', style={'marginTop': '10px'})
])

# Callback to update the graph based on selected visualization type
@app.callback(
    Output('graph-output', 'figure'),
    [Input('viz-type', 'value')]
)
def update_graph(viz_type):
    # Violin plot: Analyze pace distribution by age group and gender
    if viz_type == 'violin':
        fig = px.violin(
            data,
            y='pace_sec',
            x='age_group',
            color='gender',
            box=True,  # Include box plot overlay
            points="all",  # Show all data points
            hover_data=['firstName', 'city', 'countryCode'],
            title='Pace Distribution by Age Group and Gender',
            labels={'pace_sec': 'Pace (seconds per mile)', 'age_group': 'Age Group'}
        )
        
    # Scatter plot: Relationship between overall time and age
    elif viz_type == 'scatter':
        fig = px.scatter(
            data,
            x='age',
            y='overallTime_sec',
            color='gender',
            title='Overall Time vs Age',
            labels={'overallTime_sec': 'Overall Time (seconds)', 'age': 'Age'},
            hover_data=['firstName', 'city', 'countryCode']
        )
    # Histogram: Distribution of pace for all runners
    elif viz_type == 'histogram':
        fig = px.histogram(
            data,
            x='pace_sec',
            nbins=50,
            color='gender',
            title='Pace Distribution',
            labels={'pace_sec': 'Pace (seconds per mile)'}
        )
    # Scatter plot: Speed vs Age
    elif viz_type == 'speed_scatter':
        fig = px.scatter(
            data,
            x='age',
            y='speed',
            color='gender',
            title='Speed vs Age',
            labels={'speed': 'Speed (m/s)', 'age': 'Age'},
            hover_data=['firstName', 'city', 'countryCode']
        )
    # Bar plot: Average speed by age group
    elif viz_type == 'bar_speed':
        avg_speed = data.groupby('age_group')['speed'].mean().reset_index()
        fig = px.bar(
            avg_speed,
            x='age_group',
            y='speed',
            title='Average Speed by Age Group',
            labels={'speed': 'Average Speed (m/s)', 'age_group': 'Age Group'},
        )
    else:
        # Empty scatter plot as fallback
        fig = px.scatter()

    # Apply a consistent visual style
    fig.update_layout(template="plotly_white")

    return fig

# Callback to save the current graph as an HTML file
@app.callback(
    Output('save-confirmation', 'children'),
    [
        Input('save-button', 'n_clicks'),
        Input('graph-output', 'figure')
    ]
)

def save_graph_as_html(n_clicks, figure):
    if n_clicks > 0:
        # Save the current figure to an HTML file
        pio.write_html(figure, file='graph_output.html', auto_open=False)
        return 'Graph saved as graph_output.html.'
    return ''

Run the Dash app locally

In [12]:
if __name__ == '__main__':
    app.run_server(debug=True)  # Set debug=True for development mode