# Introduction

## Exploratory data analysis and visualization of World Marathon Majors dataset using Pandas and Altair Viz.

This jupyter notebook has exploratory data analysis on the Kaggle dataset (https://www.kaggle.com/emmanuelleai/world-marathons-majors)

## Motivation

Unlike the popular visualization libraries like Matplotlib, Seaborn or Plotly, I thought of using Altair and do some exploratory visualization charts with it.

There are many kernels using Plotly and other libraries but rarely do I find Altair for small datasets. So I took that thought and created this notebook. Hope you will like it 😊

## Import libraries

In [None]:
!pip install pandas
!pip install altair

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
from datetime import datetime

import altair as alt
import pandas as pd

## Load data

In [None]:
df = pd.read_csv('/kaggle/input/world-marathons-majors/world_marathon_majors.csv', encoding='ISO-8859-1')
df.head()

## Dataframe size

In [None]:
df.shape

## Datatype checks

In [None]:
df.info()

## NA checks

In [None]:
df.isna().sum()

## Converting total time from hours to minutes (object -> float)

In [None]:
def str_to_time(x):
    """
    hh:mm:ss => total minutes
    e.g. 2:05:30 => 125.5
    """
    t = datetime.strptime(x, "%H:%M:%S")
    return t.hour * 60 + t.minute + (t.second / 60)


# test
str_to_time("02:05:30")

In [None]:
df['total_minutes'] = df['time'].apply(lambda x: str_to_time(x))
df.head()

In [None]:
df.total_minutes.dtype

## EDA (Exploratory Data Analysis) and Visualizations

### 1. Total marathons held at each city

In [None]:
total_marathons = pd.DataFrame(df.groupby("marathon")["year"].agg("count")).reset_index()
total_marathons.columns = ['City', 'Total marathons']
total_marathons

In [None]:
alt.Chart(
    total_marathons, title="Number of marathons held at each City"
).mark_bar().encode(
    x="City:O", y="Total marathons:Q", tooltip=["City", "Total marathons"],
).properties(
    width=550, height=300
).configure_axis(
    labelFontSize=15, titleFontSize=18, labelAngle=0
).configure_title(
    fontSize=21
)

**Majority of the marathons were held at Boston.**

### 2. Total wins by Country

In [None]:
country_counts = pd.DataFrame(df.groupby("country")["year"].agg("count")).reset_index()
country_counts.columns = ["Country", "Wins"]
country_counts.head()

**Play around with the box-select tool to find the mean.**

**While the average number of wins of all the countries is ~15, the top three winning nations are Kenya, United States and Ethiopia.**

In [None]:
'''
Mean line will change when you use the box-select
'''
max_wins = country_counts.Wins.max()

brush = alt.selection(type="interval", encodings=["x"])

bar_counts = (
    alt.Chart()
    .mark_bar()
    .encode(
        x="Country:O",
        y="Wins:Q",
        opacity=alt.condition(brush, alt.OpacityValue(1), alt.OpacityValue(0.7)),
        tooltip=["Country", "Wins"],
        color=alt.condition(
            alt.datum.Wins == max_wins, alt.value("orange"), alt.value("steelblue")
        ),
    )
    .add_selection(brush)
)

mean_line = (
    alt.Chart()
    .mark_rule(color="red",)
    .encode(y="mean(Wins):Q", size=alt.SizeValue(4))
    .transform_filter(brush)
)

final = alt.layer(bar_counts, mean_line, data=country_counts)
final.properties(width=850, height=500).configure_axis(
    labelFontSize=15, titleFontSize=20
)

### 3. Overall Female & Male winners ratio

In [None]:
gender_counts = pd.DataFrame(df.gender.value_counts()).reset_index()
gender_counts.columns = ["Gender", "Wins"]
gender_counts

**Male winners ratio is more than that of Female winners in the dataset.**

**One point to note is that not until 1972 (http://www.marathonguide.com/history/olympicmarathons/chapter25.cfm)**

In [None]:
bar_chart = alt.Chart(gender_counts).mark_bar().encode(
    x="Gender:O", y="Wins:Q", tooltip=["Gender", "Wins"], color=alt.Color("Gender")
).properties(width=300, height=350).configure_axis(
    labelFontSize=15, titleFontSize=20, labelAngle=0
).configure_legend(
    strokeColor="gray", labelFontSize=15, padding=10, cornerRadius=5, orient="right"
)

bar_chart

### 4. Wins by Place and Gender

In [None]:
place_gender_counts = (
    df.groupby(["marathon", "gender"])["year"].agg(["count"]).reset_index()
)
place_gender_counts

In [None]:
alt.Chart(
    place_gender_counts,
    title="Total number of wins by gender grouped by marathon city",
).mark_bar().encode(
    x="gender:O", y="count:Q", tooltip=["count:Q"], color="gender", column="marathon:O",
).properties(
    width=150, height=250
).configure_axis(
    labelFontSize=15, labelAngle=0
).configure_legend(
    strokeColor="gray", labelFontSize=15, padding=10, cornerRadius=5, orient="right"
).configure_title(
    fontSize=20, offset=10, orient="top", anchor="middle"
).configure_header(
    labelFontSize=18
)

### 5. Histogram of marathon competitions

In [None]:
base_chart = alt.Chart(df, title="Distribution of female and male winners")

hist_chart = base_chart.mark_bar().encode(
    x=alt.X("year:Q", bin=True, axis=None),
    y="count()",
    color="gender:N",
    tooltip=["count()"],
)

mean_line = base_chart.mark_rule(color="red").encode(
    x="mean(year):Q", size=alt.value(5)
)

(hist_chart + mean_line).configure_legend(
    strokeColor="gray", labelFontSize=15, padding=10, cornerRadius=5, orient="right"
).configure_axis(
    labelFontSize=15, labelAngle=0
).configure_legend(
    strokeColor="gray", labelFontSize=15, padding=10, cornerRadius=5, orient="right"
).configure_title(
    fontSize=20, offset=10, orient="top", anchor="middle"
)

### 6. Overview of total minutes taken by the marathon winners

In [None]:
year_avg_minutes = df.groupby("year")["total_minutes"].agg("mean").reset_index()
year_avg_minutes.head()

**Use box-select tool to observe the average time taken between years.**

In [None]:
interval = alt.selection_interval(mark=alt.BrushConfig(fill="green"))
brush = alt.selection(type="interval", encodings=["x"])

base = (
    alt.Chart(year_avg_minutes)
    .mark_area()
    .encode(x="year", y="total_minutes:Q",)
    .properties(width=600, height=250, selection=interval)
)

minutes_chart = base.encode(
    alt.X("year", scale=alt.Scale(domain=brush)), tooltip="total_minutes"
)

chart_selector = base.properties(width=600, height=120).add_selection(brush)

chart_selector | minutes_chart

### 7. Time taken by Male and Female winners (boxplots)

In [None]:
alt.Chart(df, title="Time taken by the winners").mark_boxplot(
    size=50, extent=0.5
).encode(
    x="gender",
    y=alt.Y("total_minutes:Q", scale=alt.Scale(zero=False)),
    color=alt.Color("gender"),
).properties(
    width=450, height=300
).configure_axis(
    labelFontSize=16, titleFontSize=16, labelAngle=0
).configure_legend(
    strokeColor="gray", labelFontSize=15, padding=10, cornerRadius=5, orient="right"
).configure_title(
    fontSize=20, offset=10, orient="top", anchor="middle"
)

### 8. Average time at each marathon location

In [None]:
marathon_avg_minutes = df.groupby("marathon")["total_minutes"].agg("mean").reset_index()
marathon_avg_minutes['total_minutes'] = round(marathon_avg_minutes['total_minutes'], 2)
marathon_avg_minutes.head()

**London and Tokyo marathons seem to have lowest averages of time taken to finish a marathon.**

In [None]:
alt.Chart(
    marathon_avg_minutes,
    title="Average time taken to complete the marathon at each City",
).mark_bar().encode(
    x="marathon", y="total_minutes", tooltip="total_minutes", color="marathon"
).properties(
    width=450, height=300
).configure_axis(
    labelFontSize=16, titleFontSize=16, labelAngle=0
).configure_axis(
    labelFontSize=15, labelAngle=0
).configure_legend(
    strokeColor="gray", labelFontSize=15, padding=10, cornerRadius=5, orient="right"
).configure_title(
    fontSize=20, offset=10, orient="top", anchor="middle"
)

### 9. Number of wins by participant

In [None]:
winner_counts = df.groupby(['winner', 'gender', 'country'])['winner'].agg(['count']).sort_values(by='count', ascending=False)
winner_counts = winner_counts[winner_counts['count'] >= 2].sort_values(by='count', ascending=False)
winner_counts.head()

**The condition I've set above is that "participants who have won at least twice in the marathon".**

- Grete Waitz (Female) from Norway is the top contestant who won a total of 11 marathons
- Bill Rogers (Male) from the United States and Ingrid Kristiansen (Female) from Norway have won a total of 8 marathons each

In [None]:
line = (
    alt.Chart(winner_counts.reset_index()).mark_line().encode(x="winner", y="count:Q",)
)

point = (
    alt.Chart(winner_counts.reset_index())
    .mark_point(filled=True)
    .encode(
        x="winner",
        y="count:Q",
        tooltip=["winner", "gender", "country", "count"],
        size="count",
        color="gender",
    )
)

(line + point).properties(width=1650, height=300).configure_axis(
    labelFontSize=14, titleFontSize=16
).configure_legend(
    strokeColor="gray", labelFontSize=15, padding=10, cornerRadius=5, orient="right"
)

### 10. Top 5 winners's lowest time taken and the average time taken to win a marathon

In [None]:
for i in range(5):
    print('Winner:', winner_counts.index[i][0], ',', winner_counts.index[i][1])
    print('Lowest time taken to win a marathon:', round(df[df['winner'] == winner_counts.index[i][0]]['total_minutes'].min(), 2))
    print('Average minutes taken to win in the marathons:', round(df[df['winner'] == winner_counts.index[i][0]]['total_minutes'].mean(), 2))
    print('\n')

**Thank you 😊😊😊.**