# Lab 7 Tasks - Solution

In this notebook we will analyse a dataset from an Irish triathlon by using the Pandas library. In the dataset, each row represents an athlete, described by a number of different descriptive features:

- *Number:* The athlete's race bib number
- *Place:* The place in which the athlete finished the race
- *Age:* The athlete's age
- *Gender:* The gender that the athlete declared ('M' or 'F')
- *Province:* The Irish province where the athlete comes from (Leinster, Munster, Connacht, Ulster)
- *Swim:* The time taken for the swimming segment of the event (in seconds)
- *T1:* The time taken for the first transition of the event, from cycling to swimming (in seconds)
- *Cycle:* The time taken for the cycling segment of the event (in seconds)
- *T2:* The time taken for the swimming segment of the event, from swimming to running (in seconds)
- *Run:* The time taken for the running segment of the event (in seconds)

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

## Task 1 - Data Loading and Preparation

Use Python to download a file containing triathlon dataset in CSV format from the URL:

http://mlg.ucd.ie/modules/COMP41680/triathlon.csv

Load the dataset into a Pandas DataFrame, where the row index will be given by the athlete's bib number. Display the first 20 rows of the DataFrame.

In [None]:
# we could either download the file using urllib, or we can use Pandas to do this for us
df = pd.read_csv("http://mlg.ucd.ie/modules/COMP41680/triathlon.csv", index_col="Number")

In [None]:
# display first 20 rows
df.head(20)

The dataset might contain missing values. For instance, some athletes may have registered for the race for never actually. Other athletes might have started the race, but not completed all segments of the triathlon.

From the DataFrame, identify the number of missing values in each column. Then remove any rows which contain missing values (i.e. athletes who did not fullly complete the race). How many rows are remaining?

In [None]:
# count number of missing values per column
df.isna().sum()

In [None]:
# number of rows currently
print("Data contains %d rows" % len(df))
# remove the rows with missing values
df = df.dropna()
# how many rows left?
print("Data now contains %d rows" % len(df))

Add a new column to the DataFrame, called *Finish*, which is the total time taken for the race for each athlete (i.e. Swim + T1 + Cycle + T2 + Run).

In [None]:
# add the times for the different segments together
df["Finish"] = df["Swim"] + df["T1"] + df["Cycle"] + df["T2"] + df["Run"]

To verify the step above, sort the DataFrame, based on the *Finish* time, fastest to slowest. Display the top 10 fastest athletes overall:

In [None]:
df.sort_values(by="Finish").head(10)

## Task 2 - Data Analysis

What is the average finishing time for athletes? What is the slowest finishing time?

In [None]:
# average time
print("Mean finish time = %.2f" % df["Finish"].mean())
# slowest time
print("Slowest finish time = %.2f" % df["Finish"].max())

On average which segment of the race took the longest: swimming, cycling or running?

In [None]:
# calculate the averages for each segment
mean_swim = df["Swim"].mean()
mean_cycle = df["Cycle"].mean()
mean_run = df["Run"].mean()
# check which took longest
if mean_swim > mean_cycle and mean_swim > mean_run:
    print("Swimming took longest")
elif mean_cycle > mean_swim and mean_cycle > mean_run:
    print("Cycling took longest")
else:
    print("Running took longest")

How many female and male athletes competed in the race? How many athletes from each Irish province competed in the race? 

In [None]:
# create a frequency table for gender
df["Gender"].value_counts()

In [None]:
# create a frequency table for province
df["Province"].value_counts()

Create a new column 'AgeCategory' that divides the ages into age categories: 16-19, 20-29, 30-39, 40-49, 50-65.

In [None]:
bins = [16, 20, 30, 40, 50, 65]
# here we specify right=False to exclude the rightmost edge
df["AgeCategory"] = pd.cut(df["Age"], bins=bins, right=False)
df.head()

How many female and male athletes were from each age catgory? How many female and male athletes were from each of the 4 provinces? 

In [None]:
# apply a cross-tabulation on the next two columns
pd.crosstab(df["AgeCategory"], df["Gender"])

In [None]:
# apply a cross-tabulation on these two columns
pd.crosstab(df["Province"], df["Gender"])

What were the average times for the three segments, per age category?

In [None]:
# first aggregate by age category
groups = df.groupby("AgeCategory")
# now calculate the means for the three segments
groups.mean(numeric_only = True)[["Swim", "Cycle", "Run"]]

## Task 3 - Data Visualisation

Use bar charts to visualise:
1. The number of athletes per age category
2. The number of athletes per province

In [None]:
# get the frequency counts
age_counts = df["AgeCategory"].value_counts()
# plot the frequencies
ax = age_counts.plot(kind="bar", fontsize=13, color="navy", figsize=(6, 5))
plt.ylabel("Number of Athletes", fontsize=13);

In [None]:
# get the frequency counts
province_counts = df["Province"].value_counts()
# plot the frequencies
ax = province_counts.plot(kind="bar", fontsize=13, color="teal", figsize=(6, 5))
plt.ylabel("Number of Athletes", fontsize=13);

Produce a visualisation of the distribution of finish times:

In [None]:
ax = df["Finish"].hist(bins=10, figsize=(8,5), color='green', grid=False, rwidth=0.9)
plt.ylabel("Number of Athletes", fontsize=13)
plt.xlabel("Finish Time (seconds)", fontsize=13);

Repeat the above, but this time produce a visualisation of the distribution of finish times for female athletes only:

In [None]:
# filter based on gender first
df2 = df[df["Gender"]=="F"]
# then generate the histogram
ax = df2["Finish"].hist(bins=10, figsize=(8,5), color='green', grid=False, rwidth=0.9)
plt.ylabel("Number of Athletes", fontsize=13)
plt.xlabel("Finish Time (seconds)", fontsize=13);

Produce three plots which show: 
1. The relationship between the time taken for the swimming and cycling segments.
2. The relationship between the time taken for the swimming and running segments.
3. The relationship between the time taken for the cycling and running segments.

In [None]:
ax = df.plot(kind="scatter", figsize=(8,6), color='darkblue', s=40, fontsize=13, x="Swim", y="Cycle")
plt.xlabel("Swimming Segment (seconds)", fontsize=13)
plt.ylabel("Cycling Segment (seconds)", fontsize=13);

In [None]:
ax = df.plot(kind="scatter", figsize=(8,6), color='darkorange', s=40, fontsize=13, x="Swim", y="Run")
plt.xlabel("Swimming Segment (seconds)", fontsize=13)
plt.ylabel("Running Segment (seconds)", fontsize=13);

In [None]:
ax = df.plot(kind="scatter", figsize=(8,6), color='darkred', s=40, fontsize=13, x="Cycle", y="Run")
plt.xlabel("Cycling Segment (seconds)", fontsize=13)
plt.ylabel("Running Segment (seconds)", fontsize=13);