This notebook comes from https://github.com/chauvinSimon/tri_stats.

## SETUP

Only needed **once**:
- [Create a Google account](https://accounts.google.com/signin). You can use your existing Google account if you already have one. However, since running this code requires storage and access to the account's Google Drive, it is recommended to create a dedicated account specifically for this project.
- [Create a key for the World Triathlon API](https://apps.api.triathlon.org/register) and write it in the next cell. You can use the just created Google address for the registration.

Needed **each time**:
- Run the cells by clicking "play", on their left side. The first one starts the runtime and therefore may take a bit long.

In [1]:
YOUR_API_KEY = "04417c38342c9e66bd68bf420b341bc0"
# YOUR_API_KEY = "2649776ef9ece4c391003b521cbfce7a"  # example only!

In [2]:
from pathlib import Path
from google.colab import drive

ModuleNotFoundError: No module named 'google'

In [3]:
drive_dir = Path("/content/drive")
drive_nb_dir = drive_dir / "MyDrive/Colab Notebooks"
project_dir = drive_nb_dir / "tri_stats"

repo_url = "https://github.com/chauvinSimon/tri_stats.git"

A pop-up window should open:
> _"Permit this notebook to access your Google Drive files?"_

You should:
- Click `"Connect to Google Drive"`.
- Select your Google account.
- Click `Continue` on `Sign in to Google Drive for desktop`.
- Click `Select all` on `Select what Google Drive for desktop can access`. _(you can revoke this grant in your Google setting, by looking for `Google Drive for desktop` in `Data from apps and services you use`)_.
- Scroll down and validate with `Continue`.

In [4]:
if not drive_dir.exists():
  print("mounting drive")
  drive.mount('/content/drive')

mounting drive
Mounted at /content/drive


In [5]:
if not project_dir.exists():
    # Convert path to a string and quote it for bash commands. Against space in "Colab Notebooks"
    quoted_project_dir = f'"{project_dir}"'
    print(f"cloning repo from: {repo_url}")
    !git clone {repo_url} {quoted_project_dir}

cloning repo from: https://github.com/chauvinSimon/tri_stats.git
Cloning into '/content/drive/MyDrive/Colab Notebooks/tri_stats'...
remote: Enumerating objects: 562, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 562 (delta 4), reused 10 (delta 4), pack-reused 551 (from 1)[K
Receiving objects: 100% (562/562), 67.76 MiB | 11.32 MiB/s, done.
Resolving deltas: 100% (281/281), done.
Updating files: 100% (91/91), done.


In [6]:
%cd {project_dir}
!git status

/content/drive/MyDrive/Colab Notebooks/tri_stats
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean


In [7]:
# remove changes before pulling
!git diff
!git checkout .
!git status

Updated 0 paths from the index
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean


In [8]:
%cd {project_dir}
!git pull origin main

/content/drive/MyDrive/Colab Notebooks/tri_stats
From https://github.com/chauvinSimon/tri_stats
 * branch            main       -> FETCH_HEAD
Already up to date.


In [9]:
%cd {project_dir}/scripts
assert Path().resolve() == project_dir / "scripts"
%ls

/content/drive/MyDrive/Colab Notebooks/tri_stats/scripts
main_athlete_dimensions.py          main_events.py           utils_itu.py
main_athlete_season.py              main_t1_with_wetsuit.py  utils.py
main_birth_month.py                 utils_countries.py       utils_rankings.py
main_birth_month_united_nations.py  utils_events.py


In [10]:
api_key_path = project_dir / "api_key.txt"
if (not api_key_path.exists()) or (api_key_path.read_text() != YOUR_API_KEY):
    print(f"Writing key to local file: {YOUR_API_KEY}")
    api_key_path.write_text(YOUR_API_KEY)

Writing key to local file: 04417c38342c9e66bd68bf420b341bc0


## USAGE

At this point, you are ready to:
- Collect data from the World Triathlon API.
- Process and clean it up.
- Format it to a table.
- Export it to a csv file.

In [6]:
import sys
import os

# Define o caminho absoluto para a pasta 'scripts'
# '.' representa o diretório atual onde o notebook está.
# '..' representa o diretório pai.
caminho_scripts = os.path.abspath('scripts')

# Verifica se o caminho já existe e, se não, adiciona ao sys.path
if caminho_scripts not in sys.path:
    sys.path.append(caminho_scripts)
    print(f"✅ Diretório adicionado ao sys.path: {caminho_scripts}")

# Confirma que os imports devem funcionar agora
print("Tente rodar a célula 'from utils import...' agora.")

✅ Diretório adicionado ao sys.path: /home/usuario/Documentos/GitHub/tri-data-stuff/scripts
Tente rodar a célula 'from utils import...' agora.


In [12]:
from utils import load_config
from utils_events import get_events_df

In [13]:
export_dir = project_dir / "ignored" / "exports"
export_dir.mkdir(parents=True, exist_ok=True)

NameError: name 'project_dir' is not defined

In [14]:
config = load_config()
events_config = config["events"]

In [15]:
# just for quick test: querying a narrow range of dates
events_config["query"]["start_date"] = "2024-05-01"
events_config["query"]["end_date"] = "2024-05-31"

In [16]:
# set the min number of results
events_config["cleaning"]["n_results_min"] = 25

The next cell makes requests to the API.
- It takes time the first time.
- It is much faster then, because the results of the requests are saved.

In [17]:
df = get_events_df(events_config)


### ### ###
spec_name = 'Triathlon' (spec_id = 357), cat_name = 'Major Games' (cat_id = 343): len(res) = 0
### ### ###

### ### ###
spec_name = 'Triathlon' (spec_id = 357), cat_name = 'Recognised Event' (cat_id = 345): len(res) = 0
### ### ###

### ### ###
spec_name = 'Triathlon' (spec_id = 357), cat_name = 'Recognised Games' (cat_id = 346): len(res) = 0
### ### ###

### ### ###
spec_name = 'Triathlon' (spec_id = 357), cat_name = 'World Championship Finals' (cat_id = 624): len(res) = 0
### ### ###

### ### ###
spec_name = 'Triathlon' (spec_id = 357), cat_name = 'World Championship Series' (cat_id = 351): len(res) = 2
### ### ###
2024 World Triathlon Championship Series Yokohama (183763): spec_id = 357 cat_id = 351
2024 World Triathlon Championship Series Yokohama (183763)
	627954 Elite Men
	627955 Elite Women
2024 World Triathlon Championship Series Cagliari (183764): spec_id = 357 cat_id = 351
2024 World Triathlon Championship Series Cagliari (183764)
	627959 Elite Men
	627960 Elite 

The next cell shows the created table (or a section of it).

In [18]:
df

Unnamed: 0,event_id,event_title,event_venue,event_listing,event_country_noc,event_date_m,prog_notes_m,event_category_ids_m,level_m,swim_mean_m,...,wetsuit_w,prog_distance_category,swim_diff,bike_diff,run_diff,swim_diff_percent,bike_diff_percent,run_diff_percent,event_category,event_year
0,183763,2024 World Triathlon Championship Series Yokohama,Yokohama,https://www.triathlon.org/events/2023-world-tr...,JPN,2024-05-11,technical Delegate: Adele Cheah Lynn-Li/MAS.\r...,[351],16.522727,1064.8,...,True,standard,36.8,373.0,243.0,0.03456,0.116621,0.136257,wcs,2024
1,183769,2024 World Triathlon Cup Samarkand,Samarkand,https://www.triathlon.org/events/2024-world-tr...,UZB,2024-05-18,Technical Delegate: Kyungsook Kim/KOR.,[349],45.227273,1093.6,...,False,standard,102.0,367.4,241.0,0.09327,0.120167,0.126443,world-cup,2024
2,183770,2024 World Triathlon Cup Huatulco,Huatulco,https://www.triathlon.org/events/2024-world-tr...,MEX,2024-05-18,Technical Delegate: Paul Brandt/USA.,[349],44.886364,559.2,...,False,sprint,78.6,187.8,123.2,0.140558,0.096516,0.134468,world-cup,2024
3,183764,2024 World Triathlon Championship Series Cagliari,Cagliari,https://www.triathlon.org/events/2024-world-tr...,ITA,2024-05-25,Technical Delegate: Dag Oliver/NOR.\r\nAthlete...,[351],19.340909,1115.2,...,True,standard,25.6,177.2,225.8,0.022956,0.057901,0.125319,wcs,2024


In [18]:
list(df.columns)

['event_id',
 'event_title',
 'event_venue',
 'event_listing',
 'event_country_noc',
 'event_date_m',
 'prog_notes_m',
 'event_category_ids_m',
 'level_m',
 'swim_mean_m',
 'swim_std_m',
 'swim_all_m',
 'swim_mean_m_last',
 'swim_std_m_last',
 't1_mean_m',
 't1_std_m',
 't1_all_m',
 't1_mean_m_last',
 't1_std_m_last',
 'bike_mean_m',
 'bike_std_m',
 'bike_all_m',
 'bike_mean_m_last',
 'bike_std_m_last',
 't2_mean_m',
 't2_std_m',
 't2_all_m',
 't2_mean_m_last',
 't2_std_m_last',
 'run_mean_m',
 'run_std_m',
 'run_all_m',
 'run_mean_m_last',
 'run_std_m_last',
 'age_mean_m',
 'age_std_m',
 'n_finishers_m',
 'pack_size_m',
 'is_winner_in_front_pack_m',
 'is_best_runner_in_front_pack_m',
 'best_runner_wins_m',
 'second_delay_m',
 'winner_m',
 'winner_country_m',
 'second_m',
 'second_country_m',
 'air_temperature_m',
 'water_temperature_m',
 'wetsuit_m',
 'event_date_w',
 'prog_notes_w',
 'event_category_ids_w',
 'level_w',
 'swim_mean_w',
 'swim_std_w',
 'swim_all_w',
 'swim_mean_w_last'

### Examples of filters:

In [None]:
# retrieve events where women and men have different swim equipments (wetsuit and no-wetsuit). Group by WCS/WC.
df_different_wetsuit = df[
    (df["wetsuit_m"] != df["wetsuit_w"])
]

for group in df_different_wetsuit.groupby(["event_category"]):
    print(group[0][0])
    display(group[1][["event_year", "event_venue", "prog_distance_category", "wetsuit_w", "wetsuit_m", "swim_diff_percent", "swim_all_w", "swim_all_m"]])

In [None]:
# list repeating locations that have had varying wetsuits over the years
venue_groups = df.groupby("event_venue")

# Iterate through wetsuit types ('w' and 'm')
for suffix in ["w", "m"]:
    print(f"\n### ### ###\n### Events with varying `wetsuit_{suffix}` values:\n### ### ###")

    # Iterate through each group based on 'event_venue'
    for event_venue, venue_group in venue_groups:
        # Filter rows where the wetsuit_{suffix} column is not null
        wetsuit_group = venue_group[venue_group[f"wetsuit_{suffix}"].notna()]

        # Check if there are multiple unique values for wetsuit_{suffix}
        unique_wetsuit_values = wetsuit_group[f"wetsuit_{suffix}"].unique()
        if len(unique_wetsuit_values) > 1:
            print(f"\n{event_venue}:")

            total_events = len(wetsuit_group)
            valid_wetsuit_count = len(wetsuit_group[wetsuit_group[f'wetsuit_{suffix}']])
            valid_percentage = (valid_wetsuit_count / total_events) * 100

            print(f"\t{total_events} total events, {valid_wetsuit_count} with wetsuit_{suffix} ({valid_percentage:.1f}%)")

            # Print details for each event in the inconsistent group
            for row in wetsuit_group.itertuples(index=False):  # Exclude index from tuples
                print(f"\t{row.event_year} ({row.event_id}) - wetsuit_{suffix}: {getattr(row, f'wetsuit_{suffix}')}")


In [None]:
if not df.empty:
    len_df = len(df)

    n_wetsuit_w_true = df['wetsuit_w'].value_counts()[True]
    print(f"wetsuit_w in {n_wetsuit_w_true}/{len_df} events: {100 * n_wetsuit_w_true / len_df:.1f}%")

    n_wetsuit_m_true = df['wetsuit_m'].value_counts()[True]
    print(f"wetsuit_m in {n_wetsuit_m_true}/{len_df} events: {100 * n_wetsuit_m_true / len_df:.1f}%")

## EXPORT TABLE

The next cell saves the table (`df`) to your Drive:
- Go to https://drive.google.com/drive/my-drive.
- You should find the saved .csv under `My Drive / Colab Notebooks / tri_stats / ignored / exported`.
- You may need to refresh the page.
- Before downloading the .csv, you can have a look: `Open with` -> `Google Sheets`.

In [None]:
df.to_csv(export_dir / "events.csv")

## QUIT

- Press `ctrl+s` to save the changes of the notebook.
- Next time, you can open the notebook directly at https://drive.google.com/drive/my-drive, under `My Drive / Colab Notebooks / Copy of main.ipynb`.