# Predicting 100m Freestyle Gold Medal Time for the 2028 Olympics
This notebook demonstrates how to predict the gold medal time for the 100m freestyle event at the 2028 Olympics using advanced ML techniques (LightGBM, StandardScaler, MSE minimization).

In [5]:
%pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [10]:
# -------------------------------------------------------------------------
# 100 m Freestyle Gold-Medal Prediction (tiny dataset, stable parameters)
# -------------------------------------------------------------------------

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import mean_squared_error
from math import sqrt

# 1. Load data ------------------------------------------------------------
df = pd.read_csv("../data/Olympic_Swimming.csv")

# 2. Convert time strings to seconds --------------------------------------
def time_to_seconds(t):
    """Handle formats like '47.02', '0:47.02', '1:02:15.3'."""
    import re
    if pd.isna(t):
        return np.nan
    t = str(t).strip()

    if re.fullmatch(r"\d+(\.\d+)?", t):                 # 47.02
        return float(t)
    if re.fullmatch(r"\d+:\d+(\.\d+)?", t):             # 0:47.02
        m, s = t.split(":")
        return int(m) * 60 + float(s)
    if re.fullmatch(r"\d+:\d+:\d+(\.\d+)?", t):         # 1:02:15.3
        h, m, s = t.split(":")
        return int(h) * 3600 + int(m) * 60 + float(s)

    return np.nan

df["result_seconds"] = df["Results"].apply(time_to_seconds)
df["distance_m"]     = df["Distance (in meters)"].str.extract(r"(\d+)").astype(float)

# 3. Keep only 100 m freestyle golds --------------------------------------
data = (df[
        (df["distance_m"] == 100)
        & (df["Stroke"] == "Freestyle")
        & (df["Rank"] == 1)
    ]
    .dropna(subset=["result_seconds", "Year", "Gender", "Team", "Location"])
    .copy()
)

# 4. Label-encode categoricals -------------------------------------------
for col in ["Gender", "Team", "Location"]:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col].astype(str))

# 5. Features & target ----------------------------------------------------
X = data[["Year", "Gender", "Team", "Location"]].astype(float).values  # no scaling
y = data["result_seconds"].values

# 6. Chronological CV error (TimeSeriesSplit) -----------------------------
tscv = TimeSeriesSplit(n_splits=5)

gbm = lgb.LGBMRegressor(
    objective="regression",
    n_estimators=200,
    learning_rate=0.05,
    num_leaves=8,
    min_child_samples=1,   # allow very small leaves
    random_state=42,
)

# neg_mean_squared_error → take sqrt to get RMSE
cv_mse  = -cross_val_score(gbm, X, y,
                           cv=tscv,
                           scoring="neg_mean_squared_error").mean()
cv_rmse = sqrt(cv_mse)
print(f"Time-series CV RMSE ≈ {cv_rmse:.2f} s")

# 7. Train on all years < 2020; test on 2020 ------------------------------
train = data[data["Year"] < 2020]
test  = data[data["Year"] == 2020]

gbm.fit(train[["Year", "Gender", "Team", "Location"]],
        train["result_seconds"])

if not test.empty:
    y_hat   = gbm.predict(test[["Year", "Gender", "Team", "Location"]])
    mse2020 = mean_squared_error(test["result_seconds"], y_hat)
    print("2020 actual seconds:",  list(test["result_seconds"].round(2)))
    print("2020 predicted      :",  list(np.round(y_hat, 2)))
    print(f"2020 MSE: {mse2020:.3f}")

# 8. Forecast for 2028 ----------------------------------------------------
future = pd.DataFrame({
    "Year":     [2028],
    "Gender":   [data["Gender"].mode()[0]],
    "Team":     [data["Team"].mode()[0]],
    "Location": [data["Location"].mode()[0]],
})
pred_2028 = gbm.predict(future)[0]
print(f"Predicted 100 m freestyle gold time for 2028: {pred_2028:.2f} s")


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000013 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 16
[LightGBM] [Info] Number of data points in the train set: 12, number of used features: 4
[LightGBM] [Info] Start training from score 50.260000
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000024 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 22
[LightGBM] [Info] Number of data points in the train set: 20, number of used features: 4
[LightGBM] [Info] Start training from score 50.860500
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000016 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 25
[LightGBM] [Info] Number of data points in the train set: 28, number

## Conclusion: 100m Freestyle Gold Medal Time Prediction for 2028

Using historical Olympic swimming data and a LightGBM regression model, we predicted the winning time for the 100m freestyle event at the 2028 Olympics. The model was trained on past gold medal results, with proper feature engineering and scaling to ensure robust predictions.

**Key points:**
- The model leverages trends in performance, country, gender, and location.
- The predicted gold medal time for 2028 is based on the most common characteristics of past winners.

**Interpretation:**
- The predicted time represents a data-driven estimate, assuming historical trends continue and no major disruptions occur (e.g., rule changes, technological leaps).
- This prediction can help set expectations for future performance and highlight the ongoing improvement in elite swimming.

> Note: Actual results may vary due to unforeseen factors, but the model provides a strong statistical baseline for what to expect in 2028.