ATP Match Prediction — Preprocessing & Feature Engineering

This notebook prepares a clean, leakage-safe dataset for predicting tennis match outcomes.

Outputs:

data/processed/final_features.parquet containing engineered features, RESULT (label), and TOURNEY_DATE for time-based splitting.

Imports

Import core Python libraries for data processing, feature engineering, and plotting.

In [1]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import defaultdict, deque
from statistics import mean
import matplotlib.pyplot as plt

Load raw match data

Load ATP match CSV files across years and concatenate into one dataset.

In [4]:
# Check if running in Google Colab or locally
try:
    import google.colab

    IN_COLAB = True
except:
    IN_COLAB = False

# Setup data directory and download if needed
if IN_COLAB:
    print("Running in Google Colab - downloading data...")
    # Navigate to /content directory
    if os.getcwd() != "/content":
        os.chdir("/content")

    # Clone tennis_atp repository if not already present
    if not os.path.exists("tennis_atp"):
        print("Cloning tennis_atp repository...")
        get_ipython().system("git clone https://github.com/JeffSackmann/tennis_atp")
    else:
        print("tennis_atp repository already exists")

    DATA_DIR = "/content/tennis_atp"
else:
    print("Running locally - using local data directory...")
    # For local environment, use relative path from project root
    # Assumes notebook is in: Tennies_prediction/notebook/
    # Data should be in: Tennies_prediction/data/raw/tennis_atp/
    DATA_DIR = "../data/raw/tennis_atp"

    # Check if data directory exists
    if not os.path.exists(DATA_DIR):
        print(f"\n⚠️  WARNING: Data directory not found at {DATA_DIR}")
        print("Please download the tennis_atp data:")
        print("  git clone https://github.com/JeffSackmann/tennis_atp.git ../data/raw/tennis_atp")
        print("Or manually download CSV files to that location.")
        raise FileNotFoundError(f"Data directory not found: {DATA_DIR}")
    else:
        print(f"✓ Data directory found: {DATA_DIR}")

print(f"\nDATA_DIR set to: {DATA_DIR}")

Running locally - using local data directory...
✓ Data directory found: ../data/raw/tennis_atp

DATA_DIR set to: ../data/raw/tennis_atp


In [5]:
DATA_DIR = "/content/tennis_atp"  # adjust if needed


all_data = []
for year in range(1992, 2025):
    path = f"{DATA_DIR}/atp_matches_{year}.csv"
    all_data.append(pd.read_csv(path))


all_data = pd.concat(all_data, ignore_index=True)
all_data.shape

FileNotFoundError: [Errno 2] No such file or directory: '/content/tennis_atp/atp_matches_1992.csv'

Chronological ordering (prevents leakage)

All rolling features (form, H2H, Elo) must be computed using only past matches. We enforce chronological ordering using tourney_date and match_num.

In [None]:
all_data = all_data.sort_values(["tourney_date", "match_num"]).reset_index(drop=True)
all_data[["tourney_date", "match_num"]].head()

Data filtering

Keep only rows with the minimum required fields for feature engineering (IDs, ranks/points, surface, and serve/break-point stats). Drop missing values to avoid NaNs during rolling calculations.

In [None]:
needed = [
    "winner_id",
    "loser_id",
    "winner_ht",
    "loser_ht",
    "winner_age",
    "loser_age",
    "w_ace",
    "w_df",
    "w_svpt",
    "w_1stIn",
    "w_1stWon",
    "w_2ndWon",
    "w_SvGms",
    "l_ace",
    "l_df",
    "l_svpt",
    "l_1stIn",
    "l_1stWon",
    "l_2ndWon",
    "l_SvGms",
    "l_bpSaved",
    "l_bpFaced",
    "w_bpSaved",
    "w_bpFaced",
    "winner_rank_points",
    "loser_rank_points",
    "winner_rank",
    "loser_rank",
    "surface",
    "best_of",
    "draw_size",
    "tourney_date",
    "match_num",
]


df = all_data.dropna(subset=needed).reset_index(drop=True)
df.shape

Sanity checks

Verify basic properties: unique surfaces, date range, and a quick null scan.

print("date range:", df.tourney_date.min(), df.tourney_date.max())
print("surfaces:", df.surface.dropna().unique())
print("nulls in needed cols:", df[needed].isna().sum().sum())

Baseline difference features

Create direction-based features as (winner − loser). Later we randomize player order and flip signs to avoid label leakage.

base = pd.DataFrame(
    {
        "WINNER_ID": df["winner_id"].astype(int),
        "LOSER_ID": df["loser_id"].astype(int),
        "ATP_POINT_DIFF": df["winner_rank_points"] - df["loser_rank_points"],
        "ATP_RANK_DIFF": df["winner_rank"] - df["loser_rank"],
        "AGE_DIFF": df["winner_age"] - df["loser_age"],
        "HEIGHT_DIFF": df["winner_ht"] - df["loser_ht"],
        "BEST_OF": df["best_of"].astype(int),
        "DRAW_SIZE": df["draw_size"].astype(int),
        "SURFACE": df["surface"].astype(str),
        "TOURNEY_DATE": df["tourney_date"].astype(int),
    }
)
base.head()

Time-aware feature builder (single pass)

We compute all features using only history before each match, then update the state after the match.

Features:

Head-to-head overall and surface-specific

Rolling win-rate differences

Serve performance rolling differences

Elo overall + surface Elo + Elo gradients (momentum)

serve_windows = [3, 5, 10, 20, 50, 100, 200, 300, 2000]
win_windows = [3, 5, 10, 25, 50, 100]
elo_grad_windows = [5, 10, 20, 35, 50, 100, 250]


def safe_mean(x):
    return mean(x) if len(x) else 0.0


# rolling state containers
serve_hist = {
    k: defaultdict(lambda: defaultdict(lambda: deque(maxlen=k)))
    for k in serve_windows
}
win_hist = {k: defaultdict(lambda: deque(maxlen=k)) for k in win_windows}


matches_played = defaultdict(int)
h2h_wins = defaultdict(int)
h2h_wins_surface = defaultdict(int)


ELO_BASE = 1500.0
K_BASE = 32.0
elo_overall = defaultdict(lambda: ELO_BASE)
elo_surface = defaultdict(lambda: defaultdict(lambda: ELO_BASE))
elo_delta_hist = {k: defaultdict(lambda: deque(maxlen=k)) for k in elo_grad_windows}


rng = np.random.default_rng(42)


def expected_score(a, b):
    return 1.0 / (1.0 + 10.0 ** ((b - a) / 400.0))


def k_factor(best_of):
    return K_BASE * (1.15 if best_of == 5 else 1.0)


def update_serve(pid, ace, df_, svpt, firstIn, firstWon, secondWon, bpSaved, bpFaced):
    if svpt and (svpt - firstIn):
        p_ace = 100.0 * ace / svpt
        p_df = 100.0 * df_ / svpt
        p_1stIn = 100.0 * firstIn / svpt
        p_2ndWon = 100.0 * secondWon / (svpt - firstIn)
        for k in serve_windows:
            serve_hist[k][pid]["p_ace"].append(p_ace)
            serve_hist[k][pid]["p_df"].append(p_df)
            serve_hist[k][pid]["p_1stIn"].append(p_1stIn)
            serve_hist[k][pid]["p_2ndWon"].append(p_2ndWon)

    if firstIn:
        p_1stWon = 100.0 * firstWon / firstIn
        for k in serve_windows:
            serve_hist[k][pid]["p_1stWon"].append(p_1stWon)

    if bpFaced:
        p_bpSaved = 100.0 * bpSaved / bpFaced
        for k in serve_windows:
            serve_hist[k][pid]["p_bpSaved"].append(p_bpSaved)


rows = []


for r in tqdm(df.itertuples(index=False), total=len(df)):
    w = int(r.winner_id)
    l = int(r.loser_id)
    surface = str(r.surface)

    # ----- PRE MATCH FEATURES -----
    rec = {
        "PLAYER_1": w,
        "PLAYER_2": l,
        "ATP_POINT_DIFF": float(r.winner_rank_points - r.loser_rank_points),
        "ATP_RANK_DIFF": float(r.winner_rank - r.loser_rank),
        "AGE_DIFF": float(r.winner_age - r.loser_age),
        "HEIGHT_DIFF": float(r.winner_ht - r.loser_ht),
        "BEST_OF": int(r.best_of),
        "DRAW_SIZE": int(r.draw_size),
        "H2H_DIFF": float(h2h_wins[(w, l)] - h2h_wins[(l, w)]),
        "H2H_SURFACE_DIFF": float(
            h2h_wins_surface[(w, l, surface)] - h2h_wins_surface[(l, w, surface)]
        ),
        "DIFF_N_GAMES": float(matches_played[w] - matches_played[l]),
        "TOURNEY_DATE": int(r.tourney_date),
    }

    for k in win_windows:
        rec[f"WIN_LAST_{k}_DIFF"] = safe_mean(win_hist[k][w]) - safe_mean(
            win_hist[k][l]
        )

    for k in serve_windows:
        rec[f"P_ACE_LAST_{k}_DIFF"] = safe_mean(
            serve_hist[k][w]["p_ace"]
        ) - safe_mean(serve_hist[k][l]["p_ace"])
        rec[f"P_DF_LAST_{k}_DIFF"] = safe_mean(serve_hist[k][w]["p_df"]) - safe_mean(
            serve_hist[k][l]["p_df"]
        )
        rec[f"P_1ST_IN_LAST_{k}_DIFF"] = safe_mean(
            serve_hist[k][w]["p_1stIn"]
        ) - safe_mean(serve_hist[k][l]["p_1stIn"])
        rec[f"P_1ST_WON_LAST_{k}_DIFF"] = safe_mean(
            serve_hist[k][w]["p_1stWon"]
        ) - safe_mean(serve_hist[k][l]["p_1stWon"])
        rec[f"P_2ND_WON_LAST_{k}_DIFF"] = safe_mean(
            serve_hist[k][w]["p_2ndWon"]
        ) - safe_mean(serve_hist[k][l]["p_2ndWon"])
        rec[f"P_BP_SAVED_LAST_{k}_DIFF"] = safe_mean(
            serve_hist[k][w]["p_bpSaved"]
        ) - safe_mean(serve_hist[k][l]["p_bpSaved"])

    # Elo features
    rec["ELO_DIFF"] = float(elo_overall[w] - elo_overall[l])
    rec["ELO_SURFACE_DIFF"] = float(elo_surface[w][surface] - elo_surface[l][surface])

    for k in elo_grad_windows:
        rec[f"ELO_GRAD_{k}_DIFF"] = safe_mean(elo_delta_hist[k][w]) - safe_mean(
            elo_delta_hist[k][l]
        )

    # ----- POST MATCH STATE UPDATE -----
    # Update serve stats
    update_serve(
        w,
        r.w_ace,
        r.w_df,
        r.w_svpt,
        r.w_1stIn,
        r.w_1stWon,
        r.w_2ndWon,
        r.w_bpSaved,
        r.w_bpFaced,
    )
    update_serve(
        l,
        r.l_ace,
        r.l_df,
        r.l_svpt,
        r.l_1stIn,
        r.l_1stWon,
        r.l_2ndWon,
        r.l_bpSaved,
        r.l_bpFaced,
    )

    # Update win history
    for k in win_windows:
        win_hist[k][w].append(1.0)
        win_hist[k][l].append(0.0)

    # Update H2H
    h2h_wins[(w, l)] += 1
    h2h_wins_surface[(w, l, surface)] += 1

    # Update Elo
    K = k_factor(int(r.best_of))
    ew = expected_score(elo_overall[w], elo_overall[l])
    delta = K * (1.0 - ew)
    elo_overall[w] += delta
    elo_overall[l] -= delta

    ews = expected_score(elo_surface[w][surface], elo_surface[l][surface])
    delta_s = K * (1.0 - ews)
    elo_surface[w][surface] += delta_s
    elo_surface[l][surface] -= delta_s

    # Record Elo gradients
    for k in elo_grad_windows:
        elo_delta_hist[k][w].append(delta)
        elo_delta_hist[k][l].append(-delta)

    # Update match count
    matches_played[w] += 1
    matches_played[l] += 1

    # Randomize player order to prevent label leakage
    if rng.random() < 0.5:
        # swap players
        rec["PLAYER_1"], rec["PLAYER_2"] = rec["PLAYER_2"], rec["PLAYER_1"]
        rec["RESULT"] = 0
        for key in rec:
            if key.endswith("_DIFF"):
                rec[key] = -rec[key]
    else:
        rec["RESULT"] = 1

    rows.append(rec)


final_features = pd.DataFrame(rows)
final_features.head()

Save processed feature matrix

Export the engineered dataset so training and comparison can be done in standalone .py scripts.

In [None]:
OUT_PATH = "data/processed/final_features.parquet"
os.makedirs(os.path.dirname(OUT_PATH), exist_ok=True)
final_features.to_parquet(OUT_PATH, index=False)
print("saved:", OUT_PATH, "shape:", final_features.shape)

Plot: class balance

Confirm RESULT is roughly balanced because we randomized player order.

plt.figure()
final_features["RESULT"].value_counts().sort_index().plot(kind="bar")
plt.xlabel("RESULT (1 = PLAYER_1 wins)")
plt.ylabel("Count")
plt.title("Class Balance")
plt.show()

Plot: feature distributions

Visualize key engineered features to verify reasonable ranges and detect extreme outliers.

In [None]:
cols = ["ATP_POINT_DIFF", "ATP_RANK_DIFF", "ELO_DIFF", "H2H_DIFF", "WIN_LAST_50_DIFF"]
for c in cols:
    plt.figure()
    plt.hist(final_features[c], bins=60)
    plt.title(f"Distribution: {c}")
    plt.xlabel(c)
    plt.ylabel("Count")
    plt.show()

Plot: correlation among key features

Check correlation structure among high-level features (Elo, rank/points, win form). Trees handle correlation well, but this helps interpret redundancy.

import numpy as np


corr_cols = [
    "ATP_POINT_DIFF",
    "ATP_RANK_DIFF",
    "ELO_DIFF",
    "ELO_SURFACE_DIFF",
    "WIN_LAST_10_DIFF",
    "WIN_LAST_50_DIFF",
    "H2H_DIFF",
    "H2H_SURFACE_DIFF",
]
C = final_features[corr_cols].corr().values


plt.figure()
plt.imshow(C, aspect="auto")
plt.xticks(range(len(corr_cols)), corr_cols, rotation=45, ha="right")
plt.yticks(range(len(corr_cols)), corr_cols)
plt.title("Correlation Heatmap (selected features)")
plt.colorbar()
plt.tight_layout()
plt.show()

Player ID → Name mapping

Load player metadata so we can select famous players and plot their Elo trajectories.

In [None]:
players_path = f"{DATA_DIR}/atp_players.csv"
players = pd.read_csv(players_path)
players["name"] = (
    players["name_first"].fillna("") + " " + players["name_last"].fillna("")
)
player_name = dict(zip(players["player_id"], players["name"]))
players.head()

Elo time-series extraction

Re-run a lightweight chronological pass to record Elo values after each match for selected players. We track overall Elo and surface-specific Elo.

In [None]:
selected_names = ["Novak Djokovic", "Rafael Nadal", "Roger Federer", "Andy Murray"]
selected_ids = [pid for pid, nm in player_name.items() if nm in selected_names]
selected_ids

Build Elo history table

Construct a long-format dataset of Elo points over time for each selected player and each surface

In [None]:
from collections import defaultdict


ELO_BASE = 1500.0
K_BASE = 32.0


elo_overall_ts = defaultdict(list)  # pid -> list of (date, elo)
elo_surface_ts = defaultdict(
    lambda: defaultdict(list)
)  # pid -> surface -> list of (date, elo)


elo_o = defaultdict(lambda: ELO_BASE)
elo_s = defaultdict(lambda: defaultdict(lambda: ELO_BASE))


def expected_score(a, b):
    return 1.0 / (1.0 + 10.0 ** ((b - a) / 400.0))


def k_factor(best_of):
    return K_BASE * (1.15 if best_of == 5 else 1.0)


for r in tqdm(df.itertuples(index=False), total=len(df)):
    w = int(r.winner_id)
    l = int(r.loser_id)
    surface = str(r.surface)
    date = int(r.tourney_date)

    K = k_factor(int(r.best_of))

    ew = expected_score(elo_o[w], elo_o[l])
    delta = K * (1.0 - ew)
    elo_o[w] += delta
    elo_o[l] -= delta

    ews = expected_score(elo_s[w][surface], elo_s[l][surface])
    delta_s = K * (1.0 - ews)
    elo_s[w][surface] += delta_s
    elo_s[l][surface] -= delta_s

    # record if selected
    if w in selected_ids:
        elo_overall_ts[w].append((date, float(elo_o[w])))
        elo_surface_ts[w][surface].append((date, float(elo_s[w][surface])))
    if l in selected_ids:
        elo_overall_ts[l].append((date, float(elo_o[l])))
        elo_surface_ts[l][surface].append((date, float(elo_s[l][surface])))


# convert to dataframe
rows_ts = []
for pid, pts in elo_overall_ts.items():
    for d, e in pts:
        rows_ts.append(
            {
                "player_id": pid,
                "name": player_name.get(pid, str(pid)),
                "date": d,
                "elo": e,
                "type": "overall",
            }
        )


for pid, bys in elo_surface_ts.items():
    for surf, pts in bys.items():
        for d, e in pts:
            rows_ts.append(
                {
                    "player_id": pid,
                    "name": player_name.get(pid, str(pid)),
                    "date": d,
                    "elo": e,
                    "type": f"surface:{surf}",
                }
            )


elo_df = pd.DataFrame(rows_ts)
elo_df.head()

Plot: Surface-specific Elo

Compare how a single player performs across different surfaces by plotting surface Elo time-series.

surfaces = ["Hard", "Clay", "Grass"]


for name in selected_names:
    pid = [p for p, nm in player_name.items() if nm == name][0]
    plt.figure()
    for s in surfaces:
        sub = elo_df[
            (elo_df["player_id"] == pid) & (elo_df["type"] == f"surface:{s}")
        ].sort_values("date")
        if len(sub) == 0:
            continue
        plt.plot(sub["date"].values, sub["elo"].values, label=s)
    plt.title(f"Surface Elo: {name}")
    plt.xlabel("Date (YYYYMMDD)")
    plt.ylabel("Surface Elo")
    plt.legend()
    plt.show()

Quick baseline (sklearn tree)

Train a quick baseline model inside the notebook to validate the pipeline. The full training workflow lives in src/train_sklearn.py.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


split_date = 20221231
train_mask = final_features["TOURNEY_DATE"].values <= split_date


y = final_features["RESULT"].astype(int).values
X = final_features.drop(columns=["RESULT", "TOURNEY_DATE"]).values


X_train, y_train = X[train_mask], y[train_mask]
X_test, y_test = X[~train_mask], y[~train_mask]


model = DecisionTreeClassifier(
    max_depth=6, min_samples_split=200, min_samples_leaf=100, random_state=42
)
model.fit(X_train, y_train)


pred = model.predict(X_test)
print("accuracy:", accuracy_score(y_test, pred))