# Introduction
- This project builds a cleaned, feature‑rich ATP singles match dataset (1992–2024), enriches it with ELO ratings, and performs EDA and modeling.

### Key visuals: 
- 2nd‑serve impact
- top players by surface
- yearly top ELO
- top‑5 ELO trajectories.
- age vs performance

### ML goal: 
 - Predict match winners (balance labels, clean outliers, train & evaluate classifiers).

# Changes since the proposal
We swapped the stock project for ATP match analysis after chatting with the professor. Now we’re building a cleaned dataset (1992–2024) with ELOs, doing quick EDA (2nd‑serve impact, surface specialists, yearly top ELO, top‑5 ELO trends) and training models to predict match winners.

# Dataset Construction:
We build final_data by taking each match (winner vs loser) and converting raw match fields into player‑pair difference features (PLAYER_1 − PLAYER_2).
This yields direct comparative predictors for modeling match outcomes.
## Key Derived Columns:
- *Ranking/Points:* ATP_POINT_DIFF = winner_rank_points - loser_rank_points, ATP_RANK_DIFF = winner_rank - loser_rank.
- *Demographics:* AGE_DIFF, HEIGHT_DIFF.
- *Match Context:* BEST_OF, DRAW_SIZE.
- *Head‑to‑Head:* H2H_DIFF and H2H_SURFACE_DIFF computed from prior matches (overall and surface‑specific).
- *Rolling Performance:* For k ∈ {3,5,10,20,50,100,200,300,2000}, percent‑based serve/defense differences (e.g., P_ACE_LAST_k_DIFF, P_1ST_WON_LAST_k_DIFF) are computed with per‑player rolling deques and sampled before each match.
- *Recent Form:* WIN_LAST_k_DIFF = recent win‑rate difference using rolling windows (k ∈ {3,5,10,20,50,100,200,300,2000}).
- *ELO:* Merged atp_elo_ratings.csv (name normalization) to compute ELO_DIFF and SURF_ELO_DIFF (surface‑specific ELO when available).

## Import packages here

In [1]:
import pandas as pd

## Combine data of all years

In [2]:
all_data = pd.DataFrame()
for year in range(1992, 2025):
    file = "./Data/SingleMatches/atp_matches_"+str(year)+'.csv'

    year_data = pd.read_csv(file)

    all_data = pd.concat([all_data, year_data], axis = 0)

all_data

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,1992-339,Adelaide,Hard,32,A,19911230,1,101964,1.0,,...,34.0,23.0,6.0,9.0,0.0,3.0,16.0,,80.0,
1,1992-339,Adelaide,Hard,32,A,19911230,2,101924,,,...,65.0,39.0,9.0,10.0,8.0,12.0,65.0,,63.0,
2,1992-339,Adelaide,Hard,32,A,19911230,3,101195,,,...,68.0,45.0,22.0,16.0,8.0,12.0,62.0,,730.0,
3,1992-339,Adelaide,Hard,32,A,19911230,4,101820,,,...,49.0,34.0,16.0,14.0,1.0,4.0,60.0,,42.0,
4,1992-339,Adelaide,Hard,32,A,19911230,5,100870,,,...,95.0,65.0,15.0,18.0,5.0,7.0,68.0,,32.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3071,2024-M-DC-2024-WG2-PO-URU-MDA-01,Davis Cup WG2 PO: URU vs MDA,Clay,4,D,20240203,5,212051,,,...,30.0,17.0,7.0,6.0,8.0,14.0,1109.0,8.0,740.0,34.0
3072,2024-M-DC-2024-WG2-PO-VIE-RSA-01,Davis Cup WG2 PO: VIE vs RSA,Hard,4,D,20240202,1,122533,,,...,41.0,25.0,6.0,9.0,1.0,4.0,554.0,67.0,748.0,32.0
3073,2024-M-DC-2024-WG2-PO-VIE-RSA-01,Davis Cup WG2 PO: VIE vs RSA,Hard,4,D,20240202,2,144748,,,...,51.0,25.0,7.0,11.0,5.0,12.0,416.0,109.0,,
3074,2024-M-DC-2024-WG2-PO-VIE-RSA-01,Davis Cup WG2 PO: VIE vs RSA,Hard,4,D,20240202,4,122533,,,...,51.0,32.0,17.0,14.0,5.0,9.0,554.0,67.0,416.0,109.0


In [3]:
all_data.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'loser_id', 'loser_seed', 'loser_entry', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon',
       'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt',
       'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points'],
      dtype='object')

## Filter out missing data

In [4]:
filtered_data = all_data.dropna(subset=[
        'winner_id', 'winner_name', 'loser_id', 'loser_name', 'winner_ht', 'winner_age', 'loser_ht', 'loser_age',
       'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon', 'w_SvGms',
        'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt', 'l_1stIn', 'l_1stWon', 
        'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced', 'winner_rank', 'winner_rank_points',
        'loser_rank', 'loser_rank_points', 'surface'])

filtered_data = filtered_data.reset_index(drop=True)            #starts index back from zero

filtered_data

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,1992-301,Auckland,Hard,32,A,19920106,1,101797,,,...,50.0,40.0,24.0,12.0,4.0,4.0,110.0,333.0,8.0,1599.0
1,1992-301,Auckland,Hard,32,A,19920106,2,101205,,,...,58.0,28.0,10.0,9.0,5.0,8.0,78.0,462.0,101.0,378.0
2,1992-301,Auckland,Hard,32,A,19920106,3,101368,,,...,28.0,19.0,11.0,9.0,4.0,8.0,82.0,436.0,1059.0,3.0
3,1992-301,Auckland,Hard,32,A,19920106,4,100772,,WC,...,46.0,31.0,16.0,12.0,3.0,6.0,171.0,201.0,52.0,607.0
4,1992-301,Auckland,Hard,32,A,19920106,5,101532,4.0,,...,57.0,39.0,21.0,14.0,9.0,12.0,30.0,837.0,99.0,383.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92328,2024-M-DC-2024-WG2-PO-TOG-INA-01,Davis Cup WG2 PO: TOG vs INA,Hard,4,D,20240203,4,207134,,,...,35.0,19.0,5.0,8.0,5.0,9.0,569.0,64.0,819.0,24.0
92329,2024-M-DC-2024-WG2-PO-TUN-CRC-01,Davis Cup WG2 PO: TUN vs CRC,Hard,4,D,20240202,1,121411,,,...,30.0,17.0,9.0,8.0,2.0,5.0,279.0,205.0,900.0,18.0
92330,2024-M-DC-2024-WG2-PO-URU-MDA-01,Davis Cup WG2 PO: URU vs MDA,Clay,4,D,20240203,1,208364,,,...,52.0,24.0,18.0,12.0,8.0,16.0,616.0,55.0,740.0,34.0
92331,2024-M-DC-2024-WG2-PO-URU-MDA-01,Davis Cup WG2 PO: URU vs MDA,Clay,4,D,20240203,4,105430,,,...,66.0,33.0,6.0,8.0,6.0,11.0,136.0,489.0,616.0,55.0


### Calculating Final Data and Features

In [5]:
final_data = pd.DataFrame()
final_data['WINNER_ID'] = filtered_data['winner_id']
final_data['WINNER_NAME'] = filtered_data['winner_name']
final_data['LOSER_ID'] = filtered_data['loser_id']
final_data['LOSER_NAME'] = filtered_data['loser_name']
final_data['ATP_POINT_DIFF'] = filtered_data['winner_rank_points'] - filtered_data['loser_rank_points']
final_data['ATP_RANK_DIFF'] = filtered_data['winner_rank'] - filtered_data['loser_rank']
final_data['AGE_DIFF'] = filtered_data['winner_age'] - filtered_data['loser_age']
final_data['HEIGHT_DIFF'] = filtered_data['winner_ht'] - filtered_data['loser_ht']
final_data['BEST_OF'] = filtered_data['best_of']
final_data['DRAW_SIZE'] = filtered_data['draw_size']

final_data

Unnamed: 0,WINNER_ID,WINNER_NAME,LOSER_ID,LOSER_NAME,ATP_POINT_DIFF,ATP_RANK_DIFF,AGE_DIFF,HEIGHT_DIFF,BEST_OF,DRAW_SIZE
0,101797,Jacco Eltingh,101120,Karel Novacek,-1266.0,102.0,-5.4,-2.0,3,32
1,101205,Grant Connell,101767,Lars Jonsson,84.0,-23.0,4.6,-3.0,3,32
2,101368,Christian Miniussi,102536,James Greenhalgh,433.0,-977.0,7.7,2.0,3,32
3,100772,Kelly Evernden,101746,Renzo Furlan,-406.0,119.0,8.6,0.0,3,32
4,101532,Francisco Clavet,101119,Marian Vajda,454.0,-69.0,-3.5,10.0,3,32
...,...,...,...,...,...,...,...,...,...,...
92328,207134,Fitriadi M Rifqi,133933,Thomas Yaka Kofi Setodji,40.0,-250.0,-3.2,-8.0,3,4
92329,121411,Moez Echargui,132374,Jesse Flores,187.0,-621.0,2.2,-10.0,3,4
92330,208364,Franco Roncadelli,209943,Ilya Snitari,21.0,-124.0,2.1,-3.0,3,4
92331,105430,Radu Albot,208364,Franco Roncadelli,434.0,-480.0,10.3,-10.0,3,4


### Head-to-Head Feature Calculation

Calculated two features:
- **H2H_DIFF:** Overall head-to-head win difference between the two players before the current match.  
- **H2H_SURFACE_DIFF:** Head-to-head win difference on the same surface.  

Both are computed by tracking previous match results for each player pair and updating after every match.


In [6]:
# Calculate H2H and H2H on that surface
from collections import defaultdict
from tqdm import tqdm

h2h_surface_dict = defaultdict(lambda: defaultdict(int))
h2h_dict = defaultdict(int)
total_h2h_surface = []
total_h2h = []

for idx, (w_id, l_id, surface) in enumerate(tqdm(zip(filtered_data['winner_id'],
                                                     filtered_data['loser_id'],
                                                     filtered_data['surface']),
                                                 total=len(filtered_data))):
    wins = h2h_dict[(w_id, l_id)]
    loses = h2h_dict[(l_id, w_id)]

    wins_surface = h2h_surface_dict[surface][(w_id, l_id)]
    loses_surface = h2h_surface_dict[surface][(l_id, w_id)]

    total_h2h.append(wins - loses)
    total_h2h_surface.append(wins_surface - loses_surface)

    h2h_dict[(w_id, l_id)] += 1
    h2h_surface_dict[surface][(w_id, l_id)] += 1

final_data["H2H_DIFF"] = total_h2h
final_data["H2H_SURFACE_DIFF"] = total_h2h_surface

final_data


100%|██████████| 92333/92333 [00:00<00:00, 709986.29it/s]


Unnamed: 0,WINNER_ID,WINNER_NAME,LOSER_ID,LOSER_NAME,ATP_POINT_DIFF,ATP_RANK_DIFF,AGE_DIFF,HEIGHT_DIFF,BEST_OF,DRAW_SIZE,H2H_DIFF,H2H_SURFACE_DIFF
0,101797,Jacco Eltingh,101120,Karel Novacek,-1266.0,102.0,-5.4,-2.0,3,32,0,0
1,101205,Grant Connell,101767,Lars Jonsson,84.0,-23.0,4.6,-3.0,3,32,0,0
2,101368,Christian Miniussi,102536,James Greenhalgh,433.0,-977.0,7.7,2.0,3,32,0,0
3,100772,Kelly Evernden,101746,Renzo Furlan,-406.0,119.0,8.6,0.0,3,32,0,0
4,101532,Francisco Clavet,101119,Marian Vajda,454.0,-69.0,-3.5,10.0,3,32,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
92328,207134,Fitriadi M Rifqi,133933,Thomas Yaka Kofi Setodji,40.0,-250.0,-3.2,-8.0,3,4,0,0
92329,121411,Moez Echargui,132374,Jesse Flores,187.0,-621.0,2.2,-10.0,3,4,0,0
92330,208364,Franco Roncadelli,209943,Ilya Snitari,21.0,-124.0,2.1,-3.0,3,4,0,0
92331,105430,Radu Albot,208364,Franco Roncadelli,434.0,-480.0,10.3,-10.0,3,4,0,0


###  Rolling Serve/Performance Feature Calculation

For each match and window size *k* ∈ {3,5,10,20,50,100,200,300,2000}, we compute PLAYER_1 − PLAYER_2 differences of recent averages for: %Aces, %Double Faults, %1st Serve In, %1st Serve Won, %2nd Serve Won, and %Break Points Saved.  
We iterate chronologically, use per-player deques (maxlen = k) to track last-k percentages, read pre-match means, then update with the current match stats.


In [7]:
from collections import defaultdict, deque
import numpy as np


def smean(seq):
    """Safe mean -> 0.0 when empty."""
    return float(np.mean(seq)) if seq else 0.0

all_data_filtered = filtered_data.sort_values(
    ['tourney_date', 'match_num'], kind='mergesort'
).reset_index(drop=True)


for k in [3, 5, 10, 20, 50, 100, 200, 300, 2000]:
  
    last_k_matches = defaultdict(lambda: defaultdict(lambda: deque(maxlen=k)))

    
    p_ace_k      = []
    p_df_k       = []
    p_1stIn_k    = []
    p_1stWon_k   = []
    p_2ndWon_k   = []
    p_bpSaved_k  = []

   
    for row in all_data_filtered.itertuples(index=False):
        
        w_id, l_id = row.winner_id, row.loser_id

       
        w_ace,   l_ace   = row.w_ace,   row.l_ace
        w_df,    l_df    = row.w_df,    row.l_df
        w_svpt,  l_svpt  = row.w_svpt,  row.l_svpt
        w_1stIn, l_1stIn = row.w_1stIn, row.l_1stIn
        w_1stWon,l_1stWon= row.w_1stWon,row.l_1stWon
        w_2ndWon,l_2ndWon= row.w_2ndWon,row.l_2ndWon
        w_SvGms,l_SvGms  = row.w_SvGms, row.l_SvGms
        w_bpSaved,l_bpSaved = row.w_bpSaved, row.l_bpSaved
        w_bpFaced,l_bpFaced = row.w_bpFaced, row.l_bpFaced

        
        p_ace_k.append( smean(last_k_matches[w_id]["p_ace"]) - smean(last_k_matches[l_id]["p_ace"]))
        p_df_k.append( smean(last_k_matches[w_id]["p_df"]) - smean(last_k_matches[l_id]["p_df"]))
        p_1stIn_k.append( smean(last_k_matches[w_id]["p_1stIn"]) - smean(last_k_matches[l_id]["p_1stIn"]))
        p_1stWon_k.append( smean(last_k_matches[w_id]["p_1stWon"]) - smean(last_k_matches[l_id]["p_1stWon"]))
        p_2ndWon_k.append( smean(last_k_matches[w_id]["p_2ndWon"]) - smean(last_k_matches[l_id]["p_2ndWon"]))
        p_bpSaved_k.append( smean(last_k_matches[w_id]["p_bpSaved"]) - smean(last_k_matches[l_id]["p_bpSaved"]))

        
        # Winner percentages
        if w_svpt != 0:
            last_k_matches[w_id]["p_ace"].append(100.0 * (w_ace / w_svpt))
            last_k_matches[w_id]["p_df"].append(100.0 * (w_df / w_svpt))
            last_k_matches[w_id]["p_1stIn"].append(100.0 * (w_1stIn / w_svpt))
        if w_1stIn != 0:
            last_k_matches[w_id]["p_1stWon"].append(100.0 * (w_1stWon / w_1stIn))
        if (w_svpt - w_1stIn) != 0:
            last_k_matches[w_id]["p_2ndWon"].append(100.0 * (w_2ndWon / (w_svpt - w_1stIn)))
        if w_bpFaced != 0:
            last_k_matches[w_id]["p_bpSaved"].append(100.0 * (w_bpSaved / w_bpFaced))

        # Loser percentages
        if l_svpt != 0:
            last_k_matches[l_id]["p_ace"].append(100.0 * (l_ace / l_svpt))
            last_k_matches[l_id]["p_df"].append(100.0 * (l_df / l_svpt))
            last_k_matches[l_id]["p_1stIn"].append(100.0 * (l_1stIn / l_svpt))
        if l_1stIn != 0:
            last_k_matches[l_id]["p_1stWon"].append(100.0 * (l_1stWon / l_1stIn))
        if (l_svpt - l_1stIn) != 0:
            last_k_matches[l_id]["p_2ndWon"].append(100.0 * (l_2ndWon / (l_svpt - l_1stIn)))
        if l_bpFaced != 0:
            last_k_matches[l_id]["p_bpSaved"].append(100.0 * (l_bpSaved / l_bpFaced))

    final_data[f"P_ACE_LAST_{k}_DIFF"]      = p_ace_k
    final_data[f"P_DF_LAST_{k}_DIFF"]       = p_df_k
    final_data[f"P_1ST_IN_LAST_{k}_DIFF"]   = p_1stIn_k
    final_data[f"P_1ST_WON_LAST_{k}_DIFF"]  = p_1stWon_k
    final_data[f"P_2ND_WON_LAST_{k}_DIFF"]  = p_2ndWon_k
    final_data[f"P_BP_SAVED_LAST_{k}_DIFF"] = p_bpSaved_k


In [8]:
final_data

Unnamed: 0,WINNER_ID,WINNER_NAME,LOSER_ID,LOSER_NAME,ATP_POINT_DIFF,ATP_RANK_DIFF,AGE_DIFF,HEIGHT_DIFF,BEST_OF,DRAW_SIZE,...,P_1ST_IN_LAST_300_DIFF,P_1ST_WON_LAST_300_DIFF,P_2ND_WON_LAST_300_DIFF,P_BP_SAVED_LAST_300_DIFF,P_ACE_LAST_2000_DIFF,P_DF_LAST_2000_DIFF,P_1ST_IN_LAST_2000_DIFF,P_1ST_WON_LAST_2000_DIFF,P_2ND_WON_LAST_2000_DIFF,P_BP_SAVED_LAST_2000_DIFF
0,101797,Jacco Eltingh,101120,Karel Novacek,-1266.0,102.0,-5.4,-2.0,3,32,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,101205,Grant Connell,101767,Lars Jonsson,84.0,-23.0,4.6,-3.0,3,32,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,101368,Christian Miniussi,102536,James Greenhalgh,433.0,-977.0,7.7,2.0,3,32,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,100772,Kelly Evernden,101746,Renzo Furlan,-406.0,119.0,8.6,0.0,3,32,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,101532,Francisco Clavet,101119,Marian Vajda,454.0,-69.0,-3.5,10.0,3,32,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92328,207134,Fitriadi M Rifqi,133933,Thomas Yaka Kofi Setodji,40.0,-250.0,-3.2,-8.0,3,4,...,6.622831,-13.411429,0.291809,2.718619,-4.685967,-1.811381,6.622831,-13.411429,0.291809,2.718619
92329,121411,Moez Echargui,132374,Jesse Flores,187.0,-621.0,2.2,-10.0,3,4,...,1.269346,1.862017,2.029024,0.848561,-3.006825,0.054561,1.269346,1.862017,2.029024,0.848561
92330,208364,Franco Roncadelli,209943,Ilya Snitari,21.0,-124.0,2.1,-3.0,3,4,...,0.126691,-5.910131,-1.772576,5.050027,-1.635691,-0.493191,0.126691,-5.910131,-1.772576,5.050027
92331,105430,Radu Albot,208364,Franco Roncadelli,434.0,-480.0,10.3,-10.0,3,4,...,5.002503,0.485732,6.268410,-5.179438,1.322220,-1.550933,5.002503,0.485732,6.268410,-5.179438


### Recent Form (Win Rate) Feature Calculation

For each player, we tracked their win rate over the last *k* matches where *k* ∈ {3, 5, 10, 25, 50, 100}.  
Before every match, we computed the difference in recent win percentages between PLAYER_1 and PLAYER_2, then updated the rolling records (1 = win, 0 = loss).  
Finally, a `RESULT` column was added as 1 since each row represents the winner’s perspective.


In [9]:
from collections import defaultdict, deque
import numpy as np

for k in [3, 5, 10, 25, 50, 100]:
    recent_results = defaultdict(lambda: deque(maxlen=k))
    win_last_k = []

    for row in filtered_data.itertuples(index=False):
        w_id, l_id = row.winner_id, row.loser_id

        # Past win % before current match
        w_winrate = np.mean(recent_results[w_id]) if recent_results[w_id] else 0
        l_winrate = np.mean(recent_results[l_id]) if recent_results[l_id] else 0
        win_last_k.append(w_winrate - l_winrate)

        # Update running results (1 = win, 0 = loss)
        recent_results[w_id].append(1)
        recent_results[l_id].append(0)

    final_data[f'WIN_LAST_{k}_DIFF'] = win_last_k
    
final_data['RESULT'] = 1  # winner always 1 since each row is from winner’s perspective



In [10]:
final_data.columns

Index(['WINNER_ID', 'WINNER_NAME', 'LOSER_ID', 'LOSER_NAME', 'ATP_POINT_DIFF',
       'ATP_RANK_DIFF', 'AGE_DIFF', 'HEIGHT_DIFF', 'BEST_OF', 'DRAW_SIZE',
       'H2H_DIFF', 'H2H_SURFACE_DIFF', 'P_ACE_LAST_3_DIFF', 'P_DF_LAST_3_DIFF',
       'P_1ST_IN_LAST_3_DIFF', 'P_1ST_WON_LAST_3_DIFF',
       'P_2ND_WON_LAST_3_DIFF', 'P_BP_SAVED_LAST_3_DIFF', 'P_ACE_LAST_5_DIFF',
       'P_DF_LAST_5_DIFF', 'P_1ST_IN_LAST_5_DIFF', 'P_1ST_WON_LAST_5_DIFF',
       'P_2ND_WON_LAST_5_DIFF', 'P_BP_SAVED_LAST_5_DIFF', 'P_ACE_LAST_10_DIFF',
       'P_DF_LAST_10_DIFF', 'P_1ST_IN_LAST_10_DIFF', 'P_1ST_WON_LAST_10_DIFF',
       'P_2ND_WON_LAST_10_DIFF', 'P_BP_SAVED_LAST_10_DIFF',
       'P_ACE_LAST_20_DIFF', 'P_DF_LAST_20_DIFF', 'P_1ST_IN_LAST_20_DIFF',
       'P_1ST_WON_LAST_20_DIFF', 'P_2ND_WON_LAST_20_DIFF',
       'P_BP_SAVED_LAST_20_DIFF', 'P_ACE_LAST_50_DIFF', 'P_DF_LAST_50_DIFF',
       'P_1ST_IN_LAST_50_DIFF', 'P_1ST_WON_LAST_50_DIFF',
       'P_2ND_WON_LAST_50_DIFF', 'P_BP_SAVED_LAST_50_DIFF',
       

##  Feature Description 

###  Player Identifiers
| Column | Description |
|---------|--------------|
| **PLAYER_1** | ID of the first player (winner in original data or perspective player after balancing). |
| **PLAYER_2** | ID of the opponent (loser in original data or the other player in balanced version). |

---

###  Basic Match & Ranking Differences
| Column | Description |
|---------|--------------|
| **ATP_POINT_DIFF** | Difference in ATP ranking points between PLAYER_1 and PLAYER_2 *(PLAYER_1 − PLAYER_2)*. |
| **ATP_RANK_DIFF** | Difference in ATP ranking positions *(PLAYER_1 − PLAYER_2)* — lower means better rank. |
| **AGE_DIFF** | Difference in age between PLAYER_1 and PLAYER_2 (in years). |
| **HEIGHT_DIFF** | Height difference between PLAYER_1 and PLAYER_2 (in cm). |
| **BEST_OF** | Number of sets played in the match (e.g., 3 or 5). |
| **DRAW_SIZE** | Tournament draw size (e.g., 32, 64, 128 players). |

---

###  Head-to-Head Performance
| Column | Description |
|---------|--------------|
| **H2H_DIFF** | Overall win–loss record difference between PLAYER_1 and PLAYER_2 before this match. |
| **H2H_SURFACE_DIFF** | Win–loss difference between PLAYER_1 and PLAYER_2 *on the same surface* (Hard/Clay/Grass) before this match. |

---

###  Serve & Performance Statistics (Rolling Windows)
Each feature represents the **difference** in averages between PLAYER_1 and PLAYER_2 over the **last *k* matches**.

| Category | Example Columns | Description |
|-----------|------------------|--------------|
| **Aces** | `P_ACE_LAST_k_DIFF` | Difference in % of points won by aces in the last *k* matches. |
| **Double Faults** | `P_DF_LAST_k_DIFF` | Difference in % of double faults made in the last *k* matches. |
| **First Serve In** | `P_1ST_IN_LAST_k_DIFF` | Difference in % of first serves successfully landed in play. |
| **First Serve Won** | `P_1ST_WON_LAST_k_DIFF` | Difference in % of points won on first serve. |
| **Second Serve Won** | `P_2ND_WON_LAST_k_DIFF` | Difference in % of points won on second serve. |
| **Break Points Saved** | `P_BP_SAVED_LAST_k_DIFF` | Difference in % of break points saved (defensive success under pressure). |

Where *k ∈ {3, 5, 10, 20, 50, 100, 200, 300, 2000}* — representing the rolling window size.

---

### Form / Momentum Features
| Column | Description |
|---------|--------------|
| **WIN_LAST_3_DIFF** | Difference in win rate between PLAYER_1 and PLAYER_2 over their last 3 matches. |
| **WIN_LAST_5_DIFF** | Difference in win rate over last 5 matches. |
| **WIN_LAST_10_DIFF** | Difference in win rate over last 10 matches. |
| **WIN_LAST_25_DIFF** | Difference in win rate over last 25 matches. |
| **WIN_LAST_50_DIFF** | Difference in win rate over last 50 matches. |
| **WIN_LAST_100_DIFF** | Difference in win rate over last 100 matches. |

These features capture short-term and long-term momentum or player form.

---

###  Target Variable
| Column | Description |
|---------|--------------|
| **RESULT** | Match outcome label from PLAYER_1’s perspective — `1` if PLAYER_1 won, `0` if lost (after balancing). Always `1` in the original winner-only dataset. |

---



In [11]:
final_data.head()

Unnamed: 0,WINNER_ID,WINNER_NAME,LOSER_ID,LOSER_NAME,ATP_POINT_DIFF,ATP_RANK_DIFF,AGE_DIFF,HEIGHT_DIFF,BEST_OF,DRAW_SIZE,...,P_1ST_WON_LAST_2000_DIFF,P_2ND_WON_LAST_2000_DIFF,P_BP_SAVED_LAST_2000_DIFF,WIN_LAST_3_DIFF,WIN_LAST_5_DIFF,WIN_LAST_10_DIFF,WIN_LAST_25_DIFF,WIN_LAST_50_DIFF,WIN_LAST_100_DIFF,RESULT
0,101797,Jacco Eltingh,101120,Karel Novacek,-1266.0,102.0,-5.4,-2.0,3,32,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,101205,Grant Connell,101767,Lars Jonsson,84.0,-23.0,4.6,-3.0,3,32,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,101368,Christian Miniussi,102536,James Greenhalgh,433.0,-977.0,7.7,2.0,3,32,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,100772,Kelly Evernden,101746,Renzo Furlan,-406.0,119.0,8.6,0.0,3,32,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,101532,Francisco Clavet,101119,Marian Vajda,454.0,-69.0,-3.5,10.0,3,32,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [12]:
import pandas as pd, numpy as np, unicodedata

def norm_name(s):
    if pd.isna(s): return s
    s = unicodedata.normalize("NFKD", str(s))
    s = "".join(ch for ch in s if not unicodedata.combining(ch))
    s = s.lower().strip().replace(".", "").replace("-", " ")
    while "  " in s: s = s.replace("  ", " ")
    return s


elo = pd.read_csv("atp_elo_ratings.csv")
elo.columns = [c.strip().lower().replace(" ", "_") for c in elo.columns]
elo = elo[["player","elo","helo","celo","gelo"]].dropna(subset=["player"]).drop_duplicates("player")
elo["player_key"] = elo["player"].map(norm_name)


df = final_data.copy()
if "WINNER_NAME" not in df or "LOSER_NAME" not in df:
    raise KeyError("final_data must include WINNER_NAME and LOSER_NAME")
df["WINNER_KEY"] = df.get("WINNER_KEY", df["WINNER_NAME"].map(norm_name))
df["LOSER_KEY"]  = df.get("LOSER_KEY",  df["LOSER_NAME"].map(norm_name))


if "surface" not in df and "filtered_data" in globals() and "surface" in filtered_data.columns:
    df["surface"] = filtered_data["surface"]


need_cols = {"w_elo","l_elo","w_helo","l_helo","w_celo","l_celo","w_gelo","l_gelo"}
if not need_cols.issubset(df.columns):
    w = elo.add_prefix("w_") 
    l = elo.add_prefix("l_")
    df = df.merge(w, left_on="WINNER_KEY", right_on="w_player_key", how="left") \
           .merge(l, left_on="LOSER_KEY",  right_on="l_player_key",  how="left")


df["ELO_DIFF"] = df["w_elo"] - df["l_elo"]

if "surface" in df.columns:
    s = df["surface"].str.lower().fillna("")
    is_h, is_c, is_g = s.str.startswith("h"), s.str.startswith("c"), s.str.startswith("g")
    w_surf = np.select([is_h,            is_c,            is_g           ],
                       [df["w_helo"],    df["w_celo"],    df["w_gelo"]   ],
                       default=df["w_elo"])
    l_surf = np.select([is_h,            is_c,            is_g           ],
                       [df["l_helo"],    df["l_celo"],    df["l_gelo"]   ],
                       default=df["l_elo"])
    df["SURF_ELO_DIFF"] = w_surf - l_surf
else:
    df["SURF_ELO_DIFF"] = np.nan


if "RESULT" in df.columns:
    cols = df.columns.tolist()
    for c in ["ELO_DIFF","SURF_ELO_DIFF"]:
        if c in cols: cols.remove(c)
    ridx = cols.index("RESULT")
    df = df[cols[:ridx] + ["ELO_DIFF","SURF_ELO_DIFF"] + cols[ridx:]]

final_data = df



In [13]:
final_data.columns

Index(['WINNER_ID', 'WINNER_NAME', 'LOSER_ID', 'LOSER_NAME', 'ATP_POINT_DIFF',
       'ATP_RANK_DIFF', 'AGE_DIFF', 'HEIGHT_DIFF', 'BEST_OF', 'DRAW_SIZE',
       'H2H_DIFF', 'H2H_SURFACE_DIFF', 'P_ACE_LAST_3_DIFF', 'P_DF_LAST_3_DIFF',
       'P_1ST_IN_LAST_3_DIFF', 'P_1ST_WON_LAST_3_DIFF',
       'P_2ND_WON_LAST_3_DIFF', 'P_BP_SAVED_LAST_3_DIFF', 'P_ACE_LAST_5_DIFF',
       'P_DF_LAST_5_DIFF', 'P_1ST_IN_LAST_5_DIFF', 'P_1ST_WON_LAST_5_DIFF',
       'P_2ND_WON_LAST_5_DIFF', 'P_BP_SAVED_LAST_5_DIFF', 'P_ACE_LAST_10_DIFF',
       'P_DF_LAST_10_DIFF', 'P_1ST_IN_LAST_10_DIFF', 'P_1ST_WON_LAST_10_DIFF',
       'P_2ND_WON_LAST_10_DIFF', 'P_BP_SAVED_LAST_10_DIFF',
       'P_ACE_LAST_20_DIFF', 'P_DF_LAST_20_DIFF', 'P_1ST_IN_LAST_20_DIFF',
       'P_1ST_WON_LAST_20_DIFF', 'P_2ND_WON_LAST_20_DIFF',
       'P_BP_SAVED_LAST_20_DIFF', 'P_ACE_LAST_50_DIFF', 'P_DF_LAST_50_DIFF',
       'P_1ST_IN_LAST_50_DIFF', 'P_1ST_WON_LAST_50_DIFF',
       'P_2ND_WON_LAST_50_DIFF', 'P_BP_SAVED_LAST_50_DIFF',
       

## Example: Viewing Elo Data

In [14]:
carlos_matches = final_data[
    (final_data["WINNER_NAME"].str.contains("Carlos Alcaraz", case=False, na=False)) |
    (final_data["LOSER_NAME"].str.contains("Carlos Alcaraz", case=False, na=False))
]

carlos_matches.sample(5, random_state=42)


Unnamed: 0,WINNER_ID,WINNER_NAME,LOSER_ID,LOSER_NAME,ATP_POINT_DIFF,ATP_RANK_DIFF,AGE_DIFF,HEIGHT_DIFF,BEST_OF,DRAW_SIZE,...,w_helo,w_celo,w_gelo,w_player_key,l_player,l_elo,l_helo,l_celo,l_gelo,l_player_key
86453,208029,Holger Rune,207989,Carlos Alcaraz,-4739.0,17.0,0.1,5.0,3,64,...,1874.5,1878.7,1763.4,holger rune,Carlos Alcaraz,2268.4,2178.3,2214.5,2137.6,carlos alcaraz
86973,207989,Carlos Alcaraz,126523,Bernabe Zapata Miralles,6028.0,-72.0,-6.3,0.0,3,32,...,2178.3,2214.5,2137.6,carlos alcaraz,Bernabe Zapata Miralles,1433.4,1390.2,1419.9,1433.6,bernabe zapata miralles
92153,207989,Carlos Alcaraz,126094,Andrey Rublev,3050.0,-5.0,-5.5,-5.0,3,8,...,2178.3,2214.5,2137.6,carlos alcaraz,Andrey Rublev,1875.2,1831.3,1845.8,1787.1,andrey rublev
90893,207733,Jack Draper,207989,Carlos Alcaraz,-7209.0,29.0,1.3,10.0,3,32,...,1890.1,1816.6,1753.5,jack draper,Carlos Alcaraz,2268.4,2178.3,2214.5,2137.6,carlos alcaraz
84598,207989,Carlos Alcaraz,200175,Miomir Kecmanovic,1358.0,-32.0,-3.7,0.0,3,128,...,2178.3,2214.5,2137.6,carlos alcaraz,Miomir Kecmanovic,1729.3,1684.4,1685.7,1626.4,miomir kecmanovic


# Exploratory Data Analysis

In [None]:
import pandas as pd
import numpy as np

# select numeric columns
numeric_df = final_data.select_dtypes(include='number').copy()

# OPTIONAL: keep a copy of the original numeric frame if you need it later
# numeric_df_full = numeric_df.copy()

# Remove any numeric columns that contain missing values
numeric_df = numeric_df.dropna(axis=1)  # axis=1 drops columns with any NaN

# Compute mode as a 1-D Series (pick first mode per column if multiple)
modes_df = numeric_df.mode()
if modes_df.empty:
    mode_series = pd.Series(np.nan, index=numeric_df.columns)
else:
    mode_series = modes_df.iloc[0]

# Build EDA summary
eda_stats = pd.DataFrame({
    'Count': numeric_df.count(),
    'Missing': numeric_df.isna().sum(),
    'Mean': numeric_df.mean(),
    'Median': numeric_df.median(),
    'Mode': mode_series,
    'StdDev': numeric_df.std(),
    'Min': numeric_df.min(),
    'Max': numeric_df.max(),
    'Range': numeric_df.max() - numeric_df.min(),
    'Variance': numeric_df.var(),
    '25th Percentile (Q1)': numeric_df.quantile(0.25),
    '50th Percentile (Q2)': numeric_df.quantile(0.50),
    '75th Percentile (Q3)': numeric_df.quantile(0.75)
})

# Display
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
print(eda_stats)

                           Count  Missing           Mean         Median  \
WINNER_ID                  92333        0  108717.718302  103819.000000   
LOSER_ID                   92333        0  108548.673075  103786.000000   
ATP_POINT_DIFF             92333        0     578.082213     249.000000   
ATP_RANK_DIFF              92333        0     -31.749472     -20.000000   
AGE_DIFF                   92333        0      -0.175471      -0.200000   
HEIGHT_DIFF                92333        0       0.604204       0.000000   
BEST_OF                    92333        0       3.363922       3.000000   
DRAW_SIZE                  92333        0      58.670919      32.000000   
H2H_DIFF                   92333        0       0.207380       0.000000   
H2H_SURFACE_DIFF           92333        0       0.116513       0.000000   
P_ACE_LAST_3_DIFF          92333        0       0.726385       0.589023   
P_DF_LAST_3_DIFF           92333        0      -0.130373      -0.123243   
P_1ST_IN_LAST_3_DIFF     

## Observations
- Largest H2H gap: Djokovic vs Gael Monfils (H2H_DIFF = 20) — a strong rivalry signal.
- ATP rank difference: Mean = -31.75, Median = -20 — winners generally have better (smaller) ranks; distribution has large outliers.
- Age difference: Mean ≈ -0.18 (median ≈ -0.2) — winners are on average slightly younger, but the effect is very small.