In [1]:
import pandas as pd

# Replace 'your_file.csv' with the path to your CSV file
df = pd.read_csv('data\\tennis_atp/atp_matches_1968.csv')

# To display the first few rows
df.columns


Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'loser_id', 'loser_seed', 'loser_entry', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon',
       'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt',
       'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points'],
      dtype='object')

### Explained columns:

### 🎾 Tournament Info

- **tourney_id:** Unique id for tournament
- **tourney_name:** Tournament name
- **surface:** Court surface: 'Hard', 'Clay', 'Grass', 'Carpet'
- **draw:** Total Number of players in the tounrament
- **tounrey_level:** Tournament level: 'G' = Grand Slam, 'M' = Masters, 'A' = ATP 500/250, 'C' = Challenger, 'F' = Futures
- **tourney_date**: Start date of tournament (format: yyyymmdd)

### 🔢 Match Info

- **match_num**: Match number within the tournament
- **round**: Round of the match (e.g. 'F', 'SF', 'QF', 'R32')
- **best_of**:Number of sets in match (3 or 5)
- **score**: Final score as string (e.g. '6-3 3-6 7-6')
- **minutes**: Duration of match in minutes

### 🧑‍🎾 Winner Info

- **winner_id**: Unique player id
- **winner_seed**: seed in tournament (if seeded)
- **winner_entry**: Entry type: 'WC' = Wildcard, 'Q' = Qualifier, 'LL' = Lucky Loser, 'PR' = Protected Ranking
- **winner_name**: Full name of winner
- **winner_hand**: Playing hand: 'R' = Right, 'L' = Left, 'U' = Unknown
- **winner_ht**: height in cm
- **winner_ioc**: Country code (e.g. 'USA')
- **winner_age**: Age at the time of match
- **winner_rank**: Ranking points before the match

### 🧍‍♂️ Loser Info (same structure)
Just replace “winner” with “loser”:
loser_id, loser_seed, loser_entry, etc.


### 📊 Match Stats
(Prefix w_ = winner, l_ = loser)


- **w_ace**: Number of aces
- **w_df**: Double faults
- **w_svpt**: Serve points played
- **w_1stIn**:  First serves in
- **w_1stWon**: Points won on 1st serve
- **w_2ndWon**: Points won on 2nd serve
- **w_SvGms**: Service games played
- **w_bpSaved**: Break points saved
- **w_bpFaced**: Break points faced



### What files will be kept?

- ✅atp_matches -> High -quality data from ATP main tour. Includes top players and rich match features
- ❌atp_matches_futures -> ITF level future matches, entry level
- ❌atp_matches_quall_chall -> Covers Challengers and qualifying rounds—bridge between ITF and ATP
- ❌atp_matches_amateur -> Old, pre open Era, incosistent structure

### Merge all atp_mathces * files into a single

In [10]:
import pandas as pd
import glob

# 1. Combine all atp_matches_*.csv (excluding futures and qual_chall)
atp_main_files = glob.glob('data/tennis_atp/atp_matches_*.csv')
atp_main_files = [f for f in atp_main_files if 'futures' not in f and 'qual_chall' not in f]
df_atp = pd.concat([pd.read_csv(f) for f in atp_main_files], ignore_index=True)

# Save to csv file
df_atp.to_csv("data/atp_matches_combined.csv")

# 2. Combine all futures
futures_files = glob.glob('data/tennis_atp/atp_matches_futures_*.csv')
df_futures = pd.concat([pd.read_csv(f) for f in futures_files], ignore_index=True)

# 3. Combine all qual_chall
qual_chall_files = glob.glob('data/tennis_atp/atp_matches_qual_chall_*.csv')
df_qual_chall = pd.concat([pd.read_csv(f) for f in qual_chall_files], ignore_index=True)


### Print merged csv

In [13]:
df = pd.read_csv('data/atp_matches_combined.csv')
df

  df = pd.read_csv('data/atp_matches_combined.csv')


Unnamed: 0.1,Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,0,1968-2029,Dublin,Grass,32.0,A,19680708,270,112411,,...,,,,,,,,,,
1,1,1968-2029,Dublin,Grass,32.0,A,19680708,271,126914,,...,,,,,,,,,,
2,2,1968-2029,Dublin,Grass,32.0,A,19680708,272,209523,,...,,,,,,,,,,
3,3,1968-2029,Dublin,Grass,32.0,A,19680708,273,100084,,...,,,,,,,,,,
4,4,1968-2029,Dublin,Grass,32.0,A,19680708,274,100132,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193332,193332,2024-M-DC-2024-WG2-PO-URU-MDA-01,Davis Cup WG2 PO: URU vs MDA,Clay,4.0,D,20240203,5,212051,,...,30.0,17.0,7.0,6.0,8.0,14.0,1109.0,8.0,740.0,34.0
193333,193333,2024-M-DC-2024-WG2-PO-VIE-RSA-01,Davis Cup WG2 PO: VIE vs RSA,Hard,4.0,D,20240202,1,122533,,...,41.0,25.0,6.0,9.0,1.0,4.0,554.0,67.0,748.0,32.0
193334,193334,2024-M-DC-2024-WG2-PO-VIE-RSA-01,Davis Cup WG2 PO: VIE vs RSA,Hard,4.0,D,20240202,2,144748,,...,51.0,25.0,7.0,11.0,5.0,12.0,416.0,109.0,,
193335,193335,2024-M-DC-2024-WG2-PO-VIE-RSA-01,Davis Cup WG2 PO: VIE vs RSA,Hard,4.0,D,20240202,4,122533,,...,51.0,32.0,17.0,14.0,5.0,9.0,554.0,67.0,416.0,109.0


### Augmented data

- rank diff: loser_rank - winner_rank
- age diff: loser_age - winner_age
- ft_diff: loser_ht - winner ht
- server_adavantage: (w_1stWon + w_2ndWon) - (l_1stWon + l_2ndWon)
- bp_effectiveness: (w_bpSaved / w_bpFaced) -> Break-point mental strenght
- total_points_played: w_svpt + l_svpt
- match_efficiency: minutes/total_points_played

-------------

- Elo rating
- Past results
- Recent form
- Tournament history
- Surface preference (win% on each surface)



Notes:
All numerical data will be normalized
Categorical ft will be encoded
