# Defining Upsets in ATP Tennis Matches Using Betting Odds

## Executive Summary

This notebook performs comprehensive data cleaning and feature engineering on ATP tennis betting odds data to define and classify match upsets. The analysis spans 8 years (2016-2019, 2021-2024) and processes over 20,000 tennis matches.

## Objectives

1. **Data Cleaning**: Remove incomplete matches (retirements and walkovers) from the betting odds datasets
2. **Probability Calculation**: Convert betting odds into fair win probabilities for each player
3. **Upset Definition**: Define upsets using a probability-based threshold
4. **Binary Classification**: Create a binary feature to classify matches as upsets or non-upsets
5. **Feature Selection**: Retain only the most relevant features for analysis

## Key Findings

- **Total Matches Analyzed**: ~20,300 completed matches across 8 years
- **Matches Removed**: 707 incomplete matches (Retired/Walkover)
- **Upset Rate**: Approximately 6-8% of matches are classified as upsets
- **Upset Threshold**: Matches where the winner had <30% probability of winning

## Dataset Overview

- **Source**: ATP Betting Odds (Excel files from Atp_Betting_Odds_Folder)
- **Years Covered**: 2016, 2017, 2018, 2019, 2021, 2022, 2023, 2024
- **Final Features**: 17 key features per match
- **Output**: `cleaned_betting_odds_dfs` dictionary containing cleaned dataframes for each year

In [1]:
# Import necessary libraries
import pandas as pd
import os
import glob

---

## 1. Data Loading

We begin by importing necessary libraries and loading all betting odds data from Excel files.

In [2]:
# Load all betting odds files
betting_odds_folder = r'c:\poly_code\Atp_Betting_Odds_Folder'
betting_odds_files = glob.glob(os.path.join(betting_odds_folder, '*.xlsx'))

# Create a dictionary to store dataframes
betting_odds_dfs = {}

for file in betting_odds_files:
    year = os.path.basename(file).split('.')[0]
    betting_odds_dfs[year] = pd.read_excel(file)
    print(f"Loaded {year}: {len(betting_odds_dfs[year])} matches")

print(f"\nTotal years loaded: {len(betting_odds_dfs)}")

  for idx, row in parser.parse():


Loaded 2016: 2626 matches
Loaded 2017: 2633 matches
Loaded 2018: 2637 matches
Loaded 2019: 2610 matches
Loaded 2021: 2489 matches
Loaded 2022: 2632 matches
Loaded 2023: 2703 matches
Loaded 2024: 2703 matches

Total years loaded: 8


In [3]:
# Check the structure of one dataset to see the 'Comment' column
year_sample = list(betting_odds_dfs.keys())[0]
print(f"Columns in {year_sample} dataset:")
print(betting_odds_dfs[year_sample].columns.tolist())
print(f"\nFirst few rows:")
print(betting_odds_dfs[year_sample].head())

Columns in 2016 dataset:
['ATP', 'Location', 'Tournament', 'Date', 'Series', 'Court', 'Surface', 'Round', 'Best of', 'Winner', 'Loser', 'WRank', 'LRank', 'WPts', 'LPts', 'W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets', 'Comment', 'B365W', 'B365L', 'EXW', 'EXL', 'LBW', 'LBL', 'PSW', 'PSL', 'MaxW', 'MaxL', 'AvgW', 'AvgL']

First few rows:
   ATP  Location              Tournament       Date  Series    Court Surface  \
0    1  Brisbane  Brisbane International 2016-01-04  ATP250  Outdoor    Hard   
1    1  Brisbane  Brisbane International 2016-01-04  ATP250  Outdoor    Hard   
2    1  Brisbane  Brisbane International 2016-01-04  ATP250  Outdoor    Hard   
3    1  Brisbane  Brisbane International 2016-01-04  ATP250  Outdoor    Hard   
4    1  Brisbane  Brisbane International 2016-01-05  ATP250  Outdoor    Hard   

       Round  Best of       Winner  ...   EXW   EXL   LBW   LBL   PSW   PSL  \
0  1st Round        3  Dimitrov G.  ...  1.68  2.10  1.62  2.25  1.68  2

---

## 2. Data Exploration

Before cleaning, we explore the dataset structure to understand the available features and identify data quality issues.

In [4]:
# Check unique values in Comment column across all datasets
print("Unique values in 'Comment' column across all years:\n")
for year, df in betting_odds_dfs.items():
    if 'Comment' in df.columns:
        unique_comments = df['Comment'].unique()
        print(f"{year}: {unique_comments}")
    else:
        print(f"{year}: No 'Comment' column found")

Unique values in 'Comment' column across all years:

2016: ['Completed' 'Retired' 'Walkover']
2017: ['Completed' 'Retired' 'Walkover']
2018: ['Completed' 'Retired' 'Walkover']
2019: ['Completed' 'Walkover' 'Retired' 'Awarded' 'Sched']
2021: ['Completed' 'Retired' 'Walkover' 'Awarded' 'Rrtired']
2022: ['Completed' 'Retired' 'Walkover']
2023: ['Completed' 'Retired' 'Walkover' 'Awarded']
2024: ['Completed' 'Retired' 'Walkover' 'Awarded']


In [5]:
# Clean the datasets by removing matches with 'Retired' or 'Walkover' in Comment column
cleaned_betting_odds_dfs = {}

for year, df in betting_odds_dfs.items():
    if 'Comment' in df.columns:
        # Remove rows where Comment is 'Retired' or 'Walkover'
        original_count = len(df)
        cleaned_df = df[~df['Comment'].isin(['Retired', 'Walkover'])].copy()
        removed_count = original_count - len(cleaned_df)
        cleaned_betting_odds_dfs[year] = cleaned_df
        print(f"{year}: Removed {removed_count} matches (Retired/Walkover). Remaining: {len(cleaned_df)}")
    else:
        cleaned_betting_odds_dfs[year] = df.copy()
        print(f"{year}: No 'Comment' column - kept all {len(df)} matches")

print(f"\nCleaning complete!")

2016: Removed 99 matches (Retired/Walkover). Remaining: 2527
2017: Removed 103 matches (Retired/Walkover). Remaining: 2530
2018: Removed 82 matches (Retired/Walkover). Remaining: 2555
2019: Removed 78 matches (Retired/Walkover). Remaining: 2532
2021: Removed 79 matches (Retired/Walkover). Remaining: 2410
2022: Removed 85 matches (Retired/Walkover). Remaining: 2547
2023: Removed 88 matches (Retired/Walkover). Remaining: 2615
2024: Removed 93 matches (Retired/Walkover). Remaining: 2610

Cleaning complete!


---

## 3. Data Cleaning

### Removing Incomplete Matches

The 'Comment' column indicates match completion status. We remove matches marked as:
- **Retired**: Player retired mid-match due to injury or other reasons
- **Walkover**: Match awarded without play

**Rationale**: These matches don't reflect true competitive outcomes and could skew upset analysis since betting odds don't account for mid-match events.

---

## 4. Feature Engineering: Probability of Winning

### Converting Odds to Probabilities

Betting odds represent the bookmaker's assessment of each player's winning chances. We convert these to probabilities using:

**Formula**: $P = \frac{1}{\text{Odds}}$

### Normalization for Fair Probabilities

Bookmaker odds include a profit margin (overround), causing implied probabilities to sum to >1. We normalize to obtain fair probabilities:

**Normalized Probability**: $P_{\text{fair}} = \frac{P_{\text{implied}}}{P_{\text{winner}} + P_{\text{loser}}}$

This ensures probabilities sum to exactly 1.0, representing a true probability distribution.

In [6]:
# Calculate implied probabilities and normalize them
for year, df in cleaned_betting_odds_dfs.items():
    # Calculate implied probabilities using P = 1 / Odds
    df['implied_prob_winner'] = 1 / df['AvgW']
    df['implied_prob_loser'] = 1 / df['AvgL']
    
    # Calculate the sum of implied probabilities (overround)
    df['prob_sum'] = df['implied_prob_winner'] + df['implied_prob_loser']
    
    # Normalize to get fair probabilities that sum to 1
    df['prob_winner'] = df['implied_prob_winner'] / df['prob_sum']
    df['prob_loser'] = df['implied_prob_loser'] / df['prob_sum']
    
    # Drop intermediate columns (implied probabilities and sum)
    df.drop(['implied_prob_winner', 'implied_prob_loser', 'prob_sum'], axis=1, inplace=True)
    
    print(f"{year}: Added probability features. Sample probabilities (first match):")
    print(f"  Winner prob: {df['prob_winner'].iloc[0]:.4f}, Loser prob: {df['prob_loser'].iloc[0]:.4f}, Sum: {df['prob_winner'].iloc[0] + df['prob_loser'].iloc[0]:.4f}")

print(f"\nProbability features added to all datasets!")

2016: Added probability features. Sample probabilities (first match):
  Winner prob: 0.5699, Loser prob: 0.4301, Sum: 1.0000
2017: Added probability features. Sample probabilities (first match):
  Winner prob: 0.7307, Loser prob: 0.2693, Sum: 1.0000
2018: Added probability features. Sample probabilities (first match):
  Winner prob: 0.4364, Loser prob: 0.5636, Sum: 1.0000
2019: Added probability features. Sample probabilities (first match):
  Winner prob: 0.7020, Loser prob: 0.2980, Sum: 1.0000
2021: Added probability features. Sample probabilities (first match):
  Winner prob: 0.6080, Loser prob: 0.3920, Sum: 1.0000
2022: Added probability features. Sample probabilities (first match):
  Winner prob: 0.5751, Loser prob: 0.4249, Sum: 1.0000
2023: Added probability features. Sample probabilities (first match):
  Winner prob: 0.5000, Loser prob: 0.5000, Sum: 1.0000
2024: Added probability features. Sample probabilities (first match):
  Winner prob: 0.5636, Loser prob: 0.4364, Sum: 1.0000


In [7]:
# Verify the new features in one of the datasets
year_sample = list(cleaned_betting_odds_dfs.keys())[0]
sample_df = cleaned_betting_odds_dfs[year_sample]

print(f"Dataset for {year_sample}:")
print(f"Total columns: {len(sample_df.columns)}")
print(f"\nNew probability columns added:")
print(f"  - prob_winner")
print(f"  - prob_loser")
print(f"\nSample data (first 3 matches):")
print(sample_df[['Winner', 'Loser', 'AvgW', 'AvgL', 'prob_winner', 'prob_loser']].head(3))

Dataset for 2016:
Total columns: 42

New probability columns added:
  - prob_winner
  - prob_loser

Sample data (first 3 matches):
        Winner        Loser  AvgW  AvgL  prob_winner  prob_loser
0  Dimitrov G.     Simon G.  1.66  2.20     0.569948    0.430052
1     Kudla D.   Smith J.P.  1.57  2.37     0.601523    0.398477
2     Kamke T.  Mitchell B.  1.77  2.00     0.530504    0.469496


---

## 5. Defining Upsets

### Upset Criteria

An **upset** is defined as a match where the winner was considered the underdog based on betting odds:

**Definition**: A match is classified as an upset if:
$$P_{\text{winner}} < 0.30$$

This threshold means the winner had less than a 30% probability of winning according to the betting markets.

### Binary Classification

- **upset_binary = 1**: Upset (winner was a strong underdog)
- **upset_binary = 0**: Expected result (favorite won or competitive match)

In [8]:
# Define upset threshold
UPSET_THRESHOLD = 0.30

# Create binary upset classification feature
for year, df in cleaned_betting_odds_dfs.items():
    # Upset = 1 if prob_winner < 0.30, otherwise 0
    df['upset_binary'] = (df['prob_winner'] < UPSET_THRESHOLD).astype(int)
    
    # Count upsets
    upset_count = df['upset_binary'].sum()
    total_matches = len(df)
    upset_percentage = (upset_count / total_matches) * 100
    
    print(f"{year}: {upset_count} upsets out of {total_matches} matches ({upset_percentage:.2f}%)")

print(f"\nUpset binary classification feature added to all datasets!")

2016: 205 upsets out of 2527 matches (8.11%)
2017: 206 upsets out of 2530 matches (8.14%)
2018: 179 upsets out of 2555 matches (7.01%)
2019: 171 upsets out of 2532 matches (6.75%)
2021: 173 upsets out of 2410 matches (7.18%)
2022: 177 upsets out of 2547 matches (6.95%)
2023: 180 upsets out of 2615 matches (6.88%)
2024: 153 upsets out of 2610 matches (5.86%)

Upset binary classification feature added to all datasets!


In [9]:
# Display examples of upsets and non-upsets
year_sample = list(cleaned_betting_odds_dfs.keys())[0]
sample_df = cleaned_betting_odds_dfs[year_sample]

print(f"Examples from {year_sample} dataset:\n")
print("=" * 80)
print("\nNON-UPSETS (upset_binary = 0):")
print("-" * 80)
non_upsets = sample_df[sample_df['upset_binary'] == 0][['Winner', 'Loser', 'prob_winner', 'prob_loser', 'upset_binary']].head(3)
print(non_upsets.to_string(index=False))

print("\n" + "=" * 80)
print("\nUPSETS (upset_binary = 1):")
print("-" * 80)
upsets = sample_df[sample_df['upset_binary'] == 1][['Winner', 'Loser', 'prob_winner', 'prob_loser', 'upset_binary']].head(3)
print(upsets.to_string(index=False))

Examples from 2016 dataset:


NON-UPSETS (upset_binary = 0):
--------------------------------------------------------------------------------
     Winner       Loser  prob_winner  prob_loser  upset_binary
Dimitrov G.    Simon G.     0.569948    0.430052             0
   Kudla D.  Smith J.P.     0.601523    0.398477             0
   Kamke T. Mitchell B.     0.530504    0.469496             0


UPSETS (upset_binary = 1):
--------------------------------------------------------------------------------
    Winner        Loser  prob_winner  prob_loser  upset_binary
Pouille L.    Goffin D.     0.189853    0.810147             1
  Tomic B. Nishikori K.     0.284483    0.715517             1
 Raonic M.   Federer R.     0.244660    0.755340             1


---

## 6. Feature Selection

### Dimensionality Reduction

To focus on the most relevant information for upset analysis, we reduced the dataset from 39-43 features to 17 essential features.

### Selected Features

**Match Context** (8 features):
- Location, Tournament, Date, Series, Court, Surface, Round, Best of

**Players** (2 features):
- Winner, Loser

**Rankings** (2 features):
- WRank (Winner's ATP ranking)
- LRank (Loser's ATP ranking)

**Betting Odds** (2 features):
- AvgW (Average odds for winner)
- AvgL (Average odds for loser)

**Engineered Features** (3 features):
- prob_winner (Fair probability of winner winning)
- prob_loser (Fair probability of loser winning)
- upset_binary (Binary upset classification)

In [10]:
# Define the features to keep
features_to_keep = [
    'Location', 'Tournament', 'Date', 'Series', 'Court', 'Surface', 'Round', 
    'Best of', 'Winner', 'Loser', 'WRank', 'LRank', 'AvgW', 'AvgL',
    'prob_winner', 'prob_loser', 'upset_binary'
]

# Keep only the selected features in all datasets
for year, df in cleaned_betting_odds_dfs.items():
    # Get columns that exist in the dataframe
    available_columns = [col for col in features_to_keep if col in df.columns]
    
    # Keep only selected columns
    cleaned_betting_odds_dfs[year] = df[available_columns].copy()
    
    print(f"{year}: Reduced from {len(df.columns)} to {len(available_columns)} features")

print(f"\nFeature selection complete!")

2016: Reduced from 43 to 17 features
2017: Reduced from 43 to 17 features
2018: Reduced from 43 to 17 features
2019: Reduced from 39 to 17 features
2021: Reduced from 39 to 17 features
2022: Reduced from 39 to 17 features
2023: Reduced from 39 to 17 features
2024: Reduced from 39 to 17 features

Feature selection complete!


In [11]:
# Verify the remaining features
year_sample = list(cleaned_betting_odds_dfs.keys())[0]
sample_df = cleaned_betting_odds_dfs[year_sample]

print(f"Remaining features in {year_sample} dataset:")
print(f"Total: {len(sample_df.columns)} features\n")
print("Column names:")
for i, col in enumerate(sample_df.columns, 1):
    print(f"  {i}. {col}")

print(f"\nDataset shape: {sample_df.shape}")
print(f"\nFirst 3 rows:")
print(sample_df.head(3))

Remaining features in 2016 dataset:
Total: 17 features

Column names:
  1. Location
  2. Tournament
  3. Date
  4. Series
  5. Court
  6. Surface
  7. Round
  8. Best of
  9. Winner
  10. Loser
  11. WRank
  12. LRank
  13. AvgW
  14. AvgL
  15. prob_winner
  16. prob_loser
  17. upset_binary

Dataset shape: (2527, 17)

First 3 rows:
   Location              Tournament       Date  Series    Court Surface  \
0  Brisbane  Brisbane International 2016-01-04  ATP250  Outdoor    Hard   
1  Brisbane  Brisbane International 2016-01-04  ATP250  Outdoor    Hard   
2  Brisbane  Brisbane International 2016-01-04  ATP250  Outdoor    Hard   

       Round  Best of       Winner        Loser  WRank  LRank  AvgW  AvgL  \
0  1st Round        3  Dimitrov G.     Simon G.   28.0   15.0  1.66  2.20   
1  1st Round        3     Kudla D.   Smith J.P.   69.0  129.0  1.57  2.37   
2  1st Round        3     Kamke T.  Mitchell B.  277.0  231.0  1.77  2.00   

   prob_winner  prob_loser  upset_binary  
0     0.569

---

## 7. Summary and Results

### Data Processing Pipeline

1. ✅ **Loaded** 8 years of ATP betting odds data (~21,000 matches)
2. ✅ **Cleaned** by removing 707 incomplete matches (Retired/Walkover)
3. ✅ **Calculated** fair win probabilities using normalized betting odds
4. ✅ **Defined** upsets as matches where winner had <30% win probability
5. ✅ **Created** binary classification feature (upset_binary)
6. ✅ **Reduced** features from 39-43 to 17 most relevant attributes

### Key Statistics

| Year | Total Matches | Upsets | Upset Rate |
|------|--------------|--------|------------|
| 2016 | 2,527 | 205 | 8.11% |
| 2017 | 2,530 | 206 | 8.14% |
| 2018 | 2,555 | 179 | 7.01% |
| 2019 | 2,532 | 171 | 6.75% |
| 2021 | 2,410 | 173 | 7.18% |
| 2022 | 2,547 | 177 | 6.95% |
| 2023 | 2,615 | 180 | 6.88% |
| 2024 | 2,610 | 153 | 5.86% |
| **Total** | **20,326** | **1,444** | **7.10%** |

### Output

The cleaned and processed data is stored in the `cleaned_betting_odds_dfs` dictionary, with each year as a key containing a pandas DataFrame with 17 features ready for further analysis, modeling, or visualization.

### Next Steps

This preprocessed dataset can be used for:
- Upset prediction modeling
- Pattern analysis in tennis upsets
- Investigating factors that contribute to upsets
- Ranking system evaluation
- Court surface impact on upsets

---

## 8. Export Cleaned Data to Excel

Export each year's cleaned data to separate Excel files for easy sharing and further analysis.

In [12]:
# Create output directory for cleaned data
output_folder = r'c:\poly_code\Cleaned_ATP_Data'
os.makedirs(output_folder, exist_ok=True)

# Export each year's data to a separate Excel file
for year, df in cleaned_betting_odds_dfs.items():
    output_file = os.path.join(output_folder, f'cleaned_atp_{year}.xlsx')
    df.to_excel(output_file, index=False, engine='openpyxl')
    print(f"{year}: Exported {len(df)} matches to {output_file}")

print(f"\n✓ All files exported successfully to: {output_folder}")
print(f"✓ Total files created: {len(cleaned_betting_odds_dfs)}")

2016: Exported 2527 matches to c:\poly_code\Cleaned_ATP_Data\cleaned_atp_2016.xlsx
2017: Exported 2530 matches to c:\poly_code\Cleaned_ATP_Data\cleaned_atp_2017.xlsx
2018: Exported 2555 matches to c:\poly_code\Cleaned_ATP_Data\cleaned_atp_2018.xlsx
2019: Exported 2532 matches to c:\poly_code\Cleaned_ATP_Data\cleaned_atp_2019.xlsx
2021: Exported 2410 matches to c:\poly_code\Cleaned_ATP_Data\cleaned_atp_2021.xlsx
2022: Exported 2547 matches to c:\poly_code\Cleaned_ATP_Data\cleaned_atp_2022.xlsx
2023: Exported 2615 matches to c:\poly_code\Cleaned_ATP_Data\cleaned_atp_2023.xlsx
2024: Exported 2610 matches to c:\poly_code\Cleaned_ATP_Data\cleaned_atp_2024.xlsx

✓ All files exported successfully to: c:\poly_code\Cleaned_ATP_Data
✓ Total files created: 8


In [13]:
# Verify exported files
import glob as glob_module

exported_files = glob_module.glob(os.path.join(output_folder, '*.xlsx'))
print(f"Exported files in {output_folder}:\n")
print("=" * 70)
for i, file in enumerate(sorted(exported_files), 1):
    file_name = os.path.basename(file)
    file_size = os.path.getsize(file) / 1024  # Size in KB
    print(f"{i}. {file_name:30s} ({file_size:.1f} KB)")
print("=" * 70)
print(f"\nTotal: {len(exported_files)} Excel files")

Exported files in c:\poly_code\Cleaned_ATP_Data:

1. cleaned_atp_2016.xlsx          (249.6 KB)
2. cleaned_atp_2017.xlsx          (250.3 KB)
3. cleaned_atp_2018.xlsx          (252.2 KB)
4. cleaned_atp_2019.xlsx          (250.0 KB)
5. cleaned_atp_2021.xlsx          (238.0 KB)
6. cleaned_atp_2022.xlsx          (251.7 KB)
7. cleaned_atp_2023.xlsx          (258.1 KB)
8. cleaned_atp_2024.xlsx          (257.4 KB)

Total: 8 Excel files
