# COGS 108 - Data Checkpoint

## Authors

- Minghao Xu: Project administration, Methodology, Analysis, Experimental investigation
- Jerry Chen: Analysis, Software, Data curation, Methodology 
- Eli Liang:  Visualization, Conceptualization, Writing - review & editing
- William Wu: Analysis, Background research, Writing - original draft
- Weder Qin: Software, Visualization, Writing - review & editing

## Research Question

To what degree does the court surface (Clay, Grass, Hard) affect the first-serve and second-serve win percentage in men’s singles matches(All Matches) at the 2024 Grand Slam tournaments?

## Background and Prior Work

Court surface can change how much advantage a server gets because bounce height, skid, and pace affect the returner’s reaction time and how quickly rallies “reset” after the serve. In Grand Slams, this creates a natural comparison across hard (Australian Open / US Open), clay (French Open), and grass (Wimbledon) to test whether first-serve win% and second-serve win% differ by surface in recent elite matches (2023–2025).

Prior work motivates our focus on first- and second-serve point outcomes. A widely shared modeling project shows that first-serve points won and second-serve points won are among the most informative serving statistics for predicting match outcomes, supporting their use as core response variables in our study.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Meanwhile, serve-focused research using Australian Open data shows that outcomes like aces depend on multiple serve attributes (not just raw speed), reinforcing why we measure overall point win% rather than relying on aces alone.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Consistent with that, tennis commentary analyses caution that aces are only one small component of winning and recommend broader serve-point measures as better indicators of serve dominance.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Building on these ideas, our project will estimate how much surface (clay/grass/hard) shifts first- and second-serve win% using match-level Grand Slam data from 2023–2025 (e.g., open datasets scraped from Slam sites).<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Kokta, M. (Medium). *Predicting ATP Tennis Match Outcomes Using Serving Statistics.* https://medium.com/swlh/predicting-atp-tennis-match-outcomes-using-serving-statistics-cde03d99f410  
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Whiteside, D., et al. (2017). *Spatial characteristics of professional tennis serves with implications for serving aces: A machine learning approach.* *Journal of Sports Sciences.* https://www.tandfonline.com/doi/full/10.1080/02640414.2016.1183805  
3. <a name="cite_note-3"></a> [^](#cite_ref-3) TalkingTennis.net (Jan 24, 2025). *How important are aces in winning tennis matches?* https://talkingtennis.net/blog-posts/how-important-are-aces-in-winning-tennis-matches  
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Sackmann, J. (ongoing). *Grand Slam point-by-point data (scraped from Slam websites).* https://github.com/JeffSackmann/tennis_slam_pointbypoint


## Hypothesis


We predict that court surface significantly influences serve win percentages, with the highest first-serve win percentages occurring on grass and the lowest on clay, while second-serve win percentages will remain relatively stable across all surfaces. Since first serves usually aim for higher speed while second serves usually aim for stability, the court surface has a more significant influence on first serves over second serves. The Court Pace Ratings(CPR) is largest on grass courts and lowest on clay courts, so grass courts are supposed to have more impact on first-serve winning rate.

## Data

### Data overview

- **Dataset #1**
  - **Dataset Name:** ATP Men’s Tennis Match Results + Match Statistics (match-level dataset)
  - **Link to the dataset:** *https://github.com/JeffSackmann/tennis_atp/blob/master/atp_matches_2024.csv*
  - **Number of observations:** 3,077 match observations
  - **Number of variables:**  49 columns covering tournament, players, outcomes, and stats 
  - **Description of the variables most relevant to this project:**
    - **Tournament information:** `tourney_name`, `surface`, `tourney_level`, `tourney_date`  
      Identifies where/when the match happened and key context like **surface** (Hard/Clay/Grass) and tournament tier.
    - **Player information:** `winner_name`, `loser_name`, `winner_hand`, `loser_hand`, `winner_ht`, `loser_ht`, `winner_age`, `loser_age`  
      Demographics/attributes used to compare players and control for factors like age/height/handedness.
    - **Match outcome:** `score`, `round`, `minutes`, `best_of`  
      Outcome and competitiveness proxies (round reached, duration, format best-of-3 vs best-of-5).
    - **Serve statistics:** `w_ace`, `l_ace`, `w_df`, `l_df`, `w_svpt`, `l_svpt`, `w_1stIn`, `l_1stIn`  
      Core performance indicators (aces, double faults, service volume, first-serve success).
    - **Rankings:** `winner_rank`, `loser_rank`, `winner_rank_points`, `loser_rank_points`  
      Pre-match strength indicators to study expectations, upsets, and performance vs rank.
  - **Descriptions of any shortcomings this dataset has with respect to the project:**
    - **Missing data (~7% overall):** especially **seed** (missing for many matches) and **entry type**, plus some missing match statistics and duration.
    - **Incomplete matches:** retirements (RET) and walkovers (W/O) make some match stats unreliable or absent; these matches can bias analyses of performance metrics like aces/DFs/minutes.
    - **Missing duration for a non-trivial subset:** some matches lack `minutes`, which limits analyses involving match length/competitiveness.
    - **Limited context:** no injury info, weather, court-speed variation within surfaces, motivation effects, etc., so interpretation is limited to what’s captured in match stats.
    - **Potential minor inconsistencies:** some static attributes (height/age) occasionally missing or inconsistently recorded across events.

**Combining datasets:** This project uses a single dataset, so no merging is required.


In [5]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [1]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    {
        "url": "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2024.csv",
        "filename": "atp_matches_2024.csv",
    }
]

get_data.get_raw(datafiles, destination_directory="data/00-raw/")


Overall Download Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading atp_matches_2024.csv:   0%|          | 0.00/150k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 1/1 [00:00<00:00,  5.09it/s]    [A

Successfully downloaded: atp_matches_2024.csv





### ATP Men’s Tennis Matches – 2024 Season 

The ATP Matches 2024 dataset contains detailed match-level information for all ATP men’s professional tennis matches played between January 1 and December 31, 2024. The data comes from Jeff Sackmann’s tennis_atp repository and includes 3,076 matches and 49 variables. Each row represents a single tennis match and each column represents one variable, meaning the dataset is already in tidy format.

The dataset includes several important categories of variables. Tournament information includes surface (Hard, Clay, Grass), tournament level (Grand Slam, Masters 1000, ATP 250/500), and date. Player information includes winner and loser names, handedness, height (cm), and age (years). Match outcome variables include score, round, best_of (3 or 5 sets), and match duration in minutes.

Several performance metrics are particularly important:

Aces (w_ace, l_ace): Number of serves that result in an immediate point. Measured as a count per match. Professional players typically average 5–15 aces per match. Higher ace counts generally indicate stronger serving performance.

Double Faults (w_df, l_df): Number of missed second serves. Measured as a count per match. Professionals usually aim to stay under 3–5 per match since double faults directly give points to the opponent.

Match Duration (minutes): Measured in minutes. Best-of-three matches typically last 90–150 minutes; best-of-five matches may exceed 3–5 hours.

ATP Ranking (winner_rank, loser_rank): Numerical ranking (1 = best). Rankings are based on a rolling 52-week point system. Lower numbers indicate stronger players.

Dataset Concerns:

Approximately 7% of all values are missing. Missingness is not random. It is strongly associated with incomplete matches (retirements and walkovers). There are 101 matches marked as walkover (W/O) or retirement (RET), and 238 matches missing duration data. Around 60 matches are missing serve statistics such as aces and double faults. These missing values are systematic and expected because incomplete matches do not generate full statistics.

There are also extreme match duration values (e.g., 0 minutes for walkovers and matches exceeding 220 minutes), but these represent legitimate real-world scenarios and should not automatically be removed.

In [4]:
import pandas as pd
import numpy as np
from pathlib import Path

# ----------------------------
# Locate and Load Raw Data
# ----------------------------
raw_path = Path("data/00-raw/atp_matches_2024.csv")

if not raw_path.exists():
    hits = list(Path(".").rglob("atp_matches_2024.csv"))
    if not hits:
        raise FileNotFoundError(
            "Could not find atp_matches_2024.csv. "
            "Make sure you ran the download cell or placed the file in data/00-raw/."
        )
    raw_path = hits[0]

df = pd.read_csv(raw_path)

print("Loaded from:", raw_path)
print("Dataset shape (rows, columns):", df.shape)


# ----------------------------
# Check Tidiness
# ----------------------------
print("\nDuplicate rows:", df.duplicated().sum())


# ----------------------------
# Missing Data Analysis
# ----------------------------
missing_counts = df.isna().sum()
missing_percent = (missing_counts / len(df)) * 100

missing_summary = pd.DataFrame({
    "Missing_Count": missing_counts,
    "Missing_Percent": missing_percent
}).sort_values(by="Missing_Percent", ascending=False)

print("\nTop Missing Columns:")
print(missing_summary.head(15))

print("\nTotal Missing Values:", df.isna().sum().sum())


# ----------------------------
# Demonstrate Systematic Missingness
# ----------------------------
df["score_ret_wo"] = df["score"].str.contains("RET|W/O", na=False)

print("\nCross-tab: Missing Minutes vs RET/W/O")
print(pd.crosstab(df["minutes"].isna(), df["score_ret_wo"], margins=True))


# ----------------------------
# Flag Incomplete Matches
# ----------------------------
stat_cols = ["minutes", "w_ace", "l_ace", "w_df", "l_df"]

df["incomplete_match"] = (
    df["score"].str.contains("RET|W/O", na=False) |
    df[stat_cols].isna().any(axis=1)
)

print("\nNumber of Incomplete Matches:", df["incomplete_match"].sum())


# ----------------------------
# Outlier Detection (Match Duration IQR)
# ----------------------------
minutes_clean = df["minutes"].dropna()

Q1 = minutes_clean.quantile(0.25)
Q3 = minutes_clean.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[
    (df["minutes"] < lower_bound) |
    (df["minutes"] > upper_bound)
]

print("\nMatch Duration IQR Bounds:", lower_bound, upper_bound)
print("Number of Duration Outliers:", len(outliers))


# ----------------------------
# Save Processed Dataset
# ----------------------------
processed_path = Path("data/02-processed/atp_matches_2024_processed.csv")
processed_path.parent.mkdir(parents=True, exist_ok=True)

df.to_csv(processed_path, index=False)

print("\nProcessed dataset saved to:", processed_path)


# ----------------------------
# Summary Statistics
# ----------------------------
print("\nMatch Duration Summary:")
print(df["minutes"].describe())

print("\nWinner Aces Summary:")
print(df["w_ace"].describe())

print("\nWinner Rank Summary:")
print(df["winner_rank"].describe())

Loaded from: data/00-raw/atp_matches_2024.csv
Dataset shape (rows, columns): (3076, 49)

Duplicate rows: 0

Top Missing Columns:
              Missing_Count  Missing_Percent
winner_entry           2599        84.492848
loser_entry            2358        76.657997
loser_seed             2319        75.390117
winner_seed            1782        57.932380
minutes                 238         7.737321
l_SvGms                  61         1.983095
w_SvGms                  61         1.983095
l_svpt                   60         1.950585
w_svpt                   60         1.950585
w_df                     60         1.950585
w_ace                    60         1.950585
w_2ndWon                 60         1.950585
w_bpSaved                60         1.950585
w_bpFaced                60         1.950585
l_ace                    60         1.950585

Total Missing Values: 10563

Cross-tab: Missing Minutes vs RET/W/O
score_ret_wo  False  True   All
minutes                        
False          2750

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> This project uses publicly available professional tennis match statistics, so no direct human subjects are involved and informed consent is not required.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Our data only includes men’s singles Grand Slam main-draw matches, which may bias results toward elite players and exclude lower-ranked or qualifying-round competitors.
> 
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> We only use publicly reported match-level performance statistics and player identifiers already in the public domain, without collecting any additional personal or sensitive information.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> All datasets are stored locally in course-restricted environments and consist solely of publicly available sports statistics, minimizing security and privacy risks.

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
       
> The data will be retained only for the duration of the course project and deleted afterward since it is not needed beyond the academic analysis.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
       
> Our analysis focuses on quantitative match outcomes and does not incorporate player, coaching, or contextual perspectives, which may limit interpretation of causal mechanisms.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We acknowledge potential biases arising from surface-specific tournament conditions, player specialization, and unequal match counts across surfaces and years.
 
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> Visualizations and summary statistics are designed to accurately reflect observed serve win percentages without exaggerating surface differences.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> No private or sensitive information is used or displayed in the analysis beyond public professional match statistics.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> All data cleaning and analysis steps are documented in reproducible notebooks to allow future review and verification of results.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> We carefully select first-serve and second-serve win percentages as core metrics, recognizing they capture serve effectiveness but not all aspects of match performance.

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We explicitly communicate that our findings are correlational and limited to recent men’s Grand Slam matches, not causal or universally generalizable.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

Team Expectations

Communication & Response Time: We will use WeChat for daily updates and GitHub for technical work. We expect everyone to reply to messages within 12 hours usually, and faster if a deadline is close.

Meetings: We will meet once a week (in-person or virtual). If online, we keep cameras on to stay engaged. If a member can't make it, they need to tell the group 24 hours in advance and post an update so the team isn't blocked.

Decision Making: We aim for consensus. If we disagree on a technical choice, we will vote. If it's a tie or a member isn't responding during a deadline rush (after 6 hours), the active members will make the decision to keep the project moving.

Fair Contribution: To get an "A", everyone must work on all parts of the project (data, analysis, coding, and writing). No one will just "write the report" or just "write the code." We will review each other's work to ensure everyone understands the whole project.

Managing Tasks: We track everything on GitHub Issues. A task isn't "started" until it's assigned on GitHub, and it's not "done" until the Pull Request is reviewed and merged.

Getting Stuck (The 2.5-Hour Rule): If a member is stuck on a problem for more than 2.5 hour, they must tell the team on WeChat immediately instead of waiting for the next meeting. We encourage asking for help early rather than hiding the issue until the deadline.

Tone & Conflict: We will use polite tone. We critique the code/ideas, not the person. If we have a conflict, we will get on a call to resolve it instead of arguing over text.

Accountability: If a member misses a deadline or "ghosts" the team, we will send a written warning requiring a specific fix within 1 week. If they still don't contribute, we will escalate the issue to the professor by Week 7.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 2/24 | Wed 5 PM | Perform individual EDA (visualizations, distributions, correlations); Push code to GitHub. | Review EDA results together; Refine hypothesis based on data; Discuss analysis strategy. |
| 03/03 | Wed 5 PM | Finalize EDA graphs and analysis; Draft Checkpoint #2 text. | Submit EDA Checkpoint #2 (Due 3/04); Plan final Machine Learning/Statistical Analysis approach. |
| 3/10 | Wed 5 PM | Complete core analysis/modeling; Draft "Results" and "Discussion" sections. | Code review of analysis; Peer edit writing; Discuss "Ethics" and "Conclusion" sections. |
| 3/15 | Sun 5 PM | Complete full draft of Final Project Notebook; Run "Restart & Run All" to check for errors. | Final polish; Check for typos/grammar; Record Video Presentation (if required); Submit Final Project (Due 3/18). |
| 3/20 | Before 11:59 PM | NA | Turn in final project & course survey |