# COGS 108 - Project Proposal

## Authors

- Minghao Xu: Project administration, Methodology, Analysis, Experimental investigation
- Jerry Chen: Analysis, Software, Data curation, Methodology 
- Eli Liang:  Visualization, Conceptualization, Writing - review & editing
- William Wu: Analysis, Background research, Writing - original draft
- Weder Qin: Software, Visualization, Writing - review & editing

## Research Question

To what degree does the court surface (Clay, Grass, Hard) affect the first-serve and second-serve win percentage in men's singles tennis Grand Slam main-draw matches from 2023 to 2025?

## Background and Prior Work

Court surface can change how much advantage a server gets because bounce height, skid, and pace affect the returner’s reaction time and how quickly rallies “reset” after the serve. In Grand Slams, this creates a natural comparison across hard (Australian Open / US Open), clay (French Open), and grass (Wimbledon) to test whether first-serve win% and second-serve win% differ by surface in recent elite matches (2023–2025).

Prior work motivates our focus on first- and second-serve point outcomes. A widely shared modeling project shows that first-serve points won and second-serve points won are among the most informative serving statistics for predicting match outcomes, supporting their use as core response variables in our study.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Meanwhile, serve-focused research using Australian Open data shows that outcomes like aces depend on multiple serve attributes (not just raw speed), reinforcing why we measure overall point win% rather than relying on aces alone.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Consistent with that, tennis commentary analyses caution that aces are only one small component of winning and recommend broader serve-point measures as better indicators of serve dominance.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Building on these ideas, our project will estimate how much surface (clay/grass/hard) shifts first- and second-serve win% using match-level Grand Slam data from 2023–2025 (e.g., open datasets scraped from Slam sites).<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Kokta, M. (Medium). *Predicting ATP Tennis Match Outcomes Using Serving Statistics.* https://medium.com/swlh/predicting-atp-tennis-match-outcomes-using-serving-statistics-cde03d99f410  
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Whiteside, D., et al. (2017). *Spatial characteristics of professional tennis serves with implications for serving aces: A machine learning approach.* *Journal of Sports Sciences.* https://www.tandfonline.com/doi/full/10.1080/02640414.2016.1183805  
3. <a name="cite_note-3"></a> [^](#cite_ref-3) TalkingTennis.net (Jan 24, 2025). *How important are aces in winning tennis matches?* https://talkingtennis.net/blog-posts/how-important-are-aces-in-winning-tennis-matches  
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Sackmann, J. (ongoing). *Grand Slam point-by-point data (scraped from Slam websites).* https://github.com/JeffSackmann/tennis_slam_pointbypoint


## Hypothesis


We predict that court surface significantly influences serve win percentages, with the highest first-serve win percentages occurring on grass and the lowest on clay, while second-serve win percentages will remain relatively stable across all surfaces. Since first serves usually aim for higher speed while second serves usually aim for stability, the court surface has a more significant influence on first serves over second serves. The Court Pace Ratings(CPR) is largest on grass courts and lowest on clay courts, so grass courts are supposed to have more impact on first-serve winning rate.

## Data

To answer our question—how court surface (Clay/Grass/Hard) impacts serve efficiency (1st-serve and 2nd-serve win %) in Men’s Singles Grand Slams (2023–2025), the ideal dataset is match-level, main draw only.
Key variables (per match, per player):
- Tournament, year, round, surface
- Player IDs, match outcome, score, match flags (RET/W/O)
- Serve efficiency stats: 1st serve in % (or counts), 1st-serve points won %, 2nd-serve points won %, plus optional aces/double faults

How many observations:
All main-draw Men’s Singles matches from 2023–2024 + 2025 Australian Open, i.e., hundreds to ~1,000+ player-match rows. 

Collection & organization:
Collect by scraping/exports from Tennis Abstract-style match tables; exclude qualifying. All raw data stored in tidy CSV.



Potential real datasets we can use
1) Jeff Sackmann — Grand Slam point-by-point (GitHub)
On GitHub as the “tennis_slam_pointbypoint” repository. We can parse point-level records and aggregate to match/player to compute 1st/2nd-serve points won by surface. This is powerful because it already provides structural data.

2) Tennis Abstract — free results/stats databases
Tennis Abstract has free ATP/WTA results and stats databases.
We will need cleaning and joining across files, and some stats may be easier to scrape than direct export. Core variables we need are almost all included.

3) Kaggle — ATP matches (2000–2019) for prototyping
A public Kaggle dataset can help us prototype cleaning/analysis quickly.
Permission just requires a Kaggle account and downloading CSVs. It includes tournament metadata and match statistics fields we can map to serve-efficiency measures, but this specific dataset may not cover 2023–2025 directly.
  

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> This project uses publicly available professional tennis match statistics, so no direct human subjects are involved and informed consent is not required.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Our data only includes men’s singles Grand Slam main-draw matches, which may bias results toward elite players and exclude lower-ranked or qualifying-round competitors.
> 
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> We only use publicly reported match-level performance statistics and player identifiers already in the public domain, without collecting any additional personal or sensitive information.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> All datasets are stored locally in course-restricted environments and consist solely of publicly available sports statistics, minimizing security and privacy risks.

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
       
> The data will be retained only for the duration of the course project and deleted afterward since it is not needed beyond the academic analysis.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
       
> Our analysis focuses on quantitative match outcomes and does not incorporate player, coaching, or contextual perspectives, which may limit interpretation of causal mechanisms.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We acknowledge potential biases arising from surface-specific tournament conditions, player specialization, and unequal match counts across surfaces and years.
 
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> Visualizations and summary statistics are designed to accurately reflect observed serve win percentages without exaggerating surface differences.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> No private or sensitive information is used or displayed in the analysis beyond public professional match statistics.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> All data cleaning and analysis steps are documented in reproducible notebooks to allow future review and verification of results.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> We carefully select first-serve and second-serve win percentages as core metrics, recognizing they capture serve effectiveness but not all aspects of match performance.
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

Team Expectations

Communication & Response Time: We will use WeChat for daily updates and GitHub for technical work. We expect everyone to reply to messages within 12 hours usually, and faster if a deadline is close.

Meetings: We will meet once a week (in-person or virtual). If online, we keep cameras on to stay engaged. If a member can't make it, they need to tell the group 24 hours in advance and post an update so the team isn't blocked.

Decision Making: We aim for consensus. If we disagree on a technical choice, we will vote. If it's a tie or a member isn't responding during a deadline rush (after 6 hours), the active members will make the decision to keep the project moving.

Fair Contribution: To get an "A", everyone must work on all parts of the project (data, analysis, coding, and writing). No one will just "write the report" or just "write the code." We will review each other's work to ensure everyone understands the whole project.

Managing Tasks: We track everything on GitHub Issues. A task isn't "started" until it's assigned on GitHub, and it's not "done" until the Pull Request is reviewed and merged.

Getting Stuck (The 2.5-Hour Rule): If a member is stuck on a problem for more than 2.5 hour, they must tell the team on WeChat immediately instead of waiting for the next meeting. We encourage asking for help early rather than hiding the issue until the deadline.

Tone & Conflict: We will use polite tone. We critique the code/ideas, not the person. If we have a conflict, we will get on a call to resolve it instead of arguing over text.

Accountability: If a member misses a deadline or "ghosts" the team, we will send a written warning requiring a specific fix within 1 week. If they still don't contribute, we will escalate the issue to the professor by Week 7.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 1/27 | Wed 5 PM | Read COGS 108 expectations; Brainstorm 3 potential topics; Search for initial datasets. | Decide on final Research Question; Vote on topic; Assign background research tasks. |
| 02/03 | Wed 5 PM | Draft Hypothesis and Dataset description; Complete Project Proposal drafts. | Finalize & Submit Project Proposal (Due 2/4); Assign data search & initial cleaning roles. |
| 2/10 | Wed 5 PM | Download datasets; Begin basic data cleaning (handling missing values); Set up GitHub Repo structure. | Review data quality; Discuss cleaning challenges; Plan for Checkpoint #1 (Data). |
| 2/17 | Wed 5 PM | Finalize Data Cleaning & Wrangling code; Draft Checkpoint #1 text. | Submit Data Checkpoint #1 (Due 2/18); Assign Exploratory Data Analysis (EDA) tasks to all members. |
| 2/24 | Wed 5 PM | Perform individual EDA (visualizations, distributions, correlations); Push code to GitHub. | Review EDA results together; Refine hypothesis based on data; Discuss analysis strategy. |
| 03/03 | Wed 5 PM | Finalize EDA graphs and analysis; Draft Checkpoint #2 text. | Submit EDA Checkpoint #2 (Due 3/04); Plan final Machine Learning/Statistical Analysis approach. |
| 3/10 | Wed 5 PM | Complete core analysis/modeling; Draft "Results" and "Discussion" sections. | Code review of analysis; Peer edit writing; Discuss "Ethics" and "Conclusion" sections. |
| 3/15 | Sun 5 PM | Complete full draft of Final Project Notebook; Run "Restart & Run All" to check for errors. | Final polish; Check for typos/grammar; Record Video Presentation (if required); Submit Final Project (Due 3/18). |
| 3/20 | Before 11:59 PM | NA | Turn in final project & course survey |