# COGS 108 - Project Proposal

## Authors

- **Oscar Khaing**: Conceptualization, Background research, Methodology, Writing – original draft, Project administration  
- **Camila**: Data curation, Software  
- **Carlos**: Analysis, Visualization  
- **Abigail Chang**: Background research, Writing – review & editing  
- **Alyssa**: Experimental investigation, Writing – review & editing  

## Research Question

**How does court surface (Hard, Clay, Grass) affect (1) upset probability and (2) the predictive power of player rankings in professional tennis matches?**

Specifically, we will examine:

- Whether the probability of an upset (lower-ranked player defeating higher-ranked player) differs significantly across surfaces.

- Whether ranking difference predicts match outcomes equally well on all surfaces.

**Key variables** include:

- Surface (categorical)

- Rank_1, Rank_2 (used to compute ranking difference)

- Winner (binary outcome)

- Odds (used as a secondary comparison baseline)

**Controls**
* **Tournament Type** (Grand Slams, Masters 1000, ATP 500/250) this will help isolate whether upsets are caused by court surface itself or differing levels of pressure and player motivation.

* **Match Format** (Best of 3 vs Best of 5) because Grand Slams hold best-of-5 matches and standard ATP matches are best-of-3 we will be using "best of" as a control variable, accounting for longer matches typically reducing the probability of an upsets by allowing a higher level player more time to recover from a slow start.

This project combines statistical inference (testing differences in upset rates across surfaces) with a prediction task (logistic regression models estimating match outcomes from ranking differences, with and without surface information). Model performance metrics such as accuracy and AUC will be used to quantify how much predictive power surface adds.


## Background and Prior Work

Professional tennis is played across multiple court surfaces—primarily hard, clay, and grass—and these surfaces materially change how matches unfold. The International Tennis Federation (ITF) formalizes these differences through its *Court Pace Rating* system, which classifies surfaces based on measured ball speed and bounce characteristics. These classifications reflect meaningful physical differences in gameplay, such as rally length and serve effectiveness, and provide an official basis for treating surface as a structural variable in match outcomes.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Sports science research supports the idea that surface type alters match dynamics. A recent large-scale match analysis study found statistically significant differences in rally structure, point duration, and serve impact across clay, grass, and hard courts, demonstrating that player performance profiles shift depending on surface conditions.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) These findings suggest that competitive balance may vary by surface, motivating investigation into whether certain environments produce greater outcome variability.

From a predictive modeling perspective, Klaassen and Magnus introduced probabilistic approaches for forecasting tennis match winners, showing that match outcomes can be effectively modeled using player performance indicators.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Their work provides a foundation for treating tennis outcomes as a statistical inference and prediction task, although it does not explicitly focus on surface-specific effects.

More directly related to our project goals, Kovalchik evaluated multiple tennis prediction methods and demonstrated that model accuracy depends strongly on how player strength is represented, highlighting limitations of relying on a single global ranking signal.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) This motivates examining whether ranking differences carry equal predictive meaning across different surfaces, rather than assuming rankings summarize player ability uniformly.

Applied analytics platforms have operationalized this idea through surface-aware rating systems. Tennis Abstract maintains separate Elo ratings for hard, clay, and grass courts and reports improved forecasting performance compared to a single universal rating.<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5) Additionally, Tennis Abstract has shown that match predictability varies by surface, with some surfaces consistently producing higher variance outcomes.<a name="cite_ref-6"></a>[<sup>6</sup>](#cite_note-6) These findings align closely with our project’s focus on comparing upset rates and ranking predictability across surfaces.

Building on this prior work, our project integrates descriptive statistics and interpretable modeling to examine (1) whether upset probabilities differ by surface and (2) whether player ranking differences predict match outcomes equally well across environments. Unlike many existing analyses that emphasize either descriptive trends or pure prediction, we aim to combine statistical testing with model-based evaluation to better understand how surface mediates competitive structure in professional tennis.

### References

1. <a name="cite_note-1"></a> [^](#cite_ref-1) International Tennis Federation. *Classified Court Surfaces*.  
https://www.itftennis.com/en/about-us/tennis-tech/classified-surfaces/

2. <a name="cite_note-2"></a> [^](#cite_ref-2) González-Rodríguez et al. (2022). *Match analysis across tennis surfaces*. International Journal of Environmental Research and Public Health.  
https://www.mdpi.com/1660-4601/19/13/7955

3. <a name="cite_note-3"></a> [^](#cite_ref-3) Klaassen, F. & Magnus, J. (2003). *Forecasting the winner of a tennis match*. European Journal of Operational Research.  
https://www.sciencedirect.com/science/article/abs/pii/S0377221702006823

4. <a name="cite_note-4"></a> [^](#cite_ref-4) Kovalchik, S. (2016). *Searching for the GOAT of tennis win prediction*. Journal of Quantitative Analysis in Sports.  
https://vuir.vu.edu.au/34652/1/jqas-2015-0059.pdf

5. <a name="cite_note-5"></a> [^](#cite_ref-5) Tennis Abstract. *An Introduction to Tennis Elo*.  
https://www.tennisabstract.com/blog/2019/12/03/an-introduction-to-tennis-elo/

6. <a name="cite_note-6"></a> [^](#cite_ref-6) Tennis Abstract. *Unpredictable Bounces, Predictable Results*.  
https://www.tennisabstract.com/blog/2017/06/23/unpredictable-bounces-predictable-results/




## Hypothesis


We hypothesize that:

1. Upset rates will be significantly higher on clay courts than on hard or grass courts.

2. Ranking difference will be less predictive of match outcomes on clay compared to hard and grass surfaces.

Our reasoning is that clay courts slow down play and reduce the advantage of powerful serves, allowing lower-ranked players to remain competitive through longer rallies and defensive consistency. This should increase variance in outcomes and weaken the predictive signal of rankings. Conversely, grass and hard courts are expected to amplify skill and power differences, making rankings more informative predictors.

## Data

#### Ideal Dataset

The ideal dataset would include:

1. **Variables:**
   - Match outcome (winner)
   - Player rankings
   - Surface type
   - Betting odds
   - Match score
   - Tournament metadata (round, court type)
   - Match Duration (minutes)

2. **Observations:**
   - At least 20,000 professional matches across multiple seasons to ensure statistical power.

3. **Collection:**
   - Official ATP match records combined with bookmaker odds.

4. **Storage:**
   - A structured tabular format (CSV or relational database), one row per match.

---

#### Real Dataset

We will use the ATP Tennis Matches dataset from Kaggle:

Source: https://www.kaggle.com/datasets/dissfya/atp-tennis-2000-2023daily-pull

This dataset contains 60,000+ professional men’s tennis matches from 2000–2023 with the following 17 columns:

Tournament, Date, Series, Court, Surface, Round, Best of, Player_1, Player_2, Winner, Rank_1, Rank_2, Pts_1, Pts_2, Odd_1, Odd_2, Score.

Key variables for our project include Surface, Rank_1, Rank_2, Winner, and Odds. The dataset is publicly available and requires no special permissions. We will compute derived features such as ranking difference and upset indicators during preprocessing.

---

#### Ideal VS Real

* **Comparison** - Our real dataset is great becuase it has thousands of match results and rankings. However, it is different than our ideal version because the Kaggle dataset does not contain match duration. Because our real data lacks this detail, we will instead use Court Surface and Match Format as our main way to study how differrent surfaces affect match probability.


## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 > The dataset includes only professional men’s ATP matches, which introduces a selection bias toward elite male athletes and excludes women’s tennis and lower-level competitions. This limits generalizability of our findings. We will explicitly state this limitation and avoid making claims beyond this population.
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
 > We recognize that is thios data were to be used by scouts, it may create a bias against players who have lower overall rankings but higher win rates on specific courts.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 > Since this data is public ATP record, individual privacy is low risk, but we will delete any processed personal identifiers upon request from the data source.
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
 > The dataset will be deleted locally after project completion unless required for future coursework.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 > Our analysis focuses on quantitative match outcomes and does not capture player experiences, coaching strategies, or injury status. We acknowledge that these unobserved factors may influence results and will avoid over-interpreting causal relationships.
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 > We will examine class imbalance across surfaces and ranking differences and report these distributions explicitly. We also recognize omitted variables such as player fitness, weather, and recent form that may confound results.
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 > We will check if our model is just as accurate for young players as it is for veterans. This makes sure the surface is actually causing the results, and not just the fact that the older player has more experience on certain courts.
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 > As this is an academic proposal we have no current plans to deployment. If we were to deploy, we would provide a README file clearly stating the model is for patterns only and isn't accurate enough for financial decisions or gambling. 
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 > If we were to ever deploy this repository, we could provide a contact email in the README for anyone to report erros or harms they see in our analysis. 
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 > Since this will be hosted on GitHub, we can 'roll back' by deleting or provating the repository is we find the model is being misused and/or contains signifigant errors. 
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> Although this project is academic and not deployed, we acknowledge that predictive sports models can be misused for gambling purposes. We will frame our results as exploratory and educational rather than as actionable betting advice.


## Team Expectations

* Team Expectation 1 – Communication & Responsiveness
  > All members agree to communicate primarily through a shared messaging platform (e.g., Discord or Slack) and respond to messages within 24 hours whenever possible. Meeting times will be respected, and advance notice will be given if someone cannot attend.

* Team Expectation 2 – Accountability & Task Ownership
  > Each member is responsible for completing their assigned tasks by agreed-upon deadlines. If someone anticipates difficulties meeting a deadline, they will notify the group early so responsibilities can be adjusted.

* Team Expectation 3 – Respectful Collaboration
  > We will treat each other with respect, listen to differing viewpoints, and provide constructive feedback. In the event of conflict or disagreement, we will address concerns directly and professionally, prioritizing project success and team well-being.

* Team Expectation 4 – Shared Quality Standards
  > All members agree to contribute meaningfully to analysis, documentation, and presentation quality. Draft work will be reviewed by at least one other teammate before final submission.

By including each member’s name in the team list and submission, we affirm that we have read the COGS108 Team Policies and commit to upholding these expectations.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| Feb 4 | 7 PM | Finalize proposal | Submit proposal; divide tasks for wrangling, EDA, modeling |
| Feb 18 | Before 11:59 PM | Data wrangling & cleaning | Data Checkpoint Due |
| Feb 22 | 7 PM | Initial EDA (distributions, surface breakdowns, upset rates) | Review EDA; refine analysis plan |
| Mar 4 | Before 11:59 PM | Complete EDA visualizations | EDA Checkpoint Due |
| Mar 7 | 7 PM | Begin modeling (logistic regression baseline + surface models) | Review model performance; iterate |
| Mar 12 | 7 PM | Finalize analysis; draft results & discussion | Edit full project; prepare video |
| Mar 18 | Before 11:59 PM | NA | Final Project + Video Due |

No special resources beyond standard Python data science libraries (pandas, seaborn, scikit-learn) are anticipated.
