# COGS 108 - Project Proposal

## Authors

Khushi: Conceptualization, Background research, Writing – original draft  
Valmik: Data curation, Methodology, Analysis  
Aanya: Visualization, Writing – review & editing  
Alexa: Project administration, Software

## Research Question

This project investigates whether court surface type is associated with or improves the prediction of match outcomes in professional tennis matches.
We will examine this relationship using key variables including match outcome, player ranking difference, and court surface (hard, clay, grass).
This research question is answerable using observational match-level data and statistical analysis.
Understanding this relationship is important because it helps determine whether playing conditions meaningfully affect match predictability, which can improve performance analysis, forecasting models, and strategic decision-making in professional tennis.

## Background and Prior Work

Tennis is one of the world's most popular and competitive sports, with professional matches played across diverse environmental conditions that can significantly influence player performance and match outcomes. Among these conditions, court surface type—primarily hard courts, clay courts, and grass courts—has long been recognized as a fundamental factor that shapes playing styles, strategic approaches, and competitive advantages. Understanding how court surface affects match outcomes is not only important for players, coaches, and analysts seeking to optimize performance and develop effective strategies, but also for improving the accuracy of predictive models used in sports analytics, broadcasting, and tournament planning. As professional tennis continues to generate vast amounts of data and increasingly relies on statistical modeling for decision-making, investigating whether court surface type meaningfully contributes to match outcome prediction represents a critical area of research with practical applications across multiple stakeholders in the sport.

Prior research has established that court surface type significantly influences match outcomes and prediction accuracy in professional tennis.<a name="cite_ref-1"></a><sup>1</sup> McHale and Morton (2011) demonstrated that models incorporating surface information alongside match scores and play data achieved higher forecasting accuracy than ranking-based models alone, though they did not account for dependencies between factors.<a name="cite_ref-2"></a><sup>2</sup> More recent studies have developed surface-specific player ratings using network analysis and machine learning approaches, with models trained on over 83,000 matches showing that surface-specific features improve prediction accuracy beyond classical approaches.<a name="cite_ref-3"></a><sup>3</sup> The physical mechanisms underlying these effects are well-documented: grass courts produce fast, low bounces that favor aggressive serve-and-volley play and shorter rallies; clay courts generate slower, higher bounces that extend rally length and favor baseline players with strong defensive skills; and hard courts provide medium-speed conditions with predictable bounces that accommodate diverse playing styles.<a name="cite_ref-4"></a><sup>4</sup> Research using dynamic models has shown that elite players like Rafael Nadal and Roger Federer exhibit measurably different strength levels across surface types, with prediction models achieving approximately 70% accuracy when incorporating surface-specific player abilities.<a name="cite_ref-5"></a><sup>5</sup> However, recent work has found that model performance varies considerably by surface, with prediction accuracy notably lower on grass courts compared to hard and clay courts, suggesting that surface-specific characteristics create unique challenges for match outcome modeling.<a name="cite_ref-6"></a><sup>6</sup>


Despite the extensive body of research on tennis match prediction, several important limitations and gaps remain. First, many existing studies focus primarily on elite players or Grand Slam tournaments, potentially limiting the generalizability of findings to the broader professional tennis ecosystem that includes lower-tier tournaments and emerging players. Second, while surface-specific models have been developed, there is inconsistency in how researchers operationalize and measure surface effects—some studies treat surface as a categorical variable while others develop continuous surface-specific rating systems, making cross-study comparisons difficult. Third, most prediction models achieve moderate accuracy levels (typically 65-75%), suggesting that important predictive factors may be missing or that the inherent variability in tennis matches limits deterministic prediction. Fourth, there is limited research examining whether the predictive value of court surface has changed over time as equipment technology, training methods, and player athleticism have evolved, potentially reducing historical surface-specific advantages. Finally, few studies have systematically compared the relative importance of court surface against other environmental and contextual factors such as tournament prestige, match format, and player fatigue across large, diverse datasets.


Our project addresses these gaps by conducting a comprehensive analysis of whether court surface type improves match outcome prediction across a large, diverse dataset of professional tennis matches from both ATP and WTA tours. Unlike previous studies that often focus on specific tournaments or player subsets, we will examine matches across multiple tournament levels and time periods to assess the robustness and generalizability of surface effects. By systematically comparing prediction models with and without surface information while controlling for player ranking differences, we can directly quantify the added predictive value that surface type provides beyond baseline player quality measures. This approach will help determine whether the widely-assumed importance of court surface in tennis translates into meaningful improvements in statistical prediction, or whether other factors dominate match outcomes. Furthermore, by analyzing both men's and women's professional tennis, we can explore whether surface effects operate similarly across different competitive contexts. The findings from this research will have practical value for sports analysts developing forecasting models, players and coaches making strategic decisions about tournament selection and preparation, and researchers seeking to understand the relative importance of environmental factors in athletic performance prediction.


Works Cited

<a name="cite_note-1"></a>[^](#cite_ref-1) Buhamra, N., Groll, A., & Brunner, S. (2024). Modeling and prediction of tennis matches at Grand Slam tournaments. Journal of Sports Analytics.

<a name="cite_note-2"></a>[^](#cite_ref-2) McHale, I., & Morton, A. (2011). A Bradley-Terry type model for forecasting tennis match outcomes. International Journal of Forecasting.

<a name="cite_note-3"></a>[^](#cite_ref-3) Bayram, F., Garbarino, D., & Barla, A. (2021). Predicting tennis match outcomes with network analysis and machine learning. In SOFSEM 2021: Theory and Practice of Computer Science (pp. 526-541). Springer.

<a name="cite_note-4"></a>[^](#cite_ref-4) Barnett, T., & Pollard, G. (2007). How the tennis court surface affects player performance and injuries. Medicine & Science in Sports & Exercise.

<a name="cite_note-5"></a>[^](#cite_ref-5) Kovalchik, S., Ingram, M., & Gorgi, P. (2019). Analysis and forecasting of tennis matches by using a high dimensional dynamic model. Journal of the Royal Statistical Society: Series A, 182(4), 1393-1425.

<a name="cite_note-6"></a>[^](#cite_ref-6) Liu, J., et al. (2024). Momentum prediction models of tennis match based on CatBoost regression and random forest algorithms. Scientific Reports, 14, Article 18552.

## Hypothesis


We hypothesize that court surface type will be positively associated with the accuracy of predicting match outcomes in professional tennis matches.
This expectation is based on prior research suggesting that different surfaces systematically affect ball speed, rally length, and serve effectiveness, which in turn influence player performance and match results.

## Data

The ideal dataset would include variables such as match outcome (winner/loser), player rankings, ranking difference, court surface type (hard, clay, grass), tournament round, tournament level, match date, and player identifiers. A dataset with at least 50,000 match observations would be sufficient to answer the research question and allow for robust statistical modeling and validation. These data would be collected from professional tennis match records from the ATP and WTA tours using official match result databases and publicly available archives. All data would be anonymized and stored and organized in structured formats such as CSV files.

Dataset 1 is publicly available at https://www.kaggle.com/datasets/taylorbrownlow/atpwta-tennis-data.
It contains variables such as match outcome (winner/loser), player rankings, player names/IDs, tournament name, tournament round, court surface type, match date, and match scores. This dataset is relevant because it allows us to analyze whether including court surface type improves the prediction of match outcomes beyond using player ranking differences alone, which directly supports our research question.

Dataset 2 is publicly available at https://www.kaggle.com/datasets/hakeem/atp-and-wta-tennis-data.
It includes variables such as match results, player rankings, surface type, tournament level, match dates, and additional match-level statistics across multiple seasons. This dataset is useful because it provides additional coverage of professional tennis matches over time, allowing us to validate our findings, increase sample size, and test whether the relationship between surface type and match outcome prediction is consistent across different tours and seasons.
  

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 
 > We have considered potential collection bias in the existing datasets. Professional tennis data may overrepresent top-ranked players and major tournaments while underrepresenting lower-tier events. We will document which tours and tournament levels are included and acknowledge any limitations in generalizability. We'll also examine whether certain surfaces or time periods are overrepresented in the data.

 
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 
 > The datasets contain player names, which are public figures in professional sports. However, we will minimize PII exposure by using player IDs or anonymized identifiers in our analysis wherever possible. We will not collect or display any personal information beyond what is publicly available in professional tennis records (names, rankings, match results).

 
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
> While our primary focus is on court surface and rankings, we will examine whether our model performs differently across various player demographics if such information is available. We'll document any potential biases in prediction accuracy and ensure our analysis doesn't perpetuate unfair advantages or disadvantages for particular player groups.


### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 
 > We will store all data files locally on password-protected devices and use secure cloud storage (e.g., Google Drive with restricted access) for team collaboration. Access will be limited to team members only. While the data is public, we will follow best practices for data management throughout the project lifecycle.
 
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> We will retain the data only for the duration of the course project (approximately 10 weeks). After final grades are submitted and we no longer need the data for academic purposes, we will delete all local copies. The original datasets will remain available in their public repositories.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 
 > We will consult with tennis subject matter experts (e.g., coaches, analysts) if possible to validate our assumptions about how surface types affect play. We'll also review tennis analytics literature to ensure we're not missing important contextual factors that might affect match outcomes.
 
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

 > We will examine the data for potential biases including: temporal bias (whether recent matches are overrepresented), surface bias (whether certain surfaces have more data), and competitive level bias (whether Grand Slams vs. smaller tournaments are balanced). We'll document these biases and consider stratifying our analysis or using appropriate weighting techniques.
 
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

 > All visualizations and statistical summaries will accurately represent the underlying data without manipulation. We will include confidence intervals, report limitations, and avoid cherry-picking results. Any data exclusions or filtering decisions will be transparently documented and justified.
 
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

 > Our analysis focuses on aggregate patterns and statistical relationships rather than individual player performance. While player names may appear in examples, we will not conduct individual-level analysis that could be used to unfairly characterize specific players' abilities.
 
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

 > All analysis will be conducted in Jupyter notebooks with well-documented code, clear comments, and version control. We will maintain a reproducible workflow so that others can verify our methods and results. All data preprocessing steps, modeling decisions, and statistical tests will be explicitly documented.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

 > Our model focuses on court surface type and ranking differences, which are legitimate competitive factors in tennis. We will ensure that we do not inadvertently use variables that serve as proxies for protected characteristics (e.g., nationality, age, gender) in discriminatory ways.
 
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

 > We will test whether our model's prediction accuracy varies across different player groups (if demographic data is available) and report any disparities. Since this is an analytical project rather than a deployed system, we'll document but not necessarily resolve all fairness issues, while acknowledging them as limitations.
 
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

 > We will use multiple evaluation metrics (e.g., accuracy, precision, recall, AUC-ROC) to assess model performance and will consider whether optimizing for one metric might have unintended consequences. We'll justify our choice of primary metric and report all relevant metrics.
 
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

 > We will use interpretable modeling approaches and provide clear explanations of how court surface and ranking differences contribute to predictions. Feature importance analysis will help explain which factors most strongly influence match outcome predictions.
 
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

 > Our final report will include a dedicated limitations section discussing: data coverage gaps, potential biases, model assumptions, generalizability constraints, and uncertainty in predictions. We will communicate these in clear, non-technical language where appropriate.


### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

 > While we do not plan to deploy this model, we will include a statement in our report discouraging use of the model for gambling or other unintended purposes. We'll emphasize that this is an exploratory academic analysis and not a validated prediction system for real-world betting or decision-making.

## Team Expectations 

* Our team will communicate through messages and we plan to meet in person at least once a week.
* Each team member will complete their assigned tasks by the agreed-upon deadlines and will communicate proactively if they encounter any obstacles or need additional time.
* If conflicts or misunderstandings come up, we will address them as a group and involve course staff if we are unable to resolve the issues on our own.
* We agree to be accountable to one another and follow the respected standards of academic integrity.
* All team members will come to meetings prepared, having completed their individual work and reviewed relevant materials beforehand.
* We will respect each other's time by starting and ending meetings on schedule and staying focused on agenda items.
* Team members will provide constructive feedback on each other's work in a respectful and supportive manner.
* If a team member cannot attend a scheduled meeting, they must notify the team at least 24 hours in advance and review meeting notes afterward.
* We will make decisions democratically, with all team members having equal input and voting rights on major project decisions.
* All work will be shared through our designated Google Drive folder with clear version control and file naming conventions.
* We commit to distributing work equitably across all team members and will reassess workload balance regularly to ensure fairness.
* Team members will respond to messages within 24 hours during weekdays to maintain efficient communication and project momentum.

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/28  |  2 PM | Read through COGS 108 project guidelines individually; come up with possible topics and research questions; skim a few past projects for ideas  | Decide how we’ll communicate and run meetings; settle on a final project topic and research question; start outlining background research; divide up tasks for the proposal; wrap up the previous project review submission | 
| 2/4 |  2 PM |  Keep working on background research; look for possible datasets and think about any ethical issues | Go over the project proposal and make final revisions; assign responsibilities for the data checkpoint (including describing and cleaning the data); talk through how the data fits our project and how we’ll wrangle it | 
| 2/11  | 2 PM  | Read and reflect on feedback for the proposal; continue cleaning and preparing the data  | Decide who will handle which revisions based on feedback; check in on data progress and adjust the timeline if needed   |
| 2/18  | 2 PM  | Finish data cleaning; start some early exploratory analysis to check data quality; review the EDA checkpoint instructions | Review and polish the data cleaning and early EDA; finalize the data checkpoint submission; assign tasks for the full EDA stage   |
| 2/25   | 2 PM  | Work individually on EDA tasks (including coding and writing interpretations); review feedback from the data checkpoint | Share progress on EDA work; discuss feedback on the data; decide how to implement any required changes |
| 3/4  | 2 PM  | Complete the main analysis; draft the results section; update the data section based on feedback| Finalize edits to the EDA section; plan and assign tasks for the project video |
| 3/11  | 2 PM  | Review feedback on the EDA; continue making revisions as assigned| Go over EDA edits; check progress on the final video; review the full project using the grading rubric |
| 3/18 | Before 11:59 PM  | Finish all remaining edits; record and edit the final video | Turn in Final Project & Group Project Surveys |