# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Sam Lau: Data 
- Arav Patel: Fixed hypothesis, added proposal elements back 
- Brandon Covas: 
- Scarlett Jeffries: 
- Calvin Anderson: 

## Research Question

Is there any influence of biological factors and physical attributes (including height, age, and weight) on tennis win rate the most among the top global professional players?


## Background and Prior Work

As multiple people in our group play tennis, we decided to focus our project on which biological factors influence the win rate of professional tennis players. Looking at factors such as height, weight, dominant hand preference, wingspan, among other factors, we hope to determine which attributes seem to most heavily influence winning rates for top players. While we recognize that factors such as coaching, years of experience, court type, and more also influence a players performance, we aim to focus on which physical characteristics of the players influence performance.

Winning in tennis is attributed to an interaction between training regimen, technical skill, and psychological, and biological factors. Among these, biological factors in the field of tennis is an interesting field of research as it is the least malleable trait. Genetic dispositions such as length of legs could positively or adversely affect span and balance. The scale of how impactful biology can be to the field of tennis is evident by studying handedness. On average, the frequency of the population expressing left handedness is 10%. Among the top 100 professional players, that number skyrockets to 20%<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1).

Further, one study published in the Scientific Review of Physical Culture analyzes how player height and career win percentage influences serve and return rates on different types of courts. While they found that shorter players tend to have a higher return rate on clay courts, taller players have a higher serve percentage on all types of courts<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). While this study mentions shorter players returning strength and taller players serving rates, how does height impact the remainder of the player’s performance for the remainder of the shots in a point? How large is the impact of serve and return rates on overall match winning rate? This analysis determines which types of court surfaces and winning rates affect serve and return rates, and how height plays a role on different types of courts. Our study aims to look at how height affects winning rates regardless of court type and serve and return percentages. Additionally, our project aims to determine if other biological factors impact player performance as well. This study was helpful in seeing how a similar study to our question was constructed and what metrics were used in their analysis, and we are looking into a question that is looking at more biological factors and simply winning rate. While we recognize from this study that height impacts serve and return rates, we will explore how height impacts overall performance.

As mentioned in the paper “Methodology and evaluation in sports analytics: challenges, approaches, and lessons learned” in a Machine Learning Review publication, modern sports analytics face many challenges<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). There are many uses for the advancement of data science and ML modeling in sports, influencing player recruitment, preparation, and predicting team performance. This paper was helpful as it helped us contextualize and understand restrictions and implications of tennis analytics. For example the paper mentions how variables and measurements in sports data are extremely binary. For example, ‘win’ and ‘loss’ does not indicate how well a player performed, as the match could have been extremely close or far in terms of the score. Additionally, another factor the article mentions that was helpful in determining how we will approach our project is that better players often have more chances to compete. In tennis, higher ranked players automatically qualify for tournaments, whereas new or lower ranked players have to try out in order to qualify for the tournament. Because of this, higher ranked players may have won more matches since they have more opportunities to play, which is why we are looking at winning rates proportional to the amount of matches the player has competed in. Overall, this paper gave helpful insights into common pitfalls of sports analytics, while also recommending how sports analytics can be extremely insightful.



1. <a name="cite_note-1"></a> [^](#cite_ref-1)  Holtzen, David W. “Handedness and professional tennis.” *International Journal of Neuroscience*, vol. 105, no. 1–4, Jan. 2000, pp. 101–119, https://doi.org/10.3109/00207450009003270.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2)  Pawel, Bieniek, and Kwater Klaudia. “Body Height and Career Win Percentage in Relation to Serve and Return Games Effectiveness in Elite Tennis Players.” *Scientific Review of Physical Culture*, Volume 4, Issue 3, 2014, www.researchgate.net/publication/277721035_Body_height_and_career_win_percentage_in_relation_to_serve_and_return_games_effectiveness_in_elite_tennis_players.html
3. <a name="cite_note-2"></a> [^](#cite_ref-3)  Davis, Jesse & Bransen, Lotte & Devos, Laurens & Jaspers, Arne & Meert, Wannes & Robberechts, Pieter & Van Haaren, Jan & Roy, Maaike. (2024). *Methodology and evaluation in sports analytics*: challenges, approaches, and lessons learned. Machine Learning. 113. 1-34. 10.1007/s10994-024-06585-0.html

## Hypothesis


We predict a strong correlation between the factors of height and left handedness on win rate for professional players, meaning that as height increases, so does the win rate. Taller players tend to have a higher reach when serving and hitting which can give a bigger incline and velocity on the ball. Based off of preliminary research, left handedness is less conventional in tennis, giving that player an advantage because of a difference in playstyle.


## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_2024.csv', 'filename':'atp_matches_2024.csv'},
    { 'url': 'https://raw.githubusercontent.com/JeffSackmann/tennis_wta/refs/heads/master/wta_matches_2024.csv', 'filename':'wta_matches_2024.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"

B. 
Dataset #1: The most glaring problem of this datset is thjat it contains exclusively professional ATP and WTA tour-level matches in 2024. This means it only represents elite players who qualified for major tournaments. This introduces a drawback known as survirvorship bias. This concept describes how lower-ranked players are not sampled and included in the dataset. Additionally, tennis can be a costly and expensive sport. This means it is a globally uneven sport sp the sample overrepresents players from wealtheir countries that could support stronger traininf for their ahtletes that compete. Because the dataset only coevers one season, short term injuries and schedule differences in the 2024 season can distort data predicting long term performance predictions. Lastly, contextrual variables such as injuries, coaching, and fatigue both mental and physical are missing considerations. Therefore, findings that we attain from this datset should state it only applies to eite professonals in tennis for 2024 and has ot be careful when generalized to broader tennis populations. 

Dataset #2: The dataset consists only of WTA tour level matches from 2024 season highlighting the same drawback of a single season representation from the first dataset. 

3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
import pandas as pd

# A. Load data
wta = pd.read_csv('data/00-raw/wta_matches_2024.csv')

# B. Tidy 
# Select columns of interest
wta_data = wta[['winner_name', 'winner_hand', 'winner_ht', 'winner_age', 
                'loser_name', 'loser_hand', 'loser_ht', 'loser_age',
                'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points']]

# C. Size of dataset
print(f"Size of the Women's Tennis Association (WTA) dataset: {wta_data.shape}\n")

# D. Missing data
print('Missing data per column:')
missing = wta_data.isnull().sum()
print(missing)
print('From the table above we can see that most of the missing data comes from the height measurements.')
print(f"There are {missing['winner_ht'] + missing['loser_ht']} missing height measurements, which we will have to exclude in order to conduct our analysis.\n")

# E. Outliers
print('Summary statistics of the data to identify outliers:')
print(wta_data.describe())
print(wta_data.describe(include='object'))

print('From this, we can see that there are 3 unique values for dominant hand preference.')
print(f"The unique values for the dominant hand preference are: {wta_data['winner_hand'].unique()}")
print(f"We want to look at strictly Right (R) or Left (L) handed players, and will remove outliers.\n")
ambidextrous_winners = wta_data[wta_data['winner_hand'] == 'U'].index
wta_data = wta_data.drop(ambidextrous_winners)

ambidextrous_losers = wta_data[wta_data['loser_hand'] == 'U'].index
wta_data = wta_data.drop(ambidextrous_losers)


# F. Clean data
# We are dropping all entries with missing values since as noted above, most missing data comes from the height measurements.
# Since height is a key biological factor in our analysis, we are only using entries including height data.
wta_data_clean = wta_data.dropna(how='any').reset_index()

# G. Write data
wta_data_clean.to_csv('data/02-processed/wta_matches_2024.csv')

# Look at data
print('Take a look at the cleaned data:')
wta_data_clean.head()

## Ethics

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?  
This is not completely relevant to our project. Considering that the players we are observing are professional, all of the data we will be observing is public knowledge. We will only be using data that anyone in the public can find, we will just be compiling and analyzing that data.
 - [x] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?  
  The sources we are using are recognized by The International Tennis Federation. The data is collected and reported directly following matches, The best way we could avoid bias while viewing our data is to use more recent data as it is more likely to be reported accurately.
 - [x] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?  
  We can keep our dataset anonymous by not including the names of the players involved in our study. personal identifiability is not a large concern of our team due to all of the info being used being available online.
 - [x] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?  
  Gender is available on ATP and WTA, being very public knowledge. This allows us to conduct analysis within tennis leagues without bias. For example we will compare women's statistics to other women and men with other men. This will reduce innate bias.


### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?  
       As we are basing our project on publicly available data that assoicated with professional players, so we don't anticipate data security to be an issue. 
 - [x] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?  
       No, we don't have a mechanism for this and recognize that this will be difficult to implement. As we are using public data from tennis organizations and other sources, we will rely on their methods for personal information removal for this. 
 - [x] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?  
    No, our project doesn't anticipate a need for deletion of data. 

### C. Analysis
 - [x] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?  
    We thought about blind spots through thinking about how assumptions about tennis performance might oversimplify the character of a player. We avoid this mistake by acknowledging that quantitative match statistics do not fully capture the full picture of events such as player injuries, mental health, coaching changes, and weather conditions, or maybe even travel fatigue. To mitigate this, we avoided making claims and framed our analysis as more descriptive rather than making our analysis definitive. Another blindspot is that data usually reflects professional seasoned athletes rather than junior players. 
 - [x] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?  
    We will examine the dataset for sources of bias. What we might look at could include imbalanced representation across gender, ranking tiers, tournament levels, and surfaces. We noted that higher-ranked and more successful players appear more frequently in the data. This has the potential to skew our results. We were cautious not to generalize findings beyond the scope of the dataset. We also checked for missing values and avoided drawing conclusions from incomplete records.
 - [x] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?  
    Our visualizations and summary statistics are going to be designed to reflect the data. Usually, axes aren’t clearly labelled and importantly, scales are made to skew in odd ways to make visualization look a certain way to prove a point. We will make sure that axes are clearly labeled and scales are not manipulated to exaggerate differences. 
 - [x] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?  
       The dataset contains publicly available professional tennis data and does not include personally identifiable information beyond player names, which are necessary for contextualizing match outcomes. No private or sensitive information was used or displayed, and we avoided linking the data to external sources that could introduce privacy concerns.
 - [x] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?  
       Our data cleaning, analysis, and visualization steps are going to be fully documented in the notebook with explanations of the choices we underwent as a team. Code will be organized and all intermediate steps and thoughts will be put down to make sure our results are reproducible. 

### D. Modeling
 - [x] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?  
       We have considered that this model might be unfair if we use certain factors such as country of origin, as this could be a proxy for access to resources/funding. We want to heavily consider or avoid using these metrics in our study. 
 - [x] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?  
       When we do the modelling, we expect that there will be differences across the two gender groups represented. We intend to minimize the bias by stratification across gender. This will allow us to consider the differences between the groups and make a fair conclusion. Additionally, we recognize that our pool of tennis players will have a built in racial bias that will reflect the current diversity of professional tennis players. We are unsure how to mitigate this bias in our pool, so we must clearly acknowledge this as a limitation in our study.
 - [x] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?  
       As we are generally modelling biological factors and in-game statistics, we don't expect that our metrics need optimization or are optimized to create any negative effects. We will need extra consideration if we bring in metrics such as country of origin, which could serve as a proxy for other variables such as access to resources.  
 - [x] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?  
       Yes, I believe that we will be able to explain why certain factors will influence success in tennis, as we can refer to the biomechanics of playing tennis. We will also be able to analyze correlation between biological factors and in game statistics, such as height on ball speed. This will give us an understanding of what part of the game is most impacted by certain factors.
 - [x] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?  
       Yes we will communicate the limitations of our study, which are mostly due to the population pool of professional tennis players. We need to emphasize that this study will not be applicable to the average tennis player, where there will be more variance in other factors that are critical for success in tennis, such as experience or training. Because we will limit our study to concrete tennis courts, our conclusions will also be limited by surface. Additionally, our conclusions should not be viewed as a predictor for success in tennis, we only want to understand better what biological factors, if any, have influence on playing tennis at a high level. We do not want our conclusions to discourage people from pursuing tennis, and we need to clearly state these limitations in order to achieve this. 
       
### E. Deployment
 - [x] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?  
       While the model is based on preexisting biological and win-rate data, we can update the model with relevant data from the upcoming tennis season. We can keep up with the upcoming season and the players stats to see if the attributes of winning, highly ranked players align with the predicted biological factors we recognized. We can audit sample predictions by comparing the outcomes of various tournaments such as the U.S. Open and Wimbledon with what our model concluded. 
 - [x] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?  
       While we do not predict that our analysis will cause distress to users, we will adapt if necessary. We can add a disclaimer that our findings are purely a data analysis of winning statistics and uncontrollable biological factors. We can further reiterate that we recognize many other aspects contribute to the win-rate for professional tennis players, such as coaching, training, and years of experience, among other factors., Our analysis does not intend to be an all-conclusive analysis of why a match was won, simply we wish to explore the relationship between biological factors and winning rates.
 - [x] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?  
       Yes, as the project is all in the group repository, we can change the status to private if the model is causing harm or having unintended effects. Then, only contributors will be able to see the model and either change it or leave it as private, depending on the circumstances.
 - [x] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?  
       Yes, if the model is being used in unintended ways then we will implement aspects of our roll back plan to reevaluate, then continue to monitor from there.


## Team Expectations 

Team Expectation 1: Tone and communication methods
Our team has talked and agreed to communicate through methods virtually and in person. We will primarily leverage the iMessage platform which we agreed upon collectively. We have also agreed to have a more relaxed expectation on response time to acknowledge everybody’s individual schedules. Over text and discussions in person, we came to the conclusion that a casual tone would encourage the most constructive feedback and encouragements 

Team Expectation 2: Task management
All team members are expected to contribute equally. Division of labor will be determined by skill and interest by the best of our collective ability to determine the split ourselves. If a task is unable to be split in the way of our individual interest, we will split it according to an equal division of labor. We will track tasks according to two important metrics; timeline and progress. We can keep track of progress well through clear contributions determined in git push and google docs labelled work. To manage our timeline, we can look into the document to keep track of what has been accomplished for each individual to keep each other on track.

Team Expectation 3: Handling Conflict
Team members will handle a conflict of interests through open communication. We are expected to lead with respect and address the issue that is brought upon the group rather than focus on the individual. We will listen to all perspectives and work together through conducting discussion when conflict arises. If we are unable to handle the conflict ourselves, we will contact staff for help to resolve our group conflict. 


## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them