# COGS 108 - Data Checkpoint



Team list and credits:

- Alexis Menor: Conceptualization, Background research, Writing – original draft, Data curation
- Camdon Dreisbach: Methodology, Software, Data curation
- Ivan Li: Analysis, Visualization
- Joseph Tuazon: Project administration, Writing – review & editing,  Data curation
- Yuna Yeom: Analysis, Background research, Visualization

## Research Question

How does each players' position-normalized (OH (Outside Hitter), MB (Middle Blocker), OPP (Opposite Hitter), S (Setter), L (Libero), L/DS (Libero/Defensive Specialist)) cumulative workload-measured by actions (TotalAttacks, Digs , BlockAssists) per match impact hitting effiency (HitPct) in the later 4th or 5th sets compared to 1st set in NCAA Division 1 Matches in each respective season 2020-2024

## Background and Prior Work

College athletes undergo demanding training and heavy competition schedules while simultaneously managing academic responsibilities. As a result, fatigue has emerged as a critical factor influencing athletic performance, recovery, and mental health. In sports such as volleyball, where matches are structured into sequential sets, fatigue may accumulate as competition progresses, potentially leading to declines in performance during later sets of a match. Prior research suggests that volleyball players that experience excessive physical and cognitive workload can reduce performance quality, including a decrease in visual perception, concentration, and reaction time <a id="cite_ref_1"></a><sup><a href="#cite_note_1">1</a></sup>.

Due to volleyball’s varying gameplay, examining fatigue can vary across positions because of the sport-specific demands. Players can engage in hitting, setting, defense, and blocking, which differ in their actions like jumping, lateral movement, reaction speed, and quick directional changes. Because of the specific position that players are in, their exhaustion can differ with the varying demands they have to meet. Previous research suggests that fatigue can be position-specific and can localize in different areas of the body <a id="cite_ref_2"></a><sup><a href="#cite_note_2">2</a></sup>. Hitters and blockers frequently jump in comparison to setters and defensive specialists, so differences in physical fatigue makes sense. Thus, it is important to consider how different positions need to recover during and after a strenuous game. Additionally, the change from pre-season preparation to in-season competition, then to the off-season, can impact an athlete’s recovery and performance <a id="cite_ref_3"></a><sup><a href="#cite_note_3">3</a></sup>. The shift from pre-season, where training is moderate and games are seldom, to in-season, where training increases and games are often, can also be a relevant factor to consider in a player's performance. 

Existing literature primarily focuses on physiological measures of fatigue or subjective survey-based assessments collected during training and competition. However, there is comparatively limited research examining how fatigue translates into measurable statistical performance changes during competitive play, particularly at the collegiate level. 

This project aims to address these gaps by analyzing how fatigue relates to in-game statistical performance across game sets and positions. By comparing early-set performance to late-set performance, we can provide quantitative insight into how cumulative intensity of playing-time affects gameplay outcomes. 

1. <p id="cite_note_1">
  <a href="#cite_ref_1">^</a>
Yu, Y., Zhang, L., Cheng, M.-Y., Liang, Z., Zhang, M., & Qi, F. (2025). <i>The effects of different fatigue types on action anticipation and physical performance in high-level volleyball players </i>. Journal of Sports Sciences, 43(4), 323–335. https://doi.org/10.1080/02640414.2025.2456399 </p>

2. <p id="cite_note_2">
  <a href="#cite_ref_2">^</a> 
Ungureanu, A. N., Lupo, C., Boccia, G., & Brustio, P. R. (2021). <i>Internal Training Load Affects Day-After-Pretraining Perceived Fatigue in Female Volleyball Players. International Journal of Sports Physiology and Performance </i>, 16(12), 1844-1850. Retrieved Feb 5, 2026, from https://doi.org/10.1123/ijspp.2020-0829 
</p>

3. <p id="cite_note_3">
  <a href="#cite_ref_3">^</a>
Rebelo, A., Pereira, J.R., Cunha, P. et al. <i>Training stress, neuromuscular fatigue and well-being in volleyball: a systematic review</i>. BMC Sports Sci Med Rehabil 16, 17 (2024). https://doi.org/10.1186/s13102-024-00807-7 
</p>




## Hypothesis


*UPDATED*

As position-normalized and Gender-normlized workload(TotalAttacks, Digs, BlockAssists) increases, NCAA Division 1 players will exhibit a statistically significant decrease in mean HitPct during the 4th and 5th sets. Furthermore, high-workload conditions will increase performance variance, shifting the population distribution toward the lower quartile of hitting efficiency compared to 1st-set baselines. This will demonstrate a consistency metric among players revealing how fatigue-resilent differs across gender and positonal categories.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

*Group Names* : Joseph Tuazon, Ivan Li, Alexis Menor, Camdon Dreisbach, Yuna Yeom 

* *Team Expectation 1* : Communicate well over discord and try to respond to messages as soon a possible. Regarding to meetings we will try and meet once a weel but understanding that people have other things in their lives it is okay.
* *Team Expectation 2* : Be nice to eachother in the groups. When disagreeing with an idea explain why but also understand the other person's point of view.
* *Team Expecation 3* : When making decisions listen other other people's proposal and provide feedback if possible.
* *Team Expecation 4* : Try and complete the task assigned before the certain due date 

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/3  |  7 PM | Think about the project proposal and ideas for the actual project  | Talked about the project idea and split up the work assignment that is due on Wednesday | 
| 2/10  |  7 PM |  Do background research on topic |  Detailed work load for each group member to accomplish what we need to work on and how we can improve lack criteria. | 
| 2/17  |  7 PM  | Look over the data sets/find special features about that data set that can help with our project  | TBA |
| 2/24 | 7 PM  | Import the data into the code and clean it up (getting rid of useless data) | TBA   |
| 3/3 | 7 PM  | Beginning the analysis of the data and writing down specifics about the research | TBA |
| 3/10  | 7 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| TBA|
| 3/17 | 7 PM  | TBA | TBA |