# COGS 108 - Data Checkpoint



Team list and credits:

- Alexis Menor: Conceptualization, Background research, Writing – original draft, Data curation
- Camdon Dreisbach: Methodology, Software, Data curation
- Ivan Li: Analysis, Visualization
- Joseph Tuazon: Project administration, Writing – review & editing,  Data curation
- Yuna Yeom: Analysis, Background research, Visualization

## Research Question

How does each players' position-normalized (OH (Outside Hitter), MB (Middle Blocker), OPP (Opposite Hitter), S (Setter), L (Libero), L/DS (Libero/Defensive Specialist)) cumulative workload-measured by actions (TotalAttacks, Digs , BlockAssists) per match impact hitting effiency (HitPct) in the later 4th or 5th sets compared to 1st set in NCAA Division 1 Matches in each respective season 2020-2024?


## Background and Prior Work

College athletes undergo demanding training and heavy competition schedules while simultaneously managing academic responsibilities. As a result, fatigue has emerged as a critical factor influencing athletic performance, recovery, and mental health. In sports such as volleyball, where matches are structured into sequential sets, fatigue may accumulate as competition progresses, potentially leading to declines in performance during later sets of a match. Prior research suggests that volleyball players that experience excessive physical and cognitive workload can reduce performance quality, including a decrease in visual perception, concentration, and reaction time <a id="cite_ref_1"></a><sup><a href="#cite_note_1">1</a></sup>.

Due to volleyball’s varying gameplay, examining fatigue can vary across positions because of the sport-specific demands. Players can engage in hitting, setting, defense, and blocking, which differ in their actions like jumping, lateral movement, reaction speed, and quick directional changes. Because of the specific position that players are in, their exhaustion can differ with the varying demands they have to meet. Previous research suggests that fatigue can be position-specific and can localize in different areas of the body <a id="cite_ref_2"></a><sup><a href="#cite_note_2">2</a></sup>. Hitters and blockers frequently jump in comparison to setters and defensive specialists, so differences in physical fatigue makes sense. Thus, it is important to consider how different positions need to recover during and after a strenuous game. Additionally, the change from pre-season preparation to in-season competition, then to the off-season, can impact an athlete’s recovery and performance <a id="cite_ref_3"></a><sup><a href="#cite_note_3">3</a></sup>. The shift from pre-season, where training is moderate and games are seldom, to in-season, where training increases and games are often, can also be a relevant factor to consider in a player's performance. 

Existing literature primarily focuses on physiological measures of fatigue or subjective survey-based assessments collected during training and competition. However, there is comparatively limited research examining how fatigue translates into measurable statistical performance changes during competitive play, particularly at the collegiate level. 

This project aims to address these gaps by analyzing how fatigue relates to in-game statistical performance across game sets and positions. By comparing early-set performance to late-set performance, we can provide quantitative insight into how cumulative intensity of playing-time affects gameplay outcomes. 

1. <p id="cite_note_1">
  <a href="#cite_ref_1">^</a>
Yu, Y., Zhang, L., Cheng, M.-Y., Liang, Z., Zhang, M., & Qi, F. (2025). <i>The effects of different fatigue types on action anticipation and physical performance in high-level volleyball players </i>. Journal of Sports Sciences, 43(4), 323–335. https://doi.org/10.1080/02640414.2025.2456399 </p>

2. <p id="cite_note_2">
  <a href="#cite_ref_2">^</a> 
Ungureanu, A. N., Lupo, C., Boccia, G., & Brustio, P. R. (2021). <i>Internal Training Load Affects Day-After-Pretraining Perceived Fatigue in Female Volleyball Players. International Journal of Sports Physiology and Performance </i>, 16(12), 1844-1850. Retrieved Feb 5, 2026, from https://doi.org/10.1123/ijspp.2020-0829 
</p>

3. <p id="cite_note_3">
  <a href="#cite_ref_3">^</a>
Rebelo, A., Pereira, J.R., Cunha, P. et al. <i>Training stress, neuromuscular fatigue and well-being in volleyball: a systematic review</i>. BMC Sports Sci Med Rehabil 16, 17 (2024). https://doi.org/10.1186/s13102-024-00807-7 
</p>


## Hypothesis


*UPDATED*

As position-normalized and Gender-normlized workload(TotalAttacks, Digs, BlockAssists) increases, NCAA Division 1 players will exhibit a statistically significant decrease in mean HitPct during the 4th and 5th sets. Furthermore, high-workload conditions will increase performance variance, shifting the population distribution toward the lower quartile of hitting efficiency compared to 1st-set baselines. This will demonstrate a consistency metric among players revealing how fatigue-resilent differs across gender and positonal categories.

## Data

### Data overview

We have collected data from a source that has condensed match statistics of NCAA D1 volleyball games for men and women during the 2020-2024 seasons.  This dataset contains information about Total Attacks, Hitting Percentage (Hit Pct), Assists, Digs, Aces, Receiving Errors (RErr), Blocking Errors (BErr), Service Errors (SErr), Block Assists, and Block Solos. 

- Dataset #1
  - Dataset Name: D1 MVB & WVB playermatch stats 2020-2024
  - Link to the dataset: https://github.com/JeffreyRStevens/ncaavolleyballr/tree/main/data-csv
  - Number of observations: 254655
  - Number of variables: 26
  - Description of the variables most relevant to this project
  -  This dataset contains information about each player and their Total Attacks, Hitting Percentage (Hit Pct), Aces, Assists, Digs, Block Assists, Errors, and Block Solos. They also include how many sets, what day they played on, which season they were in, and 


In [25]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [1]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 

   




import pandas as pd

datafiles = [
    {'url': 'https://github.com/COGS108/Group035_WI26/tree/master/data', 'filename':"wvb_playermatch_div1_2020.txt"},
    {'url': 'https://github.com/COGS108/Group035_WI26/tree/master/data', 'filename':"wvb_playermatch_div1_2021.txt"},
    {'url': 'https://github.com/COGS108/Group035_WI26/tree/master/data', 'filename':"wvb_playermatch_div1_2022.txt"},
    {'url': 'https://github.com/COGS108/Group035_WI26/tree/master/data', 'filename':"wvb_playermatch_div1_2023.txt"},
    {'url': 'https://github.com/COGS108/Group035_WI26/tree/master/data', 'filename':"wvb_playermatch_div1_2024.txt"},
]

get_data.get_raw(datafiles, destination_directory='data/00-raw/')



Overall Download Progress:   0%|          | 0/5 [00:00<?, ?it/s]
Downloading wvb_playermatch_div1_2020.txt: 0.00B [00:00, ?B/s][A
Overall Download Progress:  20%|██        | 1/5 [00:00<00:02,  1.82it/s]

Successfully downloaded: wvb_playermatch_div1_2020.txt



Downloading wvb_playermatch_div1_2021.txt: 0.00B [00:00, ?B/s][A
Overall Download Progress:  40%|████      | 2/5 [00:00<00:00,  3.05it/s]

Successfully downloaded: wvb_playermatch_div1_2021.txt



Downloading wvb_playermatch_div1_2022.txt: 0.00B [00:00, ?B/s][A
Overall Download Progress:  60%|██████    | 3/5 [00:00<00:00,  3.81it/s]

Successfully downloaded: wvb_playermatch_div1_2022.txt



Downloading wvb_playermatch_div1_2023.txt: 0.00B [00:00, ?B/s][A
Overall Download Progress:  80%|████████  | 4/5 [00:01<00:00,  4.39it/s]

Successfully downloaded: wvb_playermatch_div1_2023.txt



Downloading wvb_playermatch_div1_2024.txt: 0.00B [00:00, ?B/s][A
Overall Download Progress: 100%|██████████| 5/5 [00:01<00:00,  3.97it/s]

Successfully downloaded: wvb_playermatch_div1_2024.txt





### Women's Volleyball Player Statistics, D1 2020-2024 

Instructions:

This dataset details statistics for every D1 women's volleyball player within the NCAA from 2020-2024. Variables to consider are:

- Team: team that the player is associated with
- Season: which year (2020, 2021, 2022, 2023, 2024) the statistics are from
- Date: game day
- Conference: respective conference that the team is in
- Opponent_Team: who that team/player was playing against
- Opponent_Conference: the respective conference that the opponent is in
- Location: whether that game was at their home court or at their opponent's court
- Player: name of the player
- P: points that the player made
- S: sets played in
- Kills: number of offensive attacks made
- Errors: any mistake made resulting in a point for the other team
- TotalAttacks: total attempts for offensive attacks
- HitPct: kills minus errors divided by total attempts
- Assists: number of sets that resulted in a kill 
- Aces: service points
- SErr: service errors
- Digs: number of times receiving the ball from the opponent after an attack is attempted
- RErr: receive error
- BlockSolos: one player making a block of an opponent's attempt to attack
- BlockAssists: two players making a block of an opponent's attempt to attack
- BErr: blocking error
- TB: total blocks
- PTS: total points a player has scored 
- BHE: ball handling error (technical foul of a player mishandling the ball i.e. 'double' or 'lift'

We will identify the key variables to focus on to analyze fatigue and player performance. Most likely points made or assisted through Kills, Aces, Assists, TB will be compared to the error variables. Some potential concerns with this data set might include unequal playing time, which may affect what truly contributes to low stats or high stats. Additionally, we must think about the context of these games, such as opponent strength and the effect of home vs away games. 



In [49]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
from pathlib import Path
import pandas as pd
import numpy as np

data_dir = Path('data/')
files = sorted(data_dir.glob("wvb_playermatch_div1_*.txt"))

dfs = [pd.read_csv(f) for f in files]
wvb_df = pd.concat(dfs, ignore_index=True)

wvb_df.columns = (
        wvb_df.columns
            .str.strip()
            .str.replace(' ', '_')
)
##removes duplicate 'team' column
wvb_df = wvb_df.loc[:, ~wvb_df.columns.duplicated()]


wvb_df["Date"] = pd.to_datetime(wvb_df["Date"])

numeric_cols = [
    "SErr", "Digs", "RErr",
      "BlockSolos", "BlockAssists", "BErr", "TB"
]
for col in numeric_cols:
    if col in wvb_df.columns:
        wvb_df[col] = wvb_df[col].fillna(0)
wvb_df.isna().sum().sort_values(ascending=False).head(10)

wvb_df = wvb_df.drop_duplicates() 
## drop match sets and jersey number since we are not interested in this
wvb_df = wvb_df.drop(columns=['MS', 'Number'])

##combines total errors
#wvb_df["Total_Errors"] = wvb_df["BErr"] + wvb_df["RErr"] + wvb_df["SErr"]

for col in numeric_cols:
    wvb_df[col] = pd.to_numeric(wvb_df[col], errors="coerce")


##organizes player matches in order
wvb_df = wvb_df.sort_values(['Player', 'Date'])
##remove data we might not include later
wvb_df = wvb_df.drop(columns=["BHe", "RetAtt"], errors = "ignore")





Files being loaded: ['wvb_playermatch_div1_2020.txt', 'wvb_playermatch_div1_2021.txt', 'wvb_playermatch_div1_2022.txt', 'wvb_playermatch_div1_2023.txt', 'wvb_playermatch_div1_2024.txt']

Suspicious rows found: 0


Unnamed: 0,team,Season,Date,Team,Conference,Opponent_Team,Opponent_Conference,Location,Player,P,...,Aces,SErr,Digs,RErr,BlockSolos,BlockAssists,BErr,TB,PTS,BHE


In [52]:
out_path = 'data/02-processed/wvb_playermatch_combined.csv'
wvb_df.to_csv(out_path, index = False)

In [51]:
print("Number of observations:", wvb_df.shape[0])
print("Number of variables:", wvb_df.shape[1])
print("\nVariable names:")
print(wvb_df.columns.tolist())

wvb_df.head()

Number of observations: 254655
Number of variables: 26

Variable names:
['team', 'Season', 'Date', 'Team', 'Conference', 'Opponent_Team', 'Opponent_Conference', 'Location', 'Player', 'P', 'S', 'Kills', 'Errors', 'TotalAttacks', 'HitPct', 'Assists', 'Aces', 'SErr', 'Digs', 'RErr', 'BlockSolos', 'BlockAssists', 'BErr', 'TB', 'PTS', 'BHE']


Unnamed: 0,team,Season,Date,Team,Conference,Opponent_Team,Opponent_Conference,Location,Player,P,...,Aces,SErr,Digs,RErr,BlockSolos,BlockAssists,BErr,TB,PTS,BHE
73824,Grambling,2021-2022,2021-08-27,Grambling,SWAC,South Dakota St.,Summit League,Home,A'Lexus Everett,MB,...,0,0,0,0,0,1,0,0.0,3.5,0
73837,Grambling,2021-2022,2021-08-27,Grambling,SWAC,McNeese,Southland,Home,A'Lexus Everett,MB,...,0,0,0,0,1,0,0,0.0,2.0,0
73852,Grambling,2021-2022,2021-09-07,Grambling,SWAC,Centenary (LA),SCAC,Home,A'Lexus Everett,MB,...,0,0,0,0,1,4,1,0.0,7.0,0
73863,Grambling,2021-2022,2021-09-08,Grambling,SWAC,Northwestern St.,Southland,Home,A'Lexus Everett,MB,...,0,0,0,0,1,3,1,0.0,7.5,0
73876,Grambling,2021-2022,2021-09-14,Grambling,SWAC,Jarvis Christian,,Away,A'Lexus Everett,MB,...,0,0,0,0,3,1,2,0.0,9.5,0


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [4]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

*Group Names* : Joseph Tuazon, Ivan Li, Alexis Menor, Camdon Dreisbach, Yuna Yeom 

* *Team Expectation 1* : Communicate well over discord and try to respond to messages as soon a possible. Regarding to meetings we will try and meet once a weel but understanding that people have other things in their lives it is okay.
* *Team Expectation 2* : Be nice to eachother in the groups. When disagreeing with an idea explain why but also understand the other person's point of view.
* *Team Expecation 3* : When making decisions listen other other people's proposal and provide feedback if possible.
* *Team Expecation 4* : Try and complete the task assigned before the certain due date 

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/3  |  7 PM | Think about the project proposal and ideas for the actual project  | Talked about the project idea and split up the work assignment that is due on Wednesday | 
| 2/10  |  7 PM |  Do background research on topic |  Detailed work load for each group member to accomplish what we need to work on and how we can improve lack criteria. | 
| 2/17  |  7 PM  | Look over the data sets/find special features about that data set that can help with our project  | TBA |
| 2/24 | 7 PM  | Import the data into the code and clean it up (getting rid of useless data) | TBA   |
| 3/3 | 7 PM  | Beginning the analysis of the data and writing down specifics about the research | TBA |
| 3/10  | 7 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| TBA|
| 3/17 | 7 PM  | TBA | TBA |