## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

Aaron Soekiatno: Conceptualization, Software, Visualization, Analysis
Ezra Hong: Background research, Analysis, Data curation
Dylan Dwight: Project administration, Experimental investigation, Writing - original draft
Andrew Chon: Software, Methodology, Data curation
Mai Tamura: Background research, Writing - original draft, Writing - review & editing

## Research Question

Research Question: Do Women's teams in the NCAA that win the first set have a significantly higher chance of winning a match that goes over 3 sets? And if they are the away team, is the match more competitive (smaller point differentials)?

Metrics/Variables: Points, sets won and lost, point differentials, fans in attendance, and home or away court.

Outcome: Match result (win/loss) 
Key predictor: First set result (win/loss) 
Additional variables: Second set result, set score margin, match format (best of 5)

Analysis Type: We will calculate match win probabilities for teams based on whether they won or lost the first set. Using bar charts, pie charts, and conditional probability tables, we will visualize how first-set winners perform in the overall match. We will extend the analysis to see if the away team winning the first set leads to a more competitive match.

## Background and Prior Work

Match competitiveness is a key theme in volleyball analytics. Recent analytics has emphasized measurable indicators of competitiveness—such as point differentials, number of sets played, and home-court effects-to understand how closely intense matches turn out.

Because matches are played as best-of-five sets in certain leagues, early outcomes do not guarantee final results, especially in matches extending beyond three sets. Research in collegiate athletics has also examined the role of home-court advantage in shaping competitive outcomes. Studies across NCAA sports have found that home teams benefit not only in win probability but also in scoring margins. In volleyball specifically, home advantage has been linked to factors such as crowd support, familiarity with the court conditions, and reduced traveling. However, less research has focused on how home or away status affects match competitiveness—particularly in matches that extend beyond three sets.

Additionally, sports analytics literature increasingly uses point differential as a primary indicator of competitiveness. Smaller scoring margins across sets are widely interpreted as signals of evenly matched teams. Analyses in basketball, soccer, and volleyball have demonstrated that attendance and external factors can influence these margins, potentially increasing or decreasing competitive balance.

Our project builds on this prior work by shifting the focus to strictly competitiveness and results of matches that extend to 3 sets. Specifically, we examine whether winning the first set meaningfully predicts match outcomes in extended matches, and whether away teams that win the first set experience more competitive matches, as measured by smaller point differentials when compared to home teams. By concentrating on measurable indicators of competitive intensity, our project aims to provide a clearer understanding of how early advantages shape the competitiveness of NCAA women’s volleyball matches, especially if they are an away team.

## Hypothesis


Hypothesis: Among the 2019 NCAA women's volleyball dataset for matches that extend past 3 sets, the team that won the first set will only have a slightly higher chance to win the match. Additionally, when the away team wins the first set, the match will have a smaller overall point differential, indicating greater competitiveness.

Reasoning: Winning the first set as the away team will likely increase competitiveness and pressure on the opposing team. Additionally, the team that wins the first set has demonstrated early that they can outperform their opponent, which may reflect a  skill advantage that carries through the rest of the match.

## Data

Dataset #1
- NCAA Division I Women's Volleyball 2019 Match Results
- Link: https://github.com/larget/stat-240-case-studies/blob/master/data/vb-division1-2019-all-matches.csv
- Number of observations: 4,958 matches (one row per match minus initial row).
- Number of variables: 19 (columns).
- Description of variables: The variables most relevant to this project include; team1/team2 (the two teams competing in the match), site (location of the match, indicates home, away, neutral), s1_1 to s1_5 (points scored by Team 1 in sets 1-5), s2_1 to s2_5 (points scored by Team 2 in sets 1-5), sets_1 (total amount of sets won by Team 1), sets_2 (total amount of sets won by Team 2), winner (team that won the match), loser (team that lost the match), and attendance (total match attendance).
- Description of shortcomings: The data set only includes the 2019 season, the data set does not have a specific match competitiveness variable, some of the rows have missing data in attendance and the site, teams appear multiple times in different matches (observations are not fully independent), and it lacks factors like team rankings, strength, conference differences, or postseason status, which could influence both first set outcomes and overall match competitiveness.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/larget/stat-240-case-studies/refs/heads/master/data/vb-division1-2019-all-matches.csv', 'filename':'vball_matches.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading vball_matches.csv:   0%|          | 0.00/99.8k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 1/1 [00:00<00:00,  5.35it/s]  [A

Successfully downloaded: vball_matches.csv





### Women's NCAA Volleyball Matches Dataset #1

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


2.
    This dataset contains match data from NCAA Division I women’s volleyball, where each row represents a single match and includes scoring results, sets won & lost, home/away, and fan attendance. Points are recorded per set, and total point differential across a match serves as a measure of competitiveness, with smaller point differentials showing more competitive matches. Because teams must win three sets to win a match, matches that go to four or five sets generally show higher competitiveness than three-set matches, while attendance (number of fans) and the site shows context for potential home-court effects.
    The data set only includes the 2019 season, which may show bias. It also doesn't have a specific match competitiveness variable, which we will have to compute on our own through the point and set differentials. Some of the rows have missing data in the attendance and the site location, which may cause problems in the amount of data we have to analyze. Some teams appear multiple times in different matches, showing that observations are not fully independent, and team factors like rankings, strength, and postseason status are missing, which are factors that may affect match competitiveness.

In [10]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
df = pd.read_csv('data/00-raw/vball_matches.csv')
df = df.dropna(subset=['site'])
df['total_sets'] = df['sets_1'] + df['sets_2']
df_competitive = df[df['total_sets'] >= 4]
df.head()

Unnamed: 0,date,team1,team2,site,s1_1,s1_2,s1_3,s1_4,s1_5,sets_1,s2_1,s2_2,s2_3,s2_4,s2_5,sets_2,winner,loser,attendance,total_sets
1,2019-08-30,South Carolina St.,Texas Southern,"@Montgomery, Ala.",25.0,25.0,12.0,18.0,17.0,2,21.0,20.0,25.0,25.0,19.0,3,Texas Southern,South Carolina St.,0.0,5
3,2019-08-30,SIUE,UC Riverside,"@DeKalb, IL",17.0,20.0,18.0,,,0,25.0,25.0,25.0,,,3,UC Riverside,SIUE,75.0,3
4,2019-08-30,Temple,Fairfield,"@Baltimore, MD",25.0,25.0,25.0,,,3,20.0,15.0,13.0,,,0,Temple,Fairfield,0.0,3
6,2019-08-30,Niagara,Stetson,"@Boca Raton, FL",17.0,14.0,16.0,,,0,25.0,25.0,25.0,,,3,Stetson,Niagara,0.0,3
7,2019-08-30,Howard,Hartford,"@Washington, DC",25.0,25.0,25.0,,,3,23.0,23.0,20.0,,,0,Howard,Hartford,0.0,3


In [11]:
df_competitive.shape

(531, 20)

### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

A. Data Collection
A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
A.4 Downstream bias mitigation: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
B. Data Storage
B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?
B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?
C. Analysis
C.1 Missing perspectives: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
C.2 Dataset bias: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
C.3 Honest representation: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
C.4 Privacy in analysis: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
C.5 Auditability: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
D. Modeling
D.1 Proxy discrimination: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
D.2 Fairness across groups: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
D.3 Metric selection: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
D.4 Explainability: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
D.5 Communicate limitations: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
E. Deployment
E.1 Monitoring and evaluation: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
E.2 Redress: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
E.3 Roll back: Is there a way to turn off or roll back the model in production if necessary?
E.4 Unintended use: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

## Team Expectations 

Team Expectation 1 For communication, we will mainly use Instagram, and we expect response times to be around 24 hours.

Team Expectation 2 We expect to provide and receive polite feedback from our group members.

Team Expectation 3 We divided up tasks evenly throughout the proposal, but it is subject to change as we work on the project.

Team Expectation 4 If anyone is falling behind and or can't meet a certain deadline, we all agreed that they need to notify the group at least a day before.

## Project Timeline Proposal

PROJECT MEETING TIMELINE

------------------------------------------------------------
Date: January 20
Time: 1:00 PM
Completed Before Meeting:
- Read and reflect on COGS 108 expectations
- Brainstorm potential project topics and research questions

Discuss at Meeting:
- Determine primary form of group communication
------------------------------------------------------------

Date: February 4
Time: 5:00 PM
Completed Before Meeting:
- Decide on final project topic
- Develop hypothesis
- Begin background research
- Discuss ideal datasets and ethical considerations
- Complete project proposal

Discuss at Meeting:
- Finalize and submit proposal
------------------------------------------------------------

Date: February 11
Time: 10:00 AM
Completed Before Meeting:
- Search for relevant datasets

Discuss at Meeting:
- Discuss data wrangling strategy
- Identify possible analytical approaches
------------------------------------------------------------

Date: February 14
Time: 6:00 PM
Completed Before Meeting:
- Import and wrangle data
- Conduct initial exploratory data analysis (EDA)

Discuss at Meeting:
- Review and edit wrangling/EDA
- Refine analysis plan
------------------------------------------------------------

Date: February 23
Time: 12:00 PM
Completed Before Meeting:
- Finalize wrangling and EDA
- Begin formal analysis

Discuss at Meeting:
- Review and edit analysis
- Complete project check-in
------------------------------------------------------------

Date: March 13
Time: 12:00 PM
Completed Before Meeting:
- Complete analysis
- Draft results, conclusion, and discussion

Discuss at Meeting:
- Review and edit full project draft
------------------------------------------------------------

Date: March 20
Time: Before 11:59 PM
Completed Before Meeting:
- N/A

Discuss at Meeting:
- Submit final project
- Complete group project surveys
------------------------------------------------------------