## Authors

Erika Tong: Writing, Conceptualization, Fixed feedback areas, Review/Editing 

Jason Wilkens: Writing, Conceptualization, Data cleaning 

Elaine Sun: Writing, Background research, Fixed feedback areas 

Mohamed Adem: Writing, Data Curation, Data cleaning 

Timothy Kim: Review and Editing, Data cleaning, Writing 

## Research Question

Does playing back-to-back NBA games affect team performance? Using data from 2000 to the present, how do key performance metrics—such as points scored, field goal percentage, turnovers, and win/loss outcome—differ between games played on consecutive days and games with at least one day of rest?

## Background and Prior Work

Basketball is a high intensity, physically demanding sport played professionally in the National Basketball Association (NBA), where teams compete in an 82-game regular season typically running from October to April. During this period, each NBA team plays almost daily, often traveling long distances between cities. Over a season, this results in periods of “schedule congestion”, including back-to-back games, where a team plays on consecutive days and games with varying amounts of rest and recovery in between. Performance in these games is usually measured using team-level statistics such as points scored, field goal percentage (FG%), turnovers and the win/loss outcome, as well as more nuanced metrics like net efficiency, effective field goal percentage (an adjusted shooting measure accounting for three-point shots) and pace of play. These metrics provide insight into both offensive and defensive performance in comparison across games and conditions.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1), <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

To determine which statistics most strongly relate to team success, prior empirical research has analyzed thousands of NBA games to identify performance indicators associated with winning outcomes. One large-scale research study examining nearly 4,000 NBA regular and postseason games found that field goal percentage and overall shooting efficiency were among the most influential variables distinguishing winning and losing teams.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Defensive rebounding and shooting efficiency accounted for a substantial portion of the explained variance in game outcomes. The study also observed that teams tend to adopt more conservative styles of play under higher pressure conditions, resulting in fewer field goal attempts, assists, and turnovers. These findings establish that performance metrics such as FG%, scoring efficiency, and turnovers are not arbitrary statistics but empirically supported indicators of team success. At the same time, the authors acknowledge that these relationships are also statistical associations and do not imply that any single metric alone determines game outcomes, highlighting the multifactorial nature of basketball performance.

In addition to peer-reviewed research, prior student-led data science projects have explored NBA performance statistics using structure/approaches that are closely related to the project proposed here. A university capstone project<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) analyzed an NBA Player Performance Statistics dataset sourced from Kaggle. The dataset contains player-level and team-level performance metrics such as points scored, shooting percentages, rebounds, assists and turnovers across multiple seasons. Using the publicly available NBA data, the project investigated relationships between different offensive and performance variables, including patterns of playing style, the association between two-point field goal production and player experience and the relationship between age and three-point scoring. Through applied exploratory data analysis and visualization techniques, their findings indicated that player experience is positively associated with two-point scoring output, while the relationship between age and three-point production is not strictly linear. Importantly, they emphasized that NBA performance is influenced by multiple interacting variables, highlighting the value of examining several performance metrics together rather than relying on a single statistic. While this prior project focused on player-level characteristics, its methodology and use of NBA performance statistics are relevant to our work. Our project would most likely adopt a similar data-driven and exploratory approach but shifts the unit of analysis to the team and game level, examining how performance metrics differ under different rest conditions, specifically back-to-back games versus games with at least one day of rest, along with other possible context factors that we might discover. In this way, our analysis builds directly on similar analytical techniques while addressing a distinct and complementary research question within NBA analytics.

While game-level statistics capture how teams perform during competition, additional research suggests that performance may also be influenced by contextual factors outside the game itself. One such factor is fatigue resulting from workload and limited recovery time, which is a common result due scheduling and back-to-back games, defined as games played by the same team on consecutive days without a full day of rest in between. In sports science, fatigue is understood to affect both physical and cognitive performance, including reaction time, shooting accuracy and decision-making, which are all elements central to basketball success. This is why back-to-back games are often viewed as particularly challenging because players have limited recovery time, thus contributing to accumulated physical and cognitive fatigue. Recent CMU research paper<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5) examining the impact of load management on NBA player performance analyzed how rest days, cumulative minutes played and opponent strength relate to player-level outcomes such as plus/minus. Using linear models, mixed-effects models, and decision tree–based approaches, the study found that rest days alone were not statistically significant predictors of player plus/minus. Instead, workload-related variables such as cumulative minutes and previous game minutes were more influential predictors. Furthermore, differences between linear and non-linear modeling approaches suggested that the relationship between rest and performance may be complex and potentially non-linear. Furthermore, the authors identified several limitations in their analysis, including reliance on plus/minus as the primary response variable and limited inclusion of specific offensive or defensive efficiency metrics. They suggested that alternative performance measures, such as shooting efficiency or role specific statistics, could yield additional insights into how rest affects particular aspects of performance. This perspective underscores the importance of examining multiple team-level performance indicators rather than relying on a single metric when evaluating the impact of scheduling factors. 

Altogether, existing literature establishes a few key points. First, empirically validated performance metrics like field goal percentage, shooting efficiency, turnovers and rebounding are strongly associated with winning outcomes in the NBA. Second, contextual factors such as fatigue, workload, travel demands and scheduling density may influence how teams perform, although the magnitude and mechanism of these effects remain debated and may depend on modeling choices and metric selection. In the same way, demographic and experience related factors(age, minutes played, and games played) have been shown to correlate with certain aspects of performance, particularly shooting efficiency and scoring output. While some studies suggest that rest days alone may not strongly predict aggregate measures such as plus/minus, they also indicate that performance is shaped by multiple interacting variables and that fatigue related effects may manifest differently across specific components of play. This body of research highlights both the importance of established performance metrics and the complexity of isolating the role of external scheduling factors.

Despite this foundation, explicit comparisons of team-level performance across different rest conditions remain relatively underexplored using a consistent set of empirically supported game statistics. Much of the prior work focuses either on identifying which performance metrics predict winning or on player-level characteristics rather than isolating rest condition or workload management or other external factors as a contextual variable affecting team-level game outcomes. Given that back-to-back games are a routine and structurally significant feature of the NBA schedule, understanding how rest availability relates to team-level performance, by comparing the same core performance metrics between games played on consecutive days and games played with additional rest, represents a meaningful extension of existing research. In this study, we treat rest condition, specifically back-to-back games versus games with at least one full day of rest, while also exploring how other team-level performance metrics relate to game outcomes. This approach builds on existing research by focusing directly on scheduling-related factors, while still allowing for exploratory analysis of other relevant performance metrics. Overall, this project represents a logical next step in NBA performance analysis and helps connect prior work on empirically validated performance metrics, scheduling and other external factors to a data-driven research question of understanding how rest availability may relate to team-level performance outcomes in the NBA.



1. <a name="cite_note-1"></a> [^](#cite_ref-1) The Hidden Metrics That Matter Most for Predicting NBA Outcomes - NBAstuffer. (2025, September 19). NBAstuffer. https://www.nbastuffer.com/nba-metrics-for-outcome-predictions/
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Basketball Reference. (2019). Glossary | Basketball-Reference.com. Basketball-Reference.com. https://www.basketball-reference.com/about/glossary.html
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Cabarkapa, D., Deane, M. A., Fry, A. C., Jones, G. T., Cabarkapa, D. V., Philipp, N. M., & Yu, D. (2022). Game statistics that discriminate winning and losing at the NBA level of basketball competition. PLOS ONE, 17(8), e0273427. https://doi.org/10.1371/journal.pone.0273427
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Yi, B., Balusu, N., Gogineni, R., & Gorji, Z. (2023). NBA Performance Stats Data Visualization. Cmu.edu. https://www.stat.cmu.edu/capstoneresearch/spring2023/315files_s23/team8.html
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Chen, D., Huang, E., Ou, C., & Parikh, S. (2025). Impact of Load Management on NBA Player Performance. https://www.stat.cmu.edu/capstoneresearch/460files_s25/team15.pdf

## Hypothesis


We predict that NBA teams playing back-to-back games will show decreased performance as a result of natural fatigue and minimal time to recover, compared to games played with at least one day of rest. Specifically we expect lower points scored and field goal percentages, higher turnovers, and a lower probability of winning in back-to-back games. However, observed differences may also be influenced by opponent strength or player absences due to injuries, rest, or suspensions.

## Data

### Data overview
Dataset #1: NBA Database (1947–Today)

https://www.kaggle.com/datasets/eoinamoore/historical-nba-data-and-player-box-scores 

- Number of observations: 145,548 (raw), 71,316 after wrangling/cleaning
- Number of variables: 48 (raw), 12 after wrangling/cleaning
- Relevant Variables:
      - gameDateTimeEst: Date of Game
      - teamId: Team identifier
      - teamScore: Points scored by team
      - opponentScore: Points allowed
      - fieldGoalsPercentage: Shooting efficiency
      - turnovers: Number of turnovers
      - win: Binary indicator ( 1 = win, 0 = loss)
      - home: Binary indicator ( 1 = home, 0 = away)
- Shortcomings:
     - Rest/back-to-back is not directly provided and must be computed from dates.
     - No advanced variables, such as the strength of the opponent
     - Many advanced variables (quarter points, fast break points, bench points etc) are heavily missing for early seasons. However, these variables are not central to our research question and are excluded from analysis.

The dataset includes team-level statistics for each NBA game, such as game date, team identifiers, points scored, field goal percentage, and turnovers, which are relevant for analyzing performance in back-to-back games. 

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 

# Problem! This function is supposed to download directly from this link and store it in data/00-raw/ but kaggle needs an api setup!
# For now I will directly put the TeamStatistc.vsc file in the data00 folder 
datafiles = [
    { ''''url': 'https://www.kaggle.com/datasets/eoinamoore/historical-nba-data-and-player-box-scores?select=TeamStatistics.csv', 'filename': 'TeamStatistics.csv' '''},
]

#get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Team Statistics

This dataset contains a complete record of team-level box scores for every game played in the National Basketball Association (NBA) from 1947 to the present day. Each row represents one team's performance in a single game. The important metrics for our analysis are teamScore and opponentScore measured by the amount of points scored with the higher score being the winners. Points are earned from two-point field goals, three-point field goals, and free throws. Nowadays most teams score between 95-125 points per game, low scores typically indicate poor offensive performance and high scores usually indicate a fast paced game or possible overtime play. Another important variable is fieldGoalsPercentage measured as a percentage of field goals made out of field goals attempted. A field goal means any successful basket worth two or three points made during live play, excluding free throws. The average percentage is in the high 40's with higher values indicating efficient shooting and lower values suggest poor performance. There is also turnovers measured as a count per game. This occurs when a team loses possession of the ball before attempting a shot either from the other team stealing or from a violation. On average a team commits 12-16 turnovers per game, with values over 20 suggesting poor performance. Lastly the variable home is a binary indicator, 1 = home game and 0 = away game. Historically teams win around 55-60% of games played at home giving the home team a higher likelihood of winning.

There are several concerns with this dataset such as the inclusion of non-regular NBA games. Playoff games are included in the dataset but they are usually played more intensely. Additionally there are some missing variables such as total number of shot attempts, free throw percentage, missing players, injuries, rebounds, etc. Also, because these records date back to the mid-40s, the way information was recorded, stored, and reviewed, has most likely changed drastically. Furthermore, human or machine errors when recording data is a potential factor in misleading data within the dataset.

In [2]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
import numpy as np

# Load and assign the data frame to tstats_df 
tstats_df = pd.read_csv('data/00-raw/TeamStatistics.csv')

# Removes entries with NaN in "teamCity" column, effectively removing any
# non-regular NBA games, like All-Star games.
tstats_df = tstats_df.dropna(subset=["teamCity"])

# Want to begin looking at data from 2000's onwards, so we filter out any games before 2000.
# To help, we eliminate the time component of the "gameDateTimeEst" column, which is not needed for our analysis.
tstats_df["gameDateTimeEst"] = pd.to_datetime(tstats_df["gameDateTimeEst"]).dt.date

# Filter out games before 2000
tstats_df = tstats_df[tstats_df["gameDateTimeEst"].apply(lambda x: x.year) >= 2000]

# Keep only variables relevant to back-to-back analysis
columns_to_keep = [
    'gameId',
    'gameDateTimeEst',
    'teamId',
    'teamName',
    'teamCity',
    'opponentTeamId',
    'home',
    'win',
    'teamScore',
    'opponentScore',
    'fieldGoalsPercentage',
    'turnovers'
]

# make tstats_df equal to a susbset of the original dataframe, keeping only the comlumns in columns_to_keep
tstats_df = tstats_df[columns_to_keep]

# Tail shows the last 10 rows of the dataframe, which should be the oldest games in the dataset.
tstats_df.tail(10)

Unnamed: 0,gameId,gameDateTimeEst,teamId,teamName,teamCity,opponentTeamId,home,win,teamScore,opponentScore,fieldGoalsPercentage,turnovers
71314,29900427,2000-01-03,1610612753,Magic,Orlando,1610612765,1,0.0,106,118,0.506,27.0
71315,29900427,2000-01-03,1610612765,Pistons,Detroit,1610612753,0,1.0,118,106,0.455,14.0
71316,29900424,2000-01-03,1610612738,Celtics,Boston,1610612739,1,1.0,105,98,0.462,12.0
71317,29900424,2000-01-03,1610612739,Cavaliers,Cleveland,1610612738,0,0.0,98,105,0.438,9.0
71318,29900426,2000-01-03,1610612744,Warriors,Golden State,1610612764,0,0.0,87,99,0.455,20.0
71319,29900425,2000-01-03,1610612749,Bucks,Milwaukee,1610612755,0,0.0,120,124,0.448,20.0
71320,29900425,2000-01-03,1610612755,76ers,Philadelphia,1610612749,1,1.0,124,120,0.523,14.0
71321,29900426,2000-01-03,1610612764,Wizards,Washington,1610612744,1,1.0,99,87,0.476,15.0
71322,29900423,2000-01-02,1610612748,Heat,Miami,1610612753,1,1.0,111,103,0.409,14.0
71323,29900423,2000-01-02,1610612753,Magic,Orlando,1610612748,0,0.0,103,111,0.361,21.0


In [3]:
# size of data frame
print(tstats_df.shape)

(71316, 12)


In [4]:
# Check the if there are missing entries in the columns
tstats_df.isnull().sum().sort_values(ascending = False) 
tstats_df.isnull().sum()

gameId                  0
gameDateTimeEst         0
teamId                  0
teamName                0
teamCity                0
opponentTeamId          0
home                    0
win                     0
teamScore               0
opponentScore           0
fieldGoalsPercentage    4
turnovers               4
dtype: int64

In [5]:
# Remove rows with missing values (4 rows missing fieldGoalsPercentage or turnovers)
tstats_df = tstats_df.dropna()
tstats_df.isnull().sum()

gameId                  0
gameDateTimeEst         0
teamId                  0
teamName                0
teamCity                0
opponentTeamId          0
home                    0
win                     0
teamScore               0
opponentScore           0
fieldGoalsPercentage    0
turnovers               0
dtype: int64

In [7]:
tstats_df['fieldGoalsPercentage'].describe()

count    71312.000000
mean         0.455972
std          0.056593
min          0.239000
25%          0.418000
50%          0.455000
75%          0.494000
max          0.689000
Name: fieldGoalsPercentage, dtype: float64

In [8]:
tstats_df['turnovers'].describe()

count    71312.000000
mean        14.465504
std          4.064114
min          0.000000
25%         12.000000
50%         14.000000
75%         17.000000
max         38.000000
Name: turnovers, dtype: float64

In [7]:
# Write the wrangled data frame to data/02-processed
tstats_df.to_csv("data/02-processed/TeamStatistics_Cleaned.csv", index = False)

## Ethics

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> The data used in this project is from publicly available records of professional NBA games. Professional players expect their performance statistics and related information to be recorded, published, and analyzed. So informed consent from each player was not obtained for this analysis.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> The data used in this project should contain little collection bias as the data consists of all professional basketball players and their stats throughout their games played. 

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> Player names and basic information are publicly reported as part of professional sports and the dataset includes these and does not include sensitive information. Our analysis is limited to player/team statistics and does not attempt to expose players outside of what is already available.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> All data will be collected and uploaded to this github repository, which can only be edited by those who have access, which are graders and members of our group.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> Professional player information is pubicly available and we do not include any sensitive information in our analysis of their statistics aside from their name and the team they play for.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> Any plans to remove any data can be discussed with groups consent, but by default, we plan to leave the data as it is.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

 > Our analysis could overlook important context like player injuries, travel schedules, team strategies, or opponent teams strength. We could check these by talking to experts and finding data on injuries and travel schedules.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

 > Dataset bias may be present as the data includes games from 2000-2025 which may introduce bias due to missing or inconsistent data in the older seasons. Inconsistencies in the comparisions of data in different eras can be introduced due to changes in leagues, rules, differences in skill, and recording methods. Additionally, player conditions such as injuries are not reflected in the dataset as well. We aim to mitigate these issues by ensuring that the data for every year was collected in a consistent manner and by checking that any outliers in performance was not caused by an injury. 

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
       
 > Some teams may be disproportionately affected by back-to-back games if they have injuries or older players which could reinforce stereotypes and assumptions. We could test if some teams are more or less effected by back to back games and make note of disparities.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

 > Our analysis could be used as justification for more back to back games and demanding schedules if our hypothesis turns out to be wrong. We could make sure to emphasize that correlation does not mean causation and that our results may not reflect all variables that contribute to team performance.

## Team Expectations 

* Communication platform: messaages via number
* Response time: respond to messages same day unless messages were sent late at night
* Open communication: let the group know if you’ll be busy and slow to respond or if you need help that week with your task
* Majority vote decisions: if someone doesn't respond within half the day we'll make the decision without them unless its an urgent decision
* Distribution of tasks: tasks would mostly be self-assigned to allow members to select whatever they're most comfortable/specialized in while also ensuring that all members are contributing relatively equally in terms of effort
* Assignment deadlines: aim to complete assignments a day before deadline to allow for each review of each others' work
* Conflict prevention/resolution: define and clarify goals/tasks to be accomplished to ensure everyone is on the same page, keep tabs and monitor group progress to eliminate blockers ahead of time, and open and effective communication nonetheless

## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/31  |  10 AM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/1  |  10PM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/4  | 6 PM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/15  | Before 11:59 PM  | Import & Wrangle Data; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/22  | Before 11:59 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 3/13  | Before 11:59 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 3/16  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |