# NBA Game Outcome Prediction Project

## Introduction

For my project, I wanted to analyze NBA game data to build predictive models that can determine game outcomes based on in-game statistics. The dataset I'm using provides detailed "snapshots" of games at regular time intervals, tracking the score, time remaining, and other metrics throughout each game. 

As a basketball fan, I've always been fascinated by the concept of win probability and how it fluctuates during games. Commentators often discuss how certain moments are "turning points" in games, but I wanted to explore this analytically. My goal is to determine how accurately we can predict the final outcome of an NBA game based on the state of the game at various time points, and to identify which factors are most important in making these predictions.

## Primary Goals

1. Build machine learning models that can accurately predict which team will win an NBA game based on in-game statistics like score differential, time remaining, period, and other features in my dataset.

2. Analyze how win probability evolves throughout games and determine at what point in games predictions become reliable.

3. Compare the performance of different machine learning approaches (logistic regression, random forests, gradient boosting, etc) to see which works best for this prediction task.

4. Identify key moments or thresholds during games that have the strongest correlation with final outcomes - essentially finding the statistical "point of no return" in NBA games.

I believe this analysis will provide interesting insights into basketball game dynamics and potentially reveal patterns that aren't obvious to casual observers of the game.

![JAMES WITH THE SLAM!](../../images/lebron-dunk-lebron-james.gif)

# Data Source and Citation
For my project, I collected NBA game data using the nba_api Python package, which provides direct accest to NBA.com's official statistics. I collected the raw data the dataset I'm using was configured to contain a season's worth of NBA game snapshots with numerous features (reference codebook.md)


### Formal Citation:
Swar, S. (2018). nba_api: An API Client package to access NBA.com endpoints. Available from https://github.com/swar/nba_api
Link to Source:
https://github.com/swar/nba_api


### Importing & Cleaning Data

In [None]:
import pandas as pd

games_2022_2023 = pd.read_csv("../../data/revised_nba_2022_23_all_games_snapshots.csv")
for column in games_2022_2023.columns:
    print(f"Column {column} has {games_2022_2023[column].isna().sum()} missing values")


overtime_games = games_2022_2023[games_2022_2023['PERIOD'] == 5]['GAME_ID'].unique()
df_filtered = games_2022_2023[~games_2022_2023['GAME_ID'].isin(overtime_games)]

# Verify the result
print(f"Original dataset size: {len(games_2022_2023)}")
print(f"Number of overtime games removed: {len(overtime_games)}")
print(f"Filtered dataset size: {len(df_filtered)}")

for column in games_2022_2023.columns:
    print(f"Column {column} has {df_filtered[column].isna().sum()} missing values")


Column seconds_elapsed has 0 missing values
Column GAME_ID has 0 missing values
Column PERIOD has 0 missing values
Column AWAY_SCORE has 1792 missing values
Column HOME_SCORE has 1792 missing values
Column SCORE_DIFF has 1792 missing values
Column IS_HOME_LEADING has 1792 missing values
Column HOME_TEAM_WIN_PCT has 1792 missing values
Column AWAY_TEAM_WIN_PCT has 1792 missing values
Column HOME_TEAM_WON has 1792 missing values
Column HOME_TEAM has 1792 missing values
Column AWAY_TEAM has 1792 missing values
Column PCTIMESTRING has 0 missing values
Original dataset size: 174200
Number of overtime games removed: 79
Filtered dataset size: 159500
Column seconds_elapsed has 0 missing values
Column GAME_ID has 0 missing values
Column PERIOD has 0 missing values
Column AWAY_SCORE has 0 missing values
Column HOME_SCORE has 0 missing values
Column SCORE_DIFF has 0 missing values
Column IS_HOME_LEADING has 0 missing values
Column HOME_TEAM_WIN_PCT has 0 missing values
Column AWAY_TEAM_WIN_PCT ha

## Handling Missing Values

    Handling Missing Values in Overtime Games:
Upon initial exploration, I identified 1,792 missing values in my dataset's scoring-related features. Further investigation revealed a systematic pattern: these missing values occurred exclusively in overtime periods (specifically when games reached a 5th period).
The missing data spans from the 12:00 to 5:30 mark (minutes and seconds remaining) or equivalently from the 0 to 390th second mark of overtime periods. This pattern exists because NBA overtime periods are only 5 minutes long, rather than the standard 12 minutes of regular quarters.
Since overtime games constitute a relatively small proportion of my dataset, there aren't enough unique overtime period snapshots to enable reliable model training for these scenarios. Therefore, to maintain data consistency and model reliability, I decided to exclude all overtime games from my analysis. This approach allows me to focus on regulation-time game dynamics while eliminating the missing value issue entirely, rather than implementing imputation methods that might introduce bias into the overtime period data.