# Predicting NBA Game Outcomes with Machine Learning
## Author: Kishore Annambhotla

### Introduction and Methods


### Setting Up

In order to predict NBA game outcomes, we will need to analyze many past games and see what statistics help or hurt a team's chances of winning. We will also need a robust predictive model capable of making a decision based on all of this game data. For this project, I used `nba_api` to retrieve up-to-date NBA game data, `pandas` and `numpy` for data manipulation, and `scikit-learn` to build an ML model. All of the project's necessary imports will be included in the code block below.

In [1]:
from nba_api.stats.static import teams
from nba_api.stats.endpoints import leaguegamefinder
import pandas as pd

pd.options.mode.chained_assignment = (
    None  # prevents warnings on reformatted columns later
)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

### Importing Data

We will base our model off of games played by all NBA teams. As shown, this game data will include categorical information about the game, teams, and season, as well as quantitative data regarding each team's offense and defense, as well as the winner of the game.

First, we will use the `nba_api` to get games played by each NBA team. Note that we set `season_type_nullable` to only include regular season games, since those are the games we will be predicting with this model. This could easily be changed to train a model better for predicting playoff or preseason games.

In [2]:
nba_teams = teams.get_teams()
team_abbr_to_id = {team["abbreviation"]: team["id"] for team in nba_teams}
all_games = pd.DataFrame()

for team in nba_teams:
    team_id = team["id"]
    gamefinder = leaguegamefinder.LeagueGameFinder(
        team_id_nullable=team_id, season_type_nullable="Regular Season"
    )
    games = gamefinder.get_data_frames()[0]
    all_games = pd.concat([all_games, games], ignore_index=True)

In [3]:
print(all_games.columns)
print(all_games.sample(n=5))

Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PLUS_MINUS'],
      dtype='object')
      SEASON_ID     TEAM_ID TEAM_ABBREVIATION           TEAM_NAME     GAME_ID  \
53788     22005  1610612753               ORL       Orlando Magic  0020501145   
33356     22016  1610612747               LAL  Los Angeles Lakers  0021600059   
2090      21999  1610612737               ATL       Atlanta Hawks  0029900459   
58184     21987  1610612754               IND      Indiana Pacers  0028700357   
21408     21994  1610612743               DEN      Denver Nuggets  0029400847   

        GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  DREB   REB  \
53788  2006-04-09    ORL @ MIA  W  240   93  ...   0.700   4.0  21.0  25.0   
33356  2016-11-02    LAL @ ATL  W  240  123  ..

While this data is good, it may be too much for our model. Over the last forty seasons, NBA basketball has greatly changed. Many rules have been added and removed, and aspects of "perfect" offense and defense have completely shifted. As a result, older data is increasingly unreliable for predicting current-day NBA games. For this model, I chose to only use data from the 2018-2019 season until now. This time period is most reflective of modern basketball, and ignores plenty of outdated game data.

In [4]:
# Get games where season is in modern_seasons
games_modern = all_games[
    (all_games.SEASON_ID.str[-4:] == "2018")
    | (all_games.SEASON_ID.str[-4:] == "2019")
    | (all_games.SEASON_ID.str[-4:] == "2020")
    | (all_games.SEASON_ID.str[-4:] == "2021")
    | (all_games.SEASON_ID.str[-4:] == "2022")
    | (all_games.SEASON_ID.str[-4:] == "2023")
    | (all_games.SEASON_ID.str[-4:] == "2024")
]
print(games_modern.sample(n=5))

      SEASON_ID     TEAM_ID TEAM_ABBREVIATION             TEAM_NAME  \
77        22023  1610612737               ATL         Atlanta Hawks   
69004     22022  1610612758               SAC      Sacramento Kings   
33090     22019  1610612747               LAL    Los Angeles Lakers   
10404     22022  1610612740               NOP  New Orleans Pelicans   
3713      22020  1610612738               BOS        Boston Celtics   

          GAME_ID   GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  \
77     0022300690  2024-02-02  ATL vs. PHX  W  240  129  ...   0.778  12.0   
69004  0022200491  2022-12-23  SAC vs. WAS  L  239  111  ...   0.771  13.0   
33090  1521900056  2019-07-10    LAL @ NYK  L  199   96  ...   0.774  11.0   
10404  0022200632  2023-01-13    NOP @ DET  W  241  116  ...   0.700   8.0   
3713   0022000990  2021-05-05    BOS @ ORL  W  238  132  ...   0.750   9.0   

       DREB   REB  AST  STL  BLK  TOV  PF  PLUS_MINUS  
77     31.0  43.0   29  7.0    4    9  13       

### Cleaning Data

We have plenty of good data, but it's still unrefined. The next step is to clean up our columns and reformat the data. This will make training, testing, and evaluating our model much easier in the future.

First, we will reformat some of our columns to make working with the data easier.

In [5]:
# Convert GAME_DATE to pandas datetime
# Also order by date earliest to latest to make working with stats easier later
games_modern["GAME_DATE"] = pd.to_datetime(games_modern["GAME_DATE"])
games_modern.sort_values(by="GAME_DATE", inplace=True)


# Add binary "WIN" column, remove categorical WL column
games_modern["WIN"] = games_modern["WL"].apply(lambda x: 1 if x == "W" else 0)


# Convert int stat columns to float type for accurate data analysis
games_modern["MIN"] = games_modern["MIN"].astype(float)  # minutes
games_modern["PTS"] = games_modern["PTS"].astype(float)  # points
games_modern["FGM"] = games_modern["FGM"].astype(float)  # field goals made
games_modern["FGA"] = games_modern["FGA"].astype(float)  # field goals attempted
games_modern["FG3M"] = games_modern["FG3M"].astype(float)  # 3s made
games_modern["FG3A"] = games_modern["FG3A"].astype(float)  # 3s attempted
games_modern["FTM"] = games_modern["FTM"].astype(float)  # free throws made
games_modern["FTA"] = games_modern["FTA"].astype(float)  # free throws attempted
games_modern["AST"] = games_modern["AST"].astype(float)  # assists
games_modern["BLK"] = games_modern["BLK"].astype(float)  # blocks
games_modern["TOV"] = games_modern["TOV"].astype(float)  # turnovers
games_modern["PF"] = games_modern["PF"].astype(float)  # personal fouls


# Add opponent id as column


def get_opponent_id(matchup, team_abbr_to_id, team_id):
    if "@" in matchup:

        opponent_abbr = matchup.split(" @ ")[-1]

    else:

        opponent_abbr = matchup.split(" vs. ")[-1]

    return team_abbr_to_id.get(opponent_abbr, team_id)


games_modern["OPP_TEAM_ID"] = games_modern.apply(
    lambda row: get_opponent_id(row["MATCHUP"], team_abbr_to_id, row["TEAM_ID"]), axis=1
)

Next, we want to define a few new statistics that will be useful in evaluating a team. These statistics will prove essential in predicting the outcomes of new games. For many of these statistics, we will only want to create them based on the most recent data to avoid inaccuracies. Because teams can undergo vast changes in the span of two or three years, we will only define these statistics using data over the last two seasons.

In [6]:
# Define 'HGA' (Home Game Advantage)
games_modern["HGA"] = games_modern["MATCHUP"].apply(lambda x: 0 if "@" in x else 1)

# Define 'LAST_GAME_OUTCOME'
games_modern["LAST_GAME_OUTCOME"] = (
    games_modern.groupby("TEAM_ID")["WIN"].shift(1).fillna(0)
)

# Note: per game statistics indicate team per game statistics using rolling average of last 10
# Define 'PPG' (Points Per Game)
# games_modern["PPG"] = (
#    games_modern.groupby("TEAM_ID")["PTS"]
#    .rolling(window=10, closed="left", min_periods=1)
#    .mean()
#    .reset_index(0, drop=True)
# )

# Define 'EFG%' (Effective Field Goal Percentage)
games_modern["EFG%"] = (
    games_modern["FGM"] + (0.5 * games_modern["FG3M"])
) / games_modern["FGA"]

# Define 'TOV%' (Turnover Percentage)
games_modern["TOV%"] = games_modern["TOV"] / (
    games_modern["FGA"] + 0.44 * games_modern["FTA"] + games_modern["TOV"]
)


# Define 'ORB%' (Offensive Rebound Percentage)
def get_opponent_dreb(game_id):
    opponent_dreb = games_modern.loc[games_modern["GAME_ID"] == game_id, "DREB"].values[
        0
    ]
    return opponent_dreb


games_modern["ORB%"] = games_modern["OREB"] / (
    games_modern["OREB"] + get_opponent_dreb(games_modern["GAME_ID"])
)

# Define 'FTR' (Free Throw Attempt Rate)
games_modern["FTR"] = games_modern["FTA"] / games_modern["FGA"]

# Define 'TS%' (True Shooting Percentage)
games_modern["TS%"] = games_modern["PTS"] / (
    2 * (games_modern["FGA"] + (0.44 * games_modern["FTA"]))
)

Finally, we want to use a `LabelEncoder` to transform a few of our categorical data columsn into numerical values that can be understood by our model.

In [7]:
# LabelEncode data
le = LabelEncoder()
games_modern["TEAM_ID"] = le.fit_transform(games_modern["TEAM_ID"])
games_modern["OPP_TEAM_ID"] = le.fit_transform(games_modern["OPP_TEAM_ID"])
games_modern.sample(n=5)

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,PLUS_MINUS,WIN,OPP_TEAM_ID,HGA,LAST_GAME_OUTCOME,EFG%,TOV%,ORB%,FTR,TS%
10244,22024,3,NOP,New Orleans Pelicans,22400384,2024-12-21,NOP vs. NYK,L,240.0,93.0,...,-11.0,0,15,1,0.0,0.481928,0.172164,0.290909,0.228916,0.508975
39542,22019,12,MIL,Milwaukee Bucks,21900071,2019-11-01,MIL @ ORL,W,238.0,123.0,...,32.0,1,16,0,0.0,0.596774,0.114115,0.22,0.193548,0.609394
10543,22020,3,NOP,New Orleans Pelicans,22001020,2021-05-09,NOP @ CHA,W,240.0,112.0,...,2.0,1,29,0,0.0,0.53125,0.131492,0.290909,0.229167,0.529902
79422,22020,24,TOR,Toronto Raptors,22000831,2021-04-14,TOR vs. SAS,W,239.0,117.0,...,5.0,1,22,1,0.0,0.545455,0.108108,0.25,0.284091,0.590909
19493,22018,6,DEN,Denver Nuggets,1521800066,2018-07-13,DEN vs. MIN,L,198.0,71.0,...,-9.8,0,13,1,0.0,0.426667,0.199442,0.264151,0.16,0.442202


### Defining Features

In our model, `X` includes features of the dataset that will be used to predict `y`. Here, `y` is obviously `WIN`, since we are just predicting and classifying each game as a win (1) or loss (0). `X` could include a number of relevant statistics, and all the ones used will be included in the code segment.

In [24]:
features = [
    "TEAM_ID",
    "OPP_TEAM_ID",
    "PTS",
    "OREB",
    "DREB",
    "REB",
    "AST",
    "STL",
    "BLK",
    "TOV",
    "EFG%",
    "TOV%",
    "FTR",
    "TS%",
    "HGA",
    "LAST_GAME_OUTCOME"
]
X = games_modern[features]

y = games_modern[["WIN"]]


# Use sklearn train_test_split with 80% training, 20% testing

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42

)

In [25]:
games_modern = games_modern[
    [
        "GAME_DATE",
        "TEAM_ID",
        "OPP_TEAM_ID",
        "PTS",
        "OREB",
        "DREB",
        "REB",
        "AST",
        "STL",
        "BLK",
        "TOV",
        "EFG%",
        "TOV%",
        "ORB%",
        "FTR",
        "TS%",
        "HGA",
        "LAST_GAME_OUTCOME",
        "WIN",
    ]
]

### Creating Initial Model

`scikit-learn` offers many options for ML classification models. A simple one good for this problem would be a `RandomForestClassifier`, which uses many decision trees to analyze data and settle on a choice. Random forests are known for accuracy and efficiency.

In [26]:
rf_model = RandomForestClassifier(n_estimators=150, max_depth=20, random_state=42)
rf_model.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [27]:
y_pred = rf_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8170134638922889
              precision    recall  f1-score   support

           0       0.80      0.83      0.82      1591
           1       0.83      0.81      0.82      1677

    accuracy                           0.82      3268
   macro avg       0.82      0.82      0.82      3268
weighted avg       0.82      0.82      0.82      3268



### Predicting Games

In [34]:
# Query nba.live.endpoints.scoreboard and  list games in localTimeZone
from datetime import datetime, timezone
from dateutil import parser
from nba_api.live.nba.endpoints import scoreboard

f = "{gameId}: {awayTeam} vs. {homeTeam} @ {gameTimeLTZ}"

board = scoreboard.ScoreBoard()
print("ScoreBoardDate: " + board.score_board_date)
games = board.games.get_dict()
for game in games:
    gameTimeLTZ = (
        parser.parse(game["gameTimeUTC"])
        .replace(tzinfo=timezone.utc)
        .astimezone(tz=None)
    )
    print(
        f.format(
            gameId=game["gameId"],
            awayTeam=game["awayTeam"]["teamName"],
            homeTeam=game["homeTeam"]["teamName"],
            gameTimeLTZ=gameTimeLTZ,
        )
    )

ScoreBoardDate: 2025-01-12
0022400539: Bucks vs. Knicks @ 2025-01-12 15:00:00-05:00
0022400540: Nuggets vs. Mavericks @ 2025-01-12 15:00:00-05:00
0022400541: Kings vs. Bulls @ 2025-01-12 15:30:00-05:00
0022400542: Pelicans vs. Celtics @ 2025-01-12 18:00:00-05:00
0022400543: Pacers vs. Cavaliers @ 2025-01-12 18:00:00-05:00
0022400544: 76ers vs. Magic @ 2025-01-12 18:00:00-05:00
0022400545: Thunder vs. Wizards @ 2025-01-12 18:00:00-05:00
0022400546: Nets vs. Jazz @ 2025-01-12 20:00:00-05:00
0022400547: Hornets vs. Suns @ 2025-01-12 21:00:00-05:00


In [13]:
from nba_api.stats.endpoints import leaguedashteamstats

advancedGamefinder = leaguedashteamstats.LeagueDashTeamStats(
    season='2024-25',
    per_mode_detailed='PerGame'
    )

team_stats = advancedGamefinder.get_data_frames()[0]

# Drop all columns related to ranks (not helpful for database)
keyword = 'RANK'
columns_to_drop = [col for col in team_stats.columns if keyword in col]
team_stats.drop(columns=columns_to_drop, inplace=True)
print(team_stats.columns)
print(games_modern.columns)
print(team_stats.head(5))

Index(['TEAM_ID', 'TEAM_NAME', 'GP', 'W', 'L', 'W_PCT', 'MIN', 'FGM', 'FGA',
       'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB',
       'DREB', 'REB', 'AST', 'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', 'PTS',
       'PLUS_MINUS'],
      dtype='object')
Index(['TEAM_ID', 'OPP_TEAM_ID', 'PTS', 'OREB', 'DREB', 'REB', 'AST', 'STL',
       'BLK', 'TOV', 'EFG%', 'TOV%', 'ORB%', 'FTR', 'TS%', 'HGA',
       'LAST_GAME_OUTCOME', 'WIN'],
      dtype='object')
      TEAM_ID          TEAM_NAME  GP   W   L  W_PCT   MIN   FGM   FGA  FG_PCT  \
0  1610612737      Atlanta Hawks  38  19  19  0.500  48.3  42.8  92.1   0.465   
1  1610612738     Boston Celtics  39  28  11  0.718  48.4  41.7  90.8   0.459   
2  1610612751      Brooklyn Nets  39  13  26  0.333  48.4  38.0  85.1   0.447   
3  1610612766  Charlotte Hornets  36   8  28  0.222  48.4  38.3  90.0   0.425   
4  1610612741      Chicago Bulls  39  18  21  0.462  48.1  43.4  92.3   0.471   

   ...   REB   AST   TOV   STL  BLK  

In [32]:
# Cleaning data in team_stats and matching to games_modern

# Define three of four factors (EFG%, TOV%, FTR)

# Define 'EFG%' (Effective Field Goal Percentage)
team_stats["EFG%"] = (
    team_stats["FGM"] + (0.5 * team_stats["FG3M"])
) / team_stats["FGA"]

# Define 'TOV%' (Turnover Percentage)
team_stats["TOV%"] = team_stats["TOV"] / (
    team_stats["FGA"] + 0.44 * team_stats["FTA"] + team_stats["TOV"]
)

# Define 'FTR' (Free Throw Attempt Rate)
team_stats["FTR"] = team_stats["FTA"] / team_stats["FGA"]

# Define 'TS%' (True Shooting Percentage)
team_stats["TS%"] = team_stats["PTS"] / (
    2 * (team_stats["FGA"] + (0.44 * team_stats["FTA"]))
)

print(team_stats.head(5))


      TEAM_ID          TEAM_NAME  GP   W   L  W_PCT   MIN   FGM   FGA  FG_PCT  \
0  1610612737      Atlanta Hawks  38  19  19  0.500  48.3  42.8  92.1   0.465   
1  1610612738     Boston Celtics  39  28  11  0.718  48.4  41.7  90.8   0.459   
2  1610612751      Brooklyn Nets  39  13  26  0.333  48.4  38.0  85.1   0.447   
3  1610612766  Charlotte Hornets  36   8  28  0.222  48.4  38.3  90.0   0.425   
4  1610612741      Chicago Bulls  39  18  21  0.462  48.1  43.4  92.3   0.471   

   ...  BLK  BLKA    PF   PFD    PTS  PLUS_MINUS      EFG%      TOV%  \
0  ...  5.3   5.6  18.5  20.0  117.2        -2.6  0.535831  0.134989   
1  ...  5.7   4.2  16.1  18.8  118.0         9.3  0.557819  0.107063   
2  ...  3.7   6.0  21.2  19.7  107.3        -6.4  0.529965  0.140945   
3  ...  4.9   5.6  20.5  19.1  106.3        -6.2  0.505000  0.139426   
4  ...  4.7   4.9  18.1  16.4  118.2        -2.7  0.559588  0.129085   

        FTR       TS%  
0  0.257329  0.571551  
1  0.232379  0.589505  
2  0.250

In [37]:
# Fetch data to predict upcoming games
for game in games:

    # Get home team (team) and away team (opponent) ids
    team_id = game["homeTeam"]["teamId"]
    opp_team_id = game["awayTeam"]["teamId"]

    # Get relevant home team statistics for features
    home_condition = (team_stats["TEAM_ID"] == team_id)
    pts = team_stats.loc[home_condition, "PTS"]
    oreb = team_stats.loc[home_condition, "OREB"]
    dreb = team_stats.loc[home_condition, "DREB"]
    reb = team_stats.loc[home_condition, "REB"]
    ast = team_stats.loc[home_condition, "AST"]
    stl = team_stats.loc[home_condition, "STL"]
    blk = team_stats.loc[home_condition, "BLK"]
    tov = team_stats.loc[home_condition, "TOV"]
    efg_pct = team_stats.loc[home_condition, "EFG%"]
    tov_pct = team_stats.loc[home_condition, "TOV%"]
    ftr = team_stats.loc[home_condition, "FTR"]
    ts_pct = team_stats.loc[home_condition, "TS%"]

    # Get home game advantage (always 1 due to how we format data)
    hga = 1

    # Get last game outcome from games_modern
    filtered_games_modern = games_modern.loc[games_modern["TEAM_ID"] == team_id]
    filtered_games_modern.sort_values(by="GAME_DATE", inplace=True)
    last_game_outcome = filtered_games_modern.iloc[0]["WIN"]



KeyError: 'GAME_DATE'