# Predicting NBA Game Outcomes with Machine Learning
## Author: Kishore Annambhotla

### Introduction and Methods


### Setting Up

In order to predict NBA game outcomes, we will need to analyze many past games and see what statistics help or hurt a team's chances of winning. We will also need a robust predictive model capable of making a decision based on all of this game data. For this project, I used `nba_api` to retrieve up-to-date NBA game data, `pandas` and `numpy` for data manipulation, and `scikit-learn` to build an ML model. All of the project's necessary imports will be included in the code block below.

In [1]:
from nba_api.stats.static import teams
from nba_api.stats.endpoints import leaguegamefinder
import pandas as pd

pd.options.mode.chained_assignment = (
    None  # prevents warnings on reformatted columns later
)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

### Importing Data

We will base our model off of games played by all NBA teams. As shown, this game data will include categorical information about the game, teams, and season, as well as quantitative data regarding each team's offense and defense, as well as the winner of the game.

First, we will use the `nba_api` to get games played by each NBA team. Note that we set `season_type_nullable` to only include regular season games, since those are the games we will be predicting with this model. This could easily be changed to train a model better for predicting playoff or preseason games.

In [2]:
nba_teams = teams.get_teams()
team_abbr_to_id = {team["abbreviation"]: team["id"] for team in nba_teams}
all_games = pd.DataFrame()

for team in nba_teams:
    team_id = team["id"]
    gamefinder = leaguegamefinder.LeagueGameFinder(
        team_id_nullable=team_id, season_type_nullable="Regular Season"
    )
    games = gamefinder.get_data_frames()[0]
    all_games = pd.concat([all_games, games], ignore_index=True)

In [3]:
print(all_games.columns)
print(all_games.sample(n=5))

Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PLUS_MINUS'],
      dtype='object')
      SEASON_ID     TEAM_ID TEAM_ABBREVIATION          TEAM_NAME     GAME_ID  \
3556      22022  1610612738               BOS     Boston Celtics  0022200921   
12235     22023  1610612741               CHI      Chicago Bulls  0022300320   
64208     21997  1610612756               PHX       Phoenix Suns  0029701156   
95110     22014  1610612766               CHA  Charlotte Hornets  0021400827   
4133      22015  1610612738               BOS     Boston Celtics  0021501153   

        GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  DREB   REB  \
3556   2023-02-27    BOS @ NYK  L  242   94  ...   0.786   7.0  28.0  35.0   
12235  2023-12-14    CHI @ MIA  W  240  124  ...   0.

While this data is good, it may be too much for our model. Over the last forty seasons, NBA basketball has greatly changed. Many rules have been added and removed, and aspects of "perfect" offense and defense have completely shifted. As a result, older data is increasingly unreliable for predicting current-day NBA games. For this model, I chose to only use data from the 2018-2019 season until now. This time period is most reflective of modern basketball, and ignores plenty of outdated game data.

In [4]:
# Get games where season is in modern_seasons
games_modern = all_games[
    (all_games.SEASON_ID.str[-4:] == "2018")
    | (all_games.SEASON_ID.str[-4:] == "2019")
    | (all_games.SEASON_ID.str[-4:] == "2020")
    | (all_games.SEASON_ID.str[-4:] == "2021")
    | (all_games.SEASON_ID.str[-4:] == "2022")
    | (all_games.SEASON_ID.str[-4:] == "2023")
    | (all_games.SEASON_ID.str[-4:] == "2024")
]
print(games_modern.sample(n=5))

      SEASON_ID     TEAM_ID TEAM_ABBREVIATION              TEAM_NAME  \
15910     22020  1610612742               DAL       Dallas Mavericks   
85403     22019  1610612763               MEM      Memphis Grizzlies   
81630     22023  1610612762               UTA              Utah Jazz   
3869      22018  1610612738               BOS         Boston Celtics   
22918     22018  1610612744               GSW  Golden State Warriors   

          GAME_ID   GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  \
15910  0022000071  2021-01-01  DAL vs. MIA  W  240   93  ...   0.739   4.0   
85403  0021900920  2020-03-04    MEM @ BKN  W  240  118  ...   0.857  18.0   
81630  0022301086  2024-03-31    UTA @ SAC  L  240  106  ...   0.786  11.0   
3869   0021801087  2019-03-23    BOS @ CHA  L  240  117  ...   0.667   7.0   
22918  0021800147  2018-11-05  GSW vs. MEM  W  240  117  ...   0.957   8.0   

       DREB   REB  AST  STL  BLK  TOV  PF  PLUS_MINUS  
15910  45.0  49.0   18  9.0    3   16  25 

### Cleaning Data

We have plenty of good data, but it's still unrefined. The next step is to clean up our columns and reformat the data. This will make training, testing, and evaluating our model much easier in the future.

First, we will reformat some of our columns to make working with the data easier.

In [5]:
# Convert GAME_DATE to pandas datetime
# Also order by date earliest to latest to make working with stats easier later
games_modern["GAME_DATE"] = pd.to_datetime(games_modern["GAME_DATE"])
games_modern.sort_values(by="GAME_DATE", inplace=True)


# Add binary "WIN" column, remove categorical WL column
games_modern["WIN"] = games_modern["WL"].apply(lambda x: 1 if x == "W" else 0)


# Convert int stat columns to float type for accurate data analysis
games_modern["MIN"] = games_modern["MIN"].astype(float)  # minutes
games_modern["PTS"] = games_modern["PTS"].astype(float)  # points
games_modern["FGM"] = games_modern["FGM"].astype(float)  # field goals made
games_modern["FGA"] = games_modern["FGA"].astype(float)  # field goals attempted
games_modern["FG3M"] = games_modern["FG3M"].astype(float)  # 3s made
games_modern["FG3A"] = games_modern["FG3A"].astype(float)  # 3s attempted
games_modern["FTM"] = games_modern["FTM"].astype(float)  # free throws made
games_modern["FTA"] = games_modern["FTA"].astype(float)  # free throws attempted
games_modern["AST"] = games_modern["AST"].astype(float)  # assists
games_modern["BLK"] = games_modern["BLK"].astype(float)  # blocks
games_modern["TOV"] = games_modern["TOV"].astype(float)  # turnovers
games_modern["PF"] = games_modern["PF"].astype(float)  # personal fouls


# Add opponent id as column


def get_opponent_id(matchup, team_abbr_to_id, team_id):
    if "@" in matchup:

        opponent_abbr = matchup.split(" @ ")[-1]

    else:

        opponent_abbr = matchup.split(" vs. ")[-1]

    return team_abbr_to_id.get(opponent_abbr, team_id)


games_modern["OPP_TEAM_ID"] = games_modern.apply(
    lambda row: get_opponent_id(row["MATCHUP"], team_abbr_to_id, row["TEAM_ID"]), axis=1
)

Next, we want to define a few new statistics that will be useful in evaluating a team. These statistics will prove essential in predicting the outcomes of new games. For many of these statistics, we will only want to create them based on the most recent data to avoid inaccuracies. Because teams can undergo vast changes in the span of two or three years, we will only define these statistics using data over the last two seasons.

In [6]:
# Define 'HGA' (Home Game Advantage)
games_modern["HGA"] = games_modern["MATCHUP"].apply(lambda x: 0 if "@" in x else 1)

# Define 'LAST_GAME_OUTCOME'
games_modern["LAST_GAME_OUTCOME"] = (
    games_modern.groupby("TEAM_ID")["WIN"].shift(1).fillna(0)
)

# Note: per game statistics indicate team per game statistics using rolling average of last 10
# Define 'PPG' (Points Per Game)
# games_modern["PPG"] = (
#    games_modern.groupby("TEAM_ID")["PTS"]
#    .rolling(window=10, closed="left", min_periods=1)
#    .mean()
#    .reset_index(0, drop=True)
# )

# Define 'EFG%' (Effective Field Goal Percentage)
games_modern["EFG%"] = (
    games_modern["FGM"] + (0.5 * games_modern["FG3M"])
) / games_modern["FGA"]

# Define 'TOV%' (Turnover Percentage)
games_modern["TOV%"] = games_modern["TOV"] / (
    games_modern["FGA"] + 0.44 * games_modern["FTA"] + games_modern["TOV"]
)


# Define 'ORB%' (Offensive Rebound Percentage)
def get_opponent_dreb(game_id):
    opponent_dreb = games_modern.loc[games_modern["GAME_ID"] == game_id, "DREB"].values[
        0
    ]
    return opponent_dreb


games_modern["ORB%"] = games_modern["OREB"] / (
    games_modern["OREB"] + get_opponent_dreb(games_modern["GAME_ID"])
)

# Define 'FTR' (Free Throw Attempt Rate)
games_modern["FTR"] = games_modern["FTA"] / games_modern["FGA"]

# Define 'TS%' (True Shooting Percentage)
games_modern["TS%"] = games_modern["PTS"] / (
    2 * (games_modern["FGA"] + (0.44 * games_modern["FTA"]))
)

Finally, we want to use a `LabelEncoder` to transform a few of our categorical data columsn into numerical values that can be understood by our model.

In [7]:
# LabelEncode data
le = LabelEncoder()
games_modern["TEAM_ID"] = le.fit_transform(games_modern["TEAM_ID"])
games_modern["OPP_TEAM_ID"] = le.fit_transform(games_modern["OPP_TEAM_ID"])
games_modern.sample(5)

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,PLUS_MINUS,WIN,OPP_TEAM_ID,HGA,LAST_GAME_OUTCOME,EFG%,TOV%,ORB%,FTR,TS%
94674,22019,29,CHA,Charlotte Hornets,21900813,2020-02-12,CHA @ MIN,W,241.0,115.0,...,7.0,1,13,0,1.0,0.543478,0.112926,0.266667,0.25,0.563063
72793,22018,22,SAS,San Antonio Spurs,21800674,2019-01-18,SAS @ MIN,W,240.0,116.0,...,3.0,1,13,0,1.0,0.567073,0.144456,0.214286,0.353659,0.612073
65776,22020,20,POR,Portland Trail Blazers,22000599,2021-03-14,POR @ MIN,L,241.0,112.0,...,-2.0,0,13,0,1.0,0.533708,0.100696,0.214286,0.235955,0.570033
10289,22023,3,NOP,New Orleans Pelicans,22301138,2024-04-07,NOP @ PHX,W,241.0,113.0,...,8.0,1,19,0,0.0,0.564706,0.113895,0.195122,0.223529,0.605184
81997,22019,25,UTA,Utah Jazz,21900574,2020-01-10,UTA vs. CHA,W,239.0,109.0,...,17.0,1,29,1,1.0,0.620253,0.14976,0.214286,0.177215,0.639972


In [8]:
# Will be helpful to have dictionary keying teams to encoded id values
ENCODED_TEAM_IDS = {
    "Atlanta Hawks": 0,
    "Hawks": 0,
    "Boston Celtics": 1,
    "Celtics": 1,
    "Cleveland Cavaliers": 2,
    "Cavaliers": 2,
    "New Orleans Pelicans": 3,
    "Pelicans": 3,
    "Chicago Bulls": 4,
    "Bulls": 4,
    "Dallas Mavericks": 5,
    "Mavericks": 5,
    "Denver Nuggets": 6,
    "Nuggets": 6,
    "Golden State Warriors": 7,
    "Warriors": 7,
    "Houston Rockets": 8,
    "Rockets": 8,
    "LA Clippers": 9,
    "Clippers": 9,
    "Los Angeles Lakers": 10,
    "Lakers": 10,
    "Miami Heat": 11,
    "Heat": 11,
    "Milwaukee Bucks": 12,
    "Bucks": 12,
    "Minnesota Timberwolves": 13,
    "Timberwolves": 13,
    "Brooklyn Nets": 14,
    "Nets": 14,
    "New York Knicks": 15,
    "Knicks": 15,
    "Orlando Magic": 16,
    "Magic": 16,
    "Indiana Pacers": 17,
    "Pacers": 17,
    "Philadelphia 76ers": 18,
    "76ers": 18,
    "Phoenix Suns": 19,
    "Suns": 19,
    "Portland Trail Blazers": 20,
    "Trail Blazers": 20,
    "Sacramento Kings": 21,
    "Kings": 21,
    "San Antonio Spurs": 22,
    "Spurs": 22,
    "Oklahoma City Thunder": 23,
    "Thunder": 23,
    "Toronto Raptors": 24,
    "Raptors": 24,
    "Utah Jazz": 25,
    "Jazz": 25,
    "Memphis Grizzlies": 26,
    "Grizzlies": 26,
    "Washington Wizards": 27,
    "Wizards": 27,
    "Detroit Pistons": 28,
    "Pistons": 28,
    "Charlotte Hornets": 29,
    "Hornets": 29,
}

### Defining Features

In our model, `X` includes features of the dataset that will be used to predict `y`. Here, `y` is obviously `WIN`, since we are just predicting and classifying each game as a win (1) or loss (0). `X` could include a number of relevant statistics, and all the ones used will be included in the code segment.

In [36]:
features = [
    "TEAM_ID",
    "OPP_TEAM_ID",
    "PTS",
    "OREB",
    "DREB",
    "REB",
    "AST",
    "STL",
    "BLK",
    "TOV",
    "EFG%",
    "TOV%",
    "FTR",
    "TS%",
    "HGA",
]
X = games_modern[features]

y = games_modern[["WIN"]]


# Use sklearn train_test_split with 80% training, 20% testing

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [37]:
games_modern = games_modern[
    [
        "TEAM_NAME",
        "TEAM_ID",
        "OPP_TEAM_ID",
        "GAME_DATE",
        "PTS",
        "OREB",
        "DREB",
        "REB",
        "AST",
        "STL",
        "BLK",
        "TOV",
        "EFG%",
        "TOV%",
        "ORB%",
        "FTR",
        "TS%",
        "HGA",
        "LAST_GAME_OUTCOME",
        "WIN",
    ]
]

### Creating Initial Model

`scikit-learn` offers many options for ML classification models. A simple one good for this problem would be a `RandomForestClassifier`, which uses many decision trees to analyze data and settle on a choice. Random forests are known for accuracy and efficiency.

In [54]:
rf_model = RandomForestClassifier(n_estimators=175, max_depth=25, random_state=42)
rf_model.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [55]:
y_pred = rf_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8317073170731707
              precision    recall  f1-score   support

           0       0.83      0.84      0.84      1660
           1       0.84      0.82      0.83      1620

    accuracy                           0.83      3280
   macro avg       0.83      0.83      0.83      3280
weighted avg       0.83      0.83      0.83      3280



### Predicting Games

In [40]:
# Query nba.live.endpoints.scoreboard and  list games in localTimeZone
from datetime import datetime, timezone
from dateutil import parser
from nba_api.live.nba.endpoints import scoreboard

f = "{gameId}: {awayTeam} vs. {homeTeam} @ {gameTimeLTZ}"

board = scoreboard.ScoreBoard()
print("ScoreBoardDate: " + board.score_board_date)
games = board.games.get_dict()
game_teams = []
for game in games:
    gameTimeLTZ = (
        parser.parse(game["gameTimeUTC"])
        .replace(tzinfo=timezone.utc)
        .astimezone(tz=None)
    )
    print(
        f.format(
            gameId=game["gameId"],
            awayTeam=game["awayTeam"]["teamName"],
            homeTeam=game["homeTeam"]["teamName"],
            gameTimeLTZ=gameTimeLTZ,
        )
    )
    game_teams.append([game["homeTeam"]["teamName"], game["awayTeam"]["teamName"]])

ScoreBoardDate: 2025-01-16
0022400575: Clippers vs. Trail Blazers @ 2025-01-16 22:00:00-05:00
0022400576: Rockets vs. Kings @ 2025-01-16 22:00:00-05:00
0022400572: Pacers vs. Pistons @ 2025-01-16 19:00:00-05:00
0022400573: Suns vs. Wizards @ 2025-01-16 19:00:00-05:00
0022400574: Cavaliers vs. Thunder @ 2025-01-16 19:30:00-05:00


In [41]:
from nba_api.stats.endpoints import leaguedashteamstats

advancedGamefinder = leaguedashteamstats.LeagueDashTeamStats(
    season="2024-25", per_mode_detailed="PerGame"
)

team_stats = advancedGamefinder.get_data_frames()[0]

# Drop all columns related to ranks (not helpful for database)
keyword = "RANK"
columns_to_drop = [col for col in team_stats.columns if keyword in col]
team_stats.drop(columns=columns_to_drop, inplace=True)

In [42]:
# Cleaning data in team_stats and matching to games_modern

# Define three of four factors (EFG%, TOV%, FTR)

# Define 'EFG%' (Effective Field Goal Percentage)
team_stats["EFG%"] = (team_stats["FGM"] + (0.5 * team_stats["FG3M"])) / team_stats[
    "FGA"
]

# Define 'TOV%' (Turnover Percentage)
team_stats["TOV%"] = team_stats["TOV"] / (
    team_stats["FGA"] + 0.44 * team_stats["FTA"] + team_stats["TOV"]
)

# Define 'FTR' (Free Throw Attempt Rate)
team_stats["FTR"] = team_stats["FTA"] / team_stats["FGA"]

# Define 'TS%' (True Shooting Percentage)
team_stats["TS%"] = team_stats["PTS"] / (
    2 * (team_stats["FGA"] + (0.44 * team_stats["FTA"]))
)

# Label encode team ids
team_stats["TEAM_ID"] = le.fit_transform(team_stats["TEAM_ID"])

In [46]:
# Fetch data to predict upcoming games
games = board.games.get_dict()
# Empty list to store each prediction dataframe
prediction_dfs = []
for game in games:

    # Get home team (team) and away team (opponent) encoded ids
    team_id = ENCODED_TEAM_IDS.get(game["homeTeam"]["teamName"])
    opp_team_id = ENCODED_TEAM_IDS.get(game["awayTeam"]["teamName"])

    # Get relevant home team statistics for features

    home_condition = team_stats["TEAM_ID"] == team_id

    pts = team_stats.loc[home_condition, "PTS"].values[0]


    oreb = team_stats.loc[home_condition, "OREB"].values[0]

    dreb = team_stats.loc[home_condition, "DREB"].values[0]

    reb = team_stats.loc[home_condition, "REB"].values[0]

    ast = team_stats.loc[home_condition, "AST"].values[0]

    stl = team_stats.loc[home_condition, "STL"].values[0]


    blk = team_stats.loc[home_condition, "BLK"].values[0]

    tov = team_stats.loc[home_condition, "TOV"].values[0]


    efg_pct = team_stats.loc[home_condition, "EFG%"].values[0]

    tov_pct = team_stats.loc[home_condition, "TOV%"].values[0]

    ftr = team_stats.loc[home_condition, "FTR"].values[0]


    ts_pct = team_stats.loc[home_condition, "TS%"].values[0]


    # Get home game advantage (always 1 due to how we format data)
    hga = 1.0

    # With all of the data, construct a dataframe with the game's information.
    prediction_data = {
        "TEAM_ID": [team_id],
        "OPP_TEAM_ID": [opp_team_id],
        "PTS": [pts],
        "OREB": [oreb],
        "DREB": [dreb],
        "REB": [reb],
        "AST": [ast],
        "STL": [stl],
        "BLK": [blk],
        "TOV": [tov],
        "EFG%": [efg_pct],
        "TOV%": [tov_pct],
        "FTR": [ftr],
        "TS%": [ts_pct],
        "HGA": [hga],
    }
    game_prediction_df = pd.DataFrame(prediction_data)
    prediction_dfs.append(game_prediction_df)

# Will all game dataframes, concatenate into a single dataframe for easy simultaneous viewing    
all_game_predictions = pd.concat(prediction_dfs)

In [49]:
all_game_predictions

Unnamed: 0,TEAM_ID,OPP_TEAM_ID,PTS,OREB,DREB,REB,AST,STL,BLK,TOV,EFG%,TOV%,FTR,TS%,HGA
0,20,9,108.4,12.5,30.8,43.3,23.2,8.1,5.6,16.4,0.517338,0.142778,0.230425,0.550455,1.0
0,21,8,116.1,10.8,33.9,44.7,26.4,8.2,4.8,13.2,0.542952,0.116205,0.240088,0.578233,1.0
0,28,17,112.1,11.1,33.8,44.9,25.7,7.3,5.1,15.9,0.54382,0.139896,0.223596,0.573366,1.0
0,27,19,108.7,10.8,33.2,44.0,25.1,7.8,5.4,16.3,0.512693,0.140668,0.225166,0.545814,1.0
0,23,2,116.6,10.0,33.7,43.7,26.0,11.6,5.9,12.0,0.548422,0.106769,0.210011,0.580724,1.0


In [50]:
for i in range(0, len(prediction_dfs)):
    print("Home Team:", game_teams[i][0])
    print("Away Team:", game_teams[i][1])
    prediction = rf_model.predict(prediction_dfs[i])
    # 0 = loss for home team/win for away team
    # 1 = win for home team/loss for away team
    if prediction == 0:
        print(game_teams[i][1] + " predicted to win!")
    if prediction == 1:
        print(game_teams[i][0] + " predicted to win!")
    print("")

Home Team: Trail Blazers
Away Team: Clippers
Clippers predicted to win!

Home Team: Kings
Away Team: Rockets
Kings predicted to win!

Home Team: Pistons
Away Team: Pacers
Pacers predicted to win!

Home Team: Wizards
Away Team: Suns
Suns predicted to win!

Home Team: Thunder
Away Team: Cavaliers
Thunder predicted to win!

