# Predicting NBA Game Outcomes with Machine Learning
## Author: Kishore Annambhotla

### Introduction and Methods


### Setting Up

In order to predict NBA game outcomes, we will need to analyze many past games and see what statistics help or hurt a team's chances of winning. We will also need a robust predictive model capable of making a decision based on all of this game data. For this project, I used `nba_api` to retrieve up-to-date NBA game data, `pandas` and `numpy` for data manipulation, and `scikit-learn` to build an ML model. All of the project's necessary imports will be included in the code block below.

In [1]:
from nba_api.stats.static import teams
from nba_api.stats.endpoints import leaguegamefinder
import pandas as pd

pd.options.mode.chained_assignment = (
    None  # prevents warnings on reformatted columns later
)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

### Importing Data

We will base our model off of games played by all NBA teams. As shown, this game data will include categorical information about the game, teams, and season, as well as quantitative data regarding each team's offense and defense, as well as the winner of the game.

First, we will use the `nba_api` to get games played by each NBA team. Note that we set `season_type_nullable` to only include regular season games, since those are the games we will be predicting with this model. This could easily be changed to train a model better for predicting playoff or preseason games.

In [2]:
nba_teams = teams.get_teams()
team_abbr_to_id = {team["abbreviation"]: team["id"] for team in nba_teams}
all_games = pd.DataFrame()

for team in nba_teams:
    team_id = team["id"]
    gamefinder = leaguegamefinder.LeagueGameFinder(
        team_id_nullable=team_id, season_type_nullable="Regular Season"
    )
    games = gamefinder.get_data_frames()[0]
    all_games = pd.concat([all_games, games], ignore_index=True)

In [None]:
print(all_games.columns)
print(all_games.sample(n=5))

Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PLUS_MINUS'],
      dtype='object')
      SEASON_ID     TEAM_ID TEAM_ABBREVIATION        TEAM_NAME     GAME_ID  \
79797     22016  1610612761               TOR  Toronto Raptors  0021600362   
62183     22022  1610612756               PHX     Phoenix Suns  0022200220   
55569     22019  1610612754               IND   Indiana Pacers  0021900474   
82177     22017  1610612762               UTA        Utah Jazz  1521700009   
42013     21989  1610612749               MIL  Milwaukee Bucks  0028900161   

        GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  DREB   REB  \
79797  2016-12-12  TOR vs. MIL  W  240  122  ...   0.857  14.0  32.0  46.0   
62183  2022-11-16  PHX vs. GSW  W  241  130  ...   0.840   8.0  3

While this data is good, it may be too much for our model. Over the last forty seasons, NBA basketball has greatly changed. Many rules have been added and removed, and aspects of "perfect" offense and defense have completely shifted. As a result, older data is increasingly unreliable for predicting current-day NBA games. For this model, I chose to only use data from the 2018-2019 season until now. This time period is most reflective of modern basketball, and ignores plenty of outdated game data.

In [4]:
# Get games where season is in modern_seasons
games_modern = all_games[
    (all_games.SEASON_ID.str[-4:] == "2018")
    | (all_games.SEASON_ID.str[-4:] == "2019")
    | (all_games.SEASON_ID.str[-4:] == "2020")
    | (all_games.SEASON_ID.str[-4:] == "2021")
    | (all_games.SEASON_ID.str[-4:] == "2022")
    | (all_games.SEASON_ID.str[-4:] == "2023")
    | (all_games.SEASON_ID.str[-4:] == "2024")
]
print(games_modern.sample(n=5))

      SEASON_ID     TEAM_ID TEAM_ABBREVIATION               TEAM_NAME  \
76189     22018  1610612760               OKC   Oklahoma City Thunder   
55381     22021  1610612754               IND          Indiana Pacers   
43000     22018  1610612750               MIN  Minnesota Timberwolves   
69063     22021  1610612758               SAC        Sacramento Kings   
33069     22019  1610612747               LAL      Los Angeles Lakers   

          GAME_ID   GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  \
76189  0021800717  2019-01-24  OKC vs. NOP  W  240  122  ...   0.500  14.0   
55381  0022101053  2022-03-18    IND @ HOU  W  239  121  ...   0.684  16.0   
43000  0021800705  2019-01-22    MIN @ PHX  W  242  118  ...   0.684  20.0   
69063  0022101010  2022-03-12    SAC @ UTA  L  239  125  ...   0.857   8.0   
33069  0021900342  2019-12-08  LAL vs. MIN  W  238  142  ...   0.900  10.0   

       DREB   REB  AST  STL  BLK  TOV  PF  PLUS_MINUS  
76189  43.0  57.0   33  8.0    2   1

### Cleaning Data

We have plenty of good data, but it's still unrefined. The next step is to clean up our columns and reformat the data. This will make training, testing, and evaluating our model much easier in the future.

First, we will reformat some of our columns to make working with the data easier.

In [5]:
# Convert GAME_DATE to pandas datetime
# Also order by date earliest to latest to make working with stats easier later
games_modern["GAME_DATE"] = pd.to_datetime(games_modern["GAME_DATE"])
games_modern.sort_values(by="GAME_DATE", inplace=True)


# Add binary "WIN" column, remove categorical WL column
games_modern["WIN"] = games_modern["WL"].apply(lambda x: 1 if x == "W" else 0)


# Convert int stat columns to float type for accurate data analysis
games_modern["MIN"] = games_modern["MIN"].astype(float)  # minutes
games_modern["PTS"] = games_modern["PTS"].astype(float)  # points
games_modern["FGM"] = games_modern["FGM"].astype(float)  # field goals made
games_modern["FGA"] = games_modern["FGA"].astype(float)  # field goals attempted
games_modern["FG3M"] = games_modern["FG3M"].astype(float)  # 3s made
games_modern["FG3A"] = games_modern["FG3A"].astype(float)  # 3s attempted
games_modern["FTM"] = games_modern["FTM"].astype(float)  # free throws made
games_modern["FTA"] = games_modern["FTA"].astype(float)  # free throws attempted
games_modern["AST"] = games_modern["AST"].astype(float)  # assists
games_modern["BLK"] = games_modern["BLK"].astype(float)  # blocks
games_modern["TOV"] = games_modern["TOV"].astype(float)  # turnovers
games_modern["PF"] = games_modern["PF"].astype(float)  # personal fouls


# Add opponent id as column


def get_opponent_id(matchup, team_abbr_to_id, team_id):
    if "@" in matchup:

        opponent_abbr = matchup.split(" @ ")[-1]

    else:

        opponent_abbr = matchup.split(" vs. ")[-1]

    return team_abbr_to_id.get(opponent_abbr, team_id)


games_modern["OPP_TEAM_ID"] = games_modern.apply(
    lambda row: get_opponent_id(row["MATCHUP"], team_abbr_to_id, row["TEAM_ID"]), axis=1
)

Next, we want to define a few new statistics that will be useful in evaluating a team. These statistics will prove essential in predicting the outcomes of new games. For many of these statistics, we will only want to create them based on the most recent data to avoid inaccuracies. Because teams can undergo vast changes in the span of two or three years, we will only define these statistics using data over the last two seasons.

In [6]:
# Define 'HGA' (Home Game Advantage)
games_modern["HGA"] = games_modern["MATCHUP"].apply(lambda x: 0 if "@" in x else 1)

# Define 'LAST_GAME_OUTCOME'
games_modern["LAST_GAME_OUTCOME"] = (
    games_modern.groupby("TEAM_ID")["WIN"].shift(1).fillna(0)
)

# Note: per game statistics indicate team per game statistics using rolling average of last 10
# Define 'PPG' (Points Per Game)
# games_modern["PPG"] = (
#    games_modern.groupby("TEAM_ID")["PTS"]
#    .rolling(window=10, closed="left", min_periods=1)
#    .mean()
#    .reset_index(0, drop=True)
# )

# Define 'EFG%' (Effective Field Goal Percentage)
games_modern["EFG%"] = (
    games_modern["FGM"] + (0.5 * games_modern["FG3M"])
) / games_modern["FGA"]

# Define 'TOV%' (Turnover Percentage)
games_modern["TOV%"] = games_modern["TOV"] / (
    games_modern["FGA"] + 0.44 * games_modern["FTA"] + games_modern["TOV"]
)


# Define 'ORB%' (Offensive Rebound Percentage)
def get_opponent_dreb(game_id):
    opponent_dreb = games_modern.loc[games_modern["GAME_ID"] == game_id, "DREB"].values[
        0
    ]
    return opponent_dreb


games_modern["ORB%"] = games_modern["OREB"] / (
    games_modern["OREB"] + get_opponent_dreb(games_modern["GAME_ID"])
)

# Define 'FTR' (Free Throw Attempt Rate)
games_modern["FTR"] = games_modern["FTA"] / games_modern["FGA"]

# Define 'TS%' (True Shooting Percentage)
games_modern["TS%"] = games_modern["PTS"] / (
    2 * (games_modern["FGA"] + (0.44 * games_modern["FTA"]))
)

Finally, we want to use a `LabelEncoder` to transform a few of our categorical data columsn into numerical values that can be understood by our model.

In [60]:
# LabelEncode data
le = LabelEncoder()
games_modern["TEAM_ID"] = le.fit_transform(games_modern["TEAM_ID"])
games_modern["OPP_TEAM_ID"] = le.fit_transform(games_modern["OPP_TEAM_ID"])
games_modern.sample(n=5)

Unnamed: 0,TEAM_NAME,TEAM_ID,OPP_TEAM_ID,GAME_DATE,PTS,OREB,DREB,REB,AST,STL,BLK,TOV,EFG%,TOV%,ORB%,FTR,TS%,HGA,LAST_GAME_OUTCOME,WIN
58945,Philadelphia 76ers,18,10,2020-03-03,107.0,12.0,30.0,42.0,24.0,10.0,1.0,15.0,0.541176,0.137868,0.307692,0.235294,0.570362,0,0.0,0
52676,Orlando Magic,16,6,2019-12-18,104.0,8.0,27.0,35.0,25.0,9.0,5.0,12.0,0.488506,0.110865,0.228571,0.241379,0.540316,0,0.0,0
12404,Chicago Bulls,4,29,2021-11-29,133.0,5.0,36.0,41.0,35.0,3.0,5.0,10.0,0.674157,0.093145,0.15625,0.213483,0.683032,1,0.0,1
52567,Orlando Magic,16,26,2021-05-01,112.0,6.0,31.0,37.0,20.0,10.0,5.0,10.0,0.511765,0.093179,0.181818,0.329412,0.575421,1,0.0,1
29352,LA Clippers,9,19,2023-04-09,119.0,14.0,39.0,53.0,22.0,3.0,5.0,10.0,0.515,0.082946,0.341463,0.24,0.538169,0,1.0,1


### Defining Features

In our model, `X` includes features of the dataset that will be used to predict `y`. Here, `y` is obviously `WIN`, since we are just predicting and classifying each game as a win (1) or loss (0). `X` could include a number of relevant statistics, and all the ones used will be included in the code segment.

In [61]:
features = [
    "TEAM_ID",
    "OPP_TEAM_ID",
    "PTS",
    "OREB",
    "DREB",
    "REB",
    "AST",
    "STL",
    "BLK",
    "TOV",
    "EFG%",
    "TOV%",
    "FTR",
    "TS%",
    "HGA",
    "LAST_GAME_OUTCOME",
]
X = games_modern[features]

y = games_modern[["WIN"]]


# Use sklearn train_test_split with 80% training, 20% testing

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [9]:
games_modern = games_modern[
    [
        "TEAM_NAME",
        "TEAM_ID",
        "OPP_TEAM_ID",
        "GAME_DATE",
        "PTS",
        "OREB",
        "DREB",
        "REB",
        "AST",
        "STL",
        "BLK",
        "TOV",
        "EFG%",
        "TOV%",
        "ORB%",
        "FTR",
        "TS%",
        "HGA",
        "LAST_GAME_OUTCOME",
        "WIN",
    ]
]

### Creating Initial Model

`scikit-learn` offers many options for ML classification models. A simple one good for this problem would be a `RandomForestClassifier`, which uses many decision trees to analyze data and settle on a choice. Random forests are known for accuracy and efficiency.

In [62]:
rf_model = RandomForestClassifier(n_estimators=150, max_depth=20, random_state=42)
rf_model.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [63]:
y_pred = rf_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8254356465912565
              precision    recall  f1-score   support

           0       0.81      0.85      0.83      1621
           1       0.84      0.80      0.82      1650

    accuracy                           0.83      3271
   macro avg       0.83      0.83      0.83      3271
weighted avg       0.83      0.83      0.83      3271



### Predicting Games

In [64]:
# Query nba.live.endpoints.scoreboard and  list games in localTimeZone
from datetime import datetime, timezone
from dateutil import parser
from nba_api.live.nba.endpoints import scoreboard

f = "{gameId}: {awayTeam} vs. {homeTeam} @ {gameTimeLTZ}"

board = scoreboard.ScoreBoard()
print("ScoreBoardDate: " + board.score_board_date)
games = board.games.get_dict()
for game in games:
    gameTimeLTZ = (
        parser.parse(game["gameTimeUTC"])
        .replace(tzinfo=timezone.utc)
        .astimezone(tz=None)
    )
    print(
        f.format(
            gameId=game["gameId"],
            awayTeam=game["awayTeam"]["teamName"],
            homeTeam=game["homeTeam"]["teamName"],
            gameTimeLTZ=gameTimeLTZ,
        )
    )

ScoreBoardDate: 2025-01-13
0022400548: Timberwolves vs. Wizards @ 2025-01-13 19:00:00-05:00
0022400549: Pistons vs. Knicks @ 2025-01-13 19:30:00-05:00
0022400550: Warriors vs. Raptors @ 2025-01-13 19:30:00-05:00
0022400551: Grizzlies vs. Rockets @ 2025-01-13 20:00:00-05:00
0022400552: Spurs vs. Lakers @ 2025-01-13 22:30:00-05:00
0022400553: Heat vs. Clippers @ 2025-01-13 22:30:00-05:00


In [65]:
from nba_api.stats.endpoints import leaguedashteamstats

advancedGamefinder = leaguedashteamstats.LeagueDashTeamStats(
    season="2024-25", per_mode_detailed="PerGame"
)

team_stats = advancedGamefinder.get_data_frames()[0]

# Drop all columns related to ranks (not helpful for database)
keyword = "RANK"
columns_to_drop = [col for col in team_stats.columns if keyword in col]
team_stats.drop(columns=columns_to_drop, inplace=True)

In [76]:
# Cleaning data in team_stats and matching to games_modern

# Define three of four factors (EFG%, TOV%, FTR)

# Define 'EFG%' (Effective Field Goal Percentage)
team_stats["EFG%"] = (team_stats["FGM"] + (0.5 * team_stats["FG3M"])) / team_stats[
    "FGA"
]

# Define 'TOV%' (Turnover Percentage)
team_stats["TOV%"] = team_stats["TOV"] / (
    team_stats["FGA"] + 0.44 * team_stats["FTA"] + team_stats["TOV"]
)

# Define 'FTR' (Free Throw Attempt Rate)
team_stats["FTR"] = team_stats["FTA"] / team_stats["FGA"]

# Define 'TS%' (True Shooting Percentage)
team_stats["TS%"] = team_stats["PTS"] / (
    2 * (team_stats["FGA"] + (0.44 * team_stats["FTA"]))
)

# Label encode team ids
team_stats["TEAM_ID"] = le.fit_transform(team_stats["TEAM_ID"])
#team_stats.sample(n=5)
print(pd.concat([team_stats["TEAM_ID"], team_stats["TEAM_NAME"]], axis=1))

    TEAM_ID               TEAM_NAME
0         0           Atlanta Hawks
1         1          Boston Celtics
2        14           Brooklyn Nets
3        29       Charlotte Hornets
4         4           Chicago Bulls
5         2     Cleveland Cavaliers
6         5        Dallas Mavericks
7         6          Denver Nuggets
8        28         Detroit Pistons
9         7   Golden State Warriors
10        8         Houston Rockets
11       17          Indiana Pacers
12        9             LA Clippers
13       10      Los Angeles Lakers
14       26       Memphis Grizzlies
15       11              Miami Heat
16       12         Milwaukee Bucks
17       13  Minnesota Timberwolves
18        3    New Orleans Pelicans
19       15         New York Knicks
20       23   Oklahoma City Thunder
21       16           Orlando Magic
22       18      Philadelphia 76ers
23       19            Phoenix Suns
24       20  Portland Trail Blazers
25       21        Sacramento Kings
26       22       San Antoni

In [None]:
# Will be helpful to define a dictionary containing team encoded ids keyed to team name values
encoded_team_ids = {
    "Atlanta Hawks": 0,
    "Boston Celtics": 1,
    
}

In [72]:
# Fetch data to predict upcoming games
games = board.games.get_dict()

for game in games:

    # Get home team (team) and away team (opponent) ids

    team_id = game["homeTeam"]["teamId"]
    print(team_id)
    print("Home Team: " + game["homeTeam"]["teamName"])
    print("Away Team: " + game["awayTeam"]["teamName"])
    encoded_team_id = le.fit_transform([team_id])
    print(encoded_team_id[0])

    opp_team_id = game["awayTeam"]["teamId"]


    # Get relevant home team statistics for features

    home_condition = team_stats["TEAM_ID"] == encoded_team_id[0]

    pts = team_stats.loc[home_condition, "PTS"]


    oreb = team_stats.loc[home_condition, "OREB"]

    dreb = team_stats.loc[home_condition, "DREB"]

    reb = team_stats.loc[home_condition, "REB"]

    ast = team_stats.loc[home_condition, "AST"]

    stl = team_stats.loc[home_condition, "STL"]


    blk = team_stats.loc[home_condition, "BLK"]

    tov = team_stats.loc[home_condition, "TOV"]


    efg_pct = team_stats.loc[home_condition, "EFG%"]

    tov_pct = team_stats.loc[home_condition, "TOV%"]

    ftr = team_stats.loc[home_condition, "FTR"]


    ts_pct = team_stats.loc[home_condition, "TS%"]


    # Get home game advantage (always 1 due to how we format data)

    hga = 1


    # Get last game outcome from games_modern


    filtered_games_modern = games_modern.loc[
        games_modern["TEAM_ID"] == encoded_team_id[0]
    ]


    filtered_games_modern.sort_values(by="GAME_DATE", inplace=True)
    last_game_outcome = filtered_games_modern.iloc[0]["WIN"]

    # With all of the data, construct a dataframe with the game's information.
    prediction_data = {
        "TEAM_ID": [team_id],
        "OPP_TEAM_ID": [opp_team_id],
        "PTS": [pts],
        "OREB": [oreb],
        "DREB": [dreb],
        "REB": [reb],
        "AST": [ast],
        "STL": [stl],
        "BLK": [blk],
        "TOV": [tov],
        "EFG%": [efg_pct],
        "TOV%": [tov_pct],
        "FTR": [ftr],
        "TS%": [ts_pct],
        "HGA": [hga],
        "LAST_GAME_OUTCOME": [last_game_outcome],
        "WIN": [0],
    }
    game_prediction_df = pd.DataFrame(prediction_data)

1610612764
Home Team: Wizards
Away Team: Timberwolves
0
1610612752
Home Team: Knicks
Away Team: Pistons
0
1610612761
Home Team: Raptors
Away Team: Warriors
0
1610612745
Home Team: Rockets
Away Team: Grizzlies
0
1610612747
Home Team: Lakers
Away Team: Spurs
0
1610612746
Home Team: Clippers
Away Team: Heat
0
