# Predicting NBA Game Outcomes with Machine Learning
## Author: Kishore Annambhotla

### Introduction and Methods


### Setting Up

In order to predict NBA game outcomes, we will need to analyze many past games and see what statistics help or hurt a team's chances of winning. We will also need a robust predictive model capable of making a decision based on all of this game data. For this project, I used `nba_api` to retrieve up-to-date NBA game data, `pandas` and `numpy` for data manipulation, and `scikit-learn` to build an ML model. All of the project's necessary imports will be included in the code block below.

In [2]:
from nba_api.stats.static import teams
from nba_api.stats.endpoints import leaguegamefinder
import pandas as pd

pd.options.mode.chained_assignment = (
    None  # prevents warnings on reformatted columns later
)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

### Importing Data

We will base our model off of games played by all NBA teams. As shown, this game data will include categorical information about the game, teams, and season, as well as quantitative data regarding each team's offense and defense, as well as the winner of the game.

First, we will use the `nba_api` to get games played by each NBA team. Note that we set `season_type_nullable` to only include regular season games, since those are the games we will be predicting with this model. This could easily be changed to train a model better for predicting playoff or preseason games.

In [3]:
nba_teams = teams.get_teams()
team_abbr_to_id = {team["abbreviation"]: team["id"] for team in nba_teams}
all_games = pd.DataFrame()

for team in nba_teams:
    team_id = team["id"]
    gamefinder = leaguegamefinder.LeagueGameFinder(
        team_id_nullable=team_id, season_type_nullable="Regular Season"
    )
    games = gamefinder.get_data_frames()[0]
    all_games = pd.concat([all_games, games], ignore_index=True)

In [4]:
print(all_games.columns)
print(all_games.sample(n=5))

Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PLUS_MINUS'],
      dtype='object')
      SEASON_ID     TEAM_ID TEAM_ABBREVIATION               TEAM_NAME  \
85237     22021  1610612763               MEM       Memphis Grizzlies   
16831     22009  1610612742               DAL        Dallas Mavericks   
55325     22022  1610612754               IND          Indiana Pacers   
28035     21997  1610612745               HOU         Houston Rockets   
44297     22003  1610612750               MIN  Minnesota Timberwolves   

          GAME_ID   GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  \
85237  0022100978  2022-03-08  MEM vs. NOP  W  240  132  ...   0.833  15.0   
16831  0020900286  2009-12-05  DAL vs. ATL  L  240   75  ...   0.708   7.0   
55325  0022200963

While this data is good, it may be too much for our model. Over the last forty seasons, NBA basketball has greatly changed. Many rules have been added and removed, and aspects of "perfect" offense and defense have completely shifted. As a result, older data is increasingly unreliable for predicting current-day NBA games. For this model, I chose to only use data from the 2018-2019 season until now. This time period is most reflective of modern basketball, and ignores plenty of outdated game data.

In [5]:
# Get games where season is in modern_seasons
games_modern = all_games[
    (all_games.SEASON_ID.str[-4:] == "2018")
    | (all_games.SEASON_ID.str[-4:] == "2019")
    | (all_games.SEASON_ID.str[-4:] == "2020")
    | (all_games.SEASON_ID.str[-4:] == "2021")
    | (all_games.SEASON_ID.str[-4:] == "2022")
    | (all_games.SEASON_ID.str[-4:] == "2023")
    | (all_games.SEASON_ID.str[-4:] == "2024")
]
print(games_modern.sample(n=5))

      SEASON_ID     TEAM_ID TEAM_ABBREVIATION      TEAM_NAME     GAME_ID  \
460       22018  1610612737               ATL  Atlanta Hawks  0021801088   
29488     22021  1610612746               LAC    LA Clippers  0022100698   
36202     22023  1610612748               MIA     Miami Heat  0022300204   
81916     22020  1610612762               UTA      Utah Jazz  0022000609   
45        22023  1610612737               ATL  Atlanta Hawks  0022301188   

        GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  DREB   REB  \
460    2019-03-23  ATL vs. PHI  W  240  129  ...   0.762  15.0  34.0  49.0   
29488  2022-01-23    LAC @ NYK  L  240  102  ...   0.667   6.0  34.0  40.0   
36202  2023-11-18    MIA @ CHI  L  240   97  ...   0.826   5.0  29.0  34.0   
81916  2021-03-16    UTA @ BOS  W  240  117  ...   0.917  10.0  28.0  38.0   
45     2024-04-14    ATL @ IND  L  241  115  ...   0.926   9.0  23.0  32.0   

       AST  STL  BLK  TOV  PF  PLUS_MINUS  
460     28  8.0    7   12  25 

### Cleaning Data

We have plenty of good data, but it's still unrefined. The next step is to clean up our columns and reformat the data. This will make training, testing, and evaluating our model much easier in the future.

First, we will reformat some of our columns to make working with the data easier.

In [6]:
# Convert GAME_DATE to pandas datetime
# Also order by date earliest to latest to make working with stats easier later
games_modern["GAME_DATE"] = pd.to_datetime(games_modern["GAME_DATE"])
games_modern.sort_values(by="GAME_DATE", inplace=True)


# Add binary "WIN" column, remove categorical WL column
games_modern["WIN"] = games_modern["WL"].apply(lambda x: 1 if x == "W" else 0)


# Convert int stat columns to float type for accurate data analysis
games_modern["MIN"] = games_modern["MIN"].astype(float)  # minutes
games_modern["PTS"] = games_modern["PTS"].astype(float)  # points
games_modern["FGM"] = games_modern["FGM"].astype(float)  # field goals made
games_modern["FGA"] = games_modern["FGA"].astype(float)  # field goals attempted
games_modern["FG3M"] = games_modern["FG3M"].astype(float)  # 3s made
games_modern["FG3A"] = games_modern["FG3A"].astype(float)  # 3s attempted
games_modern["FTM"] = games_modern["FTM"].astype(float)  # free throws made
games_modern["FTA"] = games_modern["FTA"].astype(float)  # free throws attempted
games_modern["AST"] = games_modern["AST"].astype(float)  # assists
games_modern["BLK"] = games_modern["BLK"].astype(float)  # blocks
games_modern["TOV"] = games_modern["TOV"].astype(float)  # turnovers
games_modern["PF"] = games_modern["PF"].astype(float)  # personal fouls


# Add opponent id as column


def get_opponent_id(matchup, team_abbr_to_id, team_id):
    if "@" in matchup:

        opponent_abbr = matchup.split(" @ ")[-1]

    else:

        opponent_abbr = matchup.split(" vs. ")[-1]

    return team_abbr_to_id.get(opponent_abbr, team_id)


games_modern["OPP_TEAM_ID"] = games_modern.apply(
    lambda row: get_opponent_id(row["MATCHUP"], team_abbr_to_id, row["TEAM_ID"]), axis=1
)

Next, we want to define a few new statistics that will be useful in evaluating a team. These statistics will prove essential in predicting the outcomes of new games. For many of these statistics, we will only want to create them based on the most recent data to avoid inaccuracies. Because teams can undergo vast changes in the span of two or three years, we will only define these statistics using data over the last two seasons.

In [7]:
# Define 'HGA' (Home Game Advantage)
games_modern["HGA"] = games_modern["MATCHUP"].apply(lambda x: 0 if "@" in x else 1)

# Define 'LAST_GAME_OUTCOME'
games_modern["LAST_GAME_OUTCOME"] = (
    games_modern.groupby("TEAM_ID")["WIN"].shift(1).fillna(0)
)

# Note: per game statistics indicate team per game statistics using rolling average of last 10
# Define 'PPG' (Points Per Game)
# games_modern["PPG"] = (
#    games_modern.groupby("TEAM_ID")["PTS"]
#    .rolling(window=10, closed="left", min_periods=1)
#    .mean()
#    .reset_index(0, drop=True)
# )

# Define 'EFG%' (Effective Field Goal Percentage)
games_modern["EFG%"] = (
    games_modern["FGM"] + (0.5 * games_modern["FG3M"])
) / games_modern["FGA"]

# Define 'TOV%' (Turnover Percentage)
games_modern["TOV%"] = games_modern["TOV"] / (
    games_modern["FGA"] + 0.44 * games_modern["FTA"] + games_modern["TOV"]
)


# Define 'ORB%' (Offensive Rebound Percentage)
def get_opponent_dreb(game_id):
    opponent_dreb = games_modern.loc[games_modern["GAME_ID"] == game_id, "DREB"].values[
        0
    ]
    return opponent_dreb


games_modern["ORB%"] = games_modern["OREB"] / (
    games_modern["OREB"] + get_opponent_dreb(games_modern["GAME_ID"])
)

# Define 'FTR' (Free Throw Attempt Rate)
games_modern["FTR"] = games_modern["FTA"] / games_modern["FGA"]

# Define 'TS%' (True Shooting Percentage)
games_modern["TS%"] = games_modern["PTS"] / (
    2 * (games_modern["FGA"] + (0.44 * games_modern["FTA"]))
)

Finally, we want to use a `LabelEncoder` to transform a few of our categorical data columsn into numerical values that can be understood by our model.

In [52]:
# LabelEncode data
le = LabelEncoder()
games_modern["TEAM_ID"] = le.fit_transform(games_modern["TEAM_ID"])
games_modern["OPP_TEAM_ID"] = le.fit_transform(games_modern["OPP_TEAM_ID"])
games_modern.sample(5)

Unnamed: 0,TEAM_NAME,TEAM_ID,OPP_TEAM_ID,GAME_DATE,PTS,OREB,DREB,REB,AST,STL,BLK,TOV,EFG%,TOV%,ORB%,FTR,TS%,HGA,LAST_GAME_OUTCOME,WIN
79638,Toronto Raptors,24,12,2019-01-05,123.0,4.0,34.0,38.0,28.0,9.0,6.0,11.0,0.597561,0.103151,0.142857,0.378049,0.643036,0,0.0,1
91301,Detroit Pistons,28,14,2019-11-02,113.0,5.0,36.0,41.0,21.0,8.0,7.0,8.0,0.494505,0.071582,0.172414,0.318681,0.544526,1,0.0,1
22692,Golden State Warriors,7,9,2021-10-21,115.0,10.0,43.0,53.0,27.0,5.0,4.0,21.0,0.608434,0.184729,0.294118,0.26506,0.620414,1,1.0,1
59119,Philadelphia 76ers,18,4,2018-10-18,127.0,11.0,44.0,55.0,30.0,10.0,8.0,13.0,0.537634,0.10906,0.314286,0.322581,0.597928,1,0.0,1
22769,Golden State Warriors,7,28,2020-12-29,116.0,4.0,36.0,40.0,21.0,8.0,8.0,19.0,0.592105,0.171418,0.142857,0.473684,0.631533,0,1.0,1


In [58]:
# Will be helpful to have dictionary keying teams to encoded id values
ENCODED_TEAM_IDS = {
    "Atlanta Hawks": 0,
    "Hawks": 0,
    "Boston Celtics": 1,
    "Celtics": 1,
    "Cleveland Cavaliers": 2,
    "Cavaliers": 2,
    "New Orleans Pelicans": 3,
    "Pelicans": 3,
    "Chicago Bulls": 4,
    "Bulls": 4,
    "Dallas Mavericks": 5,
    "Mavericks": 5,
    "Denver Nuggets": 6,
    "Nuggets": 6,
    "Golden State Warriors": 7,
    "Warriors": 7,
    "Houston Rockets": 8,
    "Rockets": 8,
    "LA Clippers": 9,
    "Clippers": 9,
    "Los Angeles Lakers": 10,
    "Lakers": 10,
    "Miami Heat": 11,
    "Heat": 11,
    "Milwaukee Bucks": 12,
    "Bucks": 12,
    "Minnesota Timberwolves": 13,
    "Timberwolves": 13,
    "Brooklyn Nets": 14,
    "Nets": 14,
    "New York Knicks": 15,
    "Knicks": 15,
    "Orlando Magic": 16,
    "Magic": 16,
    "Indiana Pacers": 17,
    "Pacers": 17,
    "Philadelphia 76ers": 18,
    "76ers": 18,
    "Phoenix Suns": 19,
    "Suns": 19,
    "Portland Trail Blazers": 20,
    "Trail Blazers": 20,
    "Sacramento Kings": 21,
    "Kings": 21,
    "San Antonio Spurs": 22,
    "Spurs": 22,
    "Oklahoma City Thunder": 23,
    "Thunder": 23,
    "Toronto Raptors": 24,
    "Raptors": 24,
    "Utah Jazz": 25,
    "Jazz": 25,
    "Memphis Grizzlies": 26,
    "Grizzlies": 26,
    "Washington Wizards": 27,
    "Wizards": 27,
    "Detroit Pistons": 28,
    "Pistons": 28,
    "Charlotte Hornets": 29,
    "Hornets": 29,
}

### Defining Features

In our model, `X` includes features of the dataset that will be used to predict `y`. Here, `y` is obviously `WIN`, since we are just predicting and classifying each game as a win (1) or loss (0). `X` could include a number of relevant statistics, and all the ones used will be included in the code segment.

In [10]:
features = [
    "TEAM_ID",
    "OPP_TEAM_ID",
    "PTS",
    "OREB",
    "DREB",
    "REB",
    "AST",
    "STL",
    "BLK",
    "TOV",
    "EFG%",
    "TOV%",
    "FTR",
    "TS%",
    "HGA",
    "LAST_GAME_OUTCOME",
]
X = games_modern[features]

y = games_modern[["WIN"]]


# Use sklearn train_test_split with 80% training, 20% testing

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [11]:
games_modern = games_modern[
    [
        "TEAM_NAME",
        "TEAM_ID",
        "OPP_TEAM_ID",
        "GAME_DATE",
        "PTS",
        "OREB",
        "DREB",
        "REB",
        "AST",
        "STL",
        "BLK",
        "TOV",
        "EFG%",
        "TOV%",
        "ORB%",
        "FTR",
        "TS%",
        "HGA",
        "LAST_GAME_OUTCOME",
        "WIN",
    ]
]

### Creating Initial Model

`scikit-learn` offers many options for ML classification models. A simple one good for this problem would be a `RandomForestClassifier`, which uses many decision trees to analyze data and settle on a choice. Random forests are known for accuracy and efficiency.

In [12]:
rf_model = RandomForestClassifier(n_estimators=150, max_depth=20, random_state=42)
rf_model.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [13]:
y_pred = rf_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8145210494203783
              precision    recall  f1-score   support

           0       0.81      0.84      0.82      1691
           1       0.82      0.79      0.80      1587

    accuracy                           0.81      3278
   macro avg       0.82      0.81      0.81      3278
weighted avg       0.81      0.81      0.81      3278



### Predicting Games

In [14]:
# Query nba.live.endpoints.scoreboard and  list games in localTimeZone
from datetime import datetime, timezone
from dateutil import parser
from nba_api.live.nba.endpoints import scoreboard

f = "{gameId}: {awayTeam} vs. {homeTeam} @ {gameTimeLTZ}"

board = scoreboard.ScoreBoard()
print("ScoreBoardDate: " + board.score_board_date)
games = board.games.get_dict()
for game in games:
    gameTimeLTZ = (
        parser.parse(game["gameTimeUTC"])
        .replace(tzinfo=timezone.utc)
        .astimezone(tz=None)
    )
    print(
        f.format(
            gameId=game["gameId"],
            awayTeam=game["awayTeam"]["teamName"],
            homeTeam=game["homeTeam"]["teamName"],
            gameTimeLTZ=gameTimeLTZ,
        )
    )

ScoreBoardDate: 2025-01-16
0022400572: Pacers vs. Pistons @ 2025-01-16 19:00:00-05:00
0022400573: Suns vs. Wizards @ 2025-01-16 19:00:00-05:00
0022400574: Cavaliers vs. Thunder @ 2025-01-16 19:30:00-05:00
0022400575: Clippers vs. Trail Blazers @ 2025-01-16 22:00:00-05:00
0022400576: Rockets vs. Kings @ 2025-01-16 22:00:00-05:00


In [15]:
from nba_api.stats.endpoints import leaguedashteamstats

advancedGamefinder = leaguedashteamstats.LeagueDashTeamStats(
    season="2024-25", per_mode_detailed="PerGame"
)

team_stats = advancedGamefinder.get_data_frames()[0]

# Drop all columns related to ranks (not helpful for database)
keyword = "RANK"
columns_to_drop = [col for col in team_stats.columns if keyword in col]
team_stats.drop(columns=columns_to_drop, inplace=True)

In [68]:
# Cleaning data in team_stats and matching to games_modern

# Define three of four factors (EFG%, TOV%, FTR)

# Define 'EFG%' (Effective Field Goal Percentage)
team_stats["EFG%"] = (team_stats["FGM"] + (0.5 * team_stats["FG3M"])) / team_stats[
    "FGA"
]

# Define 'TOV%' (Turnover Percentage)
team_stats["TOV%"] = team_stats["TOV"] / (
    team_stats["FGA"] + 0.44 * team_stats["FTA"] + team_stats["TOV"]
)

# Define 'FTR' (Free Throw Attempt Rate)
team_stats["FTR"] = team_stats["FTA"] / team_stats["FGA"]

# Define 'TS%' (True Shooting Percentage)
team_stats["TS%"] = team_stats["PTS"] / (
    2 * (team_stats["FGA"] + (0.44 * team_stats["FTA"]))
)

# Label encode team ids
team_stats["TEAM_ID"] = le.fit_transform(team_stats["TEAM_ID"])

In [112]:
# Fetch data to predict upcoming games
games = board.games.get_dict()
# Empty list to store each prediction dataframe
prediction_dfs = []
for game in games:

    # Get home team (team) and away team (opponent) encoded ids
    team_id = ENCODED_TEAM_IDS.get(game["homeTeam"]["teamName"])
    opp_team_id = ENCODED_TEAM_IDS.get(game["awayTeam"]["teamName"])

    # Get relevant home team statistics for features

    home_condition = team_stats["TEAM_ID"] == team_id

    pts = team_stats.loc[home_condition, "PTS"].values[0]


    oreb = team_stats.loc[home_condition, "OREB"].values[0]

    dreb = team_stats.loc[home_condition, "DREB"].values[0]

    reb = team_stats.loc[home_condition, "REB"].values[0]

    ast = team_stats.loc[home_condition, "AST"].values[0]

    stl = team_stats.loc[home_condition, "STL"].values[0]


    blk = team_stats.loc[home_condition, "BLK"].values[0]

    tov = team_stats.loc[home_condition, "TOV"].values[0]


    efg_pct = team_stats.loc[home_condition, "EFG%"].values[0]

    tov_pct = team_stats.loc[home_condition, "TOV%"].values[0]

    ftr = team_stats.loc[home_condition, "FTR"].values[0]


    ts_pct = team_stats.loc[home_condition, "TS%"].values[0]


    # Get home game advantage (always 1 due to how we format data)

    hga = 1.0


    # Get last game outcome from games_modern


    filtered_games_modern = games_modern.loc[
        games_modern["TEAM_ID"] == team_id
    ]


    filtered_games_modern.sort_values(by="GAME_DATE", inplace=True)
    last_game_outcome = filtered_games_modern.iloc[0]["WIN"]

    # With all of the data, construct a dataframe with the game's information.
    prediction_data = {
        "TEAM_ID": [team_id],
        "OPP_TEAM_ID": [opp_team_id],
        "PTS": [pts],
        "OREB": [oreb],
        "DREB": [dreb],
        "REB": [reb],
        "AST": [ast],
        "STL": [stl],
        "BLK": [blk],
        "TOV": [tov],
        "EFG%": [efg_pct],
        "TOV%": [tov_pct],
        "FTR": [ftr],
        "TS%": [ts_pct],
        "HGA": [hga],
        "LAST_GAME_OUTCOME": [last_game_outcome],
    }
    game_prediction_df = pd.DataFrame(prediction_data)
    prediction_dfs.append(game_prediction_df)

# Will all game dataframes, concatenate into a single dataframe for easy simultaneous viewing    
all_game_predictions = pd.concat(prediction_dfs)

In [113]:
all_game_predictions


Unnamed: 0,TEAM_ID,OPP_TEAM_ID,PTS,OREB,DREB,REB,AST,STL,BLK,TOV,EFG%,TOV%,FTR,TS%,HGA,LAST_GAME_OUTCOME
0,28,17,112.4,11.0,33.6,44.6,25.8,7.4,5.1,15.8,0.546119,0.139261,0.223847,0.575489,1.0,0
0,27,19,108.3,10.8,33.2,43.9,25.0,7.7,5.4,16.3,0.510486,0.140775,0.222958,0.544287,1.0,0
0,23,2,116.2,10.1,33.7,43.8,26.0,11.6,5.9,12.2,0.545852,0.10856,0.212882,0.579956,1.0,0
0,20,9,108.4,12.5,30.8,43.3,23.2,8.1,5.6,16.4,0.517338,0.142778,0.230425,0.550455,1.0,1
0,21,8,116.1,10.8,33.9,44.7,26.4,8.2,4.8,13.2,0.542952,0.116205,0.240088,0.578233,1.0,1


In [124]:
for i in range(0, len(prediction_dfs)):
    print(rf_model.predict(prediction_dfs[i]))

[0]
[0]
[1]
[0]
[1]
