<a href="https://colab.research.google.com/github/Tcberck1/Trenton-Berck/blob/main/NBA_Sports_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NBA Game Prediction Model**

This project aims to predict the outcomes of NBA games using machine learning techniques. The model leverages team statistics, player availability, and dynamic team ratings to estimate both the winner and the expected score differential. A Random Forest model is used for predictions, incorporating key performance metrics such as shooting percentages, rebounds, turnovers, steals, and blocks.

The ultimate goal is to develop an accurate and reliable model that can be used for sports analytics and potential betting applications. This notebook will walk through the data collection, preprocessing, feature engineering, model training, and evaluation steps required to build the prediction system.

The first step is to install the necessary Python libraries for data extraction, manipulation, and statistical analysis.

In [None]:
!pip install nba-api
!pip install pandas
!pip install statsmodels

Collecting nba-api
  Downloading nba_api-1.7.0-py3-none-any.whl.metadata (5.5 kB)
Downloading nba_api-1.7.0-py3-none-any.whl (280 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/280.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/280.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.2/280.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nba-api
Successfully installed nba-api-1.7.0


## Fetching and Organizing NBA Game Data

This section retrieves NBA game data for a specified date using the nba-api library. The ScoreboardV2 endpoint is used to gather live or historical game details, including team names, locations, and final scores.

The game date is set to '2025-02-09'.
The ScoreboardV2 API fetches all games played on that date.
The LineScore data is extracted into a Pandas DataFrame, which contains key game details.
A dictionary (games) is created to store each game, categorizing teams as either home or away based on their appearance in the data.
The results are printed, displaying the home and away teams, their cities, and final scores.
This structured game data will serve as a foundation for further analysis and model training.

In [None]:
from nba_api.stats.endpoints import ScoreboardV2

# Specify the game date
game_date = '2025-02-09'

# Fetch the game data for the specified date
scoreboard = ScoreboardV2(game_date=game_date)

# Extract the LineScore data
line_score_df = scoreboard.line_score.get_data_frame()

# Dictionary to store games
games = {}

# Iterate over the rows and group teams by GAME_ID
for _, row in line_score_df.iterrows():
    game_id = row["GAME_ID"]

    if game_id not in games:
        games[game_id] = {"home": None, "away": None}

    # If home is not assigned, assume the first team is home
    if games[game_id]["home"] is None:
        games[game_id]["home"] = {
            "team": row["TEAM_ABBREVIATION"],
            "city": row["TEAM_CITY_NAME"],
            "points": row["PTS"]
        }
    else:
        # The second team encountered must be the away team
        games[game_id]["away"] = {
            "team": row["TEAM_ABBREVIATION"],
            "city": row["TEAM_CITY_NAME"],
            "points": row["PTS"]
        }

# Display the results
for game_id, teams in games.items():
    print(f"Game ID: {game_id}")
    print(f"Home Team: {teams['home']['team']} ({teams['home']['city']}) - {teams['home']['points']} PTS")
    print(f"Away Team: {teams['away']['team']} ({teams['away']['city']}) - {teams['away']['points']} PTS")
    print("-" * 40)


Game ID: 0022400752
Home Team: CHA (Charlotte) - 102 PTS
Away Team: DET (Detroit) - 112 PTS
----------------------------------------
Game ID: 0022400753
Home Team: TOR (Toronto) - 87 PTS
Away Team: HOU (Houston) - 94 PTS
----------------------------------------
Game ID: 0022400754
Home Team: PHI (Philadelphia) - 127 PTS
Away Team: MIL (Milwaukee) - 135 PTS
----------------------------------------


## Storing NBA Team PPG (Points Per Game) in a Database

This section retrieves and stores NBA team statistics in an SQLite database, specifically focusing on Points Per Game (PPG) for the 2024-25 season.

Database Setup:

Connects to an SQLite database (nba_stats.db).
Drops the existing team_ppg table (if it exists) to prevent schema mismatches.
Creates a new table with columns for team ID, team abbreviation, and PPG.
Fetching and Processing Data:

Uses nba-api to fetch team statistics for the 2024-25 season.
Extracts total points (PTS) and games played (GP).
Calculates PPG (PTS / GP) for each team.
Matches the team abbreviation using the NBA API’s static team data.
Storing Data in SQLite:

Iterates through the team stats and inserts the computed PPG into the database.
Commits each entry and prints a confirmation message for each team.
This stored data will be useful for feature engineering in the NBA prediction model, allowing the model to incorporate offensive performance trends.

In [None]:
import pandas as pd
import sqlite3
from nba_api.stats.endpoints import leaguedashteamstats
from nba_api.stats.static import teams

# Connect to SQLite database
conn = sqlite3.connect("nba_stats.db")
cursor = conn.cursor()

# Drop existing table if needed (to avoid schema mismatches)
cursor.execute("DROP TABLE IF EXISTS team_ppg")
conn.commit()

# Create a fresh table for storing PPG
cursor.execute("""
CREATE TABLE IF NOT EXISTS team_ppg (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    team_id INTEGER,
    team_abbr TEXT,
    ppg REAL
)
""")
conn.commit()

# Fetch team stats from the NBA API
team_stats = leaguedashteamstats.LeagueDashTeamStats(season="2024-25").get_data_frames()[0]

# Extract relevant columns
team_stats = team_stats[['TEAM_ID', 'PTS', 'GP']]  # Get team ID, total points, and games played

# Insert data into the database
for _, row in team_stats.iterrows():
    team_id = row["TEAM_ID"]
    games_played = row["GP"]
    total_points = row["PTS"]

    # Compute PPG
    ppg = total_points / games_played if games_played > 0 else 0

    # Find team abbreviation by using the team_id
    team_abbr = next((team["abbreviation"] for team in teams.get_teams() if team["id"] == team_id), "N/A")

    cursor.execute("""
        INSERT INTO team_ppg (team_id, team_abbr, ppg)
        VALUES (?, ?, ?)
    """, (team_id, team_abbr, ppg))

    conn.commit()
    print(f"Stored Team ID {team_id} ({team_abbr}) data: PPG={ppg:.2f}")

# Close the database connection
conn.close()


## Collecting and Storing NBA Game Outcomes

This section retrieves NBA game results for the 2024-25 season and stores them in a CSV file for further analysis. The dataset will be used to train the prediction model by providing historical game outcomes.

Process Overview:
Initialize Data Storage:

Creates an empty DataFrame with columns for game date, team IDs, and final scores.
Define the Season Date Range:

Loops through all dates from October 24, 2024, to February 10, 2025.
Fetch Game Data from nba-api:

Uses ScoreboardV2 to retrieve daily game results.
Extracts team abbreviations, scores, and game IDs.
Maps abbreviations to team IDs for consistency.
Store Completed Game Results:

Ensures both teams have scores before storing results.
Appends each day's data to the main DataFrame.
Save Data for Future Use:

Exports the final dataset to game_outcomes.csv for model training.
Additional Features:
Implements error handling to catch API failures.
Introduces a 1-second delay between requests to avoid rate limits.
This dataset will be crucial for training the NBA prediction model by providing historical match results and score differentials.

In [None]:
import pandas as pd
from nba_api.stats.endpoints import ScoreboardV2
from datetime import datetime, timedelta
import time

# Initialize an empty DataFrame to store the game outcomes
columns = ['game_date', 'home_team_id', 'away_team_id', 'home_score', 'away_score']
game_outcomes_df = pd.DataFrame(columns=columns)

# Define season date range (approximate: Oct 24, 2024 - April 14, 2025)
start_date = datetime(2024, 10, 24)
# Get yesterday's date
end_date = datetime(2025, 2, 10)

# Fetch team IDs from the NBA API
from nba_api.stats.static import teams
team_ids = {team["abbreviation"]: team["id"] for team in teams.get_teams()}

# Loop through every date in the season
current_date = start_date
while current_date <= end_date:
    date_str = current_date.strftime("%Y-%m-%d")  # Format date

    try:
        # Fetch game data for the date
        scoreboard = ScoreboardV2(game_date=date_str)
        line_score_df = scoreboard.line_score.get_data_frame()

        # Dictionary to store games
        games = {}

        # Iterate over rows and group teams by GAME_ID
        for _, row in line_score_df.iterrows():
            game_id = row["GAME_ID"]

            if game_id not in games:
                games[game_id] = {"home": None, "away": None}

            # Assign teams correctly by team abbreviation
            if games[game_id]["home"] is None:
                games[game_id]["home"] = {
                    "team_abbr": row["TEAM_ABBREVIATION"],
                    "points": row["PTS"]
                }
            else:
                games[game_id]["away"] = {
                    "team_abbr": row["TEAM_ABBREVIATION"],
                    "points": row["PTS"]
                }

        # Create a temporary DataFrame to store today's games
        temp_df = pd.DataFrame(columns=columns)

        # Store final scores in temporary DataFrame using team IDs
        for game_id, teams in games.items():
            if teams["home"] and teams["away"]:
                home_team_abbr = teams["home"]["team_abbr"]
                away_team_abbr = teams["away"]["team_abbr"]
                home_team_id = team_ids.get(home_team_abbr, None)
                away_team_id = team_ids.get(away_team_abbr, None)
                home_score = teams["home"]["points"]
                away_score = teams["away"]["points"]

                # Ensure only completed games are stored
                if home_score is not None and away_score is not None:
                    temp_df = pd.concat([temp_df, pd.DataFrame([{
                        'game_date': date_str,
                        'home_team_id': home_team_id,
                        'away_team_id': away_team_id,
                        'home_score': home_score,
                        'away_score': away_score
                    }])], ignore_index=True)

                    print(f"Stored Team ID {home_team_id} vs Team ID {away_team_id}: {home_score}-{away_score} on {date_str}")

        # Concatenate today's data to the main DataFrame
        game_outcomes_df = pd.concat([game_outcomes_df, temp_df], ignore_index=True)

    except Exception as e:
        print(f"Error processing {date_str}: {e}")

    # Add a delay of 1 second between requests to avoid rate limiting
    time.sleep(1)

    # Move to the next day
    current_date += timedelta(days=1)

# Save the DataFrame to a CSV file
game_outcomes_df.to_csv('game_outcomes.csv', index=False)

print("Game outcomes have been saved to 'game_outcomes.csv'.")


Stored Team ID 1610612738 vs Team ID 1610612764: 122-102 on 2024-10-24
Stored Team ID 1610612759 vs Team ID 1610612742: 109-120 on 2024-10-24
Stored Team ID 1610612760 vs Team ID 1610612743: 102-87 on 2024-10-24
Stored Team ID 1610612750 vs Team ID 1610612758: 117-115 on 2024-10-24
Stored Team ID 1610612751 vs Team ID 1610612753: 101-116 on 2024-10-25
Stored Team ID 1610612755 vs Team ID 1610612761: 107-115 on 2024-10-25
Stored Team ID 1610612766 vs Team ID 1610612737: 120-125 on 2024-10-25
Stored Team ID 1610612765 vs Team ID 1610612739: 101-113 on 2024-10-25
Stored Team ID 1610612754 vs Team ID 1610612752: 98-123 on 2024-10-25
Stored Team ID 1610612763 vs Team ID 1610612745: 108-128 on 2024-10-25
Stored Team ID 1610612741 vs Team ID 1610612749: 133-122 on 2024-10-25
Stored Team ID 1610612744 vs Team ID 1610612762: 127-86 on 2024-10-25
Stored Team ID 1610612756 vs Team ID 1610612747: 116-123 on 2024-10-25
Stored Team ID 1610612740 vs Team ID 1610612757: 105-103 on 2024-10-25
Stored Te

## Fetching and Saving NBA Team Statistics

This section retrieves both basic and advanced team statistics for the 2024-25 NBA season and saves them to a CSV file for use in the prediction model.

Process Overview:
Fetch Basic Team Stats:

Uses the LeagueDashTeamStats endpoint to collect per-game averages.
Extracts key metrics:
Field Goal Percentage (FG_PCT)
Offensive & Defensive Rebounds (OREB, DREB)
Turnovers (TOV), Steals (STL), Blocks (BLK)
Winning Percentage (W_PCT)
Fetch Advanced Team Stats:

Retrieves Offensive Rating (OFF_RATING), Defensive Rating (DEF_RATING), and Net Rating (NET_RATING) to measure team efficiency.
Merge Data:

Combines basic and advanced statistics by matching TEAM_ID.
Save to CSV:

Exports the final dataset to team_stats.csv for model training.
This dataset provides a comprehensive view of team performance and serves as the foundation for making game predictions.

In [None]:
from nba_api.stats.endpoints import LeagueDashTeamStats
import pandas as pd

# Fetch base stats
team_base_stats = LeagueDashTeamStats(
    season='2024-25',
    measure_type_detailed_defense='Base',
    per_mode_detailed='PerGame'
).get_data_frames()[0]

team_base_stats = team_base_stats[['TEAM_ID', 'TEAM_NAME', 'FG_PCT', 'OREB', 'DREB', 'TOV', 'STL', 'BLK', 'W_PCT']]

# Fetch advanced stats
team_advanced_stats = LeagueDashTeamStats(
    season='2024-25',
    measure_type_detailed_defense='Advanced',
    per_mode_detailed='PerGame'
).get_data_frames()[0]

team_advanced_stats = team_advanced_stats[['TEAM_ID', 'OFF_RATING', 'DEF_RATING', 'NET_RATING']]

# Merge datasets on TEAM_ID
merged_team_stats = pd.merge(team_base_stats, team_advanced_stats, on='TEAM_ID')

# Save the merged data
merged_team_stats.to_csv('team_stats.csv', index=False)


## Training and Evaluating the NBA Score Prediction Model

This section builds a Random Forest regression model to predict NBA game score differentials based on team statistics.

Process Overview:
Load and Merge Data:

Reads game outcomes (game_outcomes.csv) and team stats (team_stats.csv).
Merges home and away team stats into the dataset.
Computes SCORE_DIFFERENCE (home score - away score).
Preprocessing:

Selects key predictive features:
HOME_TEAM_NET_RATING, AWAY_TEAM_NET_RATING
HOME_W_PCT, AWAY_W_PCT
Standardizes feature values using StandardScaler for improved model performance.
Train-Test Split:

Splits data into 80% training and 20% testing sets.
Hyperparameter Tuning with GridSearchCV:

Searches for the best combination of hyperparameters using cross-validation.
Optimizes parameters like number of trees, depth, and split criteria.
Train and Evaluate Model:

Fits the best Random Forest model found by GridSearchCV.
Clips predictions to ensure realistic score margins (±20 points).
Evaluates using:
Mean Absolute Error (MAE)
Prediction Accuracy (correct winner %)
Outcome:
This step fine-tunes the NBA game prediction model, improving its ability to forecast realistic score differentials and game winners.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Read game outcomes and team stats data
game_outcomes_df = pd.read_csv('game_outcomes.csv')
team_stats_df = pd.read_csv('team_stats.csv')

# Merge datasets for home and away teams
game_outcomes_df = game_outcomes_df.merge(
    team_stats_df,
    left_on='home_team_id',
    right_on='TEAM_ID',
    how='left'
).rename(columns={
    'NET_RATING': 'HOME_TEAM_NET_RATING',
    'W_PCT': 'HOME_W_PCT'
})

game_outcomes_df = game_outcomes_df.merge(
    team_stats_df,
    left_on='away_team_id',
    right_on='TEAM_ID',
    how='left'
).rename(columns={
    'NET_RATING': 'AWAY_TEAM_NET_RATING',
    'W_PCT': 'AWAY_W_PCT'
})

# Drop unnecessary columns
game_outcomes_df.drop(columns=['TEAM_ID_x', 'TEAM_ID_y'], inplace=True)

# Create SCORE_DIFFERENCE column
game_outcomes_df['SCORE_DIFFERENCE'] = game_outcomes_df['home_score'] - game_outcomes_df['away_score']

# Drop rows with missing values
game_outcomes_df.dropna(inplace=True)

# Prepare features and target
X = game_outcomes_df[['HOME_TEAM_NET_RATING', 'AWAY_TEAM_NET_RATING', 'HOME_W_PCT', 'AWAY_W_PCT']]
y = game_outcomes_df['SCORE_DIFFERENCE']

# Feature Scaling (Standardization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scale features

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Define the model
rf_model = RandomForestRegressor(random_state=42)

# Hyperparameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees
    'max_depth': [10, 20, None],  # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 4],  # Minimum samples required to be at a leaf node
    'max_features': ['auto', 'sqrt', 'log2']  # Number of features to consider when looking for the best split
}

# GridSearchCV with cross-validation
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='neg_mean_absolute_error')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters from GridSearchCV
print("Best parameters found by GridSearchCV:", grid_search.best_params_)

# Fit the best Random Forest model found by GridSearchCV
best_rf_model = grid_search.best_estimator_

# Make predictions
y_pred = best_rf_model.predict(X_test)
y_pred = y_pred.clip(-20, 20)  # Cap predictions to realistic score margins

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
accuracy = (y_test > 0).astype(int) == (y_pred > 0).astype(int)
accuracy_percentage = accuracy.mean() * 100

print(f"Improved MAE: {mae:.2f}")
print(f"Improved Accuracy: {accuracy_percentage:.2f}%")


405 fits failed out of a total of 1215.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
405 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
s

Best parameters found by GridSearchCV: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
Improved MAE: 11.42
Improved Accuracy: 64.10%
