<a href="https://colab.research.google.com/github/Duppal147/Hackathon2025/blob/main/hackathon_pre_game_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

LINK: https://www.kaggle.com/datasets/nathanlauga/nba-games

To build a machine learning model for predicting NBA game outcomes using historical Kaggle data, follow these steps:

Data Collection and Preparation
Obtain the NBA dataset from Kaggle, which includes comprehensive information on teams, players, and games28.

Clean and preprocess the data:

Remove any irrelevant or redundant features

Handle missing values

Encode categorical variables

Normalize numerical features

Create relevant features:

Calculate team performance metrics (e.g., average points scored, rebounds, assists)

Compute player statistics

Generate features based on recent performance (e.g., last 10 games)3

Feature Selection
Identify key performance indicators that influence game outcomes:

Field goal percentage

Defensive rebounds

Turnovers

Assists

Three-point shooting percentage4

Use feature importance techniques like correlation analysis or SHAP (SHapley Additive exPlanations) to select the most relevant features4.

Model Selection and Training
Split the data into training and testing sets (e.g., 80:20 ratio)3.

Choose and implement machine learning algorithms:

Logistic Regression

Random Forest Classifier

XGBoost Classifier

Support Vector Classifier

Gaussian Naïve Bayes14

Train the models using the training data.

Perform hyperparameter tuning using techniques like grid search or Bayesian optimization4.

Model Evaluation
Evaluate model performance using metrics such as:

Accuracy

Precision

Recall

F1 Score

AUC (Area Under the Curve)4

Use cross-validation techniques (e.g., 10-fold cross-validation) to ensure robust performance assessment4.

Model Refinement
Analyze feature importance to understand which factors contribute most to the predictions4.

Consider ensemble methods or stacking to combine multiple models for improved performance.

Implement techniques like rolling averages or time-based features to capture recent team performance3.

Deployment and Prediction
Select the best-performing model based on evaluation metrics.

Implement the model in a production environment.

Use the model to predict outcomes of upcoming NBA games by inputting the latest team and player statistics.

Continuously monitor and update the model with new data to maintain its accuracy over time.

By following these steps, you can create a machine learning model to predict NBA game outcomes using historical Kaggle data. Remember that the accuracy of such models typically ranges from 65% to 70%14, so while they can provide valuable insights, they are not perfect predictors of game results.




In [3]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report
import os

In [2]:
games = pd.read_csv('https://raw.githubusercontent.com/Duppal147/Hackathon2025/refs/heads/main/games.csv')
rankings=pd.read_csv("https://raw.githubusercontent.com/Duppal147/Hackathon2025/refs/heads/main/ranking.csv")
teams = pd.read_csv('https://raw.githubusercontent.com/Duppal147/Hackathon2025/refs/heads/main/teams.csv')

In [5]:
games.columns


Index(['GAME_DATE_EST', 'GAME_ID', 'GAME_STATUS_TEXT', 'HOME_TEAM_ID',
       'VISITOR_TEAM_ID', 'SEASON', 'TEAM_ID_home', 'PTS_home', 'FG_PCT_home',
       'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home', 'TEAM_ID_away',
       'PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away',
       'REB_away', 'HOME_TEAM_WINS'],
      dtype='object')

y=Hometeamwins is the y what you are trying to predict for the model, so it would be the y

Field G

In [6]:
# Check for missing values in each dataframe
print("Missing values in games.csv:\n", games.isnull().sum())


Missing values in games.csv:
 GAME_DATE_EST        0
GAME_ID              0
GAME_STATUS_TEXT     0
HOME_TEAM_ID         0
VISITOR_TEAM_ID      0
SEASON               0
TEAM_ID_home         0
PTS_home            99
FG_PCT_home         99
FT_PCT_home         99
FG3_PCT_home        99
AST_home            99
REB_home            99
TEAM_ID_away         0
PTS_away            99
FG_PCT_away         99
FT_PCT_away         99
FG3_PCT_away        99
AST_away            99
REB_away            99
HOME_TEAM_WINS       0
dtype: int64


In [7]:
columns_to_check = [
    'PTS_home', 'FG_PCT_home', 'FG3_PCT_home', 'FT_PCT_home', 'AST_home', 'REB_home',
    'FG_PCT_away', 'FG3_PCT_away', 'FT_PCT_away', 'AST_away', 'REB_away', 'PTS_away'
]

games = games.dropna(subset=columns_to_check)


In [8]:
games.isnull().sum()

Unnamed: 0,0
GAME_DATE_EST,0
GAME_ID,0
GAME_STATUS_TEXT,0
HOME_TEAM_ID,0
VISITOR_TEAM_ID,0
SEASON,0
TEAM_ID_home,0
PTS_home,0
FG_PCT_home,0
FT_PCT_home,0


In [9]:
games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26552 entries, 0 to 26650
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   GAME_DATE_EST     26552 non-null  object 
 1   GAME_ID           26552 non-null  int64  
 2   GAME_STATUS_TEXT  26552 non-null  object 
 3   HOME_TEAM_ID      26552 non-null  int64  
 4   VISITOR_TEAM_ID   26552 non-null  int64  
 5   SEASON            26552 non-null  int64  
 6   TEAM_ID_home      26552 non-null  int64  
 7   PTS_home          26552 non-null  float64
 8   FG_PCT_home       26552 non-null  float64
 9   FT_PCT_home       26552 non-null  float64
 10  FG3_PCT_home      26552 non-null  float64
 11  AST_home          26552 non-null  float64
 12  REB_home          26552 non-null  float64
 13  TEAM_ID_away      26552 non-null  int64  
 14  PTS_away          26552 non-null  float64
 15  FG_PCT_away       26552 non-null  float64
 16  FT_PCT_away       26552 non-null  float64
 17

In [10]:

print("Initial Data Info:")
print(rankings.info())
print("\nFirst 5 rows:")
#rankings = rankings.drop(['CONFERENCE','RETURNTOPLAY'], axis=1)
#already removed Conference and Return to Play
print(rankings)

#Need to change Home_record and Road_record to integers
print(rankings)
print(games)

# Need to figure out a way to not drop standings date and team, but not use it while making our model

Initial Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210342 entries, 0 to 210341
Data columns (total 13 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   TEAM_ID        210342 non-null  int64  
 1   LEAGUE_ID      210342 non-null  int64  
 2   SEASON_ID      210342 non-null  int64  
 3   STANDINGSDATE  210342 non-null  object 
 4   CONFERENCE     210342 non-null  object 
 5   TEAM           210342 non-null  object 
 6   G              210342 non-null  int64  
 7   W              210342 non-null  int64  
 8   L              210342 non-null  int64  
 9   W_PCT          210342 non-null  float64
 10  HOME_RECORD    210342 non-null  object 
 11  ROAD_RECORD    210342 non-null  object 
 12  RETURNTOPLAY   3990 non-null    float64
dtypes: float64(2), int64(6), object(5)
memory usage: 20.9+ MB
None

First 5 rows:
           TEAM_ID  LEAGUE_ID  SEASON_ID STANDINGSDATE CONFERENCE  \
0       1610612743          0      22022   

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a simple feature: average ranking difference
rankings['win_ratio'] = rankings['W'] / (rankings['W'] + rankings['L'])

# Merge ranking data with games data
games = pd.merge(games, rankings[['TEAM_ID', 'win_ratio']], left_on='HOME_TEAM_ID', right_on='TEAM_ID', suffixes=('', '_home'))
games = pd.merge(games, rankings[['TEAM_ID', 'win_ratio']], left_on='visitor_team_id', right_on='TEAM_ID', suffixes=('_home', '_away'))

# Calculate ranking difference
games['ranking_diff'] = games['win_ratio_home'] - games['win_ratio_away']

# Define target variable (home team win)
games['home_win'] = (games['home_team_score'] > games['visitor_team_score']).astype(int)

# Point differential
games['score_diff'] = games['PTS_home'] - games['PTS_away']  # Difference in points scored

# Shooting percentage differentials
games['fg_pct_diff'] = games['FG_PCT_home'] - games['FG_PCT_away']  # Field Goal % difference
games['ft_pct_diff'] = games['FT_PCT_home'] - games['FT_PCT_away']  # Free Throw % difference
games['fg3_pct_diff'] = games['FG3_PCT_home'] - games['FG3_PCT_away']  # 3-Point % difference

# Rolling averages for the last 5 games they played
for stat in ['PTS', 'FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST', 'REB']:
    for team_type in ['home', 'away']:
        games[f'{stat}_{team_type}_last_5_avg'] = games.groupby(f'TEAM_ID_{team_type}')[f'{stat}_{team_type}'].rolling(window=5, min_periods=1).mean().reset_index(0, drop=True)

# Select features and target
X = games[['ranking_diff']]  # Use only ranking difference as a feature
y = games['home_win']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model Training
model = LogisticRegression()
model.fit(X_train, y_train)

# Model Evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


In [None]:
print("Columns in games DataFrame:")
print(games.columns)

print("\nColumns in rankings DataFrame:")
print(rankings.columns)