## Abstract:
This project aims to predict NBA game outcomes using machine learning techniques. We collect and preprocess a comprehensive dataset spanning multiple seasons, including game statistics and player performance metrics. Through exploratory analysis and feature engineering, we identify the most influential factors in determining game results. We develop and evaluate various models, such as logistic regression, support vector machines, and XGBoost, using cross-validation and time-series splitting. Our approach incorporates SHAP for model interpretability and understanding feature importance. The ultimate goal is to provide accurate predictions and valuable insights to support decision-making and strategy in the NBA. The project's findings have implications for coaches, managers, analysts, and fans, offering a data-driven perspective on the factors that drive success in professional basketball.

## Introdction:  

In this project, we leverage machine learning techniques to predict NBA game outcomes. Our goal is to develop an accurate and robust model that can provide valuable insights for coaches, managers, and analysts. By harnessing historical game data and advanced analytics, we aim to answer key questions such as the impact of statistical indicators, in-season changes, and home-court advantage on game results. Through rigorous data preprocessing, feature engineering, and model evaluation, we strive to create a tool that can revolutionize decision-making and strategy in the NBA. Join us as we explore the power of data science in predicting the outcomes of professional basketball games.

In [1]:
### Libraries
import warnings
warnings.filterwarnings('ignore')


import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import time

from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, TimeSeriesSplit, StratifiedKFold, cross_val_score, learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, VotingClassifier
from xgboost import XGBClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor
import shap
import os
import pickle
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'shap'

In [None]:
# List of GitHub raw URLs
urls = [
    'https://raw.githubusercontent.com/JetendraMulinti/DAV-6150---DataScience/main/FinalProject-NBA_Prediction/Data/season_-2020.csv',
    'https://raw.githubusercontent.com/JetendraMulinti/DAV-6150---DataScience/main/FinalProject-NBA_Prediction/Data/season_-2021.csv',
    'https://raw.githubusercontent.com/JetendraMulinti/DAV-6150---DataScience/main/FinalProject-NBA_Prediction/Data/season_-2022.csv',
    'https://raw.githubusercontent.com/JetendraMulinti/DAV-6150---DataScience/main/FinalProject-NBA_Prediction/Data/season_-2023.csv'
]

all_cleaned_dataframes = []

for url in urls:
    try:
        # Load the dataframe from pickle data obtained from URL
        dataframe = pd.read_csv(url)
        
        # Filter out columns that start with 'Unnamed:'
        dataframe = dataframe.loc[:, ~dataframe.columns.str.startswith('Unnamed:')]

        # Drop all columns that are entirely NA
        dataframe = dataframe.dropna(axis=1, how='all')

        # Add the cleaned dataframe to the list
        all_cleaned_dataframes.append(dataframe)
        
        print(f"Processed data from {url}")

    except Exception as e:
        print(f"Error processing {url}: {e}")

# Concatenate all dataframes into one
df = pd.concat(all_cleaned_dataframes, ignore_index=True)


### delete some more columns
columns_to_delete = ['OT', 'OT_opp', '2OT', '3OT', '2OT_opp', '3OT_opp',
                     ## '4OT', '4OT_opp',
                    'mp_total_opp','bpm_max','bpm_max_opp']

# Drop the specified columns from the dataframe
df.drop(columns=columns_to_delete, inplace=True)


print("No of duplicate rows: ",df.duplicated().sum())

### Drop duplicates
df = df.drop_duplicates().reset_index(drop=True)

print("No of duplicate rows after dropping duplicates: ",df.duplicated().sum())

#### rename columns
df.rename(columns = {'mp_total':'mp'}, inplace=True)

#### Creating Season column
df['date'] = pd.to_datetime(df['date'])  # Convert 'date' column to datetime if it's not already

# Function to determine the season year based on the month
def get_season_year(row):
    if row['date'].month >= 10:
        return row['date'].year
    else:
        return row['date'].year - 1

# Apply the function to create a new 'season' column
df['season'] = df.apply(get_season_year, axis=1)


print("data shape:", df.shape)

columns_format = list(df.columns)

df.head(2)

## Data Cleaning

In [None]:
##### Abbrivate the Team names
team_df = pd.read_csv('https://raw.githubusercontent.com/JetendraMulinti/DAV-6150---DataScience/main/FinalProject-NBA_Prediction/Data/Team_full-forms.csv')
team_df['team'] = team_df['team'].str.strip()
team_df['team1'] = team_df['team1'].str.strip()


##### Merge and delete the columns
df = pd.merge(team_df, df, on = ['team'], how='inner')
del df['team']
df.rename(columns = {'team1':'team'}, inplace=True)

team_df.rename(columns = {'team':'team_opp'}, inplace=True)
df = pd.merge(team_df, df, on = ['team_opp'], how='inner')
del df['team_opp']
df.rename(columns = {'team1':'team_opp'}, inplace=True)

print("data shape:", df.shape)

df = df[columns_format]

## ordering with date
df['date'] = pd.to_datetime(df['date']).dt.date
df = df.sort_values(by = ['date'], ascending=True).reset_index(drop=True)

df.head()

In [None]:
df['season'].value_counts()

In [None]:
### checking the data is balance / Imbalanced

df['won'].value_counts()

Checking Null values and dropping columns and rows

In [None]:
### Checking null values

null_columns = df.isnull().sum()
null_columns[null_columns > 0]

In [None]:
### delete some more columns
more_columns_to_delete = ['index_opp']

# Drop the specified columns from the dataframe
df.drop(columns=more_columns_to_delete, inplace=True)

## as we have only 1 null row (match) we will drop it
df = df.dropna()

null_columns = df.isnull().sum()
null_columns[null_columns > 0]

In [None]:
## re-ordering on date

## ordering with date
df['date'] = pd.to_datetime(df['date']).dt.date
df = df.sort_values(by = ['date'], ascending=True).reset_index(drop=True)

print("data shape:", df.shape)


print("no of rows before: ", len(df))

#### Deleting the repeated rows (Instead of both perpestives)
# Create a sorted string that combines team names and game date
df['game_id'] = df.apply(lambda x: '_'.join(sorted([x['team'], x['team_opp']])) + '_' + str(x['date']), axis=1)

# Keep only one entry per game based on the alphabetical order of team names
df = df.sort_values(by=['team', 'team_opp']).drop_duplicates(subset='game_id', keep='first').reset_index(drop=True)

print("no of rows After: ", len(df))

df.head()

## Exploratory Data Analysis

In [None]:
# Generate descriptive statistics for key metrics
key_metrics = ['fg_total', 'fga_total', 'fg%_total', '3p_total', '3pa_total', '3p%_total', 'ft_total',
               'fta_total', 'ft%_total', 'total_opp', '+/-_max']

# Selecting the key metrics and generating descriptive statistics
key_stats_summary = df[key_metrics].describe()

# Display the descriptive statistics for key metrics
key_stats_summary

In [None]:
# Generate descriptive statistics for key metrics
key_metrics_opp = ['fg_total_opp', 'fga_total_opp', 'fg%_total_opp', '3p_total_opp', '3pa_total_opp', '3p%_total_opp', 'ft_total_opp',
               'fta_total_opp', 'ft%_total_opp', '+/-_max_opp']

# Selecting the key metrics and generating descriptive statistics
key_stats_summary_opp = df[key_metrics_opp].describe()

# Display the descriptive statistics for key metrics
key_stats_summary_opp

1. Field Goals Made and Attempted (fg_total, fga_total): Teams make an average of 40 field goals per game from 87 attempts, translating to an average field goal percentage of 46.2%.
2. Three-Point Shots (3p_total, 3pa_total, 3p%_total): On average, teams successfully make 11 three-point shots per game from 31 attempts, achieving a three-point shooting percentage of 35.7%.
3. Free Throws (ft_total, fta_total, ft%_total): Teams typically make 17 free throws per game from 23 attempts, with an average success rate of 77.2%.

In [None]:

fig, axes = plt.subplots(5, 3, figsize=(15, 20))  # Adjust the subplot grid to 4x3

# Plotting field goals, three-point shots, and free throws
sns.histplot(df['fg_total'], bins=30, kde=True, ax=axes[0, 0]).set_title('Field Goals Made')
sns.histplot(df['3p_total'], bins=30, kde=True, ax=axes[0, 1]).set_title('Three-Points Made')
sns.histplot(df['ft_total'], bins=30, kde=True, ax=axes[0, 2]).set_title('Free Throws Made')

# Plotting percentages for field goals, three-point shots, and free throws
sns.histplot(df['fg%_total'], bins=30, kde=True, ax=axes[1, 0]).set_title('Field Goal Percentage')
sns.histplot(df['3p%_total'], bins=30, kde=True, ax=axes[1, 1]).set_title('Three-Point Percentage')
sns.histplot(df['ft%_total'], bins=30, kde=True, ax=axes[1, 2]).set_title('Free Throw Percentage')

# Plotting field goals, three-point shots, and free throws made by opponents
sns.histplot(df['fg_total_opp'], bins=30, kde=True, ax=axes[2, 0]).set_title('Field Goals Made by Opp')
sns.histplot(df['3p_total_opp'], bins=30, kde=True, ax=axes[2, 1]).set_title('Three-Points Made by Opp')
sns.histplot(df['ft_total_opp'], bins=30, kde=True, ax=axes[2, 2]).set_title('Free Throws Made by Opp')

# Plotting percentages for field goals, three-point shots, and free throws made by opponents
sns.histplot(df['fg%_total_opp'], bins=30, kde=True, ax=axes[3, 0]).set_title('Field Goal Percentage by Opp')
sns.histplot(df['3p%_total_opp'], bins=30, kde=True, ax=axes[3, 1]).set_title('Three-Point Percentage by Opp')
sns.histplot(df['ft%_total_opp'], bins=30, kde=True, ax=axes[3, 2]).set_title('Free Throw Percentage by Opp')

# Plotting games per season and distributions of win and next game outcomes
sns.countplot(x='season', data=df, ax=axes[4, 0]).set_title('Games per Season')
sns.countplot(x='won', data=df, ax=axes[4, 1]).set_title('Win Distribution')

plt.tight_layout()
plt.show()


1. Field Goals Made and Three-Points Made distributions center around a common range, indicating a pattern in scoring strategies across games.
2. Percentage metrics for Field Goals, Three-Points, and Free Throws exhibit a normal distribution, reflecting a standard level of efficiency across matches.
3. The Games per Season distribution shows consistency in the number of games played, which supports analyses over multiple seasons.

In [None]:
def season_trend(column):
    # Check if the column data looks like percentages (values between 0 and 1)
    if df[column].max() <= 1:
        # If so, convert to percentage by multiplying by 100
        seasonal_averages = df.groupby('season')[column].mean() * 100
        ylabel = f'Average {column} (%)'
    else:
        # Otherwise, use the values as is
        seasonal_averages = df.groupby('season')[column].mean()
        ylabel = f'Average {column}'
    
    # Plotting the time series
    plt.figure(figsize=(14, 7))
    seasonal_averages.plot(kind='line', marker='o')
    plt.title(f'Average {column} by NBA Season')
    plt.xlabel('NBA Season')
    plt.ylabel(ylabel)
    plt.grid(True)
    plt.xticks(ticks=seasonal_averages.index, labels=seasonal_averages.index)
    plt.tight_layout()
    plt.show()

In [None]:
##  Field goal percentage (field goals made divided by field goal attempts)
season_trend('fg%_total')

In [None]:
## A statistic that measures the point differential when a player or team is on the court, indicating the impact on the game's score; a positive value means the team outscored opponents, while a negative value indicates being outscored.

season_trend('+/-_max_opp')

In [None]:
def plot_team_performance( metric):
    # Determine if the metric is a percentage (between 0 and 1)
    percentage_scale = df[metric].max() <= 1
    
    # Calculate the average metric for each team per season
    seasonal_team_averages = df.groupby(['season', 'team'])[metric].mean().unstack()

    # Scale up if the metric is a percentage
    if percentage_scale:
        seasonal_team_averages *= 100
        ylabel = f'Average {metric} (%)'
    else:
        ylabel = f'Average {metric}'

    # Identify the top and bottom 5 performing teams
    top_teams = seasonal_team_averages.mean(axis=0).sort_values(ascending=False).head(5).index
    bottom_teams = seasonal_team_averages.mean(axis=0).sort_values(ascending=True).head(5).index

    # Create subplots for the top and bottom performing teams
    fig, axs = plt.subplots(2, 1, figsize=(15, 10), sharex=True)

    # Top 5 performing teams plot
    for team in top_teams:
        axs[0].plot(seasonal_team_averages.index, seasonal_team_averages.loc[:, team], marker='o', label=team)
    axs[0].set_title(f'Top 5 Performing Teams by {metric}')
    axs[0].set_ylabel(ylabel)
    axs[0].grid(True)
    axs[0].legend()

    # Bottom 5 performing teams plot
    for team in bottom_teams:
        axs[1].plot(seasonal_team_averages.index, seasonal_team_averages.loc[:, team], marker='o', label=team)
    axs[1].set_title(f'Bottom 5 Performing Teams by {metric}')
    axs[1].set_ylabel(ylabel)
    axs[1].grid(True)
    axs[1].legend()

    # Set common X label
    plt.xlabel('NBA Season')
    plt.tight_layout()
    plt.show()

In [None]:
plot_team_performance('fg%_total') 

In [None]:
plot_team_performance('+/-_max_opp')

In [None]:


# List of column names correctly passed to the correlation matrix plotting
columns_to_include = [
 'mp', 'fg_total', 'fga_total', 'fg%_total', '3p_total', '3pa_total', '3p%_total',
 'ft_total', 'fta_total', 'ft%_total', 'orb_total', 'drb_total', 'trb_total',
 'ast_total', 'stl_total', 'blk_total', 'tov_total', 'pf_total', 'pts_total',
 'ts%_total', 'efg%_total', '3par_total', 'ftr_total', 'orb%_total', 'drb%_total',
 'trb%_total', 'ast%_total', 'stl%_total', 'blk%_total', 'tov%_total',
 'usg%_total', 'ortg_total', 'drtg_total', 'home', 'won'  # Including only relevant columns
]

# Ensure that all these columns exist in df before using them
if set(columns_to_include).issubset(df.columns):
    fig, ax = plt.subplots(figsize=(25, 25))
    correlation_matrix = df[columns_to_include].corr()
    sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', ax=ax)
    ax.set_title('Correlation Matrix of Selected Metrics with Target')

    # Save the plot to a file
    plt.savefig('CorrelationMatrix.png')
    
    plt.show()
else:
    print("Some columns are missing in the DataFrame. Please check the column names.")

Multicollinearity Check Using Variance Inflation Factor (VIF)
Purpose: Assessing multicollinearity among predictive features to ensure that the model is not unduly influenced by highly correlated independent variables. This helps in refining the model to improve prediction accuracy and reliability.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Function to calculate VIF for each feature and provide specific suggestions
def calculate_vif(data):
    vif_df = pd.DataFrame()
    vif_df["variables"] = data.columns
    vif_df["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
    
    # Defining suggestions based on VIF values
    def vif_suggestions(vif_value):
        if vif_value <= 1:
            return "Keep: Not correlated"
        elif 1 < vif_value < 5:
            return "Keep: Moderately correlated"
        elif 5 <= vif_value < 10:
            return "Consider reviewing: Highly correlated"
        else:
            return "Remove or transform: Very high correlation"

    vif_df['Suggestion'] = vif_df['VIF'].apply(vif_suggestions)
    return vif_df

# Selecting numeric features for VIF calculation
numeric_cols = df.select_dtypes(include=[np.number]).columns
vif_data = calculate_vif(df[numeric_cols].dropna())

# Display VIF scores
vif_data = vif_data.sort_values('VIF', ascending=False).reset_index(drop=True)
vif_data

In [None]:
vif_data['Suggestion'].value_counts()

Lagged Features for Dynamic Team Performance
Purpose: Creating lagged features to assess how past game performances (e.g., points scored, rebounds) influence the outcome of future games. This analysis helps in understanding team momentum or fatigue, which can be crucial for predicting outcomes of future games.

In [None]:

# Define the rolling windows you want to test
rolling_windows = [1, 3, 5, 7, 10]

# Assuming 'df' and 'pts_total' are already defined in your DataFrame
# Create lagged features for each window size
for window in rolling_windows:
    # Create the lagged data for points
    df[f'pts_scored_lag{window}'] = df.groupby('team')['pts_total'].shift(1).rolling(window, min_periods=1).mean().reset_index(level=0, drop=True)


# Drop rows with NaN values in the target column, which will appear at the end of each team's data
df.dropna(subset=['won'], inplace=True)

# Prepare the figure layout
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 18))
axes = axes.flatten()  # Flatten the array of axes for easier iteration

# Plot the impact of each lagged feature on the target
for i, window in enumerate(rolling_windows):
    sns.boxplot(x='won', y=f'pts_scored_lag{window}', data=df, ax=axes[i])
    axes[i].set_title(f'Impact of Average Points Last {window} Games Current Game Winning')
    axes[i].set_xlabel('Game Won')
    axes[i].set_ylabel(f'Points Scored in Last {window} Games')

# Adjust layout
plt.tight_layout()

# Save the plot to a file
plt.savefig('Impact_of_Lagged_Features_on_Winning.png')

# Show plot
plt.show()


From the visual inspection, Lag 3 & Lag 1(average points from the last 3 games & last game) seems to provide the best balance between capturing enough historical performance to predict future outcomes and not including too much past data which dilutes the predictive power. The slight increase in median points for games that were won suggests that averaging over three games strikes a good balance in this scenario.

In [None]:


# Plotting the impact of home-court advantage
plt.figure(figsize=(6, 4))
ax = sns.barplot(x='home', y='won', data=df, ci=None)  # ci=None to remove the confidence interval bars
plt.title("Impact of Home Court on Winning")
plt.xlabel("Home Game (1 = Home, 0 = Away)")
plt.ylabel("Probability of Winning")

# Calculate the mean probabilities for annotations
home_winning_probabilities = df.groupby('home')['won'].mean()

# Annotate the bars with the calculated probabilities
for i, p in enumerate(ax.patches):  # access the bars
    ax.annotate(format(home_winning_probabilities.iloc[i], '.2f'),  # format the probability
                (p.get_x() + p.get_width() / 2., p.get_height()),  # position for the text
                ha = 'center', va = 'center',  # center alignment
                xytext = (0, 9),  # position text slightly above the bar
                textcoords = 'offset points')

plt.show()


## Data Preparation

In [None]:
null_columns = df.isnull().sum()
null_columns[null_columns > 0]

In [None]:
### delete null columns

try:
    df.drop(columns=['pts_scored_lag1','pts_scored_lag3','pts_scored_lag5',
                    'pts_scored_lag7','pts_scored_lag10'], inplace=True)
except:
    pass

In [None]:
null_columns = df.isnull().sum()
null_columns[null_columns > 0]

Checking whether the target column is repeated

In [None]:
target_column = 'won'  

# Find any columns that have the same values as the target column
similar_columns = []
for column in df.columns:
    if column != target_column and df[column].equals(df[target_column]):
        similar_columns.append(column)

if similar_columns:
    print("Columns with identical values to the target column '{}':".format(target_column), similar_columns)
else:
    print("No columns have identical values to the target column '{}'.")

In [None]:
df['season'] = df['season'].astype(str)  # Convert 'season' to string type

# Select columns of data type 'category', 'object', and 'bool'
categorical_bool_columns = df.select_dtypes(include=['category', 'object', 'bool']).columns

# Convert to list and remove 'game_id' if it exists in the list
categorical_bool_columns = list(categorical_bool_columns)
if 'game_id' in categorical_bool_columns:
    categorical_bool_columns.remove('game_id')

categorical_bool_columns + ['home']

In [None]:
vif_data['Suggestion'].value_counts()

In [None]:
vif_safeColumns =  list(vif_data[vif_data['Suggestion'].isin(['Keep: Moderately correlated',
                                                              'Keep: Not correlated'])]['variables']) + categorical_bool_columns

df1 = df[vif_safeColumns].reset_index(drop=True)

def handle_duplicate_columns(df):
    """ Remove duplicate columns by keeping the first occurrence """
    df = df.loc[:, ~df.columns.duplicated()]
    return df


## remove duplicates columns
df1 = handle_duplicate_columns(df1)

df1['home'] = df['home']

print(df1.info())

df1.head(2)

In [None]:

# Data preparation function
def prepare_data(data):
    X = data.drop(columns=['won'])
    y = data['won'].astype(int)

    categorical_cols = X.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
    numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', MinMaxScaler())])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_cols),
            ('cat', categorical_transformer, categorical_cols)])

    return X, y, preprocessor, numeric_cols, categorical_cols

# Train and evaluate function
def train_and_evaluate(X, y, preprocessor, numeric_cols, categorical_cols, top_features=20, model_type='logistic'):
    if model_type == 'logistic':
        classifier = LogisticRegression(solver='liblinear', random_state=42)
    elif model_type == 'svm':
        classifier = SVC(probability=True, random_state=42)
    else:
        raise ValueError(f"Unsupported model type: {model_type}")

    clf = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])

    # Cross-validation with StratifiedKFold to prevent data leakage
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = cross_val_score(clf, X, y, cv=skf, scoring='accuracy')
    print(f"Cross-Validation Scores: {cv_scores}")
    print(f"Mean CV Accuracy: {np.mean(cv_scores)}")

    # Fit the model on the whole dataset
    clf.fit(X, y)

    # Access the preprocessor step
    preprocessor_fitted = clf.named_steps['preprocessor']
    X_transformed = preprocessor_fitted.transform(X)

    # Get feature names from the fitted preprocessor
    feature_names = numeric_cols
    
    # Initialize SHAP explainer and compute SHAP values
    explainer = shap.Explainer(clf.named_steps['classifier'], X_transformed)
    shap_values = explainer.shap_values(X_transformed)

    # Summarize the SHAP values to find the top features
    shap_sum = np.abs(shap_values).mean(axis=0)
    feature_importance = pd.DataFrame(list(zip(feature_names, shap_sum)), columns=['feature', 'shap_importance']).sort_values(by='shap_importance', ascending=False)

    top_features_df = feature_importance.head(top_features).reset_index(drop=True)
    print("Top Features Based on SHAP values:\n", top_features_df)

    # Plotting SHAP values for top features
    shap.summary_plot(shap_values, X_transformed, feature_names=feature_names)

    return clf, explainer, shap_values, top_features_df



In [None]:
selected_features = [
    '+/-_max', '+/-_max_opp',
    'ortg_max', 'drtg_max',
    'usg%_total', 'usg%_total_opp',
    'home',
    'fg%_max', 'fg%_max_opp',
    '3p%_max', '3p%_max_opp',
    'ft%_max', 'ft%_max_opp',
    'fga_max', 'fga_max_opp',
    '3pa_max', '3pa_max_opp',
    'orb_max', 'orb_max_opp',
    'drb%_max', 'drb%_max_opp',
    'blk_max', 'blk_max_opp',
    'tov_max', 'tov_max_opp',
    'stl_max', 'stl_max_opp',
    'pf_total', 'pf_total_opp',
    'ast_max', 'ast_max_opp',
    'ast%_max', 'ast%_max_opp',
    '3par_max', '3par_max_opp',
    'ftr_max', 'ftr_max_opp',
    'blk%_max', 'blk%_max_opp',
    'stl%_max', 'stl%_max_opp',
]

# Assuming df1 is your DataFrame
df_selected = df[selected_features + ['won']]

# Example usage
# Assuming df_selected is your DataFrame with 'won' column included
X, y, preprocessor, numeric_cols, categorical_cols = prepare_data(df_selected)
model, explainer, shap_values, top_features_df = train_and_evaluate(X, y, preprocessor, numeric_cols, categorical_cols, top_features=20, model_type='logistic')


In [None]:
df_selected.info()

Domain short listed

1. usg%_max (Usage Rate Max): Measures the percentage of team plays a player uses while on the floor.
2. trb%_max (Total Rebound Percentage Max): Indicates a player's or team's efficiency in grabbing available rebounds.
3. ts%_max (True Shooting Percentage Max): An overall measure of shooting efficiency.
4. pts_max (Points Max): Highest points scored in a game.
5. ast_total (Total Assists): Total assists provided in a game.

In [None]:
### Filtering only on SHAP Values

df2 = df[list(top_features_df['feature']) + ['home',"season", "date", "won", "team",
                                                      "team_opp"] + ['usg%_max','trb%_max','ts%_max', 'pts_max', 'ast_total']] ## Domain 
df2.head(2)

In [None]:
list(top_features_df['feature'])

## Prepped Data Review

In [None]:
# Generate descriptive statistics for key metrics
key_metrics = ['+/-_max', '+/-_max_opp', 'pf_total', 'pf_total_opp','fg%_max_opp','usg%_max','trb%_max','ts%_max', 'pts_max', 'ast_total']

# Selecting the key metrics and generating descriptive statistics
key_stats_summary = df2[key_metrics].describe()

# Display the descriptive statistics for key metrics
key_stats_summary

In [None]:
# Generate descriptive statistics for key metrics
key_metrics_opp = ['+/-_max', '+/-_max_opp', 'pf_total', 'pf_total_opp','fg%_max_opp','usg%_max','trb%_max','ts%_max', 'pts_max', 'ast_total']

# Selecting the key metrics and generating descriptive statistics
key_stats_summary_opp = df2[key_metrics_opp].describe()

# Display the descriptive statistics for key metrics
key_stats_summary_opp

1. Points Differential (+/-):
The average points differential for teams (+/-_max: 13.45) and their opponents (+/-_max_opp: 13.09) suggests closely matched games.
2. Usage and Rebounds (usg%_max, trb%_max):
Teams have an average usage rate of 35.14% and a total rebound percentage of 25.11%, indicating the involvement of key players in scoring and rebounding opportunities.
3. Shooting Efficiency (ts%_max):
Teams achieve a true shooting percentage (accounting for field goals, 3-pointers, and free throws) of 96.38% on average, highlighting their overall scoring efficiency.

## Machine Learning Models

Approach 1: 
1. What statistical indicators are most influential in determining the winning team?
2. Can historical NBA game statistics be utilized to predict the outcome of a game? 

In [None]:

def train_and_evaluate_models(X, y, scaler,Approach, models):
    """Train and evaluate models using TimeSeriesSplit, handling multicollinearity, and return model metrics."""
    model_metrics = {}

    ### Scaling the data
    X_scaled = scaler.fit_transform(X)
    X_filtered = pd.DataFrame(X_scaled, columns=X.columns)

    tscv = TimeSeriesSplit(n_splits=5)
    for name, model in models.items():
        scores = []
        detailed_reports = []
        for train_index, test_index in tscv.split(X_filtered):
            X_train, X_test = X_filtered.iloc[train_index], X_filtered.iloc[test_index]
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]

            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            score = accuracy_score(y_test, y_pred)
            scores.append(score)
            report = classification_report(y_test, y_pred, output_dict=True)
            detailed_reports.append(report)

        avg_score = np.mean(scores)
        model_metrics[name] = {
            'Model': model,
            'CV Scores': scores,
            'Average CV Score': avg_score,
            'Classification Reports': detailed_reports
        }
        print(f"{name} Model Metrics:")
        print("Cross-Validation Scores:", scores)
        print("Average CV Score:", avg_score)
        print("Classification Reports for the last split:")
        print(classification_report(y_test, y_pred))

        model_directory = 'models'
        os.makedirs(model_directory, exist_ok=True)
        model_path = f"{model_directory}/{name}_Approach{Approach}.pkl"
        with open(model_path, 'wb') as file:
            pickle.dump(model, file)


    return model_metrics, X_filtered, y


# Example usage would be the same as before, initializing and calling these functions accordingly.


In [None]:
def plot_learning_curve(estimator, title, X, y, cv):
    """Plot and save the learning curve for the given estimator."""
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 5))
    
    # Calculate means and standard deviations for the training and test sets
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.figure()
    plt.title(title)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    plt.ylim(0.4, 1.1)
    plt.yticks(np.arange(0.4, 1.1, 0.1))
    plt.grid()

    # Plot the standard deviation as a shaded area around the mean
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    
    # Plot the mean score for training and test sets
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
    
    plt.legend(loc="best")
    plt.savefig(f'{title}.png')
    plt.show()



def format_classification_report(reports):
    """Formats and prints average classification report from multiple splits."""
    print("Average Classification Metrics:")
    for key in ['precision', 'recall', 'f1-score']:
        avg_metric = np.mean([report['macro avg'][key] for report in reports])
        print(f"{key.capitalize()}: {avg_metric:.2f}")

def select_and_plot_best_model(model_metrics, X, y):
    """Select the best model based on CV scores and plot its learning curve."""
    best_model_name = max(model_metrics, key=lambda x: model_metrics[x]['Average CV Score'])
    best_model = model_metrics[best_model_name]['Model']

    print(f"Selected Best Model: {best_model_name}")
    print("Cross-Validation Scores:", model_metrics[best_model_name]['CV Scores'])
    print(f"Average CV Score: {model_metrics[best_model_name]['Average CV Score']}")
    format_classification_report(model_metrics[best_model_name]['Classification Reports'])

    plot_learning_curve(best_model, f"Learning Curve for {best_model_name}", X, y, cv=StratifiedKFold(n_splits=5))


In [None]:
# checking the datatypes of the X values
df_approach1 = df2.copy()
df_approach1['date'] = pd.to_datetime(df_approach1['date'])
df_approach1.sort_values('date', inplace=True)
predictors = [col for col in df_approach1.columns if col not in ["season", "date", "won", "team", "team_opp"]]
test = df_approach1[predictors]
test.head(2)

In [None]:
test.info()

In [None]:
# Usage example
df_approach1 = df2.copy()
df_approach1['date'] = pd.to_datetime(df_approach1['date'])
df_approach1.sort_values('date', inplace=True)
predictors = [col for col in df_approach1.columns if col not in ["season", "date", "won", "team", "team_opp"]]
X = df_approach1[predictors]
y = df_approach1['won']
scaler = MinMaxScaler()

In [None]:


models = {
    'logistic_regression': LogisticRegression(
        penalty='l2',  # 'l1' or 'elasticnet' also possible
        C=0.01,  # Regularization strength (smaller values specify stronger regularization)
        max_iter=1000,
        solver='liblinear'  # Suitable solver for 'l1' penalty
    ),
    'svm': SVC(
        probability=True,
        C=0.05,  # Regularization strength (smaller values specify stronger regularization)
        kernel='rbf',  # Commonly used kernel with SVMs
        gamma='scale'  # Kernel coefficient
    ),
'xgboost' : XGBClassifier(
    use_label_encoder=False,
    eval_metric='mlogloss',
    alpha=0.05,  # Increased L1 regularization
    reg_lambda=0.05,  # Increased L2 regularization
    max_depth=4,
    min_child_weight=5,
    learning_rate=0.05
)
}

In [None]:
# Start time

from datetime import datetime
import time

start_time = time.time()
start_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Started at = ", start_dt_string)

###
approach1_model_metrics, approach1_X_filtered, approach1_y = train_and_evaluate_models(X, y, scaler, Approach = 1, models = models)

# End time
end_time = time.time()
end_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Ended at = ", end_dt_string)
print(f"Total processing time: {(end_time - start_time) / 60:.2f} minutes.")

In [None]:

# Start time
start_time = time.time()
start_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Started at = ", start_dt_string)

#### 
select_and_plot_best_model(approach1_model_metrics, approach1_X_filtered, approach1_y)

# End time
end_time = time.time()
end_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Ended at = ", end_dt_string)
print(f"Total processing time: {(end_time - start_time) / 60:.2f} minutes.")

Approach 2:

3. Can the prediction model adapt dynamically to in-season changes such as player 
performance trends? 
4. How does the home-court advantage factor into the predictive model?

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Assume approach2_df2 is your DataFrame already loaded and cleaned
approach2_df2 = df2.copy()
approach2_df2['date'] = pd.to_datetime(approach2_df2['date'])
approach2_df2.sort_values(by=['team', 'date'], inplace=True)

# Define predictors excluding non-numeric and irrelevant columns
predictors = [col for col in approach2_df2.columns if col not in ["season", "date", "won", "team", "team_opp"]]

# Calculate rolling averages for the past 3 games for all numeric predictors
for predictor in predictors:
    approach2_df2[f'{predictor}_rolling_avg'] = approach2_df2.groupby('team')[predictor].transform(lambda x: x.rolling(window=3, min_periods=1).mean())

# Convert the 'home' column to an integer type if it's not already
approach2_df2['home_game'] = approach2_df2['home'].astype(int)

# Select the rolling average features and the home game feature
feature_columns = [f'{predictor}_rolling_avg' for predictor in predictors] + ['home_game']
approach2_X = approach2_df2[feature_columns]
approach2_y = approach2_df2['won'].astype(int)  # Ensure target variable 'won' is integer

# Normalize the features
scaler = MinMaxScaler()

In [None]:
df2.head(2)

In [None]:
approach2_X.head(2)

In [None]:
# Start time
start_time = time.time()
start_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Started at = ", start_dt_string)

###
approach2_model_metrics, approach2_X_filtered, approach2_y = train_and_evaluate_models(approach2_X, approach2_y, scaler, Approach = 2, models = models)

# End time
end_time = time.time()
end_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Ended at = ", end_dt_string)
print(f"Total processing time: {(end_time - start_time) / 60:.2f} minutes.")

Approach 1 & 2

In [None]:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

approach3_df2 = df2.copy()
approach3_df2['date'] = pd.to_datetime(approach3_df2['date'])
approach3_df2.sort_values(by=['team', 'date'], inplace=True)

# Define predictors excluding non-numeric and irrelevant columns
predictors = [col for col in approach3_df2.columns if col not in ["season", "date", "won", "team", "team_opp"]]

# Add suffix '_original' to distinguish base features
original_predictors = [f'{col}_original' for col in predictors]
approach3_df2.rename(columns=dict(zip(predictors, original_predictors)), inplace=True)

# Calculate rolling averages for the past 3 games for all numeric predictors
for predictor in original_predictors:
    approach3_df2[f'{predictor}_rolling_avg'] = approach3_df2.groupby('team')[predictor].transform(lambda x: x.rolling(window=3, min_periods=1).mean())

# Convert the 'home' column to an integer type if it's not already
approach3_df2['home_game'] = df2['home'].astype(int)

# Select both original features and rolling average features, including 'home_game'
combined_features = original_predictors + [f'{predictor}_rolling_avg' for predictor in original_predictors] + ['home_game']
approach3_X = approach3_df2[combined_features]
approach3_y = approach3_df2['won'].astype(int)  # Ensure target variable 'won' is integer

# Normalize the features
scaler = MinMaxScaler()

In [None]:
approach3_X.head(2)

In [None]:
df2.head(2)

In [None]:
# Start time
start_time = time.time()
start_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Started at = ", start_dt_string)

###
approach3_model_metrics, approach3_X_filtered, approach3_y = train_and_evaluate_models(approach3_X, approach3_y, scaler, Approach = 3, models = models)

# End time
end_time = time.time()
end_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Ended at = ", end_dt_string)
print(f"Total processing time: {(end_time - start_time) / 60:.2f} minutes.")

## Model Selection

In [None]:
# Start time
start_time = time.time()
start_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Started at = ", start_dt_string)

#### 
select_and_plot_best_model(approach3_model_metrics, approach3_X_filtered, approach3_y)

# End time
end_time = time.time()
end_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Ended at = ", end_dt_string)
print(f"Total processing time: {(end_time - start_time) / 60:.2f} minutes.")

## Ensemble Model

In [None]:

# Define ensemble models
ensemble_models = {
    'stacking': StackingClassifier(
        estimators=[
            ('lr', LogisticRegression(
                penalty='l2', C=0.01, max_iter=1000, solver='liblinear'
            )),
            ('svm', SVC(
                probability=True, C=0.05, kernel='rbf', gamma='scale'
            )),
            ('xgb', XGBClassifier(
                use_label_encoder=False, eval_metric='mlogloss',
                alpha=0.05, reg_lambda=0.05, max_depth=4,
                min_child_weight=5, learning_rate=0.05
            ))
        ],
        final_estimator=LogisticRegression(penalty='l2', C=0.01, max_iter=1000, solver='liblinear')
    ),
    'voting': VotingClassifier(
        estimators=[
            ('lr', LogisticRegression(
                penalty='l2', C=0.01, max_iter=1000, solver='liblinear'
            )),
            ('svm', SVC(
                probability=True, C=0.05, kernel='rbf', gamma='scale'
            )),
            ('xgb', XGBClassifier(
                use_label_encoder=False, eval_metric='mlogloss',
                alpha=0.05, reg_lambda=0.05, max_depth=4,
                min_child_weight=5, learning_rate=0.05
            ))
        ],
        voting='soft'
    )
}

# Utility function to load a model from a file
def load_model(path):
    with open(path, 'rb') as file:
        return pickle.load(file)

# Utility function to evaluate a given model using cross-validation
def evaluate_model(model, X, y, model_name):
    cv = StratifiedKFold(n_splits=5)
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    print(f'{model_name} - Accuracy: {np.mean(scores):.2f} ± {np.std(scores):.2f}')
    return np.mean(scores), scores

# Function to plot and save the learning curve for the given estimator
def plot_learning_curve(estimator, title, X, y, cv):
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5))
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.figure()
    plt.title(title)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    plt.ylim(0.0, 1.1)
    plt.grid()
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
    plt.legend(loc="best")
    plt.savefig(f'{title}.png')
    plt.show()

# Function to evaluate and compare ensemble models (Stacking and Voting)
def ensemble_approach(X, y, ensemble_models):
    # Create directory for saving models
    os.makedirs('models', exist_ok=True)

    for name, model in ensemble_models.items():
        # Fit and evaluate each ensemble model
        model.fit(X, y)
        model_mean_score, model_scores = evaluate_model(model, X, y, f"{name.capitalize()} Model")

        # Save the ensemble model
        model_path = f"models/{name.capitalize()}Model.pkl"
        with open(model_path, 'wb') as f:
            pickle.dump(model, f)

        # Plot learning curve for the ensemble
        plot_learning_curve(model, f"Learning Curve for {name.capitalize()} Model", X, y, StratifiedKFold(n_splits=5))

        # Print classification report for the ensemble
        y_pred = model.predict(X)
        print(f"Classification Report for {name.capitalize()} Model:")
        print(classification_report(y, y_pred))

In [None]:

# Start time
start_time = time.time()
start_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Started at = ", start_dt_string)


ensemble_approach(approach3_X_filtered, approach3_y, ensemble_models)

# End time
end_time = time.time()
end_dt_string = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
print("Ended at = ", end_dt_string)
print(f"Total processing time: {(end_time - start_time) / 60:.2f} minutes.")

## Conclusions:

Among the two ensemble methods, Stacking Model and Voting Model, the Stacking Model has shown superior performance for this use case. It achieved a higher cross-validation score (0.77 ± 0.03) compared to the Voting Model (0.73 ± 0.03) and displayed a well-balanced learning curve, indicating a better generalization and less overfitting.

Model Performance:
XGBoost: Achieved an average cross-validation accuracy of 74%.
Ensemble Model: Combining Logistic Regression, SVM, and XGBoost via stacking improved accuracy to 77% (±0.03).

Feature Importance:
SHAP Analysis: Revealed the most influential features include ± max, ± max_opp, pf_total, pf_total_opp, and 3p_total, emphasizing shooting efficiency and offensive production.

In-Season Adaptability:
Rolling Averages: Incorporating rolling averages for key metrics helped the model adapt to in-season changes and capture player/team performance dynamics.

Home-Court Advantage:
Impact on Winning: Home teams showed a higher probability of winning compared to away teams.