# NBA Games Analysis
The goal of this project is to build a handful of different models to make a predictive model for NBA games. The dataset used has four seasons of data from 2014-2018, not including playoff or preseason games. This data is available publicly via Kaggle. 

### Load Data
Data is loaded from local file that was downloaded from Kaggle.

In [1]:
# Import libraries
import os
import pandas as pd
import numpy as np
import tensorflow as tf
import keras
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc, mean_squared_error
import datetime

%matplotlib inline 

Using TensorFlow backend.


In [2]:
# Read in data
nba = pd.read_csv("nba.games.stats.csv")

In [31]:
# Read in ELO data (from 538)
nba_elo = pd.read_csv("nba_elo.csv")

## Data Cleaning
Next steps will be to clean up the data and create some new features that may be useful for future predcitions.

In [None]:
# Print out data information
nba.info()

In [4]:
# Print out ELO info
nba_elo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69635 entries, 0 to 69634
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            69635 non-null  object 
 1   season          69635 non-null  int64  
 2   neutral         69635 non-null  int64  
 3   playoff         4362 non-null   object 
 4   team1           69635 non-null  object 
 5   team2           69635 non-null  object 
 6   elo1_pre        69635 non-null  float64
 7   elo2_pre        69635 non-null  float64
 8   elo_prob1       69635 non-null  float64
 9   elo_prob2       69635 non-null  float64
 10  elo1_post       69224 non-null  float64
 11  elo2_post       69224 non-null  float64
 12  carm-elo1_pre   6478 non-null   float64
 13  carm-elo2_pre   6478 non-null   float64
 14  carm-elo_prob1  6478 non-null   float64
 15  carm-elo_prob2  6478 non-null   float64
 16  carm-elo1_post  6067 non-null   float64
 17  carm-elo2_post  6067 non-null  

In [5]:
# Most of the above are ints or floats, but a few objects should be categories.
# Next, convert these to categories.
nba['Team'] = nba['Team'].astype('category')
nba['Home'] = nba['Home'].astype('category')
nba['Opponent'] = nba['Opponent'].astype('category')
nba['WINorLOSS'] = nba['WINorLOSS'].astype('category')
# Convert date to a date object
nba['Date'] = pd.to_datetime(nba['Date'])

In [6]:
# Confirm the changes worked
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9840 entries, 0 to 9839
Data columns (total 41 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Unnamed: 0                9840 non-null   int64         
 1   Team                      9840 non-null   category      
 2   Game                      9840 non-null   int64         
 3   Date                      9840 non-null   datetime64[ns]
 4   Home                      9840 non-null   category      
 5   Opponent                  9840 non-null   category      
 6   WINorLOSS                 9840 non-null   category      
 7   TeamPoints                9840 non-null   int64         
 8   OpponentPoints            9840 non-null   int64         
 9   FieldGoals                9840 non-null   int64         
 10  FieldGoalsAttempted       9840 non-null   int64         
 11  FieldGoals.               9840 non-null   float64       
 12  X3PointShots        

In [32]:
# Filter out ELO data < 2015 and > 2018
nba_elo = nba_elo[((nba_elo['season'] >= 2015) & (nba_elo['season'] < 2019))]

In [33]:
# Subset to only a few needed columns
nba_elo = nba_elo[['date','team1','team2','elo1_pre','elo2_pre']]

In [34]:
nba_elo['date'] = pd.to_datetime(nba_elo['date'])

### Create season variable
There are four distinct seasons, with no games in a season coming after May 1st of that year. Let's categorize every game for particular seasons.

In [22]:
# Define function
def getSeason(x):
    if x < pd.to_datetime('2015-5-1'):
        return '14-15'
    elif x < pd.to_datetime('2016-5-1'):
        return '15-16'
    elif x < pd.to_datetime('2017-5-1'):
        return '16-17'
    else:
        return '17-18'

nba['season'] = nba['Date'].apply(getSeason)

In [23]:
# Convert to category
nba['season'] = nba['season'].astype('category')
# Check results
nba['season'].value_counts()

17-18    2460
16-17    2460
15-16    2460
14-15    2460
Name: season, dtype: int64

### Get running total of wins & losses
Need to calculate the running total of wins and losses for the home and away teams to calculate winning pct for a certain point in time.

Logical approach seems to be create a separate dataframe with each teams wins, losses and win pct for each date and then join this table (twice) to the main table. One join would bring in the Home teams info and the second join would tie in the Away teams info.

In [None]:
display(nba)

In [24]:
# Create copy of nba dataframe
temp = nba
# Define a numeric W/L column for cumulative sum
temp['win_loss'] = np.where(temp['WINorLOSS'] == 'W', 1, 0)
# Create four subsets - one for each season
# Might be a better way to do this but sticking with simplicity for now
s1 = temp[temp['season'] == '14-15']
s2 = temp[temp['season'] == '15-16']
s3 = temp[temp['season'] == '16-17']
s4 = temp[temp['season'] == '17-18']

In [25]:
# Function to do all of the below
def getRollingData(df):
    # Create a temporary dataframe
    temp = df
    # Get rolling win total
    df_wins = df.groupby(['Team','Game','Date']).win_loss.sum().groupby(level=[0]).cumsum() # Get rolling win total
    df_wins = pd.DataFrame(df_wins).reset_index() # Removes multi index from grouping
    df_wins = df_wins[np.isfinite(df_wins['win_loss'])] # Remove NaN rows
    opp_wins = df_wins.rename(columns = {'win_loss': 'opp_wins', 'Game': 'opp_games'}) # Opponent data
    df_wins = df_wins.rename(columns = {'win_loss': 'wins','Game': 'home_game'}) # Rename cols for merge
    df = df.merge(df_wins, on = ['Team','Date'], how = 'left') # Join back to original season data
    df = df.merge(opp_wins, on =['Team','Date'], how = 'left') # Join in opponent info
    # Only keep a select few columns
    df = df[['Team','Game','Date','Home','Opponent','WINorLOSS','wins','home_game','opp_wins','opp_games']]
    # Get rolling averages
    temp = temp.drop(columns = {'Home','Opponent','WINorLOSS','season','win_loss','Date'})
    temp = temp.groupby(['Team']).expanding().mean().reset_index() # Gets cumulative avg of numeric cols
    temp['Game'] = temp['level_1'] % 82 + 2 # This fixes the game column (for joining). Index offset which means the avg for a game is one behind the game which is ideal
    df = df.merge(temp, on = ['Team', 'Game'], how = 'left')
    df = df.drop(columns = ['level_1','Unnamed: 0'])
    return df

In [26]:
# Get the seasons data
season1 = getRollingData(s1)
season2 = getRollingData(s2)
season3 = getRollingData(s3)
season4 = getRollingData(s4)

In [27]:
# Combine all the data
nba = pd.concat([season1, season2, season3, season4])

In [29]:
nba_elo.columns

Index(['date', 'team1', 'team2', 'elo1_pre', 'elo2_pre'], dtype='object')

In [35]:
# Rename ELO columns for join
nba_elo = nba_elo.rename(columns={'date': 'Date', 'team1': 'Team', 'team2': 'Opponent', 
                                  'elo1_pre': 'elo', 'elo2_pre': 'elo_opp'})

In [38]:
# Join in ELO ratings
temp = pd.merge(nba, nba_elo, how='left', on = ['Date','Team','Opponent'])

In [40]:
temp.head(10)

Unnamed: 0,Team,Game,Date,Home,Opponent,WINorLOSS,wins,home_game,opp_wins,opp_games,...,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls,elo,elo_opp
0,ATL,1,2014-10-29,Away,TOR,L,0.0,1,0.0,1,...,,,,,,,,,,
1,ATL,2,2014-11-01,Home,IND,W,1.0,2,1.0,2,...,0.818,16.0,48.0,26.0,13.0,9.0,9.0,22.0,1485.8804,1524.3203
2,ATL,3,2014-11-05,Away,SAS,L,1.0,3,1.0,3,...,0.8375,13.5,46.0,25.5,9.0,7.0,13.5,24.0,,
3,ATL,4,2014-11-07,Away,CHO,L,1.0,4,1.0,4,...,0.795333,12.666667,47.333333,25.333333,8.333333,7.666667,15.333333,21.0,,
4,ATL,5,2014-11-08,Home,NYK,W,2.0,5,2.0,5,...,0.78175,12.25,48.25,26.75,7.75,7.5,16.25,23.25,1489.2977,1491.152
5,ATL,6,2014-11-10,Away,NYK,W,3.0,6,3.0,6,...,0.7708,12.4,47.4,26.6,6.6,7.2,16.0,24.4,,
6,ATL,7,2014-11-12,Home,UTA,W,4.0,7,4.0,7,...,0.7395,12.166667,46.166667,26.0,6.166667,6.333333,15.833333,24.666667,1505.4128,1383.9275
7,ATL,8,2014-11-14,Home,MIA,W,5.0,8,5.0,8,...,0.705286,11.571429,43.857143,26.285714,7.0,6.571429,15.142857,23.571429,1507.4873,1575.7856
8,ATL,9,2014-11-15,Away,CLE,L,5.0,9,5.0,9,...,0.732125,10.75,42.375,26.375,7.375,6.125,15.0,23.125,,
9,ATL,10,2014-11-18,Home,LAL,L,5.0,10,5.0,10,...,0.736222,10.888889,42.888889,27.777778,7.888889,5.666667,14.777778,22.111111,1499.0448,1383.3768


#### Remove a win/game
A quick note - we need to remove a win from each of the `wins` and `opp_wins` columns. Why? As of right now, it gives us where the team stood *after* that game but to make predictions, we need to know how the team was doing heading *into* the game. If we left it as is, we would be predicting for a game while using data that already included how the game actually went.

In [None]:
# We need to also account for cases when wins = 0, in which case we leave as it is.
nba['wins'] = nba['wins'].apply(lambda x: x-1 if x > 0 else x)
nba['opp_wins'] = nba['opp_wins'].apply(lambda x: x-1 if x > 0 else x)
nba['home_game'] = nba['home_game'].apply(lambda x: x-1 if x > 0 else x)
nba['opp_games'] = nba['opp_games'].apply(lambda x: x-1 if x > 0 else x)

In [None]:
display(nba)

#### Create winning pct column
Get the team's current winning pct.

In [None]:
nba['win_pct'] = nba['wins'] / nba['home_game']
nba['opp_win_pct'] = nba['opp_wins'] / nba['opp_games']

In [None]:
nba.columns

### Drop columns
We can drop some of the columns we no longer will need. These columns were only created for getting to a different result, such as creating `win_loss` to get the cumulative `wins`.

In [None]:
nba = nba.drop(columns = ['home_game', 'opp_games'])
display(nba)

Now we have a "final" dataset to use. Future feature engineering could be done to include a handful of other things, such as:
* Travel distance for team
* If a team had to change timezones
* How many games back-to-back a team had played

In [None]:
# Finally we drop the columns with NaN for either team's winning pct because we need games with data on both teams
nba = nba[(nba['win_pct'].notna()) & (nba['opp_win_pct'].notna())]

In [None]:
# Print final data info
nba.info()

## Exploratory Data Analysis
Some brief exploration of the data now that it has been properly formatted and is in a usable situation.

In [None]:
import pandas_profiling

In [None]:
#pandas_profiling.ProfileReport(nba)

## Model Building
The aim of this is to predict whether or not a team will win or loss. This is found in the `WINorLOSS` column for each game. We have a total of four seasons worth of data and can use all of ths information available since we "backlogged" the season averages to not include the game they are from. This means we are attempting to predict the winner of a game based on the averages of the Home and Away teams and their respective winning percentages heading into the game. Let's start with a basic logistic regression model.

In [None]:
# Converting WINorLOSS into a numeric column
nba['WINorLOSS'] = np.where(nba['WINorLOSS'] == 'W', 1, 0)

### Logistic Regression

To do logistic regression, we need to convert all of our categorical variables to 'dummy' variables using the `get_dummies` function.

In [None]:
# Create dummy data
nba = nba.drop(columns = {'Date'}) # Remove date column - not relevant for predictions
#nba_temp = nba.set_index('Team') # Set the index to the team we are predicting for (preserving original data)
nba_dummies = pd.get_dummies(nba, drop_first=True) # Drop first to avoid multicollinearity
nba_dummies.head()

In [None]:
nba_dummies.columns

Using the `scikit learn` library, we will now split our data into training and testing sets.

In [None]:
# Setup temporary dataframes for features and labels
X = nba_dummies.drop(columns = ['WINorLOSS'])
y = nba_dummies['WINorLOSS']

In [None]:
# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 7)

In [None]:
# Print out dims for each data set
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

#### Standardize the data
Need to standardize the data before building the model

In [None]:
# Standardize
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

#### Create and fit the model

In [None]:
# Using LogisticRegressionCV function for cross validation
# Define regularization parameters
#reg_params = np.arange(0.5, 0.001, -0.001)
# Create & fit the CV version of Logistic Regression
logRegCVSD = LogisticRegressionCV(Cs=500, penalty='l1', cv=5, solver='liblinear')
logRegCVSD.fit(X_train_std, y_train)

In [None]:
logRegCVSD.C_

In [None]:
fig, axs = plt.subplots(1, 1, figsize=(30, 9))
axs.plot(logRegCVSD.Cs_, logRegCVSD.scores_[1].mean(axis=0), marker='o')
axs.plot(logRegCVSD.C_, logRegCVSD.scores_[1].mean(axis=0).max(), marker='x',
         markersize=15)
# axs.set_xscale('log')
axs.set_title('')
axs.set_xlabel('C Values')
axs.set_ylabel('Accuracy')
axs.grid()

# print max accuracy
print(logRegCVSD.scores_[1].mean(axis=0).max())

In [None]:
# Calculate accuracy of the model
# Make predictions using the training data
train_preds = logRegCVSD.predict(X_train)
# Score those predictions
train_acc = accuracy_score(train_preds, y_train)

# Make predictions on the testing data
test_preds = logRegCVSD.predict(X_test)
# Score those predictions
test_acc = accuracy_score(test_preds, y_test)

# Print out the accuracy for each
print("Train Accuracy: {:.2f}%".format(train_acc*100))
print("Test Accuracy: {:.2f}%".format(test_acc*100))

### Decision Tree

#### Fit original model

In [None]:
# Define model
clf = DecisionTreeClassifier(random_state=7, min_samples_leaf=25)
# Fit model
clf.fit(X_train, y_train)

#### Find accuracy for train/test data

In [None]:
# Find the accuracy on the training data
train_data_accuracy = accuracy_score(clf.predict(X_train), y_train)

# Find the accuracy on the testing data
test_data_accuracy = accuracy_score(clf.predict(X_test) , y_test)

print('The accuracy on the training data is {:.0f}%'.format(train_data_accuracy*100))
print('The accuracy on the testing data is {:.0f}%'.format(test_data_accuracy*100))

#### Tree model using GridSearch approach

In [None]:
# Redo tree but with GridSearchCV for fine-tuning the model
clf = DecisionTreeClassifier(random_state=7)
# Create a dictionary of parameters & ranges to test against
parameters = {'min_samples_leaf': range(50,500, 50), 'max_depth': range(3,50)}
# Create GridSearchCV
gsCV = GridSearchCV(clf, parameters, cv=5, return_train_score=True)
# Train the model
gsCV.fit(X_train, y_train)
# Get the results of the grid search approach
grid_results = pd.DataFrame(gsCV.cv_results_)
# Order the results on best test score and print out the top ten
grid_results.sort_values('rank_test_score').head(10)

### Neural Network Approach
Attempting to build a simple neural network using Tensorflow & Keras framework for modeling.

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, Flatten, LeakyReLU
from keras.callbacks import TensorBoard, EarlyStopping

In [None]:
# Data has already been preprocessed
# Setup early stopping
callback = tf.keras.callbacks.EarlyStopping(monitor='val_acc', patience=3)

# Define input shape
input_shape = X_train.shape

In [None]:
# Build the model
model = tf.keras.models.Sequential([
  tf.keras.layers.Dense(1000, input_shape=(69,), activation='relu'),
  tf.keras.layers.Dense(300, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
# Fit the model
model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1, validation_split=0.3, callbacks=[callback])

In [None]:
# Evaluate the model
model.evaluate(X_test,  y_test, verbose=2)

In [None]:
# Attempt to iteratively repeat this process
# Define vector to hold test scores
test_scores = []

# Loop through criteria in two different loops
for i in range(100, 500, 10):
    for j in range(50, 150, 10):
        # Build the model
        model = tf.keras.models.Sequential([
          tf.keras.layers.Dense(i, input_shape=(69,), activation='relu'),
          tf.keras.layers.Dense(j, activation='relu'),
          tf.keras.layers.Dropout(0.2),
          tf.keras.layers.Dense(1, activation='sigmoid')
        ])
        # Compile the model
        model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
        # Fit the model
        model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0, validation_split=0.3, 
                  callbacks=[callback])
        results = model.evaluate(X_test, y_test, verbose=2)
        test_scores.append(results[1])