# Creating our Final Models for NBA Predictions

**Objective:** The objective of this project is to be able to predict the 2024 NBA Champion. This jupyter notebook contains the models that are used to form the probabilities of whether a team has a good chance of becoming the Champion or not. To develop these predictions, we will be using three classifiers, Logistic Regression, Random Forest, and XGBoost. Once we have the probabilities from all three of the classifiers we will take the average probabilites and that is our final result. 

* Note: At this point all of the data has been gathered in excel sheets and have been formed in other Jupyter Notebooks. The data being loaded in this notebook is the final version of the training and testing datasets.

For building the following 3 classifiers we are going to need to import some packages.

* **Pandas** --> Used for Data Manipulation
* **Numpy** --> Used for working with Arrays
* **Sklearn - Logistic Regression** --> 1 of 3 classifiers used to develop our predictions
* **Sklearn - Random Forest Classifier** --> 2 of 3 classifiers used to develop our predictions
* **XGBoost** --> 3 of 3 classifiers used to develop our predictions
* **Sklearn - OneHotEncoder** --> Allows us to encode categorical data to become numerical
* **warnings** --> Removes any red warning output

### Imports

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None, 'display.max_columns', None)

# Preparing the Training Data

In [51]:
# Pulling in our Training Data
training_data = pd.read_excel('NBA_Stats_2000.xlsx')

In [52]:
# Getting and idea of what our training data looks like
training_data[['Team', 'Opp Field Goal Percentage', 'Year', 'Accolades', 'Longest Win Streak', 'Champion', 
                   'Off 3-Pointer Percentage', 'W/L%']].head()

Unnamed: 0,Team,Opp Field Goal Percentage,Year,Accolades,Longest Win Streak,Champion,Off 3-Pointer Percentage,W/L%
0,San Antonio Spurs,0.425,2000,5,7,No,0.374,0.646
1,New York Knicks,0.424,2000,1,5,No,0.375,0.61
2,Portland Trail Blazers,0.431,2000,1,11,No,0.361,0.72
3,Miami Heat,0.422,2000,3,7,No,0.371,0.634
4,Utah Jazz,0.446,2000,3,9,No,0.385,0.671


In [44]:
# Our training data has 715 rows and 66 variables
training_data.shape

(715, 66)

In [53]:
# Fixing the "Net Rating", "Pace", and "Attendance" variables
pace_list = []
for i, value in enumerate(training_data['Pace']):
    value = float(value)
    pace_list.append(value)
training_data['Pace'] = pace_list

net_rating_list = []
for i, value in enumerate(training_data['Net Rating']):
    value = float(value)
    net_rating_list.append(value)
training_data['Net Rating'] = net_rating_list

attendance_list = []
for i, value in enumerate(training_data['Attendance']):
    value = str(value)
    value = value.replace(',', '')
    value = value.replace('.', '')
    value = int(value)
    attendance_list.append(value)
training_data['Attendance'] = attendance_list

In [55]:
# Creating different variables
training_data['Assist_to_Turnover_Ratio'] = training_data['Off Assists'] / training_data['Off Turnovers']
training_data['Margin_of_Victory'] = training_data['Off Points'] - training_data['Opp Points']

In [56]:
# One Hot Encoding the Training Data
# Let's get the Categorical Variables
categorical_train = training_data[['Champion', 'Top 3 Conference', 'Division', 'Conference', 'MVP']]
numerical_train = training_data.drop(['Champion', 'Top 3 Conference', 'Division', 'Conference', 'Team', 'MVP'], axis = 1)

# Creating the One Hot Encoder
encoder = OneHotEncoder(drop = 'if_binary', sparse_output = False).set_output(transform = 'pandas')
# Encoding the Categorical Data
encoded_data = encoder.fit_transform(categorical_train)
# Having the finalized dataset
train_hot_encoded = numerical_train.join(encoded_data)

In [57]:
# Creating our Feature Vector and and Target Vector
x_train = train_hot_encoded.drop(['Champion_Yes'], axis = 1)
x_train = x_train[['Top 3 Conference_Yes', 'MVP_Yes', 'Longest Win Streak', 'W/L%', 'Accolades', 'Net Rating', 'SRS', 
                  'Off Field Goal Percentage', 'Mean Exp', 'Off Blocks', 'Off 2-Pointer Percentage', 'Off Assists',
                  'Off 3-Pointer Percentage', 'Off Defensive Rebounds', 'Assist_to_Turnover_Ratio']]
y_train = train_hot_encoded['Champion_Yes']

# Preparing the Testing Data

In [58]:
# Reading in our Testing Data, this will be used to determine who will become the winner
testing_data = pd.read_excel('Testing Data.xlsx')
# Creating New Variables for the testing data such as "Assist to Turnover Ratio" and "Margin of Victory"
testing_data['Assist_to_Turnover_Ratio'] = testing_data['Off Assists'] / testing_data['Off Turnovers']
testing_data['Margin_of_Victory'] = testing_data['Off Points'] - testing_data['Opp Points']

In [59]:
# One Hot Encoding the Testing Data
# Let's split up the Categorical Variables and Numeric Variables
categorical_test = testing_data[['Champion', 'Top 3 Conference', 'Division', 'Conference', 'MVP']]
numerical_test = testing_data.drop(['Champion', 'Top 3 Conference', 'Division', 'Conference', 'Team', 'MVP'], axis = 1)

# Creating the One Hot Encoder
encoder = OneHotEncoder(drop = 'if_binary', sparse_output = False).set_output(transform = 'pandas')
# Encoding the Testing Categorical Data
encoded_data = encoder.fit_transform(categorical_test)
# Having the finalized test dataset
test_hot_encoded = numerical_test.join(encoded_data)

In [60]:
# Dropping the Championship column since this will be our target variable in what we are predicting
x_test = test_hot_encoded.drop(['Champion_No'], axis = 1)
# Using all of the variables we think will be best in predicting the Champion
x_test = x_test[['Top 3 Conference_Yes', 'MVP_Yes', 'Longest Win Streak', 'W/L%', 'Accolades', 'Net Rating', 'SRS', 
                  'Off Field Goal Percentage', 'Mean Exp', 'Off Blocks', 'Off 2-Pointer Percentage', 'Off Assists',
                  'Off 3-Pointer Percentage', 'Off Defensive Rebounds', 'Assist_to_Turnover_Ratio']]

# Creating our 3 Classifier Models

Here we will be developing our three classifier models. Logistic Regression, Random Forest, and XGBoost are the 3 models we have chosen to use to determine our predictions.

In [62]:
# Creating a Logistic Regression Object with default Hyperparameters
log_model = LogisticRegression(random_state = 2024)
# Training our Logistic Regression Model
log_model.fit(x_train, y_train)

In [63]:
# Getting the probabilities of each team becoming a Champion
probabilities = log_model.predict_proba(x_test)
# Turning those probabilities into a Dataframe
probabilities_df = pd.DataFrame(data = probabilities)
# Normalizing them
normalized_probabilities_df = probabilities_df.div(probabilities_df.sum(axis=0), axis=1)
# Creating a dataframe that neatly shows the percentages with corresponding with each team
final_df1 = testing_data[['Team', 'Year']]
final_df1 = final_df1.join(normalized_probabilities_df)
final_df1 = final_df1.rename({1: 'Logisitic Regression'}, axis = 1)
final_df1 = final_df1.drop([0], axis = 1)
final_df1.head()

Unnamed: 0,Team,Year,Logisitic Regression
0,Minnesota Timberwolves,2024,0.064855
1,New York Knicks,2024,0.038753
2,Orlando Magic,2024,0.002989
3,Miami Heat,2024,0.001766
4,Boston Celtics,2024,0.185752


### 2.) Random Forest

In [64]:
# Creating our Random Forest Classifier with default hyperparamters
rnd_clf = RandomForestClassifier(random_state = 42)
# Training our Random Forest Model
rnd_clf.fit(x_train, y_train)

In [65]:
# Getting the probabilities of each team becoming a Champion
probabilities = rnd_clf.predict_proba(x_test)
# Turning those probabilities into a Dataframe
probabilities_df = pd.DataFrame(data = probabilities)
# Normalizing them
normalized_probabilities_df = probabilities_df.div(probabilities_df.sum(axis=0), axis=1)
# Creating a dataframe that neatly shows the percentages with corresponding with each team
final_df2 = final_df1.join(normalized_probabilities_df)
final_df2 = final_df2.drop([0], axis = 1)
final_df2 = final_df2.rename({1:'Random Forest'}, axis = 1)
final_df2.head()

Unnamed: 0,Team,Year,Logisitic Regression,Random Forest
0,Minnesota Timberwolves,2024,0.064855,0.037975
1,New York Knicks,2024,0.038753,0.0
2,Orlando Magic,2024,0.002989,0.0
3,Miami Heat,2024,0.001766,0.006329
4,Boston Celtics,2024,0.185752,0.196203


### 3.) XG Boost

In [66]:
# Developing an XGBoost Classifier Model
xgb_model = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss')
# Training our Model
xgb_model.fit(x_train, y_train)

In [67]:
# Getting the probabilities of each team becoming a Champion
probabilities_xgb = xgb_model.predict_proba(x_test)
# Convert the array of probabilities to a DataFrame and normalize them
probabilities_xgb_df = pd.DataFrame(data = probabilities_xgb)
normalized_probabilities_xgb_df = probabilities_xgb_df.div(probabilities_xgb_df.sum(axis=0), axis=1)
# Creating a dataframe that will join the XGBoost Probabilities to the current other probabilities
final_df3 = final_df2.join(normalized_probabilities_xgb_df)
final_df3 = final_df3.drop([0], axis = 1)
final_df3 = final_df3.rename({1:'XG Boost'}, axis = 1)
final_df3.head()

Unnamed: 0,Team,Year,Logisitic Regression,Random Forest,XG Boost
0,Minnesota Timberwolves,2024,0.064855,0.037975,0.015813
1,New York Knicks,2024,0.038753,0.0,0.001584
2,Orlando Magic,2024,0.002989,0.0,0.000759
3,Miami Heat,2024,0.001766,0.006329,0.000806
4,Boston Celtics,2024,0.185752,0.196203,0.5701


### Getting our Final Predictions

In [68]:
final_df3['Mean'] = final_df3[['Logisitic Regression', 'Random Forest', 'XG Boost']].mean(axis=1)
final_df3.sort_values('Mean', ascending = False)[['Team', 'Mean']][:5]

Unnamed: 0,Team,Mean
4,Boston Celtics,0.317352
5,Denver Nuggets,0.278632
10,Oklahoma City Thunder,0.105251
20,Milwaukee Bucks,0.057393
26,Indiana Pacers,0.049045
