# 207 Final Project - The Impact of Lebron James

In [1]:
#Import needed libraries, not all were used
import random
import pandas as pd
import numpy as np
import matplotlib
import numpy as np
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow import keras
from keras import metrics
tf.get_logger().setLevel('INFO')
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Overview

LeBron James is arguably the most iconic basketball player to ever grace the court. Emerging straight from high school into the NBA in 2003, LeBron has accumulated a staggering array of accolades that remain unparalleled by any other player:

- 4x NBA MVP
- 3x NBA Champion
- 2x Finals MVP
- 2x Olympic Gold Medalist
- 19x All-Star Selection
- 18x All-NBA Team Member
- All-time leading scorer in NBA history
- 8 consecutive NBA Finals appearances (surpassing some entire franchises)
- Statistically recognized as the most clutch player in the history of the game

As the NBA playoffs are now in full swing, we sought to investigate the impact that LeBron James has on his team's performance. To achieve this, we constructed various machine learning models that predict the outcome of a game—win or loss—based on LeBron's performance. Our dataset encompasses every game LeBron has participated in from 2003 until 2023, offering a comprehensive analysis of numerous performance variables.

Dataset: https://www.basketball-reference.com/players/j/jamesle01/gamelog/2004 <br> 
Data Dictionary: https://www.basketball-reference.com/about/glossary.html

### Table of Contents:

1) Data Cleaning & EDA <br>
2) Training, Validation, and Test Split (Time-Series Data) <br>
3) Make X and Y for each respective set (train, val, and test) <br>
4) ML Models Overview <br>
5) Logistic Regression <br>
6) Decision Tree <br>
7) Random Forest <br>
8) Neural Network <br>
9) Results <br>
10) Conclusion <br>

## Data Cleaning & EDA

Our dataset was, for the most part, well-organized and tidy. Some minor adjustments were required, such as column parsing, verifying arithmetic calculations, and generating a few additional columns. However, the integrity of the data itself was impressive, allowing for a seamless exploration and analysis process.

### Cleaning & EDA Specifics:

- Our dataset consists of 1421 records, covering games up to April 9th, 2023, which marks the end of the 2022-2023 regular season.
- We separated the 'result' column into two distinct columns: 'win/loss' and 'spread.'
- The number of games per season varies due to the number of games LeBron played in each season.
- We verified the accuracy of the field goal percentages in the dataset.
- The 'age' column was parsed and converted to decimal format for better usability.
- We added a 'year_num' column to facilitate grouping games by season.
- The 'minutes per game' column was transformed into decimal values for easier analysis.
- All column data types were adjusted to be compatible with our machine learning models.
- We created one-hot encodings for the opposing teams to better represent categorical data.
- For games in which LeBron attempted no three-point shots, NaN values in the 'threep' and 'three points attempted' columns were replaced with 0.



In [2]:
## Read in CSV

lbj_df = pd.read_csv("lebron_full_career.csv")

In [3]:
## Look at head

lbj_df.head()

Unnamed: 0,game,date,age,team,opp,result,mp,fg,fga,fgp,...,orb,drb,trb,ast,stl,blk,tov,pts,game_score,minus_plus
0,1,10/29/2003,18-303,CLE,SAC,L (-14),42:50:00,12,20,0.6,...,2,4,6,9,4,0,2,25,24.7,-9.0
1,2,10/30/2003,18-304,CLE,PHO,L (-9),40:21:00,8,17,0.471,...,2,10,12,8,1,0,7,21,14.7,-3.0
2,3,11/1/2003,18-306,CLE,POR,L (-19),39:10:00,3,12,0.25,...,0,4,4,6,2,0,2,8,5.0,-21.0
3,4,11/5/2003,18-310,CLE,DEN,L (-4),41:06:00,3,11,0.273,...,2,9,11,7,2,3,2,7,11.2,-3.0
4,5,11/7/2003,18-312,CLE,IND,L (-1),43:44:00,8,18,0.444,...,0,5,5,3,0,0,7,23,9.0,-7.0


In [4]:
## Look at tail

lbj_df.tail()

Unnamed: 0,game,date,age,team,opp,result,mp,fg,fga,fgp,...,orb,drb,trb,ast,stl,blk,tov,pts,game_score,minus_plus
1416,51,4/2/2023,38-093,LAL,HOU,W (+25),29:21:00,8,18,0.444,...,2,8,10,11,0,1,1,18,19.4,23.0
1417,52,4/4/2023,38-095,LAL,UTA,W (+2),38:28:00,14,27,0.519,...,0,5,5,6,1,1,5,37,25.3,-7.0
1418,53,4/5/2023,38-096,LAL,LAC,L (-7),35:06:00,13,20,0.65,...,0,8,8,7,1,1,6,33,26.4,-10.0
1419,54,4/7/2023,38-098,LAL,PHO,W (+14),29:21:00,6,19,0.316,...,0,6,6,6,0,0,5,16,5.3,11.0
1420,55,4/9/2023,38-100,LAL,UTA,W (+11),33:13:00,13,25,0.52,...,1,5,6,6,1,1,2,36,29.8,20.0


In [5]:
## Summary stats

lbj_df.describe()

Unnamed: 0,game,fg,fga,fgp,three,threeatt,threep,ft,fta,ftp,orb,drb,trb,ast,stl,blk,tov,pts,game_score,minus_plus
count,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,1381.0,1421.0,1421.0,1407.0,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,1420.0
mean,36.796622,9.959184,19.735398,0.505616,1.591133,4.618578,0.312681,5.691063,7.741027,0.72897,1.172414,6.334272,7.506685,7.332864,1.538353,0.755102,3.494722,27.200563,22.242646,5.08662
std,21.664897,3.099936,4.780131,0.110395,1.501044,2.572508,0.240515,3.368388,4.118038,0.192436,1.170277,2.719287,3.016059,2.96215,1.281823,0.898659,1.819536,7.839582,7.742163,12.676835
min,1.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,-0.1,-39.0
25%,18.0,8.0,16.0,0.435,0.0,3.0,0.125,3.0,5.0,0.615,0.0,4.0,5.0,5.0,1.0,0.0,2.0,22.0,17.0,-3.0
50%,36.0,10.0,20.0,0.5,1.0,4.0,0.333,5.0,7.0,0.75,1.0,6.0,7.0,7.0,1.0,1.0,3.0,27.0,22.5,6.0
75%,54.0,12.0,23.0,0.579,2.0,6.0,0.5,8.0,10.0,0.857,2.0,8.0,9.0,9.0,2.0,1.0,5.0,32.0,27.3,14.0
max,82.0,23.0,36.0,0.929,9.0,14.0,1.0,24.0,28.0,1.0,7.0,17.0,19.0,19.0,7.0,5.0,11.0,61.0,53.2,39.0


In [6]:
##Dtypes

lbj_df.dtypes

game            int64
date           object
age            object
team           object
opp            object
result         object
mp             object
fg              int64
fga             int64
fgp           float64
three           int64
threeatt        int64
threep        float64
ft              int64
fta             int64
ftp           float64
orb             int64
drb             int64
trb             int64
ast             int64
stl             int64
blk             int64
tov             int64
pts             int64
game_score    float64
minus_plus    float64
dtype: object

In [7]:
## Check NA

lbj_df.isna().count()

game          1421
date          1421
age           1421
team          1421
opp           1421
result        1421
mp            1421
fg            1421
fga           1421
fgp           1421
three         1421
threeatt      1421
threep        1421
ft            1421
fta           1421
ftp           1421
orb           1421
drb           1421
trb           1421
ast           1421
stl           1421
blk           1421
tov           1421
pts           1421
game_score    1421
minus_plus    1421
dtype: int64

In [8]:
## Parse the result column to get W/L and spread seperately

lbj_df['win_or_loss'] = lbj_df['result'].str[0]
lbj_df['spread'] = lbj_df['result'].str[3:-1]

In [9]:
## Check if it worked

lbj_df

Unnamed: 0,game,date,age,team,opp,result,mp,fg,fga,fgp,...,trb,ast,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread
0,1,10/29/2003,18-303,CLE,SAC,L (-14),42:50:00,12,20,0.600,...,6,9,4,0,2,25,24.7,-9.0,L,-14
1,2,10/30/2003,18-304,CLE,PHO,L (-9),40:21:00,8,17,0.471,...,12,8,1,0,7,21,14.7,-3.0,L,-9
2,3,11/1/2003,18-306,CLE,POR,L (-19),39:10:00,3,12,0.250,...,4,6,2,0,2,8,5.0,-21.0,L,-19
3,4,11/5/2003,18-310,CLE,DEN,L (-4),41:06:00,3,11,0.273,...,11,7,2,3,2,7,11.2,-3.0,L,-4
4,5,11/7/2003,18-312,CLE,IND,L (-1),43:44:00,8,18,0.444,...,5,3,0,0,7,23,9.0,-7.0,L,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,4/2/2023,38-093,LAL,HOU,W (+25),29:21:00,8,18,0.444,...,10,11,0,1,1,18,19.4,23.0,W,+25
1417,52,4/4/2023,38-095,LAL,UTA,W (+2),38:28:00,14,27,0.519,...,5,6,1,1,5,37,25.3,-7.0,W,+2
1418,53,4/5/2023,38-096,LAL,LAC,L (-7),35:06:00,13,20,0.650,...,8,7,1,1,6,33,26.4,-10.0,L,-7
1419,54,4/7/2023,38-098,LAL,PHO,W (+14),29:21:00,6,19,0.316,...,6,6,0,0,5,16,5.3,11.0,W,+14


In [10]:
## Drop the result column

lbj_df.drop('result', axis=1, inplace=True)

In [11]:
## Check it dropped

lbj_df

Unnamed: 0,game,date,age,team,opp,mp,fg,fga,fgp,three,...,trb,ast,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread
0,1,10/29/2003,18-303,CLE,SAC,42:50:00,12,20,0.600,0,...,6,9,4,0,2,25,24.7,-9.0,L,-14
1,2,10/30/2003,18-304,CLE,PHO,40:21:00,8,17,0.471,1,...,12,8,1,0,7,21,14.7,-3.0,L,-9
2,3,11/1/2003,18-306,CLE,POR,39:10:00,3,12,0.250,0,...,4,6,2,0,2,8,5.0,-21.0,L,-19
3,4,11/5/2003,18-310,CLE,DEN,41:06:00,3,11,0.273,0,...,11,7,2,3,2,7,11.2,-3.0,L,-4
4,5,11/7/2003,18-312,CLE,IND,43:44:00,8,18,0.444,1,...,5,3,0,0,7,23,9.0,-7.0,L,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,4/2/2023,38-093,LAL,HOU,29:21:00,8,18,0.444,1,...,10,11,0,1,1,18,19.4,23.0,W,+25
1417,52,4/4/2023,38-095,LAL,UTA,38:28:00,14,27,0.519,3,...,5,6,1,1,5,37,25.3,-7.0,W,+2
1418,53,4/5/2023,38-096,LAL,LAC,35:06:00,13,20,0.650,4,...,8,7,1,1,6,33,26.4,-10.0,L,-7
1419,54,4/7/2023,38-098,LAL,PHO,29:21:00,6,19,0.316,3,...,6,6,0,0,5,16,5.3,11.0,W,+14


In [12]:
## Check unique values for W and L, matches the total records in table

lbj_df['win_or_loss'].value_counts()

W    924
L    497
Name: win_or_loss, dtype: int64

In [13]:
## Double check the fgp is correct based on the fg and fga

lbj_df['Calculated Field Goal Percentage'] = (lbj_df['fg'] / lbj_df['fga']).round(3)

In [14]:
## See if it added the column

lbj_df

Unnamed: 0,game,date,age,team,opp,mp,fg,fga,fgp,three,...,ast,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread,Calculated Field Goal Percentage
0,1,10/29/2003,18-303,CLE,SAC,42:50:00,12,20,0.600,0,...,9,4,0,2,25,24.7,-9.0,L,-14,0.600
1,2,10/30/2003,18-304,CLE,PHO,40:21:00,8,17,0.471,1,...,8,1,0,7,21,14.7,-3.0,L,-9,0.471
2,3,11/1/2003,18-306,CLE,POR,39:10:00,3,12,0.250,0,...,6,2,0,2,8,5.0,-21.0,L,-19,0.250
3,4,11/5/2003,18-310,CLE,DEN,41:06:00,3,11,0.273,0,...,7,2,3,2,7,11.2,-3.0,L,-4,0.273
4,5,11/7/2003,18-312,CLE,IND,43:44:00,8,18,0.444,1,...,3,0,0,7,23,9.0,-7.0,L,-1,0.444
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,4/2/2023,38-093,LAL,HOU,29:21:00,8,18,0.444,1,...,11,0,1,1,18,19.4,23.0,W,+25,0.444
1417,52,4/4/2023,38-095,LAL,UTA,38:28:00,14,27,0.519,3,...,6,1,1,5,37,25.3,-7.0,W,+2,0.519
1418,53,4/5/2023,38-096,LAL,LAC,35:06:00,13,20,0.650,4,...,7,1,1,6,33,26.4,-10.0,L,-7,0.650
1419,54,4/7/2023,38-098,LAL,PHO,29:21:00,6,19,0.316,3,...,6,0,0,5,16,5.3,11.0,W,+14,0.316


In [15]:
##Check

lbj_df['Calculated Field Goal Percentage'] == lbj_df['fgp']

0       True
1       True
2       True
3       True
4       True
        ... 
1416    True
1417    True
1418    True
1419    True
1420    True
Length: 1421, dtype: bool

In [16]:
## After checking if the math lined up, it does, round is just slightly different.

lbj_df.drop('Calculated Field Goal Percentage', axis=1, inplace=True)

In [17]:
# Check

lbj_df.head()

Unnamed: 0,game,date,age,team,opp,mp,fg,fga,fgp,three,...,trb,ast,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread
0,1,10/29/2003,18-303,CLE,SAC,42:50:00,12,20,0.6,0,...,6,9,4,0,2,25,24.7,-9.0,L,-14
1,2,10/30/2003,18-304,CLE,PHO,40:21:00,8,17,0.471,1,...,12,8,1,0,7,21,14.7,-3.0,L,-9
2,3,11/1/2003,18-306,CLE,POR,39:10:00,3,12,0.25,0,...,4,6,2,0,2,8,5.0,-21.0,L,-19
3,4,11/5/2003,18-310,CLE,DEN,41:06:00,3,11,0.273,0,...,11,7,2,3,2,7,11.2,-3.0,L,-4
4,5,11/7/2003,18-312,CLE,IND,43:44:00,8,18,0.444,1,...,5,3,0,0,7,23,9.0,-7.0,L,-1


In [18]:
# check for inconsistencies between the "win_or_loss" and "spread" columns

inconsistent_rows = ((lbj_df['spread'].str.contains('-') & (lbj_df['win_or_loss'] != 'L')) |
                     (lbj_df['spread'].str.contains('\+') & (lbj_df['win_or_loss'] != 'W')))

inconsistent_rows.to_frame()

# display the inconsistent rows

lbj_df[inconsistent_rows]

Unnamed: 0,game,date,age,team,opp,mp,fg,fga,fgp,three,...,trb,ast,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread


In [19]:
## AGE COLUMN

## Parse the age column
age_parts = lbj_df['age'].str.split('-', expand=True)
years = age_parts[0].astype(int)
days = age_parts[1].astype(int)

# calculate the decimal age by dividing the days out of 365 rounded to third decimal
lbj_df['decimal_age'] = (years + (days / 365)).round(3)

In [20]:
lbj_df

Unnamed: 0,game,date,age,team,opp,mp,fg,fga,fgp,three,...,ast,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread,decimal_age
0,1,10/29/2003,18-303,CLE,SAC,42:50:00,12,20,0.600,0,...,9,4,0,2,25,24.7,-9.0,L,-14,18.830
1,2,10/30/2003,18-304,CLE,PHO,40:21:00,8,17,0.471,1,...,8,1,0,7,21,14.7,-3.0,L,-9,18.833
2,3,11/1/2003,18-306,CLE,POR,39:10:00,3,12,0.250,0,...,6,2,0,2,8,5.0,-21.0,L,-19,18.838
3,4,11/5/2003,18-310,CLE,DEN,41:06:00,3,11,0.273,0,...,7,2,3,2,7,11.2,-3.0,L,-4,18.849
4,5,11/7/2003,18-312,CLE,IND,43:44:00,8,18,0.444,1,...,3,0,0,7,23,9.0,-7.0,L,-1,18.855
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,4/2/2023,38-093,LAL,HOU,29:21:00,8,18,0.444,1,...,11,0,1,1,18,19.4,23.0,W,+25,38.255
1417,52,4/4/2023,38-095,LAL,UTA,38:28:00,14,27,0.519,3,...,6,1,1,5,37,25.3,-7.0,W,+2,38.260
1418,53,4/5/2023,38-096,LAL,LAC,35:06:00,13,20,0.650,4,...,7,1,1,6,33,26.4,-10.0,L,-7,38.263
1419,54,4/7/2023,38-098,LAL,PHO,29:21:00,6,19,0.316,3,...,6,0,0,5,16,5.3,11.0,W,+14,38.268


In [21]:
## Drop age column
lbj_df.drop('age', axis=1, inplace=True)

In [22]:
## Check it dropped
lbj_df

Unnamed: 0,game,date,team,opp,mp,fg,fga,fgp,three,threeatt,...,ast,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread,decimal_age
0,1,10/29/2003,CLE,SAC,42:50:00,12,20,0.600,0,2,...,9,4,0,2,25,24.7,-9.0,L,-14,18.830
1,2,10/30/2003,CLE,PHO,40:21:00,8,17,0.471,1,5,...,8,1,0,7,21,14.7,-3.0,L,-9,18.833
2,3,11/1/2003,CLE,POR,39:10:00,3,12,0.250,0,1,...,6,2,0,2,8,5.0,-21.0,L,-19,18.838
3,4,11/5/2003,CLE,DEN,41:06:00,3,11,0.273,0,2,...,7,2,3,2,7,11.2,-3.0,L,-4,18.849
4,5,11/7/2003,CLE,IND,43:44:00,8,18,0.444,1,2,...,3,0,0,7,23,9.0,-7.0,L,-1,18.855
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,4/2/2023,LAL,HOU,29:21:00,8,18,0.444,1,7,...,11,0,1,1,18,19.4,23.0,W,+25,38.255
1417,52,4/4/2023,LAL,UTA,38:28:00,14,27,0.519,3,10,...,6,1,1,5,37,25.3,-7.0,W,+2,38.260
1418,53,4/5/2023,LAL,LAC,35:06:00,13,20,0.650,4,6,...,7,1,1,6,33,26.4,-10.0,L,-7,38.263
1419,54,4/7/2023,LAL,PHO,29:21:00,6,19,0.316,3,7,...,6,0,0,5,16,5.3,11.0,W,+14,38.268


In [23]:
# Create list of date ranges to group games by NBA season
year_ranges = [
    (lbj_df['date'] < '2004-08-01'),
    (lbj_df['date'] >= '2004-08-01') & (lbj_df['date'] < '2005-08-01'),
    (lbj_df['date'] >= '2005-08-01') & (lbj_df['date'] < '2006-08-01'),
    (lbj_df['date'] >= '2006-08-01') & (lbj_df['date'] < '2007-08-01'),
    (lbj_df['date'] >= '2007-08-01') & (lbj_df['date'] < '2008-08-01'),
    (lbj_df['date'] >= '2008-08-01') & (lbj_df['date'] < '2009-08-01'),
    (lbj_df['date'] >= '2009-08-01') & (lbj_df['date'] < '2010-08-01'),
    (lbj_df['date'] >= '2010-08-01') & (lbj_df['date'] < '2011-08-01'),
    (lbj_df['date'] >= '2011-08-01') & (lbj_df['date'] < '2012-08-01'),
    (lbj_df['date'] >= '2012-08-01') & (lbj_df['date'] < '2013-08-01'),
    (lbj_df['date'] >= '2013-08-01') & (lbj_df['date'] < '2014-08-01'),
    (lbj_df['date'] >= '2014-08-01') & (lbj_df['date'] < '2015-08-01'),
    (lbj_df['date'] >= '2015-08-01') & (lbj_df['date'] < '2016-08-01'),
    (lbj_df['date'] >= '2016-08-01') & (lbj_df['date'] < '2017-08-01'),
    (lbj_df['date'] >= '2017-08-01') & (lbj_df['date'] < '2018-08-01'),
    (lbj_df['date'] >= '2018-08-01') & (lbj_df['date'] < '2019-08-01'),
    (lbj_df['date'] >= '2019-08-01') & (lbj_df['date'] < '2020-11-01'),
    (lbj_df['date'] >= '2020-11-01') & (lbj_df['date'] < '2021-08-01'),
    (lbj_df['date'] >= '2021-08-01') & (lbj_df['date'] < '2022-08-01'),
    (lbj_df['date'] >= '2022-08-01')
     ]

# Create array
year_vals = np.arange(1,21)

In [24]:
# Add year_num column to assign each game to the corresponding NBA season
lbj_df['year_num'] = np.select(year_ranges, year_vals)

In [25]:
# Check 
lbj_df

Unnamed: 0,game,date,team,opp,mp,fg,fga,fgp,three,threeatt,...,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread,decimal_age,year_num
0,1,10/29/2003,CLE,SAC,42:50:00,12,20,0.600,0,2,...,4,0,2,25,24.7,-9.0,L,-14,18.830,1
1,2,10/30/2003,CLE,PHO,40:21:00,8,17,0.471,1,5,...,1,0,7,21,14.7,-3.0,L,-9,18.833,1
2,3,11/1/2003,CLE,POR,39:10:00,3,12,0.250,0,1,...,2,0,2,8,5.0,-21.0,L,-19,18.838,1
3,4,11/5/2003,CLE,DEN,41:06:00,3,11,0.273,0,2,...,2,3,2,7,11.2,-3.0,L,-4,18.849,1
4,5,11/7/2003,CLE,IND,43:44:00,8,18,0.444,1,2,...,0,0,7,23,9.0,-7.0,L,-1,18.855,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,4/2/2023,LAL,HOU,29:21:00,8,18,0.444,1,7,...,0,1,1,18,19.4,23.0,W,+25,38.255,20
1417,52,4/4/2023,LAL,UTA,38:28:00,14,27,0.519,3,10,...,1,1,5,37,25.3,-7.0,W,+2,38.260,20
1418,53,4/5/2023,LAL,LAC,35:06:00,13,20,0.650,4,6,...,1,1,6,33,26.4,-10.0,L,-7,38.263,20
1419,54,4/7/2023,LAL,PHO,29:21:00,6,19,0.316,3,7,...,0,0,5,16,5.3,11.0,W,+14,38.268,20


In [26]:
#Check dtypes again

lbj_df.dtypes

game             int64
date            object
team            object
opp             object
mp              object
fg               int64
fga              int64
fgp            float64
three            int64
threeatt         int64
threep         float64
ft               int64
fta              int64
ftp            float64
orb              int64
drb              int64
trb              int64
ast              int64
stl              int64
blk              int64
tov              int64
pts              int64
game_score     float64
minus_plus     float64
win_or_loss     object
spread          object
decimal_age    float64
year_num         int32
dtype: object

In [27]:
lbj_df.describe()

Unnamed: 0,game,fg,fga,fgp,three,threeatt,threep,ft,fta,ftp,...,trb,ast,stl,blk,tov,pts,game_score,minus_plus,decimal_age,year_num
count,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,1381.0,1421.0,1421.0,1407.0,...,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,1420.0,1421.0,1421.0
mean,36.796622,9.959184,19.735398,0.505616,1.591133,4.618578,0.312681,5.691063,7.741027,0.72897,...,7.506685,7.332864,1.538353,0.755102,3.494722,27.200563,22.242646,5.08662,27.947809,6.187896
std,21.664897,3.099936,4.780131,0.110395,1.501044,2.572508,0.240515,3.368388,4.118038,0.192436,...,3.016059,2.96215,1.281823,0.898659,1.819536,7.839582,7.742163,12.676835,5.613479,8.467953
min,1.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.0,-0.1,-39.0,18.83,1.0
25%,18.0,8.0,16.0,0.435,0.0,3.0,0.125,3.0,5.0,0.615,...,5.0,5.0,1.0,0.0,2.0,22.0,17.0,-3.0,23.093,1.0
50%,36.0,10.0,20.0,0.5,1.0,4.0,0.333,5.0,7.0,0.75,...,7.0,7.0,1.0,1.0,3.0,27.0,22.5,6.0,27.97,1.0
75%,54.0,12.0,23.0,0.579,2.0,6.0,0.5,8.0,10.0,0.857,...,9.0,9.0,2.0,1.0,5.0,32.0,27.3,14.0,32.819,20.0
max,82.0,23.0,36.0,0.929,9.0,14.0,1.0,24.0,28.0,1.0,...,19.0,19.0,7.0,5.0,11.0,61.0,53.2,39.0,38.274,20.0


In [28]:
#Convert Minutes Played Column to Decimal Format instead of "minutes:seconds"

def convert_to_decimal(time_str):
    if isinstance(time_str, str) and ':00' in time_str:
        minutes, seconds, milliseconds = time_str.split(':')
        return round(int(minutes) + int(seconds) / 60, 2)
    elif isinstance(time_str, str) and ':' in time_str:
        minutes, seconds = time_str.split(':')
        return round(int(minutes) + int(seconds) / 60, 2)
    else:
        return np.nan

lbj_df['mp'] = lbj_df['mp'].apply(convert_to_decimal)


In [29]:
lbj_df

Unnamed: 0,game,date,team,opp,mp,fg,fga,fgp,three,threeatt,...,stl,blk,tov,pts,game_score,minus_plus,win_or_loss,spread,decimal_age,year_num
0,1,10/29/2003,CLE,SAC,42.83,12,20,0.600,0,2,...,4,0,2,25,24.7,-9.0,L,-14,18.830,1
1,2,10/30/2003,CLE,PHO,40.35,8,17,0.471,1,5,...,1,0,7,21,14.7,-3.0,L,-9,18.833,1
2,3,11/1/2003,CLE,POR,39.17,3,12,0.250,0,1,...,2,0,2,8,5.0,-21.0,L,-19,18.838,1
3,4,11/5/2003,CLE,DEN,41.10,3,11,0.273,0,2,...,2,3,2,7,11.2,-3.0,L,-4,18.849,1
4,5,11/7/2003,CLE,IND,43.73,8,18,0.444,1,2,...,0,0,7,23,9.0,-7.0,L,-1,18.855,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,4/2/2023,LAL,HOU,29.35,8,18,0.444,1,7,...,0,1,1,18,19.4,23.0,W,+25,38.255,20
1417,52,4/4/2023,LAL,UTA,38.47,14,27,0.519,3,10,...,1,1,5,37,25.3,-7.0,W,+2,38.260,20
1418,53,4/5/2023,LAL,LAC,35.10,13,20,0.650,4,6,...,1,1,6,33,26.4,-10.0,L,-7,38.263,20
1419,54,4/7/2023,LAL,PHO,29.35,6,19,0.316,3,7,...,0,0,5,16,5.3,11.0,W,+14,38.268,20


In [30]:
#Convert all the datatypes to appropriate types for the ML model to run. 

lbj_df['spread'] = lbj_df['spread'].astype('int64')
lbj_df['year_num'] = lbj_df['year_num'].astype('int64')
lbj_df['decimal_age'] = lbj_df['decimal_age'].astype('int64')
lbj_df['mp'] = lbj_df['mp'].astype('int64')
lbj_df['fgp'] = lbj_df['fgp'].astype('int64')
lbj_df['game_score'] = lbj_df['game_score'].astype('int64')



In [31]:
## Apply One-Hot Encodings to Categorical Variables
## This adds a ton of variables

lbj_df = pd.get_dummies(lbj_df, columns=['team', 'opp'])


In [32]:
#Convert more, this time the date column
lbj_df['date'] = pd.to_datetime(lbj_df['date'])


lbj_df['year'] = lbj_df['date'].dt.year
lbj_df['month'] = lbj_df['date'].dt.month
lbj_df['day'] = lbj_df['date'].dt.day

lbj_df = lbj_df.drop(columns=['date'])

In [33]:
# Calculate the count of NaN values in each column
nan_count = lbj_df.isna().sum()

# Filter and display only the columns that have NaN values
nan_count_filtered = nan_count[nan_count > 0]
print(nan_count_filtered)

threep        40
ftp           14
minus_plus     1
dtype: int64


In [34]:
# Replace NaN values in 'threep' column with 0
lbj_df['threep'] = lbj_df['threep'].fillna(0)

# Replace NaN values in 'ftp' column with 0
lbj_df['ftp'] = lbj_df['ftp'].fillna(0)

In [35]:
## Look at the set

lbj_df

Unnamed: 0,game,mp,fg,fga,fgp,three,threeatt,threep,ft,fta,...,opp_POR,opp_SAC,opp_SAS,opp_SEA,opp_TOR,opp_UTA,opp_WAS,year,month,day
0,1,42,12,20,0,0,2,0.000,1,3,...,0,1,0,0,0,0,0,2003,10,29
1,2,40,8,17,0,1,5,0.200,4,7,...,0,0,0,0,0,0,0,2003,10,30
2,3,39,3,12,0,0,1,0.000,2,2,...,1,0,0,0,0,0,0,2003,11,1
3,4,41,3,11,0,0,2,0.000,1,1,...,0,0,0,0,0,0,0,2003,11,5
4,5,43,8,18,0,1,2,0.500,6,7,...,0,0,0,0,0,0,0,2003,11,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,29,8,18,0,1,7,0.143,1,1,...,0,0,0,0,0,0,0,2023,4,2
1417,52,38,14,27,0,3,10,0.300,6,6,...,0,0,0,0,0,1,0,2023,4,4
1418,53,35,13,20,0,4,6,0.667,3,5,...,0,0,0,0,0,0,0,2023,4,5
1419,54,29,6,19,0,3,7,0.429,1,2,...,0,0,0,0,0,0,0,2023,4,7


In [36]:
# Check the dtypes one more time
lbj_df.dtypes

game       int64
mp         int64
fg         int64
fga        int64
fgp        int64
           ...  
opp_UTA    uint8
opp_WAS    uint8
year       int64
month      int64
day        int64
Length: 66, dtype: object

In [37]:
# IGNORE Calculate Lebron's Off the Court Minus_Plus 

# lbj_df['off_court_minus_plus'] = lbj_df['spread'] - lbj_df['minus_plus']

In [38]:
# IGNORE Check that it worked

# lbj_df['off_court_minus_plus']

In [39]:
# IGNORE Drop and spread (these are indiciators of win or loss, we don't want that in our input model)

lbj_df = lbj_df.drop(columns=['spread'])

In [40]:
lbj_df.columns.to_list()

['game',
 'mp',
 'fg',
 'fga',
 'fgp',
 'three',
 'threeatt',
 'threep',
 'ft',
 'fta',
 'ftp',
 'orb',
 'drb',
 'trb',
 'ast',
 'stl',
 'blk',
 'tov',
 'pts',
 'game_score',
 'minus_plus',
 'win_or_loss',
 'decimal_age',
 'year_num',
 'team_CLE',
 'team_LAL',
 'team_MIA',
 'opp_ATL',
 'opp_BOS',
 'opp_BRK',
 'opp_CHA',
 'opp_CHI',
 'opp_CHO',
 'opp_CLE',
 'opp_DAL',
 'opp_DEN',
 'opp_DET',
 'opp_GSW',
 'opp_HOU',
 'opp_IND',
 'opp_LAC',
 'opp_LAL',
 'opp_MEM',
 'opp_MIA',
 'opp_MIL',
 'opp_MIN',
 'opp_NJN',
 'opp_NOH',
 'opp_NOK',
 'opp_NOP',
 'opp_NYK',
 'opp_OKC',
 'opp_ORL',
 'opp_PHI',
 'opp_PHO',
 'opp_POR',
 'opp_SAC',
 'opp_SAS',
 'opp_SEA',
 'opp_TOR',
 'opp_UTA',
 'opp_WAS',
 'year',
 'month',
 'day']

# Training, Validation, and Test Split

As our dataset comprises time-series data, our primary objective with machine learning is to make accurate predictions about future outcomes. To achieve this, we considered several methodologies for splitting our data but ultimately settled on the following approach:

- We chose a chronological split for our training, validation, and test sets, as it aligns with the nature of time-series data and ensures the model can learn from past events to predict future outcomes more effectively.
- We allocated a 70/15/15 distribution for our train, validation, and test sets.
- We were diligent in avoiding any overlap between the sets, as demonstrated below.
- Each set (train, validation, and test) was further divided into X and Y components, with X representing all input variables and Y representing the outcome (in our case, Win or Loss).

By using this chronological splitting method, we were able to maintain the integrity of the time-series data, ensuring that our models are better equipped to predict future game outcomes based on LeBron's performance.

In [41]:
### Calculate the total length

n_total = len(lbj_df)
n_train = int(n_total * 0.70)
n_val = int(n_total * 0.15)
n_test = int(n_total * 0.15)

In [42]:
## Double check the lengths
print(f'dataset:', n_total)
print(f'train:',n_train)
print(f'validation:', n_val)
print(f'test:', n_test)

dataset: 1421
train: 994
validation: 213
test: 213


In [43]:
## Create the actual sets themselves
train_df = lbj_df.iloc[:n_train]
val_df = lbj_df.loc[n_train:n_train + n_val]
test_df = lbj_df.loc[n_train + n_val + 1:]

In [44]:
## Check it worked
train_df

Unnamed: 0,game,mp,fg,fga,fgp,three,threeatt,threep,ft,fta,...,opp_POR,opp_SAC,opp_SAS,opp_SEA,opp_TOR,opp_UTA,opp_WAS,year,month,day
0,1,42,12,20,0,0,2,0.000,1,3,...,0,1,0,0,0,0,0,2003,10,29
1,2,40,8,17,0,1,5,0.200,4,7,...,0,0,0,0,0,0,0,2003,10,30
2,3,39,3,12,0,0,1,0.000,2,2,...,1,0,0,0,0,0,0,2003,11,1
3,4,41,3,11,0,0,2,0.000,1,1,...,0,0,0,0,0,0,0,2003,11,5
4,5,43,8,18,0,1,2,0.500,6,7,...,0,0,0,0,0,0,0,2003,11,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
989,3,39,7,16,0,2,6,0.333,7,11,...,0,0,0,0,0,0,0,2016,10,29
990,4,35,6,12,0,1,3,0.333,6,10,...,0,0,0,0,0,0,0,2016,11,1
991,5,36,12,22,0,1,4,0.250,5,5,...,0,0,0,0,0,0,0,2016,11,3
992,6,35,9,23,0,1,5,0.200,6,7,...,0,0,0,0,0,0,0,2016,11,5


In [45]:
## Check it worked
val_df

Unnamed: 0,game,mp,fg,fga,fgp,three,threeatt,threep,ft,fta,...,opp_POR,opp_SAC,opp_SAS,opp_SEA,opp_TOR,opp_UTA,opp_WAS,year,month,day
994,8,38,9,18,0,3,4,0.750,6,9,...,0,0,0,0,0,0,1,2016,11,11
995,9,38,8,21,0,2,5,0.400,1,2,...,0,0,0,0,0,0,0,2016,11,13
996,10,38,10,15,0,2,5,0.400,6,10,...,0,0,0,0,1,0,0,2016,11,15
997,11,28,9,14,0,1,3,0.333,2,3,...,0,0,0,0,0,0,0,2016,11,18
998,12,37,11,21,0,2,3,0.667,7,8,...,1,0,0,0,0,0,0,2016,11,23
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1203,6,37,8,23,0,0,6,0.000,5,10,...,0,0,1,0,0,0,0,2019,11,3
1204,7,35,10,19,0,2,6,0.333,8,9,...,0,0,0,0,0,0,0,2019,11,5
1205,8,36,10,19,0,4,7,0.571,1,2,...,0,0,0,0,0,0,0,2019,11,8
1206,9,35,5,15,0,0,2,0.000,3,6,...,0,0,0,0,1,0,0,2019,11,10


In [46]:
## Check it worked
test_df

Unnamed: 0,game,mp,fg,fga,fgp,three,threeatt,threep,ft,fta,...,opp_POR,opp_SAC,opp_SAS,opp_SEA,opp_TOR,opp_UTA,opp_WAS,year,month,day
1208,11,26,11,21,0,1,5,0.200,0,1,...,0,0,0,0,0,0,0,2019,11,13
1209,12,38,10,20,0,2,7,0.286,7,7,...,0,1,0,0,0,0,0,2019,11,15
1210,13,33,13,21,0,6,10,0.600,1,1,...,0,0,0,0,0,0,0,2019,11,17
1211,14,37,10,21,0,2,5,0.400,3,5,...,0,0,0,0,0,0,0,2019,11,19
1212,15,35,9,20,0,1,3,0.333,4,6,...,0,0,0,0,0,0,0,2019,11,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,51,29,8,18,0,1,7,0.143,1,1,...,0,0,0,0,0,0,0,2023,4,2
1417,52,38,14,27,0,3,10,0.300,6,6,...,0,0,0,0,0,1,0,2023,4,4
1418,53,35,13,20,0,4,6,0.667,3,5,...,0,0,0,0,0,0,0,2023,4,5
1419,54,29,6,19,0,3,7,0.429,1,2,...,0,0,0,0,0,0,0,2023,4,7


# Make X and Y for each respective set (train, val, and test)

In this section, we prepared our data for the machine learning models by creating X and Y components for each respective set (train, validation, and test). The X components include all input variables, while the Y components represent the outcomes, i.e., Win or Loss.

For each set, we carried out the following steps:

- Dropped the 'win_or_loss' column to create the X component.
- Retained the 'win_or_loss' column as the Y component.
- Verified that the lengths of X and Y components match.
- Checked the shapes of all X and Y components to ensure consistency.

We also examined the data for any NaN values and addressed them accordingly. In the case of our test set, we found one NaN value in the 'minus_plus' column, which we filled with a zero. This process helped maintain data integrity and ensured that our machine learning models receive clean and consistent input.

In [47]:
### Create X and Y's for each respective set (train)

x_train = train_df.drop('win_or_loss', axis=1)
y_train = train_df[['win_or_loss']]

In [48]:
## Check lengths match, and check proper columns were dropped
print(len(x_train))
print(len(y_train))

994
994


In [49]:
### Create X and Y's for each respective set (validation)

x_val = val_df.drop('win_or_loss', axis=1)
y_val = val_df[['win_or_loss']]

In [50]:
## Check lengths match, and check proper columns were dropped

print(len(x_val))
print(len(y_val))

214
214


In [51]:
### Create X and Y's for each respective set (test)

x_test = test_df.drop('win_or_loss', axis=1)
y_test = test_df[['win_or_loss']]

In [52]:
## Check lengths match, and check proper columns were dropped

print(len(x_test))
print(len(y_test))

213
213


In [53]:
## Check the shapes

print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

print(x_test.shape)
print(y_val.shape)


(994, 64)
(994, 1)
(214, 64)
(214, 1)
(213, 64)
(214, 1)


In [54]:
x_train.columns.to_list()

['game',
 'mp',
 'fg',
 'fga',
 'fgp',
 'three',
 'threeatt',
 'threep',
 'ft',
 'fta',
 'ftp',
 'orb',
 'drb',
 'trb',
 'ast',
 'stl',
 'blk',
 'tov',
 'pts',
 'game_score',
 'minus_plus',
 'decimal_age',
 'year_num',
 'team_CLE',
 'team_LAL',
 'team_MIA',
 'opp_ATL',
 'opp_BOS',
 'opp_BRK',
 'opp_CHA',
 'opp_CHI',
 'opp_CHO',
 'opp_CLE',
 'opp_DAL',
 'opp_DEN',
 'opp_DET',
 'opp_GSW',
 'opp_HOU',
 'opp_IND',
 'opp_LAC',
 'opp_LAL',
 'opp_MEM',
 'opp_MIA',
 'opp_MIL',
 'opp_MIN',
 'opp_NJN',
 'opp_NOH',
 'opp_NOK',
 'opp_NOP',
 'opp_NYK',
 'opp_OKC',
 'opp_ORL',
 'opp_PHI',
 'opp_PHO',
 'opp_POR',
 'opp_SAC',
 'opp_SAS',
 'opp_SEA',
 'opp_TOR',
 'opp_UTA',
 'opp_WAS',
 'year',
 'month',
 'day']

In [55]:
y_train.columns.to_list()

['win_or_loss']

In [56]:
# Check Nans

nan_count_test = x_test.isnull().sum()
print(nan_count_test)

game       0
mp         0
fg         0
fga        0
fgp        0
          ..
opp_UTA    0
opp_WAS    0
year       0
month      0
day        0
Length: 64, dtype: int64


In [57]:
# Check NaN count

nan_count_filtered_test = nan_count_test[nan_count_test > 0]
print(nan_count_filtered_test)

minus_plus    1
dtype: int64


In [58]:
# Find the record

nan_rows = x_test[x_test['minus_plus'].isnull()]
print(nan_rows)

      game  mp  fg  fga  fgp  three  threeatt  threep  ft  fta  ...  opp_POR  \
1415    50  32   7   19    0      1         3   0.333   3    4  ...        0   

      opp_SAC  opp_SAS  opp_SEA  opp_TOR  opp_UTA  opp_WAS  year  month  day  
1415        0        0        0        0        0        0  2023      3   31  

[1 rows x 64 columns]


In [59]:
# Fill it with 0

x_test['minus_plus'] = x_test['minus_plus'].fillna(0)

# IGNORE x_test['off_court_minus_plus'] = x_test['off_court_minus_plus'].fillna(0)

# ML Models Overview

Considering that we are dealing with a binary outcome, we identified four suitable models to effectively address our objective:

- Logistic Regression (both non-TensorFlow and TensorFlow implementations)
- Decision Tree
- Random Forest
- Neural Network

These models were chosen as they are well-suited for handling binary classification problems and offer diverse approaches to understanding and predicting the impact of LeBron James' performance on game outcomes.

In the sections below, we present the logic and output for each respective model applied to our LeBron James dataset. Furthermore, we delve into the feature importance analysis for each model, which will be discussed in detail within our results section and conclusion. By utilizing these models, we aim to uncover valuable insights into LeBron's contributions and their influence on his team's success.

# Logistic Regression

Simple logistic regression (not using tensorflow)

In [60]:
## Standardize the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(x_train)
X_val_scaled = scaler.transform(x_val)
X_test_scaled = scaler.transform(x_test)

In [61]:
## Run the logistic regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, np.ravel(y_train))

In [62]:
y_val_pred = log_reg.predict(X_val_scaled)

# Calculate accuracy
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation accuracy: {val_accuracy}")

# Print classification report
print(classification_report(y_val, y_val_pred))

Validation accuracy: 0.8504672897196262
              precision    recall  f1-score   support

           L       0.84      0.76      0.80        83
           W       0.86      0.91      0.88       131

    accuracy                           0.85       214
   macro avg       0.85      0.83      0.84       214
weighted avg       0.85      0.85      0.85       214



In [63]:
# Make predictions on the test set
y_test_pred = log_reg.predict(X_test_scaled)

# Calculate the accuracy
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test accuracy:", test_accuracy)

# Print the classification report
print(classification_report(y_test, y_test_pred))

Test accuracy: 0.784037558685446
              precision    recall  f1-score   support

           L       0.76      0.67      0.72        86
           W       0.80      0.86      0.83       127

    accuracy                           0.78       213
   macro avg       0.78      0.77      0.77       213
weighted avg       0.78      0.78      0.78       213



In [64]:
# Get coefficients from the logistic regression model
log_reg_coefficients = log_reg.coef_[0]

# Create a DataFrame with the feature names and their coefficients
coefficients_df = pd.DataFrame(
    {"feature": x_train.columns, "coefficient": log_reg_coefficients}
)

# Sort the DataFrame by the absolute values of the coefficients in descending order
coefficients_df_sorted = coefficients_df.reindex(
    coefficients_df.coefficient.abs().sort_values(ascending=False).index
)

# Display the sorted DataFrame
print(coefficients_df_sorted)

       feature  coefficient
20  minus_plus     3.919856
6     threeatt    -0.582375
1           mp    -0.450482
30     opp_CHI    -0.425536
59     opp_UTA    -0.406391
..         ...          ...
26     opp_ATL     0.007534
42     opp_MIA    -0.005041
5        three     0.003616
24    team_LAL     0.000000
4          fgp     0.000000

[64 rows x 2 columns]


#### Logistic Regression Using Tensorflow

In [65]:
# Encode the labels as integers
encoder = LabelEncoder()
y_train_enc = encoder.fit_transform(y_train.values.ravel())
y_val_encoded = encoder.transform(y_val.values.ravel())
y_test_enc = encoder.transform(y_test.values.ravel())

# Set the seed for reproducibility
seed_value = 30
np.random.seed(seed_value)
random.seed(seed_value)
tf.random.set_seed(seed_value)


# Create the logistic regression model using TensorFlow
model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, activation='sigmoid', input_shape=(x_train.shape[1],))
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model on the training data
history = model.fit(x_train, y_train_enc, epochs=40, batch_size=32, validation_data=(x_val, y_val_encoded), verbose=2)

# Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(x_test, y_test_enc)
print(f"Test accuracy (Logistic Regression with TensorFlow): {test_accuracy}")



Epoch 1/40
32/32 - 1s - loss: 78.8225 - accuracy: 0.6700 - val_loss: 79.0704 - val_accuracy: 0.6121 - 526ms/epoch - 16ms/step
Epoch 2/40
32/32 - 0s - loss: 55.7469 - accuracy: 0.6700 - val_loss: 51.8185 - val_accuracy: 0.6121 - 52ms/epoch - 2ms/step
Epoch 3/40
32/32 - 0s - loss: 32.9462 - accuracy: 0.6700 - val_loss: 25.5469 - val_accuracy: 0.6121 - 50ms/epoch - 2ms/step
Epoch 4/40
32/32 - 0s - loss: 11.5397 - accuracy: 0.6388 - val_loss: 4.9148 - val_accuracy: 0.5607 - 50ms/epoch - 2ms/step
Epoch 5/40
32/32 - 0s - loss: 5.1980 - accuracy: 0.5292 - val_loss: 4.8509 - val_accuracy: 0.5841 - 49ms/epoch - 2ms/step
Epoch 6/40
32/32 - 0s - loss: 4.4559 - accuracy: 0.5644 - val_loss: 4.3161 - val_accuracy: 0.5841 - 51ms/epoch - 2ms/step
Epoch 7/40
32/32 - 0s - loss: 4.0445 - accuracy: 0.5835 - val_loss: 3.7410 - val_accuracy: 0.5748 - 49ms/epoch - 2ms/step
Epoch 8/40
32/32 - 0s - loss: 3.5216 - accuracy: 0.5815 - val_loss: 3.8203 - val_accuracy: 0.5888 - 52ms/epoch - 2ms/step
Epoch 9/40
32/3

### Results

The logistic regression model without TensorFlow yielded a validation accuracy of 0.8505 and a test accuracy of 0.7840, while the TensorFlow-based model achieved a validation accuracy of ~0.8411 and a test accuracy of 0.8122 after 40 epochs of training. Although the TensorFlow model had a slightly lower validation accuracy, it outperformed the non-TensorFlow model in test accuracy, making it the better model.

A key finding from the analysis was that the 'minus_plus' feature, representing the point differential when LeBron is on the court, was the most significant predictor of his team's victory. This demonstrates LeBron's tangible and intangible impact on the game outcome.

A few of the key hyperparameters for this model are as follows:

Learning rate (0.001): A smaller learning rate helps the model converge gradually without overshooting the optimal weights. 0.001 is a commonly used default value that often leads to good results.

Batch size (32): A moderate batch size balances between computational efficiency and the ability to generalize. Smaller batch sizes can be noisier and lead to better generalization, while larger ones offer faster training but may suffer from overfitting.

Epochs (40): The number of times the model iterates through the entire dataset. We chose 40 epochs as a balance between sufficient training and avoiding overfitting. It's important to monitor the validation accuracy to ensure that the model doesn't start overfitting as epochs increase.

# Decision Tree

Run a simple decision tree

In [85]:
# Set the seed for reproducibility
seed = 42

# Create the Decision Tree model with the seed
dtree = DecisionTreeClassifier(max_depth=1, random_state=seed)

# Train the model on the training set
dtree.fit(X_train_scaled, y_train)

# Make predictions on the validation set
y_val_pred_dtree = dtree.predict(X_val_scaled)

# Calculate the accuracy
val_accuracy_dtree = accuracy_score(y_val, y_val_pred_dtree)
print("Validation accuracy (Decision Tree):", val_accuracy_dtree)

# Print the classification report
print(classification_report(y_val, y_val_pred_dtree))

Validation accuracy (Decision Tree): 0.8738317757009346
              precision    recall  f1-score   support

           L       0.84      0.83      0.84        83
           W       0.89      0.90      0.90       131

    accuracy                           0.87       214
   macro avg       0.87      0.87      0.87       214
weighted avg       0.87      0.87      0.87       214



In [86]:
# Make predictions on the test set
y_test_pred_dtree = dtree.predict(X_test_scaled)

# Calculate the accuracy
test_accuracy_dtree = accuracy_score(y_test, y_test_pred_dtree)
print("Test accuracy (Decision Tree):", test_accuracy_dtree)

# Print the classification report
print(classification_report(y_test, y_test_pred_dtree))

Test accuracy (Decision Tree): 0.8169014084507042
              precision    recall  f1-score   support

           L       0.81      0.72      0.76        86
           W       0.82      0.88      0.85       127

    accuracy                           0.82       213
   macro avg       0.81      0.80      0.81       213
weighted avg       0.82      0.82      0.81       213



In [87]:
# Get feature importances from the Decision Tree model
feature_importances = dtree.feature_importances_

# Create a DataFrame with the feature names and their importance scores
feature_importance_df = pd.DataFrame(
    {"feature": x_train.columns, "importance": feature_importances}
)

# Sort the DataFrame by importance scores in descending order
feature_importance_df_sorted = feature_importance_df.sort_values(
    by="importance", ascending=False
)

# Display the sorted DataFrame
print(feature_importance_df_sorted)

       feature  importance
20  minus_plus         1.0
0         game         0.0
33     opp_DAL         0.0
35     opp_DET         0.0
36     opp_GSW         0.0
..         ...         ...
27     opp_BOS         0.0
28     opp_BRK         0.0
29     opp_CHA         0.0
30     opp_CHI         0.0
63         day         0.0

[64 rows x 2 columns]


### Results

Our decision tree model displayed promising results, achieving a validation accuracy of ~0.8738 and a test accuracy of ~0.8169. Decision trees were chosen for their ability to recursively partition the feature space based on feature values until each subspace contains data points from the same class. They offer simplicity, interpretability, and can handle both numerical and categorical data. The most important feature in our model was "minus_plus," with an importance score of 1.0, which indicates it was especially effective in separating the classes in the dataset. The remaining features had an importance score of 0, suggesting they did not contribute to the tree's decision-making process. In summary, the decision tree model demonstrated good generalization to unseen data, with "minus_plus" emerging as the key feature for class separation. After playing with some hyperparameters the only that seemed to help was the max_depth which was optimized at 1.

# Random Forest

In [69]:
# Set the seed for reproducibility
seed = 42

# Create the Random Forest model with the seed
rf = RandomForestClassifier(min_samples_split=4, n_estimators=100, random_state=seed)

# Train the model on the training set
rf.fit(X_train_scaled, y_train.values.ravel())

# Make predictions on the validation set
y_val_pred_rf = rf.predict(X_val_scaled)

# Calculate the accuracy
val_accuracy_rf = accuracy_score(y_val, y_val_pred_rf)
print("Validation accuracy (Random Forest):", val_accuracy_rf)

# Print the classification report
print(classification_report(y_val, y_val_pred_rf))

Validation accuracy (Random Forest): 0.8785046728971962
              precision    recall  f1-score   support

           L       0.90      0.77      0.83        83
           W       0.87      0.95      0.91       131

    accuracy                           0.88       214
   macro avg       0.88      0.86      0.87       214
weighted avg       0.88      0.88      0.88       214



In [70]:
# Make predictions on the test set
y_test_pred_rf = rf.predict(X_test_scaled)

# Calculate the accuracy
test_accuracy_rf = accuracy_score(y_test, y_test_pred_rf)
print("Test accuracy (Random Forest):", test_accuracy_rf)

# Print the classification report
print(classification_report(y_test, y_test_pred_rf))


Test accuracy (Random Forest): 0.8262910798122066
              precision    recall  f1-score   support

           L       0.86      0.69      0.76        86
           W       0.81      0.92      0.86       127

    accuracy                           0.83       213
   macro avg       0.83      0.80      0.81       213
weighted avg       0.83      0.83      0.82       213



In [71]:
# Get the feature importances
importances = rf.feature_importances_

# Get the column names from the input dataset
feature_names = x_train.columns

# Create a dictionary of feature names and their importances
feature_importances = dict(zip(feature_names, importances))

# Sort the dictionary by importance score in descending order
sorted_feature_importances = sorted(feature_importances.items(), key=lambda x: x[1], reverse=True)

# Print the sorted feature importances
for feature, importance in sorted_feature_importances:
    print(f"{feature}: {importance}")

minus_plus: 0.4129913200597449
game_score: 0.058358479027977396
mp: 0.03851677041489027
fga: 0.030672452032309224
ast: 0.02857400407156052
pts: 0.027114384877615406
decimal_age: 0.025401084611770167
year: 0.025291838964842724
game: 0.025246148275027438
ftp: 0.024204507543174315
tov: 0.02211705224597309
fg: 0.022095775205857423
drb: 0.02094001271410278
trb: 0.020259088934534607
threeatt: 0.020159656687059307
day: 0.01993866430177415
fta: 0.019554395414955718
ft: 0.0174783271495241
threep: 0.015529790655759322
stl: 0.013519901375034928
orb: 0.011743507378445428
three: 0.00962355668163217
blk: 0.00934945167072192
month: 0.009313342384678364
opp_CHI: 0.006790915926492717
opp_UTA: 0.003931800712040648
opp_PHI: 0.003219445755511916
opp_NYK: 0.0030396794667972467
opp_ATL: 0.0028534608599681227
year_num: 0.00277039665685679
opp_DEN: 0.002737100428012352
team_MIA: 0.0027035008155969387
team_CLE: 0.0025335959316940887
opp_BRK: 0.002371539546484823
opp_IND: 0.0023680344656915333
opp_LAL: 0.002336

### Results

Our random forest model demonstrated promising results, with a validation accuracy of approximately 0.8785 and a test accuracy of roughly 0.8263. We chose random forests as they are an ensemble learning method that fits multiple decision trees to different subsamples of the training data and combines their predictions. This reduces overfitting and enhances the model's ability to generalize to unseen data. In our model, the most important feature was "minus_plus," with an importance score of 0.413. This score indicates the relative contribution of the "minus_plus" feature to the prediction accuracy of the model. A higher importance score means that the feature played a more significant role in the model's decision-making process. In this case, an importance score of 0.413 suggests that "minus_plus" was a crucial feature in predicting the game outcome, with a substantial impact on the model's overall performance.

# Neural Network

In [72]:
# For encoding W/L to 1 and 0 for the Neural Network

encoder = LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train.values.ravel())
y_val_encoded = encoder.transform(y_val.values.ravel())
y_test_encoded = encoder.transform(y_test.values.ravel())

In [73]:
# Set seed
np.random.seed(42)

# Define the neural network architecture
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')  # Binary output class (W, L)
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train the model on the training set
history = model.fit(X_train_scaled, y_train_encoded, epochs=15, batch_size=20, validation_data=(X_val_scaled, y_val_encoded), verbose=2, callbacks=[early_stopping])

# Get the validation accuracy
val_accuracy = history.history['val_accuracy'][-1]
print("Validation accuracy (Neural Network):", val_accuracy)


Epoch 1/15
50/50 - 1s - loss: 0.7165 - accuracy: 0.6247 - val_loss: 0.6142 - val_accuracy: 0.6168 - 874ms/epoch - 17ms/step
Epoch 2/15
50/50 - 0s - loss: 0.6329 - accuracy: 0.6549 - val_loss: 0.5980 - val_accuracy: 0.6215 - 111ms/epoch - 2ms/step
Epoch 3/15
50/50 - 0s - loss: 0.5982 - accuracy: 0.6982 - val_loss: 0.5646 - val_accuracy: 0.6636 - 100ms/epoch - 2ms/step
Epoch 4/15
50/50 - 0s - loss: 0.5486 - accuracy: 0.7294 - val_loss: 0.5178 - val_accuracy: 0.6869 - 91ms/epoch - 2ms/step
Epoch 5/15
50/50 - 0s - loss: 0.5231 - accuracy: 0.7425 - val_loss: 0.4751 - val_accuracy: 0.7523 - 106ms/epoch - 2ms/step
Epoch 6/15
50/50 - 0s - loss: 0.4711 - accuracy: 0.7847 - val_loss: 0.4735 - val_accuracy: 0.7290 - 97ms/epoch - 2ms/step
Epoch 7/15
50/50 - 0s - loss: 0.3867 - accuracy: 0.8189 - val_loss: 0.4505 - val_accuracy: 0.7710 - 91ms/epoch - 2ms/step
Epoch 8/15
50/50 - 0s - loss: 0.4045 - accuracy: 0.8129 - val_loss: 0.4389 - val_accuracy: 0.7991 - 89ms/epoch - 2ms/step
Epoch 9/15
50/50 - 

In [74]:
# Scale the test set
x_test_scaled = scaler.transform(x_test)

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(x_test_scaled, y_test_encoded, verbose=2)
print("Test accuracy (Neural Network):", test_accuracy)


7/7 - 0s - loss: 0.5326 - accuracy: 0.7793 - 22ms/epoch - 3ms/step
Test accuracy (Neural Network): 0.7793427109718323


In [88]:
#Create Summary

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_1 (Dense)             (None, 128)               8320      
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 32)                2080      
                                                                 
 dropout_2 (Dropout)         (None, 32)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                

### Result

The neural network model we implemented did not perform as well as the random forest or decision tree models, with a validation accuracy of approximately 0.8505 and a test accuracy of around 0.7793. Despite adjusting the hyperparameters, this was the best prediction we could achieve. The neural network model consisted of multiple dense and dropout layers with varying numbers of nodes and dropout rates, using 'relu' activation functions and a final 'sigmoid' activation function for binary output.

We used the Adam optimizer and binary cross-entropy loss function to compile the model. Early stopping was employed with a patience of 3 epochs to halt training when the model ceased improving on the validation loss. We trained the model for a maximum of 15 epochs with a batch size of 20.

Considering that the relationship between our predictors and the outcome variable is not complex, with "minus_plus" being the primary indicator of the outcome, it is understandable that the neural network model did not perform better. Neural networks excel at modeling complex relationships, but in this case, the relationship is relatively simple, and hence, other models like random forest and decision tree provide better predictions. Although we experimented with the model's hyperparameters, we did not dive too deep into its architecture or optimization, which could potentially have led to better performance.

# Conclusion

In conclusion, our analysis of various machine learning models, including random forest, decision tree, logistic regression, and neural network, has provided valuable insights into the factors that contribute to the outcome of basketball games featuring LeBron James. The random forest model emerged as the top performer, achieving a validation accuracy of approximately 0.87 and a test accuracy of around 0.82. However, the decision tree and logistic regression models also exhibited comparable performance.

A key finding across all models was the overwhelming importance of the "minus_plus" feature in predicting game outcomes. This metric represents the team's point differential when LeBron is on the court and consistently outperformed other features by a significant margin. The next closest predictor was "game_score," which is a player-specific efficiency rating, suggesting that LeBron's individual performance can also influence the game's result, although not to the same extent as "minus_plus."

Interestingly, we found that traditional individual statistics such as points, assists, rebounds, and blocks had little to no impact on predicting game outcomes. This observation aligns with the perception of LeBron James as a player who can score prolifically yet still experience losses, particularly in recent years. For instance, LeBron's career average of 26.5 points per game is notably higher than that of most players, yet he has encountered his fair share of losses.

Taking all of our findings into account, it becomes evident that LeBron James' mere presence on the court has a profound effect on his team's performance, often resulting in victories. This analysis highlights the significance of considering more comprehensive metrics, such as "minus_plus," to better understand the dynamics of a basketball game and the factors that contribute to its outcome. By focusing on these indicators, we can gain a more accurate and nuanced perspective of the game and the impact that star players like LeBron James have on their teams' success.