In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
df1 = pd.read_csv('2012-18_playerBoxScore_diff.csv')
df = pd.read_csv('NBAPlayer_dataset.csv')

In [2]:
feature_names = ['gmDayofyear', 'playPTS_diff', 'playAST_diff', 'playTO_diff',
'playSTL_diff', 'playBLK_diff', 'playPF_diff', 'playFGA_diff',
'play3PA_diff', 'playDRB_diff', 'playFT%_diff', 'IsStarter_diff',
'playMin_diff']

In [3]:
df2 = df1[feature_names]

In [23]:
df2.head()

Unnamed: 0,gmDayofyear,playPTS_diff,playAST_diff,playTO_diff,playSTL_diff,playBLK_diff,playPF_diff,playFGA_diff,play3PA_diff,playDRB_diff,playFT%_diff,IsStarter_diff,playMin_diff
0,98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,74,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,89,6.0,0.0,0.0,1.0,0.0,0.0,9.0,1.0,2.0,0.0,1.0,6.0
4,91,-16.0,1.0,2.0,-2.0,0.0,0.0,-10.0,-4.0,-1.0,-0.5,0.0,-9.0


## Compare to Original Random Forest Model

### Original Random Forest Model

#### with original data:

In [4]:
X = np.array(df.drop(['playMin','playDispNm'], axis=1))
y = np.array(df['playMin'])

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [6]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=6, max_features=0.5).fit(X_train, y_train)
rfr.score(X_test, y_test)

0.919902675307746

#### with game difference data transformed:

In [7]:
X1 = np.array(df1.drop(['playMin_diff','playDispNm','Unnamed: 0'], axis=1))
y1 = np.array(df1['playMin_diff'])

In [8]:
from sklearn.model_selection import train_test_split

X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, random_state=1)

In [9]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=6, max_features=0.5).fit(X_train1, y_train1)
rfr.score(X_test1, y_test1)

0.7752257058913371

### Interpreted Random Forest Model

In [10]:
X2 = np.array(df2.drop(['playMin_diff'], axis=1))
y2 = np.array(df2['playMin_diff'])

In [11]:
from sklearn.model_selection import train_test_split

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=1)

In [12]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=6, max_features=0.5).fit(X_train2, y_train2)
rfr.score(X_test2, y_test2)

0.7986206018773632

To predict the playMin, the min that play on court in one game, the original Random Forest Model that use original data get a very high score of **0.919**. However, because of the Data leakage caused by some aggregate value like playPTS(Points scored by player), this model is not good.

To better predict the playMin, I decided to measure the difference from last game. Then, we got the game difference dataset after data cleaning. I applied the updated dataset for the original random forest model, the score down to **0.779**. The score looks more reasonable now but we have too many features.

After I eliminated the highly correlated features, selected the high ranking features, generated per game difference features, we got a score at **0.7999**. And we could mark this to our accurancy.

### Compare to Linear Regression：

In [100]:
X3 = np.array(df2.drop(['playMin_diff'], axis=1))
y3 = np.array(df2['playMin_diff'])

In [101]:
from sklearn.model_selection import train_test_split

X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.2, random_state=1)

In [102]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train3,y_train3)
model.score(X_test3, y_test3)

0.5409800416917555

In [103]:
%%time
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import numpy as np

selectors = []

for idx in list(range(1, 12, 1))[::-1]:

    estimator = LinearRegression(n_jobs=-1)

    rfe = RFE(estimator, idx, step=1)

    X_train_rfe = rfe.fit_transform(X_train3,y_train3)
    X_test_rfe = rfe.transform(X_test2)
    

    estimator.fit(X_train_rfe,y_train)
    score = estimator.score(X_test_rfe,y_test3)
    selectors.append(np.array([rfe, score, idx]))

CPU times: user 9.31 s, sys: 716 ms, total: 10 s
Wall time: 1.84 s


In [104]:
top_selector = selectors[-5]
top_rfe = top_selector[0]
top_list = np.array(df2.drop(columns='playMin_diff').columns[top_rfe.support_])
top_list

array(['playBLK_diff', 'playFGA_diff', 'playDRB_diff', 'playFT%_diff',
       'IsStarter_diff'], dtype=object)

#### How did the random forest model compare to the linear regression model?

1. We could find out there is a significant difference in accuracy between Linear Regression and Random Forest Model base on the game difference dataset. 

   One possible reason could be Random Forest Model tend to be out performance than the Linear Regression Model, for example, categorical features. Because the relationship between y and x in this dataset is not truly linear, the accuracy of two models would be greatly different.


2. Both model share the same 2 of top 5 important features : **'playFGA_diff', 'playDRB_diff'**. However, while Linear Regression choose **'playBLK_diff', 'playFT%_diff','IsStarter_diff'** Random Forest Model  selected **'playAST_diff', 'playPTS_diff'** and **'playPF_diff'**.


3. I think even though the linear regression model is easier to explain and understand, the random forests model is generally better in all means. For example, when we apply the random forests model on this NBA Player box dataset, we could use the decision tree to understand two play minute are effected by two type of player: offensive players and defensive players. There are some features behind. The first one is playFGA_diff and the second one relates to the playBLK, which is blocked by player. And both of these features are under the big one -- playPTS_diff(how many point that a player score differently per game)

   The linear Regression might runs much faster, but with random forests model, we could make the prediction more accurately and effectively.




The reason why I choose to predict playmin(Player minutes on floor) is that nowadays their salary are highly related to the advertising exposure time. And here is an article that talk about the playmin but in the another way. It is not always better to play 48 min per game, and our model is one way to predict and say how many time should play.

Citation:

“NBA: How Important Are Minutes Played?” DraftKings Playbook, DraftKings, https://www.draftkings.com/playbook/nba/nba-how-important-are-minutes-played.