# Capstone Project : NBA Player Analysis and Hall Of Fame Predictions
By Chan Song Yuan SG-DSI-14

## Notebook 3. Modeling, Results and Conclusions

## Table of Contents

- [1.Import Data](#1.-Import-Data)<br>
- [2. Data Train Test Spliting](#2.-Data-Train-Test-Spliting)<br>
- [3. Modeling](#3.-Modeling)<br>
    - [3.1 Modeling With Hyperparameter Tuning](#3.1-Modeling-With-Hyperparameter-Tuning)<br>
    - [3.2 Comparing Models Results With Best Hyperparameters](#3.2-Comparing-Models-Results-With-Best-Hyperparameters)<br>
- [4. Conclusion & Recommendation](#4.-Conclusion-&-Recommendation)<br>

In [53]:
import numpy as np
import pandas as pd

import chart_studio
chart_studio.tools.set_credentials_file(username='songyuan89', api_key='2meHLg87R2tzuGwUHL84')
import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.express as px

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

pd.set_option('display.max_columns',None) 
pd.set_option('display.max_rows',100)

import os

%matplotlib inline

## 1. Import Data

In [26]:
df_career = pd.read_csv('../datasets/df_career.csv')

## 2. Data Train Test Spliting

In [3]:
# Drop 'Player' column for modeling
df_career = df_career.drop(['Player'], axis=1)

In [4]:
# Train Test Spliting
# Train data will consist of Player who already retired in  1997 *taking Michael Jordan retire year
# Test data will consist of Player who still playing after 1997

train = df_career[(df_career['Last_Season'] <= 1997) | (df_career['HOF'] == 1)]
test = df_career[(df_career['Last_Season'] > 1997) & (df_career['HOF'] == 0)]

In [5]:
train['HOF'].value_counts()

0.0    312
1.0     68
Name: HOF, dtype: int64

In [6]:
test.head()

Unnamed: 0.1,Unnamed: 0,id,PER,TS%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,FG%,3P%,2P%,eFG%,FT%,PPG,APG,RPG,SPG,BPG,HOF,MVP_Counts,Champions_Winners,season,Last_Season,Total_PTS
1,2,58,18.526316,0.581474,5.873684,1.757895,7.642105,0.149263,4.205263,-1.247368,2.952632,0.451842,0.398,0.488579,0.532158,0.896526,19.124945,3.452787,4.118728,1.137763,0.185133,0.0,0,2,18,2014.0,24505.0
2,3,146,13.8375,0.53725,2.3,1.1625,3.4375,0.0945,0.8125,-2.625,-1.8375,0.3955,0.368,0.412,0.47575,0.857125,10.214707,4.143085,1.836157,0.607843,0.035776,0.0,0,0,9,2017.0,6214.0
3,5,204,13.145455,0.544455,1.663636,1.954545,3.627273,0.098909,0.463636,0.818182,1.290909,0.436545,0.333455,0.525455,0.515545,0.754636,8.902484,1.966323,4.905326,0.800149,0.533846,0.0,0,1,14,2017.0,7589.0
5,9,454,14.68,0.538,2.74,2.82,5.54,0.1344,0.26,1.8,2.04,0.4936,0.2468,0.5182,0.5058,0.6616,9.659802,2.125249,3.340584,1.480879,0.298635,0.0,0,0,8,2012.0,3940.0
6,11,490,12.625,0.51375,1.175,1.6,2.75,0.088,-0.65,-0.45,-1.125,0.41125,0.342,0.44125,0.463,0.77125,8.495222,1.86741,3.220451,0.637647,0.149215,0.0,0,2,8,2008.0,3352.0


In [7]:
test['HOF'].value_counts()

0.0    533
Name: HOF, dtype: int64

In [8]:
print('Train data:', train.shape)
print('Test data:', test.shape)

Train data: (380, 27)
Test data: (533, 27)


In [9]:
# Target Variable = HOF, so drop column HOF from X_train & X_test

X_train = train.drop('HOF', axis=1)
X_test = test.drop('HOF', axis =1)
y_train = train['HOF']
y_test = test['HOF']

In [10]:
print('Train data:', X_train.shape)
print('Train data:', X_test.shape)
print('Test data:', y_train.shape)

Train data: (380, 26)
Train data: (533, 26)
Test data: (380,)


## 3. Modeling 

### 3.1 Modeling With Hyperparameter Tuning

In [11]:
model_dict = {
    
    'ss' : StandardScaler(),
    'dt': DecisionTreeClassifier(random_state=42),
    'rf': RandomForestClassifier(random_state=42),
    'ada': AdaBoostClassifier(random_state=42),
    'xg' : XGBClassifier(random_state=42,eval_metric='auc')
}

In [12]:
model_lib = {
    
    'ss' : 'StandardScaler',
    'rf': 'Random Forest',
    'dt': 'Decision Tree',
    'ada' : 'AdaBoost Classifier',
    'xg' : 'XGBoost Classifier'
}

In [13]:
# Create dictionary for each model and classifier hyperparameters

model_params ={
    
    'ss':{},
    'rf': {
        'rf__n_estimators': [1_000,10_000],
        'rf__max_depth': [1,2,3],
        'rf__min_samples_leaf': [1,2,3],
        'rf__max_leaf_nodes': [2,3]
    },
    'dt': {
        'dt__max_depth': [5,10,15],
        'dt__min_samples_split': [5,10,15,20],
        'dt__min_samples_leaf': [2,3,4]
    },
    'ada': {
        'ada__n_estimators': [1_000,10_000],
        'ada__learning_rate': [0.7,0.9,1.0]
    },
    'xg' :{
        
        'xg__learning_rate' :[0.6,0.7,0.8]

}
}

In [14]:
# Define pipeline of models
def pipelines (models_list):
    
    pipe_model = [(i,model_dict[i]) for i in models_list]
    return Pipeline(pipe_model)

In [15]:
# Define parameters to fit into GridSearch
def parameter(name,model_dict):
    parameters = model_params[name]
    for k,v in parameters.items():
        model_dict[k] = v
    return model_dict

In [16]:
# Define GridSearch
def gs(method,model,X_train=X_train,y_train=y_train,X_test=X_test):
    pipe_param = {}
    pipe_param = parameter(method,pipe_param)
    pipe_param = parameter(model,pipe_param)
    pipe = pipelines([method,model])
    g_search = GridSearchCV(pipe,param_grid=pipe_param, cv =5)
    g_search.fit(X_train,y_train)
    print(f'{model_lib[model]} with {model_lib[method]}:')
    print(f'Train Score : {round(g_search.best_estimator_.score(X_train,y_train),4)}')
    print(f'Parameters: {g_search.best_params_}')
    return g_search.best_estimator_.predict(X_test)

**Random Forest Classifier + Grid Search**

In [17]:
rf_pred = gs('ss', 'rf')

Random Forest with StandardScaler:
Train Score : 0.9105
Parameters: {'rf__max_depth': 2, 'rf__max_leaf_nodes': 3, 'rf__min_samples_leaf': 1, 'rf__n_estimators': 10000}


**AdaBoost Classifier + Grid Search**

In [18]:
ada_pred = gs('ss', 'ada')

AdaBoost Classifier with StandardScaler:
Train Score : 1.0
Parameters: {'ada__learning_rate': 0.7, 'ada__n_estimators': 10000}


**XGBoost Classifier + Grid Search**

In [19]:
xg_pred = gs('ss', 'xg')

XGBoost Classifier with StandardScaler:
Train Score : 1.0
Parameters: {'xg__learning_rate': 0.6}


### 3.2 Comparing Models Results With Best Hyperparameters

In [20]:
ss = StandardScaler()

In [21]:
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

**Random Forest Classfication**

In [22]:
# Instantiate Random Forest
rf = RandomForestClassifier(n_estimators = 10000,max_depth=2, max_leaf_nodes=3,min_samples_leaf=1)
rf.fit(X_train, y_train)
prediction = rf.predict(X_test)
rf_score = rf.score(X_train, y_train)
pred_proba = rf.predict_proba(X_test)
print('Random Forest Model Score:',round(rf_score,4))
print('------------------------------')
print(pred_proba)

Random Forest Model Score: 0.9105
------------------------------
[[0.26266822 0.73733178]
 [0.81824982 0.18175018]
 [0.80650486 0.19349514]
 ...
 [0.6845739  0.3154261 ]
 [0.68560469 0.31439531]
 [0.86466161 0.13533839]]


In [23]:
y_test = []
for i in enumerate(pred_proba):
    y_test.append(i[1][1])
y_test = np.asarray(y_test)


In [24]:
results = pd.DataFrame({
    "id": test["id"],
    "HOF": y_test
    })

In [27]:
players = df_career[['id', 'Player']]
results = players.merge(results, on='id')

In [28]:
results  = results.sort_values(by='HOF', ascending=False)
results = results.set_index('id')
results.head(20)

Unnamed: 0_level_0,Player,HOF
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2943,LeBron James,0.829534
2625,Dirk Nowitzki,0.818191
3219,Kevin Durant,0.807564
2455,Kobe Bryant,0.796548
2534,Tim Duncan,0.794496
2400,Kevin Garnett,0.789097
3095,Chris Paul,0.752076
2629,Paul Pierce,0.751899
2554,Tracy McGrady,0.749821
3324,Russell Westbrook,0.739049


**AdaBoost Classifier**

In [29]:
# Instantiate AdaBoost Classifier
ada = AdaBoostClassifier(learning_rate= 0.7, n_estimators = 10000)
ada.fit(X_train, y_train)
prediction = ada.predict(X_test)
ada_score = ada.score(X_train, y_train)
pred_proba = ada.predict_proba(X_test)
print('AdaBoost Model Score:',round(ada_score,4))
print('------------------------------')
print(pred_proba)

AdaBoost Model Score: 1.0
------------------------------
[[0.46296092 0.53703908]
 [0.4893291  0.5106709 ]
 [0.50839544 0.49160456]
 ...
 [0.50522556 0.49477444]
 [0.5052286  0.4947714 ]
 [0.49889649 0.50110351]]


In [30]:
y_test = []
for i in enumerate(pred_proba):
    y_test.append(i[1][1])
y_test = np.asarray(y_test)


In [31]:
results = pd.DataFrame({
    "id": test["id"],
    "HOF": y_test
    })

In [32]:
players = df_career[['id', 'Player']]
results = players.merge(results, on='id')

In [33]:
results  = results.sort_values(by='HOF', ascending=False)
results = results.set_index('id')
results.head(20)

Unnamed: 0_level_0,Player,HOF
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2534,Tim Duncan,0.620322
2400,Kevin Garnett,0.602696
2943,LeBron James,0.600571
3219,Kevin Durant,0.589252
3342,Stephen Curry,0.581682
3342,Willie Reed,0.581682
2455,Kobe Bryant,0.572167
2999,Dwight Howard,0.568905
2625,Dirk Nowitzki,0.568744
3095,Chris Paul,0.558395


**XGBoost Classifier**

In [48]:
# Instantiate XGBoost Classifier
xg = XGBClassifier(eta = 0,scale_pos_weight = 0.4,learning_rate=0.6)
xg.fit(X_train, y_train)
prediction = xg.predict(X_test)
xg_score = xg.score(X_train, y_train)
pred_proba = xg.predict_proba(X_test)
print('XGBoost Model Score:',round(xg_score,4))
print('------------------------------')
print(pred_proba)

XGBoost Model Score: 1.0
------------------------------
[[5.4012537e-03 9.9459875e-01]
 [9.8654753e-01 1.3452464e-02]
 [9.0997899e-01 9.0020984e-02]
 ...
 [9.9273723e-01 7.2627435e-03]
 [9.9937916e-01 6.2081299e-04]
 [9.9988508e-01 1.1490992e-04]]


In [49]:
y_test = []
for i in enumerate(pred_proba):
    y_test.append(i[1][1])
y_test = np.asarray(y_test)


results = pd.DataFrame({
    "id": test["id"],
    "HOF": y_test
    })

In [51]:
players = df_career[['id', 'Player']]
results = players.merge(results, on='id')

In [52]:
results  = results.sort_values(by='HOF', ascending=False)
results = results.set_index('id')
results.head(20)

Unnamed: 0_level_0,Player,HOF
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2534,Tim Duncan,0.994822
58,Ray Allen,0.994599
2625,Dirk Nowitzki,0.994424
2807,Pau Gasol,0.992963
2943,LeBron James,0.992784
3219,Kevin Durant,0.988898
2336,Jason Kidd,0.984888
2400,Kevin Garnett,0.984877
2455,Kobe Bryant,0.984721
2629,Paul Pierce,0.977102


|Models | Best Params | Train Score |  Description |
|--------|------|---------|--------------| 
|  Random Forest | 'rf__max_depth': 2, 'rf__max_leaf_nodes': 3, 'rf__min_samples_leaf': 1, 'rf__n_estimators': 10000| 0.9105 | The train score is lower but the results seems acceptable, but by comparing with XGBoost Classifier model, the probability results is lower.|
|  AdaBoost | 'ada__learning_rate': 0.7, 'ada__n_estimators': 10000| 1.0 | Even though the train score is 1.0, but the probability were lower than 0.7 |
|  XGBoost | 'xg__learning_rate': 0.6| 1.0 | The train score is 1.0 and comparing the results, the probability by XGBoost Classifications seems perform better than other 2 models |


By comparing the train data accuracy and the results for each model, seems XGboost is performing better with 1.0 of accuracy and the results is more likely accurate comparing to other 2 models.

In [71]:
# Limit to 50 Players for ease of read of scatter plots

results50 = results.head(50)
results50.to_csv('../datasets/results50.csv')

In [70]:
fig = go.Figure()

fig.add_trace(go.Scatter(x = results50['Player'], y= results50['HOF'], hovertext=results50['Player'],
                         mode = 'markers', marker_symbol = 'star-dot', marker_size=12, marker_color = 'blue'))

fig.update_layout(title_text="Player Nominated As Hall Of Fame Probability", height = 600, width = 1000)

fig.show()


### 4. Conclusion & Recommendation

**Conclusion**

After EDA done, I had chose 20-22 features for modeling and classifications to predict which NBA players will have higher probabily to be nominated as Hall Of Famer in near future. The train test split was done manually instead of using train_test_split. 

Train data : Players who were retired at 1997 or players who already nominated as Hall Of Famer<br>
Test data: Players who were still playing after 1997 and not nominated as Hall of Famer yet.<br>

From 3 classification modeling, the XGBoost Model seems perform better comparing to other 2 models, with better accuracy and results.

**Recommendation**

The datasets only recorded until year 2017, so it is good to get the newer datasets to get better results and predictions for the modeling.


