## Predicting the NBA Finals MVP with ML
#### *by Noah Ford*


#### Importing Packages and Helper Functions

I've outsourced most of the importing of tools and helper functions of an accompanying workbook [helper_funcs.ipynb](helper_funcs.ipynb).

In [114]:
%run helper_funcs.ipynb
print('success!')

success!


#### Getting Set Up

Here we'll set up the paths to the data we will access throughout.

In [115]:
DIR = 'series'
DIR2 = 'csvs'

The first jumping off point for this notebook, is getting access to the playoff history.  We will use [basketball-reference.com](https://www.basketball-reference.com) for the entirety of our data accessing. \
This first link we're accessing takes us to a page with information for every playoff series: winner, loser, finals mvp, hyperlinks to more stats, etc.

## Data Scraping

In [116]:
BASE = "https://www.basketball-reference.com"
url = "https://www.basketball-reference.com/playoffs/series.html"
# save takes a path and a folder, and fetches the html we're looking for
text = save(url,DIR)
bs = BeautifulSoup(text, 'html.parser')
table = bs.find(id = 'div_playoffs_series')

This below function wraps up that above code into a nice callable unit.

In [117]:
def get_html_table():
    url = "https://www.basketball-reference.com/playoffs/series.html"
    text = save(url,DIR)
    bs = BeautifulSoup(text, 'html.parser')
    table = bs.find(id = 'div_playoffs_series')
    return table

playoff_history just is a wrapper for the pandas native function read_html, which when given html, spits out a dataframe.

In [118]:
def playoff_history(url,header_col=False):
    p = os.path.join(DIR2, name_csv(url))
    if not(os.path.exists(p)):
        text = save(url,DIR2)
        bs = BeautifulSoup(text, 'html.parser')
        df = pd.read_html(url)[0]
        if header_col: df.columns = df.columns.get_level_values(1)
        df.to_csv(p,index=True)
    else :
        df = pd.read_csv(p,index_col=0)
    return df

In [119]:
url = "https://www.basketball-reference.com/playoffs/series.html"
t = playoff_history(url,True)

We'll just do a bit of cleaning to drop some unneeded columns.  Additionally, we only care about finals history.

In [120]:
def playoff_history_cleaned():
    url = "https://www.basketball-reference.com/playoffs/series.html"
    df = playoff_history(url,True)
    df.drop(df.columns[[4,7,10,-2,-1]], axis=1, inplace=True)
    df['Yr'] = pd.to_numeric(df['Yr'], errors='coerce').fillna(0).astype(int)
    df = df[df['Lg'].str.contains('NBA', na=False)].reset_index(drop=True)
    df = df[df['Yr']>1968].reset_index(drop=True)
    df = df[~df['Series'].str.contains('Conf', na=True)].reset_index(drop=True)
    df = df[~df['Series'].str.contains('Semi', na=False)].reset_index(drop=True)
    df = df[~df['Series'].str.contains('Div', na=False)].reset_index(drop=True)
    return df 

In [121]:
playoff_history_cleaned().head(5)

Unnamed: 0,Yr,Lg,Series,Unnamed: 3_level_1,Team,W,Team.1,W.1
0,2024,NBA,Finals,"Jun 6 - Jun 17, 2024",Boston Celtics (1),4,Dallas Mavericks (5),1
1,2023,NBA,Finals,"Jun 1 - Jun 12, 2023",Denver Nuggets (1),4,Miami Heat (8),1
2,2022,NBA,Finals,"Jun 2 - Jun 16, 2022",Golden State Warriors (3),4,Boston Celtics (2),2
3,2021,NBA,Finals,"Jul 6 - Jul 20, 2021",Milwaukee Bucks (3),4,Phoenix Suns (2),2
4,2020,NBA,Finals,"Sep 30 - Oct 11, 2020",Los Angeles Lakers (1),4,Miami Heat (5),2


I've stored the finals MVP history into an existing csv, so we can actually look at that using our same playoff_history function, just passing that different file path.

In [122]:
url = 'https://www.basketball-reference.com/awards/finals_mvp.html'
df = playoff_history('csvs/finals_mvp.csv',True)
df.head(5)

Unnamed: 0,Season,Lg,Player,Age,Tm,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%
0,2023-24,NBA,Jaylen Brown,27,BOS,5,38.6,20.8,5.4,5.0,1.6,0.8,0.44,0.235,0.733
1,2022-23,NBA,Nikola Jokic,27,DEN,5,41.2,30.2,14.0,7.2,0.8,1.4,0.583,0.421,0.838
2,2021-22,NBA,Stephen Curry,33,GSW,6,37.5,31.2,6.0,5.0,2.0,0.2,0.482,0.437,0.857
3,2020-21,NBA,Giannis Antetokounmpo,26,MIL,6,39.8,35.2,13.2,5.0,1.2,1.8,0.618,0.2,0.659
4,2019-20,NBA,LeBron James,35,LAL,6,39.3,29.8,11.8,8.5,1.2,0.5,0.591,0.417,0.667


Now we would like to merge the finals history dataframe, the finals mvp dataframe, and a dataframe of hyperlinks into the series summary of each dataframe.

That last one is the trickiest to get because the links do not populate in the finals history dataframe from our read_html call.  So, we have to perform a workaround to read through the html directly.

In [123]:
def add_mvp():
    df = pd.read_csv('csvs/finals_mvp.csv',index_col=0)
    return df

In [124]:
def hyperlink_table():
    table = get_html_table()
    series = [tag for tag in table.find_all('a') if "vs" in tag['href']]
    recent = [tag for tag in series if int(tag['href'].split("/")[2].split("-")[0]) > 1968]
    hrefs = [BASE + a['href'] for a in recent]
    col = [tag for tag in hrefs if 'nba-finals' in tag]
    col = pd.DataFrame(col, columns=["url"])
    return col

In [125]:
def merged_table():
    # ret = await get_html(url,"#div_playoffs_series")
    table = get_html_table()
    df = playoff_history_cleaned()
    col = hyperlink_table()
    full = pd.concat([df,add_mvp()],axis = 1)
    full = pd.concat([full,col],axis = 1)
    return full

In [126]:
FULL_DF = merged_table()
FULL_DF.columns

Index(['Yr', 'Lg', 'Series', 'Unnamed: 3_level_1', 'Team', 'W', 'Team.1',
       'W.1', 'Season', 'Lg', 'Player', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB',
       'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'url'],
      dtype='object')

In [127]:
def winner_abbrev(url):
    bs = BeautifulSoup(save(url,DIR))
    ref = [link['href'] for link in bs.find_all('a') if 'teams' in link['href'] and '.html' in link['href']][0]
    return ref.split('/')[2]

In [128]:
winner_abbrev(FULL_DF['url'][0])

'BOS'

We enter into some dicey territory here because for some reason, basketball reference was keeping their accessible playoff series stats only in the form of a comment.  Meaning, we have to convert the comment into usable html before we can proceed.

The reason we wrote the winner_abbrev functions above is because the boxscores are ordered not by who won, but rather who was home.  Meaning, when we're searching the page, we have to introduce logic to jump to the correct box score.  We achieve that here by pulling the abbreviation of each team, which is then used to label the box scores in the id.

In [129]:
def save_table(url):
    p = os.path.join(DIR2, name_csv(url))
    if not(os.path.exists(p)):
        save_tag(url,DIR,f'all_{winner_abbrev(url)}')
        text = save_tag(url,DIR,f'all_{winner_abbrev(url)}')
        bs = BeautifulSoup(text, 'html.parser')
        # table = bs.find(id = f'div_{winner_abbrev(url)}')
        # Find all comments
        comments = bs.find_all(string=lambda text: isinstance(text, Comment))
        table_c = [c for c in comments if len(c) > 10000][0]
        comment_soup = BeautifulSoup(table_c, 'html.parser')
        table = comment_soup.find('table')
        df = pd.read_html(str(table))[0]
        df.columns = df.columns.get_level_values(1)
        df.drop(df.columns[[0]], axis=1, inplace=True)
        df.to_csv(p)
        with open(p, "w+") as f:
            f.write(df.to_csv(p))
    else :
        df = pd.read_csv(p)
        df.drop(df.columns[[0]], axis=1, inplace=True)
    return df

Here I want to optimize the retrieval of our html information, so I'll plop that all into csvs which will be easier to access than the htmls.

In [130]:
def get_table(url):
    p = os.path.join(DIR2, name_csv(url))
    df = pd.read_csv(p)
    df.drop(df.columns[[0]], axis=1, inplace=True)
    return df

In [131]:
for link in tqdm(FULL_DF['url']):
    try:
        save_table(link)
    except:
        save_table(link)

100%|██████████| 56/56 [00:00<00:00, 1131.84it/s]


In [132]:
url = FULL_DF['url'][2]
get_table(url).head(5)

Unnamed: 0,Player,Age,G,MP,FG,FGA,3P,3PA,FT,FTA,...,PTS,FG%,3P%,FT%,MP.1,PTS.1,TRB.1,AST.1,STL.1,BLK.1
0,Stephen Curry,33.0,6,225,66,137,31,71,24,28,...,187,0.482,0.437,0.857,37.5,31.2,6.0,5.0,2.0,0.2
1,Andrew Wiggins,26.0,6,235,45,101,11,37,9,13,...,110,0.446,0.297,0.692,39.2,18.3,8.8,2.2,1.5,1.5
2,Klay Thompson,31.0,6,230,36,101,20,57,10,10,...,102,0.356,0.351,1.0,38.3,17.0,3.0,2.0,1.3,0.5
3,Jordan Poole,22.0,6,125,27,62,15,39,10,11,...,79,0.435,0.385,0.909,20.8,13.2,1.8,1.8,0.5,0.2
4,Draymond Green,31.0,6,217,14,42,2,16,7,12,...,37,0.333,0.125,0.583,36.2,6.2,8.0,6.2,1.7,0.7


## Data Cleaning

Now we're almost there!  We now just need to clean up our data to make nice integers and floats for the machine learning models to read.

In [133]:
def clean_table(url):
    t = get_table(url)
    mvp = mvp_from_url(url)
    t.drop(t.columns[[5,7,9,-6,-5,-4,-3,-2,-1]], axis=1, inplace=True)
    t = t.drop(t.index[-1])
    yr = year_from_url(url)
    t['3P'] = pd.to_numeric(t['3P'], errors='coerce').fillna(0).astype(int)
    t['FG%'] = pd.to_numeric(t['FG%'], errors='coerce').fillna(0).astype(float)
    t['3P%'] = pd.to_numeric(t['3P%'], errors='coerce').fillna(0).astype(float)
    t['FT%'] = pd.to_numeric(t['FT%'], errors='coerce').fillna(0).astype(float)
    t['STL'] = pd.to_numeric(t['STL'], errors='coerce').fillna(0).astype(int)
    t['BLK'] = pd.to_numeric(t['BLK'], errors='coerce').fillna(0).astype(int)
    t['ORB'] = pd.to_numeric(t['ORB'], errors='coerce').fillna(0).astype(int)
    t['DRB'] = pd.to_numeric(t['DRB'], errors='coerce').fillna(0).astype(int)
    t['TOV'] = pd.to_numeric(t['TOV'], errors='coerce').fillna(0).astype(int)
    t['Age'] = pd.to_numeric(t['Age'], errors='coerce').fillna(0).astype(int)
    t['mvp'] = t['Player'].apply(lambda x: x == mvp)
    return t

A big picture ML decision I've made on this project is to label players statistics by their relative rank to their teammates.  Meaning, I'm not interested in knowing that player X who put a 30/6/5 should win MVP, because if that player had a teammate who put up 35/12/9, then we'd have to reconsider. \
However, that is a potential follow up of this project.

In [138]:
def rank_table(t):
    # MP through blocks
    cols = t.columns[2:19]
    for i,col in enumerate(cols):
        new_name = col + '!'
        ascending_bool = (i == 11) or (i == 12)
        t[new_name] = t[col].rank(ascending=ascending_bool, method='min').astype(int)
        t.drop(columns = [col],inplace=True)
    return t

top_table takes just the top X entries from the table.  Meaning we don't care to predict Brian Scalabrine's chance at winning MVP.

In [139]:
def top_table(url,top):
    df = clean_table(url)
    df = df.drop(df.index[top:])
    df = rank_table(df)
    yr = year_from_url(url)
    df.insert(0,'Year',yr)
    return df

We'll try to pull a random year's url, to see what that table looks like.

In [136]:
clean_table(FULL_DF['url'][55])

Unnamed: 0,Player,Age,G,MP,FG,3P,FT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,FG%,3P%,FT%,mvp
0,John Havlicek,28,7,336,74,0,50,0,0,77,31,0,0,0,25,198,0.457,0.0,0.847,False
1,Sam Jones,35,7,211,56,0,19,0,0,25,16,0,0,0,23,131,0.471,0.0,0.826,False
2,Larry Siegfried,29,7,181,36,0,26,0,0,18,20,0,0,0,30,98,0.391,0.0,0.897,False
3,Don Nelson,28,7,141,32,0,19,0,0,41,8,0,0,0,24,83,0.421,0.0,0.792,False
4,Em Bryant,30,7,233,31,0,15,0,0,35,19,0,0,0,27,77,0.403,0.0,0.882,False
5,Bailey Howell,32,7,193,31,0,12,0,0,37,4,0,0,0,34,74,0.333,0.0,0.6,False
6,Bill Russell,34,7,336,25,0,14,0,0,148,36,0,0,0,29,64,0.397,0.0,0.583,False
7,Tom Sanders,30,5,39,6,0,2,0,0,6,1,0,0,0,13,14,0.462,0.0,1.0,False
8,Don Chaney,22,2,10,0,0,2,0,0,1,0,0,0,0,4,2,0.0,0.0,0.667,False


Here will be the dataframe we will base our learning upon.  It is the top eight highest scoring players of each winning NBA champion.

In [151]:
# full top 8
FULL_TOP_8 = top_table(FULL_DF['url'][0],8)
for i in tqdm(range(len(FULL_DF['url'])-1)):
    FULL_TOP_8 = pd.concat([FULL_TOP_8,top_table(FULL_DF['url'][i+1],8)],axis=0)
# df.reset_index(drop=True).to_csv('top8_full.csv')
FULL_TOP_8 = FULL_TOP_8.reset_index(drop=True)

  0%|          | 0/55 [00:00<?, ?it/s]

100%|██████████| 55/55 [00:00<00:00, 99.11it/s] 


In [152]:
FULL_TOP_8.sample(5)

Unnamed: 0,Year,Player,Age,mvp,G!,MP!,FG!,3P!,FT!,ORB!,...,TRB!,AST!,STL!,BLK!,TOV!,PF!,PTS!,FG%!,3P%!,FT%!
69,2016,Richard Jefferson,35,False,1,5,6,6,5,3,...,4,7,4,7,5,4,6,2,6,7
330,1983,Julius Erving,32,False,1,2,2,1,3,2,...,2,3,4,1,5,5,3,5,1,2
165,2004,Corliss Williamson,30,False,1,8,6,6,5,6,...,7,8,8,6,4,1,6,5,6,3
295,1988,Tony Campbell,25,False,8,8,8,4,8,8,...,8,8,7,7,1,1,8,1,4,1
189,2001,Horace Grant,35,False,1,6,6,7,4,2,...,3,8,7,2,1,2,6,8,7,5


## Machine Learning

Ok, now we get into the ML classification.

We will working with the scikit_learn infrastructure to perform our analysis.

In [156]:
X = FULL_TOP_8.drop(['Year','Player','mvp'],axis=1)
y = FULL_TOP_8['mvp']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 

                                           stratify = y, random_state=2022)
smote = SMOTE(sampling_strategy='minority')
X_train_SMOTE, y_train_SMOTE = smote.fit_resample(X_train,y_train)

In [157]:
pd.DataFrame(y_test).value_counts()

mvp  
False    118
True      17
dtype: int64

In [None]:
y_train_SMOTE.value_counts()

False    275
True     275
Name: mvp, dtype: int64

In [158]:
logistic_classifier = LogisticRegression(max_iter=200)
logistic_classifier.fit(X_train, y_train)
# logistic_classifier.fit(X_train_SMOTE, y_train_SMOTE)
y_pred = logistic_classifier.predict(X_test)
y_train_pred_proba = logistic_classifier.predict_proba(X_train_SMOTE)[:, 1]
y_test_pred_proba = logistic_classifier.predict_proba(X_test)[:, 1]
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[116   2]
 [  7  10]]
              precision    recall  f1-score   support

       False       0.94      0.98      0.96       118
        True       0.83      0.59      0.69        17

    accuracy                           0.93       135
   macro avg       0.89      0.79      0.83       135
weighted avg       0.93      0.93      0.93       135



In [159]:
test_and_pred_and_percent = pd.DataFrame({'mvp': y_test,'pred':y_pred,'prob':y_test_pred_proba.round(3)})
y_rounded = test_and_pred_and_percent.sort_index()
output = complete.drop(columns=['mvp']).join(y_rounded, how="inner").sort_values(by=['prob'],ascending=False)
winners = output[(output['Year']==1978)]
winners

Unnamed: 0,Year,Player,Age,G!,MP!,FG!,3P!,FT!,ORB!,DRB!,...,BLK!,TOV!,PF!,PTS!,FG%!,3P%!,FT%!,mvp,pred,prob
375,1978,Larry Wright,23,1,8,8,1,8,8,8,...,7,1,1,8,8,1,1,False,False,0.0
374,1978,Wes Unseld,31,1,3,7,1,6,2,1,...,6,2,6,7,2,1,8,True,False,0.0
372,1978,Charles Johnson,28,1,6,4,1,7,6,7,...,7,3,1,5,6,1,3,False,False,0.0


In [164]:
def dirty(url):
    df = clean_table(url)
    df['dirty'] = df.apply(lambda row: row['PTS']+row['TRB']+row['AST'], axis=1)
    player_max = df['dirty'].idxmax()
    return df.loc[player_max, 'Player']