# NBA Standings


## Table of Contents

- Introduction
- Data: Collection, Cleaning and Preprocessing
- Machine Learning
- The Game
- Conclusion


### Introduction

This project stems from a [ proposed project ] developed for a Machine Learning course, using Supervised Learning. I will attempt to gain better results and try out different models to establish a well defined tournament.
Previously, I used features intented to aid in winning a game such as Block, Steal, Rebound, etc. Logistic and linear regression were the only models used to make these winning prediction. <br/>
For this project, I will add more features and experiment with more regression models while using a more and advanced data to create an efficient model to predict the winner of team match ups, using data extracted from [NBA Site](https://www.nba.com)

<font color=red>Note:</font> [Selenium](https://selenium-python.readthedocs.io/installation.html) was used to retreive the source data. <br/>
This notebook contains interactive graphs which may require you to install a few dependencies. Graph images are displayed below the respected cell to view without running the program.


### Data


#### Collection [Selenium]

Check [ ]() for Selenium Code


In [1]:
import pandas as pd
import numpy as np
import random

#### Cleaning & Processing

Now we're going to:

- Read data from previously exported csv file
- Normalize the data to condense the distribution
- Update Team Abbreviations to its' lastest of today


In [33]:
# Read data
full_df = pd.read_csv('nba_box_scores.csv')

In [34]:
# Uncoment and run below if notebook did not run
# df.loc[(df['SEASON'] < '2004-05') & (df['TEAM'] == 'NEW ORLEANS PELICANS'), 'CONFERENCE'] = 'WESTERN'


In [35]:
full_df

Unnamed: 0,TEAM ABBR,TEAM,OPP ABBR,OPPONENT,SEASON,MATCH UP,GAME DATE,W/L,MIN,PTS,...,DREB,REB,AST,TOV,STL,BLK,PF,+/-,CONFERENCE,REGION
0,NYK,NEW YORK KNICKS,CHI,CHICAGO BULLS,2023-24,NYK vs. CHI,4/14/24,W,53,120,...,37,53,27,21.0,7,6,17,1.0,EASTERN,ATLANTIC
1,CHA,CHARLOTTE HORNETS,CLE,CLEVELAND CAVALIERS,2023-24,CHA @ CLE,4/14/24,W,48,120,...,37,47,36,10.0,10,9,11,10.0,EASTERN,SOUTHEAST
2,POR,PORTLAND TRAIL BLAZERS,SAC,SACRAMENTO KINGS,2023-24,POR @ SAC,4/14/24,L,48,82,...,31,54,18,18.0,11,2,19,-39.0,WESTERN,NORTHWEST
3,HOU,HOUSTON ROCKETS,LAC,LA CLIPPERS,2023-24,HOU @ LAC,4/14/24,W,48,116,...,42,59,31,18.0,7,8,13,11.0,WESTERN,SOUTHWEST
4,MIL,MILWAUKEE BUCKS,ORL,ORLANDO MAGIC,2023-24,MIL @ ORL,4/14/24,L,48,88,...,27,34,16,17.0,10,4,18,-25.0,EASTERN,CENTRAL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66167,PHI,PHILADELPHIA 76ERS,MIL,MILWAUKEE BUCKS,1996-97,PHI vs. MIL,11/1/96,L,48,103,...,26,40,25,14.0,6,4,29,-8.0,EASTERN,ATLANTIC
66168,UTA,UTAH JAZZ,SEA,OKLAHOMA CITY THUNDER,1996-97,UTA vs. SEA,11/1/96,W,48,99,...,32,43,26,9.0,7,6,27,8.0,WESTERN,NORTHWEST
66169,NYK,NEW YORK KNICKS,TOR,TORONTO RAPTORS,1996-97,NYK @ TOR,11/1/96,W,48,107,...,33,44,23,24.0,15,5,29,8.0,EASTERN,ATLANTIC
66170,SEA,OKLAHOMA CITY THUNDER,UTA,UTAH JAZZ,1996-97,SEA @ UTA,11/1/96,L,48,91,...,34,46,16,12.0,6,6,26,-8.0,WESTERN,NORTHWEST


In [36]:
# Prepare name changes
old_abbrs = [['NOH','NOK'], 'VAN', 'CHH', 'SEA', 'NJN' ]
new_abbrs = ['NOP', 'MEM', 'CHA', 'OKC', 'BKN']

for idx,abbr in enumerate(old_abbrs):
    full_df = full_df.replace(abbr,new_abbrs[idx])

team_abbr = full_df['TEAM ABBR'].unique()
team_name = full_df['TEAM']
teams = {}

# Recreate teams Dictionary
for each in team_abbr:
    
    indicies = list(np.where(full_df['TEAM ABBR'] == each))[0]
    teams[each] = full_df['TEAM'].iloc[indicies[0]]    


##### Statistics Below


In [37]:
print(f"Size of Dataframe {full_df.shape}")


Size of Dataframe (66172, 30)


In [38]:
# Sorted Team List for easier read
from collections import OrderedDict

sorted_teamsDict = OrderedDict(sorted(teams.items()))

distinct_teams = sorted(full_df['TEAM'].unique())
wins = []
losses = []
total = []
win_rate = []

for each in sorted_teamsDict:

    win = sum(full_df[full_df['TEAM ABBR'] == each]['W/L'] == 'W') # CHANGE back to W
    loss = sum(full_df[full_df['TEAM ABBR'] == each]['W/L'] == 'L') # CHANGE  back to L
    wins.append(win)
    losses.append(loss)
    total.append(win+loss)
    win_rate.append(round(win/(win+loss),2))

bar_chart_df = pd.DataFrame(data={'WINS':wins, 'LOSSES':losses, 'TOTAL': total, 'WIN RATE':win_rate}, index=sorted_teamsDict.values())

In [39]:
# Bar Chart of Results from SEASONS 1996-2024
bar_chart_df

Unnamed: 0,WINS,LOSSES,TOTAL,WIN RATE
ATLANTA HAWKS,1043,1180,2223,0.47
BROOKLYN NETS,994,1234,2228,0.45
BOSTON CELTICS,1212,1015,2227,0.54
CHARLOTTE HORNETS,902,1155,2057,0.44
CHICAGO BULLS,1067,1154,2221,0.48
CLEVELAND CAVALIERS,1079,1142,2221,0.49
DALLAS MAVERICKS,1256,975,2231,0.56
DENVER NUGGETS,1138,1091,2229,0.51
DETROIT PISTONS,1044,1178,2222,0.47
GOLDEN STATE WARRIORS,1091,1130,2221,0.49


In [40]:
# Bar Chart of Wins and Losses
from plotly import express as px

fig = px.bar(bar_chart_df, x= bar_chart_df.index, y= ['LOSSES','WINS'], barmode = 'stack', text_auto = True,
             height=400)

fig.update_layout(
    title= 'Wins & Losses of Teams [1996-2024]',
    xaxis_title='Team',
    yaxis_title='Total Games Played',
    legend_title='Result',
)
fig.show()



In [41]:
# Bar Chart of Overall Winning Rate
fig1 = px.bar(bar_chart_df, x= bar_chart_df.index, y='WIN RATE')

fig1.update_layout(
    title= 'Teams\' Historical Performance',
    xaxis_title='Team',
    yaxis_title='Win Rate',
)
fig1.show()

In [42]:
# Condense the distribution [Normalzation] ** Terrible practice **

from sklearn.preprocessing import LabelEncoder, MinMaxScaler

encoder = LabelEncoder()
scalar = MinMaxScaler()

full_df["W/L"] = encoder.fit_transform(full_df["W/L"]) # WIN: 1 , LOSE: 0

num_cols = full_df.select_dtypes(include='number')

norm_df = pd.DataFrame(scalar.fit_transform(num_cols), columns = num_cols.columns)
norm_df = pd.concat([full_df.drop(columns=num_cols.columns),norm_df], axis=1)

norm_df = norm_df.sort_values(by=['SEASON']).reset_index(drop=True)


#### Machine Learning

- Splitting data as 80/10/10 train-val-test
- Experiementing DecisionTreeRegressor


In [43]:
# Prepare data for ML
quant_features = norm_df.select_dtypes(['category', 'object'])
qual_features =  norm_df.select_dtypes(['number'])

X = qual_features.drop(columns='W/L')
y = qual_features['W/L']

In [44]:
# Regression Models
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor as DTR
from sklearn.linear_model import LogisticRegression

dtreg = DTR()
logreg = LogisticRegression()

X_train, X_remain, y_train, y_remain = train_test_split(X, y, test_size=0.2, random_state=42) # For training (Take 1)
X_val, X_test, y_val, y_test = train_test_split(X_remain, y_remain, test_size=0.5, random_state=42)  # For Validating and Testing (Take 2)


regression = [DTR(), LogisticRegression()]

y_pred = [] # For training
y_pred1 = [] # For validating and testing

scores = []

for each in regression:
    print(each)
    # Training
    each.fit(X_train, y_train) 
    y_pred.append(each.predict(X_remain))

    if "Tree" in str(each):
        # print('in hereee')
        feature_importances = each.feature_importances_
    
   # Validating - Testing
    each.fit(X_val, y_val) 
    y_pred1.append(each.predict(X_test))

    train_score = round(each.score(X_train,y_train), 2)
    val_score = round(each.score(X_val,y_val), 2)
    test_score = round(each.score(X_test,y_test), 2)

    print('Train: ', train_score)
    print('Val: ', val_score)
    print('Test: ', test_score)  

    scores.append([ train_score, val_score, test_score]) 



DecisionTreeRegressor()
Train:  1.0
Val:  1.0
Test:  1.0
LogisticRegression()
Train:  0.96
Val:  0.97
Test:  0.97


In [45]:
score_df = pd.DataFrame({"DecisionTreeReg": scores[0] , "LogisticReg" : scores[1]}, index= ['train', 'val', 'test'])
score_df

Unnamed: 0,DecisionTreeReg,LogisticReg
train,1.0,0.96
val,1.0,0.97
test,1.0,0.97


First I wanted to experiement with Decision Tree Regression. After analying the model, I noticed that this was aa 100% accuracy prediction rate. When using this for the first time, I assumed this had to be false and there's an error somewhere. 
I wanted to add the Simple Logistic Regression to compare with the DTR, and saw another high performance. 
*\*Phewww*\* 😮‍💨

In [46]:
# Finding important features in the dataframe
fi_char = pd.DataFrame( {'Events' : X.columns, 'Importance': feature_importances})

In [47]:
# Feature importance graph below
fi_bar = px.bar( fi_char, y= 'Events', x='Importance', orientation='h')

fi_bar.update_layout(
    title= 'Importance of  Each Feature',
    height = 500,
    width = 1000
) 
fi_bar.show()


The above visual shows that one Plus-Minus shows a high relavence to the predicting the target variable. The golden rule of thumb is not to fill your model with a lot of columns as it may lead several issues like overfitting or correlations with one another, where some algorithms doesn't do well with. <br/>

Since there are not a lot of columns used here, we will still add these to the model

In [48]:
# Model Performance for DTR and Logistic Regression

from sklearn.metrics import mean_squared_error, classification_report, ConfusionMatrixDisplay, accuracy_score

# Mean Squared Error
mse_d = mean_squared_error(y_remain,y_pred[0]) # Gets the average loss between actual and predicted value of the target
mse_l =  mean_squared_error(y_remain,y_pred[1])

# Accuracy Score
acc_d = accuracy_score(y_test,y_pred1[0])
acc_l = accuracy_score(y_test,y_pred1[1])

perform_df = pd.DataFrame({'DecisionTree':[mse_d,acc_d],'Logistic':[mse_l,acc_l]}, index=['mse','acc'])
display(perform_df)

# Classification Report for Logistic Regression
print('Classification Report for Logistic Regression')
print(classification_report(y_test,y_pred1[1],target_names=['loss','win'] )) 


Unnamed: 0,DecisionTree,Logistic
mse,0.0,0.008311
acc,1.0,0.965549


Classification Report for Logistic Regression
              precision    recall  f1-score   support

        loss       0.97      0.96      0.97      3288
         win       0.96      0.97      0.97      3330

    accuracy                           0.97      6618
   macro avg       0.97      0.97      0.97      6618
weighted avg       0.97      0.97      0.97      6618



# WORK ON REMAINDER

- Create Team Tournament [Regions & Conferences]
  - Probably add penalty (positive/negative) to the current match up


### Summary

In [49]:

summary_teams = {}
teams_historical = {}

for each in distinct_teams:
    ind_team = full_df[full_df['TEAM'] == each]

    prev_ssn_records = ind_team[(ind_team['SEASON'] == '2022-23')][ind_team.columns[8:28]]
    tie_break_stats = ind_team[(ind_team['SEASON'] <= '2021-22')][ind_team.columns[8:28]]

    ind_team = np.mean(prev_ssn_records, axis=0)
    tie_breaker_sum = np.mean(tie_break_stats, axis=0)

    summary_teams[each] = ind_team
    teams_historical[each] = tie_breaker_sum

# Create the Average for each team from last Season
summary_dict = pd.DataFrame(summary_teams)
teams_his = pd.DataFrame(teams_historical)
# summary_dict = summary_dict.T

### THE GAME

<u>Group play - 30 teams</u>

- All 30 teams are split into 6 groups; "random draw of last season records"
  - 3 groups 5 teams in East, and in West

- Each team in the group plays each other 1 time, totaling 4 games each
  - teams with the best group play record after those 4 games, moves on; 3 on east and 3 on west.

- Team with the next best record in each conference are claimed as the wild card (1 in both conference)


<u>Knockout round - 8 teams</u>

- Single Elimination Match. Each team plays 1 game against a group in conference
  - 4 teams remaining 

<u>Semifinals - 4 teams</u>

- 2 teams from East 2 teams West

<u>Final - 2 teams</u>

- 1 team from East & 1 team from West

In [50]:
east_conf = full_df[full_df['CONFERENCE'] == 'EASTERN']
west_conf = full_df[full_df['CONFERENCE'] == 'WESTERN']

east_teams = list(east_conf['TEAM'].unique())
west_teams = list(west_conf['TEAM'].unique())

In [51]:
# Split Teams: split remainder teams within a Conference

def split_teams(teams, num_groups):

    random.shuffle(teams)
    sub_group = []
    groups = []

    group_size = len(teams) / num_groups
    print(" Group Size: ", group_size)

    for idx,each in enumerate(teams):
        
        if (idx + 1) % group_size != 0:
            sub_group.append(each)
        else:
            sub_group.append(each)
            groups.append(sub_group)
            sub_group = []

    return groups

In [52]:
# THE GAME: Determine the Winner of the match

def the_game(team1, team2):
    
    names = [team1.name, team2.name]

    comp_stats = pd.concat([team1,team2], axis=1).transpose()

    winner = regression[0].predict(comp_stats).astype('int64').tolist()
    # print("Results: ", winner)
    

   # Randomly choose the winner if tied
    if all(map(lambda x: x == winner[0], winner)):
        selected = random.choices(comp_stats.index)[0]
        print('Winner is: ', selected)
        return selected

    else:
        index = winner.index(max(winner))
        print('Winner is: ', names[index])
        return names[index]
    




In [53]:
# MATCH UP: Prepare teams for matchups
def match_up(conf, team):

    team_gp = dict.fromkeys(team, 0)

    for gp in conf:

        for i in range(len(gp)):

            for j in range(i+1, len(gp)):
                curr_team = gp[i]
                opp_team = gp[j]
            
                print(f'{curr_team} vs. {opp_team}')

                print('TYPE: ', type(summary_dict[curr_team]))
                # PLAY THE GAME
                winner = the_game(summary_dict[curr_team], summary_dict[opp_team])

                # win_count[winner] += 1 # Referenced in next cell as global variable
                team_gp[winner] += 1

        
        # # Base case: gets the last iterator 
        if len(gp) == 1:
        
            curr_team = conf[0][0]
            opp_team = conf[1][0]

            # print(conf)
            print(f'{curr_team} vs. {opp_team}')

            # PLAY THE GAME

            winner = the_game(summary_dict[curr_team], summary_dict[opp_team])

            # win_count[winner] += 1 # Referenced in next cell as global variable
            team_gp[winner] += 1
            break
    
    print(team_gp)
    return team_gp


In [54]:
# Last Standing: Balance each conference size

def last_standing(east, west, knockout, wild):

    # Necessary for next step

    series_east = pd.Series(east)
    series_west = pd.Series(west)

    east_ko_teams = list(series_east[series_east == max(series_east)].index)
    west_ko_teams = list(series_west[series_west == max(series_west)].index)

    # Balance Teams
    
    if knockout:
        for i in range(len(east_ko_teams), 3):
            new_east = {x:x_val for (x,x_val) in zip(east.keys(),east.values()) if x not in east_ko_teams}
            new_east_ser = pd.Series(new_east)
            # print(type(new_east_ser))
            rndm_team = list(new_east_ser[new_east_ser == max(new_east_ser)].index)
            east_ko_xtra = random.choices(rndm_team)
            east_ko_teams.extend(east_ko_xtra)

        for i in range(len(west_ko_teams), 3):
            new_west = {x:x_val for (x,x_val) in zip(west.keys(),west.values()) if x not in west_ko_teams}
            new_west_ser = pd.Series(new_west)
            # print(type(new_west_ser))
            rndm_team = list(new_west_ser[new_west_ser == max(new_west_ser)].index)
            west_ko_xtra = random.choices(rndm_team)
            west_ko_teams.extend(west_ko_xtra)
        
    
     
    print('East: ',len(east_ko_teams))
    print('West: ',len(west_ko_teams))


    if wild:
        # WILD CARDS
        east_wcgp = list(series_east[series_east == max(series_east)- 1].index)
        west_wcgp = list(series_west[series_west == max(series_west)- 1].index)
        
        
        east_wc = random.choices(east_wcgp)
        west_wc = random.choices(west_wcgp)

        east_ko_teams.extend(east_wc)
        west_ko_teams.extend(west_wc)
    

    
    return east_ko_teams, west_ko_teams
# get teams with max score

# return last standing teams


In [55]:
# SPLITTED FOR GROUP PLAY
splitted_east = split_teams(east_teams, 3)
splitted_west = split_teams(west_teams, 3)


# GROUP PLAY

win_count_gp = dict.fromkeys(distinct_teams, 0)

east_gp = match_up(splitted_east, east_teams)
west_gp = match_up(splitted_west, west_teams)

win_count_gp.update(east_gp)
win_count_gp.update(west_gp)


 Group Size:  5.0
 Group Size:  5.0
MILWAUKEE BUCKS vs. CHICAGO BULLS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  CHICAGO BULLS
MILWAUKEE BUCKS vs. WASHINGTON WIZARDS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  MILWAUKEE BUCKS
MILWAUKEE BUCKS vs. CLEVELAND CAVALIERS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  MILWAUKEE BUCKS
MILWAUKEE BUCKS vs. DETROIT PISTONS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  MILWAUKEE BUCKS
CHICAGO BULLS vs. WASHINGTON WIZARDS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  CHICAGO BULLS
CHICAGO BULLS vs. CLEVELAND CAVALIERS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  CLEVELAND CAVALIERS
CHICAGO BULLS vs. DETROIT PISTONS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  CHICAGO BULLS
WASHINGTON WIZARDS vs. CLEVELAND CAVALIERS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  CLEVELAND CAVALIERS
WASHINGTON WIZARDS vs. DETROIT PISTONS
TYPE:  <class 'pandas.core.series.Series'>
Winner is: 

In [56]:
# KNOCK OUT ROUND

# Prepare contestants...

east_ko_teams, west_ko_teams = last_standing(east_gp,west_gp, True,True)

# BEGIN KNOCKOUT

ko_teams = east_ko_teams + west_ko_teams

win_count_ko = dict.fromkeys(ko_teams, 0)

print(east_ko_teams)

splitted_east = split_teams(east_ko_teams, 2)
splitted_west = split_teams(west_ko_teams, 2)

print(splitted_east)
east_ko = match_up(splitted_east, east_ko_teams)
west_ko = match_up(splitted_west, west_ko_teams)

win_count_ko.update(east_ko)
win_count_ko.update(west_ko)

East:  3
West:  3
['BOSTON CELTICS', 'NEW YORK KNICKS', 'MILWAUKEE BUCKS', 'CLEVELAND CAVALIERS']
 Group Size:  2.0
 Group Size:  2.0
[['MILWAUKEE BUCKS', 'BOSTON CELTICS'], ['NEW YORK KNICKS', 'CLEVELAND CAVALIERS']]
MILWAUKEE BUCKS vs. BOSTON CELTICS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  BOSTON CELTICS
NEW YORK KNICKS vs. CLEVELAND CAVALIERS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  NEW YORK KNICKS
{'MILWAUKEE BUCKS': 0, 'BOSTON CELTICS': 1, 'NEW YORK KNICKS': 1, 'CLEVELAND CAVALIERS': 0}
LOS ANGELES LAKERS vs. OKLAHOMA CITY THUNDER
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  LOS ANGELES LAKERS
PHOENIX SUNS vs. DENVER NUGGETS
TYPE:  <class 'pandas.core.series.Series'>
Winner is:  DENVER NUGGETS
{'LOS ANGELES LAKERS': 1, 'OKLAHOMA CITY THUNDER': 0, 'PHOENIX SUNS': 0, 'DENVER NUGGETS': 1}


In [57]:
# SEMI FINALS

# Prepare contestants...

east_semi, west_semi = last_standing(east_ko,west_ko, False, False)

# BEGIN SEMI FINALS

semi_teams = east_semi + west_semi


splitted_east = split_teams(east_semi, 2)
splitted_west = split_teams(west_semi, 2)

print(east_semi)
print(splitted_east)
win_count_semif = dict.fromkeys(semi_teams, 0)

east_sf = match_up(splitted_east, east_semi)
west_sf = match_up(splitted_west, west_semi)


win_count_semif.update(east_sf)
win_count_semif.update(west_sf)

win_count_semif

East:  2
West:  2
 Group Size:  1.0
 Group Size:  1.0
['BOSTON CELTICS', 'NEW YORK KNICKS']
[['BOSTON CELTICS'], ['NEW YORK KNICKS']]
BOSTON CELTICS vs. NEW YORK KNICKS
Winner is:  BOSTON CELTICS
{'BOSTON CELTICS': 1, 'NEW YORK KNICKS': 0}
LOS ANGELES LAKERS vs. DENVER NUGGETS
Winner is:  LOS ANGELES LAKERS
{'LOS ANGELES LAKERS': 1, 'DENVER NUGGETS': 0}


{'BOSTON CELTICS': 1,
 'NEW YORK KNICKS': 0,
 'LOS ANGELES LAKERS': 1,
 'DENVER NUGGETS': 0}

In [58]:
# FINAL - CHAMPIONSHIP CUP

# Prepare contestants...

east_final, west_final = last_standing(east_sf,west_sf, False, False)

# BEGIN FINALS

final_teams = east_final + west_final

east_final = east_final[0]
west_final = west_final[0]

win_count_final = dict.fromkeys(final_teams, 0)

final_match = the_game(summary_dict[east_final],summary_dict[west_final])
win_count_final[final_match] += 1


East:  1
West:  1
Winner is:  BOSTON CELTICS


In [59]:
from collections import Counter
total_count = Counter(win_count_gp) + Counter(win_count_ko) + Counter(win_count_semif) + Counter(win_count_final)

In [60]:
print(f'CHAMPIONSHIPS: {final_match} winning {total_count[final_match]} games')


CHAMPIONSHIPS: BOSTON CELTICS winning 7 games
