<a href="https://colab.research.google.com/github/BARATZL/march-madness-supML/blob/main/NCAAMB_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting the Outcome of March Madness Basketball Games

I love watching college basketball, particularly in March. My alma mater has never been in the "big dance", but the tournament has nevertheless always been very entertaining to me.

However, I do not know much about the sport itself. When it comes time to join in the office bracket pool, my bracket's success largely hinges on my luck, or the rushed google searches made just before filling out my predictions.

The above is the main motivation behind this project. Using Machine Learning concepts I have taken in during the fall semester, can I improve upon my bracket predictions from last year (where I more or less guessed)?

## Defining success




A simple way to define success of my model is to perform better than my predictions last year. I correctly guessed 42 out of 64 of the games, about 65% in total.

That sounds pretty decent for guessing, but my predicitions got worse after round 1 and 2 of the bracket. Equally weighting my predictions by each round, my accuracy looks something like this:



---



$\frac{1}{6}(R1 acc.+ R2 acc.+ R3acc...)$

Or,

$\frac{1}{6} (\frac{24}{32} + \frac{11}{16} + \frac{3}{8} + \frac{2}{4} + \frac{2}{2} + \frac{0}{1})$ = ~.55



---



So the standards I will initially aim for is a total accuracy higher than 65%, with an average accuracy across rounds higher than 55%.

# Data Sourcing and Formatting

There are two methods I have thought of that can be appropriately formatted for a model. The tabular data should be organized as follows:

Game | Team 1 | Team 1 Season Stat 1 | ... | Team 2 | Team 2 Season Stat 1 | ... | Team 1 Score | Team 2 Score | Team 1 Win (0 for no, 1 for yes)
----|----|----|----|----|----|----|----|----|---|
Purdue v. UConn | Purdue | x | ... | UConn | y | ... | 60 | 75 | 0
UConn v. Alabama | UConn | y | ... | Alabama | z | ... | 86 | 72 | 1
...|...|...|...|...|...|...|...|...|...

With this format, our model can either:

**1**. Predict the values that Team 1 and 2 score, with a subsequent function that confirms the outcome prediction the model is making.

**2**. Predict whether or not Team 1 wins.

We can assess both, but first we need to assemble our data. First, we need to create a table of season statistics.

Assembling this data across all years will be difficult, but should be possible through extracting data from [Sports Reference](https://www.sports-reference.com/cbb/). The below code begins this process.

In [27]:
from bs4 import BeautifulSoup, Comment
import numpy as np
import requests
import pandas as pd
url = 'https://www.sports-reference.com/cbb/seasons/men/2024-school-stats.html'
test = requests.get(url)

html_content = test.text
soup = BeautifulSoup(html_content, 'html.parser')
pretty_html = soup.prettify()
tables = soup.find_all('table')  # finding table in webpage
headers = soup.find_all('th')

In [28]:
columnheaders = []
for i in headers:
 columnheaders.append(i.text)
columnheaders = columnheaders[13:50]
ncaa2324 = pd.DataFrame(columns = columnheaders)  # setting table based on webpage table columns
ncaa2324.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 37 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   School  0 non-null      object
 1   G       0 non-null      object
 2   W       0 non-null      object
 3   L       0 non-null      object
 4   W-L%    0 non-null      object
 5   SRS     0 non-null      object
 6   SOS     0 non-null      object
 7           0 non-null      object
 8   W       0 non-null      object
 9   L       0 non-null      object
 10          0 non-null      object
 11  W       0 non-null      object
 12  L       0 non-null      object
 13          0 non-null      object
 14  W       0 non-null      object
 15  L       0 non-null      object
 16          0 non-null      object
 17  Tm.     0 non-null      object
 18  Opp.    0 non-null      object
 19          0 non-null      object
 20  MP      0 non-null      object
 21  FG      0 non-null      object
 22  FGA     0 non-null      object
 23  FG

In [29]:
rows = soup.find_all('tr')
for row in rows:
  cells = row.find_all('td')
  if cells == []:
    continue
  columndata = [col.text.strip() for col in cells]
  if "NCAA" in columndata[0]:
    ncaa2324.loc[len(ncaa2324)]=columndata  # extracting everyone who made it into the NCAA tournament.

In [30]:
ncaa2324.head(5)  # checking things here

Unnamed: 0,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,L.1,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Akron NCAA,35,24,11,0.686,2.77,-2.08,,13,5,...,467,642,0.727,363,1278,455,197,100,394,583
1,Alabama NCAA,37,25,12,0.676,20.69,11.8,,13,5,...,650,842,0.772,472,1467,587,256,162,438,734
2,Arizona NCAA,36,27,9,0.75,24.54,9.45,,15,5,...,605,844,0.717,471,1533,665,300,133,430,590
3,Auburn NCAA,35,27,8,0.771,22.46,7.66,,13,5,...,609,812,0.75,393,1323,622,258,215,374,678
4,Baylor NCAA,35,24,11,0.686,19.5,10.71,,11,7,...,579,791,0.732,399,1229,514,236,110,421,577


This table contains some of the statistics we would like to see for our team season data when we compile games from March Madness tournaments.

However, there's an issue with this webscraping method: the statistics listed include tournament games. This is problematic because we want the model to be useful prior to the tournament takes place.

If I train a model on data partially from tournaments, there's a chance that it will negatively affect the model when it is needed before teams even have a chance to compile tournament statistics.

One way to avoid this is to make sure that I only include rate/percentage statistics.

In [31]:
ncaa2324.drop(columns=['W','L'], inplace=True)

In [32]:
columns_to_drop = ncaa2324.columns[5:9]
columns_to_drop
ncaa2324.drop(columns=columns_to_drop,inplace=True)

In [33]:
ncaa2324.drop(columns=['MP','FG','FGA','FT','FTA','3P','3PA'],inplace=True)

In [34]:
ncaa2324.iloc[:,1:] = ncaa2324.iloc[:,1:].astype(float)
column_ops = ['Tm.','Opp.','ORB','TRB','AST','STL','BLK','TOV','PF']
for col in column_ops:
  ncaa2324[col] = ncaa2324[col].astype(float)
  ncaa2324[col] = ncaa2324[col]/(ncaa2324['G'].astype(float))  # setting up per game ratios

ncaa2324.rename(columns={'Tm.':'PPG',
                         'Opp.':'PAPG',
                         'ORB':'ORBPG',
                         'TRB':'TRBPG',
                         'AST':'ASTPG',
                         'STL':'STLPG',
                         'BLK':'BLKPG',
                         'TOV':'TOVPG',
                         'PF':'PFPG'},inplace=True)

In [35]:
ncaa2324.drop(columns=['G','W-L%'],inplace=True)

Now I've eliminated most of the obvious indicators of postseason success that would not be available if we try to practically apply a model next year. Next, to create game data, and stitch these seasonal statistics onto the table with game data. I plan to pull game data again from Sports Reference.

In [36]:
# first, making sure school names are uniform with the second table. This way, there's a seamless merge.
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('NCAA$', '', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('Brigham Young$', 'BYU', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('Connecticut$', 'UConn', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('North Carolina$', 'UNC', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Saint Mary's (CA)", "Saint Mary's", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Saint Peter's", "St. Peter's", regex=False)
print(ncaa2324['School'].unique())

['Akron' 'Alabama' 'Arizona' 'Auburn' 'Baylor' 'Boise State' 'BYU'
 'Clemson' 'Colgate' 'College of Charleston' 'Colorado' 'Colorado State'
 'UConn' 'Creighton' 'Dayton' 'Drake' 'Duke' 'Duquesne' 'Florida'
 'Florida Atlantic' 'Gonzaga' 'Grambling' 'Grand Canyon' 'Houston'
 'Howard' 'Illinois' 'Iowa State' 'James Madison' 'Kansas' 'Kentucky'
 'Long Beach State' 'Longwood' 'Marquette' 'McNeese State'
 'Michigan State' 'Mississippi State' 'Montana State' 'Morehead State'
 'NC State' 'Nebraska' 'Nevada' 'New Mexico' 'UNC' 'Northwestern'
 'Oakland' 'Oregon' 'Purdue' "Saint Mary's" "St. Peter's" 'Samford'
 'San Diego State' 'South Carolina' 'South Dakota State' 'Stetson' 'TCU'
 'Tennessee' 'Texas' 'Texas A&M' 'Texas Tech' 'UAB' 'Utah State' 'Vermont'
 'Virginia' 'Wagner' 'Washington State' 'Western Kentucky' 'Wisconsin'
 'Yale']


In [37]:
mm2324 = pd.DataFrame(columns=['T1 Seed','T1 Name','T1 Score','T2 Seed','T2 Name','T2 Score','Round'])  # making combined table template

In [38]:
mm2324

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Round


In [39]:
url2 = 'https://www.sports-reference.com/cbb/postseason/men/2024-ncaa.html'
test = requests.get(url2)

html_content = test.text
soup = BeautifulSoup(html_content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
i = 0
for comment in comments:
    if "game" in comment:  # the html had <--game--> comments wherever they placed bracket games.
        next_element = comment.find_next_sibling()
        try:
          seed1 = next_element.find('span').text.strip()
        except:
          break
        name1 = next_element.find('a', href=True).text.strip()
        try:
          score1 = next_element.find_all('a',href=True)[1].text.strip()
        except:
          continue
        third_element = next_element.find_next_sibling()
        seed2 = third_element.find('span').text.strip()
        name2 = third_element.find('a', href=True).text.strip()
        score2 = third_element.find_all('a',href=True)[1].text.strip()
        i += 1
        if i < 33:
          round = 1
        elif i >= 33 and i < 49:
          round = 2
        elif i >= 49 and i < 57:
          round = 3
        elif i >= 57 and i < 61:
          round = 4
        elif i >= 61 and i < 63:
          round = 5
        elif i >= 63:
          round = 6
        mm2324.loc[len(mm2324)] = [seed1,name1,score1,seed2,name2,score2,round]
mm2324.loc[len(mm2324)] = [1,'UConn',75,1,'Purdue',60,6]  # for some reason the championship game still didn't have the score. manually inputting.

In [40]:
mm2324['Game'] = mm2324['T1 Name'] + f' ('+mm2324['T1 Seed'].astype(str)+')' ' v. ' + mm2324['T2 Name'] + ' ('+mm2324['T2 Seed'].astype(str)+')'
mm2324  # the above will be the index. Important for future so we can still determine which game the model is predicting on

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Round,Game
0,1,UConn,91,16,Stetson,52,1,UConn (1) v. Stetson (16)
1,8,Florida Atlantic,65,9,Northwestern,77,1,Florida Atlantic (8) v. Northwestern (9)
2,5,San Diego State,69,12,UAB,65,1,San Diego State (5) v. UAB (12)
3,4,Auburn,76,13,Yale,78,1,Auburn (4) v. Yale (13)
4,6,BYU,67,11,Duquesne,71,1,BYU (6) v. Duquesne (11)
...,...,...,...,...,...,...,...,...
58,6,Clemson,77,2,Arizona,72,4,Clemson (6) v. Arizona (2)
59,4,Alabama,89,6,Clemson,82,4,Alabama (4) v. Clemson (6)
60,1,UConn,86,4,Alabama,72,5,UConn (1) v. Alabama (4)
61,1,Purdue,63,11,NC State,50,5,Purdue (1) v. NC State (11)


Now that the game table has been created, the seasonal stats for the teams can be inserted into the table.

In [41]:
ncaa2324['School'] = ncaa2324['School'].str.strip()
mm2324['T1 Name'] = mm2324['T1 Name'].str.strip()
mm2324['T2 Name'] = mm2324['T2 Name'].str.strip()
comb2324 = pd.merge(mm2324, ncaa2324,left_on='T1 Name',right_on='School',how='left')
comb2324.rename(columns={'SRS':'T1 SRS',
                         'SOS':'T1 SOS',
                         'PPG':'T1 PPG',
                         'PAPG':'T1 PAPG',
                         'ORBPG':'T1 ORBPG',
                         'TRBPG':'T1 TRBPG',
                         'ASTPG':'T1 ASTPG',
                         'FG%':'T1 FG%',
                         '3P%':'T1 3P%',
                         'FT%':'T1 FT%',
                         'STLPG':'T1 STLPG',
                         'BLKPG':'T1 BLKPG',
                         'TOVPG':'T1 TOVPG',
                         'PFPG':'T1 PFPG'},inplace=True)
comb2324 = pd.merge(comb2324, ncaa2324, left_on='T2 Name',right_on='School',how='left')
comb2324.rename(columns={'SRS':'T2 SRS',
                         'SOS':'T2 SOS',
                         'PPG':'T2 PPG',
                         'PAPG':'T2 PAPG',
                         'ORBPG':'T2 ORBPG',
                         'TRBPG':'T2 TRBPG',
                         'ASTPG':'T2 ASTPG',
                         'FG%':'T2 FG%',
                         '3P%':'T2 3P%',
                         'FT%':'T2 FT%',
                         'STLPG':'T2 STLPG',
                         'BLKPG':'T2 BLKPG',
                         'TOVPG':'T2 TOVPG',
                         'PFPG':'T2 PFPG'}, inplace=True)
comb2324 = comb2324.drop(columns=['School_x','School_y'])
comb2324.head(5)

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Round,Game,T1 SRS,T1 SOS,...,T2 FG%,T2 3P%,T2 FT%,T2 ORBPG,T2 TRBPG,T2 ASTPG,T2 STLPG,T2 BLKPG,T2 TOVPG,T2 PFPG
0,1,UConn,91,16,Stetson,52,1,UConn (1) v. Stetson (16),26.7,8.7,...,0.463,0.365,0.766,9.685714,34.742857,13.285714,5.342857,3.085714,10.542857,14.742857
1,8,Florida Atlantic,65,9,Northwestern,77,1,Florida Atlantic (8) v. Northwestern (9),13.26,4.61,...,0.455,0.39,0.752,8.529412,31.411765,15.588235,7.029412,3.117647,8.823529,17.588235
2,5,San Diego State,69,12,UAB,65,1,San Diego State (5) v. UAB (12),14.68,8.37,...,0.449,0.327,0.746,12.885714,37.971429,13.514286,6.742857,4.542857,11.6,15.857143
3,4,Auburn,76,13,Yale,78,1,Auburn (4) v. Yale (13),22.46,7.66,...,0.467,0.351,0.707,9.878788,36.515152,15.0,6.181818,3.181818,9.575758,15.121212
4,6,BYU,67,11,Duquesne,71,1,BYU (6) v. Duquesne (11),19.33,7.86,...,0.436,0.34,0.72,10.513514,34.810811,13.432432,7.486486,4.324324,11.702703,17.135135


In [42]:
comb2324.set_index('Game',inplace=True)

In [53]:
combi2324 = comb2324.iloc[:,[6,0,1,7,8,9,10,11,12,13,14,15,16,17,18,19,3,4,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,2,5]]  # rearranging.
combi2324.info()  # null check

<class 'pandas.core.frame.DataFrame'>
Index: 63 entries, UConn (1) v. Stetson (16) to UConn (1) v. Purdue (1)
Data columns (total 35 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Round     63 non-null     int64  
 1   T1 Seed   63 non-null     object 
 2   T1 Name   63 non-null     object 
 3   T1 SRS    63 non-null     object 
 4   T1 SOS    63 non-null     object 
 5   T1 PPG    63 non-null     float64
 6   T1 PAPG   63 non-null     float64
 7   T1 FG%    63 non-null     object 
 8   T1 3P%    63 non-null     object 
 9   T1 FT%    63 non-null     object 
 10  T1 ORBPG  63 non-null     float64
 11  T1 TRBPG  63 non-null     float64
 12  T1 ASTPG  63 non-null     float64
 13  T1 STLPG  63 non-null     float64
 14  T1 BLKPG  63 non-null     float64
 15  T1 TOVPG  63 non-null     float64
 16  T2 Seed   63 non-null     object 
 17  T2 Name   63 non-null     object 
 18  T1 PFPG   63 non-null     float64
 19  T2 SRS    63 non-null     object 

In [54]:
combi2324.head(1)

Unnamed: 0_level_0,Round,T1 Seed,T1 Name,T1 SRS,T1 SOS,T1 PPG,T1 PAPG,T1 FG%,T1 3P%,T1 FT%,...,T2 FT%,T2 ORBPG,T2 TRBPG,T2 ASTPG,T2 STLPG,T2 BLKPG,T2 TOVPG,T2 PFPG,T1 Score,T2 Score
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
UConn (1) v. Stetson (16),1,1,UConn,26.7,8.7,81.4,63.4,0.497,0.358,0.743,...,0.766,9.685714,34.742857,13.285714,5.342857,3.085714,10.542857,14.742857,91,52


In [55]:
col_list = combi2324.columns
for col in col_list:
  try:
    combi2324[col] = combi2324[col].astype(float)
  except:
    continue
combi2324.info()

<class 'pandas.core.frame.DataFrame'>
Index: 63 entries, UConn (1) v. Stetson (16) to UConn (1) v. Purdue (1)
Data columns (total 35 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Round     63 non-null     float64
 1   T1 Seed   63 non-null     float64
 2   T1 Name   63 non-null     object 
 3   T1 SRS    63 non-null     float64
 4   T1 SOS    63 non-null     float64
 5   T1 PPG    63 non-null     float64
 6   T1 PAPG   63 non-null     float64
 7   T1 FG%    63 non-null     float64
 8   T1 3P%    63 non-null     float64
 9   T1 FT%    63 non-null     float64
 10  T1 ORBPG  63 non-null     float64
 11  T1 TRBPG  63 non-null     float64
 12  T1 ASTPG  63 non-null     float64
 13  T1 STLPG  63 non-null     float64
 14  T1 BLKPG  63 non-null     float64
 15  T1 TOVPG  63 non-null     float64
 16  T2 Seed   63 non-null     float64
 17  T2 Name   63 non-null     object 
 18  T1 PFPG   63 non-null     float64
 19  T2 SRS    63 non-null     float64

This is the ideal table setup we're looking for. Before iterating on the past two-ish decades of March Madness basketball, I'd like to see how the model does with a train-test set of 2024.

In [56]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import MultiTaskLasso

X = combi2324.drop(columns=['T1 Name', 'T2 Name', 'T1 Score', 'T2 Score'])
y = combi2324[['T1 Score', 'T2 Score']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
preprocessing = Pipeline([('scaler', StandardScaler())])
lasso_pipe = Pipeline([('preprocessing', preprocessing),('lasso',MultiTaskLasso(alpha=0.1,random_state=1))])
lasso_pipe.fit(X_train,y_train)

In [57]:
predictions = lasso_pipe.predict(X_test)

In [58]:
yt_copy = y_test.copy()
yt_copy['T1 Win'] = yt_copy['T1 Score'] > yt_copy['T2 Score']
yt_copy['Prediction 1'] = [row[0] for row in predictions]
yt_copy['Prediction 2'] = [row[1] for row in predictions]
yt_copy['Predict T1 Win'] = yt_copy['Prediction 1'] > yt_copy['Prediction 2']
yt_copy['Prediction Correct?'] = yt_copy['Predict T1 Win'] == yt_copy['T1 Win']
yt_copy

Unnamed: 0_level_0,T1 Score,T2 Score,T1 Win,Prediction 1,Prediction 2,Predict T1 Win,Prediction Correct?
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Gonzaga (5) v. Kansas (4),89.0,68.0,True,81.662369,43.541411,True,True
James Madison (12) v. Duke (4),55.0,93.0,False,62.917434,88.763237,False,True
Dayton (7) v. Nevada (10),63.0,60.0,True,65.644312,60.434751,True,True
Purdue (1) v. Gonzaga (5),80.0,68.0,True,93.865815,84.544111,True,True
Purdue (1) v. NC State (11),63.0,50.0,True,78.019095,71.433421,True,True
San Diego State (5) v. UAB (12),69.0,65.0,True,84.079367,72.294425,True,True
Texas (7) v. Colorado State (10),56.0,44.0,True,74.703053,58.523031,True,True
Houston (1) v. Duke (4),51.0,54.0,False,69.934826,54.642119,True,False
Colorado (10) v. Marquette (2),77.0,81.0,False,54.525323,54.36356,True,False
Duke (4) v. NC State (11),64.0,76.0,False,72.604166,62.741328,True,False


With a really small sample size, the model predicts 9 outcomes correctly out of 13 (69% accuracy). This is surprisingly good to me! But there are a few caveats:

- The model does miss notable upsets of the test set. Duke v.s. NC State, Houston v.s. Duke, and Kentucky v.s. Oakland, with the last game's predicted scores being way off the mark.
- Even though it predicts the outcome correctly on a decent portion of the games, the predicted scores are often not close to the true score.
- This is applying all different rounds of tournament basketball at once, which we won't be able to do when using it practically. When tested with more data later, we will test it mirroring the process we would in March (Round 1 predictions, then Round 2, and so on).

What features does lasso view as important?


In [59]:
lasso_model = lasso_pipe.named_steps['lasso']
lasso_model.coef_

array([[ -1.28632685,  -8.09314741,  -9.05436266,   2.80581931,
          3.51173394,   3.05264702,  -0.68090375,   1.59732375,
          0.02866434,   0.29348371,   0.59317387,   2.24410698,
          2.58865319,   2.93368135,  -2.64913593,   1.80330937,
         -0.40469613,  -3.83008419,  -0.13338225,  -1.2513049 ,
          2.04078305,   5.20388974,  -1.70265308,   1.22984616,
          6.04349709,  -1.53834124,   2.02271848,  -0.23500458,
         -0.17048525,  -0.66802763,  -1.6039443 ],
       [ -0.57259656,  -4.35916996,  -3.04444809,  -3.06864803,
          5.65789046,   5.5239296 , -11.0978735 ,   0.08618596,
         -3.69707465,  -0.19684973,  -5.03981465,   7.84490407,
         -2.36999219,  -1.24634203,   2.8409219 ,   1.5681161 ,
         -0.82016547,   0.43126844,  -0.26111831,   2.99121665,
         -0.53961886,   2.21525164,   2.20052199,   5.24467308,
          4.90402835,   1.95657142,  -0.46814232,   3.20451099,
         -1.73331925,  -3.08187072,  -2.33982227]])

The two arrays above raise an issue with using any form of linear regression method for multi target regression. Because linear regression seeks to minimize error on each target, a team will be predicted to score a different value simply by virtue of being on the opposite side of the table I created. I'm not sure if this is something I want in my model.

What if I tried a Random Forest Regressor?



In [60]:
from sklearn.ensemble import RandomForestRegressor
rf_pipe = Pipeline([('preprocessing',preprocessing),('rf',RandomForestRegressor(n_estimators=100,random_state=1))])
rf_pipe.fit(X_train,y_train)

In [61]:
rf_predictions = rf_pipe.predict(X_test)
yt_rfcopy = y_test.copy()
yt_rfcopy['T1 Win'] = yt_rfcopy['T1 Score'] > yt_rfcopy['T2 Score']
yt_rfcopy['Prediction 1'] = [row[0] for row in rf_predictions]
yt_rfcopy['Prediction 2'] = [row[1] for row in rf_predictions]
yt_rfcopy['Predict T1 Win'] = yt_rfcopy['Prediction 1'] > yt_rfcopy['Prediction 2']
yt_rfcopy['Prediction Correct?'] = yt_rfcopy['Predict T1 Win'] == yt_rfcopy['T1 Win']
yt_rfcopy

Unnamed: 0_level_0,T1 Score,T2 Score,T1 Win,Prediction 1,Prediction 2,Predict T1 Win,Prediction Correct?
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Gonzaga (5) v. Kansas (4),89.0,68.0,True,78.2,68.94,True,True
James Madison (12) v. Duke (4),55.0,93.0,False,69.17,72.37,False,True
Dayton (7) v. Nevada (10),63.0,60.0,True,69.36,69.63,False,False
Purdue (1) v. Gonzaga (5),80.0,68.0,True,86.08,64.99,True,True
Purdue (1) v. NC State (11),63.0,50.0,True,82.4,65.32,True,True
San Diego State (5) v. UAB (12),69.0,65.0,True,74.99,74.29,True,True
Texas (7) v. Colorado State (10),56.0,44.0,True,66.43,68.15,False,False
Houston (1) v. Duke (4),51.0,54.0,False,76.85,63.67,True,False
Colorado (10) v. Marquette (2),77.0,81.0,False,70.36,69.53,True,False
Duke (4) v. NC State (11),64.0,76.0,False,75.13,67.13,True,False


In [52]:
rf_model = rf_pipe.named_steps['rf']
rf_model.feature_importances_


array([0.06295068, 0.00777161, 0.11029013, 0.03662097, 0.03733891,
       0.0977812 , 0.02682331, 0.00782585, 0.00813894, 0.02315277,
       0.02092493, 0.02264632, 0.01957728, 0.01371243, 0.03219846,
       0.01820401, 0.03437072, 0.03625745, 0.03579809, 0.01195948,
       0.10134685, 0.04240205, 0.01847401, 0.0109591 , 0.01530276,
       0.0281096 , 0.0425722 , 0.01115343, 0.0267725 , 0.01541695,
       0.02314703])

Random Forest does slightly worse, but the lack of data should be noted. I think it's time to expand the data we are training on. We'll use code implemented earlier as functions to iterate on lots of Sports Reference pages.

In [64]:
def get_season_stats(url_name):  # only works with Sports Reference!!
  test = requests.get(url)

  html_content = test.text  # setting up the html
  soup = BeautifulSoup(html_content, 'html.parser')
  pretty_html = soup.prettify()

  headers = soup.find_all('th')  # establishing headers
  columnheaders = []
  for i in headers:
   columnheaders.append(i.text)
  columnheaders = columnheaders[13:50]
  new_season_df = pd.DataFrame(columns = columnheaders)  # setting table based on webpage table columns

  rows = soup.find_all('tr')  # inputting the seasonal team data
  for row in rows:
    cells = row.find_all('td')
    if cells == []:
      continue
    columndata = [col.text.strip() for col in cells]
    if "NCAA" in columndata[0]:
      new_season_df.loc[len(new_season_df)]=columndata  # extracting everyone who made it into the NCAA tournament.

  new_season_df.drop(columns=['W','L'], inplace=True)  # standardizing data in ratios for model's sake
  columns_to_drop = new_season_df.columns[5:9]
  new_season_df.drop(columns=columns_to_drop,inplace=True)
  new_season_df.drop(columns=['MP','FG','FGA','FT','FTA','3P','3PA'],inplace=True)
  new_season_df.iloc[:,1:] = new_season_df.iloc[:,1:].astype(float)
  column_ops = ['Tm.','Opp.','ORB','TRB','AST','STL','BLK','TOV','PF']
  for col in column_ops:
    new_season_df[col] = new_season_df[col].astype(float)
    new_season_df[col] = new_season_df[col]/(new_season_df['G'].astype(float))  # setting up per game ratios

  new_season_df.rename(columns={'Tm.':'PPG',
                               'Opp.':'PAPG',
                               'ORB':'ORBPG',
                               'TRB':'TRBPG',
                               'AST':'ASTPG',
                               'STL':'STLPG',
                               'BLK':'BLKPG',
                               'TOV':'TOVPG',
                               'PF':'PFPG'}, inplace=True)
  new_season_df.drop(columns=['G','W-L%'],inplace=True)

  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('NCAA$', '', regex=True)  # changing/standardizing school names.
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('Brigham Young$', 'BYU', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('Connecticut$', 'UConn', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('North Carolina$', 'UNC', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Saint Mary's (CA)", "Saint Mary's", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Saint Peter's", "St. Peter's", regex=False)

  return new_season_df


Whew. Let's see if this works.

In [65]:
get_season_stats('https://www.sports-reference.com/cbb/seasons/men/2023-school-stats.html')

Unnamed: 0,School,SRS,SOS,PPG,PAPG,FG%,3P%,FT%,ORBPG,TRBPG,ASTPG,STLPG,BLKPG,TOVPG,PFPG
0,Akron,2.77,-2.08,73.628571,66.171429,0.453,0.324,0.727,10.371429,36.514286,13.000000,5.628571,2.857143,11.257143,16.657143
1,Alabama,20.69,11.8,90.135135,81.243243,0.476,0.373,0.772,12.756757,39.648649,15.864865,6.918919,4.378378,11.837838,19.837838
2,Arizona,24.54,9.45,87.138889,72.055556,0.484,0.366,0.717,13.083333,42.583333,18.472222,8.333333,3.694444,11.944444,16.388889
3,Auburn,22.46,7.66,83.114286,68.314286,0.476,0.352,0.75,11.228571,37.800000,17.771429,7.371429,6.142857,10.685714,19.371429
4,Baylor,19.5,10.71,80.400000,71.114286,0.483,0.389,0.732,11.400000,35.114286,14.685714,6.742857,3.142857,12.028571,16.485714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63,Wagner,-10.64,-8.58,63.727273,63.121212,0.396,0.327,0.704,11.181818,35.848485,12.757576,5.909091,2.303030,9.878788,15.363636
64,Washington State,14.02,7.19,73.542857,66.714286,0.463,0.339,0.704,11.171429,37.628571,12.428571,5.171429,4.914286,10.942857,16.628571
65,Western Kentucky,1.19,-2.91,80.235294,74.441176,0.466,0.34,0.72,10.617647,39.941176,13.352941,7.647059,3.235294,13.823529,18.382353
66,Wisconsin,16.01,11.35,74.666667,70.000000,0.461,0.349,0.755,10.055556,34.000000,12.722222,6.222222,1.638889,9.944444,16.500000


Nice! I need to adjust this code such that the year of the season is here as well, I'll need to merge to both team name and year now that I am going across multiple years.

Things for tomorrow me to do : )