<a href="https://colab.research.google.com/github/BARATZL/march-madness-supML/blob/main/NCAAMB_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting the Outcome of March Madness Basketball Games

I love watching college basketball, particularly in March. My alma mater has never been in the "big dance", but the tournament has nevertheless always been very entertaining to me.

However, I do not know much about the sport itself. When it comes time to join in the office bracket pool, my bracket's success largely hinges on my luck, or the rushed google searches made just before filling out my predictions.

The above is the main motivation behind this project. Using Machine Learning concepts I have taken in during the fall semester, can I improve upon my bracket predictions from last year (where I more or less guessed)?

## Defining success




A simple way to define success of my model is to perform better than my predictions last year. I correctly guessed 42 out of 64 of the games, about 65% in total.

That sounds pretty decent for guessing, but my predicitions got worse after round 1 and 2 of the bracket. Equally weighting my predictions by each round, my accuracy looks something like this:



---



$\frac{1}{6}(R1 acc.+ R2 acc.+ R3acc...)$

Or,

$\frac{1}{6} (\frac{24}{32} + \frac{11}{16} + \frac{3}{8} + \frac{2}{4} + \frac{2}{2} + \frac{0}{1})$ = ~.55



---



So the standards I will initially aim for is a total accuracy higher than 65%, with an average accuracy across rounds higher than 55%.

# Data Sourcing and Formatting

There are two methods I have thought of that can be appropriately formatted for a model. The tabular data should be organized as follows:

Game | Team 1 | Team 1 Season Stat 1 | ... | Team 2 | Team 2 Season Stat 1 | ... | Team 1 Score | Team 2 Score | Team 1 Win (0 for no, 1 for yes)
----|----|----|----|----|----|----|----|----|---|
Purdue v. UConn | Purdue | x | ... | UConn | y | ... | 60 | 75 | 0
UConn v. Alabama | UConn | y | ... | Alabama | z | ... | 86 | 72 | 1
...|...|...|...|...|...|...|...|...|...

With this format, our model can either:

**1**. Predict the values that Team 1 and 2 score, with a subsequent function that confirms the outcome prediction the model is making.

**2**. Predict whether or not Team 1 wins.

We can assess both, but first we need to assemble our data. First, we need to create a table of season statistics.

Assembling this data across all years will be difficult, but should be possible through extracting data from [Sports Reference](https://www.sports-reference.com/cbb/). The below code begins this process.

In [234]:
from bs4 import BeautifulSoup, Comment
import numpy as np
import requests
import pandas as pd
url = 'https://www.sports-reference.com/cbb/seasons/men/2024-school-stats.html'
test = requests.get(url)

html_content = test.text
soup = BeautifulSoup(html_content, 'html.parser')
pretty_html = soup.prettify()
tables = soup.find_all('table')  # finding table in
headers = soup.find_all('th')

In [235]:
columnheaders = []
for i in headers:
 columnheaders.append(i.text)
columnheaders = columnheaders[13:50]
ncaa2324 = pd.DataFrame(columns = columnheaders)
ncaa2324.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 37 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   School  0 non-null      object
 1   G       0 non-null      object
 2   W       0 non-null      object
 3   L       0 non-null      object
 4   W-L%    0 non-null      object
 5   SRS     0 non-null      object
 6   SOS     0 non-null      object
 7           0 non-null      object
 8   W       0 non-null      object
 9   L       0 non-null      object
 10          0 non-null      object
 11  W       0 non-null      object
 12  L       0 non-null      object
 13          0 non-null      object
 14  W       0 non-null      object
 15  L       0 non-null      object
 16          0 non-null      object
 17  Tm.     0 non-null      object
 18  Opp.    0 non-null      object
 19          0 non-null      object
 20  MP      0 non-null      object
 21  FG      0 non-null      object
 22  FGA     0 non-null      object
 23  FG

In [236]:
rows = soup.find_all('tr')
for row in rows:
  cells = row.find_all('td')
  if cells == []:
    continue
  columndata = [col.text.strip() for col in cells]
  if "NCAA" in columndata[0]:
    ncaa2324.loc[len(ncaa2324)]=columndata

In [237]:
ncaa2324.head(20)

Unnamed: 0,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,L.1,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Akron NCAA,35,24,11,0.686,2.77,-2.08,,13,5,...,467,642,0.727,363,1278,455,197,100,394,583
1,Alabama NCAA,37,25,12,0.676,20.69,11.8,,13,5,...,650,842,0.772,472,1467,587,256,162,438,734
2,Arizona NCAA,36,27,9,0.75,24.54,9.45,,15,5,...,605,844,0.717,471,1533,665,300,133,430,590
3,Auburn NCAA,35,27,8,0.771,22.46,7.66,,13,5,...,609,812,0.75,393,1323,622,258,215,374,678
4,Baylor NCAA,35,24,11,0.686,19.5,10.71,,11,7,...,579,791,0.732,399,1229,514,236,110,421,577
5,Boise State NCAA,33,22,11,0.667,13.31,7.35,,13,5,...,505,686,0.736,384,1231,404,189,78,366,550
6,Brigham Young NCAA,34,23,11,0.676,19.33,7.86,,10,8,...,406,547,0.742,405,1343,629,202,101,364,602
7,Clemson NCAA,36,24,12,0.667,16.2,10.04,,11,9,...,507,650,0.78,326,1308,533,165,138,363,585
8,Colgate NCAA,35,25,10,0.714,-0.31,-5.55,,16,2,...,368,540,0.681,286,1245,526,227,122,368,468
9,College of Charleston NCAA,35,27,8,0.771,4.09,-2.38,,15,3,...,452,626,0.722,457,1392,536,219,108,348,558


This table contains some of the statistics we would like to see for our team season data when we compile games from March Madness tournaments.

However, there's an issue with this webscraping method: the statistics listed include tournament games. This is problematic because we want the model to be useful prior to the tournament takes place.

If I train a model on data partially from tournaments, there's a chance that it will negatively affect the model when it is needed before teams even have a chance to compile tournament statistics.

One way to avoid this is to make sure that I only include rate/percentage statistics.

In [238]:
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('NCAA$', '', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('Brigham Young$', 'BYU', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('Connecticut$', 'UConn', regex=True)
ncaa2324.head(14)

Unnamed: 0,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,L.1,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Akron,35,24,11,0.686,2.77,-2.08,,13,5,...,467,642,0.727,363,1278,455,197,100,394,583
1,Alabama,37,25,12,0.676,20.69,11.8,,13,5,...,650,842,0.772,472,1467,587,256,162,438,734
2,Arizona,36,27,9,0.75,24.54,9.45,,15,5,...,605,844,0.717,471,1533,665,300,133,430,590
3,Auburn,35,27,8,0.771,22.46,7.66,,13,5,...,609,812,0.75,393,1323,622,258,215,374,678
4,Baylor,35,24,11,0.686,19.5,10.71,,11,7,...,579,791,0.732,399,1229,514,236,110,421,577
5,Boise State,33,22,11,0.667,13.31,7.35,,13,5,...,505,686,0.736,384,1231,404,189,78,366,550
6,BYU,34,23,11,0.676,19.33,7.86,,10,8,...,406,547,0.742,405,1343,629,202,101,364,602
7,Clemson,36,24,12,0.667,16.2,10.04,,11,9,...,507,650,0.78,326,1308,533,165,138,363,585
8,Colgate,35,25,10,0.714,-0.31,-5.55,,16,2,...,368,540,0.681,286,1245,526,227,122,368,468
9,College of Charleston,35,27,8,0.771,4.09,-2.38,,15,3,...,452,626,0.722,457,1392,536,219,108,348,558


In [239]:
ncaa2324.drop(columns=['W','L'], inplace=True)

In [240]:
columns_to_drop = ncaa2324.columns[5:9]
columns_to_drop
ncaa2324.drop(columns=columns_to_drop,inplace=True)

In [241]:
ncaa2324.drop(columns=['MP','FG','FGA','FT','FTA','3P','3PA'],inplace=True)

In [242]:
ncaa2324.iloc[:,1:] = ncaa2324.iloc[:,1:].astype(float)
column_ops = ['Tm.','Opp.','ORB','TRB','AST','STL','BLK','TOV','PF']
for col in column_ops:
  ncaa2324[col] = ncaa2324[col].astype(float)
  ncaa2324[col] = ncaa2324[col]/(ncaa2324['G'].astype(float))

ncaa2324.rename(columns={'Tm.':'PPG',
                         'Opp.':'PAPG',
                         'ORB':'ORBPG',
                         'TRB':'TRBPG',
                         'AST':'ASTPG',
                         'STL':'STLPG',
                         'BLK':'BLKPG',
                         'TOV':'TOVPG',
                         'PF':'PFPG'},inplace=True)

In [243]:
ncaa2324.drop(columns=['G','W-L%'],inplace=True)

Now I've eliminated most of the obvious indicators of postseason success that would not be available if we try to practically apply a model next year. Next, to create game data, and stitch these seasonal statistics onto the table with game data. I plan to pull game data again from Sports Reference.

In [244]:
mm2324 = pd.DataFrame(columns=['T1 Seed','T1 Name','T1 Score','T2 Seed','T2 Name','T2 Score'])

In [245]:
mm2324

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score


In [246]:
url2 = 'https://www.sports-reference.com/cbb/postseason/men/2024-ncaa.html'
test = requests.get(url2)

html_content = test.text
soup = BeautifulSoup(html_content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for comment in comments:
    if "game" in comment:
        next_element = comment.find_next_sibling()
        try:
          seed1 = next_element.find('span').text.strip()
        except:
          break
        name1 = next_element.find('a', href=True).text.strip()
        try:
          score1 = next_element.find_all('a',href=True)[1].text.strip()
        except:
          continue
        third_element = next_element.find_next_sibling()
        seed2 = third_element.find('span').text.strip()
        name2 = third_element.find('a', href=True).text.strip()
        score2 = third_element.find_all('a',href=True)[1].text.strip()
        mm2324.loc[len(mm2324)] = [seed1,name1,score1,seed2,name2,score2]
mm2324.loc[len(mm2324)] = [1,'UConn',75,1,'Purdue',60]

In [247]:
mm2324['Game'] = mm2324['T1 Name'] + f' ('+mm2324['T1 Seed'].astype(str)+')' ' v. ' + mm2324['T2 Name'] + ' ('+mm2324['T2 Seed'].astype(str)+')'
#mm2324.set_index('Game',inplace=True)
mm2324

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Game
0,1,UConn,91,16,Stetson,52,UConn (1) v. Stetson (16)
1,8,Florida Atlantic,65,9,Northwestern,77,Florida Atlantic (8) v. Northwestern (9)
2,5,San Diego State,69,12,UAB,65,San Diego State (5) v. UAB (12)
3,4,Auburn,76,13,Yale,78,Auburn (4) v. Yale (13)
4,6,BYU,67,11,Duquesne,71,BYU (6) v. Duquesne (11)
...,...,...,...,...,...,...,...
58,6,Clemson,77,2,Arizona,72,Clemson (6) v. Arizona (2)
59,4,Alabama,89,6,Clemson,82,Alabama (4) v. Clemson (6)
60,1,UConn,86,4,Alabama,72,UConn (1) v. Alabama (4)
61,1,Purdue,63,11,NC State,50,Purdue (1) v. NC State (11)


Now that the game table has been created, the seasonal stats for the teams can be inserted into the table.

In [249]:
ncaa2324['School'] = ncaa2324['School'].str.strip()
mm2324['T1 Name'] = mm2324['T1 Name'].str.strip()
mm2324['T2 Name'] = mm2324['T2 Name'].str.strip()
pd.merge(mm2324, ncaa2324,left_on='T1 Name',right_on='School',how='left')


Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Game
0,1,UConn,91,16,Stetson,52,UConn (1) v. Stetson (16)
1,8,Florida Atlantic,65,9,Northwestern,77,Florida Atlantic (8) v. Northwestern (9)
2,5,San Diego State,69,12,UAB,65,San Diego State (5) v. UAB (12)
3,4,Auburn,76,13,Yale,78,Auburn (4) v. Yale (13)
4,6,BYU,67,11,Duquesne,71,BYU (6) v. Duquesne (11)
...,...,...,...,...,...,...,...
58,6,Clemson,77,2,Arizona,72,Clemson (6) v. Arizona (2)
59,4,Alabama,89,6,Clemson,82,Alabama (4) v. Clemson (6)
60,1,UConn,86,4,Alabama,72,UConn (1) v. Alabama (4)
61,1,Purdue,63,11,NC State,50,Purdue (1) v. NC State (11)
