<a href="https://colab.research.google.com/github/BARATZL/march-madness-supML/blob/main/NCAAMB_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting the Outcome of March Madness Basketball Games

I love watching college basketball, particularly in March. My alma mater has never been in the "big dance", but the tournament has nevertheless always been very entertaining to me.

However, I do not know much about the sport itself. When it comes time to join in the office bracket pool, my bracket's success largely hinges on my luck, or the rushed google searches made just before filling out my predictions.

The above is the main motivation behind this project. Using Machine Learning concepts I have taken in during the fall semester, can I improve upon my bracket predictions from last year (where I more or less guessed)?

## Defining success




A simple way to define success of my model is to perform better than my predictions last year. I correctly guessed 42 out of 64 of the games, about 65% in total.

That sounds pretty decent for guessing, but my predicitions got worse after round 1 and 2 of the bracket. Equally weighting my predictions by each round, my accuracy looks something like this:



---



$\frac{1}{6}(R1 acc.+ R2 acc.+ R3acc...)$

Or,

$\frac{1}{6} (\frac{24}{32} + \frac{11}{16} + \frac{3}{8} + \frac{2}{4} + \frac{2}{2} + \frac{0}{1})$ = ~.55



---



So the standards I will initially aim for is a total accuracy higher than 65%, with an average accuracy across rounds higher than 55%.

# Data Sourcing and Formatting

There are two methods I have thought of that can be appropriately formatted for a model. The tabular data should be organized as follows:

Game | Team 1 | Team 1 Season Stat 1 | ... | Team 2 | Team 2 Season Stat 1 | ... | Team 1 Score | Team 2 Score | Team 1 Win (0 for no, 1 for yes)
----|----|----|----|----|----|----|----|----|---|
Purdue v. UConn | Purdue | x | ... | UConn | y | ... | 60 | 75 | 0
UConn v. Alabama | UConn | y | ... | Alabama | z | ... | 86 | 72 | 1
...|...|...|...|...|...|...|...|...|...

With this format, our model can either:

**1**. Predict the values that Team 1 and 2 score, with a subsequent function that confirms the outcome prediction the model is making.

**2**. Predict whether or not Team 1 wins.

We can assess both, but first we need to assemble our data. First, we need to create a table of season statistics.

Assembling this data across all years will be difficult, but should be possible through extracting data from [Sports Reference](https://www.sports-reference.com/cbb/). The below code begins this process.

In [None]:
from bs4 import BeautifulSoup, Comment
import numpy as np
import requests
import pandas as pd
url = 'https://www.sports-reference.com/cbb/seasons/men/2024-school-stats.html'
test = requests.get(url)

html_content = test.text
soup = BeautifulSoup(html_content, 'html.parser')
pretty_html = soup.prettify()
tables = soup.find_all('table')  # finding table in webpage
headers = soup.find_all('th')

In [None]:
columnheaders = []
for i in headers:
 columnheaders.append(i.text)
columnheaders = columnheaders[13:50]
ncaa2324 = pd.DataFrame(columns = columnheaders)  # setting table based on webpage table columns
ncaa2324.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 37 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   School  0 non-null      object
 1   G       0 non-null      object
 2   W       0 non-null      object
 3   L       0 non-null      object
 4   W-L%    0 non-null      object
 5   SRS     0 non-null      object
 6   SOS     0 non-null      object
 7           0 non-null      object
 8   W       0 non-null      object
 9   L       0 non-null      object
 10          0 non-null      object
 11  W       0 non-null      object
 12  L       0 non-null      object
 13          0 non-null      object
 14  W       0 non-null      object
 15  L       0 non-null      object
 16          0 non-null      object
 17  Tm.     0 non-null      object
 18  Opp.    0 non-null      object
 19          0 non-null      object
 20  MP      0 non-null      object
 21  FG      0 non-null      object
 22  FGA     0 non-null      object
 23  FG

In [None]:
rows = soup.find_all('tr')
for row in rows:
  cells = row.find_all('td')
  if cells == []:
    continue
  columndata = [col.text.strip() for col in cells]
  if "NCAA" in columndata[0]:
    ncaa2324.loc[len(ncaa2324)]=columndata  # extracting everyone who made it into the NCAA tournament.

In [None]:
ncaa2324.head(5)  # checking things here

Unnamed: 0,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,L.1,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Akron NCAA,35,24,11,0.686,2.77,-2.08,,13,5,...,467,642,0.727,363,1278,455,197,100,394,583
1,Alabama NCAA,37,25,12,0.676,20.69,11.8,,13,5,...,650,842,0.772,472,1467,587,256,162,438,734
2,Arizona NCAA,36,27,9,0.75,24.54,9.45,,15,5,...,605,844,0.717,471,1533,665,300,133,430,590
3,Auburn NCAA,35,27,8,0.771,22.46,7.66,,13,5,...,609,812,0.75,393,1323,622,258,215,374,678
4,Baylor NCAA,35,24,11,0.686,19.5,10.71,,11,7,...,579,791,0.732,399,1229,514,236,110,421,577


This table contains some of the statistics we would like to see for our team season data when we compile games from March Madness tournaments.

However, there's an issue with this webscraping method: the statistics listed include tournament games. This is problematic because we want the model to be useful prior to the tournament takes place.

If I train a model on data partially from tournaments, there's a chance that it will negatively affect the model when it is needed before teams even have a chance to compile tournament statistics.

One way to avoid this is to make sure that I only include rate/percentage statistics.

In [None]:
ncaa2324.drop(columns=['W','L'], inplace=True)

In [None]:
columns_to_drop = ncaa2324.columns[5:9]
columns_to_drop
ncaa2324.drop(columns=columns_to_drop,inplace=True)

In [None]:
ncaa2324.drop(columns=['MP','FG','FGA','FT','FTA','3P','3PA'],inplace=True)

In [None]:
ncaa2324.iloc[:,1:] = ncaa2324.iloc[:,1:].astype(float)
column_ops = ['Tm.','Opp.','ORB','TRB','AST','STL','BLK','TOV','PF']
for col in column_ops:
  ncaa2324[col] = ncaa2324[col].astype(float)
  ncaa2324[col] = ncaa2324[col]/(ncaa2324['G'].astype(float))  # setting up per game ratios

ncaa2324.rename(columns={'Tm.':'PPG',
                         'Opp.':'PAPG',
                         'ORB':'ORBPG',
                         'TRB':'TRBPG',
                         'AST':'ASTPG',
                         'STL':'STLPG',
                         'BLK':'BLKPG',
                         'TOV':'TOVPG',
                         'PF':'PFPG'},inplace=True)

In [None]:
ncaa2324.drop(columns=['G','W-L%'],inplace=True)

Now I've eliminated most of the obvious indicators of postseason success that would not be available if we try to practically apply a model next year. Next, to create game data, and stitch these seasonal statistics onto the table with game data. I plan to pull game data again from Sports Reference.

## Pulling and formatting '23 - '24 data

In [None]:
# first, making sure school names are uniform with the second table. This way, there's a seamless merge.
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('NCAA$', '', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('Brigham Young$', 'BYU', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('Connecticut$', 'UConn', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('North Carolina$', 'UNC', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Saint Mary's (CA)", "Saint Mary's", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Saint Peter's", "St. Peter's", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Louisiana State", "LSU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Southern California", "USC", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Southern Methodist", "SMU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Mississippi", "Ole Miss", regex=False)  #how to adjust?
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Ole Miss State", "Mississippi State", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Ole Miss Valley State", "Mississippi Valley State", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Southern Ole Miss", "Southern Miss", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Pittsburgh", "Pitt", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Pennsylvania", "Penn", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Central Connecticut State", "Central Connecticut", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Florida International", "FIU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Nevada-Las Vegas","UNLV", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("UC Santa Barbara", "UCSB", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Virginia Commonwealth", "VCU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Saint Joseph's", "St. Joseph's", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Maryland-Baltimore County","UMBC", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Illinois-Chicago", "UIC", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Massachusetts", "UMass", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("East Tennessee State", "ETSU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("IU Indy", "IUPUI", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("UC Irvine", "UC-Irvine", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("UC Davis", "UC-Davis", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Long Island University", "LIU", regex=False)

ncaa2324['Year'] = url[49:53]


In [None]:
mm2324 = pd.DataFrame(columns=['T1 Seed','T1 Name','T1 Score','T2 Seed','T2 Name','T2 Score','Round'])  # making combined table template

In [None]:
mm2324

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Round


In [None]:
url2 = 'https://www.sports-reference.com/cbb/postseason/men/2024-ncaa.html'
test = requests.get(url2)

html_content = test.text
soup = BeautifulSoup(html_content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
i = 0
j = 1
for comment in comments:
    if "game" in comment:  # the html had <--game--> comments wherever they placed bracket games.
        next_element = comment.find_next_sibling()
        try:
          seed1 = next_element.find('span').text.strip()
        except:
          break
        name1 = next_element.find('a', href=True).text.strip()
        try:
          score1 = next_element.find_all('a',href=True)[1].text.strip()
        except:
          continue
        third_element = next_element.find_next_sibling()
        seed2 = third_element.find('span').text.strip()
        name2 = third_element.find('a', href=True).text.strip()
        score2 = third_element.find_all('a',href=True)[1].text.strip()
        i += 1

        if j < 5:
            if i <= 8:
              round = 1
            elif i <= 12:
              round = 2
            elif i <= 14:
              round = 3
            elif i == 15:
              round = 4
              j += 1  # move to the next quadrant
              i = 0
        elif j == 5:
            if i <= 2:
                round = 5
            elif i == 3:
                round = 6

        mm2324.loc[len(mm2324)] = [seed1,name1,score1,seed2,name2,score2,round]
mm2324.loc[len(mm2324)] = [1,'UConn',75,1,'Purdue',60,6]  # for some reason the championship game still didn't have the score. manually inputting.

In [None]:
mm2324['Game'] = mm2324['T1 Name'] + f' ('+mm2324['T1 Seed'].astype(str)+')' ' v. ' + mm2324['T2 Name'] + ' ('+mm2324['T2 Seed'].astype(str)+')'
mm2324.drop_duplicates(['T1 Name', 'T2 Name'],inplace=True)
mm2324  # the above will be the index. Important for future so we can still determine which game the model is predicting on

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Round,Game
0,1,UConn,91,16,Stetson,52,1,UConn (1) v. Stetson (16)
1,8,Florida Atlantic,65,9,Northwestern,77,1,Florida Atlantic (8) v. Northwestern (9)
2,5,San Diego State,69,12,UAB,65,1,San Diego State (5) v. UAB (12)
3,4,Auburn,76,13,Yale,78,1,Auburn (4) v. Yale (13)
4,6,BYU,67,11,Duquesne,71,1,BYU (6) v. Duquesne (11)
...,...,...,...,...,...,...,...,...
58,6,Clemson,77,2,Arizona,72,3,Clemson (6) v. Arizona (2)
59,4,Alabama,89,6,Clemson,82,4,Alabama (4) v. Clemson (6)
60,1,UConn,86,4,Alabama,72,5,UConn (1) v. Alabama (4)
61,1,Purdue,63,11,NC State,50,5,Purdue (1) v. NC State (11)


Now that the game table has been created, the seasonal stats for the teams can be inserted into the table.

In [None]:
ncaa2324['School'] = ncaa2324['School'].str.strip()
mm2324['T1 Name'] = mm2324['T1 Name'].str.strip()
mm2324['T2 Name'] = mm2324['T2 Name'].str.strip()
comb2324 = pd.merge(mm2324, ncaa2324,left_on='T1 Name',right_on='School',how='left')
comb2324.rename(columns={'SRS':'T1 SRS',
                         'SOS':'T1 SOS',
                         'PPG':'T1 PPG',
                         'PAPG':'T1 PAPG',
                         'ORBPG':'T1 ORBPG',
                         'TRBPG':'T1 TRBPG',
                         'ASTPG':'T1 ASTPG',
                         'FG%':'T1 FG%',
                         '3P%':'T1 3P%',
                         'FT%':'T1 FT%',
                         'STLPG':'T1 STLPG',
                         'BLKPG':'T1 BLKPG',
                         'TOVPG':'T1 TOVPG',
                         'PFPG':'T1 PFPG'},inplace=True)
comb2324 = pd.merge(comb2324, ncaa2324, left_on='T2 Name',right_on='School',how='left')
comb2324.rename(columns={'SRS':'T2 SRS',
                         'SOS':'T2 SOS',
                         'PPG':'T2 PPG',
                         'PAPG':'T2 PAPG',
                         'ORBPG':'T2 ORBPG',
                         'TRBPG':'T2 TRBPG',
                         'ASTPG':'T2 ASTPG',
                         'FG%':'T2 FG%',
                         '3P%':'T2 3P%',
                         'FT%':'T2 FT%',
                         'STLPG':'T2 STLPG',
                         'BLKPG':'T2 BLKPG',
                         'TOVPG':'T2 TOVPG',
                         'PFPG':'T2 PFPG'}, inplace=True)
comb2324 = comb2324.drop(columns=['School_x','School_y','Year_y'])
comb2324.rename(columns={'Year_x':'Year'},inplace=True)
comb2324.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 37 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   T1 Seed   63 non-null     object 
 1   T1 Name   63 non-null     object 
 2   T1 Score  63 non-null     object 
 3   T2 Seed   63 non-null     object 
 4   T2 Name   63 non-null     object 
 5   T2 Score  63 non-null     object 
 6   Round     63 non-null     int64  
 7   Game      63 non-null     object 
 8   T1 SRS    63 non-null     object 
 9   T1 SOS    63 non-null     object 
 10  T1 PPG    63 non-null     float64
 11  T1 PAPG   63 non-null     float64
 12  T1 FG%    63 non-null     object 
 13  T1 3P%    63 non-null     object 
 14  T1 FT%    63 non-null     object 
 15  T1 ORBPG  63 non-null     float64
 16  T1 TRBPG  63 non-null     float64
 17  T1 ASTPG  63 non-null     float64
 18  T1 STLPG  63 non-null     float64
 19  T1 BLKPG  63 non-null     float64
 20  T1 TOVPG  63 non-null     float64


In [None]:
combi2324 = comb2324.iloc[:,[22,6,7,0,1,8,9,10,11,12,13,14,15,16,17,18,19,20,21,3,4,23,24,25,26,27,28,29,30,31,32,33,34,35,36,2,5]]  # rearranging.
combi2324.info()  # null check

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 37 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      63 non-null     object 
 1   Round     63 non-null     int64  
 2   Game      63 non-null     object 
 3   T1 Seed   63 non-null     object 
 4   T1 Name   63 non-null     object 
 5   T1 SRS    63 non-null     object 
 6   T1 SOS    63 non-null     object 
 7   T1 PPG    63 non-null     float64
 8   T1 PAPG   63 non-null     float64
 9   T1 FG%    63 non-null     object 
 10  T1 3P%    63 non-null     object 
 11  T1 FT%    63 non-null     object 
 12  T1 ORBPG  63 non-null     float64
 13  T1 TRBPG  63 non-null     float64
 14  T1 ASTPG  63 non-null     float64
 15  T1 STLPG  63 non-null     float64
 16  T1 BLKPG  63 non-null     float64
 17  T1 TOVPG  63 non-null     float64
 18  T1 PFPG   63 non-null     float64
 19  T2 Seed   63 non-null     object 
 20  T2 Name   63 non-null     object 


In [None]:
combi2324.set_index('Game',inplace=True)

In [None]:
col_list = combi2324.columns  ### NEEDS FIXING
for col in col_list:
    if col == 'Year' or col == 'Round':
      combi2324[col] = combi2324[col].astype(int)
    else:
      try:
        combi2324[col] = combi2324[col].astype(float)
      except:
        continue
combi2324.info()

<class 'pandas.core.frame.DataFrame'>
Index: 63 entries, UConn (1) v. Stetson (16) to UConn (1) v. Purdue (1)
Data columns (total 36 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      63 non-null     int64  
 1   Round     63 non-null     int64  
 2   T1 Seed   63 non-null     float64
 3   T1 Name   63 non-null     object 
 4   T1 SRS    63 non-null     float64
 5   T1 SOS    63 non-null     float64
 6   T1 PPG    63 non-null     float64
 7   T1 PAPG   63 non-null     float64
 8   T1 FG%    63 non-null     float64
 9   T1 3P%    63 non-null     float64
 10  T1 FT%    63 non-null     float64
 11  T1 ORBPG  63 non-null     float64
 12  T1 TRBPG  63 non-null     float64
 13  T1 ASTPG  63 non-null     float64
 14  T1 STLPG  63 non-null     float64
 15  T1 BLKPG  63 non-null     float64
 16  T1 TOVPG  63 non-null     float64
 17  T1 PFPG   63 non-null     float64
 18  T2 Seed   63 non-null     float64
 19  T2 Name   63 non-null     object 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combi2324[col] = combi2324[col].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combi2324[col] = combi2324[col].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combi2324[col] = combi2324[col].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try 

This is the ideal table setup we're looking for. Before iterating on the past two-ish decades of March Madness basketball, I'd like to see how the model does with a train-test set of 2024.

## ML Test Run

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import MultiTaskLasso

X = combi2324.drop(columns=['T1 Name', 'T2 Name', 'T1 Score', 'T2 Score'])
y = combi2324[['T1 Score', 'T2 Score']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
preprocessing = Pipeline([('scaler', StandardScaler())])
lasso_pipe = Pipeline([('preprocessing', preprocessing),('lasso',MultiTaskLasso(alpha=0.1,random_state=1))])
lasso_pipe.fit(X_train,y_train)

In [None]:
predictions = lasso_pipe.predict(X_test)

In [None]:
yt_copy = y_test.copy()
yt_copy['T1 Win'] = yt_copy['T1 Score'] > yt_copy['T2 Score']
yt_copy['Prediction 1'] = [row[0] for row in predictions]
yt_copy['Prediction 2'] = [row[1] for row in predictions]
yt_copy['Predict T1 Win'] = yt_copy['Prediction 1'] > yt_copy['Prediction 2']
yt_copy['Prediction Correct?'] = yt_copy['Predict T1 Win'] == yt_copy['T1 Win']
yt_copy

Unnamed: 0_level_0,T1 Score,T2 Score,T1 Win,Prediction 1,Prediction 2,Predict T1 Win,Prediction Correct?
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Gonzaga (5) v. Kansas (4),89.0,68.0,True,78.705772,43.390901,True,True
James Madison (12) v. Duke (4),55.0,93.0,False,63.193435,88.406209,False,True
Dayton (7) v. Nevada (10),63.0,60.0,True,66.574662,60.674839,True,True
Purdue (1) v. Gonzaga (5),80.0,68.0,True,94.443444,84.581761,True,True
Purdue (1) v. NC State (11),63.0,50.0,True,80.886146,69.471964,True,True
San Diego State (5) v. UAB (12),69.0,65.0,True,84.074132,72.177932,True,True
Texas (7) v. Colorado State (10),56.0,44.0,True,74.898804,58.057505,True,True
Houston (1) v. Duke (4),51.0,54.0,False,69.663298,55.74264,True,False
Colorado (10) v. Marquette (2),77.0,81.0,False,53.721962,54.085654,False,True
Duke (4) v. NC State (11),64.0,76.0,False,71.890411,59.885521,True,False


With a really small sample size, the model predicts 10 outcomes correctly out of 13 (76% accuracy). This is surprisingly good to me! But there are a few caveats:

- The model does miss notable upsets of the test set. Duke v.s. NC State, Houston v.s. Duke, and Kentucky v.s. Oakland, with the last game's predicted scores being way off the mark.
- Even though it predicts the outcome correctly on a decent portion of the games, the predicted scores are often not close to the true score.
- This is applying all different rounds of tournament basketball at once, which we won't be able to do when using it practically. When tested with more data later, we will test it mirroring the process we would in March (Round 1 predictions, then Round 2, and so on).

What features does lasso view as important?


In [None]:
lasso_model = lasso_pipe.named_steps['lasso']
lasso_model.coef_

array([[  0.        ,   0.18968562,  -7.90138937,  -8.68511282,
          2.69186365,   4.00038837,   3.04772834,  -1.44519617,
          1.04515361,  -0.52893422,   0.16803906,  -0.19707116,
          2.8636212 ,   2.3270009 ,   2.9518573 ,  -2.20339699,
         -0.82773822,   0.6618601 ,  -5.76945669,  -0.22828046,
         -1.22715471,   1.15900413,   6.23542036,  -2.3080677 ,
          2.12410524,   6.97350114,  -2.03994218,   1.84398979,
          0.04185568,  -0.37596015,  -1.27988767,  -1.39528945],
       [  0.        ,  -1.38209025,  -4.2856943 ,  -3.38528387,
         -2.51232115,   6.44915053,   4.91302739, -11.25167354,
         -0.2054643 ,  -3.84210442,  -0.04642817,  -5.15071968,
          8.07429539,  -2.25639257,  -1.11588826,   2.68150977,
         -1.21657376,   0.31096353,   0.55280259,  -0.64612286,
          2.64145534,  -0.3531427 ,   2.368977  ,   2.21318565,
          5.34115537,   4.96337399,   2.0562435 ,  -0.5692651 ,
          3.15113207,  -1.74017287,  -3

The two arrays above raise an issue with using any form of linear regression method for multi target regression. Because linear regression seeks to minimize error on each target, a team will be predicted to score a different value simply by virtue of being on the opposite side of the table I created. I'm not sure if this is something I want in my model.

What if I tried a Random Forest Regressor?



In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_pipe = Pipeline([('preprocessing',preprocessing),('rf',RandomForestRegressor(n_estimators=100,random_state=1))])
rf_pipe.fit(X_train,y_train)

In [None]:
rf_predictions = rf_pipe.predict(X_test)
yt_rfcopy = y_test.copy()
yt_rfcopy['T1 Win'] = yt_rfcopy['T1 Score'] > yt_rfcopy['T2 Score']
yt_rfcopy['Prediction 1'] = [row[0] for row in rf_predictions]
yt_rfcopy['Prediction 2'] = [row[1] for row in rf_predictions]
yt_rfcopy['Predict T1 Win'] = yt_rfcopy['Prediction 1'] > yt_rfcopy['Prediction 2']
yt_rfcopy['Prediction Correct?'] = yt_rfcopy['Predict T1 Win'] == yt_rfcopy['T1 Win']
yt_rfcopy

Unnamed: 0_level_0,T1 Score,T2 Score,T1 Win,Prediction 1,Prediction 2,Predict T1 Win,Prediction Correct?
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Gonzaga (5) v. Kansas (4),89.0,68.0,True,78.32,70.81,True,True
James Madison (12) v. Duke (4),55.0,93.0,False,69.87,72.66,False,True
Dayton (7) v. Nevada (10),63.0,60.0,True,68.97,70.83,False,False
Purdue (1) v. Gonzaga (5),80.0,68.0,True,85.82,65.47,True,True
Purdue (1) v. NC State (11),63.0,50.0,True,81.59,63.12,True,True
San Diego State (5) v. UAB (12),69.0,65.0,True,75.65,72.0,True,True
Texas (7) v. Colorado State (10),56.0,44.0,True,67.31,68.21,False,False
Houston (1) v. Duke (4),51.0,54.0,False,77.37,63.53,True,False
Colorado (10) v. Marquette (2),77.0,81.0,False,70.04,70.05,False,True
Duke (4) v. NC State (11),64.0,76.0,False,75.93,67.79,True,False


In [None]:
rf_model = rf_pipe.named_steps['rf']
rf_model.feature_importances_


array([0.        , 0.00737733, 0.06162588, 0.1074935 , 0.032775  ,
       0.04439269, 0.0854742 , 0.02568427, 0.01468159, 0.00791005,
       0.01937261, 0.01875131, 0.02730401, 0.02166131, 0.01376976,
       0.03424782, 0.03141211, 0.01594199, 0.04126105, 0.03322993,
       0.01221171, 0.10052826, 0.04371441, 0.02528439, 0.01131888,
       0.0114804 , 0.03479599, 0.04133326, 0.01109922, 0.02928155,
       0.01363706, 0.02094847])

Random Forest does slightly worse, but the lack of data should be noted. I think it's time to expand the data we are training on. We'll use code implemented earlier as functions to iterate on lots of Sports Reference pages.

## Web functions

In [None]:
def get_season_stats(url_name):  # only works with Sports Reference!!
  test = requests.get(url_name)

  html_content = test.text  # setting up the html
  soup = BeautifulSoup(html_content, 'html.parser')
  pretty_html = soup.prettify()

  headers = soup.find_all('th')  # establishing headers
  columnheaders = []
  for i in headers:
   columnheaders.append(i.text)
  columnheaders = columnheaders[13:50]
  new_season_df = pd.DataFrame(columns = columnheaders)  # setting table based on webpage table columns

  rows = soup.find_all('tr')  # inputting the seasonal team data
  for row in rows:
    cells = row.find_all('td')
    if cells == []:
      continue
    columndata = [col.text.strip() for col in cells]
    if "NCAA" in columndata[0]:
      new_season_df.loc[len(new_season_df)]=columndata  # extracting everyone who made it into the NCAA tournament.
  new_season_df.drop(columns=['W','L'], inplace=True)  # standardizing data in ratios for model's sake
  columns_to_drop = new_season_df.columns[5:9]
  new_season_df.drop(columns=columns_to_drop,inplace=True)
  new_season_df.drop(columns=['MP','FG','FGA','FT','FTA','3P','3PA'],inplace=True)
  new_season_df = new_season_df.replace('', np.nan)
  new_season_df.iloc[:,1:] = new_season_df.iloc[:,1:].astype(float)
  column_ops = ['Tm.','Opp.','ORB','TRB','AST','STL','BLK','TOV','PF']
  for col in column_ops:
    new_season_df[col] = new_season_df[col].astype(float)
    new_season_df[col] = new_season_df[col]/(new_season_df['G'].astype(float))  # setting up per game ratios

  new_season_df.rename(columns={'Tm.':'PPG',
                               'Opp.':'PAPG',
                               'ORB':'ORBPG',
                               'TRB':'TRBPG',
                               'AST':'ASTPG',
                               'STL':'STLPG',
                               'BLK':'BLKPG',
                               'TOV':'TOVPG',
                               'PF':'PFPG'}, inplace=True)
  new_season_df.drop(columns=['G','W-L%'],inplace=True)

  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('NCAA$', '', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('Brigham Young$', 'BYU', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('Connecticut$', 'UConn', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('North Carolina$', 'UNC', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Saint Mary's (CA)", "Saint Mary's", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Saint Peter's", "St. Peter's", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Louisiana State", "LSU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Southern California", "USC", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Southern Methodist", "SMU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Mississippi", "Ole Miss", regex=False)  #how to adjust?
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Ole Miss State", "Mississippi State", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Ole Miss Valley State", "Mississippi Valley State", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Southern Ole Miss", "Southern Miss", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Pittsburgh", "Pitt", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Pennsylvania", "Penn", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Central Connecticut State", "Central Connecticut", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Florida International", "FIU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Nevada-Las Vegas","UNLV", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("UC Santa Barbara", "UCSB", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Virginia Commonwealth", "VCU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Saint Joseph's", "St. Joseph's", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Maryland-Baltimore County","UMBC", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Illinois-Chicago", "UIC", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Massachusetts", "UMass", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("East Tennessee State", "ETSU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("IU Indy", "IUPUI", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("UC Irvine", "UC-Irvine", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("UC Davis", "UC-Davis", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Long Island University", "LIU", regex=False)

  new_season_df['Year'] = url_name[49:53]
  for col in new_season_df.columns:
    if col == 'Year':
      new_season_df[col] = new_season_df[col].astype(int)
    else:
      try:
        new_season_df[col] = new_season_df[col].astype(float)
      except:
        continue
  print(f"No errors with {url_name[49:53]}")
  return new_season_df


Whew. Let's see if this works.

In [None]:
x = get_season_stats('https://www.sports-reference.com/cbb/seasons/men/2000-school-stats.html')


No errors with 2000


Nice! Now it's time to iterate and retrieve seasonal data up to 2000.

In [None]:
team_data = pd.DataFrame(columns=['School','SRS','SOS','PPG','PAPG','FG%','3P%','FT%','ORBPG','TRBPG','ASTPG','STLPG','BLKPG','TOVPG','PFPG','Year'])
for i in range(0,25):
  if i < 10:
    i = f'0{i}'
    x = get_season_stats(f'https://www.sports-reference.com/cbb/seasons/men/20{i}-school-stats.html')
  else:
    x = get_season_stats(f'https://www.sports-reference.com/cbb/seasons/men/20{i}-school-stats.html')
  team_data = pd.concat([team_data,x])
team_data.head()

No errors with 2000


  team_data = pd.concat([team_data,x])


No errors with 2001
No errors with 2002
No errors with 2003
No errors with 2004
No errors with 2005
No errors with 2006
No errors with 2007
No errors with 2008
No errors with 2009
No errors with 2010
No errors with 2011
No errors with 2012
No errors with 2013
No errors with 2014
No errors with 2015
No errors with 2016
No errors with 2017
No errors with 2018
No errors with 2019
No errors with 2020
No errors with 2021
No errors with 2022
No errors with 2023
No errors with 2024


Unnamed: 0,School,SRS,SOS,PPG,PAPG,FG%,3P%,FT%,ORBPG,TRBPG,ASTPG,STLPG,BLKPG,TOVPG,PFPG,Year
0,Appalachian State,2.49,-3.75,79.0625,69.625,0.486,0.388,0.709,11.6875,35.96875,16.6875,10.15625,3.96875,15.8125,19.59375,2000
1,Arizona,18.96,9.7,76.441176,67.176471,0.457,0.322,0.73,12.117647,38.411765,15.794118,7.5,5.647059,15.0,14.588235,2000
2,Arkansas,12.37,7.76,74.382353,69.764706,0.429,0.351,0.603,13.264706,34.088235,13.470588,11.147059,3.529412,14.529412,20.735294,2000
3,Auburn,13.59,8.2,71.264706,64.352941,0.413,0.327,0.655,15.676471,39.264706,12.882353,7.264706,3.235294,13.235294,17.117647,2000
4,Ball State,7.84,3.17,74.193548,68.967742,0.453,0.4,0.612,12.83871,36.709677,13.806452,8.322581,4.225806,13.548387,17.096774,2000


In [None]:
team_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1598 entries, 0 to 67
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   School  1598 non-null   object 
 1   SRS     1598 non-null   float64
 2   SOS     1598 non-null   float64
 3   PPG     1598 non-null   float64
 4   PAPG    1598 non-null   float64
 5   FG%     1598 non-null   float64
 6   3P%     1598 non-null   float64
 7   FT%     1598 non-null   float64
 8   ORBPG   1595 non-null   float64
 9   TRBPG   1598 non-null   float64
 10  ASTPG   1598 non-null   float64
 11  STLPG   1598 non-null   float64
 12  BLKPG   1598 non-null   float64
 13  TOVPG   1597 non-null   float64
 14  PFPG    1598 non-null   float64
 15  Year    1598 non-null   object 
dtypes: float64(14), object(2)
memory usage: 276.8+ KB


In [None]:
team_data["Year"] = team_data["Year"].astype(int)
team_data["Year"].unique()

array([2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
       2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2021, 2022,
       2023, 2024])

Yay! There are a couple null values, but nothing that can't be handled. Now to do the same thing for the game data.

In [None]:
tourney_df = pd.DataFrame(columns=['T1 Seed','T1 Name','T1 Score','T2 Seed','T2 Name','T2 Score','Round','Year'])
def get_game_data(url_name):
  test = requests.get(url_name)
  html_content = test.text
  soup = BeautifulSoup(html_content, 'html.parser')
  comments = soup.find_all(string=lambda text: isinstance(text, Comment))
  i = 0
  j = 1
  for comment in comments:
    if "game" in comment:  # the html had <--game--> comments wherever they placed bracket games.
        next_element = comment.find_next_sibling()
        try:
          seed1 = next_element.find('span').text.strip()
        except:
          break
        name1 = next_element.find('a', href=True).text.strip()
        try:
          score1 = next_element.find_all('a',href=True)[1].text.strip()
        except:
          continue
        third_element = next_element.find_next_sibling()
        seed2 = third_element.find('span').text.strip()
        name2 = third_element.find('a', href=True).text.strip()
        score2 = third_element.find_all('a',href=True)[1].text.strip()
        i += 1

        if j < 5:
            if i <= 8:
              round = 1
            elif i <= 12:
              round = 2
            elif i <= 14:
              round = 3
            elif i == 15:
              round = 4
              j += 1  # move to the next quadrant
              i = 0
        elif j == 5:
            if i <= 2:
                round = 5
            elif i == 3:
                round = 6

            # Append data to DataFrame
        tourney_df.loc[len(tourney_df)] = [seed1, name1, score1, seed2, name2, score2, round, url_name[52:56]]
        if round == 6:
          break
  print(f"No issues with {url_name[52:56]}")
  return tourney_df

In [None]:
import time
for i in range(0,25):
  if i < 10:
    i = f'0{i}'
    game_data = get_game_data(f'https://www.sports-reference.com/cbb/postseason/men/20{i}-ncaa.html')
    game_data.drop_duplicates(("T1 Name", "T2 Name","Year"),ignore_index=True,inplace = True)
    time.sleep(15)  # to prevent a 429 response. don't want to overwhelm the website
  elif i == 20:
    continue
  else:
    game_data = get_game_data(f'https://www.sports-reference.com/cbb/postseason/men/20{i}-ncaa.html')
    game_data.drop_duplicates(("T1 Name", "T2 Name","Year"),ignore_index=True,inplace = True)
    time.sleep(15)

No issues with 2000
No issues with 2001
No issues with 2002
No issues with 2003
No issues with 2004
No issues with 2005
No issues with 2006
No issues with 2007
No issues with 2008
No issues with 2009
No issues with 2010
No issues with 2011
No issues with 2012
No issues with 2013
No issues with 2014
No issues with 2015
No issues with 2016
No issues with 2017
No issues with 2018
No issues with 2019
No issues with 2021
No issues with 2022
No issues with 2023
No issues with 2024


In [None]:
game_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1511 entries, 0 to 1510
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   T1 Seed   1511 non-null   object
 1   T1 Name   1511 non-null   object
 2   T1 Score  1511 non-null   object
 3   T2 Seed   1511 non-null   object
 4   T2 Name   1511 non-null   object
 5   T2 Score  1511 non-null   object
 6   Round     1511 non-null   int64 
 7   Year      1511 non-null   object
dtypes: int64(1), object(7)
memory usage: 94.6+ KB


Looks like the function worked! One small problem: Assuming 63 games every year for 24 years, the number should be 1512. We're sitting at 1510.

In 2021, VCU forfeited a game due to COVID, so that makes a few 2021 games in the dataframe receive the incorrect round. I'll need to address that.

Sports-Reference does not yet have the UConn-Purdue final score inputted from last year, so I'll input that manually.

In [None]:
game_data.iloc[1312,6] = 2  # index of game w/ wrong round-- associating with proper round
game_data.iloc[1316,6] = 3
game_data.iloc[1318,6] = 4
game_data.iloc[1319,6] = 5
game_data.iloc[1321,6] = 6  # true championship game

In [None]:
#game_data.loc[len(game_data)] = [1,'UConn',75,1,'Purdue',60,6,2024]  # adding 2k24 finals

Now to organize the game data a little bit more before merging.

In [None]:
game_data["Game"] = game_data["T1 Name"] + f' ('+game_data["T1 Seed"].astype(str)+')' + ' v. ' + game_data["T2 Name"] + ' ('+game_data["T2 Seed"].astype(str)+')' + ', ' + game_data["Year"].astype(str)
game_data.set_index('Game', inplace=True)  # so i can identify the matches.
team_data["Year"] = team_data["Year"].astype(int)
game_data["Year"] = game_data["Year"].astype(int)  # to ensure that columns we are merging on are identical.
game_data["T1 Score"] = game_data["T1 Score"].astype(int)
game_data["T2 Score"] = game_data["T2 Score"].astype(int)
game_data["T1 Seed"] = game_data["T1 Seed"].astype(int)
game_data["T2 Seed"] = game_data["T2 Seed"].astype(int)

In [None]:
game_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1511 entries, Duke (1) v. Lamar (16), 2000 to UConn (1) v. Purdue (1), 2024
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   T1 Seed   1511 non-null   int64 
 1   T1 Name   1511 non-null   object
 2   T1 Score  1511 non-null   int64 
 3   T2 Seed   1511 non-null   int64 
 4   T2 Name   1511 non-null   object
 5   T2 Score  1511 non-null   int64 
 6   Round     1511 non-null   int64 
 7   Year      1511 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 106.2+ KB


In [None]:
team_data['School'] = team_data['School'].str.strip()
game_data['T1 Name'] = game_data['T1 Name'].str.strip()
game_data['T2 Name'] = game_data['T2 Name'].str.strip()
ncaa2k = pd.merge(game_data, team_data,left_on=['T1 Name','Year'],right_on=['School','Year'],how='left')
ncaa2k.rename(columns={'SRS':'T1 SRS',
                         'SOS':'T1 SOS',
                         'PPG':'T1 PPG',
                         'PAPG':'T1 PAPG',
                         'ORBPG':'T1 ORBPG',
                         'TRBPG':'T1 TRBPG',
                         'ASTPG':'T1 ASTPG',
                         'FG%':'T1 FG%',
                         '3P%':'T1 3P%',
                         'FT%':'T1 FT%',
                         'STLPG':'T1 STLPG',
                         'BLKPG':'T1 BLKPG',
                         'TOVPG':'T1 TOVPG',
                         'PFPG':'T1 PFPG'},inplace=True)
ncaa2k = pd.merge(ncaa2k, team_data, left_on=['T2 Name','Year'],right_on=['School','Year'],how='left')
ncaa2k.rename(columns={'SRS':'T2 SRS',
                         'SOS':'T2 SOS',
                         'PPG':'T2 PPG',
                         'PAPG':'T2 PAPG',
                         'ORBPG':'T2 ORBPG',
                         'TRBPG':'T2 TRBPG',
                         'ASTPG':'T2 ASTPG',
                         'FG%':'T2 FG%',
                         '3P%':'T2 3P%',
                         'FT%':'T2 FT%',
                         'STLPG':'T2 STLPG',
                         'BLKPG':'T2 BLKPG',
                         'TOVPG':'T2 TOVPG',
                         'PFPG':'T2 PFPG'}, inplace=True)
ncaa2k = ncaa2k.drop(columns=['School_x','School_y'])
ncaa2k.rename(columns={'Year_x':'Year'},inplace=True)

In [None]:
ncaa2k.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1511 entries, 0 to 1510
Data columns (total 36 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   T1 Seed   1511 non-null   int64  
 1   T1 Name   1511 non-null   object 
 2   T1 Score  1511 non-null   int64  
 3   T2 Seed   1511 non-null   int64  
 4   T2 Name   1511 non-null   object 
 5   T2 Score  1511 non-null   int64  
 6   Round     1511 non-null   int64  
 7   Year      1511 non-null   int64  
 8   T1 SRS    1511 non-null   float64
 9   T1 SOS    1511 non-null   float64
 10  T1 PPG    1511 non-null   float64
 11  T1 PAPG   1511 non-null   float64
 12  T1 FG%    1511 non-null   float64
 13  T1 3P%    1511 non-null   float64
 14  T1 FT%    1511 non-null   float64
 15  T1 ORBPG  1510 non-null   float64
 16  T1 TRBPG  1511 non-null   float64
 17  T1 ASTPG  1511 non-null   float64
 18  T1 STLPG  1511 non-null   float64
 19  T1 BLKPG  1511 non-null   float64
 20  T1 TOVPG  1511 non-null   floa

In [None]:
ncaa2k.to_csv('ncaa2k.csv')

In [None]:
ncaa2k["Game"] = ncaa2k["T1 Name"] + f' ('+ncaa2k["T1 Seed"].astype(str)+')' + ' v. ' + ncaa2k["T2 Name"] + ' ('+ncaa2k["T2 Seed"].astype(str)+')'+', ' + ncaa2k["Year"].astype(str)
ncaa2k.set_index('Game',inplace=True)

# Machine Learning Applied

Now it's time to apply machine learning! I will first test and train the conventional way, before switching to the approach needed for applying the model in a bracket format. In each case, I will omit the most recent year, because I want to eventually use it as a validation set.

### Regression Test

In [None]:
ncaa2k23 = ncaa2k[ncaa2k['Year'] != 2024]
ncaa2k23.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1448 entries, Duke (1) v. Lamar (16), 2000 to San Diego State (5) v. UConn (4), 2023
Data columns (total 36 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   T1 Seed   1448 non-null   int64  
 1   T1 Name   1448 non-null   object 
 2   T1 Score  1448 non-null   int64  
 3   T2 Seed   1448 non-null   int64  
 4   T2 Name   1448 non-null   object 
 5   T2 Score  1448 non-null   int64  
 6   Round     1448 non-null   int64  
 7   Year      1448 non-null   int64  
 8   T1 SRS    1448 non-null   float64
 9   T1 SOS    1448 non-null   float64
 10  T1 PPG    1448 non-null   float64
 11  T1 PAPG   1448 non-null   float64
 12  T1 FG%    1448 non-null   float64
 13  T1 3P%    1448 non-null   float64
 14  T1 FT%    1448 non-null   float64
 15  T1 ORBPG  1447 non-null   float64
 16  T1 TRBPG  1448 non-null   float64
 17  T1 ASTPG  1448 non-null   float64
 18  T1 STLPG  1448 non-null   float64
 19  T1 BLKPG  144

First, I build a pipeline that scales our data and addresses the small missing values.

In [None]:
from sklearn.impute import SimpleImputer

X = ncaa2k23.drop(columns=['T1 Name', 'T2 Name', 'T1 Score', 'T2 Score'])
y = ncaa2k23[['T1 Score', 'T2 Score']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

real_pipe = Pipeline([(('scaler'), StandardScaler()),(('impute'), SimpleImputer(strategy='mean'))])
X_train_prepd = real_pipe.fit_transform(X_train)
X_test_prepd = real_pipe.fit_transform(X_test)
rf_pipe2 = Pipeline([('preprocessing',real_pipe),('rf',RandomForestRegressor(n_estimators=100,random_state=1))])

In [None]:
rf_pipe2.fit(X_train,y_train)

In [None]:
predictions = rf_pipe2.predict(X_test)

The predictions are in, let's see how it does.

In [None]:
y_test

Unnamed: 0_level_0,T1 Score,T2 Score
Game,Unnamed: 1_level_1,Unnamed: 2_level_1
"VCU (8) v. UCF (9), 2019",58,73
"Kansas State (11) v. Wisconsin (3), 2008",55,72
"Florida (7) v. Virginia Tech (10), 2021",75,70
"Michigan State (2) v. Robert Morris (15), 2009",77,62
"Cincinnati (2) v. UNC Wilmington (15), 2000",64,47
...,...,...
"Indiana (5) v. Chattanooga (12), 2016",99,74
"Pitt (3) v. Oklahoma State (2), 2004",51,63
"Kansas State (3) v. Montana State (14), 2023",77,65
"Georgia (3) v. Murray State (14), 2002",85,68


In [None]:
test_scores = y_test.copy()
test_scores["T1 Win"] = test_scores["T1 Score"] > test_scores["T2 Score"]
test_scores["Prediction 1"] = [row[0] for row in predictions]
test_scores["Prediction 2"] = [row[1] for row in predictions]
test_scores["Predict T1 Win"] = test_scores["Prediction 1"] > test_scores["Prediction 2"]
test_scores["Prediction Correct?"] = test_scores["Predict T1 Win"] == test_scores["T1 Win"]
test_scores

Unnamed: 0_level_0,T1 Score,T2 Score,T1 Win,Prediction 1,Prediction 2,Predict T1 Win,Prediction Correct?
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"VCU (8) v. UCF (9), 2019",58,73,False,68.11,66.58,True,False
"Kansas State (11) v. Wisconsin (3), 2008",55,72,False,65.48,67.25,False,True
"Florida (7) v. Virginia Tech (10), 2021",75,70,True,69.50,69.89,False,False
"Michigan State (2) v. Robert Morris (15), 2009",77,62,True,74.83,56.10,True,True
"Cincinnati (2) v. UNC Wilmington (15), 2000",64,47,True,69.73,55.24,True,True
...,...,...,...,...,...,...,...
"Indiana (5) v. Chattanooga (12), 2016",99,74,True,77.03,67.49,True,True
"Pitt (3) v. Oklahoma State (2), 2004",51,63,False,61.79,65.81,False,True
"Kansas State (3) v. Montana State (14), 2023",77,65,True,73.04,62.93,True,True
"Georgia (3) v. Murray State (14), 2002",85,68,True,81.19,69.48,True,True


In [None]:
test_scores["Prediction Correct?"].value_counts()

Unnamed: 0_level_0,count
Prediction Correct?,Unnamed: 1_level_1
True,209
False,81


This is an overall accuracy of 72%, which is pretty good, but could use improvement. I now turn my attention to adjusting the hyperparameters.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
    'max_depth': np.arange(1, 20),
    'n_estimators': np.arange(100, 500, 100),
    'min_samples_leaf': np.arange(1, 20)
}

random_search = RandomizedSearchCV(RandomForestRegressor(),param_dist,n_iter=10,cv=2,random_state=1,n_jobs=-1)
random_search.fit(X_train_prepd,y_train)

In [None]:
opt_pred = random_search.best_estimator_.predict(X_test_prepd)

sec_test_scores = y_test.copy()
sec_test_scores["T1 Win"] = sec_test_scores["T1 Score"] > sec_test_scores["T2 Score"]
sec_test_scores["Prediction 1"] = [row[0] for row in opt_pred]
sec_test_scores["Prediction 2"] = [row[1] for row in opt_pred]
sec_test_scores["Predict T1 Win"] = sec_test_scores["Prediction 1"] > sec_test_scores["Prediction 2"]
sec_test_scores["Prediction Correct?"] = sec_test_scores["Predict T1 Win"] == sec_test_scores["T1 Win"]
sec_test_scores

Unnamed: 0_level_0,T1 Score,T2 Score,T1 Win,Prediction 1,Prediction 2,Predict T1 Win,Prediction Correct?
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"VCU (8) v. UCF (9), 2019",58,73,False,66.564332,67.618109,False,True
"Kansas State (11) v. Wisconsin (3), 2008",55,72,False,66.624620,67.312647,False,True
"Florida (7) v. Virginia Tech (10), 2021",75,70,True,69.273204,68.915836,True,True
"Michigan State (2) v. Robert Morris (15), 2009",77,62,True,75.532696,56.569820,True,True
"Cincinnati (2) v. UNC Wilmington (15), 2000",64,47,True,69.614880,57.523422,True,True
...,...,...,...,...,...,...,...
"Indiana (5) v. Chattanooga (12), 2016",99,74,True,76.489558,67.394710,True,True
"Pitt (3) v. Oklahoma State (2), 2004",51,63,False,62.174079,66.208572,False,True
"Kansas State (3) v. Montana State (14), 2023",77,65,True,73.282230,61.575429,True,True
"Georgia (3) v. Murray State (14), 2002",85,68,True,83.875114,68.583317,True,True


In [None]:
sec_test_scores["Prediction Correct?"].value_counts()

Unnamed: 0_level_0,count
Prediction Correct?,Unnamed: 1_level_1
True,210
False,80


Minor improvements are made with some simple hyperparameter tuning. Earlier in the file, I talked about predicting scores, or simply outcomes. Now, I'll adjust the dataframe to see if simply predicting ones or zeroes helps my model.


### Classification

In [None]:
ncaa2kclass = ncaa2k.copy()

In [None]:
ncaa2kclass["T1 Win"] = ncaa2kclass["T1 Score"] > ncaa2kclass["T2 Score"]
ncaa2kclass["T1 Win"] = ncaa2kclass["T1 Win"].replace({True:1,False:0})
ncaa2kclass["T1 Win"]

  ncaa2kclass["T1 Win"] = ncaa2kclass["T1 Win"].replace({True:1,False:0})


Unnamed: 0_level_0,T1 Win
Game,Unnamed: 1_level_1
"Duke (1) v. Lamar (16), 2000",1
"Kansas (8) v. DePaul (9), 2000",1
"Florida (5) v. Butler (12), 2000",1
"Illinois (4) v. Penn (13), 2000",1
"Indiana (6) v. Pepperdine (11), 2000",0
...,...
"Clemson (6) v. Arizona (2), 2024",1
"Alabama (4) v. Clemson (6), 2024",1
"UConn (1) v. Alabama (4), 2024",1
"Purdue (1) v. NC State (11), 2024",1


In [None]:
ncaa2kclass.drop(columns=["T1 Score","T2 Score"],inplace=True)
ncaa2k23class = ncaa2kclass[ncaa2kclass["Year"] != 2024]
ncaa2k23class.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1448 entries, Duke (1) v. Lamar (16), 2000 to San Diego State (5) v. UConn (4), 2023
Data columns (total 35 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   T1 Seed   1448 non-null   int64  
 1   T1 Name   1448 non-null   object 
 2   T2 Seed   1448 non-null   int64  
 3   T2 Name   1448 non-null   object 
 4   Round     1448 non-null   int64  
 5   Year      1448 non-null   int64  
 6   T1 SRS    1448 non-null   float64
 7   T1 SOS    1448 non-null   float64
 8   T1 PPG    1448 non-null   float64
 9   T1 PAPG   1448 non-null   float64
 10  T1 FG%    1448 non-null   float64
 11  T1 3P%    1448 non-null   float64
 12  T1 FT%    1448 non-null   float64
 13  T1 ORBPG  1447 non-null   float64
 14  T1 TRBPG  1448 non-null   float64
 15  T1 ASTPG  1448 non-null   float64
 16  T1 STLPG  1448 non-null   float64
 17  T1 BLKPG  1448 non-null   float64
 18  T1 TOVPG  1448 non-null   float64
 19  T1 PFPG   144

In [None]:
from sklearn.ensemble import RandomForestClassifier
X = ncaa2k23class.drop(columns=["T1 Name", "T2 Name","T1 Win"])
y = ncaa2k23class["T1 Win"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 1)
preprocessing = Pipeline([('scale', StandardScaler()),('impute',SimpleImputer(strategy='mean'))])
rf_pipe = Pipeline([('preprocessing', preprocessing),('rf',RandomForestClassifier(n_estimators=100,
                                                                              random_state=1))])
rf_pipe.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, rf_pipe.predict(X_test))


0.7241379310344828

72% again, so the type of prediction does not greatly affect the accuracy of my model's predictions.

I'm uneasy about leaving the model at 72%. While it may seem good, we are predicting games in batches, regardless of round. In March, we must first predict all of the round 1 games, then round 2, and so on.

This means we only have a 72% prediction accuracy for the first round. Assuming we incorrectly guess a first round game and then predict the incorrect R1 winner to win the second round, then it is predicting an outcome that can no longer happen.

I'd like a stronger initial success rate to "weather the storm" through the first round.

What features does the model use most for its predictions?

## RF Feature Importance

In [None]:
rf_model = rf_pipe.named_steps['rf']
importances = rf_model.feature_importances_

feature_names = X_train.columns
feature_importances = list(zip(feature_names, importances))
sorted_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)

for feature, importance in sorted_importances:
    print(f"{feature}: {importance:.4f}")


T1 SRS: 0.0819
T2 SRS: 0.0818
T2 SOS: 0.0607
T1 Seed: 0.0436
T1 SOS: 0.0376
T2 Seed: 0.0375
T1 STLPG: 0.0327
T2 PAPG: 0.0297
T2 TOVPG: 0.0281
T1 PPG: 0.0280
T2 PFPG: 0.0279
T2 STLPG: 0.0278
T2 BLKPG: 0.0276
T1 3P%: 0.0275
T1 PFPG: 0.0275
T1 TOVPG: 0.0264
T1 BLKPG: 0.0264
T2 ASTPG: 0.0262
T1 ASTPG: 0.0260
T1 ORBPG: 0.0260
T2 PPG: 0.0252
T1 TRBPG: 0.0252
T2 FG%: 0.0247
T2 TRBPG: 0.0237
T1 PAPG: 0.0235
T1 FT%: 0.0235
T1 FG%: 0.0234
T2 ORBPG: 0.0224
T2 3P%: 0.0218
T2 FT%: 0.0214
Round: 0.0175
Year: 0.0166


The above suggests to me that it may be better if I combine the T1 & T2 statistics.

I'm interested in taking the difference of T1 and T2 versions of the same statistic. Perhaps less features can make the model more focused.

# Data Adjustments

In [None]:
newclass = ncaa2kclass.copy()
lst = list(newclass.columns)
for i in range(6,20):
  column_name = lst[i][3:] + ' Diff'
  newclass[column_name] = newclass[lst[i]] - newclass[lst[i+14]]
newclass.drop(columns=['T1 Seed', 'T1 Name', 'T2 Seed', 'T2 Name', 'T1 SRS',
       'T1 SOS', 'T1 PPG', 'T1 PAPG', 'T1 FG%', 'T1 3P%', 'T1 FT%', 'T1 ORBPG',
       'T1 TRBPG', 'T1 ASTPG', 'T1 STLPG', 'T1 BLKPG', 'T1 TOVPG', 'T1 PFPG',
       'T2 SRS', 'T2 SOS', 'T2 PPG', 'T2 PAPG', 'T2 FG%', 'T2 3P%', 'T2 FT%',
       'T2 ORBPG', 'T2 TRBPG', 'T2 ASTPG', 'T2 STLPG', 'T2 BLKPG', 'T2 TOVPG',
       'T2 PFPG'], inplace=True)

In [None]:
newclass.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1511 entries, Duke (1) v. Lamar (16), 2000 to UConn (1) v. Purdue (1), 2024
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Round       1511 non-null   int64  
 1   Year        1511 non-null   int64  
 2   T1 Win      1511 non-null   int64  
 3   SRS Diff    1510 non-null   float64
 4   SOS Diff    1510 non-null   float64
 5   PPG Diff    1510 non-null   float64
 6   PAPG Diff   1510 non-null   float64
 7   FG% Diff    1510 non-null   float64
 8   3P% Diff    1510 non-null   float64
 9   FT% Diff    1510 non-null   float64
 10  ORBPG Diff  1507 non-null   float64
 11  TRBPG Diff  1510 non-null   float64
 12  ASTPG Diff  1510 non-null   float64
 13  STLPG Diff  1510 non-null   float64
 14  BLKPG Diff  1510 non-null   float64
 15  TOVPG Diff  1509 non-null   float64
 16  PFPG Diff   1510 non-null   float64
dtypes: float64(14), int64(3)
memory usage: 244.8+ KB


There are much less features here, and it captures similar information. Let's see if the result changes now with these features.

In [None]:
newclass23 = newclass[newclass["Year"] != 2024]

X = newclass23.drop(columns = ["T1 Win"])
y = newclass23["T1 Win"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
rf_pipe.fit(X_train, y_train)

In [None]:
accuracy_score(y_test, rf_pipe.predict(X_test))

0.7517241379310344

Oddly enough, cutting the features in half improved the model by 3 percentage points! This is great news.

The same information is now captured over much less data, allowing me to introduce more informative features that can push the model further.

In [None]:
param_dist = {
    'max_depth': np.arange(1, 20),
    'n_estimators': np.arange(100, 500, 100),
    'min_samples_leaf': np.arange(1, 20)
}

X_train_prepd = real_pipe.fit_transform(X_train)
X_test_prepd = real_pipe.fit_transform(X_test)

hp_search = RandomizedSearchCV(RandomForestClassifier(),
                               param_dist,
                               n_iter=11,
                               cv=2,
                               random_state=1)
hp_search.fit(X_train_prepd,y_train)

In [None]:
predictions = hp_search.best_estimator_.predict(X_test_prepd)
accuracy_score(y_test, predictions)

0.7827586206896552

78%!

I'm really pleased with this accuracy score considering the basic statistics we are using.

To elevate the model, we will need to use include more informative statistics. For example, how many players on a team are returning from the year before?

Another task that will need to be addressed is applying this model in a bracket context. R1 games, then R2 games... and so on.

In [None]:
teams25 = get_season_stats('https://www.sports-reference.com/cbb/seasons/men/2025-school-stats.html')
# teams25.to_csv('teams25.csv')

No errors with 2025


In [None]:
east = ['Duke', "Mount St. Mary's", "Mississippi State", "Baylor",
        "Oregon","Liberty","Arizona","Akron", "BYU", "VCU",
        "Wisconsin","Montana","Saint Mary's","Vanderbilt","Alabama",
        "Robert Morris"]
midwest = ["Houston", "SIU Edwardsville", "Gonzaga", "Georgia",
           "Clemson", "McNeese State", "Purdue", "High Point","Illinois",
           "Xavier","Kentucky","Troy","UCLA","Utah State", "Tennessee",
           "Wofford"]
south = ['Auburn', "Alabama State", "Louisville","Creighton","Michigan",
         "UC San Diego", "Texas A&M", "Yale", "Ole Miss", "UNC",
         "Iowa State", "Lipscomb", "Marquette", "New Mexico",
         "Michigan State", "Bryant"]
west = ['Florida', 'Norfolk State', 'UConn', 'Oklahoma', 'Memphis',
        'Colorado State', 'Maryland', "Grand Canyon", "Missouri", "Drake",
        "Texas Tech", "UNC Wilmington", "Kansas", "Arkansas",
        "St. John's (NY)", "Omaha"]

In [None]:
year = 2025
def get_new_pred(east, midwest, south, west):
  quads = [south, west, east, midwest]
  winners = []
  for quad in quads:
    i = 0
    for x in range(0,8):
      srs_diff = teams25[teams25['School'] == quad[i]]['SRS'].values[0] - teams25[teams25['School'] == quad[i+1]]['SRS'].values[0]
      sos_diff = teams25[teams25['School'] == quad[i]]['SOS'].values[0] - teams25[teams25['School'] == quad[i+1]]['SOS'].values[0]
      ppg_diff = teams25[teams25['School'] == quad[i]]['PPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['PPG'].values[0]
      papg_diff = teams25[teams25['School'] == quad[i]]['PAPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['PAPG'].values[0]
      fg_diff = teams25[teams25['School'] == quad[i]]['FG%'].values[0] - teams25[teams25['School'] == quad[i+1]]['FG%'].values[0]
      three_diff = teams25[teams25['School'] == quad[i]]['3P%'].values[0] - teams25[teams25['School'] == quad[i+1]]['3P%'].values[0]
      ft_diff = teams25[teams25['School'] == quad[i]]['FT%'].values[0] - teams25[teams25['School'] == quad[i+1]]['FT%'].values[0]
      orbpg_diff = teams25[teams25['School'] == quad[i]]['ORBPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['ORBPG'].values[0]
      trbpg_diff = teams25[teams25['School'] == quad[i]]['TRBPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['TRBPG'].values[0]
      astpg_diff = teams25[teams25['School'] == quad[i]]['ASTPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['ASTPG'].values[0]
      stlpg_diff = teams25[teams25['School'] == quad[i]]['STLPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['STLPG'].values[0]
      blkpg_diff = teams25[teams25['School'] == quad[i]]['BLKPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['BLKPG'].values[0]
      tovpg_diff = teams25[teams25['School'] == quad[i]]['TOVPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['TOVPG'].values[0]
      pfpg_diff = teams25[teams25['School'] == quad[i]]['PFPG'].values[0] - teams25[teams25['School'] == quad[i+1]]['PFPG'].values[0]
      round = 1
      games = pd.DataFrame(columns=newclass.columns)
      games = games.drop(columns=['T1 Win'])
      games.loc[len(games)] = [round, year, srs_diff, sos_diff,
                               ppg_diff, papg_diff, fg_diff, three_diff,
                               ft_diff, orbpg_diff, trbpg_diff, astpg_diff,
                               stlpg_diff, blkpg_diff, tovpg_diff, pfpg_diff]
      prediction = hp_search.best_estimator_.predict(games)
      if prediction == 1:
        print(f'The model predicts {quad[i]} will win against {quad[i+1]}')
        winners.append(quad[i])
      else:
        print(f'The model predicts {quad[i+1]} will win against {quad[i]}')
        winners.append(quad[i+1])
      i += 2
  return winners

In [None]:
import warnings
warnings.filterwarnings("ignore")
round2 = get_new_pred(east, midwest, south, west)

The model predicts Auburn will win against Alabama State
The model predicts Louisville will win against Creighton
The model predicts Michigan will win against UC San Diego
The model predicts Texas A&M will win against Yale
The model predicts Ole Miss will win against UNC
The model predicts Iowa State will win against Lipscomb
The model predicts Marquette will win against New Mexico
The model predicts Michigan State will win against Bryant
The model predicts Florida will win against Norfolk State
The model predicts UConn will win against Oklahoma
The model predicts Memphis will win against Colorado State
The model predicts Maryland will win against Grand Canyon
The model predicts Missouri will win against Drake
The model predicts Texas Tech will win against UNC Wilmington
The model predicts Kansas will win against Arkansas
The model predicts St. John's (NY) will win against Omaha
The model predicts Duke will win against Mount St. Mary's
The model predicts Baylor will win against Mississ

In [None]:
round2

['Auburn',
 'Louisville',
 'Michigan',
 'Texas A&M',
 'Ole Miss',
 'Iowa State',
 'Marquette',
 'Michigan State',
 'Florida',
 'UConn',
 'Memphis',
 'Maryland',
 'Missouri',
 'Texas Tech',
 'Kansas',
 "St. John's (NY)",
 'Duke',
 'Baylor',
 'Oregon',
 'Arizona',
 'BYU',
 'Wisconsin',
 "Saint Mary's",
 'Alabama',
 'Houston',
 'Gonzaga',
 'Clemson',
 'Purdue',
 'Illinois',
 'Kentucky',
 'UCLA',
 'Tennessee']

In [None]:
year = 2025
def get_round2_pred(winners):
    i = 0
    winners2 = []
    for x in range(0,16):
      srs_diff = teams25[teams25['School'] == winners[i]]['SRS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SRS'].values[0]
      sos_diff = teams25[teams25['School'] == winners[i]]['SOS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SOS'].values[0]
      ppg_diff = teams25[teams25['School'] == winners[i]]['PPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PPG'].values[0]
      papg_diff = teams25[teams25['School'] == winners[i]]['PAPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PAPG'].values[0]
      fg_diff = teams25[teams25['School'] == winners[i]]['FG%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FG%'].values[0]
      three_diff = teams25[teams25['School'] == winners[i]]['3P%'].values[0] - teams25[teams25['School'] == winners[i+1]]['3P%'].values[0]
      ft_diff = teams25[teams25['School'] == winners[i]]['FT%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FT%'].values[0]
      orbpg_diff = teams25[teams25['School'] == winners[i]]['ORBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ORBPG'].values[0]
      trbpg_diff = teams25[teams25['School'] == winners[i]]['TRBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TRBPG'].values[0]
      astpg_diff = teams25[teams25['School'] == winners[i]]['ASTPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ASTPG'].values[0]
      stlpg_diff = teams25[teams25['School'] == winners[i]]['STLPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['STLPG'].values[0]
      blkpg_diff = teams25[teams25['School'] == winners[i]]['BLKPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['BLKPG'].values[0]
      tovpg_diff = teams25[teams25['School'] == winners[i]]['TOVPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TOVPG'].values[0]
      pfpg_diff = teams25[teams25['School'] == winners[i]]['PFPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PFPG'].values[0]
      round = 2
      games = pd.DataFrame(columns=newclass.columns)
      games = games.drop(columns=['T1 Win'])
      games.loc[len(games)] = [round, year, srs_diff, sos_diff,
                               ppg_diff, papg_diff, fg_diff, three_diff,
                               ft_diff, orbpg_diff, trbpg_diff, astpg_diff,
                               stlpg_diff, blkpg_diff, tovpg_diff, pfpg_diff]
      prediction = hp_search.best_estimator_.predict(games)
      if prediction == 1:
        print(f'The model predicts {winners[i]} will win against {winners[i+1]}')
        winners2.append(winners[i])
      else:
        print(f'The model predicts {winners[i+1]} will win against {winners[i]}')
        winners2.append(winners[i+1])
      i += 2
    return winners2

In [None]:
r2_winners = get_round2_pred(round2)

The model predicts Auburn will win against Louisville
The model predicts Michigan will win against Texas A&M
The model predicts Iowa State will win against Ole Miss
The model predicts Michigan State will win against Marquette
The model predicts Florida will win against UConn
The model predicts Maryland will win against Memphis
The model predicts Texas Tech will win against Missouri
The model predicts Kansas will win against St. John's (NY)
The model predicts Duke will win against Baylor
The model predicts Arizona will win against Oregon
The model predicts Wisconsin will win against BYU
The model predicts Alabama will win against Saint Mary's
The model predicts Houston will win against Gonzaga
The model predicts Purdue will win against Clemson
The model predicts Illinois will win against Kentucky
The model predicts Tennessee will win against UCLA


In [None]:
year = 2025
def get_round3_pred(winners):
    i = 0
    winners3 = []
    for x in range(0,8):
      srs_diff = teams25[teams25['School'] == winners[i]]['SRS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SRS'].values[0]
      sos_diff = teams25[teams25['School'] == winners[i]]['SOS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SOS'].values[0]
      ppg_diff = teams25[teams25['School'] == winners[i]]['PPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PPG'].values[0]
      papg_diff = teams25[teams25['School'] == winners[i]]['PAPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PAPG'].values[0]
      fg_diff = teams25[teams25['School'] == winners[i]]['FG%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FG%'].values[0]
      three_diff = teams25[teams25['School'] == winners[i]]['3P%'].values[0] - teams25[teams25['School'] == winners[i+1]]['3P%'].values[0]
      ft_diff = teams25[teams25['School'] == winners[i]]['FT%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FT%'].values[0]
      orbpg_diff = teams25[teams25['School'] == winners[i]]['ORBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ORBPG'].values[0]
      trbpg_diff = teams25[teams25['School'] == winners[i]]['TRBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TRBPG'].values[0]
      astpg_diff = teams25[teams25['School'] == winners[i]]['ASTPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ASTPG'].values[0]
      stlpg_diff = teams25[teams25['School'] == winners[i]]['STLPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['STLPG'].values[0]
      blkpg_diff = teams25[teams25['School'] == winners[i]]['BLKPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['BLKPG'].values[0]
      tovpg_diff = teams25[teams25['School'] == winners[i]]['TOVPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TOVPG'].values[0]
      pfpg_diff = teams25[teams25['School'] == winners[i]]['PFPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PFPG'].values[0]
      round = 3
      games = pd.DataFrame(columns=newclass.columns)
      games = games.drop(columns=['T1 Win'])
      games.loc[len(games)] = [round, year, srs_diff, sos_diff,
                               ppg_diff, papg_diff, fg_diff, three_diff,
                               ft_diff, orbpg_diff, trbpg_diff, astpg_diff,
                               stlpg_diff, blkpg_diff, tovpg_diff, pfpg_diff]
      prediction = hp_search.best_estimator_.predict(games)
      if prediction == 1:
        print(f'The model predicts {winners[i]} will win against {winners[i+1]}')
        winners3.append(winners[i])
      else:
        print(f'The model predicts {winners[i+1]} will win against {winners[i]}')
        winners3.append(winners[i+1])
      i += 2
    return winners3

In [None]:
r3_winners = get_round3_pred(r2_winners)

The model predicts Auburn will win against Michigan
The model predicts Iowa State will win against Michigan State
The model predicts Florida will win against Maryland
The model predicts Texas Tech will win against Kansas
The model predicts Duke will win against Arizona
The model predicts Alabama will win against Wisconsin
The model predicts Houston will win against Purdue
The model predicts Tennessee will win against Illinois


In [None]:
year = 2025
def get_round4_pred(winners):
    i = 0
    winners4 = []
    for x in range(0,4):
      srs_diff = teams25[teams25['School'] == winners[i]]['SRS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SRS'].values[0]
      sos_diff = teams25[teams25['School'] == winners[i]]['SOS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SOS'].values[0]
      ppg_diff = teams25[teams25['School'] == winners[i]]['PPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PPG'].values[0]
      papg_diff = teams25[teams25['School'] == winners[i]]['PAPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PAPG'].values[0]
      fg_diff = teams25[teams25['School'] == winners[i]]['FG%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FG%'].values[0]
      three_diff = teams25[teams25['School'] == winners[i]]['3P%'].values[0] - teams25[teams25['School'] == winners[i+1]]['3P%'].values[0]
      ft_diff = teams25[teams25['School'] == winners[i]]['FT%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FT%'].values[0]
      orbpg_diff = teams25[teams25['School'] == winners[i]]['ORBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ORBPG'].values[0]
      trbpg_diff = teams25[teams25['School'] == winners[i]]['TRBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TRBPG'].values[0]
      astpg_diff = teams25[teams25['School'] == winners[i]]['ASTPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ASTPG'].values[0]
      stlpg_diff = teams25[teams25['School'] == winners[i]]['STLPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['STLPG'].values[0]
      blkpg_diff = teams25[teams25['School'] == winners[i]]['BLKPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['BLKPG'].values[0]
      tovpg_diff = teams25[teams25['School'] == winners[i]]['TOVPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TOVPG'].values[0]
      pfpg_diff = teams25[teams25['School'] == winners[i]]['PFPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PFPG'].values[0]
      round = 4
      games = pd.DataFrame(columns=newclass.columns)
      games = games.drop(columns=['T1 Win'])
      games.loc[len(games)] = [round, year, srs_diff, sos_diff,
                               ppg_diff, papg_diff, fg_diff, three_diff,
                               ft_diff, orbpg_diff, trbpg_diff, astpg_diff,
                               stlpg_diff, blkpg_diff, tovpg_diff, pfpg_diff]
      prediction = hp_search.best_estimator_.predict(games)
      if prediction == 1:
        print(f'The model predicts {winners[i]} will win against {winners[i+1]}')
        winners4.append(winners[i])
      else:
        print(f'The model predicts {winners[i+1]} will win against {winners[i]}')
        winners4.append(winners[i+1])
      i += 2
    return winners4

In [None]:
elite_eight = get_round4_pred(r3_winners)

The model predicts Auburn will win against Iowa State
The model predicts Florida will win against Texas Tech
The model predicts Duke will win against Alabama
The model predicts Houston will win against Tennessee


In [None]:
year = 2025
def get_final_four_pred(winners):
    i = 0
    winners4 = []
    for x in range(0,2):
      srs_diff = teams25[teams25['School'] == winners[i]]['SRS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SRS'].values[0]
      sos_diff = teams25[teams25['School'] == winners[i]]['SOS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SOS'].values[0]
      ppg_diff = teams25[teams25['School'] == winners[i]]['PPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PPG'].values[0]
      papg_diff = teams25[teams25['School'] == winners[i]]['PAPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PAPG'].values[0]
      fg_diff = teams25[teams25['School'] == winners[i]]['FG%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FG%'].values[0]
      three_diff = teams25[teams25['School'] == winners[i]]['3P%'].values[0] - teams25[teams25['School'] == winners[i+1]]['3P%'].values[0]
      ft_diff = teams25[teams25['School'] == winners[i]]['FT%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FT%'].values[0]
      orbpg_diff = teams25[teams25['School'] == winners[i]]['ORBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ORBPG'].values[0]
      trbpg_diff = teams25[teams25['School'] == winners[i]]['TRBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TRBPG'].values[0]
      astpg_diff = teams25[teams25['School'] == winners[i]]['ASTPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ASTPG'].values[0]
      stlpg_diff = teams25[teams25['School'] == winners[i]]['STLPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['STLPG'].values[0]
      blkpg_diff = teams25[teams25['School'] == winners[i]]['BLKPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['BLKPG'].values[0]
      tovpg_diff = teams25[teams25['School'] == winners[i]]['TOVPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TOVPG'].values[0]
      pfpg_diff = teams25[teams25['School'] == winners[i]]['PFPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PFPG'].values[0]
      round = 5
      games = pd.DataFrame(columns=newclass.columns)
      games = games.drop(columns=['T1 Win'])
      games.loc[len(games)] = [round, year, srs_diff, sos_diff,
                               ppg_diff, papg_diff, fg_diff, three_diff,
                               ft_diff, orbpg_diff, trbpg_diff, astpg_diff,
                               stlpg_diff, blkpg_diff, tovpg_diff, pfpg_diff]
      prediction = hp_search.best_estimator_.predict(games)
      if prediction == 1:
        print(f'The model predicts {winners[i]} will win against {winners[i+1]}')
        winners4.append(winners[i])
      else:
        print(f'The model predicts {winners[i+1]} will win against {winners[i]}')
        winners4.append(winners[i+1])
      i += 2
    return winners4

In [None]:
semifinalists = get_final_four_pred(elite_eight)

The model predicts Auburn will win against Florida
The model predicts Duke will win against Houston


In [None]:
year = 2025
def get_champ_pred(winners):
    i = 0
    winners4 = []
    for x in range(0,1):
      srs_diff = teams25[teams25['School'] == winners[i]]['SRS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SRS'].values[0]
      sos_diff = teams25[teams25['School'] == winners[i]]['SOS'].values[0] - teams25[teams25['School'] == winners[i+1]]['SOS'].values[0]
      ppg_diff = teams25[teams25['School'] == winners[i]]['PPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PPG'].values[0]
      papg_diff = teams25[teams25['School'] == winners[i]]['PAPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PAPG'].values[0]
      fg_diff = teams25[teams25['School'] == winners[i]]['FG%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FG%'].values[0]
      three_diff = teams25[teams25['School'] == winners[i]]['3P%'].values[0] - teams25[teams25['School'] == winners[i+1]]['3P%'].values[0]
      ft_diff = teams25[teams25['School'] == winners[i]]['FT%'].values[0] - teams25[teams25['School'] == winners[i+1]]['FT%'].values[0]
      orbpg_diff = teams25[teams25['School'] == winners[i]]['ORBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ORBPG'].values[0]
      trbpg_diff = teams25[teams25['School'] == winners[i]]['TRBPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TRBPG'].values[0]
      astpg_diff = teams25[teams25['School'] == winners[i]]['ASTPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['ASTPG'].values[0]
      stlpg_diff = teams25[teams25['School'] == winners[i]]['STLPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['STLPG'].values[0]
      blkpg_diff = teams25[teams25['School'] == winners[i]]['BLKPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['BLKPG'].values[0]
      tovpg_diff = teams25[teams25['School'] == winners[i]]['TOVPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['TOVPG'].values[0]
      pfpg_diff = teams25[teams25['School'] == winners[i]]['PFPG'].values[0] - teams25[teams25['School'] == winners[i+1]]['PFPG'].values[0]
      round = 6
      games = pd.DataFrame(columns=newclass.columns)
      games = games.drop(columns=['T1 Win'])
      games.loc[len(games)] = [round, year, srs_diff, sos_diff,
                               ppg_diff, papg_diff, fg_diff, three_diff,
                               ft_diff, orbpg_diff, trbpg_diff, astpg_diff,
                               stlpg_diff, blkpg_diff, tovpg_diff, pfpg_diff]
      prediction = hp_search.best_estimator_.predict(games)
      if prediction == 1:
        print(f'The model predicts {winners[i]} will win against {winners[i+1]}')
        winners4.append(winners[i])
      else:
        print(f'The model predicts {winners[i+1]} will win against {winners[i]}')
        winners4.append(winners[i+1])
      i += 2
    return winners4

In [None]:
get_champ_pred(semifinalists)

The model predicts Auburn will win against Duke


['Auburn']