<a href="https://colab.research.google.com/github/BARATZL/march-madness-supML/blob/main/NCAAMB_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting the Outcome of March Madness Basketball Games

I love watching college basketball, particularly in March. My alma mater has never been in the "big dance", but the tournament has nevertheless always been very entertaining to me.

However, I do not know much about the sport itself. When it comes time to join in the office bracket pool, my bracket's success largely hinges on my luck, or the rushed google searches made just before filling out my predictions.

The above is the main motivation behind this project. Using Machine Learning concepts I have taken in during the fall semester, can I improve upon my bracket predictions from last year (where I more or less guessed)?

## Defining success




A simple way to define success of my model is to perform better than my predictions last year. I correctly guessed 42 out of 64 of the games, about 65% in total.

That sounds pretty decent for guessing, but my predicitions got worse after round 1 and 2 of the bracket. Equally weighting my predictions by each round, my accuracy looks something like this:



---



$\frac{1}{6}(R1 acc.+ R2 acc.+ R3acc...)$

Or,

$\frac{1}{6} (\frac{24}{32} + \frac{11}{16} + \frac{3}{8} + \frac{2}{4} + \frac{2}{2} + \frac{0}{1})$ = ~.55



---



So the standards I will initially aim for is a total accuracy higher than 65%, with an average accuracy across rounds higher than 55%.

# Data Sourcing and Formatting

There are two methods I have thought of that can be appropriately formatted for a model. The tabular data should be organized as follows:

Game | Team 1 | Team 1 Season Stat 1 | ... | Team 2 | Team 2 Season Stat 1 | ... | Team 1 Score | Team 2 Score | Team 1 Win (0 for no, 1 for yes)
----|----|----|----|----|----|----|----|----|---|
Purdue v. UConn | Purdue | x | ... | UConn | y | ... | 60 | 75 | 0
UConn v. Alabama | UConn | y | ... | Alabama | z | ... | 86 | 72 | 1
...|...|...|...|...|...|...|...|...|...

With this format, our model can either:

**1**. Predict the values that Team 1 and 2 score, with a subsequent function that confirms the outcome prediction the model is making.

**2**. Predict whether or not Team 1 wins.

We can assess both, but first we need to assemble our data. First, we need to create a table of season statistics.

Assembling this data across all years will be difficult, but should be possible through extracting data from [Sports Reference](https://www.sports-reference.com/cbb/). The below code begins this process.

In [1]:
from bs4 import BeautifulSoup, Comment
import numpy as np
import requests
import pandas as pd
url = 'https://www.sports-reference.com/cbb/seasons/men/2024-school-stats.html'
test = requests.get(url)

html_content = test.text
soup = BeautifulSoup(html_content, 'html.parser')
pretty_html = soup.prettify()
tables = soup.find_all('table')  # finding table in webpage
headers = soup.find_all('th')

In [2]:
columnheaders = []
for i in headers:
 columnheaders.append(i.text)
columnheaders = columnheaders[13:50]
ncaa2324 = pd.DataFrame(columns = columnheaders)  # setting table based on webpage table columns
ncaa2324.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 37 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   School  0 non-null      object
 1   G       0 non-null      object
 2   W       0 non-null      object
 3   L       0 non-null      object
 4   W-L%    0 non-null      object
 5   SRS     0 non-null      object
 6   SOS     0 non-null      object
 7           0 non-null      object
 8   W       0 non-null      object
 9   L       0 non-null      object
 10          0 non-null      object
 11  W       0 non-null      object
 12  L       0 non-null      object
 13          0 non-null      object
 14  W       0 non-null      object
 15  L       0 non-null      object
 16          0 non-null      object
 17  Tm.     0 non-null      object
 18  Opp.    0 non-null      object
 19          0 non-null      object
 20  MP      0 non-null      object
 21  FG      0 non-null      object
 22  FGA     0 non-null      object
 23  FG

In [3]:
rows = soup.find_all('tr')
for row in rows:
  cells = row.find_all('td')
  if cells == []:
    continue
  columndata = [col.text.strip() for col in cells]
  if "NCAA" in columndata[0]:
    ncaa2324.loc[len(ncaa2324)]=columndata  # extracting everyone who made it into the NCAA tournament.

In [4]:
ncaa2324.head(5)  # checking things here

Unnamed: 0,School,G,W,L,W-L%,SRS,SOS,Unnamed: 8,W.1,L.1,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Akron NCAA,35,24,11,0.686,2.77,-2.08,,13,5,...,467,642,0.727,363,1278,455,197,100,394,583
1,Alabama NCAA,37,25,12,0.676,20.69,11.8,,13,5,...,650,842,0.772,472,1467,587,256,162,438,734
2,Arizona NCAA,36,27,9,0.75,24.54,9.45,,15,5,...,605,844,0.717,471,1533,665,300,133,430,590
3,Auburn NCAA,35,27,8,0.771,22.46,7.66,,13,5,...,609,812,0.75,393,1323,622,258,215,374,678
4,Baylor NCAA,35,24,11,0.686,19.5,10.71,,11,7,...,579,791,0.732,399,1229,514,236,110,421,577


This table contains some of the statistics we would like to see for our team season data when we compile games from March Madness tournaments.

However, there's an issue with this webscraping method: the statistics listed include tournament games. This is problematic because we want the model to be useful prior to the tournament takes place.

If I train a model on data partially from tournaments, there's a chance that it will negatively affect the model when it is needed before teams even have a chance to compile tournament statistics.

One way to avoid this is to make sure that I only include rate/percentage statistics.

In [5]:
ncaa2324.drop(columns=['W','L'], inplace=True)

In [6]:
columns_to_drop = ncaa2324.columns[5:9]
columns_to_drop
ncaa2324.drop(columns=columns_to_drop,inplace=True)

In [7]:
ncaa2324.drop(columns=['MP','FG','FGA','FT','FTA','3P','3PA'],inplace=True)

In [8]:
ncaa2324.iloc[:,1:] = ncaa2324.iloc[:,1:].astype(float)
column_ops = ['Tm.','Opp.','ORB','TRB','AST','STL','BLK','TOV','PF']
for col in column_ops:
  ncaa2324[col] = ncaa2324[col].astype(float)
  ncaa2324[col] = ncaa2324[col]/(ncaa2324['G'].astype(float))  # setting up per game ratios

ncaa2324.rename(columns={'Tm.':'PPG',
                         'Opp.':'PAPG',
                         'ORB':'ORBPG',
                         'TRB':'TRBPG',
                         'AST':'ASTPG',
                         'STL':'STLPG',
                         'BLK':'BLKPG',
                         'TOV':'TOVPG',
                         'PF':'PFPG'},inplace=True)

In [9]:
ncaa2324.drop(columns=['G','W-L%'],inplace=True)

Now I've eliminated most of the obvious indicators of postseason success that would not be available if we try to practically apply a model next year. Next, to create game data, and stitch these seasonal statistics onto the table with game data. I plan to pull game data again from Sports Reference.

## tidying school names

In [10]:
# first, making sure school names are uniform with the second table. This way, there's a seamless merge.
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('NCAA$', '', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('Brigham Young$', 'BYU', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('Connecticut$', 'UConn', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace('North Carolina$', 'UNC', regex=True)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Saint Mary's (CA)", "Saint Mary's", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Saint Peter's", "St. Peter's", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Louisiana State", "LSU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Southern California", "USC", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Southern Methodist", "SMU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Mississippi", "Ole Miss", regex=False)  #how to adjust?
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Ole Miss State", "Mississippi State", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Ole Miss Valley State", "Mississippi Valley State", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Southern Ole Miss", "Southern Miss", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Pittsburgh", "Pitt", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Pennsylvania", "Penn", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Central Connecticut State", "Central Connecticut", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Florida International", "FIU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Nevada-Las Vegas","UNLV", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("UC Santa Barbara", "UCSB", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Virginia Commonwealth", "VCU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Saint Joseph's," "St. Joseph's", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Maryland-Baltimore County","UMBC", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Illinois-Chicago", "UIC", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Massachusetts", "UMass", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("East Tennessee State", "ETSU", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("IU Indy", "IUPUI", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("UC Irvine", "UC-Irvine", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("UC Davis", "UC-Davis", regex=False)
ncaa2324['School'] = ncaa2324['School'].str.strip().str.replace("Long Island University", "LIU", regex=False)

ncaa2324['Year'] = url[49:53]


In [11]:
mm2324 = pd.DataFrame(columns=['T1 Seed','T1 Name','T1 Score','T2 Seed','T2 Name','T2 Score','Round'])  # making combined table template

In [12]:
mm2324

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Round


In [13]:
url2 = 'https://www.sports-reference.com/cbb/postseason/men/2024-ncaa.html'
test = requests.get(url2)

html_content = test.text
soup = BeautifulSoup(html_content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
i = 0
j = 1
for comment in comments:
    if "game" in comment:  # the html had <--game--> comments wherever they placed bracket games.
        next_element = comment.find_next_sibling()
        try:
          seed1 = next_element.find('span').text.strip()
        except:
          break
        name1 = next_element.find('a', href=True).text.strip()
        try:
          score1 = next_element.find_all('a',href=True)[1].text.strip()
        except:
          continue
        third_element = next_element.find_next_sibling()
        seed2 = third_element.find('span').text.strip()
        name2 = third_element.find('a', href=True).text.strip()
        score2 = third_element.find_all('a',href=True)[1].text.strip()
        i += 1

        if j < 5:
            if i <= 8:
              round = 1
            elif i <= 12:
              round = 2
            elif i <= 14:
              round = 3
            elif i == 15:
              round = 4
              j += 1  # move to the next quadrant
              i = 0
        elif j == 5:
            if i <= 2:
                round = 5
            elif i == 3:
                round = 6

        mm2324.loc[len(mm2324)] = [seed1,name1,score1,seed2,name2,score2,round]
mm2324.loc[len(mm2324)] = [1,'UConn',75,1,'Purdue',60,6]  # for some reason the championship game still didn't have the score. manually inputting.

In [14]:
mm2324['Game'] = mm2324['T1 Name'] + f' ('+mm2324['T1 Seed'].astype(str)+')' ' v. ' + mm2324['T2 Name'] + ' ('+mm2324['T2 Seed'].astype(str)+')'
mm2324.drop_duplicates(['T1 Name', 'T2 Name'],inplace=True)
mm2324  # the above will be the index. Important for future so we can still determine which game the model is predicting on

Unnamed: 0,T1 Seed,T1 Name,T1 Score,T2 Seed,T2 Name,T2 Score,Round,Game
0,1,UConn,91,16,Stetson,52,1,UConn (1) v. Stetson (16)
1,8,Florida Atlantic,65,9,Northwestern,77,1,Florida Atlantic (8) v. Northwestern (9)
2,5,San Diego State,69,12,UAB,65,1,San Diego State (5) v. UAB (12)
3,4,Auburn,76,13,Yale,78,1,Auburn (4) v. Yale (13)
4,6,BYU,67,11,Duquesne,71,1,BYU (6) v. Duquesne (11)
...,...,...,...,...,...,...,...,...
58,6,Clemson,77,2,Arizona,72,3,Clemson (6) v. Arizona (2)
59,4,Alabama,89,6,Clemson,82,4,Alabama (4) v. Clemson (6)
60,1,UConn,86,4,Alabama,72,5,UConn (1) v. Alabama (4)
61,1,Purdue,63,11,NC State,50,5,Purdue (1) v. NC State (11)


Now that the game table has been created, the seasonal stats for the teams can be inserted into the table.

In [15]:
ncaa2324['School'] = ncaa2324['School'].str.strip()
mm2324['T1 Name'] = mm2324['T1 Name'].str.strip()
mm2324['T2 Name'] = mm2324['T2 Name'].str.strip()
comb2324 = pd.merge(mm2324, ncaa2324,left_on='T1 Name',right_on='School',how='left')
comb2324.rename(columns={'SRS':'T1 SRS',
                         'SOS':'T1 SOS',
                         'PPG':'T1 PPG',
                         'PAPG':'T1 PAPG',
                         'ORBPG':'T1 ORBPG',
                         'TRBPG':'T1 TRBPG',
                         'ASTPG':'T1 ASTPG',
                         'FG%':'T1 FG%',
                         '3P%':'T1 3P%',
                         'FT%':'T1 FT%',
                         'STLPG':'T1 STLPG',
                         'BLKPG':'T1 BLKPG',
                         'TOVPG':'T1 TOVPG',
                         'PFPG':'T1 PFPG'},inplace=True)
comb2324 = pd.merge(comb2324, ncaa2324, left_on='T2 Name',right_on='School',how='left')
comb2324.rename(columns={'SRS':'T2 SRS',
                         'SOS':'T2 SOS',
                         'PPG':'T2 PPG',
                         'PAPG':'T2 PAPG',
                         'ORBPG':'T2 ORBPG',
                         'TRBPG':'T2 TRBPG',
                         'ASTPG':'T2 ASTPG',
                         'FG%':'T2 FG%',
                         '3P%':'T2 3P%',
                         'FT%':'T2 FT%',
                         'STLPG':'T2 STLPG',
                         'BLKPG':'T2 BLKPG',
                         'TOVPG':'T2 TOVPG',
                         'PFPG':'T2 PFPG'}, inplace=True)
comb2324 = comb2324.drop(columns=['School_x','School_y','Year_y'])
comb2324.rename(columns={'Year_x':'Year'},inplace=True)
comb2324.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 37 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   T1 Seed   63 non-null     object 
 1   T1 Name   63 non-null     object 
 2   T1 Score  63 non-null     object 
 3   T2 Seed   63 non-null     object 
 4   T2 Name   63 non-null     object 
 5   T2 Score  63 non-null     object 
 6   Round     63 non-null     int64  
 7   Game      63 non-null     object 
 8   T1 SRS    63 non-null     object 
 9   T1 SOS    63 non-null     object 
 10  T1 PPG    63 non-null     float64
 11  T1 PAPG   63 non-null     float64
 12  T1 FG%    63 non-null     object 
 13  T1 3P%    63 non-null     object 
 14  T1 FT%    63 non-null     object 
 15  T1 ORBPG  63 non-null     float64
 16  T1 TRBPG  63 non-null     float64
 17  T1 ASTPG  63 non-null     float64
 18  T1 STLPG  63 non-null     float64
 19  T1 BLKPG  63 non-null     float64
 20  T1 TOVPG  63 non-null     float64


In [16]:
combi2324 = comb2324.iloc[:,[22,6,7,0,1,8,9,10,11,12,13,14,15,16,17,18,19,20,21,3,4,23,24,25,26,27,28,29,30,31,32,33,34,35,36,2,5]]  # rearranging.
combi2324.info()  # null check

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 37 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      63 non-null     object 
 1   Round     63 non-null     int64  
 2   Game      63 non-null     object 
 3   T1 Seed   63 non-null     object 
 4   T1 Name   63 non-null     object 
 5   T1 SRS    63 non-null     object 
 6   T1 SOS    63 non-null     object 
 7   T1 PPG    63 non-null     float64
 8   T1 PAPG   63 non-null     float64
 9   T1 FG%    63 non-null     object 
 10  T1 3P%    63 non-null     object 
 11  T1 FT%    63 non-null     object 
 12  T1 ORBPG  63 non-null     float64
 13  T1 TRBPG  63 non-null     float64
 14  T1 ASTPG  63 non-null     float64
 15  T1 STLPG  63 non-null     float64
 16  T1 BLKPG  63 non-null     float64
 17  T1 TOVPG  63 non-null     float64
 18  T1 PFPG   63 non-null     float64
 19  T2 Seed   63 non-null     object 
 20  T2 Name   63 non-null     object 


In [17]:
combi2324.set_index('Game',inplace=True)

In [18]:
col_list = combi2324.columns  ### NEEDS FIXING
for col in col_list:
    if col == 'Year' or col == 'Round':
      combi2324[col] = combi2324[col].astype(int)
    else:
      try:
        combi2324[col] = combi2324[col].astype(float)
      except:
        continue
combi2324.info()

<class 'pandas.core.frame.DataFrame'>
Index: 63 entries, UConn (1) v. Stetson (16) to UConn (1) v. Purdue (1)
Data columns (total 36 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      63 non-null     int64  
 1   Round     63 non-null     int64  
 2   T1 Seed   63 non-null     float64
 3   T1 Name   63 non-null     object 
 4   T1 SRS    63 non-null     float64
 5   T1 SOS    63 non-null     float64
 6   T1 PPG    63 non-null     float64
 7   T1 PAPG   63 non-null     float64
 8   T1 FG%    63 non-null     float64
 9   T1 3P%    63 non-null     float64
 10  T1 FT%    63 non-null     float64
 11  T1 ORBPG  63 non-null     float64
 12  T1 TRBPG  63 non-null     float64
 13  T1 ASTPG  63 non-null     float64
 14  T1 STLPG  63 non-null     float64
 15  T1 BLKPG  63 non-null     float64
 16  T1 TOVPG  63 non-null     float64
 17  T1 PFPG   63 non-null     float64
 18  T2 Seed   63 non-null     float64
 19  T2 Name   63 non-null     object 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combi2324[col] = combi2324[col].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combi2324[col] = combi2324[col].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combi2324[col] = combi2324[col].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try 

This is the ideal table setup we're looking for. Before iterating on the past two-ish decades of March Madness basketball, I'd like to see how the model does with a train-test set of 2024.

## ML Test Run

In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import MultiTaskLasso

X = combi2324.drop(columns=['T1 Name', 'T2 Name', 'T1 Score', 'T2 Score'])
y = combi2324[['T1 Score', 'T2 Score']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
preprocessing = Pipeline([('scaler', StandardScaler())])
lasso_pipe = Pipeline([('preprocessing', preprocessing),('lasso',MultiTaskLasso(alpha=0.1,random_state=1))])
lasso_pipe.fit(X_train,y_train)

In [20]:
predictions = lasso_pipe.predict(X_test)

In [21]:
yt_copy = y_test.copy()
yt_copy['T1 Win'] = yt_copy['T1 Score'] > yt_copy['T2 Score']
yt_copy['Prediction 1'] = [row[0] for row in predictions]
yt_copy['Prediction 2'] = [row[1] for row in predictions]
yt_copy['Predict T1 Win'] = yt_copy['Prediction 1'] > yt_copy['Prediction 2']
yt_copy['Prediction Correct?'] = yt_copy['Predict T1 Win'] == yt_copy['T1 Win']
yt_copy

Unnamed: 0_level_0,T1 Score,T2 Score,T1 Win,Prediction 1,Prediction 2,Predict T1 Win,Prediction Correct?
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Gonzaga (5) v. Kansas (4),89.0,68.0,True,78.705772,43.390901,True,True
James Madison (12) v. Duke (4),55.0,93.0,False,63.193435,88.406209,False,True
Dayton (7) v. Nevada (10),63.0,60.0,True,66.574662,60.674839,True,True
Purdue (1) v. Gonzaga (5),80.0,68.0,True,94.443444,84.581761,True,True
Purdue (1) v. NC State (11),63.0,50.0,True,80.886146,69.471964,True,True
San Diego State (5) v. UAB (12),69.0,65.0,True,84.074132,72.177932,True,True
Texas (7) v. Colorado State (10),56.0,44.0,True,74.898804,58.057505,True,True
Houston (1) v. Duke (4),51.0,54.0,False,69.663298,55.74264,True,False
Colorado (10) v. Marquette (2),77.0,81.0,False,53.721962,54.085654,False,True
Duke (4) v. NC State (11),64.0,76.0,False,71.890411,59.885521,True,False


With a really small sample size, the model predicts 10 outcomes correctly out of 13 (76% accuracy). This is surprisingly good to me! But there are a few caveats:

- The model does miss notable upsets of the test set. Duke v.s. NC State, Houston v.s. Duke, and Kentucky v.s. Oakland, with the last game's predicted scores being way off the mark.
- Even though it predicts the outcome correctly on a decent portion of the games, the predicted scores are often not close to the true score.
- This is applying all different rounds of tournament basketball at once, which we won't be able to do when using it practically. When tested with more data later, we will test it mirroring the process we would in March (Round 1 predictions, then Round 2, and so on).

What features does lasso view as important?


In [22]:
lasso_model = lasso_pipe.named_steps['lasso']
lasso_model.coef_

array([[  0.        ,   0.18968562,  -7.90138937,  -8.68511282,
          2.69186365,   4.00038837,   3.04772834,  -1.44519617,
          1.04515361,  -0.52893422,   0.16803906,  -0.19707116,
          2.8636212 ,   2.3270009 ,   2.9518573 ,  -2.20339699,
         -0.82773822,   0.6618601 ,  -5.76945669,  -0.22828046,
         -1.22715471,   1.15900413,   6.23542036,  -2.3080677 ,
          2.12410524,   6.97350114,  -2.03994218,   1.84398979,
          0.04185568,  -0.37596015,  -1.27988767,  -1.39528945],
       [  0.        ,  -1.38209025,  -4.2856943 ,  -3.38528387,
         -2.51232115,   6.44915053,   4.91302739, -11.25167354,
         -0.2054643 ,  -3.84210442,  -0.04642817,  -5.15071968,
          8.07429539,  -2.25639257,  -1.11588826,   2.68150977,
         -1.21657376,   0.31096353,   0.55280259,  -0.64612286,
          2.64145534,  -0.3531427 ,   2.368977  ,   2.21318565,
          5.34115537,   4.96337399,   2.0562435 ,  -0.5692651 ,
          3.15113207,  -1.74017287,  -3

The two arrays above raise an issue with using any form of linear regression method for multi target regression. Because linear regression seeks to minimize error on each target, a team will be predicted to score a different value simply by virtue of being on the opposite side of the table I created. I'm not sure if this is something I want in my model.

What if I tried a Random Forest Regressor?



In [23]:
from sklearn.ensemble import RandomForestRegressor
rf_pipe = Pipeline([('preprocessing',preprocessing),('rf',RandomForestRegressor(n_estimators=100,random_state=1))])
rf_pipe.fit(X_train,y_train)

In [24]:
rf_predictions = rf_pipe.predict(X_test)
yt_rfcopy = y_test.copy()
yt_rfcopy['T1 Win'] = yt_rfcopy['T1 Score'] > yt_rfcopy['T2 Score']
yt_rfcopy['Prediction 1'] = [row[0] for row in rf_predictions]
yt_rfcopy['Prediction 2'] = [row[1] for row in rf_predictions]
yt_rfcopy['Predict T1 Win'] = yt_rfcopy['Prediction 1'] > yt_rfcopy['Prediction 2']
yt_rfcopy['Prediction Correct?'] = yt_rfcopy['Predict T1 Win'] == yt_rfcopy['T1 Win']
yt_rfcopy

Unnamed: 0_level_0,T1 Score,T2 Score,T1 Win,Prediction 1,Prediction 2,Predict T1 Win,Prediction Correct?
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Gonzaga (5) v. Kansas (4),89.0,68.0,True,78.32,70.81,True,True
James Madison (12) v. Duke (4),55.0,93.0,False,69.87,72.66,False,True
Dayton (7) v. Nevada (10),63.0,60.0,True,68.97,70.83,False,False
Purdue (1) v. Gonzaga (5),80.0,68.0,True,85.82,65.47,True,True
Purdue (1) v. NC State (11),63.0,50.0,True,81.59,63.12,True,True
San Diego State (5) v. UAB (12),69.0,65.0,True,75.65,72.0,True,True
Texas (7) v. Colorado State (10),56.0,44.0,True,67.31,68.21,False,False
Houston (1) v. Duke (4),51.0,54.0,False,77.37,63.53,True,False
Colorado (10) v. Marquette (2),77.0,81.0,False,70.04,70.05,False,True
Duke (4) v. NC State (11),64.0,76.0,False,75.93,67.79,True,False


In [25]:
rf_model = rf_pipe.named_steps['rf']
rf_model.feature_importances_


array([0.        , 0.00737733, 0.06162588, 0.1074935 , 0.032775  ,
       0.04439269, 0.0854742 , 0.02568427, 0.01468159, 0.00791005,
       0.01937261, 0.01875131, 0.02730401, 0.02166131, 0.01376976,
       0.03424782, 0.03141211, 0.01594199, 0.04126105, 0.03322993,
       0.01221171, 0.10052826, 0.04371441, 0.02528439, 0.01131888,
       0.0114804 , 0.03479599, 0.04133326, 0.01109922, 0.02928155,
       0.01363706, 0.02094847])

Random Forest does slightly worse, but the lack of data should be noted. I think it's time to expand the data we are training on. We'll use code implemented earlier as functions to iterate on lots of Sports Reference pages.

## Web functions

In [77]:
def get_season_stats(url_name):  # only works with Sports Reference!!
  test = requests.get(url_name)

  html_content = test.text  # setting up the html
  soup = BeautifulSoup(html_content, 'html.parser')
  pretty_html = soup.prettify()

  headers = soup.find_all('th')  # establishing headers
  columnheaders = []
  for i in headers:
   columnheaders.append(i.text)
  columnheaders = columnheaders[13:50]
  new_season_df = pd.DataFrame(columns = columnheaders)  # setting table based on webpage table columns

  rows = soup.find_all('tr')  # inputting the seasonal team data
  for row in rows:
    cells = row.find_all('td')
    if cells == []:
      continue
    columndata = [col.text.strip() for col in cells]
    if "NCAA" in columndata[0]:
      new_season_df.loc[len(new_season_df)]=columndata  # extracting everyone who made it into the NCAA tournament.
  new_season_df.drop(columns=['W','L'], inplace=True)  # standardizing data in ratios for model's sake
  columns_to_drop = new_season_df.columns[5:9]
  new_season_df.drop(columns=columns_to_drop,inplace=True)
  new_season_df.drop(columns=['MP','FG','FGA','FT','FTA','3P','3PA'],inplace=True)
  new_season_df = new_season_df.replace('', np.nan)
  new_season_df.iloc[:,1:] = new_season_df.iloc[:,1:].astype(float)
  column_ops = ['Tm.','Opp.','ORB','TRB','AST','STL','BLK','TOV','PF']
  for col in column_ops:
    new_season_df[col] = new_season_df[col].astype(float)
    new_season_df[col] = new_season_df[col]/(new_season_df['G'].astype(float))  # setting up per game ratios

  new_season_df.rename(columns={'Tm.':'PPG',
                               'Opp.':'PAPG',
                               'ORB':'ORBPG',
                               'TRB':'TRBPG',
                               'AST':'ASTPG',
                               'STL':'STLPG',
                               'BLK':'BLKPG',
                               'TOV':'TOVPG',
                               'PF':'PFPG'}, inplace=True)
  new_season_df.drop(columns=['G','W-L%'],inplace=True)

  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('NCAA$', '', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('Brigham Young$', 'BYU', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('Connecticut$', 'UConn', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace('North Carolina$', 'UNC', regex=True)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Saint Mary's (CA)", "Saint Mary's", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Saint Peter's", "St. Peter's", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Louisiana State", "LSU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Southern California", "USC", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Southern Methodist", "SMU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Mississippi", "Ole Miss", regex=False)  #how to adjust?
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Ole Miss State", "Mississippi State", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Ole Miss Valley State", "Mississippi Valley State", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Southern Ole Miss", "Southern Miss", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Pittsburgh", "Pitt", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Pennsylvania", "Penn", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Central Connecticut State", "Central Connecticut", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Florida International", "FIU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Nevada-Las Vegas","UNLV", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("UC Santa Barbara", "UCSB", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Virginia Commonwealth", "VCU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Saint Joseph's", "St. Joseph's", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Maryland-Baltimore County","UMBC", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Illinois-Chicago", "UIC", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Massachusetts", "UMass", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("East Tennessee State", "ETSU", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("IU Indy", "IUPUI", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("UC Irvine", "UC-Irvine", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("UC Davis", "UC-Davis", regex=False)
  new_season_df['School'] = new_season_df['School'].str.strip().str.replace("Long Island University", "LIU", regex=False)

  new_season_df['Year'] = url_name[49:53]
  for col in new_season_df.columns:
    if col == 'Year':
      new_season_df[col] = new_season_df[col].astype(int)
    else:
      try:
        new_season_df[col] = new_season_df[col].astype(float)
      except:
        continue
  print(f"No errors with {url_name[49:53]}")
  return new_season_df


Whew. Let's see if this works.

In [63]:
x = get_season_stats('https://www.sports-reference.com/cbb/seasons/men/2000-school-stats.html')


No errors with 2000


Nice! Now it's time to iterate and retrieve seasonal data up to 2000.

In [78]:
team_data = pd.DataFrame(columns=['School','SRS','SOS','PPG','PAPG','FG%','3P%','FT%','ORBPG','TRBPG','ASTPG','STLPG','BLKPG','TOVPG','PFPG','Year'])
for i in range(0,25):
  if i < 10:
    i = f'0{i}'
    x = get_season_stats(f'https://www.sports-reference.com/cbb/seasons/men/20{i}-school-stats.html')
  else:
    x = get_season_stats(f'https://www.sports-reference.com/cbb/seasons/men/20{i}-school-stats.html')
  team_data = pd.concat([team_data,x])
team_data.head()

No errors with 2000


  team_data = pd.concat([team_data,x])


No errors with 2001
No errors with 2002
No errors with 2003
No errors with 2004
No errors with 2005
No errors with 2006
No errors with 2007
No errors with 2008
No errors with 2009
No errors with 2010
No errors with 2011
No errors with 2012
No errors with 2013
No errors with 2014
No errors with 2015
No errors with 2016
No errors with 2017
No errors with 2018
No errors with 2019
No errors with 2020
No errors with 2021
No errors with 2022
No errors with 2023
No errors with 2024


Unnamed: 0,School,SRS,SOS,PPG,PAPG,FG%,3P%,FT%,ORBPG,TRBPG,ASTPG,STLPG,BLKPG,TOVPG,PFPG,Year
0,Appalachian State,2.49,-3.75,79.0625,69.625,0.486,0.388,0.709,11.6875,35.96875,16.6875,10.15625,3.96875,15.8125,19.59375,2000
1,Arizona,18.96,9.7,76.441176,67.176471,0.457,0.322,0.73,12.117647,38.411765,15.794118,7.5,5.647059,15.0,14.588235,2000
2,Arkansas,12.37,7.76,74.382353,69.764706,0.429,0.351,0.603,13.264706,34.088235,13.470588,11.147059,3.529412,14.529412,20.735294,2000
3,Auburn,13.59,8.2,71.264706,64.352941,0.413,0.327,0.655,15.676471,39.264706,12.882353,7.264706,3.235294,13.235294,17.117647,2000
4,Ball State,7.84,3.17,74.193548,68.967742,0.453,0.4,0.612,12.83871,36.709677,13.806452,8.322581,4.225806,13.548387,17.096774,2000


In [65]:
team_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1598 entries, 0 to 67
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   School  1598 non-null   object 
 1   SRS     1598 non-null   float64
 2   SOS     1598 non-null   float64
 3   PPG     1598 non-null   float64
 4   PAPG    1598 non-null   float64
 5   FG%     1598 non-null   float64
 6   3P%     1598 non-null   float64
 7   FT%     1598 non-null   float64
 8   ORBPG   1595 non-null   float64
 9   TRBPG   1598 non-null   float64
 10  ASTPG   1598 non-null   float64
 11  STLPG   1598 non-null   float64
 12  BLKPG   1598 non-null   float64
 13  TOVPG   1597 non-null   float64
 14  PFPG    1598 non-null   float64
 15  Year    1598 non-null   object 
dtypes: float64(14), object(2)
memory usage: 276.8+ KB


In [66]:
team_data["Year"] = team_data["Year"].astype(int)
team_data["Year"].unique()

array([2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
       2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2021, 2022,
       2023, 2024])

Yay! There are a couple null values, but nothing that can't be handled. Now to do the same thing for the game data.

In [67]:
tourney_df = pd.DataFrame(columns=['T1 Seed','T1 Name','T1 Score','T2 Seed','T2 Name','T2 Score','Round','Year'])
def get_game_data(url_name):
  test = requests.get(url_name)
  html_content = test.text
  soup = BeautifulSoup(html_content, 'html.parser')
  comments = soup.find_all(string=lambda text: isinstance(text, Comment))
  i = 0
  j = 1
  for comment in comments:
    if "game" in comment:  # the html had <--game--> comments wherever they placed bracket games.
        next_element = comment.find_next_sibling()
        try:
          seed1 = next_element.find('span').text.strip()
        except:
          break
        name1 = next_element.find('a', href=True).text.strip()
        try:
          score1 = next_element.find_all('a',href=True)[1].text.strip()
        except:
          continue
        third_element = next_element.find_next_sibling()
        seed2 = third_element.find('span').text.strip()
        name2 = third_element.find('a', href=True).text.strip()
        score2 = third_element.find_all('a',href=True)[1].text.strip()
        i += 1

        if j < 5:
            if i <= 8:
              round = 1
            elif i <= 12:
              round = 2
            elif i <= 14:
              round = 3
            elif i == 15:
              round = 4
              j += 1  # move to the next quadrant
              i = 0
        elif j == 5:
            if i <= 2:
                round = 5
            elif i == 3:
                round = 6

            # Append data to DataFrame
        tourney_df.loc[len(tourney_df)] = [seed1, name1, score1, seed2, name2, score2, round, url_name[52:56]]
        if round == 6:
          break
  print(f"No issues with {url_name[52:56]}")
  return tourney_df

In [68]:
import time
for i in range(0,25):
  if i < 10:
    i = f'0{i}'
    game_data = get_game_data(f'https://www.sports-reference.com/cbb/postseason/men/20{i}-ncaa.html')
    game_data.drop_duplicates(("T1 Name", "T2 Name","Year"),ignore_index=True,inplace = True)
    time.sleep(15)  # to prevent a 429 response. don't want to overwhelm the website
  elif i == 20:
    continue
  else:
    game_data = get_game_data(f'https://www.sports-reference.com/cbb/postseason/men/20{i}-ncaa.html')
    game_data.drop_duplicates(("T1 Name", "T2 Name","Year"),ignore_index=True,inplace = True)
    time.sleep(15)

No issues with 2000
No issues with 2001
No issues with 2002
No issues with 2003
No issues with 2004
No issues with 2005
No issues with 2006
No issues with 2007
No issues with 2008
No issues with 2009
No issues with 2010
No issues with 2011
No issues with 2012
No issues with 2013
No issues with 2014
No issues with 2015
No issues with 2016
No issues with 2017
No issues with 2018
No issues with 2019
No issues with 2021
No issues with 2022
No issues with 2023
No issues with 2024


In [69]:
game_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1510 entries, 0 to 1509
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   T1 Seed   1510 non-null   object
 1   T1 Name   1510 non-null   object
 2   T1 Score  1510 non-null   object
 3   T2 Seed   1510 non-null   object
 4   T2 Name   1510 non-null   object
 5   T2 Score  1510 non-null   object
 6   Round     1510 non-null   int64 
 7   Year      1510 non-null   object
dtypes: int64(1), object(7)
memory usage: 94.5+ KB


Looks like the function worked! One small problem: Assuming 63 games every year for 24 years, the number should be 1512. We're sitting at 1510.

In 2021, VCU forfeited a game due to COVID, so that makes a few 2021 games in the dataframe receive the incorrect round. I'll need to address that.

Sports-Reference does not yet have the UConn-Purdue final score inputted from last year, so I'll input that manually.

In [70]:
game_data.iloc[1312,6] = 2  # index of game w/ wrong round-- associating with proper round
game_data.iloc[1316,6] = 3
game_data.iloc[1318,6] = 4
game_data.iloc[1319,6] = 5
game_data.iloc[1321,6] = 6  # true championship game

In [71]:
game_data.loc[len(game_data)] = [1,'UConn',75,1,'Purdue',60,6,2024]  # adding 2k24 finals

Now to organize the game data a little bit more before merging.

In [72]:
game_data["Game"] = game_data["T1 Name"] + f' ('+game_data["T1 Seed"].astype(str)+')' + ' v. ' + game_data["T2 Name"] + ' ('+game_data["T2 Seed"].astype(str)+')' + ', ' + game_data["Year"].astype(str)
game_data.set_index('Game', inplace=True)  # so i can identify the matches.
team_data["Year"] = team_data["Year"].astype(int)
game_data["Year"] = game_data["Year"].astype(int)  # to ensure that columns we are merging on are identical.
game_data["T1 Score"] = game_data["T1 Score"].astype(int)
game_data["T2 Score"] = game_data["T2 Score"].astype(int)
game_data["T1 Seed"] = game_data["T1 Seed"].astype(int)
game_data["T2 Seed"] = game_data["T2 Seed"].astype(int)

In [73]:
game_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1511 entries, Duke (1) v. Lamar (16), 2000 to UConn (1) v. Purdue (1), 2024
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   T1 Seed   1511 non-null   int64 
 1   T1 Name   1511 non-null   object
 2   T1 Score  1511 non-null   int64 
 3   T2 Seed   1511 non-null   int64 
 4   T2 Name   1511 non-null   object
 5   T2 Score  1511 non-null   int64 
 6   Round     1511 non-null   int64 
 7   Year      1511 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 106.2+ KB


In [74]:
team_data['School'] = team_data['School'].str.strip()
game_data['T1 Name'] = game_data['T1 Name'].str.strip()
game_data['T2 Name'] = game_data['T2 Name'].str.strip()
ncaa2k = pd.merge(game_data, team_data,left_on=['T1 Name','Year'],right_on=['School','Year'],how='left')
ncaa2k.rename(columns={'SRS':'T1 SRS',
                         'SOS':'T1 SOS',
                         'PPG':'T1 PPG',
                         'PAPG':'T1 PAPG',
                         'ORBPG':'T1 ORBPG',
                         'TRBPG':'T1 TRBPG',
                         'ASTPG':'T1 ASTPG',
                         'FG%':'T1 FG%',
                         '3P%':'T1 3P%',
                         'FT%':'T1 FT%',
                         'STLPG':'T1 STLPG',
                         'BLKPG':'T1 BLKPG',
                         'TOVPG':'T1 TOVPG',
                         'PFPG':'T1 PFPG'},inplace=True)
ncaa2k = pd.merge(ncaa2k, team_data, left_on=['T2 Name','Year'],right_on=['School','Year'],how='left')
ncaa2k.rename(columns={'SRS':'T2 SRS',
                         'SOS':'T2 SOS',
                         'PPG':'T2 PPG',
                         'PAPG':'T2 PAPG',
                         'ORBPG':'T2 ORBPG',
                         'TRBPG':'T2 TRBPG',
                         'ASTPG':'T2 ASTPG',
                         'FG%':'T2 FG%',
                         '3P%':'T2 3P%',
                         'FT%':'T2 FT%',
                         'STLPG':'T2 STLPG',
                         'BLKPG':'T2 BLKPG',
                         'TOVPG':'T2 TOVPG',
                         'PFPG':'T2 PFPG'}, inplace=True)
ncaa2k = ncaa2k.drop(columns=['School_x','School_y'])
ncaa2k.rename(columns={'Year_x':'Year'},inplace=True)

In [75]:
ncaa2k.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1511 entries, 0 to 1510
Data columns (total 36 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   T1 Seed   1511 non-null   int64  
 1   T1 Name   1511 non-null   object 
 2   T1 Score  1511 non-null   int64  
 3   T2 Seed   1511 non-null   int64  
 4   T2 Name   1511 non-null   object 
 5   T2 Score  1511 non-null   int64  
 6   Round     1511 non-null   int64  
 7   Year      1511 non-null   int64  
 8   T1 SRS    1511 non-null   float64
 9   T1 SOS    1511 non-null   float64
 10  T1 PPG    1511 non-null   float64
 11  T1 PAPG   1511 non-null   float64
 12  T1 FG%    1511 non-null   float64
 13  T1 3P%    1511 non-null   float64
 14  T1 FT%    1511 non-null   float64
 15  T1 ORBPG  1510 non-null   float64
 16  T1 TRBPG  1511 non-null   float64
 17  T1 ASTPG  1511 non-null   float64
 18  T1 STLPG  1511 non-null   float64
 19  T1 BLKPG  1511 non-null   float64
 20  T1 TOVPG  1511 non-null   floa

##tidying school names pt 2

In [76]:
unmatched_T1 = game_data[~game_data['T1 Name'].isin(team_data['School'])]
unmatched_T2 = game_data[~game_data['T2 Name'].isin(team_data['School'])]
print(unmatched_T1[['T1 Name']])
print(unmatched_T2[['T2 Name']])

Empty DataFrame
Columns: [T1 Name]
Index: []
                                     T2 Name
Game                                        
UNC (2) v. LIU (15), 2011                LIU
Michigan State (1) v. LIU (16), 2012     LIU


In [59]:
unmatched_school = team_data[~team_data['School'].isin(game_data['T2 Name'])]
unmatched_school

Unnamed: 0,School,SRS,SOS,PPG,PAPG,FG%,3P%,FT%,ORBPG,TRBPG,ASTPG,STLPG,BLKPG,TOVPG,PFPG,Year
48,Saint Joseph's,12.73,4.46,79.545455,71.272727,0.482,0.356,0.693,11.151515,37.606061,17.878788,6.181818,3.878788,14.515152,19.757576,2001
1,Alcorn State,-11.27,-14.17,79.774194,76.580645,0.465,0.378,0.644,14.612903,40.096774,16.0,8.322581,4.032258,17.354839,18.322581,2002
21,Illinois-Chicago,1.46,1.43,73.617647,70.588235,0.441,0.405,0.658,12.441176,34.705882,13.588235,5.735294,1.411765,12.794118,20.5,2002
16,East Tennessee State,-0.9,-2.94,82.032258,75.225806,0.465,0.352,0.731,14.096774,38.741935,14.451613,11.419355,5.032258,18.096774,19.032258,2003
22,IU Indy,-2.43,-1.8,71.264706,70.764706,0.449,0.343,0.7,11.882353,32.882353,13.441176,8.5,1.794118,16.088235,19.235294,2003
43,Saint Joseph's,14.61,3.71,70.366667,59.466667,0.443,0.363,0.686,11.3,35.333333,13.9,7.233333,5.533333,12.9,18.766667,2003
12,East Tennessee State,4.19,-3.15,79.333333,69.424242,0.459,0.347,0.689,12.454545,38.454545,15.30303,11.575758,5.818182,15.575758,16.878788,2004
19,Illinois-Chicago,6.0,-0.47,69.46875,63.0,0.451,0.385,0.636,13.21875,35.1875,14.625,7.71875,2.65625,12.65625,17.875,2004
43,Saint Joseph's,20.03,4.97,77.375,62.3125,0.475,0.404,0.702,9.625,32.59375,16.25,8.90625,3.46875,11.5625,17.0625,2004
1,Alabama A&M,-11.96,-12.86,69.65625,67.78125,0.401,0.302,0.613,13.21875,37.8125,13.3125,10.5625,3.875,15.25,19.34375,2005
