## This Is March
by Chancellor Tang

For this project, my goal was to see if it were possible to create a model that could predict the mens NCAA March Madness tournament. To do this, I retrieved 2 datasets. ncaam.csv, which has statistics for teams from tournament paticipants from the past five tournaments (2015-2019) With a column labeled "POSTSEASON", which stated how far each team went in the tournament. cbb21.csv hold team statistics from the 2021 season.

Link: https://www.kaggle.com/andrewsundberg/college-basketball-dataset?select=cbb.csv

In [185]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [186]:
ncaam = pd.read_csv('data/cbb.csv')

In [187]:
ncaam21 = pd.read_csv('data/test/cbb21t.csv')

In [188]:
ncaam22 = pd.read_csv('data/test/cbb22T.csv')

In [189]:
ncaam23 = pd.read_csv('data/test/cbb23t.csv')

In [190]:
ncaam24 = pd.read_csv('data/test/cbb24t.csv')

**Data Cleansing**

The first step I took to cleanse was to drop all the schools not in the tournament out of the 2021 DataFrame. To do this, I created a condition to filter out any team with less than a 17 seed because 17 seeds do not exist in this tournament (yet). To make it easier for a later process, I also rearranged the columns so that the numeric values were grouped together. From there, for each DataFrame. I filled all null values with 0 and changed the seed number to an int. After restructuring the data, I divided the past tournaments DataFrame into their respective years.

In [191]:
ncaam21 = ncaam21[ncaam21['SEED'] < 17] 

In [192]:
def power_conf(df):
    test_list = []
    for x in df.CONF:
        if x in ["B10", "B12", "SEC", "P12", "BE", "ACC"]:
            test_list.append(1)
        else:
            test_list.append(0)
    return test_list

In [193]:
new_columns = ['TEAM',
 'CONF',
 'POSTSEASON',
 'G',
 'W',
 "WIN_PER",
 'ADJOE',
 'ADJDE',
 'BARTHAG',
 'EFG_O',
 'EFG_D',
 'TOR',
 'TORD',
 'ORB',
 'DRB',
 'FTR',
 'FTRD',
 '2P_O',
 '2P_D',
 '3P_O',
 '3P_D',
 'ADJ_T',
 'WAB',
 'SEED',
 "POWER",
 'YEAR']

In [194]:
def format_ncaa_df(df, year = None):
    df = df.fillna(0)
    df.SEED = df.SEED.astype(int) 
    df["WIN_PER"] = df["W"]/df["G"]
    df["POWER"] = power_conf(df)
    df = df.reindex(columns=new_columns)
    if year is not None:
        df["YEAR"] = year
    return df

In [195]:
ncaam = format_ncaa_df(ncaam)

In [196]:
ncaam21 = format_ncaa_df(ncaam21, 2021)
ncaam22 = format_ncaa_df(ncaam22, 2022)
ncaam23 = format_ncaa_df(ncaam23, 2023)
ncaam24 = format_ncaa_df(ncaam24, 2024)

In [197]:
#ncaam21 = ncaam[ncaam['YEAR']==2021]
ncaam19 = ncaam[ncaam['YEAR']==2019]
ncaam18 = ncaam[ncaam['YEAR']==2018]
ncaam17 = ncaam[ncaam['YEAR']==2017]
ncaam16 = ncaam[ncaam['YEAR']==2016]
ncaam15 = ncaam[ncaam['YEAR']==2015]
ncaam14 = ncaam[ncaam['YEAR']==2014]
ncaam13 = ncaam[ncaam['YEAR']==2013]

In [198]:
ncaam13.head()

Unnamed: 0,TEAM,CONF,POSTSEASON,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED,POWER,YEAR
6,Michigan,B10,2ND,38,30,0.789474,121.5,93.7,0.9522,54.6,...,22.7,53.4,47.6,37.9,32.6,64.8,6.2,4,1,2013
13,Louisville,BE,Champions,40,35,0.875,115.9,84.5,0.9743,50.6,...,34.9,50.8,43.4,33.3,31.8,67.1,9.0,1,1,2013
38,Ohio St.,B10,E8,37,29,0.783784,113.6,89.4,0.9406,50.6,...,29.5,49.4,43.6,35.6,32.4,65.3,7.2,2,1,2013
39,Duke,ACC,E8,36,30,0.833333,118.4,91.5,0.9507,53.9,...,32.7,50.8,46.2,39.9,29.0,67.8,7.5,2,1,2013
40,Marquette,BE,E8,35,26,0.742857,113.0,93.2,0.902,49.6,...,31.7,51.6,44.9,29.6,32.3,64.6,4.5,3,1,2013


**Manual Input**


One of the drawbacks from this dataframe is that it does not tell you which leg of the bracket each team came from. This causes an issue with setting up matchups. To regroup the data, I had to manually create lists for each leg (4) of each tournament (6).

In [199]:
import json

with open("data/regions.json", "r") as file:
    regions = json.load(file)

**Creating the Dataframes**

For creating each leg's dataframe, I created three functions. f(x) works like a case function. It looks up the value of the column "POSTSEASON" and returns a list of ones and zeros to act as dummy variables. The dummy(z) function user f(x) to create a dataframe of of all these dummy variables. Finally, create_df(x,y) creates a final dataframe where it merges (the region list with their year's respective nccam dataframe to import the data. This also uses the dummy(z) function to add on the dummy variables.

To create the dataframe, I used a for loop that creates the dataframes based on their region and their name.

In [200]:
def round_assign(x):
    if x['POSTSEASON'] == "R64" : return [1,0,0,0,0,0,0]
    elif x['POSTSEASON'] == "R32" : return [1,1,0,0,0,0,0]
    elif x['POSTSEASON'] == "S16" : return [1,1,1,0,0,0,0]
    elif x['POSTSEASON'] == 'E8': return [1,1,1,1,0,0,0]
    elif x['POSTSEASON'] == 'F4': return [1,1,1,1,1,0,0]
    elif x['POSTSEASON'] == '2ND' or x['POSTSEASON'] == 'C2': return [1,1,1,1,1,1,0]
    elif x['POSTSEASON'] == 'Champions': return [1,1,1,1,1,1,1]
    else: return [0,0,0,0,0,0,0]

In [201]:
def dummy(z):
    a = []
    for x in range(0,len(z)):
        a.append(round_assign(z.iloc[x]))
    a = pd.DataFrame(a)
    a = a.rename(columns={0: "R64", 1: "R32", 2: "S16",3: "E8", 4:"F4",5:"C2",6:"Champions"})
    return a

In [202]:
def region_df(x,y):
    v = pd.DataFrame(x)
    v = v.rename(columns={0: "TEAM"})
    v = v.merge(y, on = 'TEAM', how='left')
    v = v.join(dummy(v))
    return v

In [203]:
master_df = pd.DataFrame()

all_years = [13,14,15,16,17,18,19,21,22,23,24]
legs = ["east",'south', 'midwest', 'west']
for x in all_years:
    for y in legs:
        z = str( 2000 + x)
        globals()[y + '%s' % x + '%s' %"_df"] = region_df(regions[z][y],  globals()['ncaam' + '%s' % x])
        master_df = pd.concat([master_df,globals()[y + '%s' % x + '%s' %"_df"]], ignore_index=True)

In [204]:
master_df['games_won'] = master_df[['R64', 'R32', 'S16', 'E8', 'F4', 'C2', 'Champions']].sum(axis=1)

In [205]:
master_df.to_csv("master_ncaam.csv", index=False)

**Determining an outcome**

The main issue regarding the matchups was how to create a dataframe that can be trained to predict the outcome to a game. While prediciting likeliness of what round each team would reach is easier, this would not give us the same head-to-head chaos that March Madness embodies. The best way to predict the outcome of each induvidual game is to create a dataframe that shows head-to-head matchups. 

To do this, I created a function diff(df,u) that takes a region dataframe (df) and creates the head-to-head matchups. This function repeats for the number of matchups in the region for that round (u). The function take the first team in the dataframe and the last team in the dataframe, which are the 1 and 16 seed, and works their way in. So in the next matchup, it would be the 2 seed agains the second to last seed (15), and the for loop would keep going until the last matchup (8 vs 9) is created by moving its way from the outside in. For each match set by the for loop, both are run through the get_upset_differences(x,y), which subtracts the numeric statistics of y (the lower seed) by x (the higher seed) to create a difference. This difference is returned as  list of differences which is then appended to a its own list (listDF) in the diff(df,u) function. Once all the matchup differences are created, the function creates a dataframe using listDF and returns that dataframe. The predictive model will us this format of dataframe (the differences) as its training data.

In [206]:
df_headers = list(ncaam.columns)

In [207]:
train_years = [13,14,15,16,17,18,19,21,22,23]

In [305]:
df_past = []
for x in range(0,len(legs)):
    for y in train_years:
        df_past.append(globals()[legs[x] + '%s' % y + '%s' %"_df"])

In [306]:
def get_upset_differences(a,b):
    """
    Takes two rows from one of the region databases and finds the difference between the two columns 

    a: higher seed
    b: lower seed'

    output: list of db difference rows
    """
    
    listA = []
    for x in range(3,25):
        diff = a.iloc[x] - b.iloc[x]
        listA.append(diff)
    return listA

In [316]:
def get_target_variable(df, nxt_round_num, rnd_name):
    """
    This function used to get the target variable attributes from the correct column
    The list is reversed because we want to see if the lower seed wins

    df: dataframe used
    nxt_round_num: number of teams in the next round
    rnd_name: the column name of the dummy round variable of the next round

    output: list of target variable attributes (will be its own column)
    """
    
    num = int(len(df)/2)
    listB = list(df[rnd_name])
    listB.reverse()
    listB = listB[0:num]
    return listB

In [308]:
def create_training_record(df, matchup_num, reseed = True):
    """
    This function takes the dataframe, and creates a new dataset of differences that can be trained
    
    df: dataframe of the round being processed
    matchup_num: the number of matchups
    """
    
    listDF = []
    for y in range(0,matchup_num):
        test_upsetH = df.iloc[y]
        test_upsetL = df.iloc[-(y+1)]
        listDF.append(get_upset_differences(test_upsetH,test_upsetL))
    if reseed:
        return pd.DataFrame(listDF, columns = df_headers[3:25]).sort_values(by = ["SEED"])
    else:
        return pd.DataFrame(listDF, columns = df_headers[3:25]).sort_values(by = ["SEED"])

In [322]:
df_past64[1]

Unnamed: 0,TEAM,CONF,POSTSEASON,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,...,SEED,POWER,YEAR,R64,R32,S16,E8,F4,C2,Champions
0,Virginia,ACC,S16,37,30,0.810811,114.6,89.5,0.9449,50.8,...,1,1,2014,1,1,1,0,0,0,0
1,Villanova,BE,R32,34,29,0.852941,115.2,92.5,0.926,53.6,...,2,1,2014,1,1,0,0,0,0,0
2,Iowa St.,B12,S16,36,28,0.777778,118.6,98.8,0.8903,54.2,...,3,1,2014,1,1,1,0,0,0,0
3,Michigan St.,B10,E8,38,29,0.763158,119.6,95.3,0.9319,54.5,...,4,1,2014,1,1,1,1,0,0,0
4,Cincinnati,Amer,R64,34,27,0.794118,108.7,91.5,0.8783,47.7,...,5,0,2014,1,0,0,0,0,0,0
5,North Carolina,ACC,R32,34,24,0.705882,113.4,94.7,0.8883,49.9,...,6,1,2014,1,1,0,0,0,0,0
6,Connecticut,Amer,Champions,40,32,0.8,112.5,91.3,0.9171,51.5,...,7,0,2014,1,1,1,1,1,1,1
7,Memphis,Amer,R32,33,23,0.69697,112.6,97.0,0.8479,51.9,...,8,0,2014,1,1,0,0,0,0,0
8,George Washington,A10,R64,33,24,0.727273,110.2,96.8,0.8151,51.5,...,11,0,2014,1,0,0,0,0,0,0
9,Saint Joseph's,A10,R64,34,24,0.705882,113.8,97.5,0.8546,53.9,...,10,0,2014,1,0,0,0,0,0,0


In [321]:
creation(df_past64[1], 'R32')

  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)


(    G   W   WIN_PER  ADJOE  ADJDE  BARTHAG  EFG_O  EFG_D  TOR  TORD  ...  \
 0   6  12  0.230166   16.3  -12.5   0.5503    2.8   -3.3 -3.7   0.1  ...   
 1   0   9  0.264706   10.9  -11.4   0.4137    4.5   -2.8 -2.6   1.5  ...   
 2   4   6  0.090278   13.9   -3.7   0.3298    2.2    0.1 -6.4  -2.6  ...   
 3   3   4  0.048872    8.9  -10.7   0.3082    4.1   -3.6  4.0   0.9  ...   
 4   3   1 -0.044592   -3.2   -3.7   0.0142   -4.4   -1.4 -0.2   1.3  ...   
 5  -2   1  0.066993   -4.7    1.3  -0.0491    0.1    0.5  0.1   2.4  ...   
 6   6   8  0.094118   -1.3   -6.2   0.0625   -2.4   -3.2 -1.4   4.3  ...   
 7   0  -1 -0.030303    2.4    0.2   0.0328    0.4    1.2 -0.4   0.7  ...   
 
    FTRD  2P_O  2P_D  3P_O  3P_D  ADJ_T   WAB  SEED  POWER  TRAIN  
 0  -7.1   1.4  -4.6   4.3  -0.1   -6.3  15.9   -15      1      0  
 1   2.1   4.7  -7.4   2.9   4.3   -0.2  14.4   -13      1      0  
 2  -4.0   0.7  -1.8   3.2   2.6    4.6  10.4   -11      1      0  
 3   2.3   2.6  -6.1   4.5   0.5 

In [323]:
def creation(df, next_rounds):
    """
    This function takes the current round's dataframe and the next round string name to use the previous functions to build the training df

    df: current round's dataframe
    next_rounds
    """
    
    y = pd.DataFrame(columns = new_columns[3:25])
    asd = []
    nxt_round_num = int(len(df)/2)
    upset= get_target_variable(df,nxt_round_num, next_rounds)
    for x in range(0,nxt_round_num):
        h = df.iloc[x]
        l = df.iloc[(-x-1)]
        if upset[x] == 1:
            asd.append(l)
        if upset[x] == 0:
            asd.append(h)
    next_df = pd.DataFrame(asd)
    y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
    y["TRAIN"] = upset
    y.TRAIN = y.TRAIN.astype(int)
    return y, next_df

In [324]:
df_past64 = []
for x in range(0,len(legs)):
    for y in train_years:
        df_past64.append(globals()[legs[x] + '%s' % y + '%s' %"_df"])

In [325]:
def create_train(a, next_round):
    df = pd.DataFrame(columns = new_columns[3:25])
    df_next = []
    for x in a:
        train, next_df = creation(x,next_round)
        df = pd.concat([df,train], ignore_index = True)
        df_next.append(next_df)
    return df, df_next

In [364]:
round_name = ["R32", "S16", "E8", "F4"]
train_master = pd.DataFrame(columns = new_columns[3:25])
train_w1 = pd.DataFrame(columns = new_columns[3:25])
train_w2 = pd.DataFrame(columns = new_columns[3:25])

for x in range(0,len(round_name)):
    a = 2**(6-x)
    b = 2**(5-x)
    holder = create_train(globals()["df_past" + '%s' % a], round_name[x])
    globals()["train" + '%s' % a] = holder[0]
    globals()["df_past" + '%s' % b] = holder[1]
    train_master = pd.concat([train_master,holder[0]], ignore_index = True)
    print("end")
    if x <= 1:
        train_w1 = pd.concat([train_w1,holder[0]], ignore_index = True)
    else:
        train_w2 = pd.concat([train_w2,holder[0]], ignore_index = True)

  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  df = pd.concat([df,train], ignore_index = True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nx

end


  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat(

end


  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  df = pd.concat([df,train], ignore_index = True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nx

end


  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  df = pd.concat([df,train], ignore_index = True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nx

end


  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat(

### Final 4 Training DF

fin4 = df of all teams in the final 4

In [369]:
df_pastF4 = []
num = len(train_years)

for p in range(num):
    df_list = []
    for i in range(4):
        df_list.append(df_past4[(i * num) + p])
    df = pd.concat(df_list)
    df_pastF4.append(df)

In [370]:
new_dfs = []
for x in df_past:
    y = x[x["F4"]==1]
    new_dfs.append(y)

In [371]:
fin4 = pd.concat(new_dfs)

In [372]:
for y in train_years:
    year = y + 2000
    globals()["fin4_" + '%s' % y] = fin4[fin4["YEAR"]== year]["TEAM"]

In [373]:
fin4_13 = fin4_13.sort_values(ascending = True)

fin4_14 = fin4_14.sort_values(ascending = True)
fin4_14 = fin4_14.reindex([6,7,1,0])

fin4_15 = fin4_15.sort_values(ascending = True)
fin4_15 = fin4_15.reset_index(drop = True)
fin4_15 = fin4_15.reindex([0,3,1,2])

fin4_16 = fin4_16.sort_values(ascending = True)
fin4_16 = fin4_16.reset_index(drop = True)
fin4_16 = fin4_16.reindex([0,3,1,2])
                          
fin4_17 = fin4_17.sort_values(ascending = True)
fin4_18 = fin4_18.sort_values(ascending = True)       
fin4_19 = fin4_19.sort_values(ascending = True)

fin4_21 = fin4_21.sort_values(ascending = True)
fin4_21 = fin4_21.reset_index(drop = True)
fin4_21 = fin4_21.reindex([0,3,1,2])

fin4_21 = fin4_21.sort_values(ascending = True)
fin4_21 = fin4_21.reset_index(drop = True)
fin4_21 = fin4_21.reindex([0,3,1,2])

fin4_22 = fin4_22.sort_values(ascending = True)
fin4_22 = fin4_22.reset_index(drop = True)
fin4_22 = fin4_22.reindex([0,3,1,2])

fin4_23 = fin4_23.sort_values(ascending = True)
fin4_23 = fin4_23.reset_index(drop = True)
fin4_23 = fin4_23.reindex([0,3,1,2])

In [374]:
df_past4 = []
for x in train_years:
    y = train_years.index(x)
    df = pd.DataFrame(globals()["fin4_" + '%s' % x ])
    year = 2000 + x
    df2 = df_pastF4[y]
    df = df.merge(df2, left_on='TEAM', right_on='TEAM')
    df_past4.append(df)

In [375]:
train4, df_past2 = create_train(df_past4, "C2")
train2, df_past1 = create_train(df_past2, "Champions")

  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  df = pd.concat([df,train], ignore_index = True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  df = pd.concat([df,train], ignore_index = True

In [378]:
train_ff = pd.DataFrame(columns = new_columns[3:25])


for x in [train4,train2]:
    train_master = pd.concat([train_master,x], ignore_index = True)
    train_ff = pd.concat([train_ff, x], ignore_index = True)

  train_ff = pd.concat([train_ff, x], ignore_index = True)


### Scaling

In [327]:
train_neg = train_master[train_master["SEED"]<=0]
train_pos = train_master[train_master["SEED"]>0]

In [328]:
train_pos = - train_pos
train_pos["TRAIN"] = train_pos["TRAIN"] + 1

In [329]:
train_master = pd.concat([train_pos, train_neg])

In [330]:
train_master = train_master.reset_index(drop = True)

from sklearn.preprocessing import StandardScaler  

scaler = StandardScaler()
scaler.fit(train_master.drop(columns = ["TRAIN"]))

def scale(df):
    m = pd.DataFrame(scaler.transform(df), columns = df.columns)
    return m

In [331]:
seed_cutoff_low = -7
seed_cutoff_high = -4
big_upset = train_master[train_master["SEED"] < seed_cutoff_low]
little_upset = train_master[train_master["SEED"] > seed_cutoff_high]
competative = train_master[(train_master["SEED"] <= seed_cutoff_high) & (train_master["SEED"] >= seed_cutoff_low)]

## Model Training

In [332]:
train_master

Unnamed: 0,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,TOR,TORD,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED,POWER,TRAIN
0,6,0,-0.133333,8.0,6.8,0.0034,-0.8,-0.0,-2.4,-11.1,...,-14.2,-1.6,-3.2,0.5,4.1,-2.7,3.5,-8,1,0.0
1,2,1,-0.011583,12.1,-6.7,0.2720,-4.3,-0.0,-0.7,4.9,...,15.2,-6.7,-3.3,-0.6,3.0,-4.7,7.1,-8,1,0.0
2,3,-6,-0.346154,10.5,2.8,0.1373,-0.9,5.3,-2.8,-9.1,...,-14.5,-0.2,3.1,-1.0,6.0,-5.3,2.1,-3,1,1.0
3,0,11,0.379310,11.4,1.0,0.2182,5.1,3.9,-3.7,0.1,...,-16.4,7.6,6.0,-0.1,0.7,0.1,9.4,-8,0,1.0
4,-1,11,0.373992,6.9,-21.2,0.6295,3.5,-9.6,0.3,-3.0,...,-7.9,2.9,-11.2,2.9,-4.3,-0.6,16.1,-7,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,2,2,0.005263,11.6,6.7,0.0058,1.3,2.6,-4.1,-4.8,...,-14.9,1.5,2.0,0.5,4.0,-7.4,2.3,-1,0,0.0
596,0,1,0.027027,0.2,2.1,-0.0163,-2.0,1.7,-2.2,2.5,...,3.7,4.4,0.8,-7.8,2.5,-1.8,-1.3,-1,0,1.0
597,1,13,0.317139,2.2,-11.6,0.1015,4.7,-10.5,-2.2,-0.4,...,-6.5,4.3,-12.5,3.7,-4.4,3.1,6.1,-10,-1,0.0
598,-1,2,0.076102,8.2,4.7,0.0048,5.5,1.2,-2.8,-3.8,...,-10.7,8.6,1.5,-0.2,0.7,4.5,0.0,-2,-1,1.0


In [333]:
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [334]:
def formatStuff(df):
    y = df["TRAIN"]
    x = scale(df.drop(columns = ["TRAIN"]))
    return [x,y]

In [335]:
train_df = formatStuff(train_master)
train_big = formatStuff(big_upset)
train_little = formatStuff(little_upset)
train_comp = formatStuff(competative)

In [336]:
for p in [train_df, train_big, train_little, train_comp]:
    print(sum(p[1])/len(p[0]))
    print(len(p[0]), "\n")

0.31333333333333335
600 

0.1828793774319066
257 

0.40782122905027934
179 

0.4146341463414634
164 



In [337]:
#MLP Classifier
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

iterations = 1000
alpha = 3
 
mlp_big = MLPClassifier(max_iter= iterations, alpha=alpha, random_state = 69)
mlp_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(mlp_big.score(X,Y)))

mlp_little = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(mlp_little.score(x, y)))

mlp_comp = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(mlp_comp.score(train_x, train_y)))

Accuracy on training set: 0.864
Accuracy on training set: 0.966
Accuracy on training set: 0.945


In [338]:
#FOREST
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

est = 6
depth = 5
 
forest_big = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(
    forest_big.score(X, Y)))

forest_little = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(
    forest_little.score(x, y)))

forest_comp = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(forest_comp.score(train_x, train_y)))

Accuracy on training set: 0.911
Accuracy on training set: 0.922
Accuracy on training set: 0.872


In [339]:
#SVM
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

svc_big = SVC(random_state = 69, C = 1)
svc_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(svc_big.score(X, Y)))

svc_little = SVC(random_state = 69, C = 1)
svc_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(svc_little.score(x, y)))

svc_comp = SVC(random_state = 69, C = 1)
svc_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(svc_comp.score(train_x, train_y)))

Accuracy on training set: 0.829
Accuracy on training set: 0.905
Accuracy on training set: 0.841


In [340]:
#Log Regressor
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

clf_big =  LogisticRegression(random_state=69, C =1)
clf_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(clf_big.score(X, Y)))

clf_little = LogisticRegression(random_state=69, C = 5)
clf_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(clf_little.score(x, y)))

clf_comp = LogisticRegression(random_state=69, C = 20)
clf_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(clf_comp.score(train_x, train_y)))

Accuracy on training set: 0.848
Accuracy on training set: 0.849
Accuracy on training set: 0.732


In [341]:
#K-Nearest
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

neighbors = 5

knn_big = KNeighborsClassifier(n_neighbors=neighbors)
knn_big.fit(X, Y)
knn_bscore = knn_big.score(X, Y)

knn_little = KNeighborsClassifier(n_neighbors=neighbors)
knn_little.fit(x, y)
knn_lscore = knn_little.score(x, y)

knn_comp = KNeighborsClassifier(n_neighbors=neighbors)
knn_comp.fit(train_x, train_y)
knn_cscore = knn_comp.score(train_x, train_y)

knn_mean =  sum( [knn_bscore, knn_lscore, knn_cscore])/ len( [knn_bscore, knn_lscore, knn_cscore])

print("Accuracy on training set: {:.3f}".format(knn_bscore))
print("Accuracy on training set: {:.3f}".format(knn_lscore))
print("Accuracy on training set: {:.3f}".format(knn_cscore))

Accuracy on training set: 0.840
Accuracy on training set: 0.782
Accuracy on training set: 0.774


In [342]:
# Naive Bayes
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

gnb_big = GaussianNB()
gnb_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(gnb_big.score(X, Y)))

gnb_little = GaussianNB()
gnb_little.fit(x,y)
print("Accuracy on training set: {:.3f}".format(gnb_little.score(x, y)))

gnb_comp = GaussianNB()
gnb_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(gnb_comp.score(train_x, train_y)))

Accuracy on training set: 0.743
Accuracy on training set: 0.743
Accuracy on training set: 0.671


In [343]:
# Decision tree
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

tree_depth = 7

DT_big = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(DT_big.score(X, Y)))

DT_little = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(DT_little.score(x,y)))

DT_comp = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(DT_comp.score(train_x, train_y)))

Accuracy on training set: 0.922
Accuracy on training set: 0.894
Accuracy on training set: 0.848


## Model Training

In [344]:
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [345]:
def formatStuff(df):
    y = df["TRAIN"]
    x = scale(df.drop(columns = ["TRAIN"]))
    return [x,y]

In [346]:
train_df = formatStuff(train_master)
train_big = formatStuff(big_upset)
train_little = formatStuff(little_upset)
train_comp = formatStuff(competative)

In [347]:
for p in [train_df, train_big, train_little, train_comp]:
    print(sum(p[1])/len(p[0]))
    print(len(p[0]), "\n")

0.31333333333333335
600 

0.1828793774319066
257 

0.40782122905027934
179 

0.4146341463414634
164 



In [348]:
#MLP Classifier
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

iterations = 1000
alpha = 3
 
mlp_big = MLPClassifier(max_iter= iterations, alpha=alpha, random_state = 69)
mlp_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(mlp_big.score(X,Y)))

mlp_little = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(mlp_little.score(x, y)))

mlp_comp = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(mlp_comp.score(train_x, train_y)))

Accuracy on training set: 0.864
Accuracy on training set: 0.966
Accuracy on training set: 0.945


In [349]:
#FOREST
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

est = 6
depth = 5
 
forest_big = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(
    forest_big.score(X, Y)))

forest_little = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(
    forest_little.score(x, y)))

forest_comp = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(forest_comp.score(train_x, train_y)))

Accuracy on training set: 0.911
Accuracy on training set: 0.922
Accuracy on training set: 0.872


In [350]:
#SVM
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

svc_big = SVC(random_state = 69, C = 1)
svc_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(svc_big.score(X, Y)))

svc_little = SVC(random_state = 69, C = 1)
svc_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(svc_little.score(x, y)))

svc_comp = SVC(random_state = 69, C = 1)
svc_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(svc_comp.score(train_x, train_y)))

Accuracy on training set: 0.829
Accuracy on training set: 0.905
Accuracy on training set: 0.841


In [351]:
#Log Regressor
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

clf_big =  LogisticRegression(random_state=69, C =1)
clf_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(clf_big.score(X, Y)))

clf_little = LogisticRegression(random_state=69, C = 5)
clf_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(clf_little.score(x, y)))

clf_comp = LogisticRegression(random_state=69, C = 20)
clf_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(clf_comp.score(train_x, train_y)))

Accuracy on training set: 0.848
Accuracy on training set: 0.849
Accuracy on training set: 0.732


In [352]:
#K-Nearest
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

neighbors = 5

knn_big = KNeighborsClassifier(n_neighbors=neighbors)
knn_big.fit(X, Y)
knn_bscore = knn_big.score(X, Y)

knn_little = KNeighborsClassifier(n_neighbors=neighbors)
knn_little.fit(x, y)
knn_lscore = knn_little.score(x, y)

knn_comp = KNeighborsClassifier(n_neighbors=neighbors)
knn_comp.fit(train_x, train_y)
knn_cscore = knn_comp.score(train_x, train_y)

knn_mean =  sum( [knn_bscore, knn_lscore, knn_cscore])/ len( [knn_bscore, knn_lscore, knn_cscore])

print("Accuracy on training set: {:.3f}".format(knn_bscore))
print("Accuracy on training set: {:.3f}".format(knn_lscore))
print("Accuracy on training set: {:.3f}".format(knn_cscore))

Accuracy on training set: 0.840
Accuracy on training set: 0.782
Accuracy on training set: 0.774


In [353]:
# Naive Bayes
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

gnb_big = GaussianNB()
gnb_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(gnb_big.score(X, Y)))

gnb_little = GaussianNB()
gnb_little.fit(x,y)
print("Accuracy on training set: {:.3f}".format(gnb_little.score(x, y)))

gnb_comp = GaussianNB()
gnb_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(gnb_comp.score(train_x, train_y)))

Accuracy on training set: 0.743
Accuracy on training set: 0.743
Accuracy on training set: 0.671


In [354]:
# Decision tree
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

tree_depth = 7

DT_big = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(DT_big.score(X, Y)))

DT_little = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(DT_little.score(x,y)))

DT_comp = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(DT_comp.score(train_x, train_y)))

Accuracy on training set: 0.922
Accuracy on training set: 0.894
Accuracy on training set: 0.848


**Running the Predictions**

To do this, I create dataframes for each round of the tournament. From there, I created three appendable variables: test_df as an x_test, y_pred as a y_test, and matchup list to keep track of which teams faced off as a visual check. The for loop takes all four regions and sets it equal to the round64 variable. Using the rounds list, which had all the dataframes up until the final four, the for loop would take each round and simulate each matchup by using the lists function to create those differences that were used in the x_train dataframe. Using the result from the lists function (holder), I used the predict function for each ML library to take the values and see if the game was an upset or not. For each game, it would print the matchup, and once the prediction was made, it would print the winner and append the winner to the wins list. Using the wins list, a dataframe would be created (rounds[x+1]). After creating the datframe, it would append the prediction to y_pred and the matchup to matchup_list. After all the rounds how gone, each dataframe created for the round is appended with data from the rounds[x+1].

For each predictive model, it has to be input in the variable q because I have yet to create a loop that can run all these libraries concurrently. "q" is the type of model the prediction runs (SVC, MLP, regressor).

In [355]:
r64 = pd.DataFrame(columns = df_headers)
r32 = pd.DataFrame(columns = df_headers)
s16 = pd.DataFrame(columns = df_headers)
e8 = pd.DataFrame(columns = df_headers)
f4 = pd.DataFrame(columns = df_headers)
c2= pd.DataFrame(columns = df_headers)
winner= pd.DataFrame(columns = df_headers)

In [356]:
#FINE TUNE [knn, DT, forest, mlp, clf, gnb, svc]
b = "knn"
l = "gnb"
c = "clf"

big = globals()[b + '%s' % "_big"]
little = globals()[l + '%s' % "_little"]
comp= globals()[c + '%s' % "_comp"]

In [357]:
test_df = pd.DataFrame(columns = df_headers[3:25])
y_pred = []
matchup_list = []
test_regions = [east24_df, midwest24_df, south24_df, west24_df]
#print('\033[1m' + str(learn) + '\033[0m' + "\n")
for x in test_regions: 
    r64 = pd.concat([r64,x], ignore_index = True)
    round64 = x
    rounds = [round64, r32, s16, e8, f4]
    for r in range (0,len(rounds)-1):
        wins = []
        y = len(rounds[r])/2
        y = int(y)
        for x in range(0, y):
            h = rounds[r].iloc[x]
            l = rounds[r].iloc[-x-1]
            holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
            if holder.iloc[0]["SEED"] > 0: 
                holder = -holder
            scaled = scale(holder)
            if holder.iloc[0]["SEED"] < seed_cutoff_high:
                ups = big.predict(scaled)
                #print(big.predict_proba(scaled))
            elif  holder.iloc[0]["SEED"] > seed_cutoff_low:
                ups = little.predict(scaled)
                #print(little.predict_proba(scaled))
            else:
                ups = comp.predict(scaled)
                #print(comp.predict_proba(scaled))
            ups = list(ups)[0]
            print(h['SEED'], h['TEAM'], ' vs. ', l['SEED'], l['TEAM'])
            if ups == 0: 
                wins.append(h)
                print("Winner:", h['SEED'], h['TEAM'])
            if ups == 1:
                wins.append(l)
                print("Winner:", l['SEED'], l['TEAM'])
            test_df = pd.concat([test_df,holder], ignore_index = True)
            matchup = h['TEAM'] + ' vs. ' + l['TEAM']
            matchup_list.append(matchup)
            y_pred.append(ups)
        rounds[r+1] = pd.DataFrame(data = wins, columns = df_headers)
    print("_" * 40)

    r32 = pd.concat([r32,rounds[1]], ignore_index = True)
    s16 = pd.concat([s16,rounds[2]], ignore_index = True)
    e8 = pd.concat([e8,rounds[3]], ignore_index = True)
    f4 = pd.concat([f4,rounds[4]], ignore_index = True)

  r64 = pd.concat([r64,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index = True)


1 Connecticut  vs.  16 Stetson
Winner: 1 Connecticut
2 Iowa St.  vs.  15 North Dakota St.
Winner: 2 Iowa St.
3 Illinois  vs.  14 Morehead St.
Winner: 3 Illinois
4 Auburn  vs.  13 Yale
Winner: 4 Auburn
5 San Diego St.  vs.  12 UAB
Winner: 5 San Diego St.
6 BYU  vs.  11 Duquesne
Winner: 6 BYU
7 Washington St.  vs.  10 Drake
Winner: 10 Drake
8 Florida Atlantic  vs.  9 Northwestern
Winner: 9 Northwestern
1 Connecticut  vs.  9 Northwestern
Winner: 1 Connecticut
2 Iowa St.  vs.  10 Drake
Winner: 2 Iowa St.
3 Illinois  vs.  6 BYU
Winner: 6 BYU
4 Auburn  vs.  5 San Diego St.
Winner: 4 Auburn
1 Connecticut  vs.  4 Auburn
Winner: 1 Connecticut
2 Iowa St.  vs.  6 BYU
Winner: 2 Iowa St.
1 Connecticut  vs.  2 Iowa St.
Winner: 1 Connecticut
________________________________________
1 Purdue  vs.  16 Grambling St.
Winner: 1 Purdue
2 Tennessee  vs.  15 Saint Peter's
Winner: 2 Tennessee
3 Creighton  vs.  14 Akron
Winner: 3 Creighton
4 Kansas  vs.  13 Samford
Winner: 13 Samford
5 Gonzaga  vs.  12 McNeese

  r32 = pd.concat([r32,rounds[1]], ignore_index = True)
  s16 = pd.concat([s16,rounds[2]], ignore_index = True)
  e8 = pd.concat([e8,rounds[3]], ignore_index = True)
  f4 = pd.concat([f4,rounds[4]], ignore_index = True)


1 Purdue  vs.  9 TCU
Winner: 1 Purdue
2 Tennessee  vs.  7 Texas
Winner: 2 Tennessee
3 Creighton  vs.  11 Oregon
Winner: 3 Creighton
13 Samford  vs.  5 Gonzaga
Winner: 5 Gonzaga
1 Purdue  vs.  5 Gonzaga
Winner: 1 Purdue
2 Tennessee  vs.  3 Creighton
Winner: 3 Creighton
1 Purdue  vs.  3 Creighton
Winner: 1 Purdue
________________________________________
1 Houston  vs.  16 Longwood
Winner: 1 Houston
2 Marquette  vs.  15 Western Kentucky
Winner: 2 Marquette
3 Kentucky  vs.  14 Oakland
Winner: 3 Kentucky
4 Duke  vs.  13 Vermont
Winner: 4 Duke
5 Wisconsin  vs.  12 James Madison
Winner: 12 James Madison
6 Texas Tech  vs.  11 North Carolina St.
Winner: 6 Texas Tech
7 Florida  vs.  10 Colorado
Winner: 10 Colorado
8 Nebraska  vs.  9 Texas A&M
Winner: 8 Nebraska
1 Houston  vs.  8 Nebraska
Winner: 1 Houston
2 Marquette  vs.  10 Colorado
Winner: 2 Marquette
3 Kentucky  vs.  6 Texas Tech
Winner: 3 Kentucky
4 Duke  vs.  12 James Madison
Winner: 12 James Madison
1 Houston  vs.  12 James Madison
Winner

## FINAL 4 SIM

In [358]:
final_four = [f4, c2, winner]
for r in range (0,len(final_four)-1):
    wins = []
    y = len(final_four[r])/2
    y = int(y)
    for x in range(0, y):
        h = final_four[r].iloc[x]
        l = final_four[r].iloc[-x-1]
        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25]).sort_values(by = "SEED")
        
        scaled = scale(holder)
        if holder.iloc[0]["SEED"] < seed_cutoff_high:
            ups = big.predict(scaled)
        elif  holder.iloc[0]["SEED"] > seed_cutoff_low:
            ups = little.predict(scaled)
        else:
            ups = comp.predict(scaled)
        ups = list(ups)[0]
        matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
        print(matchup)
        if ups == 0: 
            wins.append(h)
            print("Winner:", h['SEED'], h['TEAM'])
        if ups == 1:
            wins.append(l)
            print("Winner:", l['SEED'], l['TEAM'])
        test_df = pd.concat([test_df,holder], ignore_index = True)
        matchup_list.append(matchup)
        y_pred.append(ups)
    final_four[r+1] = pd.DataFrame(data = wins, columns = df_headers)
df_c2 = final_four[1]
df_winner = final_four[2]

1 Connecticut vs. 2 Arizona
Winner: 1 Connecticut
1 Purdue vs. 1 Houston
Winner: 1 Houston
1 Connecticut vs. 1 Houston
Winner: 1 Houston


In [359]:
test_df['UPSET'] = y_pred

In [360]:
test_df["MATCHUP"] = matchup_list

In [361]:
df_winner

Unnamed: 0,TEAM,CONF,POSTSEASON,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED,POWER,YEAR
2,Houston,B12,,31,28,0.903226,121.5,87.8,0.9768,50.1,...,39.6,48.8,44.0,34.8,30.2,63.7,9.6,1,1,2024


## Model Training

In [362]:
def formatStuff(df):
    y = df["TRAIN"]
    x = scale(df.drop(columns = ["TRAIN"]))
    return [x,y]

In [379]:
train_r1 = formatStuff(train_w1)
train_r2 = formatStuff(train_w2)
train_r3 = formatStuff(train_ff)

In [380]:
for p in [train_r1,train_r2,train_r3]:
    print(sum(p[1])/len(p[0]))
    print(len(p[0]), "\n")

0.30833333333333335
480 

0.38333333333333336
120 

0.3333333333333333
30 



In [381]:
# Training DFs
x1 = train_r1[0]
y1 = list(train_r1[1])

x2 = train_r2[0]
y2 = list(train_r2[1])

x3 = train_r3[0]
y3 = list(train_r3[1])

In [382]:
#MLP Classifier

iterations = 1000
alpha = 4
 
mlp1 = MLPClassifier(max_iter= iterations, alpha=alpha, random_state = 69)
mlp1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(mlp1.score(x1,y1)))

mlp2 = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 52)
mlp2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(mlp2.score(x2, y2)))

mlp3 = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(mlp3.score(x3, y3)))

Accuracy on training set: 0.783
Accuracy on training set: 0.933
Accuracy on training set: 0.900


In [383]:
#FOREST
est = 6
depth = 6
 
forest1 = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(
    forest1.score(x1, y1)))

forest2 = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(
    forest2.score(x2, y2)))

forest3 = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(forest3.score(x3, y3)))

Accuracy on training set: 0.885
Accuracy on training set: 0.925
Accuracy on training set: 0.967


In [384]:
#SVM

svc1 = SVC(random_state = 69, C = 1)
svc1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(svc1.score(x1, y1)))

svc2 = SVC(random_state = 69, C = 1)
svc2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(svc2.score(x2, y2)))

svc3 = SVC(random_state = 69, C = 1)
svc3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(svc3.score(x3, y3)))

Accuracy on training set: 0.823
Accuracy on training set: 0.858
Accuracy on training set: 0.800


In [385]:
#Log Regressor

clf1 =  LogisticRegression(random_state=69, C =5)
clf1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(clf1.score(x1, y1)))

clf2 = LogisticRegression(random_state=69, C = 5)
clf2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(clf2.score(x2, y2)))

clf3 = LogisticRegression(random_state=69, C = 3)
clf3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(clf3.score(x3, y3)))

Accuracy on training set: 0.783
Accuracy on training set: 0.867
Accuracy on training set: 0.900


In [386]:
#K-Nearest
neighbors = 5

knn1 = KNeighborsClassifier(n_neighbors=neighbors)
knn1.fit(x1, y1)
knn1_score = knn_big.score(x1, y1)

knn2 = KNeighborsClassifier(n_neighbors=neighbors)
knn2.fit(x2, y2)
knn2_score = knn_little.score(x2, y2)

knn3 = KNeighborsClassifier(n_neighbors=neighbors)
knn3.fit(x3, y3)
knn3_score = knn_comp.score(x3, y3)

knn_mean =  sum( [knn1_score, knn2_score, knn3_score])/ len( [knn1_score, knn2_score, knn3_score])

print("Accuracy on training set: {:.3f}".format(knn1_score))
print("Accuracy on training set: {:.3f}".format(knn2_score))
print("Accuracy on training set: {:.3f}".format(knn3_score))

Accuracy on training set: 0.692
Accuracy on training set: 0.733
Accuracy on training set: 0.633


In [387]:
# Naive Bayes

gnb1 = GaussianNB()
gnb1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(gnb1.score(x1, y1)))

gnb2 = GaussianNB()
gnb2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(gnb2.score(x2, y2)))

gnb3 = GaussianNB()
gnb3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(gnb3.score(x3, y3)))

Accuracy on training set: 0.706
Accuracy on training set: 0.742
Accuracy on training set: 0.867


In [388]:
# Decision tree

tree_depth = 7

DT1 = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(DT1.score(x1, y1)))

DT2 = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(DT2.score(x2, y2)))

DT3 = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(DT3.score(x3, y3)))

Accuracy on training set: 0.854
Accuracy on training set: 0.900
Accuracy on training set: 0.867


# Simulation: Round Group Simulation

In [None]:
r64_r = pd.DataFrame(columns = df_headers)
r32_r = pd.DataFrame(columns = df_headers)
s16_r = pd.DataFrame(columns = df_headers)
e8_r = pd.DataFrame(columns = df_headers)
f4_r = pd.DataFrame(columns = df_headers)
c2_r= pd.DataFrame(columns = df_headers)
winner_r= pd.DataFrame(columns = df_headers)

In [None]:
#FINE TUNE [knn, DT, forest, mlp, clf, gnb, svc]
b = "clf"
l = 'clf'
c = "clf"

w1 = globals()[b + '%s' % "1"]
w2 = globals()[l + '%s' % "2"]
w3 = globals()[c + '%s' % "3"]

In [None]:
test_df = pd.DataFrame(columns = df_headers[3:25])
y_pred = []
matchup_list = []
test_regions = [east24_df, midwest24_df, south24_df, west24_df]
#print('\033[1m' + str(learn) + '\033[0m' + "\n")
for x in test_regions: 
    r64_r = r64_r.append(x)
    round64 = x
    rounds = [round64, r32_r, s16_r, e8_r, f4_r]
    for r in range (0,len(rounds)-1):
        wins = []
        y = len(rounds[r])/2
        y = int(y)
        for x in range(0, y):
            h = rounds[r].iloc[x]
            l = rounds[r].iloc[-x-1]
            holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
            if holder.iloc[0]["SEED"] > 0: 
                holder = -holder
            scaled = scale(holder)
            if r < 2:
                ups = w1.predict(scaled)
            else:
                ups = w2.predict(scaled)
            ups = list(ups)[0]
            print(h['SEED'], h['TEAM'], ' vs. ', l['SEED'], l['TEAM'])
            if ups == 0: 
                wins.append(h)
                print("Winner:", h['SEED'], h['TEAM'])
            if ups == 1:
                wins.append(l)
                print("Winner:", l['SEED'], l['TEAM'])
            test_df = test_df.append(holder)
            matchup = h['TEAM'] + ' vs. ' + l['TEAM']
            matchup_list.append(matchup)
            y_pred.append(ups)
        rounds[r+1] = pd.DataFrame(data = wins, columns = df_headers)
    print("_" * 40)

    r32_r = r32_r.append(rounds[1])
    s16_r = s16_r.append(rounds[2])
    e8_r = e8_r.append(rounds[3])
    f4_r = f4_r.append(rounds[4])

## FINAL 4 SIM

In [None]:
final_four = [f4_r, c2_r, winner_r]
for r in range (0,len(final_four)-1):
    wins = []
    y = len(final_four[r])/2
    y = int(y)
    for x in range(0, y):
        h = final_four[r].iloc[x]
        l = final_four[r].iloc[-x-1]
        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25]).sort_values(by = "SEED")
        scaled = scale(holder)

        ups = w3.predict(scaled)

        ups = list(ups)[0]
        matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
        print(matchup)
        if ups == 0: 
            wins.append(h)
            print("Winner:", h['SEED'], h['TEAM'])
        if ups == 1:
            wins.append(l)
            print("Winner:", l['SEED'], l['TEAM'])
        test_df = test_df.append(holder)
        matchup_list.append(matchup)
        y_pred.append(ups)
    final_four[r+1] = pd.DataFrame(data = wins, columns = df_headers)
df_c2 = final_four[1]
df_winner = final_four[2]

In [None]:
test_df['UPSET'] = y_pred

In [None]:
test_df["MATCHUP"] = matchup_list

In [None]:
df_winner

## Single Matchup Test

In [None]:
single = region_df(["Iowa St.", "Michigan St."],ncaam24)

In [None]:
learners = ["knn", "DT", "forest", "mlp", "clf", "gnb", "svc"]

In [None]:
single

In [None]:
for j in learners:
    h = single.iloc[0]
    l = single.iloc[1]
    holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25]).sort_values(by = "SEED")
    scaled = scale(holder)
    
    pred = globals()[j + '%s' % "3"] #1
    ups = pred.predict(scaled)

    ups = list(ups)[0]
    matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
    print(matchup)
    if ups == 0: 
        wins.append(h)
        print("Winner:", h['SEED'], h['TEAM'])
    if ups == 1:
        wins.append(l)
        print("Winner:", l['SEED'], l['TEAM'])
    print("_"*40)

In [None]:
for j in learners:
    h = single.iloc[1]
    l = single.iloc[0]
    holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25]).sort_values(by = "SEED")
    scaled = scale(holder)
    
    pred = globals()[j + '%s' % "_comp"] #1
    ups = pred.predict(scaled)

    ups = list(ups)[0]
    matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
    print(matchup)
    if ups == 0: 
        wins.append(h)
        print("Winner:", h['SEED'], h['TEAM'])
    if ups == 1:
        wins.append(l)
        print("Winner:", l['SEED'], l['TEAM'])
    print("_"*40)

### Conference Tournament

In [None]:
def bye_split(team_cnt):
    x = 0
    while 2**x < team_cnt:
        binary_rnd = 2**x 
        x += 1
    
    second_round = binary_rnd + (binary_rnd - team_cnt)
    
    return second_round

In [None]:
def first_rnd_bye(df):
    teams = bye_split(len(df))
    bye_teams = df.iloc[:teams]
    first_round = df.iloc[teams:]
    
    return bye_teams, first_round
    

In [None]:
big_east = ['Connecticut',
 'Creighton',
 'Marquette',
 'Seton Hall',
 "St. John's",
 'Villanova',
 'Providence',
 'Butler',
 'Xavier',
 'Georgetown',
 'DePaul']

In [None]:
def region_df(x,y):
    v = pd.DataFrame(x)
    v = v.rename(columns={0: "TEAM"})
    v = v.merge(y, on = 'TEAM', how='left')
    return v

In [None]:
be_tourn = region_df(big_east, ncaam24)

In [None]:
be_tourn.SEED = be_tourn.index + 1

In [None]:
r2, r1 = first_rnd_bye(be_tourn)

In [None]:
if len(r1) > 0:
    for x in range(int(len(be24[1])/2)):
        h = r1.iloc[x]
        l = r1.iloc[(-x-1)]
        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
        scaled = scale(holder)
        

        
        pred = forest_comp
        ups = pred.predict(scaled)

        ups = list(ups)[0]
        matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
        print(matchup)
        if ups == 0: 
            r2 = r2.append(h, ignore_index=True)
            print("Winner:", h['SEED'], h['TEAM'])
        if ups == 1:
            r2 = r2.append(l, ignore_index=True)
            print("Winner:", l['SEED'], l['TEAM'])
        print("_"*40)
  

In [None]:
r = 2
while len(globals()["r" + '%s' % str(r)]) != 1:
    curr_round = globals()["r" + '%s' % str(r)]
    
    globals()["r" + '%s' % str((r+1))] = pd.DataFrame(columns = df_headers)
    for x in range(int(len(curr_round)/2)):
        h = curr_round.iloc[x]
        l = curr_round.iloc[(-x-1)]
        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
        scaled = scale(holder)
        
        if holder.iloc[0]["SEED"] < seed_cutoff_high:
            pred = mlp_big
            #print(big.predict_proba(scaled))
        elif  holder.iloc[0]["SEED"] > seed_cutoff_low:
            pred = mlp_little
            #print(little.predict_proba(scaled))
        else:
            pred = mlp_comp
            #print(comp.predict_proba(scaled))
        
        ups = pred.predict(scaled)

        ups = list(ups)[0]
        matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
        print(matchup)
        if ups == 0: 
            globals()["r" + '%s' % str((r+1))] = globals()["r" + '%s' % str((r+1))].append(h, ignore_index=True)
            print("Winner:", h['SEED'], h['TEAM'])
        if ups == 1:
            globals()["r" + '%s' % str((r+1))] = globals()["r" + '%s' % str((r+1))].append(l, ignore_index=True)
            print("Winner:", l['SEED'], l['TEAM'])
        print("_"*40)
    
    
    r += 1

In [None]:
learners = ["knn", "DT", "forest", "mlp", "clf", "gnb", "svc"]

### COEF Importance

In [None]:
#Lasso

X = train_big[0]
Y = train_big[1]

x = train_little[0]
y = train_little[1]

train_x = train_comp[0]
train_y = train_comp[1]

from sklearn.linear_model import Lasso
lasso_big = Lasso(alpha=0.005)
lasso_big.fit(X,Y)

lasso_little = Lasso(alpha=0.005)
lasso_little.fit(x,y)

lasso_comp = Lasso(alpha=0.005)
lasso_comp.fit(train_x,train_y)

pd.DataFrame([lasso_big.coef_, lasso_little.coef_, lasso_comp.coef_], columns  = new_columns[3:25], index=["big_upset", "little_upset", "competative"])

### Simulated Final Fours

In [None]:
scenarios = ''

In [None]:
all_scen = []
learning = []
upset_count = []
winners= []

In [None]:
for f in range(7):
    for g in range(7):

        r64 = pd.DataFrame(columns = df_headers)
        r32 = pd.DataFrame(columns = df_headers)
        s16 = pd.DataFrame(columns = df_headers)
        e8 = pd.DataFrame(columns = df_headers)
        f4 = pd.DataFrame(columns = df_headers)
        c2= pd.DataFrame(columns = df_headers)
        winner= pd.DataFrame(columns = df_headers)

        #FINE TUNE [knn, DT, forest, mlp, clf, gnb, svc]
        learners = ["knn", "DT", "forest", "mlp", "clf", "gnb", "svc"]
        b = learners[f]
        l = learners[g]

        w1 = globals()[b + '%s' % "1"]
        w2 = globals()[l + '%s' % "2"]

        test_df = pd.DataFrame(columns = df_headers[3:25])
        y_pred = []
        matchup_list = []
        test_regions = [east24_df, midwest24_df, south24_df, west24_df]

        for x in test_regions: 
            r64 = r64.append(x)
            round64 = x
            rounds = [round64, r32, s16, e8, f4]
            for r in range (0,len(rounds)-1):
                wins = []
                y = len(rounds[r])/2
                y = int(y)
                for x in range(0, y):
                    base_count = 0

                    h = rounds[r].iloc[x]
                    l = rounds[r].iloc[-x-1]
                    holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
                    if holder.iloc[0]["SEED"] > 0: 
                        holder = -holder
                    scaled = scale(holder)
                    if r < 2:
                        ups = w1.predict(scaled)
                    else:
                        ups = w2.predict(scaled)
#                    if holder.iloc[0]["SEED"] < seed_cutoff_high:
#                       ups = big.predict(scaled)
#                        #print(big.predict_proba(scaled))
#                    elif  holder.iloc[0]["SEED"] > seed_cutoff_low:
#                        ups = little.predict(scaled)
#                        #print(little.predict_proba(scaled))
#                    else:
#                        ups = comp.predict(scaled)
#                        #print(comp.predict_proba(scaled))
                    
                    ups = list(ups)[0]
                    if ups == 0: 
                        wins.append(h)
                    if ups == 1:
                        wins.append(l)
                        base_count += 1
                    test_df = test_df.append(holder)
                    matchup = h['TEAM'] + ' vs. ' + l['TEAM']
                    matchup_list.append(matchup)
                    y_pred.append(ups)

                    upset_count.append(base_count)
                rounds[r+1] = pd.DataFrame(data = wins, columns = df_headers)
                
                
            r32 = r32.append(rounds[1])
            s16 = s16.append(rounds[2])
            e8 = e8.append(rounds[3])
            f4 = f4.append(rounds[4])
            winners.append(rounds[1:])
        
        learning.append(learners[f] + " " + learners[g])
        all_scen.append(list(f4.TEAM))
        

### Final Four Proportions

In [None]:
len(upset_count)/15/49

In [None]:
all_scen1 = []
for j in all_scen:
    all_scen1.append(list(j))

In [None]:
scenarios = pd.DataFrame(all_scen1, columns = ["East", "Midwest", "West", "South"])

In [None]:
scenarios.index = learning

In [None]:
max_east = max(scenarios.groupby(["East"]).count()["South"])
scenarios.groupby(["East"]).count()["South"].sort_values(ascending = False)/max_east*100

In [None]:
max_south = max(scenarios.groupby(["South"]).count()["East"])
scenarios.groupby(["South"]).count()["East"].sort_values(ascending = False)/max_south*100

In [None]:
max_midwest = max(scenarios.groupby(["Midwest"]).count()["East"])
scenarios.groupby(["Midwest"]).count()["East"].sort_values(ascending = False)/max_midwest*100

In [None]:
max_west = max(scenarios.groupby(["West"]).count()["East"])
scenarios.groupby(["West"]).count()["East"].sort_values(ascending = False)/max_west*100

In [None]:
scenarios.to_csv('Sims/sims24.csv')

### Round Breakdown

In [None]:
len(winners)

In [None]:
r32all = pd.DataFrame(columns = ['TEAM','SEED','CONF'])
s16all = pd.DataFrame(columns = ['TEAM','SEED','CONF'])
e8all = pd.DataFrame(columns = ['TEAM','SEED','CONF'])
f4all = pd.DataFrame(columns = ['TEAM','SEED','CONF'])

In [None]:
for x in winners:
    r32all = r32all.append(x[0][['TEAM','SEED','CONF']])
    s16all = s16all.append(x[1][['TEAM','SEED','CONF']])
    e8all = e8all.append(x[2][['TEAM','SEED','CONF']])
    f4all = f4all.append(x[3][['TEAM','SEED','CONF']])

In [None]:
def region_setter(number):
    regions = ["east", "midwest", "south", "west"]
    n = int(number/ 196)
    
    region_col = []
    for i in regions:
        x1 = n * [i]
        region_col.extend(x1)
    return region_col * 49

In [None]:
def learner_setter(number):
    scens = list(scenarios.index)
    n = int(number/ 49)
    
    learner_col = []
    for i in scens:
        x1 = n * [i]
        learner_col.extend(x1)
    return learner_col

In [None]:
r32all["REGION"] = region_setter(len(r32all))
r32all["LEARNER"] = learner_setter(len(r32all))

s16all["REGION"] = region_setter(len(s16all))
s16all["LEARNER"] = learner_setter(len(s16all))

e8all["REGION"] = region_setter(len(e8all))
e8all["LEARNER"] = learner_setter(len(e8all))

["REGION"] = region_setter(len(f4all))
f4all["LEARNER"] = learner_setter(len(f4all))

In [None]:
max_f4 = max(f4all.groupby(["TEAM"]).count()["LEARNER"])
f4all.groupby(["TEAM", "REGION"]).count()["LEARNER"].sort_values(ascending = False)/max_f4*100

In [None]:
max_SEED32 = max(r32all.groupby(["SEED"]).count()["LEARNER"])
r32all.groupby(["SEED"]).count()["LEARNER"].sort_values(ascending = False)/max_SEED32*100

In [None]:
r32all[r32all['SEED'] >= 11].groupby(['TEAM',"SEED","REGION"]).count()["LEARNER"].sort_values(ascending = False)

In [None]:
s16all[s16all['SEED'] >= 6].groupby(['TEAM',"SEED","REGION"]).count()["LEARNER"].sort_values(ascending = False)

In [None]:
f4all.groupby(['TEAM',"SEED","REGION"]).count()["LEARNER"].sort_values(ascending = False)

In [None]:
r32all[r32all['REGION'] == "midwest"].groupby(['TEAM',"SEED","REGION"]).count()["LEARNER"].sort_values(ascending = False)

In [None]:
scenarios1 = scenarios[scenarios["East"] == "Tennessee"]

In [None]:
east_seed = []
for x in scenarios['East']:
    east_seed.append(east23.index(x)+1)

In [None]:
south_seed = []
for x in scenarios['South']:
    south_seed.append(south23.index(x)+1)

In [None]:
west_seed = []
for x in scenarios['West']:
    west_seed.append(west23.index(x)+1)

In [None]:
midwest_seed = []
for x in scenarios['Midwest']:
    midwest_seed.append(midwest23.index(x)+1)

In [None]:
scenarios['East_seed'] = east_seed
scenarios['west_seed'] = west_seed
scenarios['midwest_seed'] = midwest_seed

scenarios['south_seed'] = south_seed

In [None]:
scenarios['seed_sum'] = scenarios['East_seed'] + scenarios['south_seed'] + scenarios['west_seed'] + scenarios['midwest_seed']

In [None]:
scenarios

In [None]:
learning

In [None]:
region_upset_sum = []
for i in range (0, len(upset_count), 15):
    x = i
    region_upset_sum.append(upset_count[x:x+15])

In [None]:
len(region_upset_sum[1])

In [None]:
leg_split = []
for y in region_upset_sum:
    ind_list = []
    ind_list.append(sum(y[0:7]))
    ind_list.append(sum(y[8:11]))
    ind_list.append(sum(y[12:13]))
    ind_list.append(y[14])
    leg_split.append(ind_list)

In [None]:
upset_col = [
    "east64","east32","east16","east8", 
    "south64","south32","south16","south8",
    'midwest64','midwest32','midwest16','midwest8',
    'west64','west32','west16','west8']

In [None]:
ncaam[ncaam['POSTSEASON'] in  ['2ND', 'F4', 'Champions'] ]

In [None]:
sum(y1)/len(y1)