## This Is March
by Chancellor Tang

For this project, my goal was to see if it were possible to create a model that could predict the mens NCAA March Madness tournament. To do this, I retrieved 2 datasets. ncaam.csv, which has statistics for teams from tournament paticipants from the past five tournaments (2015-2019) With a column labeled "POSTSEASON", which stated how far each team went in the tournament. cbb21.csv hold team statistics from the 2021 season.

Link: https://www.kaggle.com/andrewsundberg/college-basketball-dataset?select=cbb.csv

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [2]:
ncaam = pd.read_csv('data/cbb.csv')

In [3]:
ncaam21 = pd.read_csv('data/test/cbb21t.csv')

In [4]:
ncaam22 = pd.read_csv('data/test/cbb22T.csv')

In [5]:
ncaam23 = pd.read_csv('data/test/cbb23t.csv')

In [6]:
ncaam24 = pd.read_csv('data/cbb24.csv')

In [7]:
ncaam25 = pd.read_csv('data/cbb25.csv')

**Data Cleansing**

The first step I took to cleanse was to drop all the schools not in the tournament out of the 2021 DataFrame. To do this, I created a condition to filter out any team with less than a 17 seed because 17 seeds do not exist in this tournament (yet). To make it easier for a later process, I also rearranged the columns so that the numeric values were grouped together. From there, for each DataFrame. I filled all null values with 0 and changed the seed number to an int. After restructuring the data, I divided the past tournaments DataFrame into their respective years.

In [8]:
ncaam21 = ncaam21[ncaam21['SEED'] < 17] 

In [9]:
def power_conf(df):
    test_list = []
    for x in df.CONF:
        if x in ["B10", "B12", "SEC", "P12", "BE", "ACC"]:
            test_list.append(1)
        else:
            test_list.append(0)
    return test_list

In [10]:
new_columns = ['TEAM',
 'CONF',
 'POSTSEASON',
 'G',
 'W',
 "WIN_PER",
 'ADJOE',
 'ADJDE',
 'BARTHAG',
 'EFG_O',
 'EFG_D',
 'TOR',
 'TORD',
 'ORB',
 'DRB',
 'FTR',
 'FTRD',
 '2P_O',
 '2P_D',
 '3P_O',
 '3P_D',
 'ADJ_T',
 'WAB',
 'SEED',
 "POWER",
 'YEAR']

In [11]:
def format_ncaa_df(df, year = None):
    df = df.fillna(0)
    df.SEED = df.SEED.astype(int) 
    df["WIN_PER"] = df["W"]/df["G"]
    df["POWER"] = power_conf(df)
    df = df.reindex(columns=new_columns)
    if year is not None:
        df["YEAR"] = year
    return df

In [12]:
ncaam = format_ncaa_df(ncaam)

In [13]:
ncaam21 = format_ncaa_df(ncaam21, 2021)
ncaam22 = format_ncaa_df(ncaam22, 2022)
ncaam23 = format_ncaa_df(ncaam23, 2023)
ncaam24 = format_ncaa_df(ncaam24, 2024)
ncaam25 = format_ncaa_df(ncaam25, 2025)

In [14]:
#ncaam21 = ncaam[ncaam['YEAR']==2021]
ncaam19 = ncaam[ncaam['YEAR']==2019]
ncaam18 = ncaam[ncaam['YEAR']==2018]
ncaam17 = ncaam[ncaam['YEAR']==2017]
ncaam16 = ncaam[ncaam['YEAR']==2016]
ncaam15 = ncaam[ncaam['YEAR']==2015]
ncaam14 = ncaam[ncaam['YEAR']==2014]
ncaam13 = ncaam[ncaam['YEAR']==2013]

In [15]:
ncaam24.head()

Unnamed: 0,TEAM,CONF,POSTSEASON,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED,POWER,YEAR
0,Houston,B12,S16,31,28,0.903226,121.5,87.8,0.9768,50.1,...,39.6,48.8,44.0,34.8,30.2,63.7,9.6,1,1,2024
1,Connecticut,BE,Champions,31,28,0.903226,127.1,95.0,0.966,56.7,...,33.4,58.0,43.9,36.6,31.7,64.7,9.3,1,1,2024
2,Purdue,B10,2ND,31,28,0.903226,128.1,96.8,0.9615,56.4,...,23.5,53.5,47.7,41.1,32.0,67.9,10.7,1,1,2024
3,North Carolina,ACC,S16,31,25,0.806452,118.0,94.3,0.9292,51.4,...,28.2,50.1,45.7,35.8,31.2,70.6,6.4,1,1,2024
4,Arizona,P12,S16,31,24,0.774194,123.8,96.3,0.9471,55.3,...,26.3,54.9,47.9,37.4,33.7,72.6,4.9,2,1,2024


**Manual Input**


One of the drawbacks from this dataframe is that it does not tell you which leg of the bracket each team came from. This causes an issue with setting up matchups. To regroup the data, I had to manually create lists for each leg (4) of each tournament (6).

In [16]:
import json

with open("data/regions.json", "r") as file:
    regions = json.load(file)

In [17]:
regions["2025"] =  {
    "east": [
      "Duke",
      "Alabama",
      "Wisconsin",
      "Arizona",
      "Oregon",
      "BYU",
      "Saint Mary’s",
      "Mississippi St.",
      "Baylor",
      "Vanderbilt",
      "VCU",
      "Liberty",
      "Akron",
      "Montana",
      "Robert Morris",
      "Mount St. Mary’s" #"American" #
    ],
    "west": [
      "Florida",
      "St. John’s",
      "Texas Tech",
      "Maryland",
      "Memphis",
      "Missouri",
      "Kansas",
      "Connecticut",
      "Oklahoma",
      "Arkansas",
      "Drake",
      "Colorado St.",
      "Grand Canyon",
      "UNC Wilmington",
      "Nebraska Omaha",
      "Norfolk St."
    ],
    "south": [
      "Auburn",
      "Michigan St.",
      "Iowa St.",
      "Texas A&M",
      "Michigan",
      "Mississippi",
      "Marquette",
      "Louisville",
      "Creighton",
      "New Mexico",
      "North Carolina", #/",  San Diego St.
      "UC San Diego",
      "Yale",
      "Lipscomb",
      "Bryant",
      "Alabama St." #/Saint Francis"
    ],
    "midwest": [
      "Houston",
      "Tennessee",
      "Kentucky",
      "Purdue",
      "Clemson",
      "Illinois",
      "UCLA",
      "Gonzaga",
      "Georgia",
      "Utah St",
      "Xavier",   #"Texas", 
      "McNeese St.",
      "High Point",
      "Troy",
      "Wofford",
      "SIU Edwardsville"
    ]
  }


**Creating the Dataframes**

For creating each leg's dataframe, I created three functions. f(x) works like a case function. It looks up the value of the column "POSTSEASON" and returns a list of ones and zeros to act as dummy variables. The dummy(z) function user f(x) to create a dataframe of of all these dummy variables. Finally, create_df(x,y) creates a final dataframe where it merges (the region list with their year's respective nccam dataframe to import the data. This also uses the dummy(z) function to add on the dummy variables.

To create the dataframe, I used a for loop that creates the dataframes based on their region and their name.

In [18]:
def round_assign(x):
    if x['POSTSEASON'] == "R64" : return [1,0,0,0,0,0,0]
    elif x['POSTSEASON'] == "R32" : return [1,1,0,0,0,0,0]
    elif x['POSTSEASON'] == "S16" : return [1,1,1,0,0,0,0]
    elif x['POSTSEASON'] == 'E8': return [1,1,1,1,0,0,0]
    elif x['POSTSEASON'] == 'F4': return [1,1,1,1,1,0,0]
    elif x['POSTSEASON'] == '2ND' or x['POSTSEASON'] == 'C2': return [1,1,1,1,1,1,0]
    elif x['POSTSEASON'] == 'Champions': return [1,1,1,1,1,1,1]
    else: return [0,0,0,0,0,0,0]

In [19]:
def dummy(z):
    a = []
    for x in range(0,len(z)):
        a.append(round_assign(z.iloc[x]))
    a = pd.DataFrame(a)
    a = a.rename(columns={0: "R64", 1: "R32", 2: "S16",3: "E8", 4:"F4",5:"C2",6:"Champions"})
    return a

In [20]:
def region_df(x,y):
    v = pd.DataFrame(x)
    v = v.rename(columns={0: "TEAM"})
    v = v.merge(y, on = 'TEAM', how='left')
    v = v.join(dummy(v))
    return v

In [21]:
master_df = pd.DataFrame()

all_years = [13,14,15,16,17,18,19,21,22,23,24,25]
legs = ["east",'south', 'midwest', 'west']
for x in all_years:
    for y in legs:
        z = str( 2000 + x)
        globals()[y + '%s' % x + '%s' %"_df"] = region_df(regions[z][y],  globals()['ncaam' + '%s' % x])
        master_df = pd.concat([master_df,globals()[y + '%s' % x + '%s' %"_df"]], ignore_index=True)

In [22]:
east24_df

Unnamed: 0,TEAM,CONF,POSTSEASON,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,...,SEED,POWER,YEAR,R64,R32,S16,E8,F4,C2,Champions
0,Connecticut,BE,Champions,31,28,0.903226,127.1,95.0,0.966,56.7,...,1,1,2024,1,1,1,1,1,1,1
1,Iowa St.,B12,S16,31,24,0.774194,113.1,89.5,0.9365,51.8,...,2,1,2024,1,1,1,0,0,0,0
2,Illinois,B10,E8,31,23,0.741935,125.3,101.8,0.9159,53.9,...,3,1,2024,1,1,1,1,0,0,0
3,Auburn,SEC,R64,31,24,0.774194,121.5,94.6,0.9467,54.0,...,4,1,2024,1,0,0,0,0,0,0
4,San Diego St.,MWC,S16,29,22,0.758621,112.5,95.0,0.8739,50.4,...,5,0,2024,1,1,1,0,0,0,0
5,BYU,B12,R64,31,22,0.709677,121.7,100.3,0.9024,55.3,...,6,1,2024,1,0,0,0,0,0,0
6,Washington St.,P12,R32,31,23,0.741935,114.3,99.2,0.8358,52.1,...,7,1,2024,1,1,0,0,0,0,0
7,Florida Atlantic,Amer,R64,31,24,0.774194,118.8,104.6,0.8131,55.1,...,8,0,2024,1,0,0,0,0,0,0
8,Northwestern,B10,R32,31,21,0.677419,118.8,101.2,0.8641,53.4,...,9,1,2024,1,1,0,0,0,0,0
9,Drake,MVC,R64,33,28,0.848485,116.2,101.5,0.825,54.9,...,10,0,2024,1,0,0,0,0,0,0


In [23]:
master_df['games_won'] = master_df[['R64', 'R32', 'S16', 'E8', 'F4', 'C2', 'Champions']].sum(axis=1)

In [24]:
master_df.to_csv("master_ncaam.csv", index=False)

**Determining an outcome**

The main issue regarding the matchups was how to create a dataframe that can be trained to predict the outcome to a game. While prediciting likeliness of what round each team would reach is easier, this would not give us the same head-to-head chaos that March Madness embodies. The best way to predict the outcome of each induvidual game is to create a dataframe that shows head-to-head matchups. 

To do this, I created a function diff(df,u) that takes a region dataframe (df) and creates the head-to-head matchups. This function repeats for the number of matchups in the region for that round (u). The function take the first team in the dataframe and the last team in the dataframe, which are the 1 and 16 seed, and works their way in. So in the next matchup, it would be the 2 seed agains the second to last seed (15), and the for loop would keep going until the last matchup (8 vs 9) is created by moving its way from the outside in. For each match set by the for loop, both are run through the get_upset_differences(x,y), which subtracts the numeric statistics of y (the lower seed) by x (the higher seed) to create a difference. This difference is returned as  list of differences which is then appended to a its own list (listDF) in the diff(df,u) function. Once all the matchup differences are created, the function creates a dataframe using listDF and returns that dataframe. The predictive model will us this format of dataframe (the differences) as its training data.

In [25]:
df_headers = list(ncaam.columns)

In [26]:
train_years = [13,14,15,16,17,18,19,21,22,23,24]

In [27]:
df_past = []
for x in range(0,len(legs)):
    for y in train_years:
        df_past.append(globals()[legs[x] + '%s' % y + '%s' %"_df"])

In [28]:
def get_upset_differences(a,b):
    """
    Takes two rows from one of the region databases and finds the difference between the two columns 

    a: higher seed
    b: lower seed'

    output: list of db difference rows
    """
    
    listA = []
    for x in range(3,25):
        diff = a.iloc[x] - b.iloc[x]
        listA.append(diff)
    return listA

In [29]:
def get_target_variable(df, nxt_round_num, rnd_name):
    """
    This function used to get the target variable attributes from the correct column
    The list is reversed because we want to see if the lower seed wins

    df: dataframe used
    nxt_round_num: number of teams in the next round
    rnd_name: the column name of the dummy round variable of the next round

    output: list of target variable attributes (will be its own column)
    """
    
    num = int(len(df)/2)
    listB = list(df[rnd_name])
    listB.reverse()
    listB = listB[0:num]
    return listB

In [30]:
def create_training_record(df, matchup_num, reseed = True):
    """
    This function takes the dataframe, and creates a new dataset of differences that can be trained
    
    df: dataframe of the round being processed
    matchup_num: the number of matchups
    """
    
    listDF = []
    for y in range(0,matchup_num):
        test_upsetH = df.iloc[y]
        test_upsetL = df.iloc[-(y+1)]
        listDF.append(get_upset_differences(test_upsetH,test_upsetL))
    if reseed:
        return pd.DataFrame(listDF, columns = df_headers[3:25]).sort_values(by = ["SEED"])
    else:
        return pd.DataFrame(listDF, columns = df_headers[3:25]).sort_values(by = ["SEED"])

In [31]:
def creation(df, next_rounds):
    """
    This function takes the current round's dataframe and the next round string name to use the previous functions to build the training df

    df: current round's dataframe
    next_rounds
    """
    
    y = pd.DataFrame(columns = new_columns[3:25])
    asd = []
    nxt_round_num = int(len(df)/2)
    upset= get_target_variable(df,nxt_round_num, next_rounds)
    for x in range(0,nxt_round_num):
        h = df.iloc[x]
        l = df.iloc[(-x-1)]
        if upset[x] == 1:
            asd.append(l)
        if upset[x] == 0:
            asd.append(h)
    next_df = pd.DataFrame(asd)
    y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
    y["TRAIN"] = upset
    y.TRAIN = y.TRAIN.astype(int)
    return y, next_df

In [32]:
df_past64 = []
for x in range(0,len(legs)):
    for y in train_years:
        df_past64.append(globals()[legs[x] + '%s' % y + '%s' %"_df"])

In [33]:
def create_train(a, next_round):
    df = pd.DataFrame(columns = new_columns[3:25])
    df_next = []
    for x in a:
        train, next_df = creation(x,next_round)
        df = pd.concat([df,train], ignore_index = True)
        df_next.append(next_df)
    return df, df_next

In [34]:
round_name = ["R32", "S16", "E8", "F4"]
train_master = pd.DataFrame(columns = new_columns[3:25])
train_w1 = pd.DataFrame(columns = new_columns[3:25])
train_w2 = pd.DataFrame(columns = new_columns[3:25])

for x in range(0,len(round_name)):
    a = 2**(6-x)
    b = 2**(5-x)
    holder = create_train(globals()["df_past" + '%s' % a], round_name[x])
    globals()["train" + '%s' % a] = holder[0]
    globals()["df_past" + '%s' % b] = holder[1]
    train_master = pd.concat([train_master,holder[0]], ignore_index = True)
    print("end")
    if x <= 1:
        train_w1 = pd.concat([train_w1,holder[0]], ignore_index = True)
    else:
        train_w2 = pd.concat([train_w2,holder[0]], ignore_index = True)

  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  df = pd.concat([df,train], ignore_index = True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nx

end


  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat(

end


  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat(

end


  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat(

end


  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)


### Final 4 Training DF

fin4 = df of all teams in the final 4

In [35]:
df_pastF4 = []
num = len(train_years)

for p in range(num):
    df_list = []
    for i in range(4):
        df_list.append(df_past4[(i * num) + p])
    df = pd.concat(df_list)
    df_pastF4.append(df)

In [36]:
new_dfs = []
for x in df_past:
    y = x[x["F4"]==1]
    new_dfs.append(y)

In [37]:
fin4 = pd.concat(new_dfs)

In [38]:
for y in train_years:
    year = y + 2000
    globals()["fin4_" + '%s' % y] = fin4[fin4["YEAR"]== year]["TEAM"]

In [39]:
fin4_13 = fin4_13.sort_values(ascending = True)

fin4_14 = fin4_14.sort_values(ascending = True)
fin4_14 = fin4_14.reindex([6,7,1,0])

fin4_15 = fin4_15.sort_values(ascending = True)
fin4_15 = fin4_15.reset_index(drop = True)
fin4_15 = fin4_15.reindex([0,3,1,2])

fin4_16 = fin4_16.sort_values(ascending = True)
fin4_16 = fin4_16.reset_index(drop = True)
fin4_16 = fin4_16.reindex([0,3,1,2])
                          
fin4_17 = fin4_17.sort_values(ascending = True)
fin4_18 = fin4_18.sort_values(ascending = True)       
fin4_19 = fin4_19.sort_values(ascending = True)

fin4_21 = fin4_21.sort_values(ascending = True)
fin4_21 = fin4_21.reset_index(drop = True)
fin4_21 = fin4_21.reindex([0,3,1,2])

fin4_21 = fin4_21.sort_values(ascending = True)
fin4_21 = fin4_21.reset_index(drop = True)
fin4_21 = fin4_21.reindex([0,3,1,2])

fin4_22 = fin4_22.sort_values(ascending = True)
fin4_22 = fin4_22.reset_index(drop = True)
fin4_22 = fin4_22.reindex([0,3,1,2])

fin4_23 = fin4_23.sort_values(ascending = True)
fin4_23 = fin4_23.reset_index(drop = True)
fin4_23 = fin4_23.reindex([0,3,1,2])

In [40]:
df_past4 = []
for x in train_years:
    y = train_years.index(x)
    df = pd.DataFrame(globals()["fin4_" + '%s' % x ])
    year = 2000 + x
    df2 = df_pastF4[y]
    df = df.merge(df2, left_on='TEAM', right_on='TEAM')
    df_past4.append(df)

In [41]:
train4, df_past2 = create_train(df_past4, "C2")
train2, df_past1 = create_train(df_past2, "Champions")

  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  df = pd.concat([df,train], ignore_index = True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nxt_round_num)], ignore_index=True)
  y = pd.concat([y, create_training_record(df,nx

In [42]:
train_ff = pd.DataFrame(columns = new_columns[3:25])


for x in [train4,train2]:
    train_master = pd.concat([train_master,x], ignore_index = True)
    train_ff = pd.concat([train_ff, x], ignore_index = True)

  train_ff = pd.concat([train_ff, x], ignore_index = True)


### Scaling

In [43]:
train_neg = train_master[train_master["SEED"]<=0]
train_pos = train_master[train_master["SEED"]>0]

In [44]:
train_pos = - train_pos
train_pos["TRAIN"] = train_pos["TRAIN"] + 1

In [45]:
train_master = pd.concat([train_pos, train_neg])

In [46]:
train_master = train_master.reset_index(drop = True)

from sklearn.preprocessing import StandardScaler  

scaler = StandardScaler()
scaler.fit(train_master.drop(columns = ["TRAIN"]))

def scale(df):
    m = pd.DataFrame(scaler.transform(df), columns = df.columns)
    return m

In [47]:
seed_cutoff_low = -7
seed_cutoff_high = -4
big_upset = train_master[train_master["SEED"] < seed_cutoff_low]
little_upset = train_master[train_master["SEED"] > seed_cutoff_high]
competative = train_master[(train_master["SEED"] <= seed_cutoff_high) & (train_master["SEED"] >= seed_cutoff_low)]

## Model Training

In [48]:
train_master

Unnamed: 0,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,TOR,TORD,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED,POWER,TRAIN
0,6,0,-0.133333,8.0,6.8,0.0034,-0.8,-0.0,-2.4,-11.1,...,-14.2,-1.6,-3.2,0.5,4.1,-2.7,3.5,-8,1,0.0
1,2,1,-0.011583,12.1,-6.7,0.2720,-4.3,-0.0,-0.7,4.9,...,15.2,-6.7,-3.3,-0.6,3.0,-4.7,7.1,-8,1,0.0
2,3,-6,-0.346154,10.5,2.8,0.1373,-0.9,5.3,-2.8,-9.1,...,-14.5,-0.2,3.1,-1.0,6.0,-5.3,2.1,-3,1,1.0
3,0,11,0.379310,11.4,1.0,0.2182,5.1,3.9,-3.7,0.1,...,-16.4,7.6,6.0,-0.1,0.7,0.1,9.4,-8,0,1.0
4,-1,11,0.373992,6.9,-21.2,0.6295,3.5,-9.6,0.3,-3.0,...,-7.9,2.9,-11.2,2.9,-4.3,-0.6,16.1,-7,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688,0,3,0.075000,14.0,3.7,0.0350,5.6,0.8,1.0,-1.3,...,-3.3,4.3,2.2,4.9,-1.5,3.7,3.7,-2,0,0.0
689,0,4,0.105263,7.8,4.7,0.0040,1.7,1.7,-3.0,-5.3,...,-10.3,-0.3,3.8,3.0,-0.8,-6.8,4.1,-2,0,0.0
690,-2,-4,-0.083333,-2.2,4.7,-0.0243,-3.5,1.6,1.5,4.3,...,5.8,-9.9,1.3,5.3,1.5,-5.8,-1.9,0,1,0.0
691,2,0,-0.052083,5.6,1.9,0.0197,2.7,-2.8,1.1,-0.2,...,9.0,3.4,-4.7,0.8,-0.2,1.1,-0.3,-1,1,0.0


In [49]:
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [50]:
def formatStuff(df):
    y = df["TRAIN"]
    x = scale(df.drop(columns = ["TRAIN"]))
    return [x,y]

In [51]:
train_df = formatStuff(train_master)
train_big = formatStuff(big_upset)
train_little = formatStuff(little_upset)
train_comp = formatStuff(competative)

In [52]:
for p in [train_df, train_big, train_little, train_comp]:
    print(sum(p[1])/len(p[0]))
    print(len(p[0]), "\n")

0.3116883116883117
693 

0.17708333333333334
288 

0.39631336405529954
217 

0.42021276595744683
188 



In [53]:
#MLP Classifier
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

iterations = 1000
alpha = 3
 
mlp_big = MLPClassifier(max_iter= iterations, alpha=alpha, random_state = 69)
mlp_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(mlp_big.score(X,Y)))

mlp_little = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(mlp_little.score(x, y)))

mlp_comp = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(mlp_comp.score(train_x, train_y)))

Accuracy on training set: 0.861
Accuracy on training set: 0.797
Accuracy on training set: 0.920


In [54]:
#FOREST
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

est = 6
depth = 5
 
forest_big = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(
    forest_big.score(X, Y)))

forest_little = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(
    forest_little.score(x, y)))

forest_comp = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(forest_comp.score(train_x, train_y)))

Accuracy on training set: 0.889
Accuracy on training set: 0.903
Accuracy on training set: 0.888


In [55]:
#SVM
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

svc_big = SVC(random_state = 69, C = 1)
svc_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(svc_big.score(X, Y)))

svc_little = SVC(random_state = 69, C = 1)
svc_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(svc_little.score(x, y)))

svc_comp = SVC(random_state = 69, C = 1)
svc_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(svc_comp.score(train_x, train_y)))

Accuracy on training set: 0.826
Accuracy on training set: 0.880
Accuracy on training set: 0.798


In [56]:
#Log Regressor
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

clf_big =  LogisticRegression(random_state=69, C =1)
clf_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(clf_big.score(X, Y)))

clf_little = LogisticRegression(random_state=69, C = 5)
clf_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(clf_little.score(x, y)))

clf_comp = LogisticRegression(random_state=69, C = 20)
clf_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(clf_comp.score(train_x, train_y)))

Accuracy on training set: 0.861
Accuracy on training set: 0.816
Accuracy on training set: 0.723


In [57]:
#K-Nearest
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

neighbors = 5

knn_big = KNeighborsClassifier(n_neighbors=neighbors)
knn_big.fit(X, Y)
knn_bscore = knn_big.score(X, Y)

knn_little = KNeighborsClassifier(n_neighbors=neighbors)
knn_little.fit(x, y)
knn_lscore = knn_little.score(x, y)

knn_comp = KNeighborsClassifier(n_neighbors=neighbors)
knn_comp.fit(train_x, train_y)
knn_cscore = knn_comp.score(train_x, train_y)

knn_mean =  sum( [knn_bscore, knn_lscore, knn_cscore])/ len( [knn_bscore, knn_lscore, knn_cscore])

print("Accuracy on training set: {:.3f}".format(knn_bscore))
print("Accuracy on training set: {:.3f}".format(knn_lscore))
print("Accuracy on training set: {:.3f}".format(knn_cscore))

Accuracy on training set: 0.851
Accuracy on training set: 0.788
Accuracy on training set: 0.777


In [58]:
# Naive Bayes
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

gnb_big = GaussianNB()
gnb_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(gnb_big.score(X, Y)))

gnb_little = GaussianNB()
gnb_little.fit(x,y)
print("Accuracy on training set: {:.3f}".format(gnb_little.score(x, y)))

gnb_comp = GaussianNB()
gnb_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(gnb_comp.score(train_x, train_y)))

Accuracy on training set: 0.726
Accuracy on training set: 0.760
Accuracy on training set: 0.644


In [59]:
# Decision tree
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

tree_depth = 7

DT_big = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(DT_big.score(X, Y)))

DT_little = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(DT_little.score(x,y)))

DT_comp = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(DT_comp.score(train_x, train_y)))

Accuracy on training set: 0.903
Accuracy on training set: 0.894
Accuracy on training set: 0.872


## Model Training

In [60]:
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [61]:
def formatStuff(df):
    y = df["TRAIN"]
    x = scale(df.drop(columns = ["TRAIN"]))
    return [x,y]

In [62]:
train_df = formatStuff(train_master)
train_big = formatStuff(big_upset)
train_little = formatStuff(little_upset)
train_comp = formatStuff(competative)

In [63]:
for p in [train_df, train_big, train_little, train_comp]:
    print(sum(p[1])/len(p[0]))
    print(len(p[0]), "\n")

0.3116883116883117
693 

0.17708333333333334
288 

0.39631336405529954
217 

0.42021276595744683
188 



In [64]:
#MLP Classifier
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

iterations = 1000
alpha = 3
 
mlp_big = MLPClassifier(max_iter= iterations, alpha=alpha, random_state = 69)
mlp_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(mlp_big.score(X,Y)))

mlp_little = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(mlp_little.score(x, y)))

mlp_comp = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(mlp_comp.score(train_x, train_y)))

Accuracy on training set: 0.861
Accuracy on training set: 0.797
Accuracy on training set: 0.920


In [65]:
#FOREST
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

est = 6
depth = 5
 
forest_big = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(
    forest_big.score(X, Y)))

forest_little = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(
    forest_little.score(x, y)))

forest_comp = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(forest_comp.score(train_x, train_y)))

Accuracy on training set: 0.889
Accuracy on training set: 0.903
Accuracy on training set: 0.888


In [66]:
#SVM
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

svc_big = SVC(random_state = 69, C = 1)
svc_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(svc_big.score(X, Y)))

svc_little = SVC(random_state = 69, C = 1)
svc_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(svc_little.score(x, y)))

svc_comp = SVC(random_state = 69, C = 1)
svc_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(svc_comp.score(train_x, train_y)))

Accuracy on training set: 0.826
Accuracy on training set: 0.880
Accuracy on training set: 0.798


In [67]:
#Log Regressor
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

clf_big =  LogisticRegression(random_state=69, C =1)
clf_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(clf_big.score(X, Y)))

clf_little = LogisticRegression(random_state=69, C = 5)
clf_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(clf_little.score(x, y)))

clf_comp = LogisticRegression(random_state=69, C = 20)
clf_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(clf_comp.score(train_x, train_y)))

Accuracy on training set: 0.861
Accuracy on training set: 0.816
Accuracy on training set: 0.723


In [68]:
#K-Nearest
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

neighbors = 5

knn_big = KNeighborsClassifier(n_neighbors=neighbors)
knn_big.fit(X, Y)
knn_bscore = knn_big.score(X, Y)

knn_little = KNeighborsClassifier(n_neighbors=neighbors)
knn_little.fit(x, y)
knn_lscore = knn_little.score(x, y)

knn_comp = KNeighborsClassifier(n_neighbors=neighbors)
knn_comp.fit(train_x, train_y)
knn_cscore = knn_comp.score(train_x, train_y)

knn_mean =  sum( [knn_bscore, knn_lscore, knn_cscore])/ len( [knn_bscore, knn_lscore, knn_cscore])

print("Accuracy on training set: {:.3f}".format(knn_bscore))
print("Accuracy on training set: {:.3f}".format(knn_lscore))
print("Accuracy on training set: {:.3f}".format(knn_cscore))

Accuracy on training set: 0.851
Accuracy on training set: 0.788
Accuracy on training set: 0.777


In [69]:
# Naive Bayes
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

gnb_big = GaussianNB()
gnb_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(gnb_big.score(X, Y)))

gnb_little = GaussianNB()
gnb_little.fit(x,y)
print("Accuracy on training set: {:.3f}".format(gnb_little.score(x, y)))

gnb_comp = GaussianNB()
gnb_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(gnb_comp.score(train_x, train_y)))

Accuracy on training set: 0.726
Accuracy on training set: 0.760
Accuracy on training set: 0.644


In [70]:
# Decision tree
X = train_big[0]
Y = list(train_big[1])

x = train_little[0]
y = list(train_little[1])

train_x = train_comp[0]
train_y = list(train_comp[1])

tree_depth = 7

DT_big = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_big.fit(X, Y)
print("Accuracy on training set: {:.3f}".format(DT_big.score(X, Y)))

DT_little = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_little.fit(x, y)
print("Accuracy on training set: {:.3f}".format(DT_little.score(x,y)))

DT_comp = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT_comp.fit(train_x, train_y)
print("Accuracy on training set: {:.3f}".format(DT_comp.score(train_x, train_y)))

Accuracy on training set: 0.903
Accuracy on training set: 0.894
Accuracy on training set: 0.872


**Running the Predictions**

To do this, I create dataframes for each round of the tournament. From there, I created three appendable variables: test_df as an x_test, y_pred as a y_test, and matchup list to keep track of which teams faced off as a visual check. The for loop takes all four regions and sets it equal to the round64 variable. Using the rounds list, which had all the dataframes up until the final four, the for loop would take each round and simulate each matchup by using the lists function to create those differences that were used in the x_train dataframe. Using the result from the lists function (holder), I used the predict function for each ML library to take the values and see if the game was an upset or not. For each game, it would print the matchup, and once the prediction was made, it would print the winner and append the winner to the wins list. Using the wins list, a dataframe would be created (rounds[x+1]). After creating the datframe, it would append the prediction to y_pred and the matchup to matchup_list. After all the rounds how gone, each dataframe created for the round is appended with data from the rounds[x+1].

For each predictive model, it has to be input in the variable q because I have yet to create a loop that can run all these libraries concurrently. "q" is the type of model the prediction runs (SVC, MLP, regressor).

In [220]:
r64 = pd.DataFrame(columns = df_headers)
r32 = pd.DataFrame(columns = df_headers)
s16 = pd.DataFrame(columns = df_headers)
e8 = pd.DataFrame(columns = df_headers)
f4 = pd.DataFrame(columns = df_headers)
c2= pd.DataFrame(columns = df_headers)
winner= pd.DataFrame(columns = df_headers)

In [221]:
#FINE TUNE [knn, DT, forest, mlp, clf, gnb, svc]
#probas not good svc

b = "mlp"
l = "mlp"
c = "mlp"

big = globals()[b + '%s' % "_big"]
little = globals()[l + '%s' % "_little"]
comp= globals()[c + '%s' % "_comp"]

In [222]:
test_df = pd.DataFrame(columns = df_headers[3:25])
y_pred = []
matchup_list = []
test_regions = [east25_df, west25_df, south25_df, midwest25_df]
#print('\033[1m' + str(learn) + '\033[0m' + "\n")
for x in test_regions: 
    r64 = pd.concat([r64,x], ignore_index = True)
    round64 = x
    rounds = [round64, r32, s16, e8, f4]
    for r in range (0,len(rounds)-1):
        wins = []
        y = len(rounds[r])/2
        y = int(y)
        for x in range(0, y):
            random = np.random.rand() **2
            h = rounds[r].iloc[x]
            l = rounds[r].iloc[-x-1]
            holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
            if holder.iloc[0]["SEED"] > 0: 
                holder = -holder
                static = h
                h = l
                l = static
            scaled = scale(holder)
            if holder.iloc[0]["SEED"] < seed_cutoff_low:
                ups = big.predict_proba(scaled) #big.predict(scaled)
                #print(big.predict_proba(scaled))
            elif  holder.iloc[0]["SEED"] > seed_cutoff_high:
                ups = little.predict_proba(scaled)
                #print(little.predict_proba(scaled))
            else:
                ups = comp.predict_proba(scaled) #comp.predict(scaled)
                #print(comp.predict_proba(scaled))
            #ups = list(ups)[0]
            ups=ups[0][0]
            print(ups)
            print(h['SEED'], h['TEAM'], ' vs. ', l['SEED'], l['TEAM'])
            if ups >= random: 
                wins.append(h)
                print("Winner:", h['SEED'], h['TEAM'])
            if ups < random:
                wins.append(l)
                print("Winner:", l['SEED'], l['TEAM'])
            test_df = pd.concat([test_df,holder], ignore_index = True)
            matchup = h['TEAM'] + ' vs. ' + l['TEAM']
            matchup_list.append(matchup)
            y_pred.append(ups)
        rounds[r+1] = pd.DataFrame(data = wins, columns = df_headers)
        print('\n')
    print("_" * 40)

    r32 = pd.concat([r32,rounds[1]], ignore_index = True)
    s16 = pd.concat([s16,rounds[2]], ignore_index = True)
    e8 = pd.concat([e8,rounds[3]], ignore_index = True)
    f4 = pd.concat([f4,rounds[4]], ignore_index = True)

  r64 = pd.concat([r64,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index = True)
  r32 = pd.concat([r32,rounds[1]], ignore_index = True)
  s16 = pd.concat([s16,rounds[2]], ignore_index = True)
  e8 = pd.concat([e8,rounds[3]], ignore_index = True)
  f4 = pd.concat([f4,rounds[4]], ignore_index = True)


0.975634274142808
1 Duke  vs.  16 Mount St. Mary’s
Winner: 1 Duke
0.9243164285569763
2 Alabama  vs.  15 Robert Morris
Winner: 2 Alabama
0.7816623737140768
3 Wisconsin  vs.  14 Montana
Winner: 3 Wisconsin
0.6791978062472721
4 Arizona  vs.  13 Akron
Winner: 4 Arizona
0.15690249036664228
5 Oregon  vs.  12 Liberty
Winner: 12 Liberty
0.25053576524793686
6 BYU  vs.  11 VCU
Winner: 6 BYU
0.6902944979677
7 Saint Mary’s  vs.  10 Vanderbilt
Winner: 7 Saint Mary’s
0.44656774534490395
8 Mississippi St.  vs.  9 Baylor
Winner: 8 Mississippi St.


0.5385042209603896
1 Duke  vs.  8 Mississippi St.
Winner: 1 Duke
0.2682996105926694
2 Alabama  vs.  7 Saint Mary’s
Winner: 2 Alabama
0.45123941988425165
3 Wisconsin  vs.  6 BYU
Winner: 3 Wisconsin
0.5501928242160044
4 Arizona  vs.  12 Liberty
Winner: 4 Arizona


0.8239029303173311
1 Duke  vs.  4 Arizona
Winner: 1 Duke
0.42081927023936505
2 Alabama  vs.  3 Wisconsin
Winner: 2 Alabama


0.8329607358571086
1 Duke  vs.  2 Alabama
Winner: 1 Duke


______________

## FINAL 4 SIM

In [215]:
final_four = [f4, c2, winner]
for r in range (0,len(final_four)-1):
    wins = []
    y = len(final_four[r])/2
    y = int(y)
    for x in range(0, y):
        h = final_four[r].iloc[x]
        l = final_four[r].iloc[-x-1]
        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25]).sort_values(by = "SEED")
    
        scaled = scale(holder)
        if holder.iloc[0]["SEED"] < seed_cutoff_high:
            ups = big.predict(scaled)
        elif  holder.iloc[0]["SEED"] > seed_cutoff_low:
            ups = little.predict(scaled)
        else:
            ups = comp.predict(scaled)
        ups = list(ups)[0]
        matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
        print(matchup)
        if ups == 0: 
            wins.append(h)
            print("Winner:", h['SEED'], h['TEAM'])
        if ups == 1:
            wins.append(l)
            print("Winner:", l['SEED'], l['TEAM'])
        test_df = pd.concat([test_df,holder], ignore_index = True)
        matchup_list.append(matchup)
        y_pred.append(ups)
    final_four[r+1] = pd.DataFrame(data = wins, columns = df_headers)
df_c2 = final_four[1]
df_winner = final_four[2]

1 Duke vs. 8 Gonzaga
Winner: 1 Duke
1 Florida vs. 1 Auburn
Winner: 1 Florida
1 Duke vs. 1 Florida
Winner: 1 Duke


In [184]:
test_df['UPSET'] = y_pred

In [185]:
test_df["MATCHUP"] = matchup_list

In [186]:
df_winner

Unnamed: 0,TEAM,CONF,POSTSEASON,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED,POWER,YEAR
0,Duke,ACC,,34,31,0.911765,128.4,91.3,0.9807,57.4,...,25.4,58.0,43.4,37.7,30.9,65.7,9.6,1,1,2025


## Model Training

In [78]:
def formatStuff(df):
    y = df["TRAIN"]
    x = scale(df.drop(columns = ["TRAIN"]))
    return [x,y]

In [79]:
train_r1 = formatStuff(train_w1)
train_r2 = formatStuff(train_w2)
train_r3 = formatStuff(train_ff)

In [80]:
for p in [train_r1,train_r2,train_r3]:
    print(sum(p[1])/len(p[0]))
    print(len(p[0]), "\n")

0.3087121212121212
528 

0.3939393939393939
132 

0.3333333333333333
33 



In [81]:
# Training DFs
x1 = train_r1[0]
y1 = list(train_r1[1])

x2 = train_r2[0]
y2 = list(train_r2[1])

x3 = train_r3[0]
y3 = list(train_r3[1])

In [82]:
#MLP Classifier

iterations = 1000
alpha = 4
 
mlp1 = MLPClassifier(max_iter= iterations, alpha=alpha, random_state = 69)
mlp1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(mlp1.score(x1,y1)))

mlp2 = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 52)
mlp2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(mlp2.score(x2, y2)))

mlp3 = MLPClassifier(max_iter= iterations, alpha= alpha, random_state = 69)
mlp3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(mlp3.score(x3, y3)))

Accuracy on training set: 0.780
Accuracy on training set: 0.909
Accuracy on training set: 0.879


In [83]:
#FOREST
est = 6
depth = 6
 
forest1 = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(
    forest1.score(x1, y1)))

forest2 = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(
    forest2.score(x2, y2)))

forest3 = RandomForestClassifier(n_estimators=est, max_depth = depth, random_state = 69)
forest3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(forest3.score(x3, y3)))

Accuracy on training set: 0.869
Accuracy on training set: 0.970
Accuracy on training set: 0.970


In [84]:
#SVM

svc1 = SVC(random_state = 69, C = 1)
svc1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(svc1.score(x1, y1)))

svc2 = SVC(random_state = 69, C = 1)
svc2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(svc2.score(x2, y2)))

svc3 = SVC(random_state = 69, C = 1)
svc3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(svc3.score(x3, y3)))

Accuracy on training set: 0.818
Accuracy on training set: 0.848
Accuracy on training set: 0.848


In [85]:
#Log Regressor

clf1 =  LogisticRegression(random_state=69, C =5)
clf1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(clf1.score(x1, y1)))

clf2 = LogisticRegression(random_state=69, C = 5)
clf2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(clf2.score(x2, y2)))

clf3 = LogisticRegression(random_state=69, C = 3)
clf3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(clf3.score(x3, y3)))

Accuracy on training set: 0.763
Accuracy on training set: 0.833
Accuracy on training set: 0.909


In [86]:
#K-Nearest
neighbors = 5

knn1 = KNeighborsClassifier(n_neighbors=neighbors)
knn1.fit(x1, y1)
knn1_score = knn_big.score(x1, y1)

knn2 = KNeighborsClassifier(n_neighbors=neighbors)
knn2.fit(x2, y2)
knn2_score = knn_little.score(x2, y2)

knn3 = KNeighborsClassifier(n_neighbors=neighbors)
knn3.fit(x3, y3)
knn3_score = knn_comp.score(x3, y3)

knn_mean =  sum( [knn1_score, knn2_score, knn3_score])/ len( [knn1_score, knn2_score, knn3_score])

print("Accuracy on training set: {:.3f}".format(knn1_score))
print("Accuracy on training set: {:.3f}".format(knn2_score))
print("Accuracy on training set: {:.3f}".format(knn3_score))

Accuracy on training set: 0.703
Accuracy on training set: 0.727
Accuracy on training set: 0.606


In [87]:
# Naive Bayes

gnb1 = GaussianNB()
gnb1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(gnb1.score(x1, y1)))

gnb2 = GaussianNB()
gnb2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(gnb2.score(x2, y2)))

gnb3 = GaussianNB()
gnb3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(gnb3.score(x3, y3)))

Accuracy on training set: 0.703
Accuracy on training set: 0.712
Accuracy on training set: 0.848


In [88]:
# Decision tree

tree_depth = 7

DT1 = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT1.fit(x1, y1)
print("Accuracy on training set: {:.3f}".format(DT1.score(x1, y1)))

DT2 = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT2.fit(x2, y2)
print("Accuracy on training set: {:.3f}".format(DT2.score(x2, y2)))

DT3 = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = tree_depth)
DT3.fit(x3, y3)
print("Accuracy on training set: {:.3f}".format(DT3.score(x3, y3)))

Accuracy on training set: 0.877
Accuracy on training set: 0.894
Accuracy on training set: 0.879


# Simulation: Round Group Simulation

In [89]:
r64_r = pd.DataFrame(columns = df_headers)
r32_r = pd.DataFrame(columns = df_headers)
s16_r = pd.DataFrame(columns = df_headers)
e8_r = pd.DataFrame(columns = df_headers)
f4_r = pd.DataFrame(columns = df_headers)
c2_r= pd.DataFrame(columns = df_headers)
winner_r= pd.DataFrame(columns = df_headers)

In [90]:
#FINE TUNE [knn, DT, forest, mlp, clf, gnb, svc]
b = "forest"
l = 'forest'
c = "forest"

w1 = globals()[b + '%s' % "1"]
w2 = globals()[l + '%s' % "2"]
w3 = globals()[c + '%s' % "3"]

In [91]:
test_df = pd.DataFrame(columns = df_headers[3:25])
y_pred = []
matchup_list = []
test_regions = [east25_df, west25_df, south25_df, midwest25_df]
#print('\033[1m' + str(learn) + '\033[0m' + "\n")
for x in test_regions: 
    r64_r = pd.concat([r64_r,x], ignore_index = True)
    round64 = x
    rounds = [round64, r32_r, s16_r, e8_r, f4_r]
    for r in range (0,len(rounds)-1):
        wins = []
        y = len(rounds[r])/2
        y = int(y)
        for x in range(0, y):
            h = rounds[r].iloc[x]
            l = rounds[r].iloc[-x-1]
            holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
            if holder.iloc[0]["SEED"] > 0: 
                holder = -holder
                static = h
                h = l
                l = static
            scaled = scale(holder)
            if r < 2:
                ups = w1.predict(scaled)
            else:
                ups = w2.predict(scaled)
            ups = list(ups)[0]
            print(h['SEED'], h['TEAM'], ' vs. ', l['SEED'], l['TEAM'])
            if ups == 0: 
                wins.append(h)
                print("Winner:", h['SEED'], h['TEAM'])
            if ups == 1:
                wins.append(l)
                print("Winner:", l['SEED'], l['TEAM'])
            test_df = pd.concat([test_df,holder], ignore_index = True)
            matchup = h['TEAM'] + ' vs. ' + l['TEAM']
            matchup_list.append(matchup)
            y_pred.append(ups)
        rounds[r+1] = pd.DataFrame(data = wins, columns = df_headers)
        print("\n")
    print("_" * 40)

    r32_r = pd.concat([r32_r,rounds[1]], ignore_index = True)
    s16_r = pd.concat([s16_r,rounds[2]], ignore_index = True)
    e8_r = pd.concat([e8_r,rounds[3]], ignore_index = True)
    f4_r = pd.concat([f4_r,rounds[4]], ignore_index = True)

  r64_r = pd.concat([r64_r,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index = True)


1 Duke  vs.  16 Mount St. Mary’s
Winner: 1 Duke
2 Alabama  vs.  15 Robert Morris
Winner: 2 Alabama
3 Wisconsin  vs.  14 Montana
Winner: 3 Wisconsin
4 Arizona  vs.  13 Akron
Winner: 4 Arizona
5 Oregon  vs.  12 Liberty
Winner: 5 Oregon
6 BYU  vs.  11 VCU
Winner: 11 VCU
7 Saint Mary’s  vs.  10 Vanderbilt
Winner: 7 Saint Mary’s
8 Mississippi St.  vs.  9 Baylor
Winner: 8 Mississippi St.


1 Duke  vs.  8 Mississippi St.
Winner: 1 Duke
2 Alabama  vs.  7 Saint Mary’s
Winner: 2 Alabama
3 Wisconsin  vs.  11 VCU
Winner: 3 Wisconsin
4 Arizona  vs.  5 Oregon
Winner: 4 Arizona


1 Duke  vs.  4 Arizona
Winner: 1 Duke
2 Alabama  vs.  3 Wisconsin
Winner: 3 Wisconsin


1 Duke  vs.  3 Wisconsin
Winner: 1 Duke


________________________________________
1 Florida  vs.  16 Norfolk St.
Winner: 1 Florida


  r32_r = pd.concat([r32_r,rounds[1]], ignore_index = True)
  s16_r = pd.concat([s16_r,rounds[2]], ignore_index = True)
  e8_r = pd.concat([e8_r,rounds[3]], ignore_index = True)
  f4_r = pd.concat([f4_r,rounds[4]], ignore_index = True)


2 St. John’s  vs.  15 Nebraska Omaha
Winner: 2 St. John’s
3 Texas Tech  vs.  14 UNC Wilmington
Winner: 3 Texas Tech
4 Maryland  vs.  13 Grand Canyon
Winner: 4 Maryland
5 Memphis  vs.  12 Colorado St.
Winner: 12 Colorado St.
6 Missouri  vs.  11 Drake
Winner: 6 Missouri
7 Kansas  vs.  10 Arkansas
Winner: 7 Kansas
8 Connecticut  vs.  9 Oklahoma
Winner: 8 Connecticut


1 Florida  vs.  8 Connecticut
Winner: 1 Florida
2 St. John’s  vs.  7 Kansas
Winner: 2 St. John’s
3 Texas Tech  vs.  6 Missouri
Winner: 6 Missouri
4 Maryland  vs.  12 Colorado St.
Winner: 12 Colorado St.


1 Florida  vs.  12 Colorado St.
Winner: 1 Florida
2 St. John’s  vs.  6 Missouri
Winner: 6 Missouri


1 Florida  vs.  6 Missouri
Winner: 1 Florida


________________________________________
1 Auburn  vs.  16 Alabama St.
Winner: 1 Auburn
2 Michigan St.  vs.  15 Bryant
Winner: 2 Michigan St.
3 Iowa St.  vs.  14 Lipscomb
Winner: 3 Iowa St.
4 Texas A&M  vs.  13 Yale
Winner: 4 Texas A&M
5 Michigan  vs.  12 UC San Diego
Winner: 12

## FINAL 4 SIM

In [92]:
final_four = [f4_r, c2_r, winner_r]
for r in range (0,len(final_four)-1):
    wins = []
    y = len(final_four[r])/2
    y = int(y)
    for x in range(0, y):
        h = final_four[r].iloc[x]
        l = final_four[r].iloc[-x-1]
        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25]).sort_values(by = "SEED")
        scaled = scale(holder)

        ups = w3.predict(scaled)

        ups = list(ups)[0]
        matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
        print(matchup)
        if ups == 0: 
            wins.append(h)
            print("Winner:", h['SEED'], h['TEAM'])
        if ups == 1:
            wins.append(l)
            print("Winner:", l['SEED'], l['TEAM'])
        test_df = pd.concat([test_df,holder], ignore_index=True)
        matchup_list.append(matchup)
        y_pred.append(ups)
    final_four[r+1] = pd.DataFrame(data = wins, columns = df_headers)
df_c2 = final_four[1]
df_winner = final_four[2]

1 Duke vs. 5 Clemson
Winner: 1 Duke
1 Florida vs. 12 UC San Diego
Winner: 1 Florida
1 Duke vs. 1 Florida
Winner: 1 Duke


In [93]:
test_df['UPSET'] = y_pred

In [94]:
test_df["MATCHUP"] = matchup_list

In [95]:
df_winner

Unnamed: 0,TEAM,CONF,POSTSEASON,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED,POWER,YEAR
0,Duke,ACC,,34,31,0.911765,128.4,91.3,0.9807,57.4,...,25.4,58.0,43.4,37.7,30.9,65.7,9.6,1,1,2025


## Single Matchup Test

In [229]:
single = region_df(["Mississippi St.","Baylor"],ncaam25) #["Alabama St.", "Saint Francis"]

In [230]:
learners = ["knn", "DT", "forest", "mlp", "clf", "gnb"] #, "svc"]

In [231]:
single

Unnamed: 0,TEAM,CONF,POSTSEASON,G,W,WIN_PER,ADJOE,ADJDE,BARTHAG,EFG_O,...,SEED,POWER,YEAR,R64,R32,S16,E8,F4,C2,Champions
0,Mississippi St.,SEC,,33,21,0.636364,118.3,98.8,0.8888,51.7,...,8,1,2025,0,0,0,0,0,0,0
1,Baylor,B12,,33,19,0.575758,120.5,99.3,0.9028,51.5,...,9,1,2025,0,0,0,0,0,0,0


In [234]:
for j in learners:
    h = single.iloc[0]
    l = single.iloc[1]
    holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25]).sort_values(by = "SEED")
    scaled = scale(holder)
    if holder.iloc[0]["SEED"] > 0: 
        holder = -holder
        static = h
        h = l
        l = static 

    
    pred = globals()[j + '%s' % "_little"] #3
    ups = pred.predict(scaled)
    print(pred.predict_proba(scaled))


    ups = list(ups)[0]
    matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
    print(matchup)
    if ups == 0: 
        wins.append(h)
        print("Winner:", h['SEED'], h['TEAM'])
    if ups == 1:
        wins.append(l)
        print("Winner:", l['SEED'], l['TEAM'])
    print("_"*40)

[[0.6 0.4]]
8 Mississippi St. vs. 9 Baylor
Winner: 8 Mississippi St.
________________________________________
[[0. 1.]]
8 Mississippi St. vs. 9 Baylor
Winner: 9 Baylor
________________________________________
[[0.65087432 0.34912568]]
8 Mississippi St. vs. 9 Baylor
Winner: 8 Mississippi St.
________________________________________
[[0.44656775 0.55343225]]
8 Mississippi St. vs. 9 Baylor
Winner: 9 Baylor
________________________________________
[[0.27947836 0.72052164]]
8 Mississippi St. vs. 9 Baylor
Winner: 9 Baylor
________________________________________
[[0.27590438 0.72409562]]
8 Mississippi St. vs. 9 Baylor
Winner: 9 Baylor
________________________________________


In [233]:
for j in learners:
    h = single.iloc[1]
    l = single.iloc[0]
    holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25]).sort_values(by = "SEED")
    scaled = scale(holder)
    
    pred = globals()[j + '%s' % "1"] #1
    ups = pred.predict(scaled)

    ups = list(ups)[0]
    matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
    print(matchup)
    if ups == 0: 
        wins.append(h)
        print("Winner:", h['SEED'], h['TEAM'])
    if ups == 1:
        wins.append(l)
        print("Winner:", l['SEED'], l['TEAM'])
    print("_"*40)

9 Baylor vs. 8 Mississippi St.
Winner: 9 Baylor
________________________________________
9 Baylor vs. 8 Mississippi St.
Winner: 9 Baylor
________________________________________
9 Baylor vs. 8 Mississippi St.
Winner: 8 Mississippi St.
________________________________________
9 Baylor vs. 8 Mississippi St.
Winner: 8 Mississippi St.
________________________________________
9 Baylor vs. 8 Mississippi St.
Winner: 9 Baylor
________________________________________
9 Baylor vs. 8 Mississippi St.
Winner: 8 Mississippi St.
________________________________________


### Conference Tournament

In [None]:
def bye_split(team_cnt):
    x = 0
    while 2**x < team_cnt:
        binary_rnd = 2**x 
        x += 1
    
    second_round = binary_rnd + (binary_rnd - team_cnt)
    
    return second_round

In [None]:
def first_rnd_bye(df):
    teams = bye_split(len(df))
    bye_teams = df.iloc[:teams]
    first_round = df.iloc[teams:]
    
    return bye_teams, first_round
    

In [None]:
big_east = ['Connecticut',
 'Creighton',
 'Marquette',
 'Seton Hall',
 "St. John's",
 'Villanova',
 'Providence',
 'Butler',
 'Xavier',
 'Georgetown',
 'DePaul']

In [None]:
def region_df(x,y):
    v = pd.DataFrame(x)
    v = v.rename(columns={0: "TEAM"})
    v = v.merge(y, on = 'TEAM', how='left')
    return v

In [None]:
be_tourn = region_df(big_east, ncaam24)

In [None]:
be_tourn.SEED = be_tourn.index + 1

In [None]:
r2, r1 = first_rnd_bye(be_tourn)

In [None]:
if len(r1) > 0:
    for x in range(int(len(be24[1])/2)):
        h = r1.iloc[x]
        l = r1.iloc[(-x-1)]
        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
        scaled = scale(holder)
        

        
        pred = forest_comp
        ups = pred.predict(scaled)

        ups = list(ups)[0]
        matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
        print(matchup)
        if ups == 0: 
            r2 = r2.append(h, ignore_index=True)
            print("Winner:", h['SEED'], h['TEAM'])
        if ups == 1:
            r2 = r2.append(l, ignore_index=True)
            print("Winner:", l['SEED'], l['TEAM'])
        print("_"*40)
  

In [None]:
r = 2
while len(globals()["r" + '%s' % str(r)]) != 1:
    curr_round = globals()["r" + '%s' % str(r)]
    
    globals()["r" + '%s' % str((r+1))] = pd.DataFrame(columns = df_headers)
    for x in range(int(len(curr_round)/2)):
        h = curr_round.iloc[x]
        l = curr_round.iloc[(-x-1)]
        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
        scaled = scale(holder)
        
        if holder.iloc[0]["SEED"] < seed_cutoff_high:
            pred = mlp_big
            #print(big.predict_proba(scaled))
        elif  holder.iloc[0]["SEED"] > seed_cutoff_low:
            pred = mlp_little
            #print(little.predict_proba(scaled))
        else:
            pred = mlp_comp
            #print(comp.predict_proba(scaled))
        
        ups = pred.predict(scaled)

        ups = list(ups)[0]
        matchup = str(h['SEED']) + " "+ h['TEAM']+  ' vs. '+ str(l['SEED'])+ " "+ l['TEAM']
        print(matchup)
        if ups == 0: 
            globals()["r" + '%s' % str((r+1))] = globals()["r" + '%s' % str((r+1))].append(h, ignore_index=True)
            print("Winner:", h['SEED'], h['TEAM'])
        if ups == 1:
            globals()["r" + '%s' % str((r+1))] = globals()["r" + '%s' % str((r+1))].append(l, ignore_index=True)
            print("Winner:", l['SEED'], l['TEAM'])
        print("_"*40)
    
    
    r += 1

In [None]:
learners = ["knn", "DT", "forest", "mlp", "clf", "gnb", "svc"]

### COEF Importance

In [None]:
#Lasso

X = train_big[0]
Y = train_big[1]

x = train_little[0]
y = train_little[1]

train_x = train_comp[0]
train_y = train_comp[1]

from sklearn.linear_model import Lasso
lasso_big = Lasso(alpha=0.005)
lasso_big.fit(X,Y)

lasso_little = Lasso(alpha=0.005)
lasso_little.fit(x,y)

lasso_comp = Lasso(alpha=0.005)
lasso_comp.fit(train_x,train_y)

pd.DataFrame([lasso_big.coef_, lasso_little.coef_, lasso_comp.coef_], columns  = new_columns[3:25], index=["big_upset", "little_upset", "competative"])

### Simulated Final Fours

In [96]:
scenarios = ''

In [97]:
all_scen = []
s16_scens = []
learning = []
upset_count = []
winners= []

In [98]:
for f in range(7):
    for g in range(7):

        r64 = pd.DataFrame(columns = df_headers)
        r32 = pd.DataFrame(columns = df_headers)
        s16 = pd.DataFrame(columns = df_headers)
        e8 = pd.DataFrame(columns = df_headers)
        f4 = pd.DataFrame(columns = df_headers)
        c2= pd.DataFrame(columns = df_headers)
        winner= pd.DataFrame(columns = df_headers)

        #FINE TUNE [knn, DT, forest, mlp, clf, gnb, svc]
        learners = ["knn", "DT", "forest", "mlp", "clf", "gnb", "svc"]
        b = learners[f]
        l = learners[g]

        w1 = globals()[b + '%s' % "1"]
        w2 = globals()[l + '%s' % "2"]

        test_df = pd.DataFrame(columns = df_headers[3:25])
        y_pred = []
        matchup_list = []
        test_regions = [east25_df, west25_df, south25_df, midwest25_df]

        for x in test_regions: 
            r64 = pd.concat([r64,x], ignore_index = True)
            round64 = x
            rounds = [round64, r32, s16, e8, f4]
            for r in range (0,len(rounds)-1):
                wins = []
                y = len(rounds[r])/2
                y = int(y)
                for x in range(0, y):
                    base_count = 0

                    h = rounds[r].iloc[x]
                    l = rounds[r].iloc[-x-1]
                    holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
                    if holder.iloc[0]["SEED"] > 0: 
                        holder = -holder
                        static = l
                        l = h
                        h = static
                    scaled = scale(holder)
                    if r < 2:
                        ups = w1.predict(scaled)
                    else:
                        ups = w2.predict(scaled)
#                    if holder.iloc[0]["SEED"] < seed_cutoff_high:
#                       ups = big.predict(scaled)
#                        #print(big.predict_proba(scaled))
#                    elif  holder.iloc[0]["SEED"] > seed_cutoff_low:
#                        ups = little.predict(scaled)
#                        #print(little.predict_proba(scaled))
#                    else:
#                        ups = comp.predict(scaled)
#                        #print(comp.predict_proba(scaled))
                    
                    ups = list(ups)[0]
                    if ups == 0: 
                        wins.append(h)
                    if ups == 1:
                        wins.append(l)
                        base_count += 1
                    test_df = pd.concat([test_df,holder], ignore_index=True)
                    matchup = h['TEAM'] + ' vs. ' + l['TEAM']
                    matchup_list.append(matchup)
                    y_pred.append(ups)

                    upset_count.append(base_count)
                rounds[r+1] = pd.DataFrame(data = wins, columns = df_headers)
                
                
            r32 = pd.concat([r32,rounds[1]], ignore_index=True)
            s16 = pd.concat([s16,rounds[2]], ignore_index=True)
            e8 = pd.concat([e8,rounds[3]], ignore_index=True)
            f4 = pd.concat([f4,rounds[4]], ignore_index=True)
            winners.append(rounds[1:])
        
        learning.append(learners[f] + " " + learners[g])
        all_scen.append(list(f4.TEAM))
        s16_scens.append(list(s16.TEAM))
        

  r64 = pd.concat([r64,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index=True)
  r32 = pd.concat([r32,rounds[1]], ignore_index=True)
  s16 = pd.concat([s16,rounds[2]], ignore_index=True)
  e8 = pd.concat([e8,rounds[3]], ignore_index=True)
  f4 = pd.concat([f4,rounds[4]], ignore_index=True)
  r64 = pd.concat([r64,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index=True)
  r32 = pd.concat([r32,rounds[1]], ignore_index=True)
  s16 = pd.concat([s16,rounds[2]], ignore_index=True)
  e8 = pd.concat([e8,rounds[3]], ignore_index=True)
  f4 = pd.concat([f4,rounds[4]], ignore_index=True)
  r64 = pd.concat([r64,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index=True)
  r32 = pd.concat([r32,rounds[1]], ignore_index=True)
  s16 = pd.concat([s16,rounds[2]], ignore_index=True)
  e8 = pd.concat([e8,rounds[3]], ignore_index=True)
  f4 = pd.concat([f4,rounds[4]], ignore_index=True)
  r64 = pd.concat([r64,x], ignore_index = T

### Final Four Proportions

In [99]:
len(upset_count)/15/49

4.0

In [100]:
all_scen1 = []
for j in all_scen:
    all_scen1.append(list(j))

In [101]:
scenarios = pd.DataFrame(all_scen1, columns = ["East", "Midwest", "West", "South"])

In [102]:
scenarios.index = learning

In [103]:
max_east = max(scenarios.groupby(["East"]).count()["South"])
scenarios.groupby(["East"]).count()["South"].sort_values(ascending = False)/max_east*100

East
Duke    100.000000
VCU       2.083333
Name: South, dtype: float64

In [104]:
max_south = max(scenarios.groupby(["South"]).count()["East"])
scenarios.groupby(["South"]).count()["East"].sort_values(ascending = False)/max_south*100

South
Houston      100.000000
Tennessee     35.483871
Purdue        12.903226
Clemson        6.451613
Kentucky       3.225806
Name: East, dtype: float64

In [105]:
max_midwest = max(scenarios.groupby(["Midwest"]).count()["East"])
scenarios.groupby(["Midwest"]).count()["East"].sort_values(ascending = False)/max_midwest*100

Midwest
Florida    100.0
Name: East, dtype: float64

In [106]:
max_west = max(scenarios.groupby(["West"]).count()["East"])
scenarios.groupby(["West"]).count()["East"].sort_values(ascending = False)/max_west*100

West
Auburn            100.000000
Louisville         13.888889
UC San Diego       11.111111
North Carolina      8.333333
Michigan            2.777778
Name: East, dtype: float64

In [107]:
scenarios.to_csv('Sims/sims25.csv')

In [108]:
region_16 = (['east']*4 + ['midwest']*4 + ['west']*4  + ['south'] *4) * 49

In [109]:
s16_scens1 =  [j for i in s16_scens for j in i]

In [110]:
scenarios16 = pd.DataFrame({"Team" : s16_scens1, "Region": region_16}, columns = ["Team", "Region"])

In [111]:

#sc.groupby(["East"]).count()["South"].sort_values(ascending = False)

In [112]:
for region in ['east','midwest', 'west', 'south']:
    r16_sim = list(scenarios16[scenarios16["Region"] == region].Team)
    
    counts = {}
    
    for item in r16_sim:
        if item in counts:
            counts[item] += 1
        else:
            counts[item] = 1
    sorted_counts = sorted(counts.items(), key=lambda item: item[1], reverse=True)

    for item, count in sorted_counts:
        print(f"{item}: {count}")

    print("-" * 50)

Duke: 49
Alabama: 42
Liberty: 35
Wisconsin: 35
BYU: 7
Vanderbilt: 7
Oregon: 7
Arizona: 7
VCU: 7
--------------------------------------------------
Florida: 49
St. John’s: 42
Texas Tech: 35
Maryland: 28
Colorado St.: 21
Missouri: 14
Kansas: 7
--------------------------------------------------
Auburn: 42
UC San Diego: 35
Michigan St.: 28
North Carolina: 21
Marquette: 14
Mississippi: 14
Lipscomb: 7
Texas A&M: 7
Louisville: 7
Iowa St.: 7
New Mexico: 7
Michigan: 7
--------------------------------------------------
Houston: 49
Tennessee: 42
Kentucky: 42
Purdue: 28
Clemson: 14
UCLA: 7
Xavier: 7
McNeese St.: 7
--------------------------------------------------


### Round Breakdown

In [113]:
len(winners)

196

In [114]:
r32all = pd.DataFrame(columns = ['TEAM','SEED','CONF'])
s16all = pd.DataFrame(columns = ['TEAM','SEED','CONF'])
e8all = pd.DataFrame(columns = ['TEAM','SEED','CONF'])
f4all = pd.DataFrame(columns = ['TEAM','SEED','CONF'])

In [115]:
for x in winners:
    r32all = pd.concat([r32all,x[0][['TEAM','SEED','CONF']]], ignore_index=True)
    s16all = pd.concat([s16all, x[1][['TEAM','SEED','CONF']]] , ignore_index=True)
    e8all = pd.concat([e8all, x[2][['TEAM','SEED','CONF']]] , ignore_index=True)
    f4all = pd.concat([f4all, x[3][['TEAM','SEED','CONF']]] , ignore_index=True)

In [116]:
def region_setter(number):
    regions = ["east", "midwest", "south", "west"]
    n = int(number/ 196)
    
    region_col = []
    for i in regions:
        x1 = n * [i]
        region_col.extend(x1)
    return region_col * 49

In [117]:
def learner_setter(number):
    scens = list(scenarios.index)
    n = int(number/ 49)
    
    learner_col = []
    for i in scens:
        x1 = n * [i]
        learner_col.extend(x1)
    return learner_col

In [118]:
r32all

Unnamed: 0,TEAM,SEED,CONF
0,Duke,1,ACC
1,Alabama,2,SEC
2,Wisconsin,3,B10
3,Arizona,4,B12
4,Liberty,12,CUSA
...,...,...,...
1563,Purdue,4,B10
1564,Clemson,5,ACC
1565,Illinois,6,B10
1566,Utah St,10,MWC


In [119]:
r32all["REGION"] = region_setter(len(r32all))
r32all["LEARNER"] = learner_setter(len(r32all))

s16all["REGION"] = region_setter(len(s16all))
s16all["LEARNER"] = learner_setter(len(s16all))

e8all["REGION"] = region_setter(len(e8all))
e8all["LEARNER"] = learner_setter(len(e8all))

f4all["REGION"] = region_setter(len(f4all))
f4all["LEARNER"] = learner_setter(len(f4all))

In [120]:
max_f4 = max(f4all.groupby(["TEAM"]).count()["LEARNER"])
f4all.groupby(["TEAM", "REGION"]).count()["LEARNER"].sort_values(ascending = False)/max_f4*100

TEAM            REGION 
Florida         midwest    100.000000
Duke            east        97.959184
Auburn          south       73.469388
Houston         west        63.265306
Tennessee       west        22.448980
Louisville      south       10.204082
Purdue          west         8.163265
UC San Diego    south        8.163265
North Carolina  south        6.122449
Clemson         west         4.081633
Kentucky        west         2.040816
Michigan        south        2.040816
VCU             east         2.040816
Name: LEARNER, dtype: float64

In [None]:
max_SEED32 = max(r32all.groupby(["SEED"]).count()["LEARNER"])
r32all.groupby(["SEED"]).count()["LEARNER"].sort_values(ascending = False)/max_SEED32*100

In [None]:
r32all[r32all['SEED'] >= 11].groupby(['TEAM',"SEED","REGION"]).count()["LEARNER"].sort_values(ascending = False)

In [None]:
s16all[s16all['SEED'] >= 6].groupby(['TEAM',"SEED","REGION"]).count()["LEARNER"].sort_values(ascending = False)

In [None]:
f4all.groupby(['TEAM',"SEED","REGION"]).count()["LEARNER"].sort_values(ascending = False)

In [None]:
r32all[r32all['REGION'] == "midwest"].groupby(['TEAM',"SEED","REGION"]).count()["LEARNER"].sort_values(ascending = False)

In [None]:
scenarios1 = scenarios[scenarios["East"] == "Tennessee"]

In [None]:
east_seed = []
for x in scenarios['East']:
    east_seed.append(east23.index(x)+1)

In [None]:
south_seed = []
for x in scenarios['South']:
    south_seed.append(south23.index(x)+1)

In [None]:
west_seed = []
for x in scenarios['West']:
    west_seed.append(west23.index(x)+1)

In [None]:
midwest_seed = []
for x in scenarios['Midwest']:
    midwest_seed.append(midwest23.index(x)+1)

In [None]:
scenarios['East_seed'] = east_seed
scenarios['west_seed'] = west_seed
scenarios['midwest_seed'] = midwest_seed

scenarios['south_seed'] = south_seed

In [None]:
scenarios['seed_sum'] = scenarios['East_seed'] + scenarios['south_seed'] + scenarios['west_seed'] + scenarios['midwest_seed']

In [None]:
scenarios

In [None]:
learning

In [None]:
region_upset_sum = []
for i in range (0, len(upset_count), 15):
    x = i
    region_upset_sum.append(upset_count[x:x+15])

In [None]:
len(region_upset_sum[1])

In [None]:
leg_split = []
for y in region_upset_sum:
    ind_list = []
    ind_list.append(sum(y[0:7]))
    ind_list.append(sum(y[8:11]))
    ind_list.append(sum(y[12:13]))
    ind_list.append(y[14])
    leg_split.append(ind_list)

In [None]:
upset_col = [
    "east64","east32","east16","east8", 
    "south64","south32","south16","south8",
    'midwest64','midwest32','midwest16','midwest8',
    'west64','west32','west16','west8']

In [None]:
ncaam[ncaam['POSTSEASON'] in  ['2ND', 'F4', 'Champions'] ]

In [None]:
sum(y1)/len(y1)

### PROBS OF THE ROUND OF 64

In [121]:
probs = []
r32_teams = []


In [122]:


for f in range(6):


    #FINE TUNE [knn, DT, forest, mlp, clf, gnb, svc]
    learners = ["knn", "DT", "forest", "mlp", "clf", "gnb"]
    b = learners[f]

    w1 = globals()[b + '%s' % "1"]

    test_df = pd.DataFrame(columns = df_headers[3:25])
    y_pred = []
    matchup_list = []
    test_regions = [east25_df, west25_df, south25_df, midwest25_df]

    for x in test_regions: 
        r64 = pd.concat([r64,x], ignore_index = True)
        round64 = x
        wins = []
        for x in range(0, 8):

            h = round64.iloc[x]
            l = round64.iloc[-x-1]
            holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
            if holder.iloc[0]["SEED"] > 0: 
                holder = -holder
            scaled = scale(holder)
            ups = w1.predict(scaled)

            r32_teams.append(l)
            prob = w1.predict_proba(scaled)
            probs.append(prob[0][1])

            
            ups = list(ups)[0]
            if ups == 0: 
                wins.append(h)
            if ups == 1:
                wins.append(l)
                

            
            matchup = h['TEAM'] + ' vs. ' + l['TEAM']
            matchup_list.append(matchup)
            y_pred.append(ups)

        

In [123]:
len(r32_teams) == len(probs)

True

In [124]:
r32_probs = pd.DataFrame(data=r32_teams, columns = df_headers)
r32_probs["probs"] = probs


In [125]:
def region_setter_probs():
    regions = ["east", "west", "south", "midwest"]
    
    region_col = []
    for i in regions:
        x1 = 8 * [i]
        region_col.extend(x1)
    return region_col * 6

In [126]:
r32_probs["REGION"]=region_setter_probs()

In [127]:
r32_probs_short = r32_probs[['TEAM','SEED','REGION',"probs"]]

In [128]:
r32_probs_short.groupby(["SEED"])["probs"].mean().sort_values(ascending=False)

SEED
12    0.603416
11    0.493771
9     0.470982
10    0.433632
13    0.207944
14    0.173680
15    0.106959
16    0.027754
Name: probs, dtype: float64

In [166]:
r32_probs_short[r32_probs_short['TEAM']=="Bryant"]

Unnamed: 0,TEAM,SEED,REGION,probs
14,Bryant,15,south,0.4
14,Bryant,15,south,0.0
14,Bryant,15,south,0.287468
14,Bryant,15,south,0.203462
14,Bryant,15,south,0.142668
14,Bryant,15,south,0.00011


In [129]:
r32_probs_short.groupby(['TEAM', "SEED", "REGION"])["probs"].mean().sort_values(ascending=False)

TEAM              SEED  REGION 
Colorado St.      12    west       0.759212
UC San Diego      12    south      0.663870
Liberty           12    east       0.641629
Baylor            9     east       0.576509
Oklahoma          9     west       0.569594
VCU               11    east       0.563480
North Carolina    11    south      0.527848
Utah St           10    midwest    0.516280
Xavier            11    midwest    0.467331
Arkansas          10    west       0.447335
Vanderbilt        10    east       0.443437
Drake             11    west       0.416424
Creighton         9     south      0.386461
Georgia           9     midwest    0.351364
McNeese St.       12    midwest    0.348953
High Point        13    midwest    0.345605
New Mexico        10    south      0.327476
Lipscomb          14    south      0.302714
UNC Wilmington    14    west       0.193387
Akron             13    east       0.190216
Yale              13    south      0.184255
Bryant            15    south      0.172285


### Randomness

In [130]:
import numpy as np

In [131]:
r64_prob = pd.DataFrame(columns = df_headers)
r32_prob = pd.DataFrame(columns = df_headers)
s16_prob = pd.DataFrame(columns = df_headers)
e8_prob = pd.DataFrame(columns = df_headers)
f4_prob = pd.DataFrame(columns = df_headers)
#c2= pd.DataFrame(columns = df_headers)
#winner= pd.DataFrame(columns = df_headers)
model_type = ['knn', 'DT', 'forest', 'mlp', 'clf', 'gnb']

In [132]:
#FINE TUNE [knn, DT, forest, mlp, clf, gnb, svc]
#probas not good svc

b = "mlp"
l = "mlp"
c = "mlp"

big = globals()[b + '%s' % "_big"]
little = globals()[l + '%s' % "_little"]
comp= globals()[c + '%s' % "_comp"]

In [133]:
for c in model_type:
    comp= globals()[c + '%s' % "_comp"]
    
    for l in model_type:      
        little = globals()[l + '%s' % "_little"]
        
        for b in model_type:
            big = globals()[b + '%s' % "_big"]
            
            test_df = pd.DataFrame(columns = df_headers[3:25])
            y_pred = []
            matchup_list = []
            test_regions = [east25_df, midwest25_df, south25_df, west25_df]
            #print('\033[1m' + str(learn) + '\033[0m' + "\n")
            for x in test_regions: 
                r64 = pd.concat([r64_prob,x], ignore_index = True)
                round64 = x
                rounds = [round64, r32_prob, s16_prob, e8_prob, f4_prob]
                for r in range (0,len(rounds)-1):
                    wins = []
                    y = len(rounds[r])/2
                    y = int(y)
                    for x in range(0, y):
                        random = np.random.rand()**2
                        h = rounds[r].iloc[x]
                        l = rounds[r].iloc[-x-1]
                        holder = pd.DataFrame([get_upset_differences(h,l)], columns = df_headers[3:25])
                        if holder.iloc[0]["SEED"] > 0: 
                            holder = -holder
                            static = h
                            h = l
                            l = static
                        scaled = scale(holder)
                        if holder.iloc[0]["SEED"] < seed_cutoff_low:
                            ups = big.predict_proba(scaled)
                        elif  holder.iloc[0]["SEED"] > seed_cutoff_high:
                            ups = little.predict_proba(scaled)
                        else:
                            ups = comp.predict_proba(scaled)
                        ups = ups[0][0]
                    
                        #print(ups,random)
                        #print(h['SEED'], h['TEAM'], ' vs. ', l['SEED'], l['TEAM'])
                        if ups > random: 
                            wins.append(h)
                            #print("Winner:", h['SEED'], h['TEAM'])
                        if ups <= random:
                            wins.append(l)
                            #print("Winner:", l['SEED'], l['TEAM'])
                        test_df = pd.concat([test_df,holder], ignore_index = True)
                        matchup = h['TEAM'] + ' vs. ' + l['TEAM']
                        matchup_list.append(matchup)
                        y_pred.append(ups)
                    rounds[r+1] = pd.DataFrame(data = wins, columns = df_headers)
                    #print('\n')
                #print("_" * 80)
            
                r32_prob = pd.concat([r32_prob,rounds[1]], ignore_index = True)
                s16_prob = pd.concat([s16_prob,rounds[2]], ignore_index = True)
                e8_prob = pd.concat([e8_prob,rounds[3]], ignore_index = True)
                f4_prob = pd.concat([f4_prob,rounds[4]], ignore_index = True)
                

  r64 = pd.concat([r64_prob,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index = True)
  r32_prob = pd.concat([r32_prob,rounds[1]], ignore_index = True)
  s16_prob = pd.concat([s16_prob,rounds[2]], ignore_index = True)
  e8_prob = pd.concat([e8_prob,rounds[3]], ignore_index = True)
  f4_prob = pd.concat([f4_prob,rounds[4]], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  test_df = pd.concat([test_df,holder], ignore_index = True)
  r64 = pd.concat([r64_prob,x], ignore_index = True)
  r64 = pd.concat([r64_prob

In [140]:
s16_out = s16_prob[["TEAM","SEED"]]
s16_out['REGION'] = (["east"]*4 + ["midwest"]*4 + ["south"]*4 + ["west"]*4) * 216

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  s16_out['REGION'] = (["east"]*4 + ["midwest"]*4 + ["south"]*4 + ["west"]*4) * 216


In [150]:
s16_out[s16_out["REGION"] == "east"].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM              SEED
Duke              1       0.814815
Arizona           4       0.638889
Wisconsin         3       0.532407
Alabama           2       0.527778
Saint Mary’s      7       0.458333
BYU               6       0.324074
Liberty           12      0.157407
Oregon            5       0.157407
Mississippi St.   8       0.148148
VCU               11      0.134259
Akron             13      0.046296
Baylor            9       0.032407
Montana           14      0.009259
Robert Morris     15      0.009259
Mount St. Mary’s  16      0.004630
Vanderbilt        10      0.004630
Name: REGION, dtype: float64

In [151]:
s16_out[s16_out["REGION"] == "south"].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM            SEED
Michigan St.    2       0.787037
Auburn          1       0.773148
Iowa St.        3       0.680556
Texas A&M       4       0.560185
Michigan        5       0.217593
UC San Diego    12      0.185185
Marquette       7       0.166667
Mississippi     6       0.166667
Louisville      8       0.152778
North Carolina  11      0.111111
Creighton       9       0.069444
Lipscomb        14      0.041667
New Mexico      10      0.041667
Yale            13      0.037037
Alabama St.     16      0.004630
Bryant          15      0.004630
Name: REGION, dtype: float64

In [152]:
s16_out[s16_out["REGION"] == "midwest"].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM         SEED
Houston      1       0.787037
Kentucky     3       0.680556
Tennessee    2       0.652778
Purdue       4       0.518519
Clemson      5       0.240741
UCLA         7       0.226852
Illinois     6       0.212963
Gonzaga      8       0.194444
High Point   13      0.138889
McNeese St.  12      0.101852
Utah St      10      0.097222
Xavier       11      0.097222
Wofford      15      0.023148
Georgia      9       0.018519
Troy         14      0.009259
Name: REGION, dtype: float64

In [153]:
s16_out[s16_out["REGION"] == "west"].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM            SEED
Florida         1       0.777778
Texas Tech      3       0.708333
St. John’s      2       0.652778
Maryland        4       0.532407
Colorado St.    12      0.208333
Memphis         5       0.208333
Missouri        6       0.199074
Connecticut     8       0.194444
Nebraska Omaha  15      0.162037
Kansas          7       0.152778
Drake           11      0.083333
Grand Canyon    13      0.050926
Arkansas        10      0.032407
Oklahoma        9       0.018519
Norfolk St.     16      0.009259
UNC Wilmington  14      0.009259
Name: REGION, dtype: float64

In [164]:
s16_out[s16_out["SEED"] >= 9].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM              SEED
Colorado St.      12      0.208333
UC San Diego      12      0.185185
Nebraska Omaha    15      0.162037
Liberty           12      0.157407
High Point        13      0.138889
VCU               11      0.134259
North Carolina    11      0.111111
McNeese St.       12      0.101852
Xavier            11      0.097222
Utah St           10      0.097222
Drake             11      0.083333
Creighton         9       0.069444
Grand Canyon      13      0.050926
Akron             13      0.046296
Lipscomb          14      0.041667
New Mexico        10      0.041667
Yale              13      0.037037
Baylor            9       0.032407
Arkansas          10      0.032407
Wofford           15      0.023148
Oklahoma          9       0.018519
Georgia           9       0.018519
Norfolk St.       16      0.009259
Montana           14      0.009259
Robert Morris     15      0.009259
Troy              14      0.009259
UNC Wilmington    14      0.009259
Alabama St.       16      0.0046

In [156]:
f4_out = f4_prob[["TEAM","SEED"]]
f4_out['REGION'] = ["east","midwest","south", "west"] * 216

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  f4_out['REGION'] = ["east","midwest","south", "west"] * 216


In [157]:
f4_out[f4_out["REGION"] == "east"].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM             SEED
Duke             1       0.680556
Alabama          2       0.069444
Wisconsin        3       0.064815
BYU              6       0.060185
Arizona          4       0.041667
Saint Mary’s     7       0.032407
Liberty          12      0.013889
Mississippi St.  8       0.013889
Oregon           5       0.013889
VCU              11      0.009259
Name: REGION, dtype: float64

In [162]:
f4_out[f4_out["REGION"] == "south"].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM            SEED
Auburn          1       0.620370
Michigan St.    2       0.120370
Iowa St.        3       0.097222
UC San Diego    12      0.041667
Louisville      8       0.037037
Texas A&M       4       0.032407
Mississippi     6       0.018519
North Carolina  11      0.018519
Michigan        5       0.009259
Marquette       7       0.004630
Name: REGION, dtype: float64

In [161]:
f4_out[f4_out["REGION"] == "midwest"].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM         SEED
Houston      1       0.574074
Tennessee    2       0.194444
Kentucky     3       0.069444
Clemson      5       0.060185
Purdue       4       0.037037
Gonzaga      8       0.027778
Illinois     6       0.018519
Utah St      10      0.009259
McNeese St.  12      0.004630
UCLA         7       0.004630
Name: REGION, dtype: float64

In [163]:
f4_out[f4_out["REGION"] == "west"].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM            SEED
Florida         1       0.560185
St. John’s      2       0.194444
Texas Tech      3       0.115741
Missouri        6       0.037037
Maryland        4       0.032407
Memphis         5       0.023148
Drake           11      0.009259
Kansas          7       0.009259
Colorado St.    12      0.004630
Connecticut     8       0.004630
Grand Canyon    13      0.004630
Nebraska Omaha  15      0.004630
Name: REGION, dtype: float64

In [165]:
f4_out[f4_out["SEED"] >= 5].groupby(['TEAM',"SEED"]).count()["REGION"].sort_values(ascending = False)/216

TEAM             SEED
BYU              6       0.060185
Clemson          5       0.060185
UC San Diego     12      0.041667
Louisville       8       0.037037
Missouri         6       0.037037
Saint Mary’s     7       0.032407
Gonzaga          8       0.027778
Memphis          5       0.023148
Illinois         6       0.018519
North Carolina   11      0.018519
Mississippi      6       0.018519
Mississippi St.  8       0.013889
Oregon           5       0.013889
Liberty          12      0.013889
Kansas           7       0.009259
Utah St          10      0.009259
Drake            11      0.009259
Michigan         5       0.009259
VCU              11      0.009259
Nebraska Omaha   15      0.004630
Grand Canyon     13      0.004630
McNeese St.      12      0.004630
Connecticut      8       0.004630
Colorado St.     12      0.004630
UCLA             7       0.004630
Marquette        7       0.004630
Name: REGION, dtype: float64