# Part 4 - Stage 2 Predictions

The purpose of this notebook is to load existing stage 2 data, create a final model, make predictions, and simulate a tournament. 

## Library Imports

Python is an incredibly flexible language, partially due to how modular it is. We can extend its basic functionality by importanting 3rd party libraries.

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pkg_resources

from binaryTree import Node
from PIL import Image, ImageDraw

import random

from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

cwd = os.getcwd()

## Train the Model

In [2]:
training_set = pd.read_csv("training_set.csv")
training_set_stage2 = pd.read_csv("training_set_stage2.csv")

If you are unfamiliar with what these statistics are, here is a quick description

Field descriptions:  
Seed: team's seeds  
WinPct: team's winning percentage 
PointsFor: average points scored per game  
PointsAgainst: average points scored agains the teams  
FGM: field goals made per game  
FGA: field goals attempted per game  
FGM3: 3 point fields goals made per game  
FGA3: fields goals attempted per game  
FTM: free throws made per game  
FTA: free throws attempted per game  
OR: offense rebounds per game  
DR: defensive rebounds per game  
Ast: assists per game  
TO: turnovers per game  
Stl: steals per game  
Blk: blocks per game  
PF: personal fouls per game  

Select what features you want to use for the final model. All features avaiable are in blue. Add your desired features to the "cols" array.

Remember, some features will count twice!

In [3]:
# cols = ['deltaSeed', 'deltaWinPct','deltaPointsFor','deltaPointsAgainst','deltaFGM','deltaFGA','deltaFGM3','deltaFGA3','deltaFTM',
#         'deltaFTA','deltaOR','deltaDR','deltaAst','deltaTO','deltaStl','deltaBlk','deltaPF']
cols = ['deltaSeed', 'deltaFGM', 'deltaAst', 'deltaAst', 'deltaBlk']
cols

['deltaSeed', 'deltaFGM', 'deltaAst', 'deltaAst', 'deltaBlk']

Now, define your training sets based on the cols variable

In [4]:
X_train = training_set[cols]
y_train = training_set['Result']

Ok, so now we have our training set.
The next thing you need to do is determine what model you want to use. Uncomment whichever model you would like to use. 

### Random Forest Classifier

In [5]:
model = RandomForestClassifier(n_estimators = 10)
model.fit(X_train, y_train)
X_pred = training_set_stage2[cols]
pred = model.predict_proba(X_pred)[:,1]
training_set_stage2['Pred'] = pred

In [6]:
training_set_stage2.head()

Unnamed: 0,ID,Pred,Season,Team1,Team2,deltaSeed,deltaWinPct,deltaPointsFor,deltaPointsAgainst,deltaFGM,...,deltaFGA3,deltaFTM,deltaFTA,deltaOR,deltaDR,deltaAst,deltaTO,deltaStl,deltaBlk,deltaPF
0,2021_1101_1104,0.1,2021,1101,1104,12,0.026087,-3.262319,-8.027536,-0.626087,...,-8.834783,0.526087,1.62029,-0.791304,-3.14058,4.04058,-0.398551,0.714493,-1.333333,0.805797
1,2021_1101_1111,0.4,2021,1101,1111,-2,0.28442,8.137681,-6.51087,3.96558,...,-3.601449,0.28442,1.17029,0.733696,0.78442,6.84058,2.309783,1.806159,-0.166667,-2.344203
2,2021_1101_1116,0.0,2021,1101,1116,11,0.040373,-6.088509,-8.939441,-2.290373,...,-1.613354,-1.781056,-1.091615,-0.962733,-3.31677,3.245342,0.613354,1.312112,-2.142857,1.703416
3,2021_1101_1124,0.3,2021,1101,1124,13,-0.09058,-8.070652,-4.677536,-3.90942,...,-3.143116,1.951087,3.04529,-1.724638,0.451087,1.132246,1.268116,0.389493,-0.75,0.48913
4,2021_1101_1140,0.1,2021,1101,1140,8,0.066087,-1.255652,-7.10087,-1.226087,...,-1.514783,1.666087,3.566957,1.288696,-5.453913,2.013913,0.434783,4.667826,0.16,1.13913


In [20]:
training_set_stage2[training_set_stage2['ID']==gid]

Unnamed: 0,ID,Pred,Season,Team1,Team2,deltaSeed,deltaWinPct,deltaPointsFor,deltaPointsAgainst,deltaFGM,...,deltaFGA3,deltaFTM,deltaFTA,deltaOR,deltaDR,deltaAst,deltaTO,deltaStl,deltaBlk,deltaPF


### Linear Regression

In [43]:
model = linear_model.LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Logistic Regression

In [None]:
model = linear_model.LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)

### Neural Network

In [14]:
parameters = {'max_iter': [20,40,60], 'hidden_layer_sizes':[5,10,15,20]}

scaler = StandardScaler() 
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
model = GridSearchCV(MLPClassifier(), parameters, n_jobs=-1)
model.fit(X_train_scaled, y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=(100,),
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_iter=200,
                                     momentum=0.9, n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state=None, shuffle=True,
                                     solver='adam', tol=0.0001,
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'hidden_layer_sizes': [5, 10, 15, 20],

## Submission - Part 1

For the official kaggle competition, the submission file is just a game ID and the prediction (team 1 beats team 2)

In [6]:
training_set_stage2[['ID', 'Pred']].to_csv('submission_predictions_part1.csv', index=False)

## Submission - Part 2

Submission.csv is the submission for the official kaggle competition, but we want to take it a step further. Now, we are will load the submisssion.csv and use that prediction data to simulate the full tournamnet, predicting a winner for each game. 

Run the following cell and check the SimulatedBracket.png file in the binder directory. That is your simulated bracket with a percentage score of who will win each game. 

In [12]:
# enableUpsets = False
enableUpsets = True

In [14]:
# Constants
__version__ = '0.2.0'
ID = 'id'
PRED = 'pred'
SEASON = 'season'
TEAM = 'teamname'

year=2021

# Imports
import os
from binaryTree import Node
import pandas as pd
from PIL import Image, ImageDraw

# Define Paths
cwd = os.getcwd()

outputPath= cwd + '//SimluatedBracket.png'
teamsPath=cwd + '//data_stage2//MTeams.csv'
seedsPath=cwd + '//data_stage2//2021TourneySeeds.csv'
slotsPath=cwd + '//data_stage2//MNCAATourneySlots.csv'
submissionPath=cwd + '//submission_predictions_part1.csv'
resultsPath=None

slot_coordinates = {
    2021: {1: (372, 32),# First four
         2: (372, 50),
         3: (30, 328),
         4: (30, 346),
         5: (695, 325),
         6: (695, 343),
         7: (370, 642),
         8: (370, 659),
         9:  (30, 532),# W1
         10: (30, 514),
         11: (30, 567),
         12: (30, 550),
         13: (30, 604),
         14: (30, 586),
         15: (30, 640),
         16: (30, 622),
         17: (30, 496),
         18: (30, 478),
         19: (30, 460),
         20: (30, 442),
         21: (30, 424),
         22: (30, 406),
         23: (30, 388),
         24: (30, 370),
         25: (30, 199),# X1
         26: (30, 182),
         27: (30, 236),
         28: (30, 218),
         29: (30, 272),
         30: (30, 254),
         31: (30, 308),
         32: (30, 290),
         33: (30, 164),
         34: (30, 146),
         35: (30, 128),
         36: (30, 110),
         37: (30, 92),
         38: (30, 74),
         39: (30, 55),
         40: (30, 38),
         41: (815, 532),# Y1
         42: (815, 514),
         43: (815, 567),
         44: (815, 550),
         45: (815, 604),
         46: (815, 586),
         47: (815, 640),
         48: (815, 622),
         49: (815, 496),
         50: (815, 478),
         51: (815, 460),
         52: (815, 442),
         53: (815, 424),
         54: (815, 406),
         55: (815, 388),
         56: (815, 370),
         57: (815, 199),# Z1
         58: (815, 182),
         59: (815, 236),
         60: (815, 218),
         61: (815, 272),
         62: (815, 254),
         63: (815, 308),
         64: (815, 290),
         65: (815, 164),
         66: (815, 146),
         67: (815, 128),
         68: (815, 110),
         69: (815, 92),
         70: (815, 74),
         71: (815, 55),
         72: (815, 38),
         73: (155, 523),# W2
         74: (155, 559),
         75: (155, 595),
         76: (155, 631),
         77: (155, 487),
         78: (155, 451),
         79: (155, 415),
         80: (155, 379),
         81: (155, 191),# X2
         82: (155, 227),
         83: (155, 263),
         84: (155, 299),
         85: (155, 155),
         86: (155, 119),
         87: (155, 83),
         88: (155, 47),
         89: (735, 523),# Y2
         90: (735, 559),
         91: (735, 595),
         92: (735, 631),
         93: (735, 487),
         94: (735, 451),
         95: (735, 415),
         96: (735, 379),
         97: (735, 191),# Z2
         98: (735, 227),
         99: (735, 263),
         100: (735, 299),
         101: (735, 155),
         102: (735, 119),
         103: (735, 83),
         104: (735, 47),
         105: (232, 541),# W3
         106: (232, 613),
         107: (232, 469),
         108: (232, 397),
         109: (232, 209),# X3
         110: (232, 281),
         111: (232, 137),
         112: (232, 65),
         113: (668, 541),# Y3
         114: (668, 613),
         115: (668, 469),
         116: (668, 397),
         117: (668, 209),# Z3
         118: (668, 281),
         119: (668, 137),
         120: (668, 65),
         121: (298, 576),# W4
         122: (298, 432),
         123: (298, 244),# X4
         124: (298, 100),
         125: (601, 576),# Y4
         126: (601, 432),
         127: (601, 244),# Z4
         128: (601, 100),
         129: (358, 504),# W5
         130: (358, 172),# X5
         131: (540, 504),# Y5
         132: (540, 172),# Z5
         133: (420, 457),# WX6
         134: (435, 219),# YZ6
         135: (435, 339)# CH
    }
}

# Define classes and functions
class extNode(Node):
    def __init__(self, value, left=None, right=None, parent=None):
        Node.__init__(self, value, left=left, right=right)
        if parent is not None and isinstance(parent, extNode):
            self.__setattr__('parent', parent)
        else:
            self.__setattr__('parent', None)

    def __setattr__(self, name, value):
        # Magically set the parent to self when a child is created
        if (name in ['left', 'right']
                and value is not None
                and isinstance(value, extNode)):
            value.parent = self
        object.__setattr__(self, name, value)

def clean_col_names(df):
    return df.rename(columns={col: col.lower().replace('_', '') for col in df.columns})

def get_team_id(seedMap):
        return (seedMap, df[df['seed'] == seed_slot_map[seedMap]]['teamid'].values[0])

def get_team_ids_and_gid(slot1, slot2):
    team1 = get_team_id(slot1)
    team2 = get_team_id(slot2)
    if team2[1] < team1[1]:
        temp = team1
        team1 = team2
        team2 = temp
    gid = '{season}_{t1}_{t2}'.format(season=year, t1=team1[1], t2=team2[1])
    return team1, team2, gid


# initialize variables
submit = clean_col_names(pd.read_csv(submissionPath))
teams_df = clean_col_names(pd.read_csv(teamsPath))
seeds_df = clean_col_names(pd.read_csv(seedsPath))
slots_df = clean_col_names(pd.read_csv(slotsPath))

df = seeds_df.merge(teams_df, left_on='teamid', right_on='teamid')

df = df.drop(['firstd1season','lastd1season'], axis=1)

s = slots_df[slots_df['season'] == year]
seed_slot_map = {0: 'R6CH'}
bkt = extNode(0)

# Begin by creating an empty tournament bracket using the modified binary tree class defined above. populate
# The initial games using seed slot data
counter = 1
current_nodes = [bkt]
current_id = -1
current_index = 0

while current_nodes:
    next_nodes = []
    current_index = 0
    while current_index < len(current_nodes):
        node = current_nodes[current_index]
        if len(s[s['slot'] == seed_slot_map[node.value]].index) > 0:
            node.left = extNode(counter)
            node.right = extNode(counter + 1)
            seed_slot_map[counter] = s[s['slot'] == seed_slot_map[node.value]].values[0][2]
            seed_slot_map[counter + 1] = s[s['slot'] == seed_slot_map[node.value]].values[0][3]
            next_nodes.append(node.left)
            next_nodes.append(node.right)
            counter += 2
        current_index += 1
        current_id += 1
    current_nodes = next_nodes
    
# Create a results dataframe     
results_df = pd.DataFrame({"id": [], "pred": []})
   
# initialize a predictions map. This will be used 
pred_map = {}

## Simulate the Tournament #############################
# Cycle through each round of the tournament
for level in list(reversed(bkt.levels)):    
    # cycle through each game of the round
    for ix, node in enumerate(level[0: len(level) // 2]):
        # extract teams and id's 
        team1, team2, gid = get_team_ids_and_gid(level[ix * 2].value, level[ix * 2 + 1].value)
        # lookup the prediction result from the submission values
        pred = submit[submit['id'] == gid]['pred'].values[0]
        # if the value is in the list of predictions (they all should)
        if gid in list(results_df.id):
            # 
            game_outcome = results_df[results_df[ID] == gid][PRED].values[0]
            
            # this is determining the prediction percent label only. Not the logic of who wins the match
            team = team1 if game_outcome == 1 else team2
            if (game_outcome == 1 and pred > 0.5):
                # outcome agress with prediction, team1 wins
                pred_label = pred
            elif (game_outcome == 0 and pred > 0.5):
                # outcome different than prediction, team2 wins
                pred_label = 1 - pred
            elif (game_outcome == 0 and pred <= 0.5):
                # outcome agrees with prediction, team2 wins
                pred_label = 1 - pred
            elif (game_outcome == 1 and pred <= 0.5):
                # outcome different than prediction, team2 wins
                pred_label = pred
            else:
                raise ValueError("team not found")

        # This assigns the winner based on prediction       
        
        #If upsets are enabled ##################################
        if enableUpsets == True:
            randNumber = random.random()
        
            if randNumber <= pred:
                team = team1
                pred_label = pred
            else:
                team = team2
                pred_label = 1 - pred
            
        # No Upsets #############################################
        if enableUpsets == False:
            if pred >= 0.5:
                team = team1
                pred_label = pred
            else:
                team = team2
                pred_label = 1 - pred
        # Set the winner to the next game
        level[ix * 2].parent.value = team[0]
        # record the winner and slot information in the prediction map
        pred_map[gid] = (team[0], seed_slot_map[team[0]], pred_label)


## Draw the bracket ##################################
slotdata = []
# cycle through the binary tree
for ix, key in enumerate([b for a in bkt.levels for b in a]):
    xy = slot_coordinates[year][max(slot_coordinates[year].keys()) - ix]
    pred = ''
    gid = ''
    if key.parent is not None:
        team1, team2, gid = get_team_ids_and_gid(key.parent.left.value, key.parent.right.value)
    
    # Format the predicted value by looking it up in the pred_map
    if gid != '' and pred_map[gid][1] == seed_slot_map[key.value]:
        pred = "{:.2f}%".format(pred_map[gid][2] * 100)
    
    # Format the string to be written on the image
    st = '{teamid} {teamname} {pred_label}'.format(
        teamid=df[df['seed'] == seed_slot_map[key.value]]['teamid'].values[0],
        teamname=df[df['seed'] == seed_slot_map[key.value]]['teamname'].values[0],
        pred_label = pred
    )
    
    # Append the string value to the slotdata submission csv file
    slotdata.append((xy, st, key.value))

# open the image file and draw a blank tournament bracket
img = Image.open('2019.jpg')
draw = ImageDraw.Draw(img)

# cycle through the simulated tournament and plot the formatted string in the proper location on the bracket image
for slot in slotdata:
    draw.text(slot[0], str(slot[1]), (0, 0, 0))

# save the bracket image
img.save(outputPath)


predictionsCSV= []
for slot in slotdata:
    predictionsCSV.append([slot[0],str(slot[1]), slot[2]])
    

df = pd.DataFrame(predictionsCSV)
df.columns = ['Coordinates', 'Predicted Team', 'Index']
df.to_csv('submission_bracket_part2.csv')