# IDM Engineering - March Madness Machine Learning - 2020

Welcome to the first IDM engineering March Madness Machine Learning lunch and learn! Thanks for attending!

The point of this lunch and learn is to be educational about the process around machine learning, not how to code in python. This activity is presented in a jupyter notebook, and it is setup in a way such that you can simply run the full notebook and you will get a result. Or, you can follow along and customize your algorithms as you see fit. 

## Table of Contents:
* Jupyter Notebooks
* Library Imports 
* Data Manipulation
* Data Analysis
* Model Exploration
    * Linear - Ordinary Least Squares
    * Linear - Logistic Regression
    * Random Forest Classifier
    * Neural Network 
        * Scaled Data
        * Grid Search CV
* Build the Final Model
* Load Submission Data
* Make Predictions
* Simulate Tournament

## Jupyter Notebooks

Quick note about jupyter notebooks. Jupyter allows you to execute individual snippets of code within one kernal. While selecting a cell, you can hit the run button to run the individual cell. You can also hit shift-enter. 

If a cell gets stuck, hit the stop button next to the run button. If your kernal crashes, you can hit kernal-restart kernal to get a fresh python instance. Note that you will lose all of your work if you hit that.

Comment cells lines of code with the $ \# $ character, or you can use "ctrl-/".

## Library Imports

Python is an incredibly flexible language, partially due to how modular it is. We can extend its basic functionality by importanting 3rd party libraries.

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pkg_resources

from binaryTree import Node
from PIL import Image, ImageDraw

from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

In [2]:
cwd = os.getcwd()

## Data Manipulation

The intention of the data manipulation section is to create a dataframe of our target variable (result) and the given factors.

First, let's see what format the data is in that we currently have.

For this section, we are going to read in our training sets and records. You could start the notebook right here if you got lost at any point. 

### Train the Model

Remember some will count twice!

In [10]:
training_set = pd.read_csv("training_set.csv")
training_set_stage2 = pd.read_csv("training_set_stage2.csv")

In [11]:
# cols = ['deltaSeed', 'deltaWinPct','deltaPointsFor','deltaPointsAgainst','deltaFGM','deltaFGA','deltaFGM3','deltaFGA3','deltaFTM',
#         'deltaFTA','deltaOR','deltaDR','deltaAst','deltaTO','deltaStl','deltaBlk','deltaPF']
cols = ['deltaSeed', 'deltaFGM', 'deltaAst']
cols

['deltaSeed', 'deltaFGM', 'deltaAst']

Now, define your training sets based on the cols variable

In [12]:
X_train = training_set[cols]
y_train = training_set['Result']

In [13]:
training_set_stage2

Unnamed: 0,ID,Pred,Season,Team1,Team2,deltaSeed,deltaWinPct,deltaPointsFor,deltaPointsAgainst,deltaFGM,...,deltaFGA3,deltaFTM,deltaFTA,deltaOR,deltaDR,deltaAst,deltaTO,deltaStl,deltaBlk,deltaPF
0,2019_1101_1113,0.5,2019,1101,1113,4,0.105603,-6.088362,-8.165948,-1.248922,...,-2.353448,-3.581897,-6.837284,-3.087284,-4.915948,1.026940,-1.938578,1.781250,-0.667026,-0.768319
1,2019_1101_1120,0.5,2019,1101,1120,10,0.057809,-7.158215,-3.691684,-1.684584,...,-11.074037,0.381339,0.333671,-2.666329,0.955375,0.208925,-0.491886,-1.294118,-2.212982,0.755578
2,2019_1101_1124,0.5,2019,1101,1124,6,0.199353,0.067888,-2.290948,-0.155172,...,-4.478448,1.074353,0.193966,-4.306034,-1.697198,0.776940,-1.626078,1.875000,-2.198276,0.356681
3,2019_1101_1125,0.5,2019,1101,1125,4,-0.040230,-15.142529,-9.770115,-6.321839,...,-9.070115,0.626437,1.168966,0.168966,-6.770115,-4.979310,0.055172,1.333333,-1.248276,3.437931
4,2019_1101_1133,0.5,2019,1101,1133,0,0.217346,5.360502,-0.285266,2.314525,...,0.138976,0.399164,-0.294671,0.038662,-2.194357,2.560084,-0.890282,2.848485,-1.205852,1.531870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2273,2019_1449_1459,0.5,2019,1449,1459,2,-0.101961,-11.376471,-3.150980,-4.741176,...,-4.545098,1.501961,2.425490,-0.829412,-2.315686,-3.023529,2.119608,2.866667,2.801961,0.911765
2274,2019_1449_1463,0.5,2019,1449,1463,-5,0.014706,-11.069328,-9.331933,-5.262605,...,0.766807,-0.228992,0.701681,0.792017,-7.703782,-5.323529,0.102941,3.250000,1.413866,1.411765
2275,2019_1458_1459,0.5,2019,1458,1459,-2,-0.169697,-12.139394,-6.109091,-3.421212,...,-6.678788,-1.551515,-0.936364,-2.148485,2.627273,-1.942424,-1.687879,-1.012121,1.278788,-2.439394
2276,2019_1458_1463,0.5,2019,1458,1463,-9,-0.053030,-11.832251,-12.290043,-3.942641,...,-1.366883,-3.282468,-2.660173,-0.527056,-2.760823,-4.242424,-3.704545,-0.628788,-0.109307,-1.939394


Ok, so now we have our training set.
The next thing you need to do is determine what model you want to use. Uncomment whichever model you would like to use. 

### Random Forest Classifier

In [14]:
model = RandomForestClassifier(n_estimators = 10)
model.fit(X_train, y_train)
X_pred = training_set_stage2[cols]
pred = model.predict_proba(X_pred)[:,1]
training_set_stage2['Pred'] = pred

### Linear Regression

In [None]:
model = linear_model.LinearRegression()
model.fit(X_train, y_train)

### Logistic Regression

In [None]:
model = linear_model.LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)

### Neural Network

In [14]:
parameters = {'max_iter': [20,40,60], 'hidden_layer_sizes':[5,10,15,20]}

scaler = StandardScaler() 
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
model = GridSearchCV(MLPClassifier(), parameters, n_jobs=-1)
model.fit(X_train_scaled, y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=(100,),
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_iter=200,
                                     momentum=0.9, n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state=None, shuffle=True,
                                     solver='adam', tol=0.0001,
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'hidden_layer_sizes': [5, 10, 15, 20],

## Create Submission File

In [15]:
training_set_stage2[['ID', 'Pred']].to_csv('submission.csv', index=False)

## Simulate Tournament

Submission.csv is the submission for the official tournament, but we want to take it a step further. Now, we are will load the submisssion.csv and use that prediction data to simulate the full tournamnet, predicting a winner for each game. 

Run the following cell and check the output.png file in the binder directory. That is your simulated bracket with a percentage score of who will win each game. 

In [29]:
# Constants
__version__ = '0.2.0'
ID = 'id'
PRED = 'pred'
SEASON = 'season'
TEAM = 'teamname'

year=2019

# Imports
import os
from binaryTree import Node
import pandas as pd
from PIL import Image, ImageDraw

# Define Paths
cwd = os.getcwd()

outputPath= cwd + '//output.png'
teamsPath=cwd + '//data//Teams.csv'
seedsPath=cwd + '//data//2019TourneySeeds.csv'
slotsPath=cwd + '//data//MNCAATourneySlots.csv'
submissionPath=cwd + '//submission.csv'
resultsPath=None

slot_coordinates = {
    2019: {1: (372, 32),# First four
         2: (372, 50),
         3: (30, 328),
         4: (30, 346),
         5: (695, 325),
         6: (695, 343),
         7: (370, 642),
         8: (370, 659),
         9:  (30, 532),# W1
         10: (30, 514),
         11: (30, 567),
         12: (30, 550),
         13: (30, 604),
         14: (30, 586),
         15: (30, 640),
         16: (30, 622),
         17: (30, 496),
         18: (30, 478),
         19: (30, 460),
         20: (30, 442),
         21: (30, 424),
         22: (30, 406),
         23: (30, 388),
         24: (30, 370),
         25: (30, 199),# X1
         26: (30, 182),
         27: (30, 236),
         28: (30, 218),
         29: (30, 272),
         30: (30, 254),
         31: (30, 308),
         32: (30, 290),
         33: (30, 164),
         34: (30, 146),
         35: (30, 128),
         36: (30, 110),
         37: (30, 92),
         38: (30, 74),
         39: (30, 55),
         40: (30, 38),
         41: (815, 532),# Y1
         42: (815, 514),
         43: (815, 567),
         44: (815, 550),
         45: (815, 604),
         46: (815, 586),
         47: (815, 640),
         48: (815, 622),
         49: (815, 496),
         50: (815, 478),
         51: (815, 460),
         52: (815, 442),
         53: (815, 424),
         54: (815, 406),
         55: (815, 388),
         56: (815, 370),
         57: (815, 199),# Z1
         58: (815, 182),
         59: (815, 236),
         60: (815, 218),
         61: (815, 272),
         62: (815, 254),
         63: (815, 308),
         64: (815, 290),
         65: (815, 164),
         66: (815, 146),
         67: (815, 128),
         68: (815, 110),
         69: (815, 92),
         70: (815, 74),
         71: (815, 55),
         72: (815, 38),
         73: (155, 523),# W2
         74: (155, 559),
         75: (155, 595),
         76: (155, 631),
         77: (155, 487),
         78: (155, 451),
         79: (155, 415),
         80: (155, 379),
         81: (155, 191),# X2
         82: (155, 227),
         83: (155, 263),
         84: (155, 299),
         85: (155, 155),
         86: (155, 119),
         87: (155, 83),
         88: (155, 47),
         89: (735, 523),# Y2
         90: (735, 559),
         91: (735, 595),
         92: (735, 631),
         93: (735, 487),
         94: (735, 451),
         95: (735, 415),
         96: (735, 379),
         97: (735, 191),# Z2
         98: (735, 227),
         99: (735, 263),
         100: (735, 299),
         101: (735, 155),
         102: (735, 119),
         103: (735, 83),
         104: (735, 47),
         105: (232, 541),# W3
         106: (232, 613),
         107: (232, 469),
         108: (232, 397),
         109: (232, 209),# X3
         110: (232, 281),
         111: (232, 137),
         112: (232, 65),
         113: (668, 541),# Y3
         114: (668, 613),
         115: (668, 469),
         116: (668, 397),
         117: (668, 209),# Z3
         118: (668, 281),
         119: (668, 137),
         120: (668, 65),
         121: (298, 576),# W4
         122: (298, 432),
         123: (298, 244),# X4
         124: (298, 100),
         125: (601, 576),# Y4
         126: (601, 432),
         127: (601, 244),# Z4
         128: (601, 100),
         129: (358, 504),# W5
         130: (358, 172),# X5
         131: (540, 504),# Y5
         132: (540, 172),# Z5
         133: (420, 457),# WX6
         134: (435, 219),# YZ6
         135: (435, 339)# CH
    }
}

# Define classes and functions
class extNode(Node):
    def __init__(self, value, left=None, right=None, parent=None):
        Node.__init__(self, value, left=left, right=right)
        if parent is not None and isinstance(parent, extNode):
            self.__setattr__('parent', parent)
        else:
            self.__setattr__('parent', None)

    def __setattr__(self, name, value):
        # Magically set the parent to self when a child is created
        if (name in ['left', 'right']
                and value is not None
                and isinstance(value, extNode)):
            value.parent = self
        object.__setattr__(self, name, value)

def clean_col_names(df):
    return df.rename(columns={col: col.lower().replace('_', '') for col in df.columns})

def get_team_id(seedMap):
        return (seedMap, df[df['seed'] == seed_slot_map[seedMap]]['teamid'].values[0])

def get_team_ids_and_gid(slot1, slot2):
    team1 = get_team_id(slot1)
    team2 = get_team_id(slot2)
    if team2[1] < team1[1]:
        temp = team1
        team1 = team2
        team2 = temp
    gid = '{season}_{t1}_{t2}'.format(season=year, t1=team1[1], t2=team2[1])
    return team1, team2, gid


# initialize variables
submit = clean_col_names(pd.read_csv(submissionPath))
teams_df = clean_col_names(pd.read_csv(teamsPath))
seeds_df = clean_col_names(pd.read_csv(seedsPath))
slots_df = clean_col_names(pd.read_csv(slotsPath))

df = seeds_df.merge(teams_df, left_on='teamid', right_on='teamid')

df = df.drop(['firstd1season','lastd1season'], axis=1)

s = slots_df[slots_df['season'] == year]
seed_slot_map = {0: 'R6CH'}
bkt = extNode(0)

# Begin by creating an empty tournament bracket using the modified binary tree class defined above. populate
# The initial games using seed slot data
counter = 1
current_nodes = [bkt]
current_id = -1
current_index = 0

while current_nodes:
    next_nodes = []
    current_index = 0
    while current_index < len(current_nodes):
        node = current_nodes[current_index]
        if len(s[s['slot'] == seed_slot_map[node.value]].index) > 0:
            node.left = extNode(counter)
            node.right = extNode(counter + 1)
            seed_slot_map[counter] = s[s['slot'] == seed_slot_map[node.value]].values[0][2]
            seed_slot_map[counter + 1] = s[s['slot'] == seed_slot_map[node.value]].values[0][3]
            next_nodes.append(node.left)
            next_nodes.append(node.right)
            counter += 2
        current_index += 1
        current_id += 1
    current_nodes = next_nodes
    
# Create a results dataframe     
results_df = pd.DataFrame({"id": [], "pred": []})
   
# initialize a predictions map. This will be used 
pred_map = {}

## Simulate the Tournament #############################
# Cycle through each round of the tournament
for level in list(reversed(bkt.levels)):    
    # cycle through each game of the round
    for ix, node in enumerate(level[0: len(level) // 2]):
        # extract teams and id's 
        team1, team2, gid = get_team_ids_and_gid(level[ix * 2].value, level[ix * 2 + 1].value)
        # lookup the prediction result from the submission values
        pred = submit[submit['id'] == gid]['pred'].values[0]
        # if the value is in the list of predictions (they all should)
        if gid in list(results_df.id):
            # 
            game_outcome = results_df[results_df[ID] == gid][PRED].values[0]
            
            # this is determining the prediction percent label only. Not the logic of who wins the match
            team = team1 if game_outcome == 1 else team2
            if (game_outcome == 1 and pred > 0.5):
                # outcome agress with prediction, team1 wins
                pred_label = pred
            elif (game_outcome == 0 and pred > 0.5):
                # outcome different than prediction, team2 wins
                pred_label = 1 - pred
            elif (game_outcome == 0 and pred <= 0.5):
                # outcome agrees with prediction, team2 wins
                pred_label = 1 - pred
            elif (game_outcome == 1 and pred <= 0.5):
                # outcome different than prediction, team2 wins
                pred_label = pred
            else:
                raise ValueError("team not found")

        # This assigns the winner based on prediction       
        elif pred >= 0.5:
            team = team1
            pred_label = pred
        else:
            team = team2
            pred_label = 1 - pred

        # Set the winner to the next game
        level[ix * 2].parent.value = team[0]
        # record the winner and slot information in the prediction map
        pred_map[gid] = (team[0], seed_slot_map[team[0]], pred_label)


## Draw the bracket ##################################
slotdata = []
# cycle through the binary tree
for ix, key in enumerate([b for a in bkt.levels for b in a]):
    xy = slot_coordinates[2019][max(slot_coordinates[2019].keys()) - ix]
    pred = ''
    gid = ''
    if key.parent is not None:
        team1, team2, gid = get_team_ids_and_gid(key.parent.left.value, key.parent.right.value)
    
    # Format the predicted value by looking it up in the pred_map
    if gid != '' and pred_map[gid][1] == seed_slot_map[key.value]:
        pred = "{:.2f}%".format(pred_map[gid][2] * 100)
    
    # Format the string to be written on the image
    st = '{teamid} {teamname} {pred_label}'.format(
        teamid=df[df['seed'] == seed_slot_map[key.value]]['teamid'].values[0],
        teamname=df[df['seed'] == seed_slot_map[key.value]]['teamname'].values[0],
        pred_label = pred
    )
    
    # Append the string value to the slotdata submission csv file
    slotdata.append((xy, st, key.value))

# open the image file and draw a blank tournament bracket
img = Image.open('2019.jpg')
draw = ImageDraw.Draw(img)

# cycle through the simulated tournament and plot the formatted string in the proper location on the bracket image
for slot in slotdata:
    draw.text(slot[0], str(slot[1]), (0, 0, 0))

# save the bracket image
img.save(outputPath)


predictionsCSV= []
for slot in slotdata:
    predictionsCSV.append([slot[0],str(slot[1]), slot[2]])
    

df = pd.DataFrame(predictionsCSV)+
df.columns = ['Coordinates', 'Predicted Team', 'Index']
df.to_csv('bracket_predictions_csv.csv')

In [24]:
pred_map

{'2019_1295_1300': (127, 'W16a', 0.9),
 '2019_1125_1396': (130, 'W11b', 0.9),
 '2019_1192_1341': (131, 'X16a', 0.8),
 '2019_1113_1385': (133, 'X11a', 0.9),
 '2019_1181_1295': (63, 'W01', 1.0),
 '2019_1416_1433': (66, 'W09', 0.7),
 '2019_1387_1439': (67, 'W04', 0.9),
 '2019_1251_1280': (70, 'W12', 0.5),
 '2019_1133_1277': (71, 'W02', 0.9),
 '2019_1257_1278': (73, 'W07', 1.0),
 '2019_1261_1463': (75, 'W03', 0.6),
 '2019_1268_1396': (130, 'W11b', 0.6),
 '2019_1192_1211': (79, 'X01', 1.0),
 '2019_1124_1393': (82, 'X09', 1.0),
 '2019_1199_1436': (83, 'X04', 0.5),
 '2019_1266_1293': (85, 'X05', 1.0),
 '2019_1276_1285': (87, 'X02', 1.0),
 '2019_1196_1305': (89, 'X07', 0.8),
 '2019_1297_1403': (91, 'X03', 0.7),
 '2019_1113_1138': (93, 'X06', 0.8),
 '2019_1233_1314': (95, 'Y01', 0.9),
 '2019_1429_1449': (98, 'Y09', 0.7),
 '2019_1242_1318': (99, 'Y04', 0.9),
 '2019_1120_1308': (101, 'Y05', 0.8),
 '2019_1101_1246': (103, 'Y02', 0.8),
 '2019_1371_1459': (106, 'Y10', 0.7),
 '2019_1209_1222': (107, 