# Models

For this journal, we have created all of our models. We only saved the one model that created the most accurate predictions. 

In [35]:
# Import necessary libraries 

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import scipy as sc
from sklearn import neural_network

%matplotlib inline

Here we read in all the dataframes that we need to train and test our model with

In [36]:
diff1314 = pd.read_csv('./preprocessed/final/diff1314.csv', sep=',')
diff1415 = pd.read_csv('./preprocessed/final/diff1415.csv', sep=',')
diff1516 = pd.read_csv('./preprocessed/final/diff1516.csv', sep=',')
diff1617 = pd.read_csv('./preprocessed/final/diff1617.csv', sep=',')

diff1718 = pd.read_csv('./preprocessed/final/diff1718.csv', sep=',')

diff_norm1314 = pd.read_csv('./preprocessed/final/diff_norm1314.csv', sep=',')
diff_norm1415 = pd.read_csv('./preprocessed/final/diff_norm1415.csv', sep=',')
diff_norm1516 = pd.read_csv('./preprocessed/final/diff_norm1516.csv', sep=',')
diff_norm1617 = pd.read_csv('./preprocessed/final/diff_norm1617.csv', sep=',')

diff_norm1718 = pd.read_csv('./preprocessed/final/diff_norm1718.csv', sep=',')

win1314 = pd.read_csv('./preprocessed/final/winner1314.csv', sep=',')
win1415 = pd.read_csv('./preprocessed/final/winner1415.csv', sep=',')
win1516 = pd.read_csv('./preprocessed/final/winner1516.csv', sep=',')
win1617 = pd.read_csv('./preprocessed/final/winner1617.csv', sep=',')
win1718 = pd.read_csv('./preprocessed/final/winner1718.csv', sep =',')

Here we want to delete two columns that do not have any purpose. Not sure as to why they remained in our dataframes, but we remove them for all the difference dataframes.

In [37]:
arr = [diff1314, diff1415, diff1516, diff1617, diff_norm1314, diff_norm1415, diff_norm1516, diff_norm1617, diff1718, diff_norm1718]

for j in arr:
    del j['Unnamed: 0']
    del j['1']
    
diff1314.shape

(15, 45)

Here we append the normalized and non-normalized data into one list that the models can be trained with. 

In [38]:
arr2 = [diff1314, diff1415, diff1516, diff1617]
arr3 = [diff_norm1314, diff_norm1415, diff_norm1516, diff_norm1617]
data = []
data_norm = []

for df in arr2:
    for row in df.iterrows():
        index, stat = row
        data.append(stat.tolist())
        
for df in arr3:
    for row in df.iterrows():
        index, stat = row
        data_norm.append(stat.tolist())

Again, we need to delete a column that has no important data. 

In [39]:
del win1314["Unnamed: 0"]
del win1415["Unnamed: 0"]
del win1516["Unnamed: 0"]
del win1617["Unnamed: 0"]
del win1718["Unnamed: 0"]

Here we append the data that contains the winners for the past 4 seasons. We use this for training. 

In [40]:
y = []

arr2 = [win1314, win1415, win1516, win1617]

for i in arr2:
    for row in i.iterrows():
        index, winner = row
        y.append(winner.tolist()[0])

Here is our first model, the most accurate model, that is saved. It is a MLP neural network with non-normalized data. 

In [41]:
model = neural_network.MLPClassifier(solver = 'lbfgs', random_state = 0, hidden_layer_sizes = [10,10])
#model.fit(data, y)

Here we save the model

In [42]:
import pickle

filename = './finalized_model_1.sav'

#pickle.dump(model, open(filename, 'wb'))

For the purpose of running this notebook, we get the saved model so that we can predict from it. 

In [43]:
loaded_model_1 = pickle.load(open(filename, 'rb'))

Here we predict for this season, below you can see what the winners, being team a or team b, are. We have a better visualization in the other notebook. 

In [44]:
y_predict = loaded_model_1.predict(diff1718)

In [45]:
y_predict

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0])

We want to be able to visualize the actual winners. Note that the predicted is longer because it also predicts for the last three series as well. But notice that our predicts are the exact same for the first 12 matchups. 

In [46]:
winners1718 = win1718['Winner'].reshape(-1, 1)

actual_win = []
for x in winners1718:
    temp = x[0]
    actual_win.append(temp)

  """Entry point for launching an IPython kernel.


In [47]:
actual_win

[0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1]

Here we do the same process, but use normalized data. Notice that there are more errors from our predicted array and the actual winners above. 

In [48]:
model_norm = neural_network.MLPClassifier(solver = 'lbfgs', random_state = 0, hidden_layer_sizes = [10,10])

In [49]:
model_norm.fit(data_norm, y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=[10, 10], learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=0, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

In [50]:
predictions_norm = model_norm.predict(diff_norm1718)

In [51]:
predictions_norm

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0])

Here we do a normalized and non-normalzied decision tree classifier. It is not as accurate as the MLP neural network. So we decided not to save the machine learning models .

In [52]:
from sklearn import tree

model_dt = tree.DecisionTreeClassifier()

In [53]:
model_dt.fit(data, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [54]:
predictions_dt = model_dt.predict(diff1718)

In [55]:
predictions_dt

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0])

We want to do the same process with normalized data. 

In [56]:
model_dt_norm = tree.DecisionTreeClassifier()

In [57]:
model_dt_norm.fit(data_norm, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [58]:
predictions_dt_norm = model_dt_norm.predict(diff_norm1718)

In [59]:
predictions_dt_norm

array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0])