This notebook queries the MySQL database that includes the results from import_game_pbp_data.py. It then trains a neural network on the data using sklearn's MLPClassifier.

In [1]:
import numpy as np
import sklearn as skl
import MySQLdb as mdb
import getpass

In [3]:
rootpass = getpass.getpass('Enter the database root password: ')
con = mdb.connect('localhost', 'root', rootpass, 'NFL_Offensive_Plays_2015')
cur = con.cursor()

Enter the database root password: ········


The below cell is the bulk of the notebook: it queries the database, trains the neural network on the data, makes predictions, and calculates the fraction of correctly predicted plays.

In [33]:
best_frac_true_cv = []
best_model = []
best_frac_true_train = []
frac_true_test = []

# A list of the team names.
teams = ['Baltimore Ravens', 'Cincinnati Bengals', 'Cleveland Browns', 'Pittsburgh Steelers', 'Houston Texans', 'Indianapolis Colts', 'Jacksonville Jaguars', 'Tennessee Titans', 'Buffalo Bills', 'Miami Dolphins', 'New England Patriots', 'New York Jets', 'Denver Broncos', 'Kansas City Chiefs', 'Oakland Raiders', 'San Diego Chargers', 'Chicago Bears', 'Detroit Lions', 'Green Bay Packers', 'Minnesota Vikings', 'Atlanta Falcons', 'Carolina Panthers', 'New Orleans Saints', 'Tampa Bay Buccaneers', 'Dallas Cowboys', 'New York Giants', 'Philadelphia Eagles', 'Washington Redskins', 'Arizona Cardinals', 'St. Louis Rams', 'San Francisco 49ers', 'Seattle Seahawks']

for team in teams:
    #Get the training data
    cur.execute("SELECT `Time Remaining`, Down, `To Go`, `Field Position`, `Score Differential` FROM `" + team + "` WHERE Week <= 8")
    train_in_prelim = cur.fetchall()
    cur.execute("SELECT IsPass FROM `" + team + "` WHERE Week <= 8")
    train_out_prelim = cur.fetchall()

    #Get the cross-validation data
    cur.execute("SELECT `Time Remaining`, Down, `To Go`, `Field Position`, `Score Differential` FROM `" + team + "` WHERE Week > 8 AND Week <= 12")
    cv_in_prelim = cur.fetchall()
    cur.execute("SELECT IsPass FROM `" + team + "` WHERE Week > 8 AND Week <= 12")
    cv_out_prelim = cur.fetchall()

    #Get the test data
    cur.execute("SELECT `Time Remaining`, Down, `To Go`, `Field Position`, `Score Differential` FROM `" + team + "` WHERE Week > 12")
    test_in_prelim = cur.fetchall()
    cur.execute("SELECT IsPass FROM `" + team + "` WHERE Week > 12")
    test_out_prelim = cur.fetchall()

    """The fetchall() command returns a triply-nested tuple, which the functions from 
    sklearn do not like. In the next set of lines I convert the outputs to lists of lists,
    which work as inputs to sklearn.StandardScaler and sklearn.MLPClassifier."""
    train_in = []
    train_out = []
    i = 0
    while i < len(train_out_prelim):
        train_out.append(float(train_out_prelim[i][0]))
        train_in.append([float(train_in_prelim[i][0]), float(train_in_prelim[i][1]), float(train_in_prelim[i][2]), float(train_in_prelim[i][3]), float(train_in_prelim[i][4])])
        i+=1
    
    cv_in = []
    cv_out = []
    i = 0
    while i < len(cv_out_prelim):
        cv_out.append(float(cv_out_prelim[i][0]))
        cv_in.append([float(cv_in_prelim[i][0]), float(cv_in_prelim[i][1]), float(cv_in_prelim[i][2]), float(cv_in_prelim[i][3]), float(cv_in_prelim[i][4])])
        i+=1
    
    test_in = []
    test_out = []
    i = 0
    while i < len(test_out_prelim):
        test_out.append(float(test_out_prelim[i][0]))
        test_in.append([float(test_in_prelim[i][0]), float(test_in_prelim[i][1]), float(test_in_prelim[i][2]), float(test_in_prelim[i][3]), float(test_in_prelim[i][4])])
        i+=1
    
    # Normalize the features.
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    scaler.fit(train_in)
    train_in = scaler.transform(train_in)
    cv_in = scaler.transform(cv_in)
    test_in = scaler.transform(test_in)

    """Here, I define the classifier. I used GridSearchCV to help find the values for each hyperparameter.
    If I add more neurons to the hidden layer, the optimal regularization parameter (alpha) becomes large
    (>= 1), suggesting that I have a high variance problem if I use more neurons without a lot of 
    regularization. If I have only one neuron in the hidden layer, the optimal regularization parameter  
    seems to be zero, suggesting that I have a high bias problem in that case. Thus, I chose 2 neurons in
    the hidden layer. The optimal regularization parameter in this case is 0.1."""
    
    from sklearn.neural_network import MLPClassifier as mlpc
    classifier = mlpc(solver='lbfgs', hidden_layer_sizes=(2,), activation='logistic', alpha=0.1, max_iter=float("inf"))

    """For each team, I will run the classifier 100 times, trying to ensure that I get the optimal result.
    I define the 'optimal' model as the one that does best on the cross-validation set, and I will save
    the best models and the fraction of true predictions on the cross-validation and training sets."""
    i = 0
    best_frac_true_cv.append(0.0)
    best_model.append(None)
    best_frac_true_train.append(0.0)
    while i < 100:
        model = classifier.fit(train_in,train_out)
        train_predictions = classifier.predict(train_in)
        j = 0
        count = 0
        while j < len(train_out):
            if train_predictions[j] == train_out[j]:
                count += 1
            j += 1
        frac_true_train = float(count)/float(len(train_predictions))
    
        cv_predictions = classifier.predict(cv_in)
        j = 0
        count = 0
        while j < len(cv_out):
            if cv_predictions[j] == cv_out[j]:
                count += 1
            j += 1
        frac_true_cv = float(count)/float(len(cv_predictions))
    
        if frac_true_cv > best_frac_true_cv[-1]:
            best_model[-1] = model
            best_frac_true_train[-1] = frac_true_train
            best_frac_true_cv[-1] = frac_true_cv
        i+=1
    
    # Here, I use the best model from above to predict the output of the test set.
    test_predictions = best_model[-1].predict(test_in)
    i = 0
    count = 0
    while i < len(test_predictions):
        if test_predictions[i] == test_out[i]:
            count += 1
        i += 1
    frac_true_test.append(float(count)/float(len(test_out)))

The below cell defines frac_true_logit, which is a list of fractions of correct predictions from the logistic regression model in NFL_Off_Plays_Logit. I put it here so that I can compare the results of the two models.

In [10]:
frac_true_logit = [0.6789667896678967,0.68,0.655511811023622,0.5580952380952381,0.6311787072243346,0.6360078277886497,0.6703296703296703,0.6610486891385767,0.6648550724637681,0.6338797814207651,0.6303972366148531,0.6766256590509666,0.6454388984509466,0.5803571428571429,0.6289752650176679,0.6679389312977099,0.6696588868940754,0.6257309941520468,0.5921875,0.7102803738317757,0.6750972762645915,0.6473594548551959,0.6863117870722434,0.6433566433566433,0.6819047619047619,0.6383763837638377,0.5942028985507246,0.5959780621572212,0.6400778210116731,0.7038461538461539,0.658008658008658,0.6227897838899804]

The below cell calculates the difference between the fraction of correctly predicted plays by the model in this notebook and that in NFL_Off_Plays_Logit. In fact, there is little difference for any team, and the mean difference is slightly negative, favoring the logistic regression model. Since the logistic regression model is less computationally expensive to train, this suggests that logistic regression is a more appropriate way to predict play calls with this data. The fact that such a highly-biased model does as well as a neural network indicates that there is not enough data here to make a neural network worth the trouble. In particular, I imagine that having the personnel and formation used on each play would make a positive difference. Information on what the defense is doing pre-snap (are they showing man-to-man or zone coverage, etc.) would also be useful.

In [41]:
frac_true_diff = np.array(frac_true_test) - np.array(frac_true_logit)
frac_true_diff, np.mean(frac_true_diff)

(array([ 0.01088828, -0.07864865, -0.03012481,  0.06259442, -0.03714886,
        -0.00322094,  0.01738963, -0.02674448, -0.03344482, -0.02331873,
        -0.01597717, -0.00271262,  0.03148418, -0.02564016, -0.00316881,
        -0.00621786, -0.05452731,  0.01297868, -0.02446416, -0.0161223 ,
        -0.00738568, -0.06115256, -0.02259497,  0.04335222, -0.01882232,
         0.02126217,  0.0225118 ,  0.02940893,  0.02985682, -0.01749929,
         0.018559  ,  0.01550809]), -0.006035696316248125)