## CSE 5243 Intro To Data Mining Final Project: Predicting NFL Outcomes

# Intro

### The objective of this project will be attempting to make a decent model at predicting the outcome of NFL games and comparing them to Vegas' models for gambling. The model will calculate an expected point differential and compare it with the Vegas spread, determining whether betting on the home team or the away team will "cover" the spread. Considering the complexity of the NFL and how many different moving pieces and variables exist, I am honestly not sure how well this model will perform. I intend to submit the completed model whether the results are good and bad and probably further tweak it in the future.

# Data Pre-Processing

In [2]:
import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_rows', None)
np.set_printoptions(suppress=True, precision=4)
import math
from sklearn import preprocessing

In [3]:
games = pd.read_csv('gameData.csv')

In [4]:
games[['wins_home','losses_home','wins_away','losses_away']] = preprocessing.MinMaxScaler().fit_transform(games[['wins_home','losses_home','wins_away','losses_away']])

# Principal Component Analysis

In [5]:
gamesDummied = pd.get_dummies(games).reset_index(drop=True)
gamesDummied.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5324 entries, 0 to 5323
Columns: 14481 entries, losses_away to team_home_Washington Redskins
dtypes: float64(5), int64(4), uint8(14472)
memory usage: 73.8 MB


First, in order to handle the categorical variables, all column values will be "dummied". Unfortunately, with every player and team name being a unique value for both home and away columns, this translates to our datasets containing 14k+ features. To save on computation time if this were to run on a weekly basis, we will be using PCA to limit our features to a more managable number without losing too much varianace.

In [6]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.99999, svd_solver='full')
outputs = gamesDummied[['vegas_expected_home_win_margin','score_home','score_away','schedule_season','home_bet']]
gameComponents = pca.fit_transform(gamesDummied.drop(columns=['vegas_expected_home_win_margin','score_home','score_away','home_bet','schedule_season']))

In [7]:
print(np.sum(pca.explained_variance_ratio_))
pcaGames = pd.DataFrame(data = gameComponents, columns = map(lambda x: 'princomp' + str(x), range(0,len(gameComponents[0]))))

0.9999917212007765


I've arbitrarily decided that a threshold of 99.999% retained variance would be acceptable for the principal component analysis. 

# Regression Models (Support Vector Machine and Multilayer Perceptron Model)

In [10]:
pcaGames[['vegas_expected_home_win_margin','score_home','score_away','schedule_season', 'home_bet']] = outputs[['vegas_expected_home_win_margin','score_home','score_away','schedule_season','home_bet']]

train = pcaGames[pcaGames.schedule_season <= 2018]
test = pcaGames[pcaGames.schedule_season >= 2019]

Since the point of the model will be to predict the outcomes of future games, the two most recent seasons are used as a test set while all previous seasons since 2000 are used as a training set. This makes it so that our data is no longer random, but it would not make sense to have future games predict those in the past.

## Poisson Regressor

### More Data Prep

For the SVM regressor, the 'schedule_season' column is first used to create a weight value for each datapoint. More recent games are given higher weights so that they are more relevant when training the dataset. This is necessary since the performance of older varieties of teams should have little bearing on more recent games. A customized exponential function is used here to weight games in a mostly arbitrary manner; the least recent season is given a weight of '1' while the most recent is given a weight of about '5'. The range of weights was considered to be higher, but since some of this is already accounted for by including team rosters for each season (which will be more similar to other nearby seasons), this final weight function was decided on.

In [17]:
trainX = train.drop(columns=['vegas_expected_home_win_margin','score_home','score_away','home_bet','schedule_season'])
trainYHome = train['score_home']
trainYAway = train['score_away']

Our train input features will end up being all of the dummied variables (players, head coaches, team names, stadium locations) as well as the normalized wins and losses each team has coming into a game. The output feature will be the point differential in reference to the home team. The vegas spread is dropped since it is essentially a regression prediction in and of itself, with the goal of the project being attempting to 'out-predict' it. The 'schedule_season' is represented in the sample weights and the 'class' will be used to judge our betting classifier, so these too are dropped.

### Hypertuning

In [54]:
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_squared_error

testX = test.drop(columns=['vegas_expected_home_win_margin','score_home','score_away','home_bet','schedule_season'])
testYHome = test['score_home']
testYAway = test['score_away']

clfHome = PoissonRegressor(alpha=alpha,max_iter=iteration)
clfHome.fit(trainX, trainYHome)

predYHome = clfHome.predict(testX)
print('Home score MSE is',mean_squared_error(testYHome,predYHome))

clfAway = PoissonRegressor(alpha=alpha,max_iter=iteration)
clfAway.fit(trainX, trainYAway)

predYAway = clfAway.predict(testX)
print('Home score MSE is',mean_squared_error(testYAway,predYAway))

Home score MSE is 95.16670109079493
Home score MSE is 101.53970141552658


In [50]:
predictions = pd.DataFrame(data=np.column_stack((predYHome,predYAway)),columns=['Home','Away'])
predictions['home_point_lead'] = predictions['Home'] - predictions['Away']
predictions['vegas_spread'] = test['vegas_expected_home_win_margin'].to_numpy()

def calculateHomeBet(row):
    if (row['home_point_lead'] > row['vegas_spread']):
        row['home_bet'] = 1
    else:
        row['home_bet'] = 0
    return row

predictions = predictions.apply(calculateHomeBet, axis=1)
predictions

Unnamed: 0,Home,Away,home_point_lead,vegas_spread,home_bet
0,23.665,20.36,3.306,3.0,1.0
1,20.91,20.812,0.097,-3.0,1.0
2,21.836,22.318,-0.482,-2.0,1.0
3,20.79,19.518,1.272,5.5,0.0
4,23.832,21.437,2.395,7.0,0.0
5,22.595,23.639,-1.044,-3.5,1.0
6,23.755,20.539,3.216,-6.0,1.0
7,20.448,21.044,-0.596,-7.0,1.0
8,23.685,19.569,4.115,3.5,1.0
9,25.538,19.001,6.537,5.5,1.0


In [51]:
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(test['home_bet'], predictions['home_bet'])
matrix = matrix / np.sum(matrix)
matrix

array([[0.2622, 0.2959],
       [0.176 , 0.2659]])

In [52]:
print("Accuracy:", (matrix[0][0] + matrix[1][1]))
print("Precision:", (matrix[0][0] / (matrix[0][0] + matrix[0][1])))
print("Recall:", (matrix[0][0] / (matrix[0][0] + matrix[1][0])))

Accuracy: 0.5280898876404494
Precision: 0.46979865771812074
Recall: 0.5982905982905983


THIS CODE IS COMMENTED OUT AS ITS IT IS SIMPLY RAN TO TEST FOR THE BEST MODEL PARAMETERS. This hypertuning block has been run, determining that a C value of 10 and a gamma value of 0.001 returned the lowest Mean Squared Error(194.681), meaning that these parameters (among those tested) allowed our model to most accurately predict the point differential of the games.

### Fitting, Predicting, and Evaluating

In [41]:
clf = SVR(C=10, gamma=0.001)
clf.fit(trainX, trainY, sample_weight=weights)

predY = clf.predict(testX)
MSE = mean_squared_error(testY,predY)
MSE

194.68108949915006

In [42]:
dfSVM = pd.DataFrame(data={'actual': testY,'pred': predY,'vegas_expected_home_win_margin': test['vegas_expected_home_win_margin']})
dfSVM = dfSVM.apply(calculateClass, axis=1, args=('pred',))
dfSVM

Unnamed: 0,actual,pred,vegas_expected_home_win_margin,class
5057,-7.0,2.693,3.0,0.0
5058,0.0,-0.855,-3.0,1.0
5059,-3.0,-1.306,-2.0,1.0
5060,-30.0,1.739,5.5,0.0
5061,18.0,3.458,7.0,0.0
5062,-14.0,-3.409,-3.5,1.0
5063,6.0,1.146,-6.0,1.0
5064,-49.0,-2.074,-7.0,1.0
5065,16.0,5.891,3.5,1.0
5066,30.0,7.866,5.5,1.0


The model is trained and test using the optimal parameters and the sample weights defined above. Afterwards, the actual score differentials, the model's prediction, the vegas spread, and the newly calculated class are shown in a dataframe.

In [43]:
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(test['class'], dfSVM['class'])
matrix = matrix / np.sum(matrix)
matrix

array([[0.2809, 0.2772],
       [0.191 , 0.2509]])

In [44]:
print("Accuracy:", (matrix[0][0] + matrix[1][1]))
print("Precision:", (matrix[0][0] / (matrix[0][0] + matrix[0][1])))
print("Recall:", (matrix[0][0] / (matrix[0][0] + matrix[1][0])))

Accuracy: 0.5318352059925093
Precision: 0.5033557046979866
Recall: 0.5952380952380952


The predicted class is then compared with the actual class. Again, a '0' or '-' would mean that the model is predicting a bet should be made on the away team to cover the spread. A '1' or a '+' would mean the model is predicting a bet should be made on the home team to cover the spread. The model would be 'correct' if its predicted bet is validated by the point differential being above or below the vegas spread. Accuracy, precision, and recall values are also shown and are unforunately lower than I had hoped for. Since this model did not work out, I will also be attempting an preceptron regressor to ensure my SVM model was not just wrongly prepared.

## Multilayer Perceptron Regressor

### Hypertuning

In [52]:
from sklearn.neural_network import MLPRegressor

trainX = train.drop(columns=['vegas_expected_home_win_margin','home_point_lead','class'])
testX = test.drop(columns=['vegas_expected_home_win_margin','home_point_lead','class'])

alphas = [0.0001,0.001,0.01,0.1]
iters = [100,200,400,600]

results = []

for alpha in alphas:
    for iter in iters:
        regressor = MLPRegressor(hidden_layer_sizes=(600,300),learning_rate_init=alpha,max_iter=iter)
        regressor.fit(trainX,trainY)

        predY = regressor.predict(testX)
        MSE = mean_squared_error(testY,predY)

        results.append([alpha,iter,MSE])

resultsFrame = pd.DataFrame(data=results,columns=['Alpha','Iterations','MSE'])
resultsFrame

Unnamed: 0,Alpha,Iterations,MSE
0,0.0,100,199.421
1,0.0,200,230.789
2,0.0,400,288.758
3,0.0,600,206.537
4,0.001,100,194.506
5,0.001,200,201.525
6,0.001,400,238.462
7,0.001,600,240.624
8,0.01,100,228.684
9,0.01,200,251.171


Like the SVM model, the MLP Regressor will be hypertuned to determine the best parameters. This block of code is also commented out, its compute time also being unnecesssary after the tests. After the hypertuning, it was determined that a learning rate (alpha) of 0.001 and a max iteration count of 100 returned the lowest Mean Squared Error(194.506).

In [53]:
regressor = MLPRegressor(hidden_layer_sizes=(600,300),learning_rate_init=0.001,max_iter=100)
regressor.fit(trainX, trainY)

predY = regressor.predict(testX)
MSE = mean_squared_error(testY,predY)


In [54]:
dfMLR = pd.DataFrame(data={'actual': testY,'pred': predY,'vegas_expected_home_win_margin': test['vegas_expected_home_win_margin']})
dfMLR = dfMLR.apply(calculateClass, axis=1, args=('pred',))
dfMLR

Unnamed: 0,actual,pred,vegas_expected_home_win_margin,class
5057,-7.0,6.997,3.0,1.0
5058,0.0,1.499,-3.0,1.0
5059,-3.0,1.982,-2.0,1.0
5060,-30.0,5.48,5.5,0.0
5061,18.0,5.966,7.0,0.0
5062,-14.0,0.423,-3.5,1.0
5063,6.0,5.856,-6.0,1.0
5064,-49.0,2.233,-7.0,1.0
5065,16.0,7.381,3.5,1.0
5066,30.0,10.407,5.5,1.0


In [55]:
matrix = confusion_matrix(test['class'], dfMLR['class'])
matrix = matrix / np.sum(matrix)
matrix

array([[0.1199, 0.4382],
       [0.0787, 0.3633]])

In [56]:
print("Accuracy:", (matrix[0][0] + matrix[1][1]))
print("Precision:", (matrix[0][0] / (matrix[0][0] + matrix[0][1])))
print("Recall:", (matrix[0][0] / (matrix[0][0] + matrix[1][0])))

Accuracy: 0.48314606741573035
Precision: 0.21476510067114093
Recall: 0.6037735849056605


Like the SVM model, the predicted class is then compared with the actual class. Our evalutation metrics for the derived classification are once agan lower than expected.

In [57]:
MSE = mean_squared_error(testY,test['vegas_expected_home_win_margin'])
MSE

170.9765917602996

Here, the MSE between the actual point differntials and the vegas spread is shown. Since this is esssentially another regression prediction, it can be used as a good comparison to our models' performances. Unfortunately, at this time I was not able to outperform the Vegas models. In fact, my methods underperformed greatly when compared to these models (delta of 24). The gambling industry, most likely, has much better resources, datasets, and modeling procedures than I was able to apply for the scope of this project. In the future, I would like to continue to tweak this project to come closer to outperforming the Vegas models.

## Conclusion and Wrap Up

The project, as a whole, was a lot of fun to map out and execute. It was very interesting to try and think about the game of football as a math equation; with hundreds of variables of differing influence. While the results founded were not exactly what I was hoping for, I do plan on continuing on tweaking the project, most likely in the form of gathering more relevant data and formatting it in a better format. I would also consider using Principle Component Analysis on the data since currently it has far too many features. This resulted in hypertuning my models to run for far too long, wasting a lot of time while I waited for them to execute. Overall, this project taught me the many complexities and hardships of data preperation and modeling. Despite it being very challenging, it also was very engaging and would be happy to carry on doing this in the future.

### Custom Programming and Bibliography

* http://www.jt-sw.com/football/pro/rosters.nsf used to get all team rosters since the year 2000
* https://www.kaggle.com/tobycrabtree/nfl-scores-and-betting-data#spreadspoke_scores.csv used to get game information
* Custom programming work done includes most of the preprocessing code. This includes the web scraping block, calculating wins and losses for each game, adding the players to the game data entries, and calculating the class.