# **Logistic Regression vs Random Forest to Predict if a Punt will be Returned**

In this report, we would like to figure out if we can predict if a punt will be returned or not using two different methods of classification. We will compare Logistic Regression and Random Forest.

# **1. Import libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **2. Bringing in data / Cleaning data**

In [None]:
#import plays data
plays = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/plays.csv')

In [None]:
#filter out punt plays from all plays
punts = plays[plays['specialTeamsPlayType'] == 'Punt']

In [None]:
#import PFF data
PFF = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/PFFScoutingData.csv')

In [None]:
#combine punt plays with PFF data
PuntPFF = PFF.merge(punts, how='inner',on=['gameId','playId'])
PuntPFF.sort_values(by=['gameId'], inplace=True)

In [None]:
#drop some variables
PuntPlays = PuntPFF.drop(['passResult', 'penaltyCodes', 'penaltyJerseyNumbers', 'gameClock','preSnapHomeScore','preSnapVisitorScore','kickerId','returnerId','kickBlockerId','down','specialTeamsPlayType','specialTeamsSafeties','vises','tackler','kickoffReturnFormation','gunners','puntRushers','missedTackler','assistTackler'], axis=1)


We began by trimming the data provided down to only punt plays. We removed unnecessary variables to make the data more manageable and allow for our model to run cleanly later. The variables that we decided to continue studying heavily focused on the punt’s context rather than the actions of the players. To restate, the purpose of our model is to predict whether a punt will be returned based on game factors including the quarter in which the play is occurring, the yardline from which the ball is punted, the length of the kick, and other variables.

This lead us to our final data frame of punt plays.

In [None]:
#make punt dataframe and drop more variables
PuntDF = PuntPlays.drop(['returnDirectionIntended','returnDirectionActual','kickContactType'], axis=1)
PuntDF

# **3. Data Manipulation**

### Changing Categorical Variables to Binary

To better manipulate our variables, we adjusted them from text to boolean and numerical values. 

In [None]:
# Create 'returned' boolean column to indicate whether a punt was returned or not
PuntDF.loc[(PuntDF['specialTeamsResult'] == 'Return'), 'returned'] = 1
PuntDF['returned'] = PuntDF['returned'].fillna(0)
PuntDF.reset_index(inplace=True)

# Create 'yardlineNumber' variable which accounts for teams punting over the midway 50 yard line
PuntDF.loc[(PuntDF['possessionTeam'] == PuntDF['yardlineSide']), 'totalYardline'] = PuntDF['yardlineNumber']
PuntDF.loc[(PuntDF['possessionTeam'] != PuntDF['yardlineSide']), 'totalYardline'] = ((50 - PuntDF['yardlineNumber']) + 50)

# Re factor kickDirectionActual into numeric
PuntDF.loc[(PuntDF['kickDirectionActual'] == 'L'), 'kickDirectionActual'] = 0
PuntDF.loc[(PuntDF['kickDirectionActual'] == 'C'), 'kickDirectionActual'] = 1
PuntDF.loc[(PuntDF['kickDirectionActual'] == 'R'), 'kickDirectionActual'] = 2

# Re factor kickDirectionIntended into numeric
PuntDF.loc[(PuntDF['kickDirectionIntended'] == 'L'), 'kickDirectionIntended'] = 0
PuntDF.loc[(PuntDF['kickDirectionIntended'] == 'C'), 'kickDirectionIntended'] = 1
PuntDF.loc[(PuntDF['kickDirectionIntended'] == 'R'), 'kickDirectionIntended'] = 2

# Re factor kick type into two groups: normal = 0, non-normal (Aussie/rugby) = 1
PuntDF.loc[(PuntDF['kickType'] != 'N'), 'kickType'] = 1
PuntDF.loc[(PuntDF['kickType'] == 'N'), 'kickType'] = 0

# Re factor snapDetail into two groups: 0 = OK, 1 = Left/right/high/low
PuntDF.loc[(PuntDF['snapDetail'] != 'OK'), 'snapDetail'] = 1
PuntDF.loc[(PuntDF['snapDetail'] == 'OK'), 'snapDetail'] = 0



Data Frame used for analysis with booleans.

In [None]:
PuntDF

### Check how evenly weighted the two classes are

Unreturned kicks, represented by 0.0, totaled at 3,705. Returned kicks, represented by 1.0, reached a total of 2286.

In [None]:
# Check how evenly weighted the two classes are
PuntDF['returned'].value_counts()

In [None]:
#change type of variables
PuntDF['snapDetail'] = PuntDF['snapDetail'].astype('int64')
PuntDF['kickType'] = PuntDF['kickType'].astype('int64')

# **4. Data Visualization**

### Checking Correlation

In [None]:
# Correlation matrix

corr = PuntDF.corr()

plt.rcParams["figure.figsize"] = (15,10)
sns.heatmap(corr, annot=True)
plt.show()

We found that kickReturnYardage and playResult have high, negative correlation and that playResult and penaltyYards have a significant correlation. All others do not demonstrate a remarkable level of correlation with playResult. 

### Create target df with features and outcome variable

In [None]:
# df with features + target variable

PuntDF_Target = PuntDF[['snapTime','operationTime','hangTime','kickDirectionActual','kickDirectionIntended','kickType','snapDetail','quarter','yardsToGo','kickLength','totalYardline','returned']]
PuntDF_Target = PuntDF_Target.dropna()

PuntDF_Target['kickDirectionIntended'] = PuntDF_Target['kickDirectionIntended'].astype('int64')
PuntDF_Target['kickDirectionActual'] = PuntDF_Target['kickDirectionActual'].astype('int64')
print(PuntDF_Target.shape[0])

# check correlation between variables

plt.figure(figsize=(12,10))
cor_target = PuntDF_Target.corr()
sns.heatmap(cor_target, annot=True, cmap=plt.cm.Reds)
plt.show()

kickDirectionInteded and kickDirectionActual have very high correlation. totalYardline and kickType have a high correlation as well. Everything else seems to be fine.

In [None]:
# Drop kickDirectionIntended

PuntDF_Target = PuntDF_Target.drop('kickDirectionIntended', axis=1)

kickDirectionInteded is dropped due to high correlation and the belief that kickDirectionActual has more influence on a punt return.

### Let's explore the distribution of the data

In [None]:
# Pairplot of the target dataframe

sns.pairplot(PuntDF_Target, hue='returned')
plt.show()

### Zooming in on a few of the most revealing plots

In [None]:
# Scatter plot of operationTime vs. kickLength with returned punts colored orange

sns.scatterplot(x='operationTime', y='kickLength', hue='returned', data=PuntDF)
plt.title('Operation Time vs. Kick Length of Punts')
plt.show()

In [None]:
# Scatter plot of hangTime vs. kickLength with returned punts colored in orange

sns.scatterplot(PuntDF['hangTime'], PuntDF['kickLength'], hue=PuntDF['returned'])
plt.title('Hang Time vs. Kick Length of Punts')
plt.show()

In [None]:
# Break down totalYardline into returned punts and non-returned punts. Check distribution of where those punts are being
# kicked from on the field

yardlineNotReturned = PuntDF[PuntDF['returned'] == 0]['totalYardline']
yardlineReturned = PuntDF[PuntDF['returned'] == 1]['totalYardline']

sns.distplot(yardlineNotReturned, label='No Return')
sns.distplot(yardlineReturned, label='Return')
plt.xlabel('Yards from own endzone to LOS')
plt.title('Location of LOS for Returned and Non-returned Punts')
plt.legend()
plt.show()

# **5. Building Models**

### Logistic Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(PuntDF_Target[['snapTime','operationTime','hangTime','kickDirectionActual','kickType','snapDetail','quarter','yardsToGo','kickLength','totalYardline']], PuntDF_Target[['returned']] , test_size=0.2, random_state=123)

logreg = LogisticRegression(max_iter=500)
logreg.fit(X_train, y_train.values.ravel())
predictions = logreg.predict(X_test)
score = logreg.score(X_test, y_test)
print(score)

Logistic Regression has a moderate score for classification.

In [None]:
# Classification report for logistic regression model

print(metrics.classification_report(y_test, predictions))

We found that the logistic regression model had a strong recall rate for 0 (punts not returned). The model was less successful predicting 1 (punts that are returned).

In [None]:
# Heat map for actual and predicted values for logistic regression

cm = metrics.confusion_matrix(y_test, predictions)
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square=True, cmap='Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);

As demonstrated above, this model is effective for punts not returned (high correct predictions). It lacks when predicting true punt returns.

In [None]:
# ROC curve for logistic regression

import scikitplot as skplt

plt.rcParams['figure.figsize'] = [10, 10]

predicted_probas = logreg.predict_proba(X_test)

skplt.metrics.plot_roc(y_test, predicted_probas)
plt.title('ROC Curves: Logistic Regression Classifier')
plt.show()

## Random Forest Classification

In [None]:
# Random Forest classification

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, oob_score=True, random_state=123)
rf.fit(X_train, y_train.values.ravel())

In [None]:
# OOB score and accuracy score of RF model

predicted = rf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')

In [None]:
# Classification report for RF model

print(metrics.classification_report(y_test, predicted))

We found that the Random Forest classifcation was much better at predicting the punt returns than the logistic regression model. It is also slightly better at predicting the punts not returned.

In [None]:
cm = metrics.confusion_matrix(y_test, predicted)
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square=True, cmap='Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(accuracy)
plt.title(all_sample_title, size = 15);

As mentioned above, Random Forest is still good at predicting punts not returned and better at making accurate predictions of punt returns. The overall accuracy score saw an increase of .043 from the logistic regression model we created. 

In [None]:
# ROC curve for RF model

plt.rcParams['figure.figsize'] = [10, 10]

predicted_probas = rf.predict_proba(X_test)

skplt.metrics.plot_roc(y_test, predicted_probas)
plt.title('ROC Curves: Random Forest Classifier')
plt.show()

We decided here that we would use Random Forest for the rest of our model, as it is more accurate in predicitng correctly. Our next step was to determine feature importance for our variables. 

# **6. Feature Importance**

In [None]:
# RF feature importances using built-in functionality
rf.feature_importances_

In [None]:
# Feature importance for RF model

feature_names = np.array(['snapTime','operationTime','hangTime','kickDirectionActual','kickType','snapDetail','quarter','yardsToGo','kickLength','totalYardline'])
sorted_idx = rf.feature_importances_.argsort()
plt.barh(feature_names[sorted_idx], rf.feature_importances_[sorted_idx], color='midnightblue')
plt.xlabel("Random Forest Feature Importance")
plt.show()

Two variables that stood out immediately were snapDetail and totalYardline. Snap detail is very low in importance while totalYardline is exceedingly important. The rest of our variables fall between the two. 

In [None]:
# Permutation importance values
# Similar to the in-model importances with a few exceptions

from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(rf, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx], color='midnightblue')
plt.xlabel("Permutation Importance")
plt.show()

When looking at permutation importance, we found kicklength took the top spot, but totalYardline is still highly important. snapTime and operationTime are least important.

We noticed that on both importance checks, snap detail is low. We decide that it may be worth dropping next time.

In [None]:
# SHAP Value estimating how much each feature contributes to the prediction

import shap
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

In [None]:
shap.summary_plot(shap_values, X_test, plot_type="bar")

# **7. Summary**

Ultimately, from the two models we set out to create, we found that the Random Forest was best for predicting whether or not a punt will be returned. The most influential variables that impact the decision of a punt returner are totalYardline, kickLength, and hangTime (not in a specific order). To further polish our model, there were variables that could have been removed to focus more squarely on the factors that impact the decision of the returner. We were happy to close with a model that has shown success in predicting the outcome of a given punt.