# Exploring NFL Play-By-Play Data

### Ian Johnson, Derek Phanekham, Travis Siems

## Introduction

The NFL (National Football League) has 32 teams split into two conferences, the AFC and NFC. Each of the 32 teams plays 16 games during the regular season (non-playoff season) every year. Due to the considerable viewership of American football, as well as the pervasiveness of fantasy football, considerable data about the game is collected. During the 2015-2016 season, information about every play from each game that occurred was logged. All of that data was consolidated into a single data set which is analyzed throughout this report.

## Business Understanding

* [10 points] Give an overview of the dataset and the analyses you will be performing. What is your plan for analyzing the data and why? This section is easiest to write as a planning phase for the assignment. 

### Motivations and Intended Analyses

The data being used for analysis is a table of 63 attributes for 46,129 rows (plays). The data will be analyzed to identify two potential insights. The first goal, motivated by the prevalance of fantasy football, is to identify players who perform exceptionally well, and specifically to identify in what situations a player excels. The second goal, motivated by the need for coaching insights, is to produce situationally-aware metrics for the potential success of a play. For example: given a field location, score differential, team, and time, identify what type of play is most likely to be successful. 

#### Player Performance Insights

Two forms of player performance analysis are relevant for fantasy football and general player performance evaluation. The first is a novel analysis, wherein all players of a certain position are ranked based on their performance at that position. This analysis can provide insight into identifying which players are most valueable for a fantasy team. The second is player-to-player comparison. Fantasy players are often faced with a decision of which player to play on their fantasy team in any given week. They must choose between players based on their individual player performances, as well as their matchups for the week. Consider a situation where player A is individually superior to player B, but player B is facing a team whose defense is very weak, while player A is facing a team whose defense is strong. Which player is expected to outperform the other? This question can be answered by analyzing the performance of each individual player against their respective opponents.

#### Play-Calling Optimization

Offensive play-calling is a very difficult task, and is often a cause of error for teams and coaches. Providing a data-informed situational understanding of the probable outcomes of certain types of plays could help inform coaches' play-calling. Analyzing the statistical outcomes of play-calls can be done on  a league-wide, per-team, or per-matchup basis. As the analysis becomes more specific (narrowing down to a specific team, or a specific matchup of two teams), the relevance of the analysis increases, but so does the margin of error.

* [10 points] Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Why is this data important and how will you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good learning algorithm in terms of its business use? Be specific.

### Data Purpose and Performance Metrics

The vast amount of money, pride, and time involved in NFL football is profound. It is for that reason that the play-by-play data was gathered in the first place. The intent of analyzing the data is to identify trends or statistics which can meaningfully influence the decisions made by coaches and It is important to define a metric by which the results of any analyses will be measured. Since two main forms of analysis will occur, two performance metrics must be defined.

#### Metrics for Player Performance Insights

Any meaningful player performance analysis must include a novel look at season-long player performance. For a running back, for example, total carries, yards, and touchdowns must be calculated. However, this novel analysis is simply a baseline. In order for a player performance analysis to be considered effective or meaningful, specific trends must be identified for that player which do not appear during routine stat summaries. For example, for a running back, a meaningful and effective analysis may conclude that the player in question performs significantly better when playing against teams whose defenses are very strong against passing plays, or that he performs significantly better when playing away, as opposed to at home.

#### Metrics for Play-Calling Optimization

In order to effectively inform offensive play-calling, play-call analysis must discover trends which identify, for a given game scenario, play calls which have statistically significantly higher probable yardage outcomes than other play calls. For example, given a scenario where an offense is down by 14 points in the 3rd quarter, on their own 35 yard line, an effective play-call analysis would be one that identified that a run play would produce statistically significantly more yardarge than a passing play.

Play-calling optimization could also be effective in a generalized scenario. For example, an effective analysis may reveal that offenses have the most success with running up the middle of the offensive line when near the goal line, but have more success with runs to the outside when nearer to the middle of the field. 

## Data Understanding

* [10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. 

### Data Attributes

The following are descriptions of the data attributes from the play-by-play data which will be considered in the analysis of the dataset.

* **GameID** (*nominal*): A unique integer which identifies each game played 
* **Drive** (*ordinal*): The number of the drive during a game when the play occured (indexed at one, so the first drive of the game has Drive 1 and the nth drive has Drive n)
* **qtr** (*interval*): The quarter of the game when the play occured
* **down** (*interval*): The down when the play occured (1st, 2nd, 3rd, or 4th)
* **TimeSecs** (*interval*): The remaining game time, in seconds, when the play occurred
* **PlayTimeDiff**: Description
* **SideofField** (*nominal*): What side of the field the play started on (the 2-or-3 character code for the team whose defensive end zone is nearest to the ball at the start of the play)
* **yrdln** (*continuous*): The yard-line on the field where the play started (from 0-50)
* **yrdline100** (*continuous*): The absolute yard-line on the field where the play started (from 0 to 100, where 0 is the defensive end zone and 100 is the offensive end zone of the team with the ball)
* **ydstogo** (*continuous*): The number of yards from the line of scrimmage to the first-down line
* **ydsnet** (*continuous*): The number of yards from the beginning of the drive to the current line of scrimmage
* **GoalToGo** (*nominal*): A binary attribute whose value is 1 if there is no first down line (the end-zone is the first down line) or 0 if there is a normal first down line
* **FirstDown** (*nominal*): A binary attribute whose value is 1 if a first down was gained on the play, or 0 if no first down occurred
* **posteam** (*nominal*): A 2-or-3 character code representing the team on offense
* **DefensiveTeam** (*nominal*): A 2-or-3 character code representing the team on defense
* **desc** (*nominal*): A plain-english text description of the play
* **Yards.Gained** (*continuous*): The number of yards gained on the play
* **sp** (*nominal*): A binary attribute whose value is 1 if the play was a scoring play, or 0 if the play was not a scoring play
* **Touchdown** (*nominal*): A binary attribute whose value is 1 if a touchdown was scored on the play, or 0 if a touchdown was not scored on the play
* **ExPointResult** (*nominal*): A binary attribute whose value is 1 if an extra point was scored on the play, or 0 if an extra point was not scored on the play
* **TwoPointConv**: Description
* **DefTwoPoint**: Description
* **Safety**: Description
* **PlayType**: Description
* **Passer**: Description
* **PassOutcome**: Description
* **PassLength**: Description
* **PassLocation**: Description
* **InterceptionThrown**: Description
* **Interceptor**: Description
* **Rusher**: Description
* **RunLocation**: Description
* **RunGap**: Description
* **Receiver**: Description
* **Reception**: Description
* **ReturnResult**: Description
* **Returner**: Description
* **Tackler1**: Description
* **Tackler2**: Description
* **FieldGoalResult**: Description
* **FieldGoalDistance**: Description
* **Fumble**: Description
* **RecFumbTeam**: Description
* **RecFumbPlayer**: Description
* **Sack**: Description
* **Challenge.Replay**: Description
* **ChalReplayResult**: Description
* **Accepted.Penalty**: Description
* **PenalizedTeam**: Description
* **PenaltyType**: Description
* **PenalizedPlayer**: Description
* **Penalty.Yards**: Description
* **ScoreDiff**: Description
* **AbsScoreDiff**: Description

* [15 points] Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justifications for your methods (elimination or imputation).

### Data Quality

#### Loading the Data

Before data quality can be assessed, the data must be loaded into memory and extraneous attributes must be removed.

In [62]:
#Libraries used for data analysis
import pandas as pd
import numpy as np

df = pd.read_csv('data/data.csv') # read in the csv file

#List of attributes which aren't going to be used for analysis
columns_to_delete = ['Unnamed: 0', 'Date', 'time', 'TimeUnder', 
                     'PosTeamScore', 'PassAttempt', 'RushAttempt', 
                     'DefTeamScore', 'Season', 'PlayAttempted']

#Iterate through and delete the columns we don't want
for col in columns_to_delete:
    if col in df:
        del df[col]

  interactivity=interactivity, compiler=compiler, result=result)


#### Missing Data
Missing data needs to be identified and either removed or imputed.
For many colums, there is intentionally missing data (for example, the "Interceptor" column is N/A when no interception was thrown)

In order to help identify missing data, the attributes will be labeled as continuous, ordinal, binary, or categorical, and each scale of data will be imputed on its own.

In [63]:
#Defining list of column names of each of the scales of variables being used.
#Interval and Ratio features are grouped together, and binary features are separated from other ordinal features
continuous_features = ['TimeSecs', 'PlayTimeDiff', 'yrdln', 'yrdline100',
                       'ydstogo', 'ydsnet', 'Yards.Gained', 'Penalty.Yards',
                       'ScoreDiff', 'AbsScoreDiff']
ordinal_features = ['Drive', 'qtr', 'down']
binary_features = ['GoalToGo', 'FirstDown','sp', 'Touchdown', 'Safety', 'Fumble']
categorical_features = df.columns.difference(continuous_features).difference(ordinal_features)

##### Missing Data for Continuous Features

First, the continuous features will be examined for missing data.

In [72]:
#Replace missing values with "missing"
#This will change lots of things to "objects." They will be switched back later.
df = df.replace(to_replace=np.nan, value = "missing")

Unnamed: 0,TimeSecs,PlayTimeDiff,yrdln,yrdline100,ydstogo,ydsnet,Yards.Gained,Penalty.Yards,ScoreDiff,AbsScoreDiff
0,3600,0,35,35,0,0,0,0,0,0
1,3600,0,20,80,10,18,18,0,0,0
2,3561,39,38,62,10,31,9,0,0,0
3,3544,17,47,53,1,31,4,0,0,0
4,3506,38,49,49,10,45,14,0,0,0
5,3462,44,35,35,10,56,11,0,0,0
6,3425,37,24,24,10,48,-8,0,0,0
7,3380,45,32,32,18,54,4,10,0,0
8,3353,27,42,42,28,54,6,0,0,0
9,3328,25,36,36,22,54,10,0,0,0


Additionally, data columns should be coerced to the correct underlying data type for their scales. 

In [2]:


#Coercing the data columns to the correct types
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)
df[binary_features] = df[binary_features].astype(np.int8)

ValueError: Cannot convert NA to integer

## Data Visualization

* [10 points] Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful or potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are  meaningful for the attribute. 

* [15 points] Visualize relationships between interesting attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships. Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate.

* [15 points] Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).

* [5 points] Are there other features that could be added to the data or created from existing features?  Which ones? 