# NFL Defense Data Nunnelee Notebook

This competition uses NFL’s Next Gen Stats data, which includes the position and speed of every player on the field during each play. We'll employ player tracking data for all drop-back pass plays from the 2018 regular season. The goal of submissions is to identify unique and impactful approaches to measure defensive performance on these plays. There are several different directions for participants to utilize —which may require levels of football savvy, data aptitude, and creativity. As examples:

* What are coverage schemes (man, zone, etc) that the defense employs? What coverage options tend to be better performing?
* Which players are the best at closely tracking receivers as they try to get open?
* Which players are the best at closing on receivers when the ball is in the air?
* Which players are the best at defending pass plays when the ball arrives?
* Is there any way to use player tracking data to predict whether or not certain penalties – for example, defensive pass interference – will be called?
* Who are the NFL’s best players against the pass?
* How does a defense react to certain types of offensive plays?
* Is there anything about a player – for example, their height, weight, experience, speed, or position – that can be used to predict their performance on defense?
* What does data tell us about defending the pass play?

# Evaulation
The challenge is to generate actionable, practical, and novel insights from player tracking data that corresponds to defensive backs. Suggestions made here represent some of the approaches that football coaches are currently thinking of, but there undoubtedly several others.

An entry to the competition consists of a Notebook submission that is evaluated on the following five components, where 0 is the low score and 10 is the high score.

Note: All notebooks submitted must be made public on or before the submission deadline to be eligible.

Open Competition: The first aim takes on what an NFL defense does once a quarterback drops back to pass. This includes coverage schemes (typically man versus zone), how players (often termed “secondary” defenders) disrupt and prevent the offense from completing passes, and how, once the ball is in the air, the defense works to ensure that a pass falls incomplete.

## Big Data Bowl 2021 scoring sheet
Submissions will be judged by the NFL based on how well they address:

Innovation:

Are the proposed findings actionable?
Is this a way of looking at tracking data that is novel?
Is this project creative?
Accuracy:

Is the work correct?
Are claims backed up by data?
Are the statistical models appropriate given the data?
Relevance:

Would NFL teams (or the league office) be able to use these results on a week-to-week basis?
Does the analysis account for variables that make football data complex?
Clarity:

Evaluate the writing with respect to how clear the writer(s) make findings.
Data visualization/tables:

Are the charts and tables provided accessible, interesting, visually appealing, and accurate?

Notebooks should consist of no more than 2,000 words and no more than 7 tables/figures. Submissions will not be penalized for any number of words or figures under this limit. Participants are encouraged to show statistical code if it helps readers better understand their analyses; most, if not all code, however, should be hidden in the Appendix.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/nfl-big-data-bowl-2021/week4.csv
/kaggle/input/nfl-big-data-bowl-2021/week6.csv
/kaggle/input/nfl-big-data-bowl-2021/week5.csv
/kaggle/input/nfl-big-data-bowl-2021/week17.csv
/kaggle/input/nfl-big-data-bowl-2021/week9.csv
/kaggle/input/nfl-big-data-bowl-2021/week2.csv
/kaggle/input/nfl-big-data-bowl-2021/week11.csv
/kaggle/input/nfl-big-data-bowl-2021/week1.csv
/kaggle/input/nfl-big-data-bowl-2021/week3.csv
/kaggle/input/nfl-big-data-bowl-2021/games.csv
/kaggle/input/nfl-big-data-bowl-2021/week13.csv
/kaggle/input/nfl-big-data-bowl-2021/week10.csv
/kaggle/input/nfl-big-data-bowl-2021/week16.csv
/kaggle/input/nfl-big-data-bowl-2021/players.csv
/kaggle/input/nfl-big-data-bowl-2021/week15.csv
/kaggle/input/nfl-big-data-bowl-2021/week8.csv
/kaggle/input/nfl-big-data-bowl-2021/plays.csv
/kaggle/input/nfl-big-data-bowl-2021/week12.csv
/kaggle/input/nfl-big-data-bowl-2021/week14.csv
/kaggle/input/nfl-big-data-bowl-2021/week7.csv


# Preparing the tools
We're going to use Matplotlib, as well as Numpy and Pandas

In [2]:
# Import all the tools we need

# Regular EDA (exploratory data analysis) and plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# we want our plots to appear inside the notebook
%matplotlib inline 

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

# Load the data
There is a lot of data bases. Player information, plays, games and plays for weeks 1 - 17. 

Let's start by consolidating weeks 1-16. This will create a large database, but create more relevance. Week 17 will be our test data set.

In [3]:
df_wk1 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week1.csv")
df_wk2 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week2.csv")
df_wk3 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week3.csv")
df_wk4 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week4.csv")
df_wk5 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week5.csv")
df_wk6 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week6.csv")
df_wk7 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week7.csv")
df_wk8 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week8.csv")
df_wk9 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week9.csv")
df_wk10 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week10.csv")
df_wk11 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week11.csv")
df_wk12 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week12.csv")
df_wk13 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week13.csv")
df_wk14 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week14.csv")
df_wk15 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week15.csv")
df_wk16 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week16.csv")
df_test = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week17.csv")

In [4]:
# Let's add the databases together
df_wks = df_wk1 + df_wk2 + df_wk3 + df_wk4 + df_wk5 + df_wk6 + df_wk7 + df_wk8 + df_wk9 + df_wk10 + df_wk11 + df_wk12 + df_wk13 + df_wk14 + df_wk15 + df_wk16
df_wks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231793 entries, 0 to 1231792
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   time           932240 non-null  object 
 1   x              932240 non-null  float64
 2   y              932240 non-null  float64
 3   s              932240 non-null  float64
 4   a              932213 non-null  float64
 5   dis            932240 non-null  float64
 6   o              300591 non-null  float64
 7   dir            300593 non-null  float64
 8   event          932240 non-null  object 
 9   nflId          300679 non-null  float64
 10  displayName    932240 non-null  object 
 11  jerseyNumber   300679 non-null  float64
 12  position       300679 non-null  object 
 13  frameId        932240 non-null  float64
 14  team           932240 non-null  object 
 15  gameId         932240 non-null  float64
 16  playId         932240 non-null  float64
 17  playDirection  932240 non-n

In [5]:
df_wks.head().T

Unnamed: 0,0,1,2,3,4
time,2018-09-07T01:07:14.599Z2018-09-14T00:23:24.70...,2018-09-07T01:07:14.599Z2018-09-14T00:23:24.70...,2018-09-07T01:07:14.599Z2018-09-14T00:23:24.70...,2018-09-07T01:07:14.599Z2018-09-14T00:23:24.70...,2018-09-07T01:07:14.599Z2018-09-14T00:23:24.70...
x,920.41,926.87,930.06,935.2,913.94
y,439.25,410.94,410.62,358.89,429.44
s,9.29,5.13,5.17,3.87,4.85
a,4.57,7.5,6.95,4.34,4.18
dis,0.97,0.54,0.56,0.42,0.51
o,2537.03,2994.25,2967.54,3134.6,2872.6
dir,3172.14,2906.35,2519.04,3453.64,2905.3
event,NoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNo...,NoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNo...,NoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNo...,NoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNo...,NoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNo...
nflId,1.09489e+07,2.6309e+07,3.62639e+07,4.0419e+07,4.05791e+07


## Let's see what the other data holds

In [6]:
df_games = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/games.csv")
df_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253 entries, 0 to 252
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   gameId           253 non-null    int64 
 1   gameDate         253 non-null    object
 2   gameTimeEastern  253 non-null    object
 3   homeTeamAbbr     253 non-null    object
 4   visitorTeamAbbr  253 non-null    object
 5   week             253 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 12.0+ KB


In [7]:
df_games.head().T

Unnamed: 0,0,1,2,3,4
gameId,2018090600,2018090901,2018090902,2018090903,2018090900
gameDate,09/06/2018,09/09/2018,09/09/2018,09/09/2018,09/09/2018
gameTimeEastern,20:20:00,13:00:00,13:00:00,13:00:00,13:00:00
homeTeamAbbr,PHI,CLE,IND,MIA,BAL
visitorTeamAbbr,ATL,PIT,CIN,TEN,BUF
week,1,1,1,1,1


In [8]:
df_players = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/players.csv")
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   nflId        1303 non-null   int64 
 1   height       1303 non-null   object
 2   weight       1303 non-null   int64 
 3   birthDate    1303 non-null   object
 4   collegeName  1303 non-null   object
 5   position     1303 non-null   object
 6   displayName  1303 non-null   object
dtypes: int64(2), object(5)
memory usage: 71.4+ KB


In [9]:
df_players.head().T

Unnamed: 0,0,1,2,3,4
nflId,2539334,2539653,2543850,2555162,2555255
height,72,70,69,73,75
weight,190,186,186,227,232
birthDate,1990-09-10,1988-11-01,1991-12-18,1994-11-04,1993-07-01
collegeName,Washington,Southeastern Louisiana,Purdue,Louisiana State,Minnesota
position,CB,CB,SS,MLB,OLB
displayName,Desmond Trufant,Robert Alford,Ricardo Allen,Deion Jones,De'Vondre Campbell


In [10]:
df_plays = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/plays.csv")
df_plays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19239 entries, 0 to 19238
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   gameId                  19239 non-null  int64  
 1   playId                  19239 non-null  int64  
 2   playDescription         19239 non-null  object 
 3   quarter                 19239 non-null  int64  
 4   down                    19239 non-null  int64  
 5   yardsToGo               19239 non-null  int64  
 6   possessionTeam          19239 non-null  object 
 7   playType                19239 non-null  object 
 8   yardlineSide            18985 non-null  object 
 9   yardlineNumber          19239 non-null  int64  
 10  offenseFormation        19098 non-null  object 
 11  personnelO              19210 non-null  object 
 12  defendersInTheBox       19177 non-null  float64
 13  numberOfPassRushers     18606 non-null  float64
 14  personnelD              19210 non-null

In [11]:
df_plays.head().T

Unnamed: 0,0,1,2,3,4
gameId,2018090600,2018090600,2018090600,2018090600,2018090600
playId,75,146,168,190,256
playDescription,(15:00) M.Ryan pass short right to J.Jones pus...,(13:10) M.Ryan pass incomplete short right to ...,(13:05) (Shotgun) M.Ryan pass incomplete short...,(13:01) (Shotgun) M.Ryan pass deep left to J.J...,(10:59) (Shotgun) M.Ryan pass incomplete short...
quarter,1,1,1,1,1
down,1,1,2,3,3
yardsToGo,15,10,10,10,1
possessionTeam,ATL,ATL,ATL,ATL,ATL
playType,play_type_pass,play_type_pass,play_type_pass,play_type_pass,play_type_pass
yardlineSide,ATL,PHI,PHI,PHI,PHI
yardlineNumber,20,39,39,39,1


# Data Analysis
The results per play is in the Plays data.

Let's break down the play results first. Succesful Completion (C) and Defensive Penalties (DPI, DH, ICT) = Positive for offense. All other results are positives for defense.

In [12]:
df_plays.tail().T

Unnamed: 0,19234,19235,19236,19237,19238
gameId,2018122200,2018122200,2018122201,2018122201,2018122201
playId,2300,3177,566,1719,2649
playDescription,(7:53) J.Johnson pass incomplete short left [K...,(6:53) (Shotgun) B.Gabbert pass incomplete sho...,(5:32) (Shotgun) P.Rivers pass deep right to K...,(1:08) P.Rivers pass incomplete deep middle to...,(7:16) (Shotgun) L.Jackson pass incomplete sho...
quarter,3,4,1,2,3
down,2,3,3,3,1
yardsToGo,5,7,4,1,10
possessionTeam,WAS,TEN,LAC,LAC,BAL
playType,play_type_unknown,play_type_unknown,play_type_unknown,play_type_unknown,play_type_unknown
yardlineSide,WAS,WAS,LAC,LAC,LAC
yardlineNumber,31,37,49,48,49


## Question: What is relevant? Narrow the focus.
### First, let's identify the relevant info:
* playResult: Was the offense succesful in completing a pass without an offensive penalty?
* passResult: Outcome of the passing play (C: Complete pass, I: Incomplete pass, S: Quarterback sack, IN: Intercepted pass, text)
* And what worked for the defense?

### Second, let's assume this is the only relevant info in determining the results for our questions.
* down
* yardsToGo
* personnelIO
* personnelID
* defendersInTheBox
* penalties? Maybe, maybe not.

### Third, let's see if we can seperate the passResult Completions from all other passResults, and create a database with only the relevant information.

In [13]:
# Lets look at our current database
df_plays.head().T

Unnamed: 0,0,1,2,3,4
gameId,2018090600,2018090600,2018090600,2018090600,2018090600
playId,75,146,168,190,256
playDescription,(15:00) M.Ryan pass short right to J.Jones pus...,(13:10) M.Ryan pass incomplete short right to ...,(13:05) (Shotgun) M.Ryan pass incomplete short...,(13:01) (Shotgun) M.Ryan pass deep left to J.J...,(10:59) (Shotgun) M.Ryan pass incomplete short...
quarter,1,1,1,1,1
down,1,1,2,3,3
yardsToGo,15,10,10,10,1
possessionTeam,ATL,ATL,ATL,ATL,ATL
playType,play_type_pass,play_type_pass,play_type_pass,play_type_pass,play_type_pass
yardlineSide,ATL,PHI,PHI,PHI,PHI
yardlineNumber,20,39,39,39,1


In [22]:
# Let's drop the assumed irrelevant info
df_plays_rel = df_plays.drop(["gameId", "playId", "playDescription", "quarter","possessionTeam", "playType",
                             "yardlineSide","yardlineNumber", "offenseFormation", "numberOfPassRushers",
                             "typeDropback", "preSnapVisitorScore", "preSnapHomeScore", "gameClock", 
                             "absoluteYardlineNumber","penaltyCodes", "penaltyJerseyNumbers","offensePlayResult",
                             "epa"], axis=1)
df_plays_rel.head().T

Unnamed: 0,0,1,2,3,4
quarter,1,1,1,1,1
down,1,1,2,3,3
yardsToGo,15,10,10,10,1
possessionTeam,ATL,ATL,ATL,ATL,ATL
playType,play_type_pass,play_type_pass,play_type_pass,play_type_pass,play_type_pass
yardlineSide,ATL,PHI,PHI,PHI,PHI
yardlineNumber,20,39,39,39,1
offenseFormation,I_FORM,SINGLEBACK,SHOTGUN,SHOTGUN,SHOTGUN
personnelO,"2 RB, 1 TE, 2 WR","1 RB, 1 TE, 3 WR","2 RB, 1 TE, 2 WR","1 RB, 1 TE, 3 WR","2 RB, 3 TE, 0 WR"
defendersInTheBox,7,7,6,6,8
