# NFL Defense Data Nunnelee Notebook

This competition uses NFL’s Next Gen Stats data, which includes the position and speed of every player on the field during each play. We'll employ player tracking data for all drop-back pass plays from the 2018 regular season. The goal of submissions is to identify unique and impactful approaches to measure defensive performance on these plays. There are several different directions for participants to utilize —which may require levels of football savvy, data aptitude, and creativity. As examples:

* What are coverage schemes (man, zone, etc) that the defense employs? What coverage options tend to be better performing?
* Which players are the best at closely tracking receivers as they try to get open?
* Which players are the best at closing on receivers when the ball is in the air?
* Which players are the best at defending pass plays when the ball arrives?
* Is there any way to use player tracking data to predict whether or not certain penalties – for example, defensive pass interference – will be called?
* Who are the NFL’s best players against the pass?
* How does a defense react to certain types of offensive plays?
* Is there anything about a player – for example, their height, weight, experience, speed, or position – that can be used to predict their performance on defense?
* What does data tell us about defending the pass play?

# Evaulation
The challenge is to generate actionable, practical, and novel insights from player tracking data that corresponds to defensive backs. Suggestions made here represent some of the approaches that football coaches are currently thinking of, but there undoubtedly several others.

An entry to the competition consists of a Notebook submission that is evaluated on the following five components, where 0 is the low score and 10 is the high score.

Note: All notebooks submitted must be made public on or before the submission deadline to be eligible.

Open Competition: The first aim takes on what an NFL defense does once a quarterback drops back to pass. This includes coverage schemes (typically man versus zone), how players (often termed “secondary” defenders) disrupt and prevent the offense from completing passes, and how, once the ball is in the air, the defense works to ensure that a pass falls incomplete.

## Big Data Bowl 2021 scoring sheet
Submissions will be judged by the NFL based on how well they address:

Innovation:

Are the proposed findings actionable?
Is this a way of looking at tracking data that is novel?
Is this project creative?
Accuracy:

Is the work correct?
Are claims backed up by data?
Are the statistical models appropriate given the data?
Relevance:

Would NFL teams (or the league office) be able to use these results on a week-to-week basis?
Does the analysis account for variables that make football data complex?
Clarity:

Evaluate the writing with respect to how clear the writer(s) make findings.
Data visualization/tables:

Are the charts and tables provided accessible, interesting, visually appealing, and accurate?

Notebooks should consist of no more than 2,000 words and no more than 7 tables/figures. Submissions will not be penalized for any number of words or figures under this limit. Participants are encouraged to show statistical code if it helps readers better understand their analyses; most, if not all code, however, should be hidden in the Appendix.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Preparing the tools
We're going to use Matplotlib, as well as Numpy and Pandas

In [None]:
# Import all the tools we need

# Regular EDA (exploratory data analysis) and plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# we want our plots to appear inside the notebook
%matplotlib inline 

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

# Load the data
There is a lot of data bases. Player information, plays, games and plays for weeks 1 - 17. 

Let's start by consolidating weeks 1-16. This will create a large database, but create more relevance. Week 17 will be our test data set.

In [None]:
df_wk1 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week1.csv")
#df_wk2 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week2.csv")
#df_wk3 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week3.csv")
#df_wk4 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week4.csv")
#df_wk5 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week5.csv")
#df_wk6 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week6.csv")
#df_wk7 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week7.csv")
#df_wk8 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week8.csv")
#df_wk9 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week9.csv")
#df_wk10 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week10.csv")
#df_wk11 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week11.csv")
#df_wk12 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week12.csv")
#df_wk13 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week13.csv")
#df_wk14 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week14.csv")
#df_wk15 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week15.csv")
#df_wk16 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week16.csv")
#df_test = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week17.csv")

In [None]:
# Let's add the databases together
#df_wks = df_wk1 + df_wk2 + df_wk3 + df_wk4 + df_wk5 + df_wk6 + df_wk7 + df_wk8 + df_wk9 + df_wk10 + df_wk11 + df_wk12 + df_wk13 + df_wk14 + df_wk15 + df_wk16
#df_wks.info()

In [None]:
df_wk1.head().T

## Let's see what the other data holds

In [None]:
df_games = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/games.csv")
df_games.info()

In [None]:
df_games.head().T

In [None]:
df_players = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/players.csv")
df_players.info()

In [None]:
df_players.head().T

In [None]:
df_plays = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/plays.csv")
df_plays.info()

In [None]:
df_plays.head().T

# Data Analysis
The results per play is in the Plays data.

Let's break down the play results first. Succesful Completion (C) and Defensive Penalties (DPI, DH, ICT) = Positive for offense. All other results are positives for defense.

In [None]:
df_plays.tail().T

## Question: What is relevant? Narrow the focus.
### First, let's identify the relevant info:
* playResult: Was the offense succesful in completing a pass without an offensive penalty?
* passResult: Outcome of the passing play (C: Complete pass, I: Incomplete pass, S: Quarterback sack, IN: Intercepted pass, text)
* And what worked for the defense?

### Second, let's assume this is the only relevant info in determining the results for our questions.
* down
* yardsToGo
* personnelIO
* personnelID
* defendersInTheBox
* penalties? Maybe, maybe not.

### Third, let's see if we can seperate the passResult Completions from all other passResults, and create a database with only the relevant information.

In [None]:
# Lets look at our current database
df_plays.head().T

In [None]:
# Let's drop the assumed irrelevant info
df_plays_rel = df_plays.drop(["gameId", "playId", "playDescription", "quarter","possessionTeam", "playType",
                             "yardlineSide","yardlineNumber", "offenseFormation", "numberOfPassRushers",
                             "typeDropback", "preSnapVisitorScore", "preSnapHomeScore", "gameClock", 
                             "absoluteYardlineNumber","penaltyCodes", "penaltyJerseyNumbers","offensePlayResult",
                             "epa"], axis=1)
df_plays_rel.head().T

## I need to seperate Completed passResults from the rest.
How do I do that?

In [None]:
pass_result_completed = df_plays_rel[df_plays_rel["passResult"] == "C"]
pass_result_completed.head().T

In [None]:
pass_result_completed.info()

## Cool. Let's look at the positive defensive data

In [None]:
pass_result_defense = df_plays_rel[df_plays_rel["passResult"] != "C"]
pass_result_defense.head().T

In [None]:
pass_result_defense.info()

# Now that we've seperated our data, let's change our data into machine language. Yeah, that's right. Numbers!!!
We'll manipulate the offense first.

Then we'll negotiate the defense. Football jargon.

First, we'll elimnate the playResult, because we're only concerned if the pass was completed.

In [None]:
# Dropping the playResult
pass_result_comp = pass_result_completed.drop("playResult", axis=1)
pass_result_comp.info()

In [None]:
# Now for the defense data
pass_result_def = pass_result_defense.drop("playResult", axis=1)
pass_result_def.info()

## Let's change the data

Convert string into categories¶

One way we can turn all of our data into numbers is by converting them into pandas categories

In [None]:
# Find the columns which contain strings
for label, content in pass_result_comp.items():
    if pd.api.types.is_string_dtype(content):
        pass_result_comp[label] = content.astype("category").cat.as_ordered()
        
for label, content in pass_result_def.items():
    if pd.api.types.is_string_dtype(content):
        pass_result_def[label] = content.astype("category").cat.as_ordered()

In [None]:
pass_result_comp.info()

In [None]:
pass_result_def.info()

In [None]:
# Check the missing data
pass_result_comp.isnull().sum()/len(pass_result_comp)

In [None]:
pass_result_comp.isna().sum()

## Fill the missing value first
Fill numerical values first

In [None]:
# defendersInTheBox is missing 20 dtat points
# Let's fill the numeric rows with the median
for label, content in pass_result_comp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            pass_result_comp[label] = content.fillna(content.median())

In [None]:
# let's check if defendersInTheBox is filled
pass_result_comp.isna().sum()

# Now we'll fill and turn the categorical variables (personnelIo and personnelId) into numbers

In [None]:
# Check for columns which are't numeric and turn categories into numbers and add +1
for label, content in pass_result_comp.items():
    if not pd.api.types.is_numeric_dtype(content):
        pass_result_comp[label] = pd.Categorical(content).codes+1

In [None]:
pass_result_comp.info()

In [None]:
pass_result_comp.head().T

In [None]:
pass_result_comp.isna().sum()