# 2020 NFL Big Data Bowl

![](https://operations.nfl.com/media/3606/big-data-bowl-new-logo_750.png?mode=max&width=200)

In this notebook I will attempt to provide a basic overview of the data given in the NFL Big Data Bowl kaggle challenge. We will attempt to better understand each variable provided to us in the `train.csv` data file.

From the [competition overview](http://www.kaggle.com/c/nfl-big-data-bowl-2020/overview):

*In this competition, you will develop a model to predict how many yards a team will gain on given rushing plays as they happen. You'll be provided game, play, and player-level data, including the position and speed of players as provided in the NFL’s Next Gen Stats data. And the best part - you can see how your model performs from your living room, as the leaderboard will be updated week after week on the current season’s game data as it plays out.*

In [1]:
import pandas as pd
import numpy as np

# pd.set_option('max_columns', 100) # So we can see more columns

# Read in the training data
train = pd.read_csv('../input/train.csv', low_memory=False)

In [2]:
# https://stackoverflow.com/questions/30228069/how-to-display-the-value-of-the-bar-on-each-bar-with-pyplot-barh
def label_bars(ax, bars, text_format, **kwargs):
    """
    Attaches a label on every bar of a regular or horizontal bar chart
    """
    ys = [bar.get_y() for bar in bars]
    y_is_constant = all(y == ys[0] for y in ys)  # -> regular bar chart, since all all bars start on the same y level (0)

    if y_is_constant:
        _label_bar(ax, bars, text_format, **kwargs)
    else:
        _label_barh(ax, bars, text_format, **kwargs)


def _label_bar(ax, bars, text_format, **kwargs):
    """
    Attach a text label to each bar displaying its y value
    """
    max_y_value = ax.get_ylim()[1]
    inside_distance = max_y_value * 0.05
    outside_distance = max_y_value * 0.01

    for bar in bars:
        text = text_format.format(bar.get_height())
        text_x = bar.get_x() + bar.get_width() / 2

        is_inside = bar.get_height() >= max_y_value * 0.15
        if is_inside:
            color = "white"
            text_y = bar.get_height() - inside_distance
        else:
            color = "black"
            text_y = bar.get_height() + outside_distance

        ax.text(text_x, text_y, text, ha='center', va='bottom', color=color, **kwargs)


def _label_barh(ax, bars, text_format, **kwargs):
    """
    Attach a text label to each bar displaying its y value
    Note: label always outside. otherwise it's too hard to control as numbers can be very long
    """
    max_x_value = ax.get_xlim()[1]
    distance = max_x_value * 0.0025

    for bar in bars:
        text = text_format.format(bar.get_width())

        text_x = bar.get_width() + distance
        text_y = bar.get_y() + bar.get_height() / 2

        ax.text(text_x, text_y, text, va='center', **kwargs)

## Data Description
- Each row represents a player at a given moment in time.
- Each 22 players participating in a given play have a row.

From the official description:
```
Each row in the file corresponds to a single player's involvement in a single play.
The dataset was intentionally joined (i.e. denormalized) to make the API simple.
All the columns are contained in one large dataframe which is grouped and provided by PlayId.
```

## Yards *The target we are trying to predict*
It's always smart to take a close look at the variable we are trying to predict.

In [3]:
train.groupby('PlayId').first()['Yards'] 

PlayId
20170907000118    8
20170907000139    3
20170907000189    5
20170907000345    2
20170907000395    7
                 ..
20191125003419    1
20191125003440    1
20191125003496    1
20191125003768    1
20191125003789    4
Name: Yards, Length: 31007, dtype: int64

## Yards gained by Down

In [4]:
# fig, axes = plt.subplots(4, 1, figsize=(15, 8), sharex=True)
# n = 0
# for i, d in train.groupby('Down'):
#     d['Yards'].plot(kind='hist',
#                     bins=30,
#                    color=color_pal[n],
#                    ax=axes[n],
#                    title=f'Yards Gained on down {i}')
#     n+=1

## Yards gained by Distance-to-Gain
We can see that there appears to be a increase in the average yards gained as the distance to gain increases. We also can see that as the distances increase the distribution of `Yards` moves from a normal distribution to bimodal. This could be because of sparsity of data for the extremely large distance-to-gain values.

In [5]:
# fig, ax = plt.subplots(figsize=(20, 5))
# sns.violinplot(x='Distance-to-Gain',
#                y='Yards',
#                data=train.rename(columns={'Distance':'Distance-to-Gain'}),
#                ax=ax)
# plt.ylim(-10, 20)
# plt.title('Yards vs Distance-to-Gain')
# plt.show()

## GameId and PlayID - `a unique game identifier`
We can see the number of plays provided for a typical gameID.
- 512 Games
- 23171 Plays

In [6]:
print('Unique game data provided: {}'.format(train['GameId'].nunique()))
print('Unique play data provided: {}'.format(train['PlayId'].nunique()))

Unique game data provided: 688
Unique play data provided: 31007


(Thanks @arnabbiswas1 for pointing out an error in this plot that I've now fixed.)

In [7]:
train.groupby('GameId')['PlayId'] \
    .nunique()

GameId
2017090700    52
2017091000    44
2017091001    38
2017091002    63
2017091003    33
              ..
2019112408    50
2019112409    50
2019112410    45
2019112411    42
2019112500    45
Name: PlayId, Length: 688, dtype: int64

## Down and Distance
- We can see the majority of running plays occur on first down. This is not unexpected as running plays are much more common in earlier downs.

In [8]:
# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# sns.boxplot(data=train.groupby('PlayId').first()[['Distance','Down']],
#             x='Down', y='Distance', ax=ax1)
# ax1.set_title('Distance-to-Gain by Down')
# sns.boxplot(data=train.groupby('PlayId').first()[['Yards','Down']],
#             x='Down', y='Yards', ax=ax2)
# ax2.set_title('Yards Gained by Down')
# plt.show()

## Distance to gain is commonly 10 yards

In [9]:
train['Distance']

0         2
1         2
2         2
3         2
4         2
         ..
682149    9
682150    9
682151    9
682152    9
682153    9
Name: Distance, Length: 682154, dtype: int64

## Speed, Acceleration, and Distance
We are provided with the speed, acceleration, and distance each player has traveled since the previous point.

In [10]:
# fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 4))
# train['S'].plot(kind='hist', ax=ax1,
#                 title='Distribution of Speed',
#                 bins=20,
#                 color=color_pal[0])
# train['A'].plot(kind='hist',
#                 ax=ax2,
#                 title='Distribution of Acceleration',
#                 bins=20,
#                 color=color_pal[1])
# train['Dis'].plot(kind='hist',
#                   ax=ax3,
#                   title='Distribution of Distance',
#                   bins=20,
#                   color=color_pal[2])
# plt.show()

In [11]:

train.query("NflIdRusher == NflId")['S'] 
train.query("NflIdRusher == NflId")['A'] 
train.query("NflIdRusher == NflId")['Dis'] 

18        0.38
40        0.34
62        0.60
84        0.46
98        0.44
          ... 
682052    0.42
682074    0.43
682096    0.43
682118    0.47
682140    0.52
Name: Dis, Length: 31007, dtype: float64

## Does Speed, Acceleration, and Distance of the runningback have a relationship with yards gained?
Lets look and see if the speed of the runningback correlates with the yardage gained. The color shows the different defensive personnels in each run.

It's not immediately clear if these features have a meaningful relationship with the yards gained.

# OffensePersonnel / DefensePersonnel
Lets see what the top personnel groupings are for the offense and defense

In [12]:
train.groupby('PlayId') \
    .first() \
    .groupby('OffensePersonnel') \
    .count()['GameId'] \
    .sort_values(ascending=False) \
    .head(15) \
    .sort_values() 
train.groupby('PlayId') \
    .first() \
    .groupby('DefensePersonnel') \
    .count()['GameId'] \
    .sort_values(ascending=False) \
    .head(15) \
    .sort_values()

DefensePersonnel
6 DL, 3 LB, 2 DB      64
3 DL, 5 LB, 3 DB      76
5 DL, 4 LB, 2 DB      76
1 DL, 4 LB, 6 DB     100
5 DL, 3 LB, 3 DB     146
3 DL, 2 LB, 6 DB     239
4 DL, 4 LB, 3 DB     295
5 DL, 2 LB, 4 DB     322
4 DL, 1 LB, 6 DB     475
2 DL, 3 LB, 6 DB     788
3 DL, 3 LB, 5 DB    3406
2 DL, 4 LB, 5 DB    3699
3 DL, 4 LB, 4 DB    5019
4 DL, 3 LB, 4 DB    7875
4 DL, 2 LB, 5 DB    8054
Name: GameId, dtype: int64

## Defensive Personnel's impact on yard gained
We can see that there are about 5 common defensive packages that are used. How does the way the defense is aligned correlate with the offensive production (yards gained)?

What stands out at first glance is that the `4DL - 4LB - 3DB` Defense shows a different distribution in yards gained.

Per wikipedia: https://en.wikipedia.org/wiki/4%E2%80%934_defense

*Originally seen as a passing defense against the spread, modern versions of the 4-4 are attacking defenses stocked with multiple blitz packages that can easily be concealed and altered.*

In [13]:
top_10_defenses = train.groupby('DefensePersonnel')['GameId'] \
    .count() \
    .sort_values(ascending=False).index[:10] \
    .tolist()

In [14]:
train_play = train.groupby('PlayId').first()
train_top10_def = train_play.loc[train_play['DefensePersonnel'].isin(top_10_defenses)]

## Running strategies change as the game goes on...

How are the yards gained impacted by the time in the game? Many times teams run the ball at the end of the game when they are ahead, in order to run out the gameclock and win. In these situations the run is expected more and defenses can scheme against it.

It doesn't look like the quarter has a huge impact on the running yards gained.

In [15]:
# fig, ax = plt.subplots(figsize=(15, 5))
# ax.set_ylim(-10, 60)
# ax.set_title('Yards vs Quarter')
# sns.boxenplot(x='Quarter',
#             y='Yards',
#             data=train.sample(5000),
#             ax=ax)
# plt.show()

# Defenders In The "Box"

The number of defenders in the box is an important part of stopping the running game. Typically defenses will add more players to this area of the field when they really want to stop a run, this comes at a cost leaving wide recievers less covered.

![](https://i0.wp.com/www.footballzebras.com/wp-content/uploads/2019/02/Slide1.jpg?resize=596%2C317)

Wow! This plot shows a big difference in yards gained when looking at the number of defenders in the box. If you've got 8+ defenders in the box you're looking to stop the run big time! And you can see the average rush yardage is lower. Conversely having 3 men in the box (maybe because they are in prevent defense for a long yard to gain) allows for a average return of about 10 yards!

In [16]:
# fig, ax = plt.subplots(figsize=(15, 5))
# ax.set_ylim(-10, 60)
# sns.boxenplot(x='DefendersInTheBox',
#                y='Yards',
#                data=train.query('DefendersInTheBox > 2'),
#                ax=ax)
# plt.title('Yards vs Defenders in the Box')
# plt.show()

# Distribution of Yards gained vs Defenders in the Box
We can clearly see some variation in yards gained depending on the number of defenders in the box. 

In [17]:
# fig, axes = plt.subplots(3, 2, constrained_layout=True, figsize=(15 , 10))
# #fig.tight_layout()
# ax_idx = 0
# ax_idx2 = 0
# for i in range(4, 10):
#     this_ax = axes[ax_idx2][ax_idx]
#     #print(ax_idx, ax_idx2)
#     sns.distplot(train.query('DefendersInTheBox == @i')['Yards'],
#                 ax=this_ax,
#                 color=color_pal[ax_idx2])
#     this_ax.set_title(f'{i} Defenders in the box')
#     this_ax.set_xlim(-10, 20)
#     ax_idx += 1
#     if ax_idx == 2:
#         ax_idx = 0
#         ax_idx2 += 1
# plt.show()

# What Ball Carriers stand out?
> Lets now look at ball carriers (the players who typically are handed off the ball) and see if any individual players stand out. We will only look at players with more than 100 plays. Then we can plot the top and bottom 10 players.

In [18]:
train.query("NflIdRusher == NflId") \
    .groupby('DisplayName')['Yards'] \
    .agg(['count','mean']) \
    .query('count > 100') \
    .sort_values('mean', ascending=True) \
    .tail(10)['mean'] 
train.query("NflIdRusher == NflId") \
    .groupby('DisplayName')['Yards'] \
    .agg(['count','mean']) \
    .query('count > 100') \
    .sort_values('mean', ascending=True) \
    .head(10)['mean']

DisplayName
DeAndre Washington    3.106061
Elijah McGuire        3.260606
David Montgomery      3.326923
Chris Ivory           3.376106
Jonathan Stewart      3.403941
Alfred Blue           3.410138
Ameer Abdullah        3.439306
Samaje Perine         3.452514
Mike Gillislee        3.550000
Kerwynn Williams      3.584746
Name: mean, dtype: float64

In [19]:
# Create the DL-LB combos
train['DL_LB'] = train['DefensePersonnel'] \
    .str[:10] \
    .str.replace(' DL, ','-') \
    .str.replace(' LB','') # Clean up and convert to DL-LB combo
top_5_dl_lb_combos = train.groupby('DL_LB').count()['GameId'] \
    .sort_values() \
    .tail(10).index.tolist()
ax = train.loc[train['DL_LB'].isin(top_5_dl_lb_combos)] \
    .groupby('DL_LB')

## Lets Plot some defensive schemes
Using some of the additional code created by the great SRK (@sudalairajkumar) in this kernel: https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-nfl

Note that we are given the player positions at the time the ball is handed off, so the player formation isn't as clean as in the diagrams above.

In [20]:
def create_football_field(linenumbers=True,
                          endzones=True,
                          highlight_line=False,
                          highlight_line_number=50,
                          highlighted_name='Line of Scrimmage',
                          fifty_is_los=False,
                          figsize=(12*2, 6.33*2)):
    """
    Function that plots the football field for viewing plays.
    Allows for showing or hiding endzones.
    """
    # rect = patches.Rectangle((0, 0), 120, 53.3, linewidth=0.1,
    #                          edgecolor='r', facecolor='darkgreen', zorder=0)

    # fig, ax = plt.subplots(1, figsize=figsize)
    # ax.add_patch(rect)

    # plt.plot([10, 10, 10, 20, 20, 30, 30, 40, 40, 50, 50, 60, 60, 70, 70, 80,
    #           80, 90, 90, 100, 100, 110, 110, 120, 0, 0, 120, 120],
    #          [0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3,
    #           53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 53.3, 0, 0, 53.3],
    #          color='white')
    # if fifty_is_los:
    #     plt.plot([60, 60], [0, 53.3], color='gold')
    #     plt.text(62, 50, '<- Player Yardline at Snap', color='gold')
    # # Endzones
    # if endzones:
    #     ez1 = patches.Rectangle((0, 0), 10, 53.3,
    #                             linewidth=0.1,
    #                             edgecolor='r',
    #                             facecolor='blue',
    #                             alpha=0.2,
    #                             zorder=0)
    #     ez2 = patches.Rectangle((110, 0), 120, 53.3,
    #                             linewidth=0.1,
    #                             edgecolor='r',
    #                             facecolor='blue',
    #                             alpha=0.2,
    #                             zorder=0)
    #     ax.add_patch(ez1)
    #     ax.add_patch(ez2)
    # plt.xlim(0, 120)
    # plt.ylim(-5, 58.3)
    # plt.axis('off')
    if linenumbers:
        for x in range(20, 110, 10):
            numb = x
            if x > 50:
                numb = 120 - x
            # plt.text(x, 5, str(numb - 10),
            #          horizontalalignment='center',
            #          fontsize=20,  # fontname='Arial',
            #          color='white')
            # plt.text(x - 0.95, 53.3 - 5, str(numb - 10),
            #          horizontalalignment='center',
            #          fontsize=20,  # fontname='Arial',
            #          color='white', rotation=180)
    if endzones:
        hash_range = range(11, 110)
    else:
        hash_range = range(1, 120)

    # for x in hash_range:
    #     ax.plot([x, x], [0.4, 0.7], color='white')
    #     ax.plot([x, x], [53.0, 52.5], color='white')
    #     ax.plot([x, x], [22.91, 23.57], color='white')
    #     ax.plot([x, x], [29.73, 30.39], color='white')

    # if highlight_line:
    #     hl = highlight_line_number + 10
    #     plt.plot([hl, hl], [0, 53.3], color='yellow')
    #     plt.text(hl + 2, 50, '<- {}'.format(highlighted_name),
    #              color='yellow')
    # return fig, ax

import math
def get_dx_dy(angle, dist):
    cartesianAngleRadians = (450-angle)*math.pi/180.0
    dx = dist * math.cos(cartesianAngleRadians)
    dy = dist * math.sin(cartesianAngleRadians)
    return dx, dy

In [21]:
play_id = train.query("DL_LB == '3-4'")['PlayId'].reset_index(drop=True)[500]
train.query("PlayId == @play_id and Team == 'away'") 
train.query("PlayId == @play_id and Team == 'home'") 
train.query("PlayId == @play_id and NflIdRusher == NflId") 
rusher_row = train.query("PlayId == @play_id and NflIdRusher == NflId")
yards_covered = rusher_row["Yards"].values[0]

x = rusher_row["X"].values[0]
y = rusher_row["Y"].values[0]
rusher_dir = rusher_row["Dir"].values[0]
rusher_speed = rusher_row["S"].values[0]
dx, dy = get_dx_dy(rusher_dir, rusher_speed)
yards_gained = train.query("PlayId == @play_id")['Yards'].tolist()[0]

In [22]:
play_id = train.query("DL_LB == '4-3'")['PlayId'].reset_index(drop=True)[500]
train.query("PlayId == @play_id and Team == 'away'") 
train.query("PlayId == @play_id and Team == 'home'")
train.query("PlayId == @play_id and NflIdRusher == NflId")
rusher_row = train.query("PlayId == @play_id and NflIdRusher == NflId")
yards_covered = rusher_row["Yards"].values[0]

x = rusher_row["X"].values[0]
y = rusher_row["Y"].values[0]
rusher_dir = rusher_row["Dir"].values[0]
rusher_speed = rusher_row["S"].values[0]
dx, dy = get_dx_dy(rusher_dir, rusher_speed)
yards_gained = train.query("PlayId == @play_id")['Yards'].tolist()[0]

In [23]:
play_id = train.query("DL_LB == '4-2'")['PlayId'].reset_index(drop=True)[500]
train.query("PlayId == @play_id and Team == 'away'")
train.query("PlayId == @play_id and Team == 'home'") 
train.query("PlayId == @play_id and NflIdRusher == NflId")
rusher_row = train.query("PlayId == @play_id and NflIdRusher == NflId")
yards_covered = rusher_row["Yards"].values[0]

x = rusher_row["X"].values[0]
y = rusher_row["Y"].values[0]
rusher_dir = rusher_row["Dir"].values[0]
rusher_speed = rusher_row["S"].values[0]
dx, dy = get_dx_dy(rusher_dir, rusher_speed)
yards_gained = train.query("PlayId == @play_id")['Yards'].tolist()[0]

# Snap to Handoff Time
Different types of designed runs develop differently, one way to understand the play design is by looking at the time it takes the quarterback to hand the ball off to the rusher. Lets take a look at the distribution of seconds taken.

In [24]:
train['SnapHandoffSeconds'] = (pd.to_datetime(train['TimeHandoff']) - \
                               pd.to_datetime(train['TimeSnap'])).dt.total_seconds()

(train.groupby('SnapHandoffSeconds').count() / 22 )['GameId']

SnapHandoffSeconds
0.0      255.0
1.0    22248.0
2.0     8447.0
3.0       44.0
4.0        8.0
5.0        3.0
7.0        2.0
Name: GameId, dtype: float64

It looks like this feature might cause some issues. Due to lack of percision we don't have much detail about the snap time. Additionally it looks like the sparcity of data for seconds that are not 1 or 2 - cause the average Yards to have large variance.

In [25]:
train.groupby('SnapHandoffSeconds')['Yards'].mean()

SnapHandoffSeconds
0.0    4.898039
1.0    4.173768
2.0    4.347579
3.0    5.000000
4.0    3.875000
5.0   -2.000000
7.0    5.000000
Name: Yards, dtype: float64

## Ideas of what I should look into next? Let me know in the comments.