<h1 style="font-size:300%; color:blue;"> 🦆 NFL Starter EDA 🦆 </h1>

<h2> This is a starter EDA aimed at understanding literally the dataset itself. Hope this could be a guide to someone just entered the competition like me and having problems understanding the dataset

There will be no fancy visualization or plot, just focusing on understanding each `csv` files.</h2>

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Index</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#🚀-Environment-Setting" role="tab" aria-controls="profile">🚀 Environment Setting<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#🚀-Helper-functions" role="tab" aria-controls="messages">🚀 Helper functions<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#🚀-EDA-(Exploratory-Data-Analysis)" role="tab" aria-controls="settings">🚀EDA (Exploratory Data Analysis)<span class="badge badge-primary badge-pill">3</span></a>


# 🚀 Environment Setting

In [None]:
import os
import pandas as pd
import numpy as np
import cv2
from PIL import Image, ImageDraw

import matplotlib.pyplot as plt
import seaborn as sns
import plotly

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
ENV_DIR = '../input'
DATA_DIR = f'{ENV_DIR}/nfl-health-and-safety-helmet-assignment'

In [None]:
os.listdir(DATA_DIR)

In [None]:
# Training data
# -----------------------------------------------------------------------
# Player information is included

# Bounding Box
train_df = pd.read_csv(f'{DATA_DIR}/train_labels.csv')

# Tracking Information using Sensor
train_tracking_df = pd.read_csv(f'{DATA_DIR}/train_player_tracking.csv')
test_tracking_df = pd.read_csv(f'{DATA_DIR}/test_player_tracking.csv')

# images/
# -----------------------------------------------------------------------
# Trained images using images_labels.csv and predict the train, test
# The prediction result is [train/test]_baseline_helmets.csv
# No player information is included

# information of images without player information
image_df = pd.read_csv(f'{DATA_DIR}/image_labels.csv')

# Baseline Prediction - Trained by images inside folder images/
train_predict_df = pd.read_csv(f'{DATA_DIR}/train_baseline_helmets.csv')

test_predict_df = pd.read_csv(f'{DATA_DIR}/test_baseline_helmets.csv')

# 🚀 Helper functions
- Get Image by frame
- Draw Bounding Box
- Video play
- Football Animation

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Dealing with image and video</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Get-Image-by-frame" role="tab" aria-controls="profile">Get Image by frame<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Draw-Bounding-Box" role="tab" aria-controls="messages">Draw Bounding Box<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Video-Play" role="tab" aria-controls="settings">Video Play<span class="badge badge-primary badge-pill">3</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Football-Animation" role="tab" aria-controls="settings">Football Animation<span class="badge badge-primary badge-pill">4</span></a> 

<div class="alert alert-info" role="alert">
    <a data-toggle="list" href="#Index" role="tab" aria-controls="settings">☝🏻 Back to Index</a>
</div>

## Get Image by frame
> Get frame image from video

Code is based on https://www.kaggle.com/coldfir3/eda-helmet-keypoint-tracking-data-comparison

In [None]:
# reference : https://www.kaggle.com/coldfir3/eda-helmet-keypoint-tracking-data-comparison
def get_frame_from_video(video_path, frame):
    video_path = f"{DATA_DIR}/train/{video_path}"
    frame = frame - 1
    
    !ffmpeg \
        -hide_banner \
        -loglevel fatal \
        -nostats \
        -i $video_path -vf "select=eq(n\,$frame)" -vframes 1 frame.png
    
    img = Image.open('frame.png')
    os.remove('frame.png')
    return img

In [None]:
get_frame_from_video('57583_000082_Endzone.mp4', 1)

## Draw Bounding Box
> Just a simple bounding box drawing without label for EDA purpose

In [None]:
def draw_rect(image, bbox_df):
    new_image = image.copy()
    draw = ImageDraw.Draw(new_image)
    for _, (left, width, top, height) in bbox_df[['left', 'width', 'top', 'height']].iterrows():
        draw.rectangle(((left, top), (left + width, top + height)), outline=(255, 0, 0), width=2)
    
    return new_image

In [None]:
def frame_bbox(df, video_frame):
    video_name = '_'.join(video_frame.split('_')[:3]) + '.mp4'
    frame = int(video_frame.split('_')[-1])
    
    image = get_frame_from_video(video_name, frame)
    bbox_df = df.query('video_frame == @video_frame')
    
    bbox_image = draw_rect(image, bbox_df)
    
    return bbox_image

In [None]:
frame_bbox(train_df, '57583_000082_Endzone_1')

## Video Play
> Let's enjoy the video and see whats the game looks like.

In [None]:
from IPython.display import Video, display

def video(video_path, ratio=0.7):
    nfl_video = Video(f"{DATA_DIR}/train/{video_path}",
                      embed=True,
                      height=int(720 * ratio),
                      width=int(1280 * ratio))
    return nfl_video
    
video('57583_000082_Endzone.mp4')

## Football Animation

> football animation made by `ammarnassanalhajali`

**Reference**
- https://www.kaggle.com/ammarnassanalhajali/nfl-big-data-bowl-2021-animating-players
- https://www.kaggle.com/robikscube/nfl-helmet-assignment-getting-started-guide

In [None]:
# Reference : https://www.kaggle.com/robikscube/nfl-helmet-assignment-getting-started-guide
def add_track_features(tracks, fps=59.94, snap_frame=10):
    """
    Add column features helpful for syncing with video data.
    """
    tracks = tracks.copy()
    tracks["game_play"] = (
        tracks["gameKey"].astype("str")
        + "_"
        + tracks["playID"].astype("str").str.zfill(6)
    )
    tracks["time"] = pd.to_datetime(tracks["time"])
    
    # The time when snap happened
    snap_dict = (
        tracks.query('event == "ball_snap"')
        .groupby("game_play")["time"]
        .first()
        .to_dict()
    )
    tracks["snap"] = tracks["game_play"].map(snap_dict)
    tracks["isSnap"] = tracks["snap"] == tracks["time"]
    tracks["team"] = tracks["player"].str[0].replace("H", "Home").replace("V", "Away")
    tracks["snap_offset"] = (tracks["time"] - tracks["snap"]).astype(
        "timedelta64[ms]"
    ) / 1_000
    # Estimated video frame
    tracks["est_frame"] = (
        ((tracks["snap_offset"] * fps) + snap_frame).round().astype("int")
    )
    return tracks

train_tracking_df = add_track_features(train_tracking_df)

In [None]:
import plotly.express as px
import plotly.graph_objects as go
import plotly


def add_plotly_field(fig):
    # Reference https://www.kaggle.com/ammarnassanalhajali/nfl-big-data-bowl-2021-animating-players
    fig.update_traces(marker_size=20)
    
    fig.update_layout(paper_bgcolor='#29a500', plot_bgcolor='#29a500', font_color='white',
        width = 800,
        height = 600,
        title = "",
        
        xaxis = dict(
        nticks = 10,
        title = "",
        visible=False
        ),
        
        yaxis = dict(
        scaleanchor = "x",
        title = "Temp",
        visible=False
        ),
        showlegend= True,
  
        annotations=[
       dict(
            x=-5,
            y=26.65,
            xref="x",
            yref="y",
            text="ENDZONE",
            font=dict(size=16,color="#e9ece7"),
            align='center',
            showarrow=False,
            yanchor='middle',
            textangle=-90
        ),
        dict(
            x=105,
            y=26.65,
            xref="x",
            yref="y",
            text="ENDZONE",
            font=dict(size=16,color="#e9ece7"),
            align='center',
            showarrow=False,
            yanchor='middle',
            textangle=90
        )]  
        ,
        legend=dict(
        traceorder="normal",
        font=dict(family="sans-serif",size=12),
        title = "",
        orientation="h",
        yanchor="bottom",
        y=1.00,
        xanchor="center",
        x=0.5
        ),
    )
    ####################################################
        
    fig.add_shape(type="rect", x0=-10, x1=0,  y0=0, y1=53.3,line=dict(color="#c8ddc0",width=3),fillcolor="#217b00" ,layer="below")
    fig.add_shape(type="rect", x0=100, x1=110, y0=0, y1=53.3,line=dict(color="#c8ddc0",width=3),fillcolor="#217b00" ,layer="below")
    for x in range(0, 100, 10):
        fig.add_shape(type="rect", x0=x,   x1=x+10, y0=0, y1=53.3,line=dict(color="#c8ddc0",width=3),fillcolor="#29a500" ,layer="below")
    for x in range(0, 100, 1):
        fig.add_shape(type="line",x0=x, y0=1, x1=x, y1=2,line=dict(color="#c8ddc0",width=2),layer="below")
    for x in range(0, 100, 1):
        fig.add_shape(type="line",x0=x, y0=51.3, x1=x, y1=52.3,line=dict(color="#c8ddc0",width=2),layer="below")
    
    for x in range(0, 100, 1):
        fig.add_shape(type="line",x0=x, y0=20.0, x1=x, y1=21,line=dict(color="#c8ddc0",width=2),layer="below")
    for x in range(0, 100, 1):
        fig.add_shape(type="line",x0=x, y0=32.3, x1=x, y1=33.3,line=dict(color="#c8ddc0",width=2),layer="below")
    
    
    fig.add_trace(go.Scatter(
    x=[2,10,20,30,40,50,60,70,80,90,98], y=[5,5,5,5,5,5,5,5,5,5,5],
    text=["G","1 0","2 0","3 0","4 0","5 0","4 0","3 0","2 0","1 0","G"],
    mode="text",
    textfont=dict(size=20,family="Arail"),
    showlegend=False,
    ))
    
    fig.add_trace(go.Scatter(
    x=[2,10,20,30,40,50,60,70,80,90,98], y=[48.3,48.3,48.3,48.3,48.3,48.3,48.3,48.3,48.3,48.3,48.3],
    text=["G","1 0","2 0","3 0","4 0","5 0","4 0","3 0","2 0","1 0","G"],
    mode="text",
    textfont=dict(size=20,family="Arail"),
    showlegend=False,
    ))
    
    return fig

In [None]:
train_tracking_df

In [None]:
def football_animation(game_play):
    train_tracking_df["track_time_count"] = (
        train_tracking_df.sort_values("time")
        .groupby("game_play")["time"]
        .rank(method="dense")
        .astype("int")
    )

    fig = px.scatter(
        train_tracking_df.query("game_play == @game_play"),
        x="x",
        y="y",
        range_x=[-10, 110],
        range_y=[-10, 53.3],
        hover_data=["player", "s", "a", "dir"],
        color="team",
        animation_frame="track_time_count",
        text="player",
        title=f"Animation of NGS data for game_play {game_play}",
    )

    fig.update_traces(textfont_size=10)
    fig = add_plotly_field(fig)
    fig.show()

In [None]:
football_animation('57583_000082')

# 🚀 EDA (Exploratory Data Analysis)

> We will focus on understanding the videos and csv files

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Videos and CSV files</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#🏈-Train/Test-Video" role="tab" aria-controls="profile">🏈 Train/Test Video<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#🏈-train_labels.csv" role="tab" aria-controls="messages">🏈 train_labels.csv<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#🏈-[-Train-/-Test-]-player-tracking.csv" role="tab" aria-controls="settings">🏈 [ Train / Test ] player tracking.csv<span class="badge badge-primary badge-pill">3</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#🏈-image_labels.csv" role="tab" aria-controls="settings">🏈 image_labels.csv<span class="badge badge-primary badge-pill">4</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#🏈-[-Train-/-Test-]-baseline-helmets.csv" role="tab" aria-controls="settings">🏈 [ Train / Test ] baseline helmets.csv<span class="badge badge-primary badge-pill">5</span></a>
    
<div class="alert alert-info" role="alert">
    <a data-toggle="list" href="#Index" role="tab" aria-controls="settings">☝🏻 Back to Index</a>
</div>

# 🏈 Train/Test Video
> Let's understand the vidoes inside `train` and `test` folder

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Questions</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Is-test-videos-subset-of-train-videos?" role="tab" aria-controls="profile">Is test videos subset of train videos?<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#How-many-train-videos-are-there?" role="tab" aria-controls="messages">How many train videos are there?<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Does-Sideline-and-Endzone-Video-has-same-frame?" role="tab" aria-controls="settings">Does Sideline and Endzone Video has same frame?<span class="badge badge-primary badge-pill">3</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Does-video-frame-matches-with-the-frame-recorded-in-the-train_labels.csv?" role="tab" aria-controls="settings">Does video frame matches with the frame recorded in the train_labels.csv?<span class="badge badge-primary badge-pill">4</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Is-there-a-frame-0?" role="tab" aria-controls="settings">Is there a frame 0?<span class="badge badge-primary badge-pill">5</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#test_labels.csv-for-submission-testing" role="tab" aria-controls="settings">test_labels.csv for submission testing<span class="badge badge-primary badge-pill">6</span></a>
    

<div class="alert alert-info" role="alert">
    <a data-toggle="list" href="#🚀-EDA-(Exploratory-Data-Analysis)" role="tab" aria-controls="settings">🚀 Back to EDA</a>
</div>

## Is test videos subset of train videos?
> Yes!

In [None]:
train_videos = os.listdir(f'{DATA_DIR}/train')
test_videos = os.listdir(f'{DATA_DIR}/test')

len(train_videos), len(test_videos)

In [None]:
set(test_videos).issubset(set(train_videos))

## How many train videos are there?
> Total 120! Sideline and Endzone video pair for every plays!

- train 120 videos
    - Endzone 60
    - Sideline 60

In [None]:
end_count = 0
side_count = 0
endzone_list = []
sideline_list = []
for train_video in train_videos:
    name = train_video.split('.')[0]
    video_id, play_id, view = name.split('_')
    
    if view == "Endzone":
        endzone_list.append('_'.join([video_id, play_id]))
        end_count += 1
    else:
        sideline_list.append('_'.join([video_id, play_id]))
        side_count += 1

print(end_count, side_count)

In [None]:
# One game video includes endzone and sideline view
len(set(endzone_list)), len(set(sideline_list)), set(endzone_list) == set(sideline_list)

In [None]:
# videos inside folder matches the video list inside the train_labels.csv
set(train_videos) == set(train_df.video.unique())

## Does Sideline and Endzone Video has same frame?
> No 25 plays out of 60 doesn't match and the difference is mostly 1 frame but there are 7 frame difference also.

In [None]:
not_match_video = []

for play_id in train_df.playID.unique():
    end_frame_n = train_df.query('playID == @play_id and view == "Endzone"').frame.max()
    side_frame_n = train_df.query('playID == @play_id and view == "Sideline"').frame.max()
    
    if end_frame_n != side_frame_n:
        not_match_video.append(play_id)
        print(f'Not same at playID {play_id} endzone [{end_frame_n}] sideline [{side_frame_n}] difference [{abs(end_frame_n - side_frame_n)}]')

In [None]:
len(not_match_video)

## Does video frame matches with the frame recorded in the train_labels.csv?
> Yes! Frame matches exactly! It's clean!

In [None]:
def get_total_frame(video_path):
    cap = cv2.VideoCapture(f"{DATA_DIR}/train/{video_path}")
    property_id = int(cv2.CAP_PROP_FRAME_COUNT) 
    length = int(cv2.VideoCapture.get(cap, property_id))
    
    return length

In [None]:
play2frame = train_df.groupby('video').frame.max().to_dict()

In [None]:
for video_name, label_frame_n in play2frame.items():
    video_frame_n = get_total_frame(video_name)
    if video_frame_n != label_frame_n:
        print('Not Match!')

## Is there a frame 0?
> Yes, there is 1 frame that is 0 and seems mislabeled. So could be simply dropped.

In [None]:
train_df.query('frame == 0')

In [None]:
frame_df = train_df.query('video == "57584_000336_Sideline.mp4"')
frame_df

In [None]:
frame_df.frame.max()

In [None]:
get_total_frame("57584_000336_Sideline.mp4")

## test_labels.csv for submission testing
> We could make additional `test_labels.csv` for testing our submission process code. Because test videos given publically are subset of train videos, the performance should be shown high or there must be something wrong for our submission process code.

In [None]:
test_df = train_df.query("video in @test_videos").reset_index().copy()
test_df

# 🏈 train_labels.csv
> The most important information for the training. Includes player, bbox, impact per video

- 50 games
- 60 plays
    - The videos in the train folder are divided based on play 
- 52142 frames (frames means images)

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Questions</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#How-many-game,-play,-frame?" role="tab" aria-controls="profile">How many game, play, frame?<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#How-many-plays-per-game?" role="tab" aria-controls="messages">How many plays per game?<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#How-many-players-per-game?" role="tab" aria-controls="settings">How many players per game?<span class="badge badge-primary badge-pill">3</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#How-many-players-exist?" role="tab" aria-controls="settings">How many players exist?<span class="badge badge-primary badge-pill">4</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Does-all-plays-has-sideline-player-included?" role="tab" aria-controls="settings">Does all plays has sideline player included?<span class="badge badge-primary badge-pill">5</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#How-many-players-are-included-in-the-plays-that-has-sideline-player-included?" role="tab" aria-controls="settings">How many players are included in the plays that has sideline player included?<span class="badge badge-primary badge-pill">6</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#What-is-the-helmet-size-inside-the-video?" role="tab" aria-controls="settings">What is the helmet size inside the video?<span class="badge badge-primary badge-pill">7</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Is-Definitive-Impact-related-to-Impact-type?" role="tab" aria-controls="settings">Is Definitive Impact related to Impact type?<span class="badge badge-primary badge-pill">8</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#How-often-Definitive-Impact-happens?" role="tab" aria-controls="settings">How often Definitive Impact happens?<span class="badge badge-primary badge-pill">9</span></a>
     
<div class="alert alert-info" role="alert">
    <a data-toggle="list" href="#🚀-EDA-(Exploratory-Data-Analysis)" role="tab" aria-controls="settings">🚀 Back to EDA</a>
</div>

In [None]:
train_df.head(5)

## How many game, play, frame?
> `gameKey` and `playID` has unique ID
- 50 games
- 60 plays
- 52142 frames (frame means image)

In [None]:
train_df.gameKey.nunique()

In [None]:
train_df.playID.nunique()

In [None]:
train_df.groupby('gameKey')['playID'].unique()

In [None]:
# 300 ~ 600 images per play
train_df.groupby(['gameKey', 'playID', 'view'])['frame'].nunique()

In [None]:
sns.displot(train_df.groupby(['gameKey', 'playID', 'view'])['frame'].nunique().values);

In [None]:
train_df.groupby(['gameKey', 'playID', 'view'])['frame'].nunique().sum()

## How many plays per game?
- 41 games has only 1 play
- 8 games has 2 plays
- 1 game has 3 plays

In [None]:
play_per_game = train_df.groupby('gameKey')['playID'].nunique().reset_index().groupby('playID')['gameKey'].unique().to_dict()
play_per_game

## How many players per game?
> Mostly 22 players per game. Some games include 16 players

In [None]:
train_df.groupby(['gameKey', 'playID', 'view'])['label'].nunique().value_counts()

In [None]:
# check what game does only 16 players are running for?
train_df.groupby(['gameKey', 'playID', 'view'])['label'].nunique().reset_index().query('label == 16')

In [None]:
# the play which has 16 players
video('57680_002206_Endzone.mp4')

In [None]:
football_animation('57680_002206')

## How many players exist?
> 196 players exist. Not sure they are all unique
- Home has 98 players and 1, 45 number player only exist at home
- Visitor has 98 players and 9, 43 number player only exist at visitor

In [None]:
train_df.label.unique()

In [None]:
len(train_df.label.unique())

In [None]:
train_df['isHome'] = train_df.label.str[0]
train_df['jersey'] = train_df.label.str[1:].astype(int)
train_df[['isHome', 'jersey']]

In [None]:
home_players = train_df.query('isHome == "H"').jersey.unique()
visitor_players = train_df.query('isHome == "V"').jersey.unique()

home_players.sort(), visitor_players.sort()
home_players, visitor_players

In [None]:
len(home_players), len(visitor_players)

In [None]:
set(home_players) - set(visitor_players)

In [None]:
set(visitor_players) - set(home_players)

## Does all plays has sideline player included?
> 25 out of 60 plays include sideline player

In [None]:
sideline_play = train_df.query('isSidelinePlayer == True').playID.unique()
sideline_play

In [None]:
len(sideline_play)

## How many players are included in the plays that has sideline player included?
> 23 players mostly shown. And 25 Sideline videos, 5 Endzone videos shown sideline players. It's easy to see sideline players when the camera is at the side of sideline ;)

In [None]:
sideline_df = train_df.groupby(['gameKey', 'playID', 'view'])['label', 'isSidelinePlayer'].nunique().query('isSidelinePlayer == 2')
sideline_df

In [None]:
sideline_df.reset_index().view.value_counts()

In [None]:
sideline_df.label.value_counts()

## What is the helmet size inside the video?
- 5928 is the biggest helmet size shown in the video and occure when the camera is zooming.
- 9 is the smallest helmets size
- Mostly the helmet size are around 150

In [None]:
train_df['helmet_size'] = train_df.width * train_df.height
train_df.helmet_size.hist();

In [None]:
# Most of the helmets are around 200
train_df.helmet_size.value_counts()[:10]

In [None]:
# Few helmets shown large at videos
train_df.helmet_size.value_counts()[-10:]

In [None]:
# The biggest size of helmet is 5928
train_df.helmet_size.max(), train_df.helmet_size.min()

In [None]:
train_df.query('helmet_size == 5928')

In [None]:
# when does the helmet shown the biggest?
get_frame_from_video('57686_002546_Endzone.mp4', 429)

In [None]:
# check through the video
video('57686_002546_Endzone.mp4')

In [None]:
train_df.query('helmet_size == 9')

In [None]:
# when does the helmet shown the smallest?
get_frame_from_video('57680_002206_Sideline.mp4', 149)

In [None]:
# check through the video
video('57680_002206_Sideline.mp4')

## Is Definitive Impact related to Impact type?
> Yes, Definitive impact is subset of Impacts and all types of impact could be definitive impact

In [None]:
train_df.impactType.value_counts()

In [None]:
impact_index = train_df.query('impactType != "None"').index
impact_index

In [None]:
definite_impact_index = train_df.query('isDefinitiveImpact == True').index
definite_impact_index

In [None]:
# Definitive impact is subset of Impacts
set(impact_index).issubset(set(definite_impact_index)), set(definite_impact_index).issubset(set(impact_index))

In [None]:
# All types of impact could be definitive impact
train_df.query('isDefinitiveImpact == True').impactType.value_counts()

## How often Definitive Impact happens?
- Definitive impact moment is 500 times smaller than normal impact

In [None]:
train_df.isDefinitiveImpact = train_df.isDefinitiveImpact.astype(int)
train_df.isDefinitiveImpact.head(1)

In [None]:
train_df.isDefinitiveImpact.value_counts()

In [None]:
950198 / 1889

# 🏈 [ Train / Test ] player tracking.csv
> Tracking information of the players that is used with videos to map the helmet label

**The associated `test_player_tracking.csv` are available to your model when submitting.** So we will not consider `test_player_tracking.csv` here

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Questions</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#How-many-games,-play,-frame?" role="tab" aria-controls="profile">How many games, play, frame?<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Is-ball-snap-is-the-starting-point-of-the-game?" role="tab" aria-controls="messages">Is ball snap is the starting point of the game??<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Does-track-information-always-track-22-players-every-moment?" role="tab" aria-controls="settings">Does track information always track 22 players every moment?<span class="badge badge-primary badge-pill">3</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Why-only-11-players-are-tracked?" role="tab" aria-controls="settings">Why only 11 players are tracked?<span class="badge badge-primary badge-pill">4</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Does-tracking-information-always-recorded-longer-than-video?" role="tab" aria-controls="settings">Does tracking information always recorded longer than video?<span class="badge badge-primary badge-pill">5</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#When-event-happens-were-there-always-22-players?" role="tab" aria-controls="settings"> When event happens were there always 22 players?<span class="badge badge-primary badge-pill">6</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Does-all-trackings-recorded-before-the-ball-snap?" role="tab" aria-controls="settings">Does all trackings recorded before the ball snap?<span class="badge badge-primary badge-pill">7</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#What-plays-has-unconsistent-player-numbers-while-beeing-tracked?" role="tab" aria-controls="settings">What plays has unconsistent player numbers while beeing tracked?<span class="badge badge-primary badge-pill">8</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#Does-player-number-unconsistent-even-we-only-consider-the-time-period-of-train-videos?" role="tab" aria-controls="settings">Does player number unconsistent even we only consider the time period of train videos?<span class="badge badge-primary badge-pill">9</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#test_tracking_df-is-not-shown-until-submission" role="tab" aria-controls="settings">test_tracking_df is not shown until submission?<span class="badge badge-primary badge-pill">10</span></a>   
    
<div class="alert alert-info" role="alert">
    <a data-toggle="list" href="#🚀-EDA-(Exploratory-Data-Analysis)" role="tab" aria-controls="settings">🚀 Back to EDA</a>
</div>

## How many games, play, frame?

- 50 games
- 60 plays
- 15180 frames (frames means images)
    - min 113
    - max 456

In [None]:
train_tracking_df.gameKey.nunique()

In [None]:
train_tracking_df.playID.nunique()

In [None]:
# 113 ~ 456 images per play
frame_df = train_tracking_df.groupby(['gameKey', 'playID'])['time'].nunique().reset_index()
frame_df

In [None]:
sns.displot(frame_df.time);

In [None]:
# Total 15180 frame of track is supported
frame_df.time.sum()

In [None]:
frame_df.time.min(), frame_df.time.max()

## Is ball snap is the starting point of the game?
> yes! It occurs only one time per game
- Before snap players change their position a bit
- After snap they start to run!


In [None]:
# Dividing 22 is considering the total player number 22
len(train_tracking_df.query('event == "ball_snap"')) / 22

In [None]:
train_tracking_df.query('event == "ball_snap"').playID.value_counts() / 22

In [None]:
train_tracking_df.query('playID == 82 & event == "ball_snap"').head(1)

In [None]:
# We can see that the player is not moving while 15 seconds flow
train_tracking_df.query('playID == 82 and snap_offset < 0 and player == "H97"')

In [None]:
before_snap_df = train_tracking_df.query('playID == 82 and snap_offset < 0 and player == "H97"')
before_snap_df.s.min(), before_snap_df.s.max()

In [None]:
# before the ball snap the movement is Low
sns.displot(before_snap_df.s);

In [None]:
after_snap_df = train_tracking_df.query('playID == 82 and snap_offset > 0 and player == "H97"')
after_snap_df.s.min(), after_snap_df.s.max()

In [None]:
# before the ball snap the movement is High
sns.displot(after_snap_df.s);

## Does track information always track 22 players every moment?
> No! Mostly 22 players are tracked but in some frames less than 22 are tracked

In [None]:
# Mostly 22 players are tracked but not all
train_tracking_df.groupby(['playID', 'time']).count()['player'].value_counts()

## Why only 11 players are tracked?
> There were 22 players before the last frame and suddenly became 11 players. This seems not right but 11 players tracking information shown after the video ended so doesn't need to be considered.

In [None]:
player_n_df = train_tracking_df.groupby(['playID', 'time']).count()['player'].reset_index()
player_n_df

In [None]:
player_n_df.query('player == 11')

In [None]:
# There are 11 players at the last frame
train_tracking_df.query('time == "2018-10-29 02:22:44.099000+00:00"')

In [None]:
# There are 22 players right before the last frame!
train_tracking_df.query('time == "2018-10-29 02:22:33.099000+00:00"')

In [None]:
video("57686_002546_Sideline.mp4")

In [None]:
football_animation("57686_002546")

## Does tracking information always recorded longer than video?
> Yes, all the track time is longer than train videos and mostly 3 times longer! Some tracking information is approximately 6 times longer!

In [None]:
time_df = ((train_df.groupby(['playID', 'view']).frame.max() - train_df.groupby(['playID', 'view']).frame.min()) / 59.94).reset_index()
time_df = time_df.rename(columns={'frame': 'time_cost'})
time_df

In [None]:
track_time_dict = (train_tracking_df.groupby('playID').snap_offset.max() - train_tracking_df.groupby('playID').snap_offset.min()).to_dict()

In [None]:
time_df['track_time_cost'] = time_df.playID.map(track_time_dict)

In [None]:
# all the track time is longer than train videos
all(time_df['time_cost'] < time_df['track_time_cost'])

In [None]:
# It's mostly 3 times longer than train videos
sns.displot(time_df['track_time_cost'] / time_df['time_cost']);

## When event happens were there always 22 players?
> Yes! But these seems to be just luck that all 22 players were on the ground when event happens

In [None]:
event_df = (train_tracking_df.groupby(['playID', 'event'])['time'].count()/22).reset_index()
event_df

In [None]:
event_df.event.value_counts()

In [None]:
event_df.time.value_counts()

## Does all trackings recorded before the ball snap?
> Yes! All plays are recorded before the ball snap!

In [None]:
len(train_tracking_df.query('snap_offset < 0').playID.unique())

## What plays has unconsistent player numbers while beeing tracked?
> There are 6 plays that are not consistent.
- The ID is 109, 336, 350, 1242, 2546, 4152

In [None]:
player_df = train_tracking_df.groupby(['playID', 'time']).count().player.reset_index()
player_df

In [None]:
history_df = player_df.groupby('playID').apply(lambda r: r['player'].values).reset_index()
history_df = history_df.rename(columns={0: 'history'})
history_df.head()

In [None]:
not_consistent = []
for _, (play_id, history) in history_df.iterrows():
    start_player_n = history[0]
    all_same = all(history == start_player_n)
    
    if not all_same:
        not_consistent.append(play_id)

In [None]:
# There are 6 plays that are not consistent
not_consistent

## Does player number unconsistent even we only consider the time period of train videos?
> It become more consistent when considering only the time period of train videos. But still 3 tracking information is not consistent. **We need to think how to impute this missing data.**
- Considering the whole tracking - 109, 336, 350, 1242, 2546, 4152
- Only for the video time period - 109, 336, 4152

In video time period 21 players are always tracked and 1 player is not tracked in some moments.

In [None]:
min_dict = train_df.groupby('playID').frame.min().to_dict()
max_dict = train_df.groupby('playID').frame.max().to_dict()

In [None]:
train_tracking_df['min_frame'] = train_tracking_df.playID.map(min_dict)
train_tracking_df['max_frame'] = train_tracking_df.playID.map(max_dict)

filter_df = train_tracking_df[['playID', 'time', 'player', 'est_frame', 'min_frame', 'max_frame']].copy()
filter_df

In [None]:
filter_df['inVideo'] = (filter_df.est_frame >= filter_df.min_frame) & (filter_df.est_frame <= filter_df.max_frame)

In [None]:
filter_df = filter_df.query("inVideo == True")
filter_df

In [None]:
player_df = filter_df.groupby(['playID', 'time']).count().player.reset_index()
player_df

In [None]:
history_df = player_df.groupby('playID').apply(lambda r: r['player'].values).reset_index()
history_df = history_df.rename(columns={0: 'history'})
history_df.head()

In [None]:
not_consistent = []
for _, (play_id, history) in history_df.iterrows():
    start_player_n = history[0]
    all_same = all(history == start_player_n)
   
    if not all_same:
        not_consistent.append(play_id)

In [None]:
# There are 6 plays that are not consistent
not_consistent

In [None]:
history_df.query('playID == 109').history.values

In [None]:
history_df.query('playID == 336').history.values

In [None]:
history_df.query('playID == 4152').history.values

In [None]:
video('57586_004152_Sideline.mp4')

In [None]:
football_animation('57586_004152')

## test_tracking_df is not shown until submission
> As data description says **test_player_tracking.csv are available to your model when submitting**.

In [None]:
test_tracking_df.gameKey.nunique()

In [None]:
test_tracking_df

# 🏈 image_labels.csv
> image labels are supplement dataset that has various plays that contains mostly only 1 frame each. Only 2 frames match with train set. Not sure this could be used for training but surely could be used for training the bbox coordinate.

Comparation between `image_labels.csv` and `train_labels.csv`
- 9947 supplement images vs 52142 train images
    - additional 20% data for bbox prediction only
- 41 games matches and only 1 plays matches
- only 2 frame matches

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Comparation Questions</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#How-many-games-match?" role="tab" aria-controls="profile">How many games match?<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#How-many-plays-match?" role="tab" aria-controls="messages">How many plays match?<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#How-many-frames-match?" role="tab" aria-controls="settings">How many frames match?<span class="badge badge-primary badge-pill">3</span></a>

<div class="alert alert-info" role="alert">
    <a data-toggle="list" href="#🚀-EDA-(Exploratory-Data-Analysis)" role="tab" aria-controls="settings">🚀 Back to EDA</a>
</div>

In [None]:
# number of bounding boxes of all frames in 
len(image_df)

In [None]:
# 9947 images
image_df.image.nunique()

In [None]:
# This feature doesn't seem useful to our competition
image_df.label.hist();

In [None]:
image_df.groupby('image').count().head(5)

In [None]:
# remove "frame", ".jpg" to compare with the whole frame
image_df.image = image_df.image.str.replace('frame', '')
image_df.image = image_df.image.str.replace('.jpg', '')
image_df.head(5)

In [None]:
# additional, auxiliary images
aux_images = image_df.image

## How many games match?
> 41 games match.
- 38 games that matches has one play
- 3 games taht matches has two plays

In [None]:
aux_games_set = set(aux_images.str.split('_').str[0])
train_games_set = set(train_df.gameKey.astype(str))

In [None]:
# aux game ID 57502 ~ 58176
# train game ID 57583 ~ 58107
len(aux_games_set), len(train_games_set)

In [None]:
common_games = list(aux_games_set & train_games_set)
print(common_games)

In [None]:
# 38 games matches between one play only games and common games
one_play_only = set([str(game) for game in play_per_game[1]])
len(set(common_games).intersection(one_play_only))

In [None]:
# 3 two play only games matches with common games
two_play_only = set([str(game) for game in play_per_game[2]])
len(set(common_games).intersection(two_play_only))

## How many plays match?
> Only 1 matches. `58000_001306`. Besides `image_labels.csv` has lot's of various plays!

In [None]:
train_df

In [None]:
aux_play_set = set(aux_images.str.replace('_\d*$', ''))
train_play_set = set(train_df.video_frame.str.replace('_\d*$', ''))

In [None]:
len(aux_play_set), len(train_play_set)

In [None]:
aux_play_set & train_play_set

## How many frames match?
> Only 2 matches. `image_labels.csv` has mostly one image per play

In [None]:
aux_frame_set = set(aux_images)
train_frame_set = set(train_df.video_frame)

In [None]:
len(aux_frame_set), len(train_frame_set)

In [None]:
# both are not subset for each other
aux_frame_set.issubset(train_frame_set), train_frame_set.issubset(aux_frame_set)

In [None]:
aux_frame_set & train_frame_set

# 🏈 [ Train / Test ] baseline helmets.csv
> Just to check does the inference really executed under our train/test folder videos. And yes it did! After training the model on the supplement images, the model was used to predict the bbox of train/test videos.
- If the bbox prediction is good with just supplement videos how great will it be when we train with our train set.

`test_tracking_df` is not shown until submission
> As data description says **test_baseline_helmets.csv are available to your model when submitting**.

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Questions</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#How-many-bbox-was-predicted-for-1-frame?" role="tab" aria-controls="profile">How many bbox was predicted for 1 frame?<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#What-if-we-remove-prediction-with-lower-confidence?" role="tab" aria-controls="messages">What if we remove prediction with lower confidence?<span class="badge badge-primary badge-pill">2</span></a>
    
<div class="alert alert-info" role="alert">
    <a data-toggle="list" href="#🚀-EDA-(Exploratory-Data-Analysis)" role="tab" aria-controls="settings">🚀 Back to EDA</a>
</div>

In [None]:
train_predict_df['video'] = train_predict_df.video_frame.str.replace('_\d*$', '')
test_predict_df['video'] = test_predict_df.video_frame.str.replace('_\d*$', '')

In [None]:
train_predict_df.head(5)

In [None]:
# prediction is for the train set that we are using
set(train_predict_df.video.unique()) == set(train_df.video_frame.str.replace('_\d*$', ''))

In [None]:
set(test_predict_df.video.unique())

In [None]:
test_videos.sort()
test_videos

## How many bbox was predicted for 1 frame?
> The model used for predicting the bounding box predicts sideline persons head as helmet too

- 82 bbox was predicted in max for 1 frame
- 2 bbox was predicted in max for 1 fram

In [None]:
# 82 bbox was predicted in max for 1 frame
train_predict_df.groupby('video_frame').count()['left'].max()

In [None]:
# 2 bbox was predicted in max for 1 frame
train_predict_df.groupby('video_frame').count()['left'].min()

In [None]:
# bbox number doesn't seem quite right
sns.displot(train_predict_df.groupby('video_frame').count()['left']);

In [None]:
train_predict_df.groupby('video_frame').count().query('left == 82')

In [None]:
frame_bbox(train_predict_df, '58094_002819_Sideline_169')

In [None]:
train_predict_df.groupby('video_frame').count().query('left == 2')

In [None]:
frame_bbox(train_predict_df, '57584_000336_Sideline_454')

## What if we remove prediction with lower confidence?
> Max bbox shrimp to 35 and the bbox prediction of false positive on the sideline players seems more better than before! But still a long way to go!

In [None]:
filter_df = train_predict_df.copy().query('conf > 0.75')
filter_df

In [None]:
# 82 bbox was predicted in max for 1 frame
filter_df.groupby('video_frame').count()['left'].max()

In [None]:
frame_bbox(filter_df, '58094_002819_Sideline_169')

In [None]:
frame_bbox(filter_df, '57584_000336_Sideline_454')