# Kaggle : NFL Big Data Bowl 2020 
## Introduction

In this competition we have to predict how many yards a team will gain on a rushing play in an NFL regular season game.  We will loop through a series of rushing plays; for each play, you'll receive the position, velocity, orientation, and more for all 22 players on the field at the moment of handing the ball off to the rusher, along with many other features such as teams, stadium, weather conditions, etc.  You'll use this information to predict how many yards the team will gain on the play as a cumulative distribution function.

Mathematically, this is equivalent to estimating the conditional distribution of $Y$ given $X$ given the couple $(Y,X)$ with $Y \in [-99,99]$ and $X$ a high dimentional vector. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from kaggle.competitions import nflrush

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
env = nflrush.make_env()

In [None]:
train_df = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2020/train.csv', low_memory=False)
train_df.shape

In [None]:
train_df.info()

In [None]:
train_df.head()

In [None]:
sns.distplot(train_df.Yards)

- Benchmark : NN 
- Benchmark : Kernel density with only one covariable : the YardLine
- Benchmark : how about regression and other method for E[Y|X]
- Benchmark : Kernel density adding non player factor
- Idea : try to find a projection or something to evaluate distance between plays
- Idea : Neural Net
- Can also be said to be a multiclass classification with 199 classes
- can also be seen as a regression with the result asking for truncation
- Boosting looks difficult with <4 requirement
- métrique : sommes de min paire unique distance de chaque joueur

In [None]:
sns.kdeplot(train_df.YardLine)

In [None]:
sns.kdeplot(train_df.Distance)

In [None]:
sns.kdeplot(train_df.groupby('GameId').max().Yards)

In [None]:
sns.kdeplot(train_df.groupby('GameId').min().Yards)

In [None]:
sns.kdeplot(train_df.groupby('GameId').median().Yards)

In [None]:
sns.kdeplot(train_df.groupby('GameId').mean().Yards)

In [None]:
a1 = plt.figure()
a1 = sns.FacetGrid(train_df, col = "FieldPosition", row = "PlayDirection")
a1 = a1.map(sns.distplot, 'Yards')

In [None]:
train_df.groupby('FieldPosition').PlayDirection.value_counts()

In [None]:
a1 = plt.figure()
a1 = sns.FacetGrid(train_df, col = "OffenseFormation", col_wrap = 4)
a1 = a1.map(sns.distplot, 'Yards')

In [None]:
mini_df = train_df[['PlayId','YardLine','Yards','FieldPosition','PlayDirection']].drop_duplicates('PlayId').reset_index()

In [None]:
mini_df.info()

In [None]:
mini_df.head()

In [None]:
endog = [i for i in range(-99,100)] 

In [None]:
from statsmodels.nonparametric.kernel_density import KDEMultivariateConditional as KDE

In [None]:
gene_dens = KDE(endog = mini_df.Yards, exog = mini_df.YardLine, dep_type='c', indep_type='c', bw='normal_reference')

In [None]:
iter_test = env.iter_test()

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    mini_test = test_df[['PlayId','YardLine','FieldPosition','PlayDirection']].drop_duplicates('PlayId')
    Y = mini_test.YardLine.iloc[0]
    FP = test_df.FieldPosition.iloc[0]
    PD = test_df.PlayDirection.iloc[0]
    train_mini_df = mini_df[(mini_df.FieldPosition == FP) & (mini_df.PlayDirection == PD)].reset_index()
    dens = KDE(endog = train_mini_df.Yards, exog = train_mini_df.YardLine, dep_type='c', indep_type='c', bw=gene_dens.bw)
    pred_value = dens.cdf(endog, [Y]*199)
    pred_value[198] = 1
    pred_value[pred_value>1]=1
    pred_value[pred_value<0]=0
    sample_prediction_df.iloc[0] = pred_value
    if sample_prediction_df.iloc[0].isnull().any()==1 :
        pred_value = gene_dens.cdf(endog,[Y]*199)
        pred_value[198] = 1
        pred_value[pred_value>1]=1
        pred_value[pred_value<0]=0
        sample_prediction_df.iloc[0] = pred_value
    env.predict(sample_prediction_df)



In [None]:
env.write_submission_file()

In [None]:
import os
print([filename for filename in os.listdir('/kaggle/working') if '.csv' in filename])