In [None]:
import numpy as np
import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
sb.set_style('whitegrid')
sb.set(font_scale = 1.4)

In [None]:
#games = '../input/nfl-big-data-bowl-2022/games.csv'
players = '../input/nfl-big-data-bowl-2022/players.csv'
#plays = '../input/nfl-big-data-bowl-2022/plays.csv'
tracking_2018 = '../input/nfl-big-data-bowl-2022/tracking2018.csv'
#tracking_2019 = '../input/nfl-big-data-bowl-2022/tracking2019.csv'
#tracking_2020 = '../input/nfl-big-data-bowl-2022/tracking2020.csv'

## Towards ranking players 

This notebook focuses on the players and tracking data and hopefully will help people who will work on ranking players and their performance. Some new features are derived from the tracked data and merged to the players data frame. Finally, a few visualisations and some observations can be found at towards the end. 

Good luck to all 🏈



In [None]:
players_df = pd.read_csv(players)
players_df

'birthDate' is not usefull as it is, it is coverted to the age (int) for each player. Also, the height feature is of string type with inconsistencies in its format (i.e. not always <feet> - <inches>). I am converting it to inches for now.

In [None]:
# convert birthday to years old
players_df['birthDate'] = pd.to_datetime(players_df['birthDate'], infer_datetime_format = True) 
players_df['birthDate'] = np.round((pd.Timestamp.now() - players_df['birthDate']).dt.days/365)
players_df = players_df.rename(columns={'birthDate': 'age'})

# fix and conver'height' to number (inches)
i =0
for h in players_df['height']:
    if '-' in h:
        h.split('-')
        tot_inch = int(h[0])*12+int(h[2])
        players_df.at[i,'height'] = tot_inch
        
    else:
        tot_inch = int(h[0])*12+int(h[1])
        players_df.at[i,'height'] = tot_inch     
    i=i+1

In [None]:
players_df

Now, let's take a look at the tracking data for 2018:

In [None]:
tracking_2018_df = pd.read_csv(tracking_2018)
tracking_2018_df

We observe that the timestep from the tracker data is every one second, from that we can derive information on how much each player played. Follow the code bellow for a step-by-step exploration of the data:

In [None]:
df1 = tracking_2018_df.groupby(["nflId",'gameId'])
#how many seconds did each player played in each game?
ans = df1['time'].count().to_frame()
ans

In order to create a feature which I can merge in the 'players' data frame, we aggregate the above:

In [None]:
# In how many games did each player played in total (in a given year)?
ans.groupby('nflId').count()

In [None]:
#merge with players_pd..
players_df = pd.merge(players_df, ans.groupby('nflId').count(), on=['nflId'])
players_df = players_df.rename(columns={'time': 'gamesPlayed2018'})

In [None]:
# how many minutes did each played played in total (in a given year)?
np.round(ans.groupby('nflId').sum()/60, decimals=1)

In [None]:
#merge with players_pd..
players_df = pd.merge(players_df, np.round(ans.groupby('nflId').sum()/60, decimals=1), on=['nflId'])
players_df = players_df.rename(columns={'time': 'minutesPlayed2018'})

In [None]:
# which plays did he play in each game?  -not using this for now..
df1['playId'].unique().to_frame()

Next, we will extract some useful features regarding the performance of the players in the game. We assume that: 

 * The average speed the player has is a good indicator on how much distance he is covering in his play time i.e. how mobile he is overall.  
 * the 90%the quantile of their top acceleration is a good indicatior of their explosiveness and agility in the field - we don't want to consider only the max() value to avoid being influenced by a single extreme outlier measurement. Neverthless, we are interested only on the tail of the distribution (q90%) as it is sensible that explosive moves (= high acceleration) will occur over a small fraction of the tracked period.

In [None]:
# What was the average speed (y/sec) of each player in each game ?
df1[['s']].mean()

We can see that, more or less, the average speed of the players stays the same in each game.

In [None]:
# What was the average speed (y/sec) of each player in 2018? 
df1[['s']].mean().groupby('nflId').mean()

In [None]:
#merge with players_pd..
players_df = pd.merge(players_df, df1[['s']].mean().groupby('nflId').mean(), on=['nflId'])
players_df = players_df.rename(columns={'s': 'averageSpeed2018'})

In [None]:
# What was  the top speed (y/sec) and top acceleration (y/sec^2) for each player in each game in 2018?
df1[['s','a']].quantile(q=0.95)

In [None]:
# What was  the top speed (y/sec) and top acceleration (y/sec^2) for each player in 2018?
df1[['s','a']].quantile(q=0.95).groupby('nflId').mean()

In [None]:
#merge with players_pd..
players_df = pd.merge(players_df, df1[['s']].quantile(q=0.95).groupby('nflId').mean(), on=['nflId'])
players_df = players_df.rename(columns={'s': 'averageTopSpeed2018'})

players_df = pd.merge(players_df, df1[['a']].quantile(q=0.95).groupby('nflId').mean(), on=['nflId'])
players_df = players_df.rename(columns={'a': 'averageTopAcc2018'})

We see that some of the top speeds are impressive (1yard = 0.9m). For reference, Usain Bolt reached an astounding 12.3 meters per second when he did his record.

In [None]:
players_df

...and now some visualizations - feel free to mix and match features to discover new insigths! 

In [None]:
sb.relplot(x="weight", y="height", size="age",
            sizes=(50, 1000), alpha=0.4, palette="muted",
            height=12, data=players_df)

* At least 4 body type-clusters can be identified. Palyer have different roles to play in the course of the game and their physical qualities are optimised for that. There seems to be a a distinct cluster of short/light players with well defined height/weight requirements. Most of the players are found in the central area and we can see a distinct group of taller players, distributed over a large ramge of weights. 
* Age seems to be distributed over all body types.

In [None]:
sb.relplot(x="gamesPlayed2018", y="minutesPlayed2018", size="weight",
            sizes=(50, 2000), alpha=0.4, palette="muted",
            height=12, data=players_df)

sb.relplot(x="minutesPlayed2018", y="weight", size="age",
            sizes=(50, 3060), alpha=0.3, palette="muted",
            height=12, data=players_df)

sb.relplot(x="minutesPlayed2018", y="height", size="age",
            sizes=(50, 3060), alpha=0.3, palette="muted",
            height=12, data=players_df)

This figure might seem obvious because it makes sense that the total time spent in the game would be proportional to the number of games played but I had to make sure that it is.
* Linear proportionality between the number of games played and the total time spent in game - expected.
* Heavier players spend less time in the game but weight does not limit how many games they play.
* Taller players also tend to spend less time on the court as well.

In [None]:
sb.relplot(x="averageTopSpeed2018", y="averageTopAcc2018", size="weight",
            sizes=(50, 300), alpha=1, palette="muted",
            height=12, data=players_df)

* The most agile and explosive players tend to also be the fastest palyers on the court. 
* The majority of players seems to be consenrtater to two regions. Players tend to (relatively) either be fast and explosive edge, or slow. There are not a lot of palyers in between those two clusters.
* Expectedly, heavy players tend to be on the slow end.