<font color='#962205'>Analysis objective</font>

The objective of this analysis is to observe and highlight risks associated to injuries depending on field surface.

<font color='#962205'>Analized groups</font>

in this scope four groups of plays were analyzed in comparison.
The groups are analyzed in parallel in order to capture details in play depending on field surface type (Natural, Synthetic) and injured groups depending also on surface type; the analysed groups are listed below:

(1a) Injured on Natural turf | (1b) Injured on Synthetic turf

(2a) Non Injured - Natural turf | (2b) Non Injured - Synthetic turf


Aggregated metrics are computed on play level - any influence related to particular players, game styles and performance specificities should not add bias to the analysis or conclusions.


In [None]:
# load necessary libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pylab as plt
import matplotlib.pyplot as plt0
from random import sample
import gc

import seaborn as sns
import matplotlib.patches as patches
sns.set_style("whitegrid")

In [None]:
play_df = pd.read_csv("../input/nfl-playing-surface-analytics/PlayList.csv")
track_df = pd.read_csv("../input/nfl-playing-surface-analytics/PlayerTrackData.csv")
injury_df = pd.read_csv("../input/nfl-playing-surface-analytics/InjuryRecord.csv")

> Preprocessing on play data:

In [None]:
# recodes PLAY data (stadium and weather)
play_df['Stadium_type'] = 'Indoor'
play_df.loc[play_df['StadiumType'].isin(['Outdoors','Oudoor','Open','Outdoor',
                                         'Outdoor Retr Roof-Open','Ourdoor','Bowl','Outddors',
                                        'Retr. Roof-Open','Domed, Open','Domed, open','Heinz Field',
                                        'Cloudy','Retr. Roof - Open','Outdor','Outside']),'Stadium_type'] = 'Outdoor'

play_df.loc[play_df['Weather'].isin(['Controlled Climate', 'Indoor', 'Indoors']), 'weather_type'] = 'Controlled climate'
play_df.loc[play_df['Weather'].isin(['Sunny', 'Mostly sunny', 'Mostly Sunny', 'Fair', 'Clear', 'Clear Skies', 'Clear and warm', 'Clear skies']), 'weather_type'] = 'Sunny'
play_df.loc[play_df['Weather'].isin(['Mostly cloudy', 'Partly Cloudy', 'Sun & clouds', 'Coudy', 'Cloudy', 'Cloudy and Cool']), 'weather_type'] = 'Cloudy'
play_df.loc[play_df['Weather'].isin(['Light Rain', 'Rain', 'Rain shower','Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.', 'Cloudy, 50% change of rain', 'Cold']), 'weather_type'] = 'Rain'
play_df.loc[play_df['Weather'].isna(),'weather_type'] = 'NA'


In [None]:
# add play info to injury dataset
play_and_inj = play_df.merge(injury_df, how='inner', on='PlayKey')

In [None]:
# cross table - percent function
def t_percent(df=injury_df,row='BodyPart', uniq_count='PlayerKey', cross=False, cross_var='Surface'):
    t=(df.groupby(row)[row].count() / df.count()[cross_var]) \
        .sort_values()
    if (cross==True):
        t=pd.crosstab(df[row],df[cross_var]) / df.groupby(cross_var)[cross_var].count() #df.count()[cross_var]
    return(t)

Select list of plays for each anayzed group:

In [None]:
################ natural vs synthetic injured groups ###############################################
inj_syn_list = injury_df.loc[(injury_df['Surface']=='Synthetic') & (injury_df['PlayKey'].notna()),'PlayKey'].tolist()
inj_nat_list = injury_df.loc[(injury_df['Surface']=='Natural') & (injury_df['PlayKey'].notna()),'PlayKey'].tolist()
inj_list = injury_df.loc[(injury_df['PlayKey'].notna()),'PlayKey'].tolist()
# sample 2000 play keys with play info which are non-injured
not_inj_list = play_df.loc[(play_df['PlayKey'].notna()) & (~play_df['PlayKey'].isin(inj_list)),'PlayKey'].tolist()
not_inj_list_samp = sample(not_inj_list,2000)

# no play number
# print(injury_df.loc[injury_df['PlayKey'].isna(),:])

# get track data for specific plays: injured and non-injured sample #############################
track_inj = track_df.query('PlayKey in @inj_list')
track_inj.loc[:,'dir_vs_o'] = track_inj['dir'] / track_inj['o']

track_non_inj = track_df.query('PlayKey in @not_inj_list_samp')
track_non_inj.loc[:,'dir_vs_o'] = track_non_inj['dir'] / track_non_inj['o']

Calculate movement metrics on track data:

In [None]:
gc.collect()
#################### track selection
trk_syn = track_inj.query('PlayKey in @inj_syn_list')
trk_nat = track_inj.query('PlayKey in @inj_nat_list')

In [None]:
# calculate distance and velocity based on x, y coordinates
trk_nat.loc[:,'dist'] = np.sqrt((trk_nat['x'] - trk_nat['x'].shift(1)) ** 2 + (trk_nat['y'] - trk_nat['y'].shift(1)) ** 2)
trk_nat.loc[trk_nat.time==0,'dist'] = 0
trk_nat.loc[:,'velocity'] = trk_nat['dist'] / 0.1

trk_syn.loc[:,'dist'] = np.sqrt((trk_syn['x'] - trk_syn['x'].shift(1)) ** 2 + (trk_syn['y'] - trk_syn['y'].shift(1)) ** 2)
trk_syn.loc[trk_syn.time==0,'dist'] = 0
trk_syn.loc[:,'velocity'] = trk_syn['dist'] / 0.1

track_non_inj.loc[:,'dist'] = np.sqrt((track_non_inj['x'] - track_non_inj['x'].shift(1)) ** 2 + (track_non_inj['y'] - track_non_inj['y'].shift(1)) ** 2)
track_non_inj.loc[track_non_inj.time==0,'dist'] = 0
track_non_inj.loc[:,'velocity'] = track_non_inj['dist'] / 0.1

In [None]:
# Calculate acceleration based on velocity
track_non_inj.loc[:,'a'] = (track_non_inj['velocity'] - track_non_inj['velocity'].shift(1)) / (track_non_inj['time'] - track_non_inj['time'].shift(1))
track_non_inj.loc[track_non_inj.time==0,'a'] = 0
track_non_inj.loc[track_non_inj['a']>0,'acc']=1
track_non_inj.loc[track_non_inj['a']<0,'dcc']=1

trk_syn.loc[:,'a'] = (trk_syn['velocity'] - trk_syn['velocity'].shift(1)) / (trk_syn['time'] - trk_syn['time'].shift(1))
trk_syn.loc[trk_syn.time==0,'a'] = 0
trk_syn.loc[trk_syn['a']>0,'acc']=1
trk_syn.loc[trk_syn['a']<0,'dcc']=1

trk_nat.loc[:,'a'] = (trk_nat['velocity'] - trk_nat['velocity'].shift(1)) / (trk_nat['time'] - trk_nat['time'].shift(1))
trk_nat.loc[trk_nat.time==0,'a'] = 0
trk_nat.loc[trk_nat['a']>0,'acc']=1
trk_nat.loc[trk_nat['a']<0,'dcc']=1

# union all track dfs
trk_all = track_non_inj.append(trk_nat).append(trk_syn)

# calculate direction shifts
trk_all.loc[:,'dir_shift'] = (trk_all['dir'] - trk_all['dir'].shift(1))
trk_all.loc[trk_all.time==0,'dir_shift'] = 0
trk_all.loc[(trk_all['dir_shift']>0) & (trk_all['dir_shift']<=30),'dir_shift_deg']='+[0-30]'
trk_all.loc[(trk_all['dir_shift']>30) & (trk_all['dir_shift']<=60),'dir_shift_deg']='+[30-60]'
trk_all.loc[(trk_all['dir_shift']>60) & (trk_all['dir_shift']<=90),'dir_shift_deg']='+[60-90]'
trk_all.loc[(trk_all['dir_shift']>90) & (trk_all['dir_shift']<=120),'dir_shift_deg']='+[90-120]'
trk_all.loc[(trk_all['dir_shift']>120) & (trk_all['dir_shift']<=150),'dir_shift_deg']='+[120-150]'
trk_all.loc[(trk_all['dir_shift']>150) & (trk_all['dir_shift']<=180),'dir_shift_deg']='+[150-180]'
trk_all.loc[(trk_all['dir_shift']>180) & (trk_all['dir_shift']<=210),'dir_shift_deg']='+[180-210]'
trk_all.loc[(trk_all['dir_shift']>210) & (trk_all['dir_shift']<=240),'dir_shift_deg']='+[210-240]'
trk_all.loc[(trk_all['dir_shift']>240) & (trk_all['dir_shift']<=270),'dir_shift_deg']='+[240-270]'
trk_all.loc[(trk_all['dir_shift']>270) & (trk_all['dir_shift']<=300),'dir_shift_deg']='+[270-300]'
trk_all.loc[(trk_all['dir_shift']>300) & (trk_all['dir_shift']<=330),'dir_shift_deg']='+[300-330]'
trk_all.loc[(trk_all['dir_shift']>330) & (trk_all['dir_shift']<=360),'dir_shift_deg']='+[330-360]'

trk_all.loc[(trk_all['dir_shift']<0) & (trk_all['dir_shift']>=-30),'dir_shift_deg']='-[0-30]'
trk_all.loc[(trk_all['dir_shift']<-30) & (trk_all['dir_shift']>=-60),'dir_shift_deg']='-[30-60]'
trk_all.loc[(trk_all['dir_shift']<-60) & (trk_all['dir_shift']>=-90),'dir_shift_deg']='-[60-90]'
trk_all.loc[(trk_all['dir_shift']<-90) & (trk_all['dir_shift']>=-120),'dir_shift_deg']='-[90-120]'
trk_all.loc[(trk_all['dir_shift']<-120) & (trk_all['dir_shift']>=-150),'dir_shift_deg']='-[120-150]'
trk_all.loc[(trk_all['dir_shift']<-150) & (trk_all['dir_shift']>=-180),'dir_shift_deg']='-[150-180]'
trk_all.loc[(trk_all['dir_shift']<-180) & (trk_all['dir_shift']>=-210),'dir_shift_deg']='-[180-210]'
trk_all.loc[(trk_all['dir_shift']<-210) & (trk_all['dir_shift']>=-240),'dir_shift_deg']='-[210-240]'
trk_all.loc[(trk_all['dir_shift']<-240) & (trk_all['dir_shift']>=-270),'dir_shift_deg']='-[240-270]'
trk_all.loc[(trk_all['dir_shift']<-270) & (trk_all['dir_shift']>=-300),'dir_shift_deg']='-[270-300]'
trk_all.loc[(trk_all['dir_shift']<-300) & (trk_all['dir_shift']>=-330),'dir_shift_deg']='-[300-330]'
trk_all.loc[(trk_all['dir_shift']<-330) & (trk_all['dir_shift']>=-360),'dir_shift_deg']='-[330-360]'


# Play level bins for direction shifts (by 30 degrees, + and -)
trk_all['dir_shift'].describe()
trk_all['dir_shift_deg'].value_counts()
trk_all['dir_shift_deg'].value_counts() / trk_all['dir_shift_deg'].notna().sum()

tst_dirshift = pd.crosstab(trk_all['PlayKey'],trk_all['dir_shift_deg']) #.count() #/ trk_syn.groupby('PlayKey')['PlayKey'].notna().sum()
tst_dirshift.loc[:,'total_counts_deg'] = tst_dirshift.sum(axis=1)

tst_dirshift_perc = tst_dirshift.iloc[:,0:-1].div(tst_dirshift.total_counts_deg, axis=0)
#print(tst_dirshift_perc.head())
#print(tst_dirshift_perc.describe())
#print(tst_dirshift_perc.columns)

tst_dirshift_counts = tst_dirshift.iloc[:,0:-1]

Compute metrics on play level:

In [None]:
# synthetic injury - compute metrics on play level
trk_syn_agg = trk_syn.groupby('PlayKey').agg({'time': 'max',
                        'dis': 'sum','dist': 'sum', 'event': 'count', 'dir_vs_o':'mean', 's':'mean',
                                             'x':['min', 'max','mean'], 'y':['min', 'max','mean'],
                                             'acc': 'count', 'dcc': 'count'})
trk_syn_agg.loc[:,'type'] = 'Synthetic'

# natural injury - compute metrics on play level
trk_nat_agg = trk_nat.groupby('PlayKey').agg({'time': 'max',
                        'dis': 'sum','dist': 'sum', 'event': 'count', 'dir_vs_o':'mean', 's':'mean',
                                             'x':['min', 'max','mean'], 'y':['min', 'max','mean'],
                                             'acc': 'count', 'dcc': 'count'})
trk_nat_agg.loc[:,'type'] = 'Natural'

# non injured random 10k sample - compute metrics on play level
trk_non_inj_agg = track_non_inj.groupby('PlayKey').agg({'time': 'max',
                        'dis': 'sum','dist': 'sum', 'event': 'count', 'dir_vs_o':'mean', 's':'mean',
                                                       'x':['min', 'max','mean'], 'y':['min', 'max','mean'],
                                                       'acc': 'count', 'dcc': 'count'})
trk_non_inj_agg.loc[:,'type'] = 'Non-Injured'


trk_inj_agg = trk_syn_agg.append(trk_nat_agg)
trk_inj_ninj_agg = trk_inj_agg.append(trk_non_inj_agg)
#print("++++++++++++++++")
new_cols = [''.join(tups) for tups in  trk_inj_ninj_agg.columns]
#print(new_cols)
trk_inj_ninj_agg.columns = new_cols
#print(trk_inj_ninj_agg.columns)

# add dir shift bins
trk_inj_ninj_agg_1 = trk_inj_ninj_agg.merge(tst_dirshift_perc, how='inner', on='PlayKey')

#print("++++++++++++++++")
#print(trk_inj_ninj_agg_1.head())

# add play info
df_fin = trk_inj_ninj_agg_1.merge(play_df, how='inner', on='PlayKey')
df_fin['inj_noninj'] = df_fin['type']
df_fin.loc[df_fin['type'].isin(['Natural','Synthetic']),'inj_noninj'] = 'Injured'

# injured and non injured by turf type
df_fin.loc[(df_fin['FieldType']=='Natural') &
                    (df_fin['type']=='Non-Injured'),'type_turf'] = 'Non Injured on Natural turf'    
df_fin.loc[(df_fin['FieldType']=='Synthetic') &
                    (df_fin['type']=='Non-Injured'),'type_turf'] = 'Non Injured on Synthetic turf'    
df_fin.loc[df_fin['type']=='Natural','type_turf'] = 'Injured on Natural turf'    
df_fin.loc[df_fin['type']=='Synthetic','type_turf'] = 'Injured on Synthetic turf'


# stadium & weather tables - by injury/non-injury and field type 
stad_t = t_percent(df=df_fin,row='Stadium_type', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()
weat_t = t_percent(df=df_fin,row='weather_type', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()
stad_t2 = t_percent(df=df_fin,row='Stadium_type', uniq_count='PlayKey',cross=True, cross_var='inj_noninj').transpose()
weat_t2 = t_percent(df=df_fin,row='weather_type', uniq_count='PlayKey',cross=True, cross_var='inj_noninj').transpose()


In [None]:
trk_all_type = trk_all.merge(df_fin[['PlayKey','type_turf']], on='PlayKey')
trk_cnt_type = tst_dirshift_counts.merge(df_fin[['PlayKey','type_turf']], on='PlayKey')

In [None]:
degr_cnt = trk_cnt_type.groupby('type_turf')[['-[0-30]','-[30-60]','-[60-90]','-[90-120]','-[120-150]','-[150-180]','-[180-210]','-[210-240]',
               '-[240-270]','-[270-300]','-[300-330]','-[330-360]',
                                        '+[0-30]','+[30-60]','+[60-90]','+[90-120]','+[120-150]','+[150-180]','+[180-210]','+[210-240]',
               '+[240-270]','+[270-300]','+[300-330]','+[330-360]']].sum()

degr_cnt.loc[:,'total_counts_deg'] = degr_cnt.sum(axis=1)
degr_perc = degr_cnt.iloc[:,0:-1].div(degr_cnt.total_counts_deg, axis=0)


Visualize direction shifts and velocity for injured plays.

In [None]:
lst_pos_degr = ['+[0-30]','+[30-60]','+[60-90]','+[90-120]','+[120-150]','+[150-180]','+[180-210]','+[210-240]',
               '+[240-270]','+[270-300]','+[300-330]','+[330-360]']
lst_neg_degr = ['-[0-30]','-[30-60]','-[60-90]','-[90-120]','-[120-150]','-[150-180]','-[180-210]','-[210-240]',
               '-[240-270]','-[270-300]','-[300-330]','-[330-360]']

fig, ax = plt.subplots(figsize=(10,8))
ax = sns.scatterplot(x='dir_shift',y='a', hue='type_turf'
                    ,data=trk_all_type.loc[trk_all_type['type_turf'].isin(['Injured on Synthetic turf','Injured on Natural turf']),:],
                       ax=ax) 
ax.set(ylabel='Acceleration', xlabel='Direction shifts')
ax.set_xticks(np.arange(-360,360,step=60))

In [None]:
# direction shifts intervals
radar_agg = df_fin.groupby('type_turf')[lst_pos_degr].mean()
radar_neg_agg = df_fin.groupby('type_turf')[lst_neg_degr].mean()
dat_1a = pd.DataFrame(data={'varx':radar_agg.loc[radar_agg.index=='Injured on Natural turf',:].values[0],
                           'vary':radar_neg_agg.loc[radar_neg_agg.index=='Injured on Natural turf',:].values[0],
                           'labels':lst_pos_degr})

radar_degree = radar_agg.merge(radar_neg_agg, on = radar_agg.index)
fig, ax = plt.subplots(figsize=(10,8)) 
ax = sns.scatterplot(x='+[0-30]',y='-[0-30]', data=radar_degree, 
                     hue='key_0', size='key_0')
ax.set(ylabel='+[0-30] Direction shifts', xlabel='-[0-30] Direction shifts')
for line in range(0,radar_degree.shape[0]):
     fig.text(radar_degree['+[0-30]'][line]+0.2, radar_degree['-[0-30]'][line], radar_degree.key_0[line],
             horizontalalignment='left', size='medium', color='black', weight='semibold')

Create intervals for analyzed indicators and other metrics:

In [None]:
# temperature
df_fin.loc[df_fin['Temperature']>0,'Temp'] = df_fin['Temperature']

df_fin['Temp_groups']=pd.qcut(df_fin['Temp'], q=5)
df_fin.groupby('Temp_groups')['Temp_groups'].count()

# direction vs orientation
df_fin['dir_o_groups']=pd.qcut(df_fin['dir_vs_omean'], q=6)
df_fin.groupby('dir_o_groups')['dir_o_groups'].count()

# distance
df_fin['dist_groups']=pd.qcut(df_fin['distsum'], q=6)
df_fin.groupby('dist_groups')['dist_groups'].count()

#time
df_fin['time_groups']=pd.qcut(df_fin['timemax'], q=6)

# acceleration
df_fin['acc_groups']=pd.qcut(df_fin['acccount'], q=6)

# deceleration
df_fin['dcc_groups']=pd.qcut(df_fin['dcccount'], q=6)

# number of events
df_fin['event_groups']=pd.qcut(df_fin['eventcount'], q=6)

# area 
df_fin['area'] = (df_fin['xmax'] - df_fin['xmin']) * (df_fin['ymax'] - df_fin['ymin'])

# area vs distance ratio
df_fin.loc[:,'area_dist'] = df_fin['area'] / df_fin['distsum']
df_fin['area_dist_groups']=pd.qcut(df_fin['area_dist'], 5)

df_fin['area_groups']=pd.qcut(df_fin['area'], 5)
#plt.hist(df_fin['area'], bins=10, range=[0,150])
#print(df_fin.loc[25:35,['xmax','xmin','ymax','ymin','area', 'area_groups']])

#print(df_fin.dtypes)
temp_t = t_percent(df=df_fin,row='Temp_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()

diro_t = t_percent(df=df_fin,row='dir_o_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()

area_t = t_percent(df=df_fin,row='area_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()

dist_t = t_percent(df=df_fin,row='dist_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()

time_t = t_percent(df=df_fin,row='time_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()

acc_t = t_percent(df=df_fin,row='acc_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()

dcc_t = t_percent(df=df_fin,row='dcc_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()

event_t = t_percent(df=df_fin,row='event_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()

area_dist_t = t_percent(df=df_fin,row='area_dist_groups', uniq_count='PlayKey',cross=True, cross_var='type_turf').transpose()


In [None]:
# compare two cats - lolipop
def lolipop(df,v1,v2,tity, tit,c1,c2):
    # Reorder it following the values of the first value:
    ordered_df = df.sort_values(by=v1)
    my_range=range(1,len(df.index)+1)
    
    fig, ax = plt.subplots(figsize=(8,6)) 
    # The vertical plot is made using the hline function
    plt.hlines(y=my_range, xmin=ordered_df[v1], xmax=ordered_df[v2], color='#d6d1c9', 
               alpha=0.4, linewidth=4)
    plt.scatter(ordered_df[v1], my_range, color=c1, alpha=1, label=v1, s=100)
    plt.scatter(ordered_df[v2], my_range, color=c2, alpha=0.7 , label=v2, s=100)
    plt.legend()

    # Add title and axis names
    plt.yticks(my_range, ordered_df.index)
    plt.title(tit, loc='center')
    plt.xlabel(' ')
    plt.ylabel(tity)
    #ax.set(xlabel=xlab, ylabel=ylab)
    ax.set_ylabel(tity,fontsize=15, fontweight='normal')
    ax.set_yticklabels(ordered_df.index,  fontsize=12, fontweight='bold')
    #ax.set_xticklabels(wrapped_labels,  fontsize=12, fontweight='bold')
    ax.set_title(tit, fontstyle='italic', fontsize=14)
    ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
    plt.yticks(rotation=0, wrap=True)
    plt.xticks(rotation=0, wrap=True)


**Weather type for injured vs non-injured in play**
* Increased injury risk when raining.
* The controlled climate is associated with injuries on synthetic fields (~90% of controlled climate games are played on synthetic turf) 

In [None]:
lolipop(df=weat_t2.transpose().loc[weat_t2.transpose().index!='NA',:], v1='Injured', v2='Non-Injured', 
        tity='Weather type', tit='Weather type by conditon (injured/not injured)\n',
       c1='#f0690e',c2='#407818')

In [None]:
# distribution by weather and field type:
t_percent(df=df_fin,row='FieldType', uniq_count='PlayKey',cross=True, cross_var='weather_type').transpose()

**Weather and field type for injuries during play**
* Injuries on Synthetic turf associate with cloudy weather (cold be related to fine H2O particles, certain degrees of humidity in atmosphere).
* Injuries on Natural turf are more likely to happen on sunny weather or when raining.
* The controlled climate setting is naturally only associated with injuries on synthetic turf.


In [None]:
lolipop(df=weat_t.transpose().loc[weat_t.transpose().index!='NA',['Injured on Natural turf','Injured on Synthetic turf']], 
        v1='Injured on Natural turf', v2='Injured on Synthetic turf',
        tity='Weather type', tit='Weather type by field type for injuries\n',
       c1='#11bf08', c2='#5d1280')

In [None]:
lolipop(df=stad_t ,v1='Indoor', v2='Outdoor',
        tity='Stadium type', tit='Stadium type by field type for injuries\n',
       c1='#11bf08', c2='#5d1280')

In [None]:
wrapped_labels=['Injured \nNatural\nturf','Injured \nSynthetic\nturf',
               'Not Injured \nNatural\nturf','Not Injured \nSynthetic\nturf']

In [None]:
from math import isinf
# create intervals function (y axis)
def create_intervals(df):
    interv_lst = []
    for i in df.columns:
        if (isinf(i.right)):
            right_val=i.right
        else:
            right_val=int(i.right.round(0))
        interv_lst.append(pd.Interval(left=int(i.left.round(0)), 
                                      right=right_val, 
                                      closed=i.closed))
    df_1 = df
    df_1.columns=interv_lst
    return(df_1)

In [None]:
# heatmap function
def heatmap_f(df, ylab, xlab, ann, tit):
    df_tt = df.transpose()  # better with target variable on y axis
    fig, ax1 = plt.subplots(figsize=(8,16)) 
    ax = sns.heatmap(df_tt, annot=ann.values*100, 
                     annot_kws={"size": 14, "weight": "bold"}, fmt = '.0f', cmap='YlGnBu',
                    cbar=False, linewidths=.5, square=True) # PuOr RdGy BrBG RdYlGn
    for t in ax.texts: t.set_text(t.get_text() + "%")
    ax.set(xlabel=xlab, ylabel=ylab)
    ax.set_ylabel(ylab,fontsize=16)
    ax.set_xticklabels(wrapped_labels,  fontsize=16,fontweight='bold')
    ax.set_title(tit, fontstyle='italic', fontsize=16)
    plt.yticks(rotation=0, wrap=True, fontsize=16, fontweight='bold')
    plt.xticks(rotation=0, wrap=True)
    
#cbar = i.collections[0].colorbar
#cbar.set_ticks([0,0.1, .2, 0.3,0.4,0.5,0.6,0.7, 0.8,0.9, 1])
#cbar.set_ticklabels(['0%','10%', '20%','30%','40%','50%','60%','70%','80%','90%','100%'])

**Temperature registered during play**
* More than one third of all injuries on natural turf have happened above 78 °F
* We observe a tendency to account for more injuries as the temperature rises (on both natural and synthetic turf)

In [None]:
heatmap_f(temp_t,ylab='Temperature intervals (F)', xlab=' ', 
          ann=create_intervals(temp_t).transpose(), 
          tit='\nTemperature during play\nby surface type and condition (injured/not injured) \n')

**Total Area covered on play**
* The total area covered by players during a play is calculated as the perimeter of a player trajectory, using x,y coordinates: (x max - x min) * (y max - y min)  
* Injuries in general tend to occur when the player covers a bigger area and in particular injuries on synthetic turf: 65% of synthetic turf injuries covered more than 200 yards.

In [None]:
heatmap_f(area_t, ylab='Covered area intervals', xlab=' ',
         ann=create_intervals(area_t).transpose(),
         tit='\nTotal Area covered on play\nby surface type and condition (injured/not injured) \n')

**Area vs distance ratio**
* The area vs distance indicator is computed as total area covered /total distance during play
* Larger numbers indicate that the player covered a wide area (moving on a diagonal trajectory can maximize this metric); small numbers indicate the total distance is closer to the total area
* 42% of synthetic turf injuries cover a wide area with short movements
* We also observe a small difference (+3pp) between surfaces on no-injury plays when the ratio between area and distance in highest.


In [None]:
heatmap_f(area_dist_t, ylab='Area vs distance ratio - intervals', xlab=' ',
         ann=create_intervals(area_dist_t).transpose(),
         tit='Area vs distance ratio on play\nby surface type and condition (injured/not injured) \n')

**Total distance covered on play**
* Injuries on synthetic turf: 30% of synthetic turf injuries happened when a player exceeded 55 yards.
* Injuries on natural turf usually happened after the play ran more than 30 yards

In [None]:
heatmap_f(dist_t, ylab='Distance covered on play intervals', xlab=' ',
         ann=create_intervals(dist_t).transpose(),
         tit='Distance covered on play\nby surface type and condition (injured/not injured) \n')

**Time spent during play**
* A quarter of all injuries on synthetic turf have happened between 30-36 seconds after play start.
* Plays with no injuries on synthetic turf are usually shorter: less than 20 seconds account for 20% of all plays on synthetic (+5pp vs natural turf on same threshold).

In [None]:
heatmap_f(time_t, ylab='Time spent on play - intervals', xlab=' ',
         ann=create_intervals(time_t).transpose(),
         tit='Total Time spent on play\nby surface type and condition (injured/not injured) \n')

**Direction vs Orientation ratio**
* Lower numbers ; as closer it gets to 0 the player moves in the same direction as facing
* Higher number indicates the player is moving with different angles (higher the number, higher the difference between the two angles)
* Synthetic turf injuries tend to happen when the difference between the two angles if higher (45% - above 4).

In [None]:
heatmap_f(diro_t, ylab='Direction vs Orientation ratio intervals', xlab=' ',
         ann=create_intervals(diro_t).transpose(),
         tit='Direction vs Orientation during play\nby surface type and condition (injured/not injured) \n')

In [None]:
heatmap_f(acc_t, ylab='# Acceleration - intervals', xlab=' ',
         ann=create_intervals(acc_t).transpose(),
         tit='Acceleration during play by surface type and condition (injured/not injured) \n')

In [None]:
heatmap_f(dcc_t, ylab='# Deceleration - intervals', xlab=' ',
         ann=create_intervals(dcc_t).transpose(),
         tit='Deceleration during play by surface type and condition (injured/not injured) \n')

In [None]:
heatmap_f(event_t, ylab='# Events - intervals', xlab=' ',
         ann=create_intervals(event_t).transpose(),
         tit='Number of events during play\nby surface type and condition (injured/not injured) \n')

The analyzed set of plays the main differences between the analyzed groups were highlighted in the comment above. 
Only analyses which reveled differences between grous were kept in this notebook.

Hope these insights could be in any way used to further research the factors associeted to injuries.

Thank you for making available the data sets and for organizing this competition.