![](https://1ycbx02rgnsa1i87hd1i7v1r-wpengine.netdna-ssl.com/wp-content/uploads/2019/01/nfl.png)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import matplotlib.patches as patches
sns.set_style("whitegrid")

# set additional display options for report
pd.set_option("display.max_columns", 100)
th_props = [('font-size', '13px'), ('background-color', 'white'), 
            ('color', '#666666')]
td_props = [('font-size', '15px'), ('background-color', 'white')]
styles = [dict(selector="td", props=td_props), dict(selector="th", 
            props=th_props)]

In this notebook, I have tried to utilize the data provided which includes Injury Record, Players Play (267,005 Plays) data and Players Movement Data to generate some significant observations and recommendation which can be beneficial to check the growing trend Injury with the increase in matches on Synthetic Turf. My recommendation are as below which is concluded based on my in-depth analysis as observed below in the notebook.

  - ***The Athletic Shoe In Football :*** It is needed to define optimal shoe sizing, the effect that design has on mechanical load, and how cleat properties, including pattern and structure, interact with the variety of playing surfaces. Important structural considerations of shoe design, including biomechanical compliance, cleat and turf interaction, and shoe sizing/fit, that affect the way an athlete engages with the playing surface should be considered, in case of synthetic turf.

  - ***Friction On Artificial Turf:*** The soil fields of Natural surfaces become slippery after rainfall, they are a cause of sprains, muscle strains, and related types of trauma. Artificial turf fields increase the friction between the ground and the players’ spikes, which allows improved braking and acceleration. As a result, players experience more powerful impacts when they collide due to the increased speed at which they can run, which leads to a higher potential for trauma. Along with this insertion of additional rubber chips into the artificial turf can also help in solving this problem   and this should be replaced every 2 years at least.


<p style="margin-top:40px"> 
My analysis is thoroughly documented in the following sections:  </p>

  - <a href='#bg'>Background and Conclusion</a>
  - <a href='#mt'>Methodology</a>
  - <a href='#an'>Analysis</a>
  - <a href='#tm'>Qualifications</a>

<p style="margin-top:40px">

<a id='bg'></a>
<div class="h2">  Background and Conclusion</div>

<p style="margin-top: 20px">The National Football League (NFL) is a professional American football league consisting of 32 teams, divided equally between the National Football Conference (NFC) and the American Football Conference (AFC). The NFL is one of the four major professional sports leagues in North America and the highest professional level of American football in the world.

Recent investigations of lower limb injuries among football athletes have indicated significantly higher injury rates on synthetic turf compared with natural turf (Mack et al., 2018; Loughran et al., 2019). In conjunction with the epidemiologic investigations, biomechanical studies of football cleat-surface interactions have shown that synthetic turf surfaces do not release cleats as readily as natural turf and may contribute to the incidence of non-contact lower limb injuries (Kent et al., 2015).

Few of the key observations noted from Turfgrass Producers International based on games played in between 2012-16 are :

* 1,280 NFL games (213,935 distinct plays) were played during the 2012-16 seasons.
* 4,801 lower body injuries occurred during the study sample, affecting 2,032 NFL players.
* Synthetic turf resulted in a 27% increase in non-contact lower body injuries.
* 56% higher rate (knee/ankle/foot) of sustaining an injury on synthetic turf that resulted in time lost, and a 67% higher injury rate resulting in more than 8 days time lost from injury.
* 68% higher rate of sustaining an ankle injury rate on synthetic turf that resulted in any time lost from injury and a 103% increase in injury rates on synthetic turf resulting in more than 8 days time lost from injury.

<a id='mt'></a>
<div class="h2">  Methodology </div>
<p style="margin-top: 20px">I followed an in-depth approach using statistical methodolgies to arrive at conclusion and tested multiple types using sampling of the given data to ensure that there is no biasness in the observations noted. I have also tried to utilize classification models - SVM and Random Forest, however the data provided didn't bring any acceptable model significance and dropped the model and went ahead with Exploratory Data Analysis. I have tried to keep my analysis simple, yet detailed so that it can be easily understood by anyone reading them </p>

The entire analysis was built on sample comparison to check the biasness due to the presence of less injury cases compared to non injury cases. Every conclusion drawn has been repeatedly validated by multiple samples.

<div class="h3"> External Inputs Considered </div>
To understand the problem better, I went through multiple external resources and tried to build relationships while generating my results. Few of such external information considered were as below:

  - NFL Official Rules and history
  - Videos and game highlights
  - NFL press releases, articles and general statistics
  - NFL Health and Safety website 
  - Articles and research papers on sports-related concussions 

<a id='an'></a>
<div class="h2">  Analysis</div>

In [None]:
#Memory Reducer Code

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


#def import_data(file):
 #   """create a dataframe and optimize its memory usage"""
  #  playtrk = pd.read_csv(''../input/nfl-playing-surface-analytics/PlayerTrackData.csv', parse_dates=True, keep_date_col=True)
   # playtrk = reduce_mem_usage(playtrk)
    #return playtrk


In [None]:
#Import available data

injrec = pd.read_csv('../input/nfl-playing-surface-analytics/InjuryRecord.csv')
playlist = pd.read_csv('../input/nfl-playing-surface-analytics/PlayList.csv')
playtrk = pd.read_csv('../input/nfl-playing-surface-analytics/PlayerTrackData.csv')
playtrk = reduce_mem_usage(playtrk)

In [None]:
# Data Preparation - Removing Duplicates

def clean_weather(row):
    cloudy = ['Cloudy 50% change of rain', 'Hazy', 'Cloudy.', 'Overcast', 'Mostly Cloudy','Cloudy, fog started developing in 2nd quarter', 'Partly Cloudy','Mostly cloudy', 'Rain Chance 40%',' Partly cloudy', 'Party Cloudy','Rain likely, temps in low 40s', 'Partly Clouidy', 'Cloudy, 50% change of rain','Mostly Coudy', '10% Chance of Rain','Cloudy, chance of rain', '30% Chance of Rain', 'Cloudy, light snow accumulating 1-3"','cloudy', 'Coudy', 'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.','Cloudy fog started developing in 2nd quarter', 'Cloudy light snow accumulating 1-3"','Cloudywith periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.','Cloudy 50% change of rain', 'Cloudy and cold','Cloudy and Cool', 'Partly cloudy','Cloudy.','Cold']
    clear = ['Clear, Windy',' Clear to Cloudy', 'Clear, highs to upper 80s','Clear and clear','Partly sunny',
             'Clear, Windy', 'Clear skies', 'Sunny', 'Partly Sunny', 'Mostly Sunny', 'Clear Skies',
             'Sunny Skies', 'Partly clear', 'Fair', 'Sunny, highs to upper 80s', 'Sun & clouds', 'Mostly sunny','Sunny, Windy',
             'Mostly Sunny Skies', 'Clear and Sunny', 'Clear and sunny','Clear to Partly Cloudy', 'Clear Skies',
            'Clear and cold', 'Clear and warm', 'Clear and Cool', 'Sunny and cold', 'Sunny and warm', 'Sunny and clear','Heat Index 95']
    
    rainy = ['Rainy', 'Scattered Showers', 'Showers', 'Cloudy Rain', 'Light Rain', 'Rain shower', 'Rain likely, temps in low 40s.', 'Cloudy, Rain']
    
    snow = ['Heavy lake effect snow']
    
    indoor = ['Controlled Climate', 'Indoors', 'N/A Indoor', 'N/A (Indoors)']
        
    if row.Weather in cloudy:
        return 'Cloudy'
    
    if row.Weather in indoor:
        return 'Indoor'
    
    if row.Weather in clear:
        return 'Clear'
    
    if row.Weather in rainy:
        return 'Rain'
    
    if row.Weather in snow:
        return 'Snow'
    
    return row.Weather

def clean_stadiumtype(row):
    if row.StadiumType in ['Bowl', 'Heinz Field', 'Cloudy']:
        return np.nan
    else:
        return row.StadiumType

def clean_play_df(playlist):
    playlist_cleaned = playlist.copy()
    
    # clean StadiumType
    playlist_cleaned['StadiumType'] = playlist_cleaned['StadiumType'].str.replace(r'Oudoor|Outdoors|Ourdoor|Outddors|Outdor|Outside', 'Outdoor')
    playlist_cleaned['StadiumType'] = playlist_cleaned['StadiumType'].str.replace(r'Indoors|Indoor, Roof Closed|Indoor, Open Roof', 'Indoor')
    playlist_cleaned['StadiumType'] = playlist_cleaned['StadiumType'].str.replace(r'Closed Dome|Domed, closed|Domed, Open|Domed, open|Dome, closed|Domed', 'Dome')
    playlist_cleaned['StadiumType'] = playlist_cleaned['StadiumType'].str.replace(r'Retr. Roof-Closed|Outdoor Retr Roof-Open|Retr. Roof - Closed|Retr. Roof-Open|Retr. Roof - Open|Retr. Roof Closed', 'Retractable Roof')
    playlist_cleaned['StadiumType'] = playlist_cleaned.apply(lambda row: clean_stadiumtype(row), axis=1)
    
    # clean Weather
    playlist_cleaned['Weather'] = playlist_cleaned.apply(lambda row: clean_weather(row), axis=1)
    
    return playlist_cleaned

playlist_cleaned = clean_play_df(playlist)

#Deleting unnecessary files
del [[playlist]]
import gc
gc.collect()
playlist=pd.DataFrame()

# New column to track no of days missed due to injury

def label(row):
   if row['DM_M42']+ row['DM_M28'] + row['DM_M7'] + row['DM_M1']  == 4 :
     return 42
   if row['DM_M42']+ row['DM_M28'] + row['DM_M7'] + row['DM_M1']  == 3 :
      return 28
   if row['DM_M42']+ row['DM_M28'] + row['DM_M7'] + row['DM_M1']  == 2 :
      return 7
   if row['DM_M42']+ row['DM_M28'] + row['DM_M7'] + row['DM_M1']  == 1 :
      return 1
      return 0
     
injrec.apply (lambda row: label(row), axis=1)
injrec['Minm_Days_Game_Missed'] = injrec.apply (lambda row: label(row), axis=1)

injrec['BodyPart_Surface']= injrec['Surface'] + '-' + injrec['BodyPart']

In [None]:
# Which body part is the most effected?

injrec.groupby('BodyPart').count()['PlayerKey'] \
    .sort_values() \
    .plot(kind='bar', figsize=(10, 3), title='Count of injuries by Body Part')
plt.show()

It can be clearly seen from the plot that Knee and Ankle injuries are the most occuring injuries which together constitues ~ 85% of the injured cases.

In [None]:
# Look at the distribution of injuries by Body Part across Turfs - Natural & Synthetic. Is Surface having any impact?
injrec.groupby(['BodyPart','Surface']) \
    .count() \
    .unstack('BodyPart')['PlayerKey'] \
    .T.sort_values('Natural').T \
    .sort_values('Ankle') \
    .plot(kind='bar', figsize=(15, 5), title='Injury Body Part by Turf Type')
plt.show()

Though there isn't much difference in the count of injuries by Turfs (in the given set of injuries data). but it is important to note that Anke and Knee injuries are quite higher on Synthetic Turf compared to Natural whereas Knee injuries are quite common

In [None]:
# To study the impact of other factors driving injuries, merged the injury data with the PlayList data which has the rest of the data.

playlist_eval = playlist_cleaned.loc[:, ['PlayerKey','GameID','FieldType']]
playlist_eval=playlist_eval.drop_duplicates()
playlist_eval_1=playlist_eval.merge(injrec[['PlayerKey','BodyPart']],how='left').fillna("NoInjury")

del [[playlist_eval]]
gc.collect()
playlist_eval=pd.DataFrame()

In [None]:
#This plot only helps in getting a better understanding of Injury Vs Non Injury data points. No significant inference from the plot.

playlist_eval_1.groupby(['BodyPart','FieldType']) \
    .count() \
    .unstack('FieldType')['GameID'] \
    .sort_values('Natural') \
    .T.sort_values('Ankle').T \
    .plot(kind='bar', figsize=(15, 5), title='Overall Injury Snapshot')
plt.show()

Addnl : Evaluate the number of injured cases by Surface Types on an overall base of games played.
Comment : It can be clearly seen that the Percentage of Injury cases is higher in Synthetic surface at an overall base as well which indicates that there are some key drivers affecting the playing surface and hence the injuries as well.

In [None]:
injrec.groupby(['BodyPart_Surface','Minm_Days_Game_Missed']) \
    .count() \
    .unstack('BodyPart_Surface')['PlayerKey'] \
.sort_values('Minm_Days_Game_Missed') \
    .plot(kind='bar', figsize=(15, 5), title='How Surface and Body Part Injury Occured')
plt.show()

It can be well understood from the above exploratory data analysis, that Knee injuries has the Longest Impact as maximum games are missed when a player has Knee Injuries.
Now coming to surface, if the Knee or Ankle injuries occured on Synthetic Surface then the injury is more critical compared to the same occured on Natural surface.
Now what we need to understand whether there is any change in the player movement or playing pattern is driven by  multiple other factors along with Surface (Synthetic) or not.

In [None]:
#Merging Injury data with Play List Data
playlist_cleaned=playlist_cleaned.merge(injrec[['PlayKey','BodyPart','Minm_Days_Game_Missed']],how='left')
playlist_cleaned.BodyPart.fillna('', inplace=True)
playlist_cleaned.Minm_Days_Game_Missed.fillna('', inplace=True)

In [None]:
#Created a separate dataset to ensure that results are not biased due to less no of injured events
model_data=playlist_cleaned

model_data['BodyPart'] = model_data['BodyPart'].map({'Knee': 1, 'Ankle': 1, 'Foot': 1, '': 0})
model_data_inj=model_data.loc[model_data['BodyPart'] == 1]
rest_data=model_data.loc[model_data['BodyPart'] == 0]
model_data_noninj=rest_data.sample(frac=0.001)


frames = [model_data_inj, model_data_noninj]
model_data_final = pd.concat(frames)

model_data_final['BodyPart'].value_counts()


In [None]:
# Compute the correlation matrix
corr = model_data_final.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
#Checking the relationship of each Categorical Variable with the Target Variable (Injury/Non Injury) to find out any significant patter
categorical = [
  'StadiumType', 'Weather', 'PlayType', 'Position', 'PositionGroup']
fig, ax = plt.subplots(3, 2, figsize=(35, 20))
for var, subplot in zip(categorical, ax.flatten()):
    sns.boxplot(x=var, y='BodyPart', data=model_data_final, ax=subplot)

In [None]:
#Using Player Track Data to find out whether injury has any correlation with player's movement
playtrk.head()
playtrk=playtrk.merge(model_data[['PlayKey','BodyPart']],how='left')
playtrk['BodyPart'].value_counts()

In [None]:
# Create Football Field

def create_football_field(linenumbers=True,
                          endzones=True,
                          highlight_line=False,
                          highlight_line_number=50,
                          highlighted_name='Line of Scrimmage',
                          fifty_is_los=False,
                          figsize=(12, 6.33)):
    """
    Function that plots the football field for viewing plays.
    Allows for showing or hiding endzones.
    """
    rect = patches.Rectangle((0, 0), 120, 53.3, linewidth=0.1,
                             edgecolor='r', facecolor='darkgreen', zorder=0)

    fig, ax = plt.subplots(1, figsize=figsize)
    ax.add_patch(rect)

    plt.plot([10, 10, 10, 20, 20, 30, 30, 40, 40, 50, 50, 60, 60, 70, 70, 80,
              80, 90, 90, 100, 100, 110, 110, 120, 0, 0, 120, 120],
             [0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3,
              53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 53.3, 0, 0, 53.3],
             color='white')
    if fifty_is_los:
        plt.plot([60, 60], [0, 53.3], color='gold')
        plt.text(62, 50, '<- Player Yardline at Snap', color='gold')
    # Endzones
    if endzones:
        ez1 = patches.Rectangle((0, 0), 10, 53.3,
                                linewidth=0.1,
                                edgecolor='r',
                                facecolor='blue',
                                alpha=0.2,
                                zorder=0)
        ez2 = patches.Rectangle((110, 0), 120, 53.3,
                                linewidth=0.1,
                                edgecolor='r',
                                facecolor='blue',
                                alpha=0.2,
                                zorder=0)
        ax.add_patch(ez1)
        ax.add_patch(ez2)
    plt.xlim(0, 120)
    plt.ylim(-5, 58.3)
    plt.axis('off')
    if linenumbers:
        for x in range(20, 110, 10):
            numb = x
            if x > 50:
                numb = 120 - x
            plt.text(x, 5, str(numb - 10),
                     horizontalalignment='center',
                     fontsize=20,  # fontname='Arial',
                     color='white')
            plt.text(x - 0.95, 53.3 - 5, str(numb - 10),
                     horizontalalignment='center',
                     fontsize=20,  # fontname='Arial',
                     color='white', rotation=180)
    if endzones:
        hash_range = range(11, 110)
    else:
        hash_range = range(1, 120)

    for x in hash_range:
        ax.plot([x, x], [0.4, 0.7], color='white')
        ax.plot([x, x], [53.0, 52.5], color='white')
        ax.plot([x, x], [22.91, 23.57], color='white')
        ax.plot([x, x], [29.73, 30.39], color='white')

    if highlight_line:
        hl = highlight_line_number + 10
        plt.plot([hl, hl], [0, 53.3], color='yellow')
        plt.text(hl + 2, 50, '<- {}'.format(highlighted_name),
                 color='yellow')
    return fig, ax

create_football_field()
plt.show()

In [None]:
#inj_cases=model_data_inj.sample(frac=0.10)
#non_inj_cases=rest_data.sample(frac=0.05)
# image utils
#from PIL import Image
#ax = Image.open('../input/nfl-utils/nfl_coordinates.png')
#ax = ax.resize((1200,533))
  # show background
 #   fig = plt.figure(figsize=figsize)
  #  plt.imshow(np.array(background).transpose(0,1,2), origin='lower')
    
#Movement of Injured Players
example_play_id1 = model_data_inj['PlayKey'].values[0]
example_play_id2 = model_data_inj['PlayKey'].values[1]
example_play_id3 = model_data_inj['PlayKey'].values[2]
example_play_id4 = model_data_inj['PlayKey'].values[3]
example_play_id5 = model_data_inj['PlayKey'].values[4]
fig, ax = create_football_field()
playtrk.query('PlayKey == @example_play_id1')\
.plot(kind='scatter', x='x', y='y', title='Players Movement - Injured Plays',color='orange',ax=ax)
playtrk.query('PlayKey == @example_play_id2')\
.plot(kind='scatter', x='x', y='y',title='Players Movement - Injured Plays',color='red', ax=ax)
playtrk.query('PlayKey == @example_play_id3')\
.plot(kind='scatter', x='x', y='y',title='Players Movement - Injured Plays',color='yellow', ax=ax)
playtrk.query('PlayKey == @example_play_id4')\
.plot(kind='scatter', x='x', y='y',title='Players Movement - Injured Plays',color='blue', ax=ax)
playtrk.query('PlayKey == @example_play_id5')\
.plot(kind='scatter', x='x', y='y',title='Players Movement - Injured Plays',color='black', ax=ax)
plt.show()


#Movement of Non Injured Players
example_play_id1 = rest_data['PlayKey'].values[0]
example_play_id2 = rest_data['PlayKey'].values[1]
example_play_id3 = rest_data['PlayKey'].values[2]
example_play_id4 = rest_data['PlayKey'].values[3]
example_play_id5 = rest_data['PlayKey'].values[4]
fig, ax = create_football_field()
playtrk.query('PlayKey == @example_play_id1')\
.plot(kind='scatter', x='x', y='y', title='Players Movement - Non Injured Plays',color='orange',ax=ax)
playtrk.query('PlayKey == @example_play_id2')\
.plot(kind='scatter', x='x', y='y',title='Players Movement - Non Injured Plays',color='red', ax=ax)
playtrk.query('PlayKey == @example_play_id3')\
.plot(kind='scatter', x='x', y='y',title='Players Movement - Non Injured Plays',color='yellow', ax=ax)
playtrk.query('PlayKey == @example_play_id4')\
.plot(kind='scatter', x='x', y='y',title='Players Movement - Non Injured Plays',color='blue', ax=ax)
playtrk.query('PlayKey == @example_play_id5')\
.plot(kind='scatter', x='x', y='y',title='Players Movement - Non Injured Plays',color='black', ax=ax)
plt.show()


In [None]:
playtrk.head()

# How speed of different players are observed in Injured Vs Non Injured Cases

#Injured Cases
inj_list = model_data_inj['PlayKey'].tolist()
playtrk.query('PlayKey == @inj_list')\
.plot(kind='scatter', x='x', y='dis', title='Speed Across Distance Travelled - Injured Plays',color='orange')
plt.show()


#Non Injured Cases
noninj_list = rest_data['PlayKey'].tolist()
playtrk.query('PlayKey == @noninj_list').sample(22195)\
.plot(kind='scatter', x='x', y='dis', title='Speed Across Distance Travelled - Non Injured Plays',color='red')
plt.show()


In [None]:
# At which Orientation the injuries occur

#Injured Cases
inj_list = model_data_inj['PlayKey'].tolist()
playtrk.query('PlayKey == @inj_list')\
.plot(kind='scatter', x='o', y='dir', title='Speed Across Distance Travelled - Injured Plays',color='orange')
plt.show()

<a id='tm'></a>
<div class="h2">  Qualification</div>

<p style="margin-top: 20px"> I am currently as a Principal Data Scientist with ITC Infotech Ltd with strong math background and 9+ years of experience using predictive modeling, data processing, and data mining algorithms to solve challenging business problems in Retail/CPG/BFSI domain. Involved in R/Python open source community and passionate about deep reinforcement learning and other advance ML algorithms.