First of all, why is so important the growing application of data science in sports? I want to do a short dissertation here. As a novice in the Kaggle environment, I think this is a good oportunity to present the motivations that lead me to focus on the Data Science Field, especially in its applications to sports. I think that the main goal of Data Science is improving the quality of life of the people. At least, that is the vision that I have. So the natural course is using the data in the field of professional sport: with the data, we can study about the injuries that ends promising careers of athletes and how to avoid them; about how can we improve their performance on the field and consequently their satisfaction and self esteem (mental health has a great impact on the efficiency of these athletes too); in conclusion, we study about how can we improve their security and well-being.  

Let's focus now on the task in hand. The main question is: What conclusions can we extract from this data? What can we improve? These are the basics steps to do a correct data analysis:

A)Preprocessing the data.
* What type of are the attributes?
* Are there any NaN values?
* Do we need to normalize any attribute?
* ...

B)Visualization of the data.
* Using Seaborn or pandas.

C)Studying the data. 
* Make your theories : what phenomenas seems to be happening? Is there any relation between some attributes?

D)"Play with the data".
* Create your own attributes, make predictions, etc.

E)Conclusions.
* The main goal of this study is to extract some useful conclusions which would serve to improve the life quality!

Let's get started:

In [None]:
#Importing necessary libraries
import numpy as np 
import pandas as pd 
import seaborn as sns


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



**A)Dataset InjuryRecord**

1-Reading and preprocessing the data.

In [None]:
injuryRecord=pd.read_csv('../input/nfl-playing-surface-analytics/InjuryRecord.csv')


In [None]:
#First we are going to inspect the dataset InjuryRecord
print(injuryRecord.info()) #to check which type is each attribute



In [None]:
print(injuryRecord.shape) #returns (rows,columns)

In [None]:
print(injuryRecord.describe()) #descriptive statistics about the dataset

In [None]:
#Let's have a peak on the dataset
injuryRecord.head(5)

Let's check if there is any NaN value.



In [None]:
injuryRecord.isnull().sum()

As we could see in info(), the only attribute with NaN values is PlayKey. PlayKey uniquely identifies a player's plays within a game (in sequential order). So it doesn't make sense should we apply such a tecnique like using the mean of the other values of the column to fill these empty values. 



In [None]:
injuryRecordWithoutPlayKey=injuryRecord[injuryRecord['PlayKey'].isnull()]

In [None]:
injuryRecordWithoutPlayKey[['PlayerKey','PlayKey']].groupby('PlayerKey',as_index=False).count()

2-Visualizating the data. Let's make questions about this dataset:

A-Which kinds of injuries are the most frequent?


In [None]:
sns.catplot(x="BodyPart", kind="count", palette="ch:.25", data=injuryRecord);

B-Which kind of surfarce seems to be more related to injuries?


In [None]:
sns.catplot(x="Surface", kind="count", hue="BodyPart", palette="ch:.25", data=injuryRecord);

* The number of total injuries seems to decrease when a natural surface is used, although foot injuries increase. But there are not heel injuries when a synthetic surface is used.
*  It seems that there are no heel injuries related to the synthetic surface.
* When a synthetic surface is used, ankle injuries are slightly more common than knee injuries. 
* Toes injuries appear more when a synthetic surface is used.
 

C-Which kind of injury makes the player miss more days?
We are going to check the DM_M42 attribute, a one-hot encoding indicating 42 or more days missed due to the injury.

In [None]:
sns.catplot(x='BodyPart', kind="count", hue="DM_M42" ,palette="ch:.25", data=injuryRecord);

In [None]:
sns.catplot(x='BodyPart', kind="count", hue="DM_M28" ,palette="ch:.25", data=injuryRecord);

In [None]:
sns.catplot(x='BodyPart', kind="count", hue="DM_M7" ,palette="ch:.25", data=injuryRecord);

In [None]:
sns.catplot(x='BodyPart', kind="count", hue="DM_M1" ,palette="ch:.25", data=injuryRecord);

* Knee injuries are the type of injuries that make the player to miss more days.
* Every type of these injuries require the player to miss 1 or more days.
* Most of the knee and ankle injuries require the player to miss 7 or more days. The player will miss between 7 and 28 days.  
* Foot injuries always seems to make the player miss 28 or more days. Furthermore, they will probably make the player miss 42 days or more. 
* Heel injuries make the player to miss 7 or more days, but no more than 27.




**B)PlayList Dataset**

1-Reading and preprocessing the data.

In [None]:
playList=pd.read_csv('../input/nfl-playing-surface-analytics/PlayList.csv')

In [None]:
print(playList.shape)
print(playList.info())
print(playList.describe())

In [None]:
#Let's have a peak on the dataset
playList.head(5)

In [None]:
#NaN values?
#According to playList.info(), there are NaN values in StadiumType, Weather and PlayType. 
#Let's check 
playList.isnull().sum()

These three attributes are categorical attributes. We are going to fill these NaN values with the mode of each column.


In [None]:
for column in ['StadiumType','Weather','PlayType']:
    playList[column].fillna(playList[column].mode()[0], inplace=True)

In [None]:
playList.isnull().sum()

In [None]:
playList['Weather'].head()

2-Visualizating the data. Let's make questions about this dataset:

A)Was the roster position of the player equal to the position of the player in the play?

The attribute Position has these values:

In [None]:
sorted(list(set(playList['Position'].values)))

In [None]:
sns.catplot(x="Position", kind="count",hue="RosterPosition",data=playList)

We can't see anything in this plot, so I'm going to divide the dataset in smaller sets of data.

In [None]:
dataset1=playList[(playList['RosterPosition'] == "Quarterback")]

In [None]:
dataset1.info()

In [None]:

sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset1);

In [None]:
dataset2=playList[(playList['RosterPosition'] == "Wide Receiver")]

In [None]:
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset2);

In [None]:
dataset3=playList[(playList['RosterPosition'] == "Linebacker")]

In [None]:
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset3);

In [None]:
dataset4=playList[(playList['RosterPosition'] == "Running Back")]

In [None]:
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset4);

In [None]:
dataset5=playList[(playList['RosterPosition'] == "Defensive Lineman")]
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset5);

In [None]:
dataset6=playList[(playList['RosterPosition'] == "Tight End")]
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset6);

In [None]:
dataset7=playList[(playList['RosterPosition'] == "Safety")]
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset7);

In [None]:
dataset8=playList[(playList['RosterPosition'] == "Cornerback")]
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset8);

In [None]:
dataset9=playList[(playList['RosterPosition'] == "Offensive Lineman")]
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset9);

In [None]:
dataset10=playList[(playList['RosterPosition'] == "Kicker")]
sns.catplot(x="Position", kind="count",hue="PlayType",data=dataset10);

* It seems like linebackers get much more plays playing as outside linebackers (OLB). 
* Defensive Linemans usually play at DE and DT positions.
* Safety players usually play at FS and SS positions.
* Offensive Linemans play at T,G and C positions, in that order of priority.
* Kickers usually stay at K position but they can play at P position too.
* The most typical plays for every type of player are Pass and Rush, except for kickers. Kickers participate in Extra Point, Kickoff, Field Goal, Kickoff Not Returned and Kickoff Returned plays.

B)Relation between FieldType and StadiumType


In [None]:
sorted(list(set(playList['StadiumType'].values)))

I'm going to consider the stadiums with open roof as outdoors.

In [None]:
playList['StadiumType']=playList['StadiumType'].replace(['Bowl','Cloudy','Domed, Open','Domed, open','Heinz Field','Indoor, Open Roof','Open','Oudoor','Ourdoor','Outddors','Outdoor','Outdoor Retr Roof-Open','Outdor','Outside','Retr. Roof - Open','Retr. Roof-Open'],'Outdoors')
playList['StadiumType']=playList['StadiumType'].replace(['Closed Dome','Dome','Dome, closed','Domed','Domed, closed','Indoor','Indoor, Roof Closed','Retr. Roof - Closed','Retr. Roof Closed','Retr. Roof-Closed','Retractable Roof'],'Indoors')

In [None]:
sns.catplot(x="FieldType", kind="count",hue="StadiumType",data=playList);

Now I'm going to merge the two datasets InjuryRecord and PlayList to study the possible relation between parameters like type of injuries and roster position, betweem type of injuries and weather, etc.

1-Preprocessing the data

In [None]:

#playList and injuryRecord
datasetInjuryPlaylist = pd.merge(playList, injuryRecord, on='PlayerKey')


In [None]:
print(datasetInjuryPlaylist.info())

In [None]:
for column in ['StadiumType','Weather','PlayType']:
    datasetInjuryPlaylist[column].fillna(datasetInjuryPlaylist[column].mode()[0], inplace=True)

In [None]:
print(datasetInjuryPlaylist.info())

In [None]:
datasetInjuryPlaylist.head()

2-Visualization of the data

A-Which kind of injuries do defensive linemans usually get? And the offensive? The safety players?

In [None]:
datasetDefensiveLineman=datasetInjuryPlaylist[(datasetInjuryPlaylist['RosterPosition'] == "Defensive Lineman")]
datasetDefensiveLineman.head()

In [None]:
sns.catplot(x="Position", kind="count",hue="BodyPart",data=datasetDefensiveLineman);

In [None]:
datasetOffensiveLineman=datasetInjuryPlaylist[(datasetInjuryPlaylist['RosterPosition'] == "Offensive Lineman")]
sns.catplot(x="Position", kind="count",hue="BodyPart",data=datasetOffensiveLineman);

In [None]:
datasetSafety=datasetInjuryPlaylist[(datasetInjuryPlaylist['RosterPosition'] == "Safety")]
sns.catplot(x="Position", kind="count",hue="BodyPart",data=datasetSafety);

* It seems like Defensive Linemans get more injuries playing as DE(Defensive End), specially knee and foot injuries.
* Offensive Linemans get more ankle injuries playing as C (center), but more knee injuries playing as T (offensive tackle).
* Safety players get more injuries playing as FS (free safety), specially ankle injuries.

B-Which weather seems to be related with more injuries?

Let's see the possible values of Weather.

In [None]:
sorted(list(set(datasetInjuryPlaylist['Weather'].values)))

In [None]:
#Let's clean a bit.


In [None]:
datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Clear','Clear skies','Fair'],'Clear Skies')
datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Indoor','Indoors','N/A Indoor','N/A (Indoors)'],'Controlled Climate')
datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Light Rain','Rain shower','Rainy','Showers','Scattered Showers','Rain shower','Cloudy, Rain','Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.'],'Rain')

datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Clear and warm','Sunny, highs to upper 80s','Sunny and warm','Heat Index 95'],'Warm')
datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Partly clear','Clear to Partly Cloudy','Cloudy','Cloudy, chance of rain','Cloudy, fog started developing in 2nd quarter','Coudy','Hazy','Mostly Cloudy','Mostly Coudy','Mostly cloudy','Overcast','Partly Cloudy','Partly Clouidy','Party Cloudy','Partly cloudy','cloudy'],'Cloudy')
datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Partly Sunny','Partly sunny','Clear and Sunny','Clear and sunny','Mostly Sunny','Mostly Sunny Skies','Mostly sunny','Sunny Skies','Sunny and clear','Sunny, Windy'],'Sunny')
datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Cloudy, light snow accumulating 1-3"','Heavy lake effect snow'],'Snow')
datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Clear and cold','Clear and Cool','Cloudy and cold','Sunny and cold','Cloudy and Cool'],'Cold')


In [None]:
datasetInjuryPlaylist['Weather']=datasetInjuryPlaylist['Weather'].replace(['Cloudy, 50% change of rain','10% Chance of Rain','30% Chance of Rain','Rain Chance 40%','Rain likely, temps in low 40s.'],'RainChance')

In [None]:
sorted(list(set(datasetInjuryPlaylist['Weather'].values)))

In [None]:
sns.catplot(x="BodyPart", kind="count",hue="Weather",data=datasetInjuryPlaylist)

I don't think there is a real relation between the cloudy weather and more injuries. 

3-Relation between type of play and type of injury.

In [None]:
sns.catplot(x="BodyPart", kind="count",hue="PlayType",data=datasetInjuryPlaylist)

* The type of play that provokes more injuries is pass play, specially knee and ankle injuries. 

4-Relation between Temperature and number of injuries.

In [None]:
sorted(list(set(datasetInjuryPlaylist['Temperature'].values)))

Since a temperature of -999 Fº seems very unlikely, I'm going to obviate that value.

In [None]:
newTemperatureInjuryPlaylist=datasetInjuryPlaylist[datasetInjuryPlaylist.Temperature >-999]

In [None]:
g = sns.catplot(x="BodyPart", y="Temperature", kind="violin", inner=None, data=newTemperatureInjuryPlaylist)


Higher temperatures are related with more injuries.

**C)PlayerTrackData**

In [None]:
playerTrackData=pd.read_csv('../input/nfl-playing-surface-analytics/PlayerTrackData.csv')

In [None]:
print(playerTrackData.shape)
print(playerTrackData.info())
print(playerTrackData.describe())

I'm going to merge this dataset with InjuryRecord.csv.

In [None]:
totalDataset1 = pd.merge(injuryRecord, playerTrackData, on='PlayKey')

In [None]:
totalDataset1.info()

Parameter "Event" has a lot of NaN values. Let's fill these empty values with the mode of the "Event" column.

In [None]:

    totalDataset1["event"].fillna(totalDataset1["event"].mode()[0], inplace=True)

In [None]:
totalDataset1['event']

1-Which is the event associated with the most number of injuries?

In [None]:
#First let's check the number of values that the attribute "value" can have
sorted(list(set(totalDataset1['event'].values)))

In [None]:
#let's make this attribute simpler
totalDataset1['event']=totalDataset1['event'].replace(['fumble_defense_recovered','fumble','fumble_offense_recovered'],'fumbleRelated')
totalDataset1['event']=totalDataset1['event'].replace(['pass_arrived','pass_forward','pass_outcome_caught','pass_outcome_incomplete','pass_outcome_interception'],'passRelated')

In [None]:
totalDataset1['event']=totalDataset1['event'].replace(['punt','punt_downed','punt_fake','punt_land','punt_muffed','punt_play','punt_received'],'puntRelated')
totalDataset1['event']=totalDataset1['event'].replace(['kick_received','kickoff','kickoff_play','onside_kick'],'kickRelated')

In [None]:
totalDataset1['event']=totalDataset1['event'].replace(['huddle_break_offense','huddle_start_offense'],'huddle')

In [None]:
sorted(list(set(totalDataset1['event'].values)))

In [None]:
sns.catplot(x="BodyPart", kind="count",hue="event",data=totalDataset1)

The injuries are related with ball_snap event.