# Data Cleaning and EDA

Hi all! Kyra here! I made this file for us to do our data cleaning and EDA in. I thought it would be helpful to have a separate notebook to work on this stuff since it can get kinda long and messy. Then we can either paste the necessary cells into checkpoint/final/etc OR we can just save pictures of the outputs and redirect readers to view this notebook if they want to see the code (I have done this in the past and it makes the final much more readable and nice to look at). 

Anyhow, I have just added the dataset for injuries to our repo and I'm going to do some things to get a quick look at it, but I wanted y'all to be able to see what I was doing and possibly do some exploration of your own too. (3/2)

In [2]:
#setup
import numpy as np
import pandas as pd

In [4]:
#read injury data from csv
injuries = pd.read_csv('injuries_2010-2020.csv')
injuries.head()

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
0,2010-10-03,Bulls,,Carlos Boozer,fractured bone in right pinky finger (out inde...
1,2010-10-06,Pistons,,Jonas Jerebko,torn right Achilles tendon (out indefinitely)
2,2010-10-06,Pistons,,Terrico White,broken fifth metatarsal in right foot (out ind...
3,2010-10-08,Blazers,,Jeff Ayres,torn ACL in right knee (out indefinitely)
4,2010-10-08,Nets,,Troy Murphy,strained lower back (out indefinitely)


In [10]:
injuries.shape #(27105, 5)

(27105, 5)

In [45]:
#Number of non "NaN" entries in acquired column
injuries[injuries['Acquired'].apply(lambda x: isinstance(x, str))].shape #(9452, 5)
#Number of non "NaN" entries in Relinquished column
injuries[injuries['Relinquished'].apply(lambda x: isinstance(x, str))].shape #(17560, 5)

#NOTE: The NaN entries don't seem to be np.nan, but float("NaN"). This may have just been my mistake due to some idiosyncrasy
#in the way I was doing things, but if it wasn't, we'll have to be a little crafty about how we fix it. 

(17560, 5)

In [8]:
#Number of unique players we have injuries recorded for
injuries['Relinquished'].nunique() #1156
#Number of unique players who came back after injuries
injuries['Acquired'].nunique() #1111

1156

There are 27,105 rows in our original dataframe. Rows refer to a player either being acquired (9452 rows total) or relinquished (17560). There are 1156 unique players in the relinquished column, and 1111 unique players in the acquired column. Thus, some players repeat, which is to be expected since someone can be injured multiple times.

However, taking player "Jonas Jerebko" as an example (see below), we can see that we sometimes have multiple entries for the same injury (look at rows 1, 5, and 75, where 5 reports surgery to address the injury reported in 1 and 75 moves him from out indefinitely to IL as he recovers from this surgery). 

My gut reaction would be to just combine these three rows together into one entry based on the year (all of this happens in 2010), but later in 2017, he has two separate illnesses, one in January and one in March. We could get around this by rephrasing our question, or just deciding our ground truth meant that the player had been placed on IL or was out indefinitely at some point during the year, but that changes things a little bit. We'll need to make a firm decision, since I doubt we have the time to do this kind of investigation for each of 1156 unique players in the dataset. 

Also, there's no "acquired" entry after Jerebko's recovery from surgery, so we'll have to take a look at that.

In [51]:
injuries[(injuries['Relinquished'] == 'Jonas Jerebko') | (injuries['Acquired'] == 'Jonas Jerebko')]

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
1,2010-10-06,Pistons,,Jonas Jerebko,torn right Achilles tendon (out indefinitely)
5,2010-10-08,Pistons,,Jonas Jerebko,surgery to repair torn right Achilles tendon
75,2010-10-27,Pistons,,Jonas Jerebko,placed on IL recovering from surgery to repair...
15112,2016-03-16,Celtics,,Jonas Jerebko,placed on IL with sore left ankle
15198,2016-03-21,Celtics,Jonas Jerebko,,activated from IL
17397,2017-01-21,Celtics,,Jonas Jerebko,placed on IL with illness
17455,2017-01-24,Celtics,Jonas Jerebko,,activated from IL
18026,2017-03-06,Celtics,,Jonas Jerebko,flu (DTD)
18051,2017-03-08,Celtics,Jonas Jerebko,,returned to lineup


In [59]:
#It seemed that a lot of the acquired messages were very similar when I was scrolling. I was right.
#Most are either activated from IL or returned to lineup (~ 97 %)

injuries[injuries['Acquired'].apply(lambda x: isinstance(x, str))]['Notes'].value_counts(normalize = True)

#In cleaning, we should make them all the same, 
#except for the few that might warrant keeping distinct (ex. 'torn ACL in right knee (out for season)')

#Below I have code for checking out one such example, but I think a Google might help too.
injuries[injuries['Notes'] == 'surgery on right knee to repair torn lateral meniscus (out for season)']
injuries[(injuries['Acquired'] == 'Russell Westbrook') | (injuries['Relinquished'] == 'Russell Westbrook')]
#side note: This poor guy! 

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
7405,2013-04-26,Thunder,,Russell Westbrook,torn lateral meniscus in right knee (out indef...
7413,2013-04-27,Thunder,,Russell Westbrook,placed on IL with torn lateral meniscus in rig...
7414,2013-04-27,Thunder,Russell Westbrook,,surgery on right knee to repair torn lateral m...
7498,2013-10-01,Thunder,,Russell Westbrook,arthroscopic surgery on right knee (out indefi...
7585,2013-10-30,Thunder,,Russell Westbrook,placed on IL recovering from arthroscopic surg...
7640,2013-11-03,Thunder,Russell Westbrook,,activated from IL
7957,2013-11-24,Thunder,,Russell Westbrook,placed on IL with knee injury
8004,2013-11-27,Thunder,Russell Westbrook,,activated from IL
8480,2013-12-27,Thunder,,Russell Westbrook,arthroscopic surgery on right knee (out indefi...
8482,2013-12-27,Thunder,,Russell Westbrook,placed on IL recovering from arthroscopic surg...


**To Do:** Learn basketball jargon for referring to injuries. Specifically, this acquired/relinquished business, "IL", "DTD", and "out indefinitely" (cuz a lot of these "out indefinitely" guys come back not that too long later).

In [62]:
#Getting bored/overwhelmed so switching focus to teams
print(injuries['Team'].value_counts())
len(injuries['Team'].value_counts())

#Bullets is sus (with only one entry)
#Bobcats end in 2013, Hornets move from New Orleans to Charlotte and New Orleans becomes Pelicans

Spurs           1163
Bucks           1068
Warriors        1060
Rockets         1058
Raptors         1044
Celtics         1040
Nets            1024
Heat            1023
Cavaliers       1001
Mavericks        992
Hawks            975
Nuggets          966
Lakers           959
Knicks           943
76ers            910
Grizzlies        875
Wizards          875
Timberwolves     860
Jazz             841
Magic            834
Pacers           831
Bulls            791
Suns             733
Kings            728
Hornets          719
Clippers         718
Thunder          717
Pistons          714
Blazers          695
Pelicans         576
Bobcats          369
Bullets            1
Name: Team, dtype: int64


32