<a id='start'></a>
# Wrangling

This notebook explains the main methods for manipulating data on a dataset. <br>
<br>
The notebook is divided into the following sections:<br>
- [Summary Functions](#section1)<a href='#section1'></a>; <br>
- [Grouping and Sorting](#section2)<a href='#section2'></a>; <br>
- [Data types](#section3)<a href='#section3'></a>; <br>
- [Wrangling Data](#section4)<a href='#section4'></a><br>
    - [Missing Value](#section5)<a href='#section5'></a><br>


We import the dataset of the titanic

In [3]:
import pandas as pd

titanic = pd.read_csv("train_dataset_titanic.csv")

<a id='section1'></a>
## Summary Function

In [4]:
# Let's look at the first 4 lines of the dataset
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


You can know the structure of a dataset through the **shape** method, the first value will indicate the number of rows of the dataset, while the second value will indicate the number of columns of the dataset:

In [5]:
titanic.shape

(891, 12)

In [6]:
print("Train Titanic dataset:", titanic.shape[0], "rows and", titanic.shape[1], "columns")

Train Titanic dataset: 891 rows and 12 columns


You can get a quick description of a dataset field via the **describe** method:

In [7]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The output of the **describe** method will be different depending on the type of field you want to have a description on, for example for a string field you will have the following description:

In [8]:
titanic.Embarked.describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

Fast calculations that can be useful when analyzing a field can be the *average* (**mean**), or the *median* (**median**). These methods are all already included in Pandas' library.

In [9]:
print("The mean age on the Titanic was:", round(titanic.Age.mean(),0), "years")
print("The median of age on the Titanic was:", round(titanic.Age.median(),0), "years")

The mean age on the Titanic was: 30.0 years
The median of age on the Titanic was: 28.0 years


In the case of a field containing strings instead, a very useful method that you can use to see the unique values contained in the field is **unique**:

In [10]:
titanic.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

If in addition to seeing the list of unique values we would like to know how often they are repeated we can use the **value_counts** method:

In [11]:
titanic.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

<a id='section2'></a>
## Grouping and Sorting

Often we may want to group our data and then do something specific for the group they are in. <br>
To do this, we can use the **groupby** operator.
<br>
<br>
We understand better with an example: suppose we want to count how many males and how many females there are in our Titanic dataset.

In [12]:
titanic.groupby('Sex').Sex.count()

Sex
female    314
male      577
Name: Sex, dtype: int64

The interesting thing is that the *groupby* operator allows us to answer even more complex questions, such as: <br>
*What is the maximum age within the group of males and females?

In [13]:
titanic.groupby('Sex').Age.max()

Sex
female    63.0
male      80.0
Name: Age, dtype: float64

It is also possible to calculate several metrics on the same selection, thanks to the **agg** method, as in the following example:

In [14]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [15]:
# I calculate the maximum age for each group (females and males) and count the number of females and males
titanic.groupby(['Pclass', 'Sex']).Age.agg([max, len])

Unnamed: 0_level_0,Unnamed: 1_level_0,max,len
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,63.0,94.0
1,male,80.0,122.0
2,female,57.0,76.0
2,male,70.0,108.0
3,female,63.0,144.0
3,male,74.0,347.0


You can order the dataset as we want using the **sort_values** method:

In [16]:
titanic.sort_values(by='Sex').head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
383,384,1,1,"Holverson, Mrs. Alexander Oskar (Mary Aline To...",female,35.0,1,0,113789,52.0,,S
218,219,1,1,"Bazzani, Miss. Albina",female,32.0,0,0,11813,76.2917,D15,C
609,610,1,1,"Shutes, Miss. Elizabeth W",female,40.0,0,0,PC 17582,153.4625,C125,S
216,217,1,3,"Honkanen, Miss. Eliina",female,27.0,0,0,STON/O2. 3101283,7.925,,S
215,216,1,1,"Newell, Miss. Madeleine",female,31.0,1,0,35273,113.275,D36,C


Automatically, the operator **sort_values** orders the dataset in ascending order, if we want to order it in descending order we can use the following notation:

In [17]:
titanic.sort_values(by='Age', ascending=False).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q


You can also order several columns at a time:

In [18]:
titanic.sort_values(by=['Age', 'Sex'], ascending=False).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q


<a id='section3'></a>
## Data Types

The method to know the type of data contained in a dataset field is **dtype**.

In [17]:
titanic.Age.dtype

dtype('float64')

Instead we use **dtypes** when we want to know the types of data used for each column of the dataset

In [18]:
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

You can convert the type of one column to another through the **astype** function; obviously the type to which a field is converted must be consistent with the values contained in a field, for example a string can never be converted to a number.

In [None]:
pd.set_option('max_rows', 5) # code with which I set the number of lines displayed in output to 5

titanic.Survived.astype('float64') # Convert the Survived field from int64 to float64

The **astype** function is also useful when we have categorical fields within the dataset, which can be sorted according to logic. In this case you can create a *mapping* to which each category is associated a number and each number indicates the order in which the values contained in the categorical field must be considered. <br>
For example:

In [None]:
# We create a list in which we indicate a customer's satisfaction
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy']

# We create a DataFrame of possible satisfactions
df = pd.DataFrame({'satisfaction':['Bad', 'Happy', 'Unhappy', 'Neutral']})
df

In [39]:
df.satisfaction = df.satisfaction.astype("category", ordered=True, categories=ordered_satisfaction).cat.codes

  """Entry point for launching an IPython kernel.


In [40]:
df

Unnamed: 0,satisfaction
0,-1
1,3
2,1
3,2


The 'Bad' record was mapped as -1 because it was not contained in our "Dictionary" (ordered_satisfaction).<br>
The example above can be useful every time we have a categorical field that we want to order according to our logic; in our example the logic was dictated by the list *ordered satisfaction*.

In addition to sorting a categorical field, you can also transform a categorical field into a series of Boolean columns, as many as there are distinct values of the categorical field. <br>
For example, considering again the dataset on the titanic, we might want to encode the Sex field in numbers, removing the categorical Sex column and replacing it with two columns. In each of the two columns there will be 0 and 1 to indicate the row corresponding to a male and vice versa to a female. To do this we must use the function **get_dummies**.<br>
Let's look at the following example:

In [41]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [42]:
titanic = pd.get_dummies(titanic, columns=['Sex'])
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,0,1


<a id='section4'></a>
## Wrangling Data

If we want to delete a column or a particular line in our dataset we can use the **drop** method.

In [43]:
# I delete the Ticket column from the Titanic dataset
titanic = titanic.drop(labels=['Ticket'], axis = 1)
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Fare,Cabin,Embarked,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.2500,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,C85,C,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,30.0000,C148,C,0,1
890,891,0,3,"Dooley, Mr. Patrick",32.0,0,0,7.7500,,Q,0,1


We can also make sure to remove any duplicates from our dataset using the **drop_duplicates** method, however in this case Pandas needs you to specify in which columns to check for duplicates.

In [44]:
# We delete all records that are identical from the point of view of the Pclass and Age field
titanic_2 = titanic.drop_duplicates(subset=['Pclass', 'Age'])
titanic_2

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Fare,Cabin,Embarked,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.2500,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,C85,C,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
843,844,0,3,"Lemberopolous, Mr. Peter L",34.5,0,0,6.4375,,C,0,1
851,852,0,3,"Svensson, Mr. Johan",74.0,0,0,7.7750,,S,0,1


When you delete duplicates you create "holes" in the indexing of the lines, to overcome this problem it is always useful to "reset" the indexing through the function **reset_index**.

In [45]:
titanic_2 = titanic_2.reset_index(drop=True)
titanic_2

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Fare,Cabin,Embarked,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.2500,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,C85,C,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
183,844,0,3,"Lemberopolous, Mr. Peter L",34.5,0,0,6.4375,,C,0,1
184,852,0,3,"Svensson, Mr. Johan",74.0,0,0,7.7750,,S,0,1


Pandas automatically determines the type of data contained in each *Series* (columns) of a *DataFrame*. However, when loading a dataset, especially when it is taken from the web (via *read_html*), Pandas may not detect the type of data correctly. <br>
For example, let's try to work on the NHL player statistics table for the 2015 season (http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2):

In [20]:
pd.set_option('max_rows', 10) # code with which I set the number of lines displayed in output to 5

# Let's load the table
tb_df = pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2', header=1)[0]
print(tb_df)

     RK              PLAYER TEAM  GP   G   A PTS  +/- PIM PTS/G  SOG   PCT  \
0     1      Jamie Benn, LW  DAL  82  35  52  87    1  64  1.06  253  13.8   
1     2     John Tavares, C  NYI  82  38  48  86    5  46  1.05  278  13.7   
2     3    Sidney Crosby, C  PIT  77  28  56  84    5  47  1.09  237  11.8   
3     4   Alex Ovechkin, LW  WSH  81  53  28  81   10  58  1.00  395  13.4   
4   NaN   Jakub Voracek, RW  PHI  82  22  59  81    1  78  0.99  221  10.0   
..  ...                 ...  ...  ..  ..  ..  ..  ...  ..   ...  ...   ...   
41  NaN  Jaden Schwartz, LW  STL  75  28  35  63   13  16  0.84  184  15.2   
42  NaN   Filip Forsberg, C  NSH  82  26  37  63   15  24  0.77  237  11.0   
43  NaN   Jordan Eberle, RW  EDM  81  24  39  63  -16  24  0.78  183  13.1   
44  NaN    Ondrej Palat, LW   TB  75  16  47  63   31  24  0.84  139  11.5   
45   40     Zach Parise, LW  MIN  74  33  29  62   21  41  0.84  259  12.7   

   GWG G.1 A.1 G.2 A.2  
0    6  10  13   2   3  
1    8  13  1

In [21]:
# We rename the columns
head_table = ['RK', 'PLAYER', 'TEAM', 'GP', 'G', 'A', 'PTS', 'PLUS_MINUS', 'PIM', 'PTS_G', 'SOG', 'PCT', 'GWG', 'PP_G', 'PP_A', 'SH_G', 'SH_A']
tb_df.columns = head_table
print(tb_df)

     RK              PLAYER TEAM  GP   G   A PTS PLUS_MINUS PIM PTS_G  SOG  \
0     1      Jamie Benn, LW  DAL  82  35  52  87          1  64  1.06  253   
1     2     John Tavares, C  NYI  82  38  48  86          5  46  1.05  278   
2     3    Sidney Crosby, C  PIT  77  28  56  84          5  47  1.09  237   
3     4   Alex Ovechkin, LW  WSH  81  53  28  81         10  58  1.00  395   
4   NaN   Jakub Voracek, RW  PHI  82  22  59  81          1  78  0.99  221   
..  ...                 ...  ...  ..  ..  ..  ..        ...  ..   ...  ...   
41  NaN  Jaden Schwartz, LW  STL  75  28  35  63         13  16  0.84  184   
42  NaN   Filip Forsberg, C  NSH  82  26  37  63         15  24  0.77  237   
43  NaN   Jordan Eberle, RW  EDM  81  24  39  63        -16  24  0.78  183   
44  NaN    Ondrej Palat, LW   TB  75  16  47  63         31  24  0.84  139   
45   40     Zach Parise, LW  MIN  74  33  29  62         21  41  0.84  259   

     PCT GWG PP_G PP_A SH_G SH_A  
0   13.8   6   10   13    2 

In [22]:
tb_df = tb_df.drop(labels=['PLUS_MINUS', 'PIM', 'PTS_G', 'SOG', 'PCT', 'GWG', 'PP_G', 'PP_A', 'SH_G', 'SH_A'], axis=1)
tb_df

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,1,"Jamie Benn, LW",DAL,82,35,52,87
1,2,"John Tavares, C",NYI,82,38,48,86
2,3,"Sidney Crosby, C",PIT,77,28,56,84
3,4,"Alex Ovechkin, LW",WSH,81,53,28,81
4,,"Jakub Voracek, RW",PHI,82,22,59,81
...,...,...,...,...,...,...,...
41,,"Jaden Schwartz, LW",STL,75,28,35,63
42,,"Filip Forsberg, C",NSH,82,26,37,63
43,,"Jordan Eberle, RW",EDM,81,24,39,63
44,,"Ondrej Palat, LW",TB,75,16,47,63


In [23]:
# Let's check what kind of data Pandas has uploaded for each column
tb_df.dtypes

RK        object
PLAYER    object
TEAM      object
GP        object
G         object
A         object
PTS       object
dtype: object

As we can see from the code above, during the loading phase it was not possible to correctly categorize each field of the dataset; we must therefore proceed to convert each field into the correct category.

In [24]:
tb_df.GP = pd.to_numeric(tb_df.GP, errors='coerce')
tb_df.G = pd.to_numeric(tb_df.G, errors='coerce')
tb_df.A = pd.to_numeric(tb_df.A, errors='coerce')
tb_df.PTS = pd.to_numeric(tb_df.PTS, errors='coerce')
tb_df.dtypes

RK         object
PLAYER     object
TEAM       object
GP        float64
G         float64
A         float64
PTS       float64
dtype: object

In [25]:
tb_df.head()

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,1.0,"Jamie Benn, LW",DAL,82.0,35.0,52.0,87.0
1,2.0,"John Tavares, C",NYI,82.0,38.0,48.0,86.0
2,3.0,"Sidney Crosby, C",PIT,77.0,28.0,56.0,84.0
3,4.0,"Alex Ovechkin, LW",WSH,81.0,53.0,28.0,81.0
4,,"Jakub Voracek, RW",PHI,82.0,22.0,59.0,81.0


<a id='section5'></a>
### Missing Value

There are several methods to detect, remove and replace null values in a dataset: <br>

- **isnull()**  : used to identify missing values within a dataset <br>
- **notnull()** : the opposite of *isnull()* <br>
- **dropna()**  : returns the dataset without missing values <br>
- **fillna()**  : returns a copy of the dataset, with the missing values replaced by other parameters decided by the user 

We use the NHL player dataset to take advantage of the methods described above.

In [26]:
# We identify the elements of the dataset that contain Missing Values #
tb_df.isnull()

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...
41,True,False,False,False,False,False,False
42,True,False,False,False,False,False,False
43,True,False,False,False,False,False,False
44,True,False,False,False,False,False,False


In [87]:
# Rows where there are no missing values in the RK column of the tb_df dataset
tb_df.RK.notnull()

0      True
1      True
2      True
3      True
4     False
      ...  
41    False
42    False
43    False
44    False
45     True
Name: RK, Length: 46, dtype: bool

In [27]:
# We display all the rows of the dataset where there is no missing value in the RK column
tb_df[tb_df.RK.notnull()]

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,1,"Jamie Benn, LW",DAL,82.0,35.0,52.0,87.0
1,2,"John Tavares, C",NYI,82.0,38.0,48.0,86.0
2,3,"Sidney Crosby, C",PIT,77.0,28.0,56.0,84.0
3,4,"Alex Ovechkin, LW",WSH,81.0,53.0,28.0,81.0
5,6,"Nicklas Backstrom, C",WSH,82.0,18.0,60.0,78.0
...,...,...,...,...,...,...,...
29,26,"Pavel Datsyuk, C",DET,63.0,26.0,39.0,65.0
31,28,"Nikita Kucherov, RW",TB,82.0,28.0,36.0,64.0
35,RK,PLAYER,TEAM,,,,
40,35,"Radim Vrbata, RW",VAN,79.0,31.0,32.0,63.0


In [100]:
# Save the tb_df dataset in the test variable and then delete the lines where there are missing values
temp = tb_df
temp.dropna()

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,1,"Jamie Benn, LW",DAL,82.0,35.0,52.0,87.0
1,2,"John Tavares, C",NYI,82.0,38.0,48.0,86.0
2,3,"Sidney Crosby, C",PIT,77.0,28.0,56.0,84.0
3,4,"Alex Ovechkin, LW",WSH,81.0,53.0,28.0,81.0
5,6,"Nicklas Backstrom, C",WSH,82.0,18.0,60.0,78.0
...,...,...,...,...,...,...,...
26,23,"Jonathan Toews, C",CHI,81.0,28.0,38.0,66.0
29,26,"Pavel Datsyuk, C",DET,63.0,26.0,39.0,65.0
31,28,"Nikita Kucherov, RW",TB,82.0,28.0,36.0,64.0
40,35,"Radim Vrbata, RW",VAN,79.0,31.0,32.0,63.0


As you can see above, however, with the *dropna* method by default all rows where at least one column had a missing value have been deleted, to remove all columns where at least one value is missing, you must use the parameter *axis = 1*.

In [102]:
#Let's eliminate columns where at least one missing value is present
temp.dropna(axis=1)

Unnamed: 0,PLAYER,TEAM
0,"Jamie Benn, LW",DAL
1,"John Tavares, C",NYI
2,"Sidney Crosby, C",PIT
3,"Alex Ovechkin, LW",WSH
4,"Jakub Voracek, RW",PHI
...,...,...
41,"Jaden Schwartz, LW",STL
42,"Filip Forsberg, C",NSH
43,"Jordan Eberle, RW",EDM
44,"Ondrej Palat, LW",TB


To avoid deleting too many lines/columns from the dataset you can use the *how* and *thresh* parameters. For example with the parameter *thresh* we can indicate the number of missing values that there must be for each line/column to delete it.

In [104]:
# We delete lines that have more than two missing values per line
temp.dropna(thresh = 3)

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,1,"Jamie Benn, LW",DAL,82.0,35.0,52.0,87.0
1,2,"John Tavares, C",NYI,82.0,38.0,48.0,86.0
2,3,"Sidney Crosby, C",PIT,77.0,28.0,56.0,84.0
3,4,"Alex Ovechkin, LW",WSH,81.0,53.0,28.0,81.0
4,,"Jakub Voracek, RW",PHI,82.0,22.0,59.0,81.0
...,...,...,...,...,...,...,...
41,,"Jaden Schwartz, LW",STL,75.0,28.0,35.0,63.0
42,,"Filip Forsberg, C",NSH,82.0,26.0,37.0,63.0
43,,"Jordan Eberle, RW",EDM,81.0,24.0,39.0,63.0
44,,"Ondrej Palat, LW",TB,75.0,16.0,47.0,63.0


It may be necessary not to delete too many lines from the dataset. To do this, you can replace the missing values of one or more columns with the **fill()** method.

In [105]:
# Replace the missing values with 0
temp.fillna(0)

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,1,"Jamie Benn, LW",DAL,82.0,35.0,52.0,87.0
1,2,"John Tavares, C",NYI,82.0,38.0,48.0,86.0
2,3,"Sidney Crosby, C",PIT,77.0,28.0,56.0,84.0
3,4,"Alex Ovechkin, LW",WSH,81.0,53.0,28.0,81.0
4,0,"Jakub Voracek, RW",PHI,82.0,22.0,59.0,81.0
...,...,...,...,...,...,...,...
41,0,"Jaden Schwartz, LW",STL,75.0,28.0,35.0,63.0
42,0,"Filip Forsberg, C",NSH,82.0,26.0,37.0,63.0
43,0,"Jordan Eberle, RW",EDM,81.0,24.0,39.0,63.0
44,0,"Ondrej Palat, LW",TB,75.0,16.0,47.0,63.0


In [115]:
# Let's replace the missing values with the previous non-zero ones
temp.fillna(method="ffill")

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,1,"Jamie Benn, LW",DAL,82.0,35.0,52.0,87.0
1,2,"John Tavares, C",NYI,82.0,38.0,48.0,86.0
2,3,"Sidney Crosby, C",PIT,77.0,28.0,56.0,84.0
3,4,"Alex Ovechkin, LW",WSH,81.0,53.0,28.0,81.0
4,4,"Jakub Voracek, RW",PHI,82.0,22.0,59.0,81.0
...,...,...,...,...,...,...,...
41,35,"Jaden Schwartz, LW",STL,75.0,28.0,35.0,63.0
42,35,"Filip Forsberg, C",NSH,82.0,26.0,37.0,63.0
43,35,"Jordan Eberle, RW",EDM,81.0,24.0,39.0,63.0
44,35,"Ondrej Palat, LW",TB,75.0,16.0,47.0,63.0


In [116]:
# Let's replace the missing values with the next non-zero ones
temp.fillna(method="bfill")

Unnamed: 0,RK,PLAYER,TEAM,GP,G,A,PTS
0,1,"Jamie Benn, LW",DAL,82.0,35.0,52.0,87.0
1,2,"John Tavares, C",NYI,82.0,38.0,48.0,86.0
2,3,"Sidney Crosby, C",PIT,77.0,28.0,56.0,84.0
3,4,"Alex Ovechkin, LW",WSH,81.0,53.0,28.0,81.0
4,6,"Jakub Voracek, RW",PHI,82.0,22.0,59.0,81.0
...,...,...,...,...,...,...,...
41,40,"Jaden Schwartz, LW",STL,75.0,28.0,35.0,63.0
42,40,"Filip Forsberg, C",NSH,82.0,26.0,37.0,63.0
43,40,"Jordan Eberle, RW",EDM,81.0,24.0,39.0,63.0
44,40,"Ondrej Palat, LW",TB,75.0,16.0,47.0,63.0


**Useful Links:**
- Data Wrangling CheatSheet: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
- GroupBy: https://pandas.pydata.org/pandas-docs/stable/groupby.html
- Sorting: https://pandas.pydata.org/pandas-docs/stable/basics.html#sorting
- DataType Introduction: https://pandas.pydata.org/pandas-docs/stable/dsintro.html
- Missing Data: https://pandas.pydata.org/pandas-docs/stable/missing_data.html <br>
    - https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html

[Click here to go to index](#start)<a id='start'></a>