# Python: Handling missing values

**Goal**: Clean and organise your data!

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-dataset" data-toc-modified-id="Introduction-to-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to dataset</a></span></li><li><span><a href="#Find-missing-values" data-toc-modified-id="Find-missing-values-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Find missing values</a></span></li><li><span><a href="#Problems-with-missing-values" data-toc-modified-id="Problems-with-missing-values-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Problems with missing values</a></span><ul class="toc-item"><li><span><a href="#Training" data-toc-modified-id="Training-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Training</a></span></li></ul></li><li><span><a href="#Introduction-to-pivot-table" data-toc-modified-id="Introduction-to-pivot-table-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Introduction to pivot table</a></span><ul class="toc-item"><li><span><a href="#Training" data-toc-modified-id="Training-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Training</a></span></li></ul></li><li><span><a href="#Remove-missing-values" data-toc-modified-id="Remove-missing-values-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Remove missing values</a></span></li><li><span><a href="#Iloc-to-access-rows" data-toc-modified-id="Iloc-to-access-rows-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Iloc to access rows</a></span></li><li><span><a href="#Column-indexes" data-toc-modified-id="Column-indexes-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Column indexes</a></span></li><li><span><a href="#Re-index-rows-in-a-dataframe" data-toc-modified-id="Re-index-rows-in-a-dataframe-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Re-index rows in a dataframe</a></span></li><li><span><a href="#Apply-functions-to-a-dataframe" data-toc-modified-id="Apply-functions-to-a-dataframe-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Apply functions to a dataframe</a></span><ul class="toc-item"><li><span><a href="#Training" data-toc-modified-id="Training-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Training</a></span></li></ul></li><li><span><a href="#Practice:-compute-the-percentage-of-survival-by-classe-group" data-toc-modified-id="Practice:-compute-the-percentage-of-survival-by-classe-group-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Practice: compute the percentage of survival by classe group</a></span></li></ul></div>

## Introduction to dataset

In this chapter, we will clean and analyze the data of a dataset containing the ``survivors of titanic``. 

In [1]:
import pandas as pd

In [2]:
titanic_survival = pd.read_csv("titanic_survival.csv")
titanic_survival.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


We will proceed to a brief presentation of some columns. The column ``pclass`` is the class of the passenger cabin which goes from 1 to 3 knowing that 1 is the highest class. The column ``survived`` represents the survival of a passenger. It takes the value 1 if the passenger survived and 0 otherwise. The ``fare`` column represents the amount paid by the passenger for the boarding ticket. The column ``embarked`` represents the place of boarding of the passenger, it takes 3 values: C, Q and S. In this dataset, it should be noted that many columns like ``age`` and ``sex`` have missing values. These ``missing values`` can cause numerical errors in our calculations. So we have to handle with them before starting the analyses. So it is important to learn how to handle with missing values and this is what we will do throughout this chapter.

## Find missing values

In this section, we will proceed to the ``discovery of missing values``. As a reminder, there are several types of missing values. There are values of type ``None`` which indicates no value, there is also the value ``Nan`` which means ``not a number`` which indicates a missing value. In general, we can consider None and Nan as ``null values``. In pandas, there is a function that allows to see which values are null or none, it is the ``isnull()`` function.

In [3]:
# example with sex column
sex = titanic_survival["sex"]
sex_is_null = pd.isnull(sex)
sex_is_null

0       False
1       False
2       False
3       False
4       False
        ...  
1305    False
1306    False
1307    False
1308    False
1309     True
Name: sex, Length: 1310, dtype: bool

In [4]:
sex_null = sex[sex_is_null]
sex_null

1309    NaN
Name: sex, dtype: object

In [5]:
# example with age column
age_null = titanic_survival["age"][pd.isnull(titanic_survival.age)]
age_null

15     NaN
37     NaN
40     NaN
46     NaN
59     NaN
        ..
1297   NaN
1302   NaN
1303   NaN
1305   NaN
1309   NaN
Name: age, Length: 264, dtype: float64

We can clearly see that our age column has 264 missing values that we can recheck using the ``len()`` function.

In [6]:
count_age_null = len(age_null)
count_age_null

264

## Problems with missing values

We have seen previously that there are columns with missing values. In this section, we will show what ``problems`` these missing values ``cause``.

In [7]:
# example of problem
mean_age = sum(titanic_survival["age"]) / len(titanic_survival["age"])
mean_age

nan

This example above simply illustrates that a computation containing a missing value will return a missing value. So it is necessary to ``filter`` out the missing values before proceeding with the computations.

In [8]:
# filter the nan values
age_not_null = titanic_survival["age"][pd.isnull(
    titanic_survival["age"]) == False]
age_not_null

0       29.0000
1        0.9167
2        2.0000
3       30.0000
4       25.0000
         ...   
1301    45.5000
1304    14.5000
1306    26.5000
1307    27.0000
1308    29.0000
Name: age, Length: 1046, dtype: float64

In [9]:
# compute the mean of age column without nan values
mean_age = sum(age_not_null) / len(age_not_null)
mean_age

29.8811345124283

For ``information``, there is a method of pandas to compute more simply an average directly on a column. This function which the ``mean()`` method ignores the missing values and computes directly the average of a numerical series.

In [10]:
# example with age column
mean_age = titanic_survival["age"].mean()
mean_age

29.8811345124283

### Training

In this practice, we will try to answer the following questions:

* create an empty dictionary that we will name fares_by_class
* create the list passenger_classes which contains the elements [1,2,3]
* use a for loop to browse the passenger_classes list:
    * select just the rows of titanic_survival for which the column pclass is equal to the temporary variable (the iterator) of the for loop, i.e. corresponding to the class number (1, 2 or 3)
    * select only the fare column for this subset of rows (corresponding to the class)
    * use the series.mean() method to calculate the average of this subset
    * add this calculated average of the class to the fares_by_class dictionary with the class number as key (and thus the average fare as value)
* once the loop is completed, the dictionary fares_by_class should have 1,2 and 3 as keys with the corresponding average values
* display the result

In [11]:
fares_by_class = {}
passenger_classes = [1, 2, 3]

for this_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival["pclass"] == this_class]
    pclass_fares = pclass_rows["fare"]
    fare_for_class = pclass_fares.mean()
    fares_by_class[this_class] = fare_for_class

In [12]:
fares_by_class

{1: 87.50899164086687, 2: 21.1791963898917, 3: 13.302888700564957}

## Introduction to pivot table

In this section we will look at pivot tables which are a way of creating a subset from a column and performing computations, for example calculating an average, etc. The idea of pivot tables is to group and then apply a function. The ``pivot_table()`` method of pandas allows to perform these operations. This following example shows how to perform our previous task just by using the ``pivot_table()`` method.

In [13]:
import numpy as np

In [14]:
fares_by_pclass = titanic_survival.pivot_table(index="pclass",
                                               values="fare",
                                               aggfunc=np.mean)
fares_by_pclass

Unnamed: 0_level_0,fare
pclass,Unnamed: 1_level_1
1.0,87.508992
2.0,21.179196
3.0,13.302889


The first parameter ``index`` indicates the column you want to group. The second parameter ``values`` indicates the column on which we want to apply a function (sum, average, etc). And the last parameter ``aggfunc`` indicates the function we want to apply on the ``values`` parameter. With this same method ``pivot_table()``, we can also perform calculations on several columns.

### Training

In this practice, we will try to answer the following questions:

* make a table pivot that calculates the total money collected ("fare") and the total number of survivors ("survived") for each embarked port ("embarked") using the numpy.sum function
* assign the result to the variable port_stats
* display the result

In [15]:
port_stats = titanic_survival.pivot_table(index="embarked",
                                          values=["fare", "survived"],
                                          aggfunc=np.sum)
port_stats

Unnamed: 0_level_0,fare,survived
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,16830.7922,150.0
Q,1526.3085,44.0
S,25033.3862,304.0


## Remove missing values

To ``remove`` missing values directly from a dataframe, we use the ``dropna()`` method. This method allows to ``delete`` all rows or columns with at least one missing value.

In [16]:
drop_na_rows = titanic_survival.dropna(axis=0)
drop_na_rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest


In [17]:
drop_na_columns = titanic_survival.dropna(axis=1)
drop_na_columns

0
1
2
3
4
...
1305
1306
1307
1308
1309


These examples above show that all rows or columns in our dataset have at least one missing value. It is also possible to remove missing values by specifying a ``subset of variables`` using the ``subset`` parameter.

In [18]:
titanic_survival.shape

(1310, 14)

In [19]:
name_drop_na_rows = titanic_survival.dropna(axis=0, subset=["name"])
name_drop_na_rows.shape

(1309, 14)

This last example says that there is a row that has at least one missing value for the ``name`` column.

## Iloc to access rows

In this section we will see the ``iloc`` method for accessing rows. More advanced than the ``loc`` method, the ``iloc`` method allows to display the elements according to the position where it is located. This method also allows to do other tasks much more advanced than the ``loc`` method.

In [20]:
unordered_titanic_survival = titanic_survival.sort_values(
    by="age", ascending=False)
unordered_titanic_survival.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.775,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


In [21]:
unordered_titanic_survival.loc[0]

pclass                                 1.0
survived                               1.0
name         Allen, Miss. Elisabeth Walton
sex                                 female
age                                   29.0
sibsp                                  0.0
parch                                  0.0
ticket                               24160
fare                              211.3375
cabin                                   B5
embarked                                 S
boat                                     2
body                                   NaN
home.dest                     St Louis, MO
Name: 0, dtype: object

In [22]:
unordered_titanic_survival.iloc[0]

pclass                                        1.0
survived                                      1.0
name         Barkworth, Mr. Algernon Henry Wilson
sex                                          male
age                                          80.0
sibsp                                         0.0
parch                                         0.0
ticket                                      27042
fare                                         30.0
cabin                                         A23
embarked                                        S
boat                                            B
body                                          NaN
home.dest                           Hessle, Yorks
Name: 14, dtype: object

In [23]:
unordered_titanic_survival.loc[0:5]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest


In [24]:
unordered_titanic_survival.iloc[0:5]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.775,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


The above example clearly shows that the ``iloc`` method does not take into account the order of the elements on the dataframe but only focuses on the positions of the elements in the dataframe. So when you are slicing a dataframe, you should use the ``iloc`` method because slicing using the ``loc`` method assumes that the index of our dataframe is sorted in ascending order. And on the other hand, if you want to access elements, you should use the ``loc`` method if you are accessing by index number, otherwise you should use the ``iloc`` method if you are accessing by the position of the element in the dataframe.

## Column indexes

As we have seen in the previous chapters, the indexing of columns is done with the ``loc`` and ``iloc`` methods. With the ``loc`` method, we use the name of the column, and for the ``iloc`` method, we use an integer that corresponds to the positions of the column in the dataframe.

In [25]:
# example
unordered_titanic_survival.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0,A23,S,B,,"Hessle, Yorks"
61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.775,,S,,,
135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


In [26]:
unordered_titanic_survival.iloc[0, 2]

'Barkworth, Mr. Algernon Henry Wilson'

In [27]:
unordered_titanic_survival.loc[14, "name"]

'Barkworth, Mr. Algernon Henry Wilson'

## Re-index rows in a dataframe

In this section, we will see how to re-index the rows of a dataframe. We have seen that sorting changes the order of a dataframe but keeps the indexes of each row. However, it can sometimes be useful to re-index a dataframe by starting the indexes at 0. To do this, we use the ``reset_index()`` method of pandas. 

In [28]:
# example
unordered_titanic_survival.reset_index().head()

Unnamed: 0,index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,14,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0,A23,S,B,,"Hessle, Yorks"
1,61,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
2,1235,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.775,,S,,,
3,135,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
4,9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


In the example above, we know that our dataframe re-indexed has kept the old index, to remove it, we can use the ``drop`` parameter which will take the value ``True``.

In [29]:
unordered_titanic_survival.reset_index(drop=True).head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0.0,0.0,27042,30.0,A23,S,B,,"Hessle, Yorks"
1,1.0,1.0,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1.0,0.0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
2,3.0,0.0,"Svensson, Mr. Johan",male,74.0,0.0,0.0,347060,7.775,,S,,,
3,1.0,0.0,"Goldschmidt, Mr. George B",male,71.0,0.0,0.0,PC 17754,34.6542,A5,C,,,"New York, NY"
4,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


## Apply functions to a dataframe

In this section, we will see how to apply functions on a dataframe. To do so, we will learn the ``apply()`` method. By default, this method applies a function on each column or row of the dataframe.

In [30]:
# example 1
fare = titanic_survival["fare"]
fare

0       211.3375
1       151.5500
2       151.5500
3       151.5500
4       151.5500
          ...   
1305     14.4542
1306      7.2250
1307      7.2250
1308      7.8750
1309         NaN
Name: fare, Length: 1310, dtype: float64

In [31]:
squared_fare = fare.apply(lambda x: x**2)
squared_fare

0       44663.538906
1       22967.402500
2       22967.402500
3       22967.402500
4       22967.402500
            ...     
1305      208.923898
1306       52.200625
1307       52.200625
1308       62.015625
1309             NaN
Name: fare, Length: 1310, dtype: float64

By changing the ``axis (0 by default)`` parameter of the ``apply()`` method, we can also apply a function on the rows of the datadrame by setting the parameter to ``1``.

In [32]:
# example 2
def is_minor(row):
    if row["age"] < 18:
        return True
    else:
        return False

In [33]:
minors = titanic_survival.apply(is_minor, axis=1)
minors

0       False
1        True
2        True
3       False
4       False
        ...  
1305    False
1306    False
1307    False
1308    False
1309    False
Length: 1310, dtype: bool

In [34]:
# example 3
def which_class(row):

    pclass = row["pclass"]

    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    else:
        return "Third Class"

In [35]:
classes = titanic_survival.apply(which_class, axis=1)
classes

0       First Class
1       First Class
2       First Class
3       First Class
4       First Class
           ...     
1305    Third Class
1306    Third Class
1307    Third Class
1308    Third Class
1309        Unknown
Length: 1310, dtype: object

### Training

In this practice, we will try to answer the following questions:

* write a function that counts the number of missing values of a series object
* use the DataFrame.apply() method to apply your function to titanic_survival
* assign the result to the column_null_value_count variable
* display the result

In [36]:
def null_value_count(column):
    null = column[pd.isnull(column)]
    null_value = len(null)
    return null_value

In [37]:
column_null_value_count = titanic_survival.apply(null_value_count)
column_null_value_count

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64

All of the above code could be simplified to a single line of code by using the methods already predefined on pandas namely ``isnull()`` and ``sum()`` which determine respectively the rows with null values and the sum of these null values as shown in the following code.

In [38]:
titanic_survival.isnull().sum()

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64

## Practice: compute the percentage of survival by classe group

In this practical case, we will apply a pivot table to compute the percentage of survival by classe group. So, we will try to answer the following questions:

* add the column "classe_labels" to the dataframe titanic_survival containing the variable classes that we created in the previous example 3
* create a pivot table that calculates the average chance of survival (column "survived") for each class (column "classes") of the dataframe titanic_survival
* assign the resulting series object to the classe_group_survival variable
* display the result

In [39]:
titanic_survival["classe_labels"] = classes
classe_group_survival = titanic_survival.pivot_table(
    index="classe_labels", values="survived")
classe_group_survival

Unnamed: 0_level_0,survived
classe_labels,Unnamed: 1_level_1
First Class,0.619195
Second Class,0.429603
Third Class,0.255289
