# Working with missing data



## Data set

We'll clean and analyze data on passenger survival from the Titanic. Each row contains information for a specific Titanic passenger.

Lets take a closer look at a few of the key columns:

* pclass -- The passenger's cabin class from 1 to 3 where 1 was the highest class
* survived -- 1 if the passenger survived, and 0 if they did not.
* sex -- The passenger's gender
* age -- The passenger's age
* fare -- The amount the passenger paid for their ticket
* embarked -- Either C, Q, or S, to indicate which port the passenger boarded the ship from.

Many of the columns, such as age and sex, have missing values.

**Because missing values** can cause errors in numerical functions, we'll need to deal with them before we can analyze the data. For instance, finding the mean of a column with a missing value will fail because it's impossible to average a missing value. Addressing missing values will let us perform calculations on the entire data set.



In [2]:
# Import data into a panda dataframe
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("titanic_survival.csv")

print(titanic_survival)



      pclass  survived                                               name  \
0        1.0       1.0                      Allen, Miss. Elisabeth Walton   
1        1.0       1.0                     Allison, Master. Hudson Trevor   
2        1.0       0.0                       Allison, Miss. Helen Loraine   
3        1.0       0.0               Allison, Mr. Hudson Joshua Creighton   
4        1.0       0.0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
5        1.0       1.0                                Anderson, Mr. Harry   
6        1.0       1.0                  Andrews, Miss. Kornelia Theodosia   
7        1.0       0.0                             Andrews, Mr. Thomas Jr   
8        1.0       1.0      Appleton, Mrs. Edward Dale (Charlotte Lamson)   
9        1.0       0.0                            Artagaveytia, Mr. Ramon   
10       1.0       0.0                             Astor, Col. John Jacob   
11       1.0       1.0  Astor, Mrs. John Jacob (Madeleine Talmadge Force)   

## Missing data
Missing data can take a few different forms:

* In Python, the `None` keyword and type indicates no value.
* The Pandas library uses `NaN`, which stands for "not a number", to indicate a missing value.

In general terms, both `NaN` and `None` can be called null values.

If we want to see which values are NaN, we can use the `pandas.isnull()` function which takes a pandas series and returns a series of True and False values, the same way that NumPy did when we compared arrays.

In [15]:
titanic_sex = titanic_survival["sex"]
#print(titanic_sex)

titanic_sex_missing = pd.isnull(titanic_sex)
#print(titanic_sex_missing)

sex_null_true = titanic_sex[titanic_sex_missing]
print(sex_null_true)

1309    NaN
Name: sex, dtype: object


In [16]:
age = titanic_survival["age"]
print(age.loc[10:20])

#extract all rows with age = NaN
age_null_true = age[pd.isnull(age)]

#count resulting Dataframe
age_null_count = age_null_true.shape[0]
print(age_null_count)

10    47.0
11    18.0
12    24.0
13    26.0
14    80.0
15     NaN
16    24.0
17    50.0
18    32.0
19    36.0
20    37.0
Name: age, dtype: float64
264


## Calculating Mean

The result of calculating the mean of this age Series this is that mean_age would be `nan`. This is because **any calculations we do with a null value also result in a null value**. This makes sense when you think about it -- how can you add a null value to a known value?

Hence, we have to filter out the missing values before we calculate the mean.

In [31]:
age_is_null = pd.isnull(titanic_survival["age"])
print(age_is_null)

age_not_null = titanic_survival["age"][pd.notnull(age)]

#Alternative: 
age_not_null = titanic_survival["age"][age_is_null == False]

correct_mean_age = age_not_null.mean()
print(correct_mean_age)

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15       True
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1280    False
1281    False
1282     True
1283     True
1284     True
1285    False
1286    False
1287    False
1288    False
1289    False
1290    False
1291     True
1292     True
1293     True
1294    False
1295    False
1296    False
1297     True
1298    False
1299    False
1300    False
1301    False
1302     True
1303     True
1304    False
1305     True
1306    False
1307    False
1308    False
1309     True
Name: age, Length: 1310, dtype: bool
29.8811345124


Luckily, missing data is so common that many pandas methods automatically filter for it. For example, if we use use the `Series.mean()` method to calculate the mean of a column, missing values will not be included in the calculation.

In [33]:
# Same result as the code above: 
correct_mean_age = titanic_survival["age"].mean()
print(correct_mean_age)

29.8811345124


In [46]:
passenger_classes = [1, 2, 3]
fares_by_class = {}

for pc in passenger_classes:
    fares_by_class[pc] = titanic_survival["fare"][titanic_survival["pclass"] == pc].mean()

print(fares_by_class)

{1: 87.508991640866881, 2: 21.179196389891697, 3: 13.302888700564973}


# Pivot tables

Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean. The concept of Pivot tables was popularized with the introduction of the 'PivotTable' feature in Microsoft Excel in the mid 1990's.

Pivot tables first group and then apply a calculation. In the previous example, we actually made a pivot table manually by grouping by the column "pclass" and then calculating the mean of the "fare" column for each class.

Luckily, we can use the `Dataframe.pivot_table()` method instead, which simplifies the kind of work we did on the last screen. To produce the same data, we could use one line.

In [53]:
passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean )
print(passenger_class_fares)

             fare
pclass           
1.0     87.508992
2.0     21.179196
3.0     13.302889


In [55]:
# calculate the mean age for each passenger class 
passenger_age = titanic_survival.pivot_table(index="pclass", values="age", aggfunc=np.mean)
print(passenger_age)


              age
pclass           
1.0     39.159918
2.0     29.506705
3.0     24.816367


We can use the `DataFrame.pivot_table()` method to **perform even more advanced tasks**. If we pass a list of column names to the values parameter instead of a single value, we can perform calculations **on multiple columns at once**.

We can also specify a custom calculation to be made. For instance, if we pass `np.sum()` to the aggfunc parameter it will total the values in each column.

In [78]:
# pivot table that calculates the total fares collected ("fare") and total number of survivors ("survived") for each embarkation port("embarked").
port_stats = titanic_survival.pivot_table(index="embarked", values=["fare", "survived"], aggfunc=np.sum)
print(port_stats)

                fare  survived
embarked                      
C         16830.7922     150.0
Q          1526.3085      44.0
S         25033.3862     304.0


In [83]:
port_stats2 = titanic_survival.pivot_table(index=["embarked", "pclass"], values=["survived","sex"], aggfunc=np.sum)
print(port_stats2)

                 survived
embarked pclass          
C        1.0         97.0
         2.0         16.0
         3.0         37.0
Q        1.0          2.0
         2.0          2.0
         3.0         40.0
S        1.0         99.0
         2.0        101.0
         3.0        104.0


## Remove missing values in a matrix

We can use the `DataFrame.dropna()` method on pandas DataFrames to do this. The method will drop any rows that contain missing values.

The `dropna()` method takes an axis parameter, which indicates whether you would like to drop rows or columns. 
* Specifying `axis=0` or `axis='index'` will drop any rows that have null values, while 
* specifying `axis=1` or `axis='columns'` will drop any columns that have null values. 

We will use 0 and 1 since they're more commonly used, but you can use either.

In [91]:
# drop all rows in titanic_survival that have null values.
drop_na_rows = titanic_survival.dropna(axis=0)
#print(drop_na_rows)

# Drop all columns in titanic_survival that have missing values and assign the result to drop_na_columns.
drop_na_columns = titanic_survival.dropna(axis=1)
#print(drop_na_columns)

# Drop all rows in titanic_survival where the columns "age" or "sex" have missing values
new_titanic_survival = titanic_survival.dropna(axis=0, subset=["age","sex"])
print(new_titanic_survival)

      pclass  survived                                               name  \
0        1.0       1.0                      Allen, Miss. Elisabeth Walton   
1        1.0       1.0                     Allison, Master. Hudson Trevor   
2        1.0       0.0                       Allison, Miss. Helen Loraine   
3        1.0       0.0               Allison, Mr. Hudson Joshua Creighton   
4        1.0       0.0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
5        1.0       1.0                                Anderson, Mr. Harry   
6        1.0       1.0                  Andrews, Miss. Kornelia Theodosia   
7        1.0       0.0                             Andrews, Mr. Thomas Jr   
8        1.0       1.0      Appleton, Mrs. Edward Dale (Charlotte Lamson)   
9        1.0       0.0                            Artagaveytia, Mr. Ramon   
10       1.0       0.0                             Astor, Col. John Jacob   
11       1.0       1.0  Astor, Mrs. John Jacob (Madeleine Talmadge Force)   


### Select n first rows of a sorted dataframe
If we wanted to select the first five rows of an already sorted dataframe, we can **use DataFrame.iloc[] method to select by position**. The easy way to remember which is which is to remember that iloc[] stands for integer location, because you use integers and not labels to select the data.

DataFrame.loc[] would select by ID / Label

In [27]:
new_titanic_survival = titanic_survival.sort_values("age", axis=0, ascending=False, inplace=False)

example_loc = new_titanic_survival.loc[14]
example_iloc = new_titanic_survival.iloc[14]

print(example_loc)
print(example_iloc)


pclass                                          1
survived                                        1
name         Barkworth, Mr. Algernon Henry Wilson
sex                                          male
age                                            80
sibsp                                           0
parch                                           0
ticket                                      27042
fare                                           30
cabin                                         A23
embarked                                        S
boat                                            B
body                                          NaN
home.dest                           Hessle, Yorks
Name: 14, dtype: object
pclass                                                       1
survived                                                     1
name         Crosby, Mrs. Edward Gifford (Catherine Elizabe...
sex                                                     female
age                     

In [28]:
#Assign the first ten rows from new_titanic_survival to first_ten_rows.
first_ten_rows = new_titanic_survival.iloc[0:10]

#Assign the fifth row from new_titanic_survival to row_position_fifth.
row_position_fifth = new_titanic_survival.iloc[4]

#Assign the row with index label 25 from new_titanic_survivalto row_index_25.
row_index_25 = new_titanic_survival.loc[25]



We can also** index columns using both the loc[] and iloc[] methods**. 
- With .loc[], we specify the column label strings as we have in the earlier exercises in this missions. 
- With iloc[], we simply use the integer number of the column, starting from the left-most column which is 0. 

Similar to indexing with NumPy arrays, you separate the row and columns with a comma, and can use a colon to specify a range or as a wildcard.

In [32]:
first_row_first_column = new_titanic_survival.iloc[0,0] #index
all_rows_first_three_columns = new_titanic_survival.iloc[:,0:3] #index

row_index_83_age = new_titanic_survival.loc[83,"age"] #labels
row_index_766_pclass = new_titanic_survival.loc[766,"pclass"] #labels

row_index_1100_age = new_titanic_survival.loc[1100,"age"] #labels
row_index_25_survived = new_titanic_survival.loc[25,"survived"] #labels
five_rows_three_cols = new_titanic_survival.iloc[0:5,0:3] #index

Sometimes it's useful to **reindex, starting from 0**. We can use the **`DataFrame.reset_index()`** method to do this. By default, the method retains the old index by adding an extra column to the dataframe with the old index values

In [33]:
titanic_reindexed = new_titanic_survival.reset_index(drop = True) # overwriting existing index
print(titanic_reindexed.loc[0:5])

   pclass  survived                                               name  \
0     1.0       1.0               Barkworth, Mr. Algernon Henry Wilson   
1     1.0       1.0  Cavendish, Mrs. Tyrell William (Julia Florence...   
2     3.0       0.0                                Svensson, Mr. Johan   
3     1.0       0.0                          Goldschmidt, Mr. George B   
4     1.0       0.0                            Artagaveytia, Mr. Ramon   
5     3.0       0.0                               Connors, Mr. Patrick   

      sex   age  sibsp  parch    ticket     fare cabin embarked boat   body  \
0    male  80.0    0.0    0.0     27042  30.0000   A23        S    B    NaN   
1  female  76.0    1.0    0.0     19877  78.8500   C46        S    6    NaN   
2    male  74.0    0.0    0.0    347060   7.7750   NaN        S  NaN    NaN   
3    male  71.0    0.0    0.0  PC 17754  34.6542    A5        C  NaN    NaN   
4    male  71.0    0.0    0.0  PC 17609  49.5042   NaN        C  NaN   22.0   
5    ma

## Dataframe.apply

To perform a complex calculation across pandas objects, we'll need to learn about the DataFrame.apply() method. By default, DataFrame.apply() will **iterate through each column in a DataFrame, and perform on each function**. When we create our function, we give it one parameter, apply() method passes each column to the parameter as a pandas series.

The result from the function will be **combined with all of the other results**, and placed **into a new series**. The function results will have the same position as the column or row we generated them from. Let's look at a simple example:

In [34]:
# This function returns the hundredth item from a series
def hundredth_elem(column):
    return column.iloc[99]

print(titanic_survival.apply(hundredth_elem))

pclass                                                       1
survived                                                     1
name         Duff Gordon, Lady. (Lucille Christiana Sutherl...
sex                                                     female
age                                                         48
sibsp                                                        1
parch                                                        0
ticket                                                   11755
fare                                                      39.6
cabin                                                      A16
embarked                                                     C
boat                                                         1
body                                                       NaN
home.dest                                       London / Paris
dtype: object


In [47]:
def count_null_values(column):
    #return(pd.isnull(column)[pd.isnull(column) == True]).shape[0]
    
    #better:
    return len(column[pd.isnull(column)])
    
column_null_count= titanic_survival.apply(count_null_values)
print(column_null_count)

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64


By passing in the `axis=1` argument, we can use the DataFrame.apply() method to **iterate over rows instead of columns**.


In [58]:
def is_minor(row):
    if row["age"] < 18:
        return True
    else:
        return False

minors = titanic_survival.apply(is_minor, axis=1)
print(minors.head())

0    False
1     True
2     True
3    False
4    False
dtype: bool


In [57]:
# Label if passengers are minors or not

def minor_label(row):
    age_obj = row["age"]
    #print(type(age))
    if pd.isnull(age_obj):
        return "unknown"
    elif age_obj >= 18:
        return "adult"
    else:
        return "minor"

age_labels = titanic_survival.apply(minor_label, axis=1)
print(age_labels.head())





0    adult
1    minor
2    minor
3    adult
4    adult
dtype: object
