## 5.1 What is NaN Value?
The `NaN` value in Pandas comes from `numpy`. Missing values may be used or displayed in a few ways in Python—`NaN`, `NAN`, or `nan`—but they are all equivalent.

Missing values are different than other types of data, in that they don’t really equal anything. The data is missing, so there is no concept of equality. `NaN` is not be equivalent to `0` or an empty string, `''`.

In [4]:
from numpy import NaN, NAN, nan

NaN == True
NaN == False
NaN == 0
NaN == ""
NaN == NaN
NaN == nan
NaN == NAN

False

False

False

False

False

False

False

In [5]:
# to test for missing value
import pandas as pd
pd.isnull(NaN)
pd.isnull(nan)
pd.isnull(NAN)

# to test for non-missing value
pd.notnull(NaN)
pd.notnull(nan)
pd.notnull(NAN)

True

True

True

False

False

False

## 5.2 Where do Missing Values Come From?
### 5.2.1 Loading Data
The survey data we used in Chapter 4 included a data set, visited, that contained missing data. When we loaded the data, Pandas automatically found the missing data cell, and gave us a dataframe with the `NaN` value in the appropriate cell. In the `read_csv` function, three parameters relate to reading in missing values: `na_values`, `keep_default_na`, and `na_filter`.

The `na_values` parameter allows you to specify additional missing or `NaN` values. You can pass in either a Python str or a list-like object to be automatically coded as missing values when the file is read. Of course, default missing values, such as NA, `NaN`, or nan, are already available, which is why this parameter is not always used. Some health data may code `99` as a missing value; to specify the use of this value, you would set `na_values=[99]`.

The `keep_default_na` parameter is a bool that allows you to specify whether any additional values need to be considered as missing. This parameter is `True` by default, meaning any additional missing values specified with the `na_values` parameter will be appended to the list of missing values. However, `keep_default_na` can also be set to `keep_default_na=False`, which will use only the missing values specified in `na_values`.

Lastly, `na_filter` is a `bool` that will specify whether any values will be read as missing. The default value of `na_filter=True` means that missing values will be coded as `NaN`. If we assign `na_filter=False`, then nothing will be recoded as missing. This parameter can by thought of as a means to turn off all the parameters set for `na_values` and `keep_default_na`, but it is more likely to be used when you want to achieve a performance boost by loading in data without missing values.

In [6]:
visited_file = '../data/survey_visited.csv'
visited = pd.read_csv(visited_file, keep_default_na=False)
visited.head(6)

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,


In [7]:
visited = pd.read_csv(visited_file, na_values=[''], keep_default_na=False)
visited.head(6)

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,


In [8]:
visited = pd.read_csv(visited_file)
visited.head(6)

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,


### 5.2.2 Merging Data

In [9]:
visited = pd.read_csv('../data/survey_visited.csv')
survey = pd.read_csv('../data/survey_survey.csv')
vs = visited.merge(survey, left_on='ident', right_on='taken')
vs.iloc[10:20,:]

Unnamed: 0,ident,site,dated,taken,person,quant,reading
10,751,DR-3,1930-02-26,751,pb,rad,4.35
11,751,DR-3,1930-02-26,751,pb,temp,-18.5
12,751,DR-3,1930-02-26,751,lake,sal,0.1
13,752,DR-3,,752,lake,rad,2.19
14,752,DR-3,,752,lake,sal,0.09
15,752,DR-3,,752,lake,temp,-16.0
16,752,DR-3,,752,roe,sal,41.6
17,837,MSK-4,1932-01-14,837,lake,rad,1.46
18,837,MSK-4,1932-01-14,837,lake,sal,0.21
19,837,MSK-4,1932-01-14,837,roe,sal,22.5


### 5.2.3 User-Inputed Data

In [23]:
# missing value in a series
num_legs = pd.Series({'goat': 4, 'amoeba': nan})
num_legs

# missing value in a dataframe
scientists = pd.DataFrame({
    'Name': ['Rosaline Franklin', 'William Gosset'],
    'Occupation': ['Chemist', 'Statistician'],
    'Born': ['1920-07-25', '1876-06-13'],
    'Died': ['1958-04-16', '1937-10-16'],
    'missing': [NaN, nan]})

scientists

amoeba    NaN
goat      4.0
dtype: float64

Unnamed: 0,Born,Died,Name,Occupation,missing
0,1920-07-25,1958-04-16,Rosaline Franklin,Chemist,
1,1876-06-13,1937-10-16,William Gosset,Statistician,


In [24]:
del scientists['missing']
# assign a column of missing values
scientists['missing'] = nan
scientists

Unnamed: 0,Born,Died,Name,Occupation,missing
0,1920-07-25,1958-04-16,Rosaline Franklin,Chemist,
1,1876-06-13,1937-10-16,William Gosset,Statistician,


### 5.2.4 Reindex Dataframe

Another way to introduce missing values into your data is to reindex your dataframe. This is useful when you want to add new indices to your dataframe, but still want to retain its original values. A common usage is when the index represents some time interval, and you want to add more dates.

In [10]:
gapminder = pd.read_csv('../data/gapminder.tsv', sep='\t')
life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
life_exp

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

In [11]:
life_exp.loc[range(2000, 2010), ]

year
2000          NaN
2001          NaN
2002    65.694923
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64

In [12]:
y2000 = life_exp[life_exp.index > 2000]
y2000.reindex(range(2000, 2010))

year
2000          NaN
2001          NaN
2002    65.694923
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64

## 5.3 Working with Missing Data
### 5.3.1 Find and Count Missing Data

In [13]:
ebola = pd.read_csv('../data/country_timeseries.csv')
ebola.count()

# counting the non-missing rows
num_rows = ebola.shape[0]
num_missing = num_rows - ebola.count()
num_missing

Date                   122
Day                    122
Cases_Guinea            93
Cases_Liberia           83
Cases_SierraLeone       87
Cases_Nigeria           38
Cases_Senegal           25
Cases_UnitedStates      18
Cases_Spain             16
Cases_Mali              12
Deaths_Guinea           92
Deaths_Liberia          81
Deaths_SierraLeone      87
Deaths_Nigeria          38
Deaths_Senegal          22
Deaths_UnitedStates     18
Deaths_Spain            16
Deaths_Mali             12
dtype: int64

Date                     0
Day                      0
Cases_Guinea            29
Cases_Liberia           39
Cases_SierraLeone       35
Cases_Nigeria           84
Cases_Senegal           97
Cases_UnitedStates     104
Cases_Spain            106
Cases_Mali             110
Deaths_Guinea           30
Deaths_Liberia          41
Deaths_SierraLeone      35
Deaths_Nigeria          84
Deaths_Senegal         100
Deaths_UnitedStates    104
Deaths_Spain           106
Deaths_Mali            110
dtype: int64

If you want to count the total number of missing values in your data, or count the number of missing values for a particular column, you can use the `count_nonzero` function from `numpy` in conjunction with the `isnull` method.

In [14]:
import numpy as np

np.count_nonzero(ebola.isnull())
np.count_nonzero(ebola['Cases_Guinea'].isnull())

1214

29

Another way to get missing data counts is to use the `value_counts` method on a series. This will print a frequency table of values. If you use the `dropna` parameter, you can also get a missing value count.

In [15]:
ebola.Cases_Guinea.value_counts(dropna=False).head()

NaN       29
 86.0      3
 495.0     2
 112.0     2
 390.0     2
Name: Cases_Guinea, dtype: int64

### 5.3.2 Cleaning Missing Data
#### 5.3.2.1 Recode/Replace
We can use the `fillna` method to recode the missing values to another value. For example, suppose we wanted the missing values to be recoded as a `0`.

In [16]:
ebola.fillna(0).iloc[0:10, 0:5]

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone
0,1/5/2015,289,2776.0,0.0,10030.0
1,1/4/2015,288,2775.0,0.0,9780.0
2,1/3/2015,287,2769.0,8166.0,9722.0
3,1/2/2015,286,0.0,8157.0,0.0
4,12/31/2014,284,2730.0,8115.0,9633.0
5,12/28/2014,281,2706.0,8018.0,9446.0
6,12/27/2014,280,2695.0,0.0,9409.0
7,12/24/2014,277,2630.0,7977.0,9203.0
8,12/21/2014,273,2597.0,0.0,9004.0
9,12/20/2014,272,2571.0,7862.0,8939.0


When if we use `fillna`, we can recode the values to a specific value. If you look at the documentation, you will discover that `fillna`, like many other Pandas functions, has a parameter for `inplace`. This means that the underlying data will be automatically changed for you; you do not need to create a new copy with the changes. You will want to use this parameter when your data gets larger and you want your code to be more memory efficient.

#### 5.3.2.2 Fill Forward
We can use built-in methods to fill forward or backward. When we fill data forward, the last known value is used for the next missing value. In this way, missing values are replaced with the last known/recorded value.

In [17]:
ebola.fillna(method='ffill').iloc[0:10, 0:5]

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone
0,1/5/2015,289,2776.0,,10030.0
1,1/4/2015,288,2775.0,,9780.0
2,1/3/2015,287,2769.0,8166.0,9722.0
3,1/2/2015,286,2769.0,8157.0,9722.0
4,12/31/2014,284,2730.0,8115.0,9633.0
5,12/28/2014,281,2706.0,8018.0,9446.0
6,12/27/2014,280,2695.0,8018.0,9409.0
7,12/24/2014,277,2630.0,7977.0,9203.0
8,12/21/2014,273,2597.0,7977.0,9004.0
9,12/20/2014,272,2571.0,7862.0,8939.0


#### 5.3.2.3 Fill Backward
We can also have Pandas fill data backward. When we fill data backward, the newest value is used to replace the missing data. In this way, missing values are replaced with the newest value.

In [18]:
ebola.fillna(method='bfill').iloc[:, 0:5].tail()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone
117,3/27/2014,5,103.0,8.0,6.0
118,3/26/2014,4,86.0,,
119,3/25/2014,3,86.0,,
120,3/24/2014,2,86.0,,
121,3/22/2014,0,49.0,,


#### 5.3.2.4 Interpolate
Interpolation uses existing values to fill in missing values. Although there are many ways to fill in missing values, the interpolation in Pandas fills in missing values linearly. Specifically, it treats the missing values as if they should be equally spaced apart.

In [19]:
ebola.interpolate().iloc[0:10, 0:5]

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone
0,1/5/2015,289,2776.0,,10030.0
1,1/4/2015,288,2775.0,,9780.0
2,1/3/2015,287,2769.0,8166.0,9722.0
3,1/2/2015,286,2749.5,8157.0,9677.5
4,12/31/2014,284,2730.0,8115.0,9633.0
5,12/28/2014,281,2706.0,8018.0,9446.0
6,12/27/2014,280,2695.0,7997.5,9409.0
7,12/24/2014,277,2630.0,7977.0,9203.0
8,12/21/2014,273,2597.0,7919.5,9004.0
9,12/20/2014,272,2571.0,7862.0,8939.0


#### 5.3.2.5 Drop Missing Values
The last way to work with missing data is to drop observations or variables with missing data. Depending on how much data is missing, keeping only complete case data can leave you with a useless data set. Perhaps the missing data is not random, so that dropping missing values will leave you with a biased data set, or perhaps keeping only complete data will leave you with insufficient data to run your analysis.

We can use the `dropna` method to drop missing data, and specify parameters to this method that control how data are dropped. For instance, the `how` parameter lets you specify whether a row (or column) is dropped when `'any`' or `'all'` of the data is missing. The `thresh` parameter lets you specify how many non-NaN values you have before dropping the row or column.

In [20]:
ebola.shape

# dropping missing values will leave us with just 1 row of data
ebola.dropna().shape

(122, 18)

(1, 18)

### 5.3.3 Calculations with Missing Data
Calculations with missing values will typically return a missing value, unless the function or method called has a means to ignore missing values in its calculations.

Examples of built-in methods that can ignore missing values include `mean` and `sum`. These functions will typically have a skipna parameter that will still calculate a value by skipping over the missing values.

In [21]:
ebola['Cases_multiple'] = ebola['Cases_Guinea'] + ebola['Cases_Liberia'] + ebola['Cases_SierraLeone']
ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone', 'Cases_multiple']]
ebola_subset.head(n=10)

Unnamed: 0,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_multiple
0,2776.0,,10030.0,
1,2775.0,,9780.0,
2,2769.0,8166.0,9722.0,20657.0
3,,8157.0,,
4,2730.0,8115.0,9633.0,20478.0
5,2706.0,8018.0,9446.0,20170.0
6,2695.0,,9409.0,
7,2630.0,7977.0,9203.0,19810.0
8,2597.0,,9004.0,
9,2571.0,7862.0,8939.0,19372.0


In [22]:
# skipping missing values is True by default
ebola.Cases_Guinea.sum(skipna = True)
ebola.Cases_Guinea.sum(skipna = False)

84729.0

nan