## Comparing missing values

- Pandas uses the NumPy NaN(np.nan) object to represent a missing value
- This is an unusual obejct, as it is not equal to itself
- Even Python's `None` object evaluates as `True` when compared to itself:

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 40

In [2]:
np.nan == np.nan

False

In [3]:
None == None

True

- All other comparions against `np.nan` also return `False`, except not equal to:

In [4]:
np.nan > 5

False

In [5]:
5 > np.nan

False

In [6]:
np.nan != 5

True

- Series and DataFrame use the equals operator, ==, to make element-by-element comparisons that return an object of the same size
- This recipe shows you how to use the equals operator, which is very different from the `equals` method

In [7]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')

- Compare each element to a scalar value

In [8]:
college_ugds_ == .0019

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,False,False,True,False,True,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,True,False,False,False,False,False
The University of Alabama,False,False,False,False,False,False,False,False,False
Central Alabama Community College,False,False,False,False,False,False,False,False,True
Athens State University,False,False,False,False,False,False,False,False,False
Auburn University at Montgomery,False,False,False,False,False,False,False,False,False
Auburn University,False,False,False,False,False,False,False,False,False


- This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values
- This same equals operator may be used to compare two DataFrames with one another on an element-by-element basis

In [9]:
college_self_compare = college_ugds_ == college_ugds_
college_self_compare.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,True,True,True,True,True,True,True,True,True
University of Alabama at Birmingham,True,True,True,True,True,True,True,True,True
Amridge University,True,True,True,True,True,True,True,True,True
University of Alabama in Huntsville,True,True,True,True,True,True,True,True,True
Alabama State University,True,True,True,True,True,True,True,True,True


- At first glance, all the values appear to be equal, however, using the `all` method to determine if each column contains only `True` values yields an unexpected result

In [10]:
college_self_compare.all()

UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

- Missing values do not compare equally with one another
- Trying to count missing values using the equal operator and summing up the boolean columns results in zero

In [12]:
(college_ugds_ == np.nan).sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

- The primary way to count missing values uses the `isnull` method:

In [13]:
college_ugds_.isnull().sum()

UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

- The correct way to compare two entire DataFrames with one another is not with the equals operator but with the `equals` method:

In [14]:
college_ugds_.equals(college_ugds_)

True

## There's more...

- The `eq` DataFrame method does element-by-element comparsion, just like the equals operator
- The `eq` method is not at all the same as the `equals` method

In [15]:
college_ugds_.eq(.0019) # same as college_ugds_ == .0019

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,False,False,True,False,True,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,True,False,False,False,False,False
The University of Alabama,False,False,False,False,False,False,False,False,False
Central Alabama Community College,False,False,False,False,False,False,False,False,True
Athens State University,False,False,False,False,False,False,False,False,False
Auburn University at Montgomery,False,False,False,False,False,False,False,False,False
Auburn University,False,False,False,False,False,False,False,False,False


- Inside the `pandas.testing` sub-package, a function exists that developers must use when creating unit tests
- The `assert_frame_equal` function raises an `AssertionError` if two DataFrames are not equal
- It return `None` if the two passed frames are equal

In [16]:
from pandas.testing import assert_frame_equal
assert_frame_equal(college_ugds_, college_ugds_)