**Handling Missing Data**
Representing missing data in a table or frame is typically handled with two strategies: using a mask that globally indicates missing values, or choosing a *Sentinal Value* that indicates a missing entry.

The mask might be an entirely seperate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate a null value.

In the sentinal approach, the value can be a data-specific convention, such as as indicating a missing integer with -9999 or a rare bit pattern. It could be a global convention like NaN.

The tradeoff: a seperate mask arrray requires allocation of an additional Boolean array, which adds overhead in both computation and storage. Sentinal value reduces the range of valid values that can be represented, and may require extra logic in CPU and GPU arithmetic.

*Missing Data in Pandas*

Pandas handles missing values by relying on NumPy, which doesnt have a built in notion of NA values for non-floating-point data types.

NumPy supports fourteen basic integer types including available precisions, signedness, and endianness of the encoding. Masked arrays are not supported under NumPy. Pandas uses sentinals for missing data, and chose to use Python-null values which already exist: NaN, and None.

*None: Pythonic missing data*
The first sentinal value used by Pandas is *None*, which is often used for missing values in Python code. It can only be used in arrays with data type 'object' because it is a Python object.

In [1]:
import numpy as np
import pandas as pd

In [3]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

dtype=object means that the best common type representing NumPy could infer the contents of the array is that there are Python objects. It is useful for some purposes, but any operations on the data will be done at the Python level, any operations on the data will be done at the Python level, with much more overhead than the fast operations seen for arrays with native types:

In [4]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
37.2 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
1.45 ms ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



The use of Python objects in array also means that performing aggregations like sum() or min() accross array with a None value, you will generally get an error:

In [5]:
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

This represents addition between an integer and None is undefined

*NaN: Missing numerical data*
The other missing data representation, NaN, is different;
it is a special floating-point value recognized by all systems that use IEEE floating point representations

In [6]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

NumPy chose a native floating-point type for the array: which means that unlike the object array before, this supports fast operations pushed into compiled code. *Regardless of the operation, the result of arithmetic with NaN will be another NaN:*

In [7]:
1 + np.nan

nan

In [8]:
0 * np.nan

nan

This means that aggregates over the values are well defined, they don't result in errors but it's not always useful:

In [9]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

NumPy does provide some special aggregations that will ignore these missing values:

In [10]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

Nan is specifically for floating type value; there is no equivalent NaN values for integers, strings, or other types.

**NaN and None in Pandas**
NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [16]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [17]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to NaN. 

**Operating on Null Values**
Pandas treats None and NaN as interchangeable for individual casting missing or null values. There are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:

isnull() : Generate a Boolean mask indicating missing values

notnull() : Opposite of isnull()

dropna() : Return a filtered version of the data

fillna() : Return a copy of the data with missing values filled or imputed

**Detecting null values**

Pandas data structures have useful methods for detecting null data: isnull() and notnull(). Either will return a Boolean mask over the data. 

In [18]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [20]:
data[data.notnull()]

0        1
2    hello
dtype: object

The isnull() and notnull() methods produce similar Boolean results for DataFrames.

**Dropping null values**

dropna() (removes null values) and fillna() (which fills in NA values). With a series the results are straightforward: 

In [21]:
data.dropna()

0        1
2    hello
dtype: object

For DataFrame, there are more options. Consider the following DataFrame:

In [22]:
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so dropna(), gives a number of options for a DataFrame.

By default, dropna() will drop all rows in which any null value is present:

In [23]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


You can drop NA values along a different axis; axis=1 drops all columns containing a null values:

In [24]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


This drops good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values. This can be specified through the *how* or *thresh* perameters, which allow fine control of the number of nulls to allow through.

The default is how='any', such that any row or column (depending on the axis key-word) containing a null value will be dropped. You can also specify how='all', which will only drop rows/columns that are all null values:

In [25]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [26]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [27]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


**Finding null values**

Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.

This can be done in place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.

Consider the following Series:

In [28]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

Can fill NA entries with a single value, such as zero:

In [29]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

Can specify a forward-fill to propagate the previous value forward:

In [30]:
# forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

or we can specify a back-fill to propogate the next values backward:

In [31]:
# back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

For DataFrames, the options are similar, but we can also specify along an axis along which the fills take place:

In [32]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [33]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


If the previous values is not available during a forward fill, the NA values remains.