# Missing Data in Pandas

## `None`: Pythonic missing data

In [1]:
import numpy as np
import pandas as pd

In [2]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

The `dtype=object` means that the best inference that numpy could make is that the values in the array are Python objects. This kind of array has some benefits for some purposes, but any operation done on the data will be at the Python level, with much more overhead than the typically fast operations on arrays with native types:

In [3]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
59.7 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
1.75 ms ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



Using python objects also brings up an error if you try to perfom operations like `sum()` or `min()` on an array with `None` value:

In [4]:
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

This shows that the operation of addition between an integer and `None` is undefined

## `NaN` : Missing numerical data

`NaN` is an acronym for *Not a Number*. This is used to represent missing data. It is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [6]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

Beware that the value `NaN` is like a virus, it infects all otehr values it interacts with in operations. Observe:

In [7]:
1 + np.nan

nan

In [8]:
0 + np.nan

nan

In [9]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

Thus, NumPy does provise some special aggregations that ignore missing values:

In [12]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

**NOTE** that `NaN` is specifically a floating-point value, there is no equivalent NaN value for int, str or other types.

## Nan and None in Pandas

In [14]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

When a type has no sentinel for None value, and we set on of the elements to NA, Pandas automatically upcasts the type to accomodate the NA:

In [15]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [17]:
x[0]=None
x

0    NaN
1    1.0
dtype: float64

Notice that the `dtype` of x changed from `int64` to `float64`.

# Operating on Null Values

In [23]:
v = pd.Series(vals2)
v

0    1.0
1    NaN
2    3.0
3    4.0
dtype: float64

In [24]:
v.isnull()

0    False
1     True
2    False
3    False
dtype: bool

In [25]:
v.notnull()

0     True
1    False
2     True
3     True
dtype: bool

In [26]:
v.dropna()

0    1.0
2    3.0
3    4.0
dtype: float64

In [31]:
v.fillna(0.1213)

0    1.0000
1    0.1213
2    3.0000
3    4.0000
dtype: float64

## Detecting null values

In [32]:
data = pd.Series([1, np.nan, 'hello', None])

In [33]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [34]:
data[data.notnull()]

0        1
2    hello
dtype: object

## Dropping null values
#### Series

In [35]:
data.dropna()

0        1
2    hello
dtype: object

#### DataFrame

In [36]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


For `DataFrames`, we cannot drop just one value, we can only drop full rows or full columns. `dropna()` provides a number of options. By default, `dropna()` drops all rows that have any null value:

In [37]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [38]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


In [39]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [40]:
df. dropna(axis='columns', how = 'all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [41]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [47]:
df.dropna(axis = 'rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


The first and last rows are dropped because they contain only two *non-null* values.

## Filling null values

In [48]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [49]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [50]:
# forward-fill
data.fillna(method= 'ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [51]:
# back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [52]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


For `DataFrame` the options are similar, but we can specify an `axis` along which the fills take place:

In [53]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


In [56]:
df.fillna(method='bfill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,2.0,2.0,
1,2.0,3.0,5.0,
2,4.0,4.0,6.0,


In [57]:
df.fillna(method='bfill', axis=1).fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,2.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,4.0,4.0,6.0,6.0


In [55]:
df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,4.0,4.0,6.0,6.0
