# Chapter 16. Handling Missing Data
**In this chapter, we will discuss some general considerations for missing
data, look at how Pandas chooses to represent it, and explore some built-in
Pandas tools for handling missing data in Python**

In [1]:
import numpy as np
import pandas as pd

## None as a Sentinel Value
For some data types, Pandas uses None as a sentinel value. **None** is a
Python object, which means that any array containing None must have
*dtype=object*—that is, it must be a sequence of Python objects.

In [2]:
vals1 = np.array([1,None,2,3])
vals1

array([1, None, 2, 3], dtype=object)

**NB**
#### The downside of using None in this way is that operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types

In [3]:
%timeit np.arange(1E6,dtype=int).sum()

2.13 ms ± 344 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Further, because Python does not support arithmetic operations with None,aggregations like sum or min will generally lead to an error:
For this reason, Pandas does not use None as a sentinel in its numerical
arrays

## NaN: Missing Numerical Data
The other missing data sentinel, NaN is different; it is a special floating-
point value recognized by all systems that use the standard IEEE floating-
point representation

In [4]:
vals2 = np.array([1, np.nan, 3, 4])
vals2

array([ 1., nan,  3.,  4.])

**NB**
#### Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. Keep in mind that NaN is a bit like a data virus—it infects any other object it touches.

In [5]:
1+np.nan

nan

In [6]:
0*np.nan

nan

In [7]:
vals2.sum()

np.float64(nan)

In [8]:
vals2.min(),vals2.max()

(np.float64(nan), np.float64(nan))

**NumPy does provide NaN-aware versions of aggregations that
will ignore these missing value**

In [10]:
np.nansum(vals2),np.nanmax(vals2),np.nanmin(vals2)

(np.float64(8.0), np.float64(4.0), np.float64(1.0))

**The main downside of NaN is that it is specifically a floating-point value;
there is no equivalent NaN value for integers, strings, or other types.**

## NaN and None in Pandas
NaN and None both have their place, and Pandas is built to handle the two
of them nearly interchangeably, converting between them where
appropriate

In [11]:
pd.Series([1,np.nan,2,None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

Typeclass   &nbsp;&nbsp;Conversion when storing NAs   &nbsp;&nbsp;                 NA sentinel value.  
floating    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            No change 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    np.nan.  
object      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;            No change        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   None or np.nan.  
integer     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;              Cast to float64     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         np.nan.  
boolean     &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            Cast to object        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;          None or np.nan.  
**Keep in mind that in Pandas, string data is always stored with an object
dtype**

## Pandas Nullable Dtypes
The primary difficulty this
introduced was with regard to the implicit type casting: for example, there
was no way to represent a true integer array with missing data.  
**To address this difficulty, Pandas later added nullable dtypes.**   which are
distinguished from regular dtypes by capitalization of their names (e.g.,
pd.Int32 versus np.int32). For backward compatibility, these
nullable dtypes are only used if specifically requested.

In [12]:
pd.Series([1,np.nan,2,None,pd.NA],dtype='Int32')

0       1
1    <NA>
2       2
3    <NA>
4    <NA>
dtype: Int32

## Operating on Null Values
Pandas treats None, NaN, and NA as essentially
interchangeable for indicating missing or null values. To facilitate this
convention, Pandas provides several methods for detecting, removing, and
replacing null values in Pandas data structure

In [13]:
data = pd.Series([1,np.nan,'hello',None])

In [14]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [15]:
data.notnull()

0     True
1    False
2     True
3    False
dtype: bool

In [16]:
data[data.notnull()]

0        1
2    hello
dtype: object

In [17]:
data.dropna()

0        1
2    hello
dtype: object

In [18]:
data.fillna(1)

0        1
1        1
2    hello
3        1
dtype: object

In [19]:
df = pd.DataFrame([[1,np.nan,2],
                   [2,3,5],
                   [np.nan,4,7]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,7


In [20]:
df.dropna() # will drop all rows in which null value is present

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [21]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,7


In [22]:
df.dropna(axis=1) #drop columns containing null value

Unnamed: 0,2
0,2
1,5
2,7


In [23]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,7


In [24]:
df.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,7
