## Missing data basics

### When / why does data become missing?

In [1]:
import pandas as pd
import numpy as np

In [6]:
df = pd.DataFrame(np.random.randn(5,3), index=['a','b','c','d','e'], columns=['one','two','three'])

In [7]:
df

Unnamed: 0,one,two,three
a,0.76022,0.639695,0.026678
b,-0.502284,1.153093,1.192418
c,-0.386687,-0.944764,-0.84853
d,-1.363746,-0.197183,0.145
e,-0.644312,1.56139,-1.199459


In [8]:
df ['four'] = 'bar'

In [11]:
df ['five'] = df['one'] > 0

In [59]:
df

Unnamed: 0,one,two,three,four,five,timestamp
a,0.76022,0.639695,0.026678,bar,True,2012-01-01
b,-0.502284,1.153093,1.192418,bar,False,2012-01-01
c,-0.386687,-0.944764,-0.84853,bar,False,2012-01-01
d,-1.363746,-0.197183,0.145,bar,False,2012-01-01
e,-0.644312,1.56139,-1.199459,bar,False,2012-01-01


In [13]:
df2 = df.reindex(['a','b','c','d','e','f','g','h'])

In [14]:
df2

Unnamed: 0,one,two,three,four,five
a,0.76022,0.639695,0.026678,bar,True
b,-0.502284,1.153093,1.192418,bar,False
c,-0.386687,-0.944764,-0.84853,bar,False
d,-1.363746,-0.197183,0.145,bar,False
e,-0.644312,1.56139,-1.199459,bar,False
f,,,,,
g,,,,,
h,,,,,


In [15]:
pd.isnull(df2['one'])

a    False
b    False
c    False
d    False
e    False
f     True
g     True
h     True
Name: one, dtype: bool

In [16]:
pd.isnull(df2)

Unnamed: 0,one,two,three,four,five
a,False,False,False,False,False
b,False,False,False,False,False
c,False,False,False,False,False
d,False,False,False,False,False
e,False,False,False,False,False
f,True,True,True,True,True
g,True,True,True,True,True
h,True,True,True,True,True


## Datetimes 

For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented by numpy in a singular dtype (datetime64[ns]). pandas objects provide intercompatibility between NaT and NaN.



In [17]:
df2 = df.copy()

In [20]:
df2['timestamp'] = pd.Timestamp('20120101')

In [21]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,0.76022,0.639695,0.026678,bar,True,2012-01-01
b,-0.502284,1.153093,1.192418,bar,False,2012-01-01
c,-0.386687,-0.944764,-0.84853,bar,False,2012-01-01
d,-1.363746,-0.197183,0.145,bar,False,2012-01-01
e,-0.644312,1.56139,-1.199459,bar,False,2012-01-01


In [23]:
df2.ix[['a','c','e'],['one','timestamp']] = np.nan

In [24]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,0.639695,0.026678,bar,True,NaT
b,-0.502284,1.153093,1.192418,bar,False,2012-01-01
c,,-0.944764,-0.84853,bar,False,NaT
d,-1.363746,-0.197183,0.145,bar,False,2012-01-01
e,,1.56139,-1.199459,bar,False,NaT


In [25]:
df2.get_dtype_counts()

bool              1
datetime64[ns]    1
float64           3
object            1
dtype: int64

## Inserting missing data

In [26]:
s = pd.Series([1,2,3])

In [28]:
s.loc[0] = None

In [29]:
s

0   NaN
1     2
2     3
dtype: float64

Likewise, datetime containers will always use NaT.

For object containers, pandas will use the value given:

In [31]:
s = pd.Series(["a","b","c"])

In [35]:
s.loc[1] = None

In [33]:
s.loc[1] = np.nan

In [36]:
s

0       a
1    None
2       c
dtype: object

## Calculations with missing data

Missing values propagate naturally through arithmetic operations between pandas objects.

In [62]:
a = df[['one','two']]

In [63]:
b = df2[['three','four']]

In [64]:
a + b

Unnamed: 0,four,one,three,two
a,,,,
b,,,,
c,,,,
d,,,,
e,,,,
