## 1 Handling Missing Data

In [1]:
import numpy as np

In [2]:
import pandas as pd

Python None value is treated as NA in object arrays

 <img src='img/7_1_1.png'>

### 1.1 Filtering Out Missing Data

Using `dropna` method to filter out missing data of a Series

In [3]:
from numpy import nan as NA

In [4]:
data = pd.Series([1, NA, 3.5, NA, 7])

In [5]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

It's equivalent to:

In [7]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, you can drop rows or columns based on how many NAs contained

In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                   [NA, NA, NA], [NA, 6.5, 3.]])

Will drop all rows containing NA (`how='any'`) in default

In [9]:
cleaned = data.dropna()

In [10]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [11]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Drop rows that are all NA:

In [12]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Drop columns

In [13]:
data[4] = NA

In [14]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [16]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Droping only rows containing a certain number of NA with `thresh` argument

In [17]:
df = pd.DataFrame(np.random.randn(7, 3))

In [18]:
df.iloc[:4, 1] = NA

In [19]:
df.iloc[:2, 2] = NA

In [20]:
df

Unnamed: 0,0,1,2
0,0.135889,,
1,1.621497,,
2,-1.03463,,0.388766
3,1.066616,,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


In [21]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-1.03463,,0.388766
3,1.066616,,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


### 1.2 Filling In Missing Data

In [22]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.135889,0.0,0.0
1,1.621497,0.0,0.0
2,-1.03463,0.0,0.388766
3,1.066616,0.0,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


Specify different fill value for each column by passing a dict

In [23]:
df.fillna({1:0.4, 2:0})

Unnamed: 0,0,1,2
0,0.135889,0.4,0.0
1,1.621497,0.4,0.0
2,-1.03463,0.4,0.388766
3,1.066616,0.4,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


Can modify the existing object in-place

In [24]:
df.fillna(0, inplace=True)

In [25]:
df

Unnamed: 0,0,1,2
0,0.135889,0.0,0.0
1,1.621497,0.0,0.0
2,-1.03463,0.0,0.388766
3,1.066616,0.0,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


The same interpolation methods available for reindexing can be used with `fillna` as well

In [26]:
df = pd.DataFrame(np.random.randn(6, 3))

In [28]:
df.iloc[2:, 1] = NA

In [33]:
df.iloc[4:, 2] = NA

In [37]:
df

Unnamed: 0,0,1,2
0,0.420402,1.313372,0.234094
1,-0.24041,1.13984,-0.367249
2,-0.585492,,-0.012703
3,0.544509,,1.452135
4,-0.345364,,
5,0.378364,,


In [38]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.420402,1.313372,0.234094
1,-0.24041,1.13984,-0.367249
2,-0.585492,1.13984,-0.012703
3,0.544509,1.13984,1.452135
4,-0.345364,1.13984,1.452135
5,0.378364,1.13984,1.452135


In [39]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.420402,1.313372,0.234094
1,-0.24041,1.13984,-0.367249
2,-0.585492,1.13984,-0.012703
3,0.544509,1.13984,1.452135
4,-0.345364,,1.452135
5,0.378364,,1.452135


## 2 Data Transformation

### 2.1 Removing Duplicates

In [40]:
data = pd.DataFrame({'k1':['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})

In [41]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each row is a duplicate or not

In [42]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

`drop_duplicates` returns a Data Frame where the `duplicated` array is False

In [43]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of two methods by default consider all of the columns. You can specify any subset of them.

In [44]:
data['v1'] = range(7)

In [45]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


Two methods by default keep the first observed value combination. Passing `keep='last'` will return the last one

In [46]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### 2.2 Transforming Data Using a Function or Mapping