In [1]:
import pandas as pd
import numpy as np

**Loading, cleaning, transforming and rearranging** take up 80% or more of an analyst's time.  

Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Many researchers choose to do ad hoc processing of data from one form to another using a general-purpose programming language, like Python, Perl, R or Java to manipulate data into the right form.

# 1. Handling Missing Data

For numeric data, pandas uses the floating-point value **NaN**(Not a Number) to represent missing data. (*sentinel value*)

In [3]:
string_data = pd.Series(['a','bb','ccc',np.nan, 'dddd'])
string_data

0       a
1      bb
2     ccc
3     NaN
4    dddd
dtype: object

In [4]:
string_data.isnull()

0    False
1    False
2    False
3     True
4    False
dtype: bool

In [5]:
string_data[0] = None # Python build-in object
string_data.isnull()

0     True
1    False
2    False
3     True
4    False
dtype: bool

* **dropna**: Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.  
* **fillna**: Fill in missing data with some value or using an interpolation method such as'ffill'or'bfill'.  
* **isnull**: return boolean values indicating which values are missing/NA.  
* **notnull**: Negation of isnull.  

In [6]:
string_data.fillna(0)

0       0
1      bb
2     ccc
3       0
4    dddd
dtype: object

In [11]:
string_data.ffill() # NaN filled same as previous value

0    None
1      bb
2     ccc
3     ccc
4    dddd
dtype: object

In [12]:
string_data.bfill() # NaN filled same as later value

0      bb
1      bb
2     ccc
3    dddd
4    dddd
dtype: object

In [16]:
string_data.dropna()

1      bb
2     ccc
4    dddd
dtype: object

## Filtering Out Missing Data

In [18]:
from numpy import nan as NA

In [19]:
data = pd.Series([1, NA, 3, 4, NA, 5])

In [20]:
data.dropna()

0    1.0
2    3.0
3    4.0
5    5.0
dtype: float64

In [21]:
data[data.isnull()]

1   NaN
4   NaN
dtype: float64

In [22]:
data[data.notnull()] # Equivalent to dropna().

0    1.0
2    3.0
3    4.0
5    5.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. 

In [24]:
data = pd.DataFrame([[1., 2., 3.], 
                     [4., NA, NA],
                     [NA, NA, NA], 
                     [NA, 11., 12.]])
data

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,
2,,,
3,,11.0,12.0


> *dropna() by default drops any row containing a missing value*

In [25]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,2.0,3.0


> *passing **how='all'** will only drop rows that are all NaN.*

In [26]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,
3,,11.0,12.0


In [29]:
data['4'] = NA

In [30]:
data

Unnamed: 0,0,1,2,4
0,1.0,2.0,3.0,
1,4.0,,,
2,,,,
3,,11.0,12.0,


> To drop columns in the same way, pass **axis=1**.

In [33]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,
2,,,
3,,11.0,12.0


If we only want to keep rows containing a certain number of observations. We can add **thresh** augument.

In [35]:
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.270694,,
1,-0.309529,,
2,0.920824,,-0.568802
3,1.084386,,2.343295
4,1.253037,-0.08833,1.838749
5,-0.210464,-0.982843,0.814549
6,-0.642116,-0.066487,1.310579


In [36]:
df.dropna() # drop all the rows contains NaN

Unnamed: 0,0,1,2
4,1.253037,-0.08833,1.838749
5,-0.210464,-0.982843,0.814549
6,-0.642116,-0.066487,1.310579


In [50]:
df[3] = NA
df[3].astype('float')
df

Unnamed: 0,0,1,2,3
0,-0.270694,,,
1,-0.309529,,,
2,0.920824,,-0.568802,
3,1.084386,,2.343295,
4,1.253037,-0.08833,1.838749,
5,-0.210464,-0.982843,0.814549,
6,-0.642116,-0.066487,1.310579,


In [51]:
df.dropna(thresh=3) # Keep only the rows with at least 3 non-na values:

Unnamed: 0,0,1,2,3
4,1.253037,-0.08833,1.838749,
5,-0.210464,-0.982843,0.814549,
6,-0.642116,-0.066487,1.310579,


## Filling In Missing Data

In [52]:
df

Unnamed: 0,0,1,2,3
0,-0.270694,,,
1,-0.309529,,,
2,0.920824,,-0.568802,
3,1.084386,,2.343295,
4,1.253037,-0.08833,1.838749,
5,-0.210464,-0.982843,0.814549,
6,-0.642116,-0.066487,1.310579,


In [53]:
df.fillna(0)

Unnamed: 0,0,1,2,3
0,-0.270694,0.0,0.0,0.0
1,-0.309529,0.0,0.0,0.0
2,0.920824,0.0,-0.568802,0.0
3,1.084386,0.0,2.343295,0.0
4,1.253037,-0.08833,1.838749,0.0
5,-0.210464,-0.982843,0.814549,0.0
6,-0.642116,-0.066487,1.310579,0.0


> By passing a dictionary, can use different fill value for each column.

In [59]:
df.fillna({1: 0.9999, 2: 0.8888, 3:0.7777 })

Unnamed: 0,0,1,2,3
0,-0.270694,0.9999,0.8888,0.7777
1,-0.309529,0.9999,0.8888,0.7777
2,0.920824,0.9999,-0.568802,0.7777
3,1.084386,0.9999,2.343295,0.7777
4,1.253037,-0.08833,1.838749,0.7777
5,-0.210464,-0.982843,0.814549,0.7777
6,-0.642116,-0.066487,1.310579,0.7777


> fillna returns a new object, but add argument inplace=Ture will change inplace.

In [68]:
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2,3
0,-0.270694,0.0,0.0,0.0
1,-0.309529,0.0,0.0,0.0
2,0.920824,0.0,-0.568802,0.0
3,1.084386,0.0,2.343295,0.0
4,1.253037,-0.08833,1.838749,0.0
5,-0.210464,-0.982843,0.814549,0.0
6,-0.642116,-0.066487,1.310579,0.0


In [69]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,0.643743,0.226103,1.36282
1,-1.470429,1.508246,0.805031
2,0.89445,,0.895556
3,-0.527436,,0.777485
4,-0.529495,,
5,-0.873979,,


In [74]:
df.fillna(method='ffill') # fill the value the same as upper one.

Unnamed: 0,0,1,2
0,0.643743,0.226103,1.36282
1,-1.470429,1.508246,0.805031
2,0.89445,1.508246,0.895556
3,-0.527436,1.508246,0.777485
4,-0.529495,1.508246,0.777485
5,-0.873979,1.508246,0.777485


* **value**: Scalar value or dict-like object to use to  ll missing values
* **method**: Interpolation; by default'ffill'if function called with no other 
* **axis**: arguments Axis to  ll on; defaultaxis=0
* **inplace**: Modify the calling object without producing a copy
* **limit**: For forward and backward  lling, maximum number of consecutive periods to  ll

# 2. Data Transformation