# Data Cleaning and Preparation

## Handling Missing Data
### Filtering data
- You can filter data with `dropna`

In [3]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from numpy import nan as NA

In [5]:
data = pd.Series([1,2,NA,3])
data.dropna()

0    1.0
1    2.0
3    3.0
dtype: float64

For `DataFrame`, `dropna` drops an entire row when any column contains NA values. 
Unless you pass `data.dropna(how='all')` to it, in which case it will only drop a row when it's entirely consisting out of `NA` values.

For columns, pass `axis=1`

If you want to drop all rows having `N` `NA` values, you can pass a treshold:

```
df.dropna(thresh=2)
```


### Filling in missing data

`fillna` is the most commonly used functions for most cases. 
You can pass scalar values, or lists of values to this. 

The scalar value can also be a value calculated on the entire set. For example:

In [8]:
s = Series([1,2,3,NA,5,6,NA,8,9,NA])
s.fillna(s.mean())

0    1.000000
1    2.000000
2    3.000000
3    4.857143
4    5.000000
5    6.000000
6    4.857143
7    8.000000
8    9.000000
9    4.857143
dtype: float64

In [10]:
s.fillna(method='ffill')

0    1.0
1    2.0
2    3.0
3    3.0
4    5.0
5    6.0
6    6.0
7    8.0
8    9.0
9    9.0
dtype: float64