# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a *significant amount of time* is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up *80% or more of an analyst’s time*.



## A. Handling Missing Data
Missing data occurs commonly in many data analysis applications. One of the goals
of pandas is to make working with missing data as painless as possible. For example,
all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect,
but it is functional for a lot of users. For numeric data, pandas uses the floating-point
value NaN (Not a Number) to represent missing data.

The built-in Python **None** value is also treated as NA in object arrays:

In [3]:
import pandas as pd
import numpy as np

In [6]:
string_data = pd.Series(['Kolkata', 'Delhi', np.nan, 'Bangalore'])
string_data

0      Kolkata
1        Delhi
2          NaN
3    Bangalore
dtype: object

In [7]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [8]:
# The built-in Python None value is also treated as NA in object arrays:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### NA handling methods
- `dropna` Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
- `fillna` Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
- `isnull` Return boolean values indicating which values are missing/NA.
- `notnull` Negation of isnull

### Filtering Out Missing Data
While you always have the option to do it by hand using `pandas.isnull` and boolean indexing, the `dropna` can be helpful.

In [9]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [10]:
# Droping the Data
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [11]:
# This is equivalent to:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows
or columns that are all NA or only those containing any `NAs`. 
The `dropna` by default drops **any row containing a missing value**:

In [13]:
 
 data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
 cleaned = data.dropna()
 cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [14]:
# Passing how='all' will only drop rows that are all NA:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [15]:
# To drop columns in the same way, pass axis=1:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [17]:
# Drop data column wise
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


### Filling In Missing Data