<h1><center> Data Cleaning </h1></center>

<div class="alert alert-success">
'Data Cleaning' is the process of finding and either removing or fixing 'bad data', where 'bad data' typically refers to corrupt and/or inaccurate data points. 
</div>

In [56]:
# Imports
import numpy as np
import pandas as pd

## Missing Values
<br>
<div class="alert alert-success">
Missing Values are simply data points that are missing. These can literally be empty, or encoded special values (None type, or numpy's NaN - not a number), or sometimes missing values are indicated by an arbitrarily chosen value. Missing values usually need dealing with before any analysis.
</div>

### Python - None Type

In [57]:
# Python has the special value 'None', which can encode a missing, or null value
dat_none = None

In [58]:
# None is actually it's own type
print(type(None))

<class 'NoneType'>


In [59]:
# Note that 'None' acts like a null type (as if the variable doesn't exist)
assert dat_none

AssertionError: 

In [60]:
# Since None is a null type, basic operations will fail when None is in the data
dat_lst = [1, 2, 3, None]
sum(dat_lst) / len(dat_lst)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

### Numpy - NaN

In [62]:
# Numpy also has a special value for 'not a number' - NaN
dat_nan = np.nan

In [63]:
# It's actually a special float value
type(dat_nan)

float

In [64]:
# It doesn't evaluate as null (unlike None)
assert dat_nan

In [65]:
# Numpy actually has multiple versions of NaN - but they are all actually the same.
np.nan is np.NaN is np.NAN

True

In [69]:
# NaN values won't fail (unlike None) but they will return undefined (NaN) answers
dat_a = np.array([1, 2, 3, np.nan])
print(np.mean(dat_a))

nan


In [71]:
# You can tell numpy to do calculations, ignoring NaN values, but you have to explicitly tell it to do so
np.nanmean(np.array([1, 2, 3, np.nan]))

2.0

Dealing with missing data is a decision point: what do you do?
- Do you drop that data point? 
- Do you keep it, but ignore it in any calculations?
- Do you recode that data point?

## Impossible Values

Data cleaning includes checking for and dealing with impossible values. Impossible values can occur due to encoding or data entry errors. 

Be wary that datasets may also encode missing data as a special value - for example using '-999' for missing age. 

These have to be dealt with, or they will skew your results.

## Data Cleaning in Pandas

<div class="alert alert-info">
Bad Data Guide (from Quartz): https://github.com/Quartz/bad-data-guide
</div>

http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%207%20-%20Cleaning%20up%20messy%20data.ipynb

<div class="alert alert-success">
Words
</div>

<div class="alert alert-info">
Link: 
</div>