# Data Cleaning and Preparation

**Data prep**: loading, cleaning, transforming and rearranging - are often reported to take up 80% or more of an analyst's time.  
  
Chapter 7 covers the following topics:
- Tools for missing data
- Duplicate data
- String manipulation
- Analytical data transformation

## Handling missing data
For numeric values, pandas uses the floating point value, **NaN** (Not a Number) to represent missing data.  
  
**NA** (Not available) may either be data that does not exist or that exists but was not observed (through problems with data collection for example).
  
When cleaning up data for analysis, it is ofen important to do **analysis on the missing data itself** to identify data collection problems or potential biases in the data caused by missing data.

In [20]:
import pandas as pd
import numpy as np

In [21]:
string_data = pd.Series([None, 'arthichoke', np.nan, 'avocado'])
string_data

0          None
1    arthichoke
2           NaN
3       avocado
dtype: object

In [22]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Filter out missing data using ***pandas.notnull()* and boolean indexing**.  

*pandas.isnull()* is reverse.

In [23]:
string_data[string_data.notnull()]

1    arthichoke
3       avocado
dtype: object

## 7.2 Removing Duplicates
A row is considered a **duplicate** if it has been previously observed.

In [24]:
data = pd.DataFrame({'k1': ['one', 'two']*3 + ['two'],
                   'k2': [1, 1, 1, 1, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,1
3,two,1
4,one,3
5,two,4
6,two,4


The DataFrame method ***duplicated()*** returns a boolean Series indicating whether each row is a duplicate or not:

In [25]:
data.duplicated()

0    False
1    False
2     True
3     True
4    False
5    False
6     True
dtype: bool

## Transforming Data Using a Function or Mapping
For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.  

In [26]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from.  

In [27]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

First, convert each value of our data to lowercase:

In [28]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

**Map** our Series object:

In [29]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon
