# Data Cleaning and Preparation

a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such
tasks are often reported to take up `80%` or more of an analyst’s time.

In this lesson I discuss tools for **missing data**, **duplicate data**, **string manipulation**,
and some other analytical data transformations. 


In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## Handling Missing Data

- All of the descriptive statistics on pandas objects exclude missing data by default.
- For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

![](assets/na-methods.png)

In [25]:
string_data = pd.Series([None, 'aardvark', 'artichoke', np.nan, 'avocado'])

In [1]:
# count the missing values with isnull() function


### Filtering Out Missing Data

In [2]:
# filter out the missing data (first approach)


In [3]:
# filter out the missing data (second approach)


With DataFrame objects, things are a bit more complex. You may want to drop **rows**
or **columns** that are **all** `NA` or only those containing **any** `NAs`.

In [33]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [4]:
# Drop any row containing a missing value


In [5]:
# Drop any row with all values missing


In [36]:
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [6]:
# drop the columns that have all values missing 


In [38]:
df = pd.DataFrame(np.random.randint(0, 10, (7, 3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df.iloc[4,:] = np.nan
df


Unnamed: 0,0,1,2
0,9.0,,
1,5.0,,
2,9.0,,4.0
3,1.0,,0.0
4,,,
5,5.0,1.0,2.0
6,3.0,8.0,6.0


In [7]:
# drop all rows that have any missing values


In [9]:
# drop all rows that have 2 or more missing values


In [10]:
# drop all rows that have a missing value at column 2


### Filling In Missing Data
Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways.

For most purposes, the `fillna` method is the workhorse function to use.

In [11]:
# replace all missing data with 0


Calling `fillna` with a **dict**, you can use a different fill value for each column:


In [12]:
# replace missing values of column 1 by 0.5 and of column 2 by 0


`fillna` returns a **new object**, but you can modify the existing object in-place

In [13]:
# fill the values in-place


In [48]:
df

Unnamed: 0,0,1,2
0,9.0,0.5,0.0
1,5.0,0.5,0.0
2,9.0,0.5,4.0
3,1.0,0.5,0.0
4,,0.5,0.0
5,5.0,1.0,2.0
6,3.0,8.0,6.0


In [49]:
df = pd.DataFrame(np.random.randint(0, 10, (6, 3)))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df


Unnamed: 0,0,1,2
0,4,3.0,2.0
1,3,9.0,1.0
2,8,,7.0
3,5,,7.0
4,3,,
5,6,,


the parameter **method** is a powerful utility available for `fillna` method

In [14]:
# fill the missing value with the value precedes it


In [15]:
# fill the missing value with the value precedes it with maximum of 2 filling


**check**: fill the missing value with the mean

In [52]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [16]:
# fill the missing value with the average


![](assets/fillna-args.png)

## Data Transformation
So far in this lesson we’ve been concerned with rearranging data. Filtering, cleaning,
and other transformations are another class of important operations.

### Removing Duplicates
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an
example:

In [55]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each
row is a duplicate (has been observed in a previous row) or not:

In [17]:
# which row(s) is duplicated 


In [18]:
# count the duplicated row(s)


Relatedly, `drop_duplicates` returns a DataFrame where the duplicated array is
False

In [19]:
# show the rows that are not duplicated


Both of these methods by default consider **all** of the columns; alternatively, you can
specify any **subset** of them to detect duplicates.

In [64]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [20]:
# drop rows with duplicated values at column k1


`duplicated` and `drop_duplicates` by default keep the first observed value combination. Passing `keep='last'` will return the last one

In [21]:
# drop rows with duplicated values at columns k1 and k2, keeping the last opservation


### Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the val‐
ues in an array, Series, or column in a DataFrame.

In [29]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food
came from.

In [35]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

The `map` method on a Series accepts a **function** or **dict-like** object containing a mapping, 
but here we have a small problem in that **some** of the meats are **capitalized** and
others are not. Thus, we need to convert each value to lowercase using the `str.lower`
Series method

In [32]:
# convert all strings in 'food' column to lower case and assign it to a variable
x = data["food"].str.lower()
x

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [36]:
# add new column 'animal' to the dataframe and get the matching value from the dict 'meat_to_animal'

data["animal"] = x.map(meat_to_animal)
data


Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [37]:
# approach 2, pass a function that do all the work

def to_animal(f):
    return meat_to_animal[f.lower()]

data["animal"] = x.map(to_animal)
data



Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [38]:
# approach 3, do the same with lambda function
data["animal"] = x.map(lambda f: meat_to_animal[f.lower()])
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon
