# Data Cleaning and Preparation

a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such
tasks are often reported to take up `80%` or more of an analyst’s time.

In this lesson I discuss tools for **missing data**, **duplicate data**, **string manipulation**,
and some other analytical data transformations. 


In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## Interacting with Databases

In a business setting, most data may not be stored in text or Excel files. SQL-based
relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use,
and many alternative databases have become quite popular.

In [1]:
# Load employees data from sqlite database 'hr.db' using a SQL query


In [2]:
# Load employees data from sqlite database 'hr.db' using read table


## Handling Missing Data

- All of the descriptive statistics on pandas objects exclude missing data by default.
- For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

In [9]:
string_data = pd.Series([None, 'aardvark', 'artichoke', np.nan, 'avocado'])

In [3]:
# check the missing values with isnull() function



![](assets/na-methods.png)

### Filtering Out Missing Data

In [6]:
# filter out the missing data (first approach)
data = pd.Series([1, np.nan, 3.5, np.nan, 7])



In [7]:
# filter out the missing data (second approach)



With DataFrame objects, things are a bit more complex. You may want to drop **rows**
or **columns** that are **all** `NA` or only those containing **any** `NAs`.

In [24]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [8]:
# Drop any row containing a missing value


In [9]:
# Drop any row with all values missing



In [27]:
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [10]:
# drop the columns that have all values missing 



In [29]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df


Unnamed: 0,0,1,2
0,-1.654345,,
1,0.804067,,
2,0.508458,,0.283504
3,0.594399,,-0.666805
4,0.923111,-0.136595,-0.085718
5,0.106951,-0.313042,0.140816
6,-0.793298,-0.696201,-0.775991


In [11]:
# drop all rows that have any missing values


In [12]:
# drop all rows that have 2 or more missing values


### Filling In Missing Data
Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways.

For most purposes, the `fillna` method is the workhorse function to use.

In [13]:
# replace all missing data with 0


Calling `fillna` with a **dict**, you can use a different fill value for each column:


In [14]:
# replace missing values of column 1 by 0.5 and of column 2 by 0



`fillna` returns a **new object**, but you can modify the existing object in-place

In [15]:
# fill the values in-place


In [36]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df


Unnamed: 0,0,1,2
0,0.555184,-1.508079,0.873148
1,-0.844023,-0.750435,-0.933205
2,0.361957,,0.664888
3,0.472315,,0.933841
4,-0.470906,,
5,0.337988,,


the parameter **method** is a powerful utility available for `fillna` method

In [16]:
# fill the missing value with the value precedes it



In [18]:
# fill the missing value with the value precedes it with maximum of 2 filling



**check**: fill the missing value with the mean

In [19]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7])


![](assets/fillna-args.png)

## Data Transformation
So far in this lesson we’ve been concerned with rearranging data. Filtering, cleaning,
and other transformations are another class of important operations.

### Removing Duplicates
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an
example:

In [39]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each
row is a duplicate (has been observed in a previous row) or not:

In [20]:
# which row(S) is duplicated 


Relatedly, `drop_duplicates` returns a DataFrame where the duplicated array is
False

In [21]:
# show the rows that are not duplicated


Both of these methods by default consider **all** of the columns; alternatively, you can
specify any **subset** of them to detect duplicates.

In [42]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [22]:
# drop rows with duplicated values at column k1


`duplicated` and `drop_duplicates` by default keep the first observed value combination. Passing `keep='last'` will return the last one

In [23]:
# drop rows with duplicated values at columns k1 and k2, keeping the last opservation



### Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the val‐
ues in an array, Series, or column in a DataFrame.

In [45]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food
came from.

In [47]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

The `map` method on a Series accepts a **function** or **dict-like** object containing a mapping, 
but here we have a small problem in that **some** of the meats are **capitalized** and
others are not. Thus, we need to convert each value to lowercase using the `str.lower`
Series method

In [24]:
# convert all strings in 'food' column to lower case and assign it to a variable



In [25]:
# add new column 'animal' to the dataframe and get the matching value from the dict 'meat_to_animal'


In [26]:
# approach 2, pass a function that do all the work



### Replacing Values
Filling in missing data with the `fillna` method is a special case of more general value
replacement. As you’ve already seen, `map` can be used to modify a subset of values in
an object but `replace` provides a simpler and more flexible way to do so. 

Let’s consider this Series

In [51]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [27]:
# replace the value of -999 with np.nan


If you want to replace multiple values at once, you instead pass a list and then the
substitute value

In [28]:
# replace both the values of -999 and -1000 with np.nan


To use a different replacement for each value, pass a list of substitutes, or a mapping dict 


In [29]:
# replace both the values of -999 and -1000 with np.nan and 0 respectivly



In [30]:
# replace both the values of -999 and -1000 with np.nan and 0 respectivly



### Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a function or mapping 
of some form to produce new, differently labeled objects. You can also modify
the axes in-place without creating a new data structure.

In [57]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Like a Series, the axis indexes have a map method

In [31]:
# convert the indices to upper case


If you want to create a transformed version of a dataset without modifying the original, a useful method is `rename`

In [32]:
# convert the indices and column names to upper case using rename method



In [33]:
# rename spcific index and column


In [34]:
# rename index in-place


### Discretization and Binning
Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose you have data about a group of people in a study, and you want to group
them into discrete age buckets

In [37]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To
do so, you have to use `cut`, a function in pandas

In [39]:
# divide the ages to 4 classes using the bins
bins = [18, 25, 35, 60, 100]


The object pandas returns is a special **Categorical object**. The output you see
describes the bins computed by `pandas.cut`. 

You can treat it like an **array of strings** indicating the bin name; 

internally it contains a **categories array** specifying the distinct category names along with a labeling for the ages data in the **codes attribute**:

In [40]:
# print the ages codes


In [41]:
# print the distinct categories


In [42]:
# print the number of occurrances of each category 


**Note** that `cats.value_counts()` are the bin counts for the result of `pandas.cut`.
Consistent with mathematical notation for intervals, a **parenthesis** means that the side
is **open**, while the **square bracket** means it is **closed (inclusive)**. 

You can change which side is closed by passing `right=False`:

In [43]:
# categorize the ages with right open intervals



You can also pass your own bin names by passing a list or array to the labels option:

In [44]:
# categorize the ages with the labels below
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']


If you pass an integer number of bins to `cut` instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data.

In [45]:
data = np.random.randint(0, 20, 12)
data

array([10,  2, 11,  8, 10, 19, 16,  7, 16, 10, 12,  8])

In [48]:
# divide the data to 4 groups


In [49]:
# count the number of values in each group


A closely related function, `qcut`, bins the data based on sample quantiles. 

Depending on the distribution of the data, using `cut` will not usually result in each bin having the
same number of data points. 

Since `qcut` uses sample quantiles instead, by definition you will obtain **roughly equal-size** bins:

In [50]:
data = np.random.randn(1000)  # Normally distributed
# Cut into quartiles and count the values
cats = pd.qcut(data, 4)  
cats
pd.value_counts(cats)

(0.737, 3.93]                    250
(-0.0404, 0.737]                 250
(-0.727, -0.0404]                250
(-3.0709999999999997, -0.727]    250
dtype: int64

## Independent Practice
using the `coffee-preferences.csv` data set
- show the duplicated rows
- show the names of customers who rated all coffee
- show the names of the coffees that are rated by all customers
- which coffee got the highest rate in average
- replace the missing value by zero
- convert names to upper case
- convert the rate to a percentage 