# Data Cleaning

* Anywhere from 50% to 80% of data science is data cleaning
    * of course I hear 70% of statistics are made up on the spot
* Dealing with dirty data is a fact of life when doing data intensive research
* Especially if you are collecting or creating the data yourself
* Fortunately, Pandas is excellent at data cleaning and once you get the hang of it you might even enjoy it!


In [None]:
# load the necessary libraries
import pandas as pd
import numpy as np


## Missing Values 

* One of challenges you may face when working with messy data are *missing* or **null** values 
* There are multiple conventions for representing null values when doing data science in Python
* There is a Pythonic way using the `None` object
* There is a Numpy/Pandas-y way using `NaN`

### None - Pythonic Missing Data

* None is the standard way of representing nothing in plain python
* It is useful, but it is also a complex data structure
* It can be used in numeric and programmatic contexts

In [None]:
# create a numpy array of numbers and a null value represented by None
some_numbers = np.array([1,None,3,4])
some_numbers

* Because numpy arrays (and pandas series/columns) all have to be the same data type, it will default to the most expressive and most inefficient data type for the array
    * Note:  Pandas will automatically convert `None` to `Nan` so we use `np.array` here
* This means any operations running over the array/column/series are going to run slower than they could if the data type was numeric

In [None]:
# create a list of objects and a list of integers
# compute their sum and time how long it takes
for dtype in ['object','int']:
    print("data type = ", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

* Notice the integer array was ***a lot*** faster than the object array
* Also, the vectorized math operations don't like `None`

In [None]:
some_numbers.sum()

### NaN - Numpy/Pandas-y Missing Numeric data

* The Numpy third-party library has a mechanism for representing missing numeric values
* Under the hood, NaNs are a standards compliant floating point numbers 
    * Note for R users: There is no `Null` only `NaN`
* This means you can use them with other numeric arrays for fast computations

In [None]:
# Create a numeric Pandas Series with a missing value
nanny = pd.Series([1, np.nan, 3, 4])
nanny.dtype

* Now we can use all the fast and easy computations in Pandas without worring about missing values

In [None]:
# compute the sum of all the numbers in the Series
nanny.sum()

## Operating on Null Values

* There are four functions in Pandas that are useful for working with missing data
* The examples below operate on Series, but they can work on Dataframes as well


### Null value functions

* `isna()` - Generate a boolean mask of the missing values (can also use `isnull()`)
* `notna()` - Do the opposite of `isna()` (can also use `notnull()`
* `dropna()` - Create a filtered copy of the data with no null values
* `fillna(value)` - Create a copy of the data will null values filled in

In [None]:
# display the Series
nanny

In [None]:
# what values are null
nanny.isna()

In [None]:
# what values are not null
nanny.notna()

* These masks can be used to filter the data and create a view of missing or not missing 

In [None]:
# not super useful in a Series, but handy with Dataframes
nanny[nanny.isna()]

* Rather than creating a view, we can create *copies* of the data with the null values removed or filled in

In [None]:
# Just get rid of all the null values
no_null_nanny = nanny.dropna()
no_null_nanny

In [None]:
# fill in the null values with zero
fill_null_nanny = nanny.fillna(0)
fill_null_nanny

In [None]:
# fill in the null values with a different value
fill_null_nanny = nanny.fillna(999)
fill_null_nanny

In [None]:
# The original nanny Series remains untouched #noreboot
# Fran Drescher frowns with dissapointment 
nanny

* These functions work with dataframes as well
* But you will need to pay closer attention to what it is doing 

In [None]:
df_nanny = pd.DataFrame([[1, np.nan, 2],
                        [2, 3, 5],
                        [np.nan, 4, 6]])
df_nanny

* Dropping null values with `dropna()` removes the entire axis (row or column) and returns a new copy of the dataframe
* You can specify dropping rows or columns with the axis parameter

In [None]:
# dropna gets rid of rows by default
df_nanny.dropna() # axis="rows" or axis=0

In [None]:
# use the axis="columns" or axis=1 to drop columns
df_nanny.dropna(axis="columns")

* There are a couple other parameters that let you specify other behaviors
* Like only dropping rows/columns with all null values or settings a threshold

## Working with null values in real data

* Here is an example of some real data, the diabetes data from week 2

In [None]:
# Import data file into a Pandas dataframe
df = pd.read_csv("../2 - data python two/diabetes.csv")

# Display the first 5 rows of the data
df.head()

In [None]:
# Display the metadata about the data, making sure to display null values
df.info() 

* If we look closely at this information we can see there are a few null values in this dataset
* There are 403 rows, but some columns have less than 403 non-null values
* Now let's check which values in the dataset are missing

In [None]:
# Create a boolean mask where True indicates a null value
df.isna().head()

* Gak! Too much data, how can we just get a quick count of the null values?
* What if we combined `isnull()` with the `sum()` function?

In [None]:
# Use the sum function to count the True values in the boolean mask
df.isna().sum()

* If we wanted to look at a specific column we can do the same operation 
* These functions work with Series as well as DataFrames

In [None]:
# How many null values in the chol column
df["chol"].isnull().sum()

* Now let's deal with missing values
* Solution 1: Remove rows with empty values
* If there are only a few null values and you know that deleting values will not cause adverse effects on your result, remove them from your DataFrame
* Make sure to save the new dataframe to a new variable!

In [None]:
# Display missing value counts
print("Missing values before dropping rows: ")
print(df.isnull().sum())


# Display new dataset
mod_df = df.dropna() # make a copy of the dataframe with null values removed
print("Missing values after dropping rows: ")
print(mod_df.isnull().sum())


### EXERCISE

A reviewer on your article that you submitted to the most prestigious journal in your field, loves your analysis but doesn't like the fact you dropped rows with missing cholesterol values. You can't drop them and you can't just put in zero, so you need to identify a technique to deal with those missing values; some kind of *interpolation* that *fills in* a new value in place of the null values. Hopefully it won't drastically change the interpretation!

1. Create a new `filler_value` by deriving a number (mean, median or something else) from the column of cholesterol values (`df['chol']`)
2. Use the `fillna()` function to fill in the missing values of the cholesterol column


In [None]:
# Put your code here

## Create a filler value
filler_value = ???




### Solution

* One quick and easy way is to fill in missing values with the mean value of a giving column

In [None]:
# Find the mean
filler_value = df["chol"].mean()
filler_value

In [None]:
# Fill missing values with a mean (average) value of a given column
# Note the inplace=True parameter - that means that we are overwriting the data
# in the existing dataset
df["chol"].fillna(filler_value, inplace=True)
df.isnull().sum()

* No more null values in the `chol` column

## Vectorized String Operations

* If you are dealing with textual or categorical data, you often have to clean strings
* Pandas has a set of *Vectorized String Operations* that are much faster and easier than the Python equivalents 
* Especially handling bad data!

In [None]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

* But like above, this breaks very easily with missing values

In [None]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

* The Pandas library has *vectorized string operations* that handle missing data

In [None]:
# convert our list into a Series
names = pd.Series(data)
names

In [None]:
# Use the string vector function to capitalize everything
names.str.capitalize()

* Look ma! No errors!
* Pandas includes a a bunch of methods for doing things to strings.

|  Functions  |. |.  |. |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

### Exercise

* In the cells below, try three of the string operations listed above on the Pandas Series `monte`
* Remember, you can hit tab to autocomplete and shift-tab to see documentation

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte

In [None]:
# First
monte.str.


In [None]:
# Second
monte.str.




In [None]:
# Third
monte.str.




### String Vector Operations with Real Data

* Let's try some string vector operations using real data!

In [None]:
# open the chipotle data and look at the first 5 rows
orders = pd.read_csv("../4 - data management one/chipotle.tsv", sep="\t")
orders.head()

We have downloaded the data and loaded it into a dataframe directly from the web.

In [None]:
# get the rows and columns of the dataframe
orders.shape

* We see there are nearly 4,622 order, and 5 columns.
* Let's take a look at the 4th row to see what textual information we have:

In [None]:
# display the first item in the DataFrame
orders.iloc[4]

* We can use Vectorized String Operations to explore the textual data

In [None]:
# Summarize the length of the choice_description string
orders['choice_description'].str.len().describe()

In [None]:
# which row has the longest ingredients string
orders['choice_description'].str.len().idxmax()

In [None]:
# use iloc to fetch that specific row from the dataframe
orders.iloc[3659]

In [None]:
# use iloc to fetch the max row automatically
orders.iloc[orders['choice_description'].str.len().idxmax()]

In [None]:
# only look at the description string
orders.iloc[orders['choice_description'].str.len().idxmax()]['choice_description']

* WOW! That is a lot of ingredients! It looks like that string is semi-structured, I wonder if we can do something with it...
* We could start by doing some string matching

In [None]:
# How many orders contain salsa
orders['choice_description'].str.contains('Salsa').sum()

* Note, you can use dot notation with column names
* This is useful because then you can use autocomplete with the string vector functions

In [None]:
# How many orders contain salsa
orders.choice_description.str.contains('Salsa').sum()

In [None]:
# How many Burritos
orders.item_name.str.contains("Burrito").sum()

In [None]:
# How many burritos...capitalization matters!
orders.item_name.str.contains("burrito").sum()

* Let's find the burrito with the most items in it

In [None]:
# only look at the description string
burrito_mask = orders.item_name.str.contains("Burrito")
burrito_mask

In [None]:
# get the id of the burrito with the longest description
max_burrito_id = orders[burito_mask]["choice_description"].str.len().idxmax()
max_burrito_id

In [None]:
# get the description column of the row with the max_burrito_id
orders.iloc[max_burrito_id]["choice_description"]

* That is a LOADED BURRITO!
* This data is interesting, but not very useful because it is one big string
* But we can probably do more with that `choice_description` column
* Let's pretend [it doesn't look like Python code](https://stackoverflow.com/questions/33281450/right-way-to-use-eval-statement-in-pandas-dataframe-map-function) and instead treat it as a comma separated list
* What string function could we use?

|  Functions  |. |.  |. |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |


In [None]:
# Use the split function to break up the different  
orders.choice_description.str.split(",")

* But what about those pesky brackets! Let's get rid of them!

In [None]:
# remove the left brackets
orders.choice_description.str.replace("[","" )

In [None]:
# remove the left and right brackets
orders.choice_description.str.replace("[","" ).str.replace("]","")

In [None]:
# remove the left and right brackets and split on commas
orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",")

* Wait what!? The brackets are back!(*@&#^$
* Yes, but now they indicate Python lists instead of `[` and `]` characters (confusing yes I know)
* How can we grab items from those lists of ingredients?

In [None]:
# remove the left and right brackets and split on commas and grab the first element
orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",").str[0]

In [None]:
# remove the left and right brackets and split on commas and grab the last element
orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",").str[-1]

In [None]:
# remove the left and right brackets and split on commas and grab the first 3 elements
orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",").str[0:3]

In [None]:
# Put the split descriptions into a new Series
split_description = orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",")
split_description

In [None]:
# look at the 4504th element of the split_descriptions series
split_description.iloc[4604]

* Every item in the series is a list

In [None]:
# Count how many items are in each description list
split_description.str.len()

In [None]:
split_description.value_counts()