### Preface 

In class I talked a bit about  mapping out steps to executing projects. This would look like this: identify question and dataset(s) that may answer the question; import data; manipulate data; and then try and answer the question. The question part is hard, but this is more conceptual, not coding. The manipulation part is where coding skills are helpful. Specifically, cleaning, merging, shaping the data to that the data set is usable to answer the question at hand. 

### Cleaning and String Methods on Dataframes

This notebook works through some cleaning examples that will probably help you in your project. Here we describe features of Pandas that allow us to clean data that, for reasons beyond our control, comes in a form that's not immediately amendable to analysis. This is the first of several such notebooks.

#### The Question (or want)...

We need to know what we're trying to do---what we want the data to look like. To borrow a phrase from our friend Tom Sargent, we say that we apply the want operator. Some problems we've run across that ask to be solved:

- We have too much data, would prefer to choose a subset.
- Row and column labels are contaminated.
- Numerical data is contaminated by commas (marking thousands); dollar signs; other non-numerical values, etc.
- Missing values are marked erratically.

What we want in each case is the opposite of what we have: we want nicely formatted numbers, clean row and column labels, and so on.

In [1]:
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for pandas 

### Example: Chipotle data

This data comes from a New York Times story about the number of calories in a typical order at Chipotle. The topic doesn't particularly excite us, but the data raises a number of issues that come up repeatedly. We adapt some code written by Daniel Forsyth.

In [2]:
url = "https://raw.githubusercontent.com/mwaugh0328/Data_Bootcamp_Fall_2017/master/data_bootcamp_1106/orders_dirty.csv"
#path = "C://data_bootcamp//Data_Bootcamp_Fall_2017//data_bootcamp_1106//orders_dirty.csv"
# Double forward slashes for windows machines.

chp = pd.read_csv(url)  

print("Variable dtypes:\n", chp.dtypes, sep='')
# Lets checkout the datatypes that we have... are they what you expect?

chp.head()
#chp.tail()
#chp.shape

Variable dtypes:
order store id 1        object
quantity 2               int64
item name 3             object
choice description 4    object
item price 5            object
dtype: object


Unnamed: 0,order store id 1,quantity 2,item name 3,choice description 4,item price 5
0,1 Bucks County,1,Chips and Fresh Tomato Salsa,,$2.39
1,1 Bucks County,1,Izze,[Clementine],$3.39
2,1 Bucks County,1,Nantucket Nectar,[Apple],$3.39
3,1 Bucks County,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2 Bucks County,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [3]:
chp.tail()

Unnamed: 0,order store id 1,quantity 2,item name 3,choice description 4,item price 5
4617,1833 Bucks County,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833 Bucks County,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834 Bucks County,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834 Bucks County,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75
4621,1834 Bucks County,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$8.75


### Issue #1: We have too much data, want to work with a subset.

Ok, so this is not really an issue here. This is about 5000 rows, width is small too. Not huge. But lets imagine that it was huge and we don't want deal with continually manipulating a big data set. We already know how to do this...we just use the `nrows` command when we read in the dataset.

In [4]:
chp = pd.read_csv(url, nrows = 500)   

print("Variable dtypes:\n", chp.dtypes, sep='')
# Lets checkout the datatypes that we have... are they what you expect?

chp.head()

chp.tail()

chp.shape

Variable dtypes:
order store id 1        object
quantity 2               int64
item name 3             object
choice description 4    object
item price 5            object
dtype: object


(500, 5)

Now the shape indicates that we only have 500 rows. Just as we specified. This was easy. 

One strategy is to write and test your code on only a subset of the data. Again the upside is that the code may run faster, its easier too look at and analyze. Then once you have everything sorted out, you simply change the code above and scale it up.

**Here is the issue to be mindful of: the subset may not be "representative" of the entire data set.** For example, there may be issues in say row 1458 (e.g. missing values, different data types), that will only arise when the full data set is imported. Moreover, your results (graphic, statistics, etc.) may not be the same one the entire data set is read in. This is just something to be mindful of when pursuing this approach.

---

### Issue #2: Row and column labels are contaminated.

Return to the head and the `dyypes` and look at the variable names...

In [5]:
chp = pd.read_csv(url, nrows = 500)  

print("Variable dtypes:\n", chp.dtypes, sep='')
# Lets checkout the datatypes that we have... are they what you expect?

chp.head()

#chp["order store id 1"].unique()

Variable dtypes:
order store id 1        object
quantity 2               int64
item name 3             object
choice description 4    object
item price 5            object
dtype: object


Unnamed: 0,order store id 1,quantity 2,item name 3,choice description 4,item price 5
0,1 Bucks County,1,Chips and Fresh Tomato Salsa,,$2.39
1,1 Bucks County,1,Izze,[Clementine],$3.39
2,1 Bucks County,1,Nantucket Nectar,[Apple],$3.39
3,1 Bucks County,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2 Bucks County,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


Here we see several issues that may slow us down, if fixed could help things.

- Notice how the variable names are separated and then they have these numerical values in them (as if the person constructing the data wanted to help us by telling us the column number). We could simply slice the data set accordingly, or we could change the column names in a simpler way. Lets follow the later approach.

- Second, notice that the "order store id 1" value gives us a order number (note how one order has several entries) and then store id. This is could be cumbersome for many reasons, lets explore this series using `unique()` and `value_counts()`. The code is below...

In [13]:
unique_values = pd.DataFrame(chp["order store id 1"].unique())
# This will grabe the unique values and create a new dataframe out of it...

In [16]:
unique_values.shape

(209, 1)

Now here is an important observations...there are 500 rows, but only 209 unique store, so what this is saying is for each order, there are multiple entries. Now here is another way to see what is going on with this by checking the value counts associated with each uniqie value. 

In [19]:
chp["order store id 1"].value_counts().head()

205 Bucks County    12
195 Bucks County     8
149 Bucks County     6
103 Bucks County     5
184 Bucks County     5
Name: order store id 1, dtype: int64

Lets now see what is up with order 205...

In [21]:
chp[chp["order store id 1"]== "205 Bucks County"]

Unnamed: 0,order store id 1,quantity 2,item name 3,choice description 4,item price 5
478,205 Bucks County,1,Carnitas Burrito,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",$9.25
479,205 Bucks County,1,Veggie Burrito,"[Roasted Chili Corn Salsa, [Fajita Vegetables,...",$11.25
480,205 Bucks County,1,Chicken Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",$8.75
481,205 Bucks County,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Rice, Black Beans...",$9.25
482,205 Bucks County,1,Chicken Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Chees...",$11.25
483,205 Bucks County,1,Chicken Bowl,"[Fresh Tomato Salsa, [Rice, Black Beans, Chees...",$11.25
484,205 Bucks County,1,Chicken Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",$8.75
485,205 Bucks County,1,Barbacoa Crispy Tacos,"[Fresh Tomato Salsa, Guacamole]",$11.75
486,205 Bucks County,1,Chicken Burrito,"[Fresh Tomato Salsa, Cheese]",$8.75
487,205 Bucks County,1,Chicken Burrito,"[Fresh Tomato Salsa, [Rice, Cheese, Lettuce]]",$8.75


What we learned is that this is for the same country (Bucks County). Thus is provides no information at all. Lets also change the entries in that column and remove it. 

**First step: Fix the column names.**

In [22]:
# One way to fix the names is just to rename them by hand like this...

#new_name_list = ["order_id", "quantity", "item_name", "choice_desc", "item_price"]

#chp.columns = new_name_list

In [31]:
# Another way is to use string methods on the column names and create something more usable.
# Here is a test run, what does this do?

test = "order store id 1"

test.rsplit(maxsplit=1)[0].replace(" ","_")

# So this splits the string into a list. The max split doess...
# Then the bracket says, take the first entry.
# Then the next part says replace the space with an underscore,
# this will help us call a column name more easily.

# What if we did not have max split?


'order_store_id'

In [24]:
# Now lets fix this all up for the data from
new_name_list = []

for var in chp.columns:
    new_name_list.append(var.rsplit(maxsplit=1)[0].replace(" ","_"))
    
# How would you do this in list comprehension format...
    
# Then rename everything...

chp.columns = new_name_list

chp.head()    

Unnamed: 0,order_store_id,quantity,item_name,choice_description,item_price
0,1 Bucks County,1,Chips and Fresh Tomato Salsa,,$2.39
1,1 Bucks County,1,Izze,[Clementine],$3.39
2,1 Bucks County,1,Nantucket Nectar,[Apple],$3.39
3,1 Bucks County,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2 Bucks County,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


Great work!

**Second step: Change the individual column entries.**

So this fixed some issues with the columns, lets use the same idea to fix the issue with the order store id, so get the "Bucks County" out of there.

In [35]:
# Again, lets test this out...

# Step one, pull off the number...

test = "1 Bucks County"
test2 = test.rsplit()[0] # same idea, don't use the max split option....

print(test2)
print(type(test2)) # I want this numerical, but its not...

# Step two, convert to floating point...

#test2 = float(test2)
#print(type(test2))

1
<class 'str'>


This gives a general idea to fixing the the order numbers. Here is the problem: We need to perform this operation on every single entry of a particular column. This is different than just editing the column names. To perform this operation, we need to use **Pandas string methods.** 

We can do the same thing to all the observations of a variable with so-called string methods. We append `.str` to a variable in a DataFrame and then apply the string method of our choice. If this is part of converting a number-like entry that has mistakenly been given `dtype` object, we then convert its `dtype` with the `astype` method.

**Aside** Below we will see several examples of string methods on the dataframe. Below is a link to a resournce with a more comprehensive treatment of string methods in pandas:

[Strings in Pandas](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb)

In [36]:
chp.head()
chp.columns
chp.order_store_id.head()

# Just to verify we are doing what we think we are...

chp.order_store_id = chp.order_store_id.str.rsplit().str[0].astype(int)

# Note that we need two str's here: one to do the split, the other to extract the first element.
# Then the last part of the code `astype` converts it to a string...
# note nothing changes unless we reassign everything. 

In [37]:
chp.head(20)

Unnamed: 0,order_store_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


In [38]:
print("Variable dtypes:\n", chp.dtypes, sep='')

Variable dtypes:
order_store_id         int32
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object


Great work. We now have a numerical value for each order number. Key lesson from this was using `.str` on the dataframe to used string methods on individual entries.

---

### Issue #3: Numerical data is contaminated by commas (marking thousands); dollar signs; other non-numerical values, etc.

We sorted out issue with labels on the rows and columns. We still have the following issue that the item price is not a numerical value. Check above, the type of `item_price` is an object, not a float. If we want to do some kind of numerical calculation on this, then we need to convert it.

**Why is `item_price` not a numerical value?** ITs those damm dollar signs. Someone put them their thinking they were being helpful, but it is giving us a headache. **How do we fix it?** Dude, in a very similar way above.

#### Exercise: Can you use the methods above to...
 
 - Remove the dollar sign
 
 - Check the type
 
 - Convert the type to a float. Note: if its not working, you are proabably doing it right. Can you figure out what the issue is?

---


#### Replacing corrupted entries with missing values

The issue that we faced in the exercise above is that while we did replace the dollar sign, we could not convert the column to a floating point number because there were some entries in the column that are not numbers (e.g. the gift card values). So Python/Pandas kicks back an error. How do we do this? The natural way to do this is to replace all these entries with a `NaN` value. 

Below is another method to replace whole entries and assign them an missing value. (This will set us up for the next issue. 

In [47]:
chp.item_price.replace(to_replace=["gift card"], value=[np.nan], inplace = True)
# So lets walk through what this does, it takes the column, then uses the replace 
# comand, to_replace = ["what we want to replace"], then the value
# that we want to replace it with. We are goning to use the numpy NaN value
# which the dataframe will proplerly recognice as not a number.

# Note this could be a huge pain if there were differing random 
# strings floating around.

chp.item_price.unique() # simmilar, but just reports the unqiue occurances
chp.item_price.astype?

In [49]:
chp.item_price = chp.item_price.astype(float)
# Now convert it to a floating point number.

print("Variable dtypes:\n", chp.dtypes, sep='')

Variable dtypes:
order_store_id          int32
quantity                int64
item_name              object
choice_description     object
item_price            float64
dtype: object


### Important Comment

Unlike the string methods we described earlier, this use of replace affects **complete entries**, not **elements of string entries**. For example, suppose we tried to use replace to get rid of the dollar signs. If would not work because `replace` is looking for an entry that only has a `$` to replace it. 

---

### Issue #4: Missing values are marked erratically.

It's important to label missing values, so that Pandas doesn't interpret entries as strings. Pandas is also smart enough to ignore things labeled missing when it does calculations or graphs. If we compute, for example, the mean of a variable, the default is to ignore missing values.

We've seen that we can label certain entries as missing values in read statements:  read_csv, read_excel, and so on. Moreover, in the operations above, we showed how to take entries that were hard to make sense of and called them missing values using the `replace` command and `np.nan`.

**Working with missing values** Here are some operations we can do...

In [48]:
chp.order_store_id[chp.item_price.isnull()]
# These are the order numbers with null values

448    195
449    195
450    195
451    195
452    195
453    195
454    195
455    195
Name: order_store_id, dtype: int32

The next command of use is `.dropna` The one thing to note is that Pandas (when it computes things or plots) automatically drops stuff. So here is an example, the mean with the NaNs there and the mean without. They are the same.

In [17]:
print(chp.item_price.dropna().mean())
print(chp.item_price.mean())

7.454735772357705
7.454735772357705


-----

### Some Analysis

Now that we have our data set clean, lets just do a couple of things to check it out. 

- 

In [37]:
has_guac = chp[chp.item_name == "Chicken Burrito"].choice_description

has_guac = pd.DataFrame(has_guac)

list(has_guac.loc[16])

#chp[chp.item_name == "Chicken Burrito"][has_guac].item_price.mean() 

['[Tomatillo-Green Chili Salsa (Medium), [Pinto Beans, Cheese, Sour Cream]]']

### Summary
We've learned the following. we learned how to clean data dealing with several key issues: (i) Too much data (ii) rows, columns, or specific entries have contaminated data (iii) numerical values are contaminated and (iv) missing values. Then we quickly analyzed the Chipoltle data and practice the `gropuby` command and `contains` string method. Great work!

- **For practice:** What if you did the same analysis on the whole data set? Is this as easy as simply changing `nrows = 500` and running it again? Why or why not?