# Week 6 in class

In [1]:
import numpy as np
import pandas as pd

## Cleaning Data

For many data projects, a [significant proportion of
time](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#74d447456f63)
is spent collecting and cleaning the data — not performing the analysis.

This non-analysis work is often called “data cleaning”.

pandas provides very powerful data cleaning tools, which we
will demonstrate using the following dataset.

In [2]:
df = pd.DataFrame({"numbers": ["#23", "#24", "#18", "#14", "#12", "#10", "#35"],
                   "nums": ["23", "24", "18", "14", np.nan, "XYZ", "35"],
                   "colors": ["green", "red", "yellow", "orange", "purple", "blue", "pink"]})
df

Unnamed: 0,numbers,nums,colors
0,#23,23,green
1,#24,24,red
2,#18,18,yellow
3,#14,14,orange
4,#12,,purple
5,#10,XYZ,blue
6,#35,35,pink


What would happen if we wanted to try and compute the mean of
`numbers`?

In [3]:
df["numbers"].mean()

TypeError: Could not convert #23#24#18#14#12#10#35 to numeric

It throws an error!

Can you figure out why? Hint: As always, when looking at error messages, start at the very
bottom.

The final error says, `TypeError: Could not convert #23#24... to numeric`.

---------

**Exercise 1**

Use the method `replace` to convert the string below into a number.

In [None]:
c2n = "#39"
c2n = c2n.replace('#', '')
c2n

### String Methods

One way to make this change to every element of a column would be to loop through all elements of the column and apply the desired string methods… One significantly faster (and easier) method is to apply a string method to an entire column of data.

Most methods that are available to a Python string are also available to a pandas Series that has `dtype` object. We access them by doing `s.str.method_name` where `method_name` is the name of the method. When we apply the method to a Series `s`, it is applied to all rows in the Series in one shot!

For example, we can check whether the colors contain blue, as below:

In [None]:
df["colors"].str.contains("blue")

**Exercise 2**

Make a new column called `numbers_str` that contains the elements of
`numbers` but without `"#"`. Afterwards, show the data types in the DataFrame.

In [4]:
df['numbers_str'] = df['numbers'].str.replace("#", '')
df.head()

Unnamed: 0,numbers,nums,colors,numbers_str
0,#23,23.0,green,23
1,#24,24.0,red,24
2,#18,18.0,yellow,18
3,#14,14.0,orange,14
4,#12,,purple,12


### Type Conversions

The `dtype` of the `numbers_str` column shows that pandas still treats it as a string even after we have removed the `"#"`.

We need to convert this column to numbers. The best way to do this is using the `pd.to_numeric` function.

This method attempts to convert whatever is stored in a Series into numeric values. For example, after the `"#"` removed, the numbers of column `"numbers"` are ready to be converted to actual numbers, as below.

In [5]:
df["numbers_numeric"] = pd.to_numeric(df["numbers_str"])
df.dtypes

numbers            object
nums               object
colors             object
numbers_str        object
numbers_numeric     int64
dtype: object

We can convert to other types well. Using the `astype` method, we can convert to any of the supported pandas `dtypes`. For example, we can convert our new variable from integers to floats, as below.

In [6]:
df["numbers_numeric"] = df["numbers_numeric"].astype(float)
df.dtypes

numbers             object
nums                object
colors              object
numbers_str         object
numbers_numeric    float64
dtype: object

**Exercise 3**

Convert the column `"nums"` to a numeric type and save it to the DataFrame as `"nums_tonumeric"`.

*Hint:* Notice that there is a missing value, and a value that is not a number. Look at the documentation for `pd.to_numeric` and think about how to overcome this.

Why could your solution be a bad idea if used without knowing what your data looks like? 

*Hint:* Think about what happens when you apply it to the `"numbers"` column before replacing the `"#"`.

In [7]:
df['nums_str'] = df['nums'].str.replace("XYZ", '0').replace(np.nan, '0')
df
df["nums_tonumeric"] = pd.to_numeric(df["nums_str"])
df

Unnamed: 0,numbers,nums,colors,numbers_str,numbers_numeric,nums_str,nums_tonumeric
0,#23,23,green,23,23.0,23,23
1,#24,24,red,24,24.0,24,24
2,#18,18,yellow,18,18.0,18,18
3,#14,14,orange,14,14.0,14,14
4,#12,,purple,12,12.0,0,0
5,#10,XYZ,blue,10,10.0,0,0
6,#35,35,pink,35,35.0,35,35


#Answer
Could be bad because if we are looking at something like amount of drink a day, then when getting the mean this would mess up our actual mean, etc. Would be best to just remove rows at that point...

**Exercise 4**

Convert the column `"numbers_numeric"` back into integers.

In [8]:
df["numbers_numeric"] = df["numbers_numeric"].astype(int)
df.dtypes

numbers            object
nums               object
colors             object
numbers_str        object
numbers_numeric     int64
nums_str           object
nums_tonumeric      int64
dtype: object

### Missing data

**Exercise 5**

Looking at the other variables, you notice that the missing item should be 12. Replace the missing item with 12.

In [9]:
df['nums'] = df['nums'].str.replace("XYZ", '12')
df

Unnamed: 0,numbers,nums,colors,numbers_str,numbers_numeric,nums_str,nums_tonumeric
0,#23,23.0,green,23,23,23,23
1,#24,24.0,red,24,24,24,24
2,#18,18.0,yellow,18,18,18,18
3,#14,14.0,orange,14,14,14,14
4,#12,,purple,12,12,0,0
5,#10,12.0,blue,10,10,0,0
6,#35,35.0,pink,35,35,35,35


### Boolean selection

**Exercise 6**

Often you need to select data based on conditions met by the data itself. 

Which colors remain if you only include data for which the numbers are above 18?

In [10]:
x = df[df['numbers_numeric']>18]
x['colors']

0    green
1      red
6     pink
Name: colors, dtype: object

### Case study

Remember the chipotle data from this week's homework.

In [11]:
chipotle = pd.read_csv("chipotle_raw.csv")
chipotle.head(20)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


**Exercise 7**

We'd like you to use this data to answer the following questions.

- What is the average price of an item with chicken?  
- What is the average price of an item with steak?  
- Did chicken or steak produce more revenue (total)? 

*Hint:* You may need to use on of the string methods shown above, and don't forget to use the variable `quantity`.

In [38]:
chipotle_chicken = chipotle[chipotle["item_name"].str.contains("Chicken")]
chipotle_chicken['no$_price'] = chipotle_chicken['item_price'].str.replace("$", '')
chipotle_chicken.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chipotle_chicken['no$_price'] = chipotle_chicken['item_price'].str.replace("$", '')


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,no$_price
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98,16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98,10.98
11,6,1,Chicken Crispy Tacos,"[Roasted Chili Corn Salsa, [Fajita Vegetables,...",$8.75,8.75
12,6,1,Chicken Soft Tacos,"[Roasted Chili Corn Salsa, [Rice, Black Beans,...",$8.75,8.75
13,7,1,Chicken Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",$11.25,11.25


## Writing DataFrames

Let’s now talk about saving a DataFrame to a file.

As a general rule of thumb, if we have a DataFrame `df` and we would like to save to save it as a file of type `FOO`, then we would call the method named `df.to_FOO(...)`.

We will show you how this can be done and try to highlight some issues. But, we will not cover all possible options and features — we feel it is best to learn these as you need them by consulting the appropriate documentation.

Let’s show `df.to_csv` as an example.

Without any additional arguments, the `df.to_csv` function will return a string containing the csv form of the DataFrame:

In [None]:
print(chipotle.head().to_csv())

If we do pass an argument, the first argument will be used as the file name. By default, it ends up in the same folder as your notebook (but you can also specify another path).

In [None]:
chipotle.to_csv("chipotle.csv")

You can see above and in the file that the csv-form contains the index, which will appear as a variable once you read the csv-file again in pandas. You can prevent this with `index = False`, as below.

In [None]:
chipotle.to_csv("chipotle.csv", index = False)

**Exercise 8**

Analogous to above, export the `chipotle` DataFrame as an excel-file `chipotle.xlsx`.

In [None]:
chipotle.to_excel("chipotle.xlsx")

## Aggregations and `datetime`

Below we will use unemployment data by US state at a monthly frequency, as contained in the file `state_unemployment.csv`. The pandas `read_csv` function will determine most datatypes of the underlying columns. The exception here is that we need to give pandas a hint so it can load up the `Date` column as a Python datetime type, as below.

In [39]:
unemp_raw = pd.read_csv("state_unemployment.csv", parse_dates=["Date"])
print(unemp_raw.dtypes)
unemp_raw.head()

Date                datetime64[ns]
state                       object
LaborForce                 float64
UnemploymentRate           float64
dtype: object


Unnamed: 0,Date,state,LaborForce,UnemploymentRate
0,2000-01-01,Alabama,2142945.0,4.7
1,2000-01-01,Alaska,319059.0,6.3
2,2000-01-01,Arizona,2499980.0,4.1
3,2000-01-01,Arkansas,1264619.0,4.4
4,2000-01-01,California,16680246.0,5.0


**Exercise 9**

One of the reasons people are concerned about business cycles is that unemployment can increase a lot during recessions. 

One way to look at unemployment changes is to study the variance of unemployment over time. Which states are relatively volatile? Compute the variance of unemployment for each state.

*Hint:* Use `var` and `groupby`.

In [40]:
vars_of_states = unemp_raw.groupby('state').var()
vars_of_states.head()

Unnamed: 0_level_0,LaborForce,UnemploymentRate
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama,777524000.0,4.548286
Alaska,258795000.0,0.257506
Arizona,47295590000.0,4.179021
Arkansas,1568688000.0,1.957209
California,510388600000.0,6.039162


**Exercise 10**

Instead of using a built-in aggregation like `var`, it is also possible to write your own aggregation, which you can call with `agg`, as you have seen in Datacamp.

Create a function `high_or_low` that takes a pandas Series as argument. The function should print `"High"` if the variance of the series is equal to or above 2.5, and should print `"Low"` if the variance of the series is below 2.5.

Apply your function to find out which states have volatile unemployment, and which don't.

**Exercise 11**

Create a DataFrame `unemp_all` containing the unemployment rates, with the US states as columns and `Date` as index.

In [43]:
unemp_all = unemp_raw.pivot(index='Date', columns='state', values=['UnemploymentRate'])
unemp_all.head()

Unnamed: 0_level_0,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate,UnemploymentRate
state,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,Florida,Georgia,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2000-01-01,4.7,6.3,4.1,4.4,5.0,2.8,2.8,3.5,3.7,3.7,...,2.4,3.7,4.6,3.1,2.7,2.6,4.9,5.8,3.2,4.1
2000-02-01,4.7,6.3,4.1,4.3,5.0,2.8,2.7,3.6,3.7,3.6,...,2.4,3.7,4.6,3.1,2.6,2.5,4.9,5.6,3.2,3.9
2000-03-01,4.6,6.3,4.0,4.3,5.0,2.7,2.6,3.6,3.7,3.6,...,2.4,3.8,4.5,3.1,2.6,2.4,5.0,5.5,3.3,3.9
2000-04-01,4.6,6.3,4.0,4.3,5.1,2.7,2.5,3.7,3.7,3.7,...,2.4,3.8,4.4,3.1,2.7,2.4,5.0,5.4,3.4,3.8
2000-05-01,4.5,6.3,4.0,4.2,5.1,2.7,2.4,3.7,3.7,3.7,...,2.4,3.9,4.3,3.2,2.7,2.3,5.1,5.4,3.5,3.8


**Exercise 12**

One of the advantages of a `DatetimeIndex` is that it is easy to slice.

Consider the list of states below. Create a DataFrame `unemp` that contains unemployment rates for only those states between January 2006 and December 2015 (including).

In [49]:
states = ["Arizona", "California", "Florida", "Illinois",
    "Michigan", "New York", "Texas"]
unemp_all.loc[['2006-01-01';'2015-12-01'], states]

SyntaxError: invalid syntax (<ipython-input-49-1a20c20eab37>, line 3)

### Transforms

Many analytical operations do not necessarily involve an aggregation. The output of a function applied to a Series might need to be a new Series.

For example,
- Compute the difference in unemployment from month to month (`diff`).
- Compute the percentage change in unemployment from month to month (`pct_change`).
- Calculate the cumulative sum of elements in each column (`cumsum`)

As usual, tab completion is helpful when trying to find such functions.

As an example of the use of transforms, the code below shows which state had the largest percentage increase in unemployment. Try to understand how this code arrives at this answer.

In [None]:
unemp.pct_change().max().idxmax()

**Exercise 13**

The DataFrame `unemp` contains dates starting at the height of the boom before the Great Recession, and ends 10 years later.

Which state had the smallest increase (or largest decrease) in unemployment over this period?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()