# Week 6 in class

In [1]:
import numpy as np
import pandas as pd

## Cleaning Data

For many data projects, a [significant proportion of
time](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#74d447456f63)
is spent collecting and cleaning the data — not performing the analysis.

This non-analysis work is often called “data cleaning”.

pandas provides very powerful data cleaning tools, which we
will demonstrate using the following dataset.

In [2]:
df = pd.DataFrame({"numbers": ["#23", "#24", "#18", "#14", "#12", "#10", "#35"],
                   "nums": ["23", "24", "18", "14", np.nan, "XYZ", "35"],
                   "colors": ["green", "red", "yellow", "orange", "purple", "blue", "pink"]})
df

Unnamed: 0,numbers,nums,colors
0,#23,23,green
1,#24,24,red
2,#18,18,yellow
3,#14,14,orange
4,#12,,purple
5,#10,XYZ,blue
6,#35,35,pink


What would happen if we wanted to try and compute the mean of
`numbers`?

In [3]:
df["numbers"].mean()

TypeError: Could not convert #23#24#18#14#12#10#35 to numeric

It throws an error!

Can you figure out why? Hint: As always, when looking at error messages, start at the very
bottom.

The final error says, `TypeError: Could not convert #23#24... to numeric`.

---------

**Exercise 1**

Use the method `replace` to convert the string below into a number.

In [None]:
c2n = "#39"
c2n = c2n.replace('#', '')
c2n

### String Methods

One way to make this change to every element of a column would be to loop through all elements of the column and apply the desired string methods… One significantly faster (and easier) method is to apply a string method to an entire column of data.

Most methods that are available to a Python string are also available to a pandas Series that has `dtype` object. We access them by doing `s.str.method_name` where `method_name` is the name of the method. When we apply the method to a Series `s`, it is applied to all rows in the Series in one shot!

For example, we can check whether the colors contain blue, as below:

In [None]:
df["colors"].str.contains("blue")

**Exercise 2**

Make a new column called `numbers_str` that contains the elements of
`numbers` but without `"#"`. Afterwards, show the data types in the DataFrame.

In [None]:
df['numbers_str'] = df['numbers'].str.replace("#", '')
df.head()

### Type Conversions

The `dtype` of the `numbers_str` column shows that pandas still treats it as a string even after we have removed the `"#"`.

We need to convert this column to numbers. The best way to do this is using the `pd.to_numeric` function.

This method attempts to convert whatever is stored in a Series into numeric values. For example, after the `"#"` removed, the numbers of column `"numbers"` are ready to be converted to actual numbers, as below.

In [None]:
df["numbers_numeric"] = pd.to_numeric(df["numbers_str"])
df.dtypes

We can convert to other types well. Using the `astype` method, we can convert to any of the supported pandas `dtypes`. For example, we can convert our new variable from integers to floats, as below.

In [None]:
df["numbers_numeric"] = df["numbers_numeric"].astype(float)
df.dtypes

**Exercise 3**

Convert the column `"nums"` to a numeric type and save it to the DataFrame as `"nums_tonumeric"`.

*Hint:* Notice that there is a missing value, and a value that is not a number. Look at the documentation for `pd.to_numeric` and think about how to overcome this.

Why could your solution be a bad idea if used without knowing what your data looks like? 

*Hint:* Think about what happens when you apply it to the `"numbers"` column before replacing the `"#"`.

In [None]:
df['nums_str'] = df['nums'].str.replace("XYZ", '0').replace(np.nan, '0')
df
df["nums_tonumeric"] = pd.to_numeric(df["nums_str"])
df

#Answer
Could be bad because if we are looking at something like amount of drink a day, then when getting the mean this would mess up our actual mean, etc. Would be best to just remove rows at that point...

**Exercise 4**

Convert the column `"numbers_numeric"` back into integers.

In [None]:
df["numbers_numeric"] = df["numbers_numeric"].astype(int)
df.dtypes

### Missing data

**Exercise 5**

Looking at the other variables, you notice that the missing item should be 12. Replace the missing item with 12.

In [None]:
df['nums'] = df['nums'].str.replace("XYZ", '12')
df

### Boolean selection

**Exercise 6**

Often you need to select data based on conditions met by the data itself. 

Which colors remain if you only include data for which the numbers are above 18?

In [None]:
x = df[df['numbers_numeric']>18]
x['colors']

### Case study

Remember the chipotle data from this week's homework.

In [None]:
chipotle = pd.read_csv("chipotle_raw.csv")
chipotle.head(20)

**Exercise 7**

We'd like you to use this data to answer the following questions.

- What is the average price of an item with chicken?  
- What is the average price of an item with steak?  
- Did chicken or steak produce more revenue (total)? 

*Hint:* You may need to use on of the string methods shown above, and don't forget to use the variable `quantity`.

In [None]:
#Answer 1
chipotle_chicken = chipotle[chipotle["item_name"].str.contains("Chicken")]
chipotle_chicken['no$_price'] = chipotle_chicken['item_price'].str.replace("$", '')
chipotle_chicken["no$_price"] = chipotle_chicken["no$_price"].astype(float)
print(chipotle_chicken['no$_price'].mean())
chipotle_chicken.head()
#Answer 2
chipotle_steak = chipotle[chipotle["item_name"].str.contains("Steak")]
chipotle_steak['no$_price_s'] = chipotle_steak['item_price'].str.replace("$", '')
chipotle_steak["no$_price_s"] = chipotle_steak["no$_price_s"].astype(float)
print(chipotle_steak['no$_price_s'].mean())
#Answer 3
chipotle_chicken['total_spent'] = chipotle_chicken['quantity'] * chipotle_chicken['no$_price']
chipotle_steak['total_spent_s'] = chipotle_steak['quantity'] * chipotle_steak['no$_price_s']

In [None]:
chipotle_chicken
print('The mean of chicken dishes is; ' + str(chipotle_chicken['no$_price'].mean()))
print('The mean of steak dishes is; ' + str(chipotle_steak['no$_price_s'].mean()))
print('The revenue of chicken dishes is; ' + str(chipotle_chicken['total_spent'].sum()))
print('The revenue of steak dishes is; ' + str(chipotle_steak['total_spent_s'].sum()))

In [None]:
#Alternative -> More clean way of doing it:
chipotle['price_numeric'] = pd.to_numeric(chipotle['item_price'].str.replace('$',''))
chipotle['revenue'] = chipotle['quantity']*chipotle['price_numeric']
chipotle.head()

#Answer
chicken = chipotle[(chipotle['choice_description'].str.contains('Chicken')==True) | (chipotle['item_name'].str.contains('Chicken')==True)]
print(chicken['price_numeric'].mean())
steak = chipotle[(chipotle['choice_description'].str.contains('Steak')==True) | (chipotle['item_name'].str.contains('Steak')==True)]
print(steak['price_numeric'].mean())
print(chicken['revenue'].sum())
print(steak['revenue'].sum())


## Writing DataFrames

Let’s now talk about saving a DataFrame to a file.

As a general rule of thumb, if we have a DataFrame `df` and we would like to save to save it as a file of type `FOO`, then we would call the method named `df.to_FOO(...)`.

We will show you how this can be done and try to highlight some issues. But, we will not cover all possible options and features — we feel it is best to learn these as you need them by consulting the appropriate documentation.

Let’s show `df.to_csv` as an example.

Without any additional arguments, the `df.to_csv` function will return a string containing the csv form of the DataFrame:

In [None]:
print(chipotle.head().to_csv())

If we do pass an argument, the first argument will be used as the file name. By default, it ends up in the same folder as your notebook (but you can also specify another path).

In [None]:
chipotle.to_csv("chipotle.csv")

You can see above and in the file that the csv-form contains the index, which will appear as a variable once you read the csv-file again in pandas. You can prevent this with `index = False`, as below.

In [None]:
chipotle.to_csv("chipotle.csv", index = False)

**Exercise 8**

Analogous to above, export the `chipotle` DataFrame as an excel-file `chipotle.xlsx`.

In [None]:
chipotle.to_excel("chipotle.xlsx")

## Aggregations and `datetime`

Below we will use unemployment data by US state at a monthly frequency, as contained in the file `state_unemployment.csv`. The pandas `read_csv` function will determine most datatypes of the underlying columns. The exception here is that we need to give pandas a hint so it can load up the `Date` column as a Python datetime type, as below.

In [5]:
unemp_raw = pd.read_csv("state_unemployment.csv", parse_dates=["Date"])
print(unemp_raw.dtypes)
unemp_raw.head()

Date                datetime64[ns]
state                       object
LaborForce                 float64
UnemploymentRate           float64
dtype: object


Unnamed: 0,Date,state,LaborForce,UnemploymentRate
0,2000-01-01,Alabama,2142945.0,4.7
1,2000-01-01,Alaska,319059.0,6.3
2,2000-01-01,Arizona,2499980.0,4.1
3,2000-01-01,Arkansas,1264619.0,4.4
4,2000-01-01,California,16680246.0,5.0


**Exercise 9**

One of the reasons people are concerned about business cycles is that unemployment can increase a lot during recessions. 

One way to look at unemployment changes is to study the variance of unemployment over time. Which states are relatively volatile? Compute the variance of unemployment for each state.

*Hint:* Use `var` and `groupby`.

In [6]:
vars_of_states = unemp_raw.groupby('state').var()
vars_of_states.head()

Unnamed: 0_level_0,LaborForce,UnemploymentRate
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama,777524000.0,4.548286
Alaska,258795000.0,0.257506
Arizona,47295590000.0,4.179021
Arkansas,1568688000.0,1.957209
California,510388600000.0,6.039162


**Exercise 10**

Instead of using a built-in aggregation like `var`, it is also possible to write your own aggregation, which you can call with `agg`, as you have seen in Datacamp.

Create a function `high_or_low` that takes a pandas Series as argument. The function should print `"High"` if the variance of the series is equal to or above 2.5, and should print `"Low"` if the variance of the series is below 2.5.

Apply your function to find out which states have volatile unemployment, and which don't.

In [9]:
def high_or_low(s):
    if s.var() < 2.5:
        out = 'Low'
    else:
        out = 'High'
        return out
        
unemp_raw.groupby('state')['UnemploymentRate'].agg(high_or_low)

state
Alabama           High
Alaska            None
Arizona           High
Arkansas          None
California        High
Colorado          High
Connecticut       High
Delaware          High
Florida           High
Georgia           High
Hawaii            None
Idaho             High
Illinois          High
Indiana           High
Iowa              None
Kansas            None
Kentucky          High
Louisiana         None
Maine             High
Maryland          None
Massachusetts     None
Michigan          High
Minnesota         None
Mississippi       High
Missouri          High
Montana           None
Nebraska          None
Nevada            High
New Hampshire     None
New Mexico        None
New York          None
New jersey        High
North Carolina    High
North Dakota      None
Ohio              High
Oklahoma          None
Oregon            High
Pennsylvania      None
Rhode island      High
South Carolina    High
South Dakota      None
Tennessee         High
Texas             None
Utah 

**Exercise 11**

Create a DataFrame `unemp_all` containing the unemployment rates, with the US states as columns and `Date` as index.

In [18]:
unemp_all = unemp_raw.pivot_table(index='Date', columns='state', values='UnemploymentRate')
unemp_all.tail()

state,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,Florida,Georgia,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-08-01,4.0,7.2,4.7,3.7,4.6,2.9,4.5,4.6,4.0,4.5,...,3.4,3.4,4.0,3.2,3.0,3.7,4.8,5.2,3.3,4.1
2017-09-01,3.9,7.2,4.7,3.7,4.5,3.0,4.5,4.5,3.9,4.5,...,3.4,3.3,4.0,3.2,2.9,3.6,4.7,5.3,3.3,4.1
2017-10-01,3.8,7.2,4.7,3.7,4.5,3.0,4.5,4.5,3.9,4.5,...,3.4,3.3,3.9,3.2,2.9,3.6,4.7,5.4,3.2,4.2
2017-11-01,3.8,7.2,4.7,3.7,4.5,3.0,4.5,4.5,3.9,4.5,...,3.4,3.3,3.9,3.2,2.9,3.6,4.7,5.4,3.2,4.2
2017-12-01,3.8,7.2,4.7,3.7,4.5,3.0,4.5,4.5,3.9,4.5,...,3.4,3.3,4.0,3.2,2.9,3.6,4.7,5.4,3.2,4.1


**Exercise 12**

One of the advantages of a `DatetimeIndex` is that it is easy to slice.

Consider the list of states below. Create a DataFrame `unemp` that contains unemployment rates for only those states between January 2006 and December 2015 (including).

In [19]:
states = ["Arizona", "California", "Florida", "Illinois",
    "Michigan", "New York", "Texas"]
unemp = unemp_all.loc["2006-01":"2015-12", states]
unemp

state,Arizona,California,Florida,Illinois,Michigan,New York,Texas
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-01-01,4.5,5.0,3.2,5.1,6.8,4.8,5.2
2006-02-01,4.4,5.0,3.2,4.9,6.9,4.8,5.1
2006-03-01,4.4,4.9,3.1,4.8,6.9,4.7,5.1
2006-04-01,4.4,4.9,3.2,4.6,7.0,4.7,5.1
2006-05-01,4.3,4.9,3.2,4.6,7.0,4.7,5.1
...,...,...,...,...,...,...,...
2015-08-01,6.0,6.0,5.3,5.9,5.2,5.0,4.4
2015-09-01,5.9,5.9,5.3,5.9,5.1,5.0,4.4
2015-10-01,5.8,5.8,5.2,5.9,5.0,4.9,4.4
2015-11-01,5.8,5.7,5.1,6.0,4.9,4.9,4.5


### Transforms

Many analytical operations do not necessarily involve an aggregation. The output of a function applied to a Series might need to be a new Series.

For example,
- Compute the difference in unemployment from month to month (`diff`).
- Compute the percentage change in unemployment from month to month (`pct_change`).
- Calculate the cumulative sum of elements in each column (`cumsum`)

As usual, tab completion is helpful when trying to find such functions.

As an example of the use of transforms, the code below shows which state had the largest percentage increase in unemployment. Try to understand how this code arrives at this answer.

In [27]:
unemp.pct_change().max().idxmax()

'Texas'

**Exercise 13**

The DataFrame `unemp` contains dates starting at the height of the boom before the Great Recession, and ends 10 years later.

Which state had the smallest increase (or largest decrease) in unemployment over this period?

In [30]:
unemp.diff().sum().idxmin()

'Michigan'