# Extra Credit: Final Exercise on Pandas

In the last notebooks you have learned a lot of new things, and you have also been able to practice them. Only by practicing will you become more confident in using the methods and hopefully it will stick in your mind better.
Thus, here is another exercise on Pandas, in which you are asked to do a few different things with the new tools we learned about (methods for combining records, `pd.cut` and `pivot tables`). 



We'll be working again with the `wine` datasets that are located in the `data` folder. 

1) Read the `winequality-red.csv` data into a `DataFrame`, and the `winequality-white.csv` into another `DataFrame`.

2) Double check that you've read them in correctly by using some of the attributes and methods available on `DataFrames` for getting a general sense of your data. 

3) I've decided that this month I want to stay away from wines with relatively high alcohol content. To do that, I'm going to avoid any wines that have a greater alcohol content than the mean alcohol content, and you're going to help me do this. To achieve this, let's do the following: 

  * Find the mean alcohol content, separately, for reds and whites.  
  * Create a `Series` that holds whether each row in each `DataFrame` (red wines and white wines) has a higher alcohol content than the mean. 
  * Merge this `Series` onto the `DataFrame`. I can imagine doing this with either a `.join()` or using `pd.concat()`. For practice, do it with both. Note: merges with `Series` work the same way that they work with `DataFrames`.  
  * Return all those rows that will help me stay away from those wines with a higher alcohol content.  
   
   
4) Let's say that I want to get started on cutting back next month. This time, though, I want to focus on staying away from those wines with a high acidity. Specifically, I want to stay away from those wines that are in the highest bin of fixed acidity (highest bin out of 5). You're now going to help me to do this. To achieve this, let's do the following: 

 * Separate the rows in each `DataFrame` into 5 equal width bins (not equal to quintiles) based off their fixed acidity. 
 * Merge the resulting `Series` holding these 5 bins onto the original `DataFrame`. I can imagine also doing this with either `.join()` or using `pd.concat()`. Try doing it with both for practice. 
 * Return back to me all those rows that are **not in** the top bin in terms of fixed acidity. 

5) Let's say that I now want to know how much my decision to avoid those wines with higher `alcohol` content is going to limit the `quality` of wines that I can drink. To figure this out, I want to know a couple of things: 

 * The average `alcohol` content for those reds above the mean `alcohol` level, by quality.
 * The average `alcohol` content for those whites above the mean `alcohol` level, by quality. 
 
 Use a `pivot table` to solve this. 
 
6) Now, do the same for my decision to avoid wines with a high acidity next month: 

 * Find the average `alcohol` content for reds, by `quality` and `fixed acidity` quintile. 
 * Find the average `alcohol` content for whites, by `quality` and `fixed acidity` quintile. 
 
  Use a `pivot table` to solve this. 

In [None]:
import pandas as pd

In [None]:
# 1 Read the `winequality-red.csv` data into a `DataFrame`, and the `winequality-white.csv` into another `DataFrame`.

df_red = pd.read_csv("data/winequality-red.csv", sep=';')
df_white = pd.read_csv("data/winequality-white.csv", sep=';')

In [None]:
# 2 Double check that you've read them in correctly ...
# (... .head(), .shape, ...)

In [None]:
# 3.1 Find the mean alcohol content, separately, for reds and whites.

mean_red = df_red.alcohol.mean()
mean_white = df_white.alcohol.mean()

In [None]:
# 3.2 Create a `Series` that holds whether each row in each `DataFrame` 
# (red wines and white wines) has a higher alcohol content than the mean.

alc_red = df_red.alcohol > df_red.alcohol.mean()
alc_white = df_white.alcohol > df_white.alcohol.mean()

In [None]:
# 3.3 Merge this `Series` onto the `DataFrame`. 
# I can imagine doing this with either a `.join()` or using `pd.concat()`. 
# For practice, do it with both. Note: merges with `Series` work the same way that they work with `DataFrames`.   

df_red_joined = df_red.join(alc_red, rsuffix = '_above_mean')
df_red_joined

df_white_joined = df_white.join(alc_white, rsuffix = '_above_mean')
df_white_joined

In [None]:
# 3.4 Return all those rows that will help me stay away from those wines with a higher alcohol content. 

df_red_joined.query('alcohol_above_mean == False')
df_white_joined.query('alcohol_above_mean == False')

In [None]:
# 4.1 Separate the rows in each `DataFrame` into 5 equal width bins (not equal to quintiles) 
# based off their fixed acidity.

acidity_labels = ['very low', 'low', 'medium', 'high', 'very high']
acidity_bins = pd.cut(df_red_joined['fixed acidity'], bins=5, labels=acidity_labels)
acidity_bins

In [None]:
# 4.2 Merge the resulting `Series` holding these 5 bins onto the original `DataFrame`. 
# I can imagine also doing this with either `.join()` or using `pd.concat()`. 
# Try doing it with both for practice.

df_red_acid_joined = df_red_joined.join(acidity_bins, rsuffix='_bins')
df_red_acid_joined

In [None]:
# 4.3 Return back to me all those rows that are **not in** the top bin in terms of fixed acidity.
df_red_keepdrinking = df_red_acid_joined[(df_red_acid_joined['fixed acidity_bins'] != "very high")]
df_red_keepdrinking

In [None]:
# 5.1 The average `alcohol` content for those reds above the mean `alcohol` level, by quality.
# Use a `pivot table` to solve this.

pd.pivot_table(df_red_acid_joined.query('alcohol_above_mean == True'), values='alcohol', index='quality')

In [None]:
# 5.2 The average `alcohol` content for those whites above the mean `alcohol` level, by quality.
# Use a `pivot table` to solve this.

#(...)

In [None]:
# 6.1 Find the average `alcohol` content for reds, by `quality` and `fixed acidity` quintile. 
# Use a `pivot table` to solve this.

pd.pivot_table(df_red_acid_joined, values='alcohol', index='quality', columns='fixed acidity_bins')


In [None]:
# 6.2  * Find the average `alcohol` content for whites, by `quality` and `fixed acidity` quintile.
# Use a `pivot table` to solve this.

#(...)