In [4]:
from IPython.core.display import HTML
HTML("""
<style>
code {
    padding:2px 4px !important;
    color: #c7254e !important;
    font-size: 90%;
    background-color: #f9f2f4 !important;
    border-radius: 4px !important;
    color: rgb(138, 109, 59);
}
mark {
    color: rgb(138, 109, 59) !important;
    font-weight: bold !important;
}
.container { width: 90% !important; }
table { font-size:15px !important; }
</style>
""")

In [5]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt 
from IPython.display import Image, display

%matplotlib inline  

# Introduction

This notebook serves as the master report document which will present the analysis results for the CiBO data science exercise. The exercise is structured around a beer review data set available [online](https://urldefense.proofpoint.com/v2/url?u=https-3A__s3.amazonaws.com_demo-2Ddatasets_beer-5Freviews.tar.gz&d=DwMFaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=8bgQeuykrF3aSX4ERnAE37e9TNni25ddf39sbnkKHrQ&m=hkI6yrD7SBn4Z9WO9Zt31KmSuYIswplFpvMihrHqFd4&s=PStqu-SKl1ZEMNBu4MLVtzHvrTddC9h1mM3NqDgmYmI&e=) for which the following must be explored:
> 1. Which brewery produces the strongest beers by ABV%?
> 2. If you had to pick 3 beers to recommend using only this data, which would you pick?
> 3. Which of the factors (aroma, taste, appearance, palette) are most important in determining the overall quality of a beer?
> 4. Lastly, if I enjoy a beer's aroma and appearance, which beer style should I try?

## Initial data exploration

The starting point of this analysis is to get a sense of what the data looks like and how it's organized, notebook [1.0](../explore/1.0_initial_look.ipynb) covers this initial investigation. The input data has the following caveats:
- The dataset is composed of a series of beer reviews containing scores for numerous categories. Each data entry also has an associated brewery and beer style.
- The following attribute fields have special characters (e.g. accents) in them: `brewery_name`, `beer_style`, `beer_name`.
- Several attributes have missing values: `brewery_name`, `review_profilename` and `beer_abv`. We aren't too concerned about `brewery_name` since we have associated `brewery_id` with them. However, `review_profilename` will be needed if we want to compare across reviewers. A total of 17,043 beers have a missing values for `beer_abv`.
- The distribution of beers reviews is not normal: the majority of beers have only a two reviews, however on average, each beer is reviewed about 24 times. <img src='figures/1.0_initial_look-0.svg'></img>
- Similarly, each brewery is not represented with the same number of beers: the average number of beers reviewed for the 5840 breweries in the dataset is ~11 where as the median is 5. <img src='figures/1.0_initial_look-1.svg'></img>
- Of the 66,055 beer represented, the most highly represented style is *American IPA* or *American Pale Ale (APA)*. <img src='figures/1.0_initial_look-2.svg'></img>
- The distribution of alcohol by volume percentage (ABV%) is roughly normal with the presence of numerous outliers. <img src='figures/1.0_initial_look-4.svg'></img>

## Q1: Which brewery produces the strongest beers by ABV%?

### Introduction

The notebook [2.0](../explore/2.0_brewery_highest_abv.ipynb) explore this questions. 

There are several issues with the ABV% data contained in the beer review data set:
1. As mentioned previously, numerous beers (17,043) have missing values for ABV%; this represents about 25% of the total beers available in the dataset. It is unclear why this data is missing. One potential way around this would be to constuct a model that could predict ABV% based on the scored review features (e.g. `review_taste`) as well as the `beer_style`. This was not done as part of this analysis; instead, those beers without an ABV% were ignored.
2. Some breweries have many more beers associated with them than others (as shown above in the intial data exploration), therefore it is unclear whether having only a single strong beer qualifies the brewery as having the "strongest beers". <mark>For this analysis, we will assume that a brewery must have at least 5 beers for it to qualify for this questions.</mark>
3. The initial data exploration also mentions the inherent noise with the ABV% metric, especially in the high-ABV range; this will have to be taken into account when electing beers.
4. Because several of the breweries do not have any associated names, we will identify breweries by the `brewery_id` attribute throughout the analysis.

### Dealing with noisy ABV%

We being by naivly looking at which brewery has the highest ABV% beers without removing any noisy data in order to get a general sense of the data. To do that, we generate a *beer dataset* which contains only the meatdata associated with beer, this is done by grouping the starting dataset on `beer_beerid` and simply grabing the first entry of each group. From this *beer dataset* we group on `brewery_id`, aggregate with *max*, and sort the breweries by max ABV*; the plot below shows the top 15 breweries with the highest ABV% beer.

<img src='figures/2.0_brewery_highest_abv-0.svg'/>

Several interesting insights can be seen in this plot:
- Noisy high-ABV beers help rank several breweries very well.
- Brewery 6513 seems to be a general outlier when compared to the remaining breweries. In general it has very high ABV beers; it's median beer ABV is nearly higher than most all other beers represented. It is a possibility that this entiry brewery represents some sort of anomoly.

In order to remove the noise, we will use the John Tukey method of detecting outliers: that is, any beer with an ABV% value that is more than Q3 + 1.5 * IQR  or less than Q1 + 1.5 * IQR is considered an outlier. Where IQR is the inner quartile range, Q1 is the first quartile, and Q3 is the third quartile.

After removing outliers, we can replot the same top 15 breweries

<img src='figures/2.0_brewery_highest_abv-1.svg'/>

- We can see that the ABV% values for all of the breweries are much more concentrated
- We can also see that several breweries with very few beers enter the list - a brewery with very few beers is perhaps not very representative of a brewery that is able to produce strong beers, we therefore choose to set a threshold of 5 beers needed in order for a brewery to be considered for this analysis. The disadvantage of this is that 5 is a relatively arbitrary cutoff. 
- Brewery 6513 did not contain any outliers, however it now looks to be even more anamolous when compared to the reamining breweries; <mark>we therefore elect to call brewery 6513 erroneous and we remove it from consideration</mark>

With both noisy beers and brewery 6513 removed with arrive at the following ditribution of breweries:

<img src='figures/2.0_brewery_highest_abv-3.svg'/>

From the plot above, we can begin to see several candidates for highest beer ABV% brewery: 2097, 11031, 10796 and 13307:
- 10796 has two beers with the absolute highest ABV%
- 2097, 11031 and 13307 has a cluster of high ABV% beers

### Choosing a brewery

If one were to choose a brewery by the single most high-ABV% beer, the winner would be brewery 10796; however if one were to consider multiple high-ABV% beers together, a different brewery would be chosen. In the data presented above, the 95th percentile is 14.48, if one were to calculate the number of beers above this percentile we would see:

In [12]:
cutoff = 14.48
dat = pd.read_csv('../../data/interim/high_beer_ABV_breweries.csv')
beer_counts = dat[dat.beer_abv > cutoff].groupby('brewery_id').agg(['count','median','mean'])[['beer_abv']]
beer_counts.sort_values(('beer_abv','count'),ascending=False).head(5)

Unnamed: 0_level_0,beer_abv,beer_abv,beer_abv
Unnamed: 0_level_1,count,median,mean
brewery_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2097,9,15.0,15.388889
11031,8,16.3,16.15
13307,7,15.0,15.131429
10796,3,18.0,16.833333
15732,3,15.0,15.166667


The breweries 2097, 11031, and 13307 all have a high number of beers (9, 8, 7 respectively) that are in the 95th percentile of highest ABV% beers amongn only those breweries that produce high ABV% beers. Since these numbers are quite similar, we use the median/mean ABV% to decide which brewery tends to brew numerous high ABV% beers: **11031**.

### Summary

We have determined that brewery **11031 (Brouwerij De Molen)** produces more of the strongest ABV% beers than any other brewery, this is contigent on the following assumptions:
- brewery 6513 is an anomaly.
- a brewery must product at least 5 beers to be considered.
- a single very high ABV% beer does not automatically select the given brewery since precedence is given to those breweries that produce multiple strong beers (i.e. 1 very strong beer is not better than 10 slightly less strong beers).

## If you had to pick 3 beers to recommend using only this data, which would you pick?