#### Group Members

- A10673470
- A12351940
- A12138344
- A91112629
- A91018689
- A12450393

## Introduction

** Reserach Question **

What are the most common styles of beer by region in the United States?

** Hypothesis **

From our general background knowledge as consumers,  we believe Indian Pale Ales (IPA) are probably the most commonly produced style overall. However, we think that the California especially favors IPA’s and that other states will generally have more variety and a more spread out distribution of other types of beers.

## Background and Prior Work

** Why is this question of interest, what background information led you to your hypothesis, and why is this important? **

Our team likes beer, and we also travel. It’s interesting to us how different the varieties of available beer are in other countries and how America’s newfound love for beer has made the craft brewing scene evolve differently than it did in other regions of the world. Since America itself is a large country and has different cultures within it, we’re wondering if different parts of the United States have differing preferences for style of beer. We know IPA’s are huge on the West Coast, but is that true in other states as well?

We found inspiration from an interesting study by Brewer’s Association that looks at number of craft breweries by state and has some general stats showing trends of how popular craft beer is in general. We also found a cool interactive map that was made by a team at University of Kentucky which actually measures popularity of the types of beers by scraping tweets. The issue with that tool is that it’s based off of social media hype which does not necessarily indicate what’s available on the market. Additionally, it isn’t as easy to identify which styles are most prominent since the available tags include styles such as “Weiss” but also specific brands like “Corona.”

**References:**
https://www.brewersassociation.org/statistics/by-state/
http://newmapsplus.uky.edu/gallery/beer-tweets/#

## Data Description

- **Name:** beers.csv
- **Description:** This is a user-generated, administrator-maintained dataset of user-reported beers from the RateBeer website. We initially used a different dataset that was 100x smaller and only had canned beers. We then emailed the director of RateBeer so that we could get more comprehensive data and luckily were given his data.
- **Link:** The dataset is not publicly available, but the data itself is all from https://ratebeer.com
- **Observations:** 285995
- **Features:** 
 - BeerName - commercial name of the beer
 - BeerStyleName - style of beer
 - Entered - date the beer was added to the database
 - RateCount - number of times that the beer was reviewed by a user
 - BrewerCity - city where the beer originates
 - Abbrev - state where the beer originates
 - ZipCode - zip code of where the beer originates

## Data Cleaning/Pre-Processing

### Imports

In [1]:
import pandas as pd
import numpy as np
import folium
import geocoder
import zipcode
import plotly
import plotly.plotly as py
from plotly.graph_objs import *
from folium.plugins import FastMarkerCluster
from folium.plugins import HeatMap
from us_state_abbrev import us_state_abbrev
import sys

First, let's import the dataset and clean it.

In [2]:
df = pd.read_csv('ucsd-sansdescrip03162018.csv', sep="|")

df.BeerStyleName.replace(to_replace="India Pale Ale &#40;IPA&#41;",value="India Pale Ale", inplace=True)
df.BeerStyleName.replace(to_replace="Czech Pilsner (Sv&#283;tlý)",value="Czech Pilsner (Světlý)", inplace=True)

df = df.drop([285995])

df.to_csv('beers_grouped.csv')

The below dataframe will allow us to group the styles of beers based on RateBeer's grouping since there are so many varieties.

In [3]:
styles_df = df.drop(axis=1, labels=['Entered','BrewerCity','Abbrev','BrewerZIPCode'])

styles_df['BeerStyleGroup'] = styles_df.BeerStyleName

styles_df.BeerStyleGroup.replace(to_replace=["Altbier","Amber Ale","American Pale Ale","American Strong Ale","American Strong Ale ","Barley Wine","Bitter","Brown Ale","Cream Ale","English Pale Ale","English Strong Ale","Golden Ale/Blond Ale","Imperial IPA","India Pale Ale (IPA)","India Pale Ale &#40;IPA&#41;","Irish Ale","Kölsch","Mild Ale","Old Ale","Premium Bitter/ESB","Scotch Ale","Scottish Ale","Session IPA"], value="Anglo-American Ales",inplace=True)
styles_df.BeerStyleGroup.replace(to_replace=["Amber Lager/Vienna","California Common","Czech Pilsner (Světlý)","Czech Pilsner (Sv&#283;tlý)","Doppelbock","Dortmunder/Helles","Dunkel/Tmavý","Dunkler Bock","Eisbock","Heller Bock","Imperial Pils/Strong Pale Lager","India Style Lager","Malt Liquor","Oktoberfest/Märzen","Pale Lager","Pilsener","Polotmavý","Premium Lager","Radler/Shandy","Schwarzbier","Zwickel/Keller/Landbier"], value="Lagers",inplace=True)
styles_df.BeerStyleGroup.replace(to_replace=["Abbey Dubbel","Abbey Tripel","Abt/Quadrupel","Belgian Ale","Belgian Strong Ale","Bière de Garde","Saison"], value="Belgian-Style Ales",inplace=True)
styles_df.BeerStyleGroup.replace(to_replace=["Baltic Porter","Black IPA","Dry Stout","Foreign Stout","Imperial Porter","Imperial Stout","Porter","Stout","Sweet Stout"], value="Stout and Porter",inplace=True)
styles_df.BeerStyleGroup.replace(to_replace=["Dunkelweizen","German Hefeweizen","German Kristallweizen","Grodziskie/Gose/Lichtenhainer","Weizenbock","Wheat Ale","Witbier"], value="Wheat Beer",inplace=True)
styles_df.BeerStyleGroup.replace(to_replace=["Berliner Weisse","Lambic Style - Faro","Lambic Style - Fruit","Lambic Style - Gueuze","Lambic Style - Unblended","Sour Red/Brown","Sour/Wild Ale"], value="Sour Beer",inplace=True)
styles_df.BeerStyleGroup.replace(to_replace=["Fruit Beer","Low Alcohol","Sahti/Gotlandsdricke/Koduõlu","Smoked","Specialty Grain","Spice/Herb/Vegetable","Traditional Ale"], value="Other",inplace=True)
styles_df.BeerStyleGroup.replace(to_replace=["Cider","Ice Cider/Ice Perry","Mead","Perry","Saké - Daiginjo","Saké - Futsu-shu","Saké - Genshu","Saké - Ginjo","Saké - Honjozo","Saké - Infused","Saké - Junmai","Saké - Koshu","Saké - Namasaké","Saké - Nigori","Saké - Taru","Saké - Tokubetsu"], value="Cider, Mead, Sake",inplace=True)

## Data Analysis and Visualization

Now, we will figure out which beer styles are most popular overall and see how that changes by region.

In [4]:
df.BeerStyleName.value_counts()[:15]

India Pale Ale          33411
American Pale Ale       16908
Imperial IPA            14052
Saison                  13546
Sour/Wild Ale           11981
Imperial Stout          11800
Porter                   9883
Stout                    8603
Spice/Herb/Vegetable     8368
Brown Ale                8088
Fruit Beer               7700
Amber Ale                7206
Golden Ale/Blond Ale     5903
Belgian Ale              5485
Sweet Stout              5286
Name: BeerStyleName, dtype: int64

Nationally, IPA's and Pale Ale's in general overwhelm other styles by production. Let's see if any particular states are the culpirits for this pale ale craze.

In [5]:
states = df.Abbrev.value_counts().index

states_df = pd.DataFrame.from_dict(states)

states_df.columns = ['states']
states_df.sort_values(by=['states'],inplace=True)
states_df.reset_index(drop=True)

def state_style(state):

    return df.loc[df['Abbrev'] == state].BeerStyleName.value_counts().index[:10].values

mostCommonByState = pd.Series()

for row in states_df.itertuples():
    mostCommonStyles10 = state_style(row.states)
    mostCommonByState[row.states] = mostCommonStyles10[0]
    
mostCommonByState.value_counts()

India Pale Ale       48
American Pale Ale     1
Saison                1
Imperial Stout        1
dtype: int64

With the exception of one state, the hype around IPA's has affected the entire nation and is only beat out in 3 states, 2 of which still just have a different kind of pale ale at the top. Let's see which beers would be most popular if we ignored pale ales.

In [6]:
for row in states_df.itertuples():
    mostCommonStyles10 = state_style(row.states)
    mostCommonStyles3 = []
    i = 0
    while (len(mostCommonStyles3) < 3):
        if (mostCommonStyles10[i] not in ['India Pale Ale', 'American Pale Ale', 'Imperial IPA', 'Saison']):
            mostCommonStyles3.append(mostCommonStyles10[i])
        i += 1;
    states_df.set_value(col='most_common', index=row.Index, value=mostCommonStyles3[0])
    states_df.set_value(col='second_most', index=row.Index, value=mostCommonStyles3[1])
    states_df.set_value(col='third_most', index=row.Index, value=mostCommonStyles3[2])
    
states_df.most_common.value_counts()

Sour/Wild Ale           13
Imperial Stout          12
Porter                   8
Mead                     4
Spice/Herb/Vegetable     3
Amber Ale                3
Stout                    3
Fruit Beer               2
Barley Wine              1
Golden Ale/Blond Ale     1
Cider                    1
Name: most_common, dtype: int64

This ranking looks quite similar to the results of the national style ranking (excluding pale ales), suggesting that there may not be too much variation in style production by region.
Let's see if a map can tell us something different.

In [7]:
def add_index_by_most_common(style):
    d = {
        'Sour/Wild Ale': 11,
        'Imperial Stout': 10,
        'Porter': 9,
        'Mead': 8,
        'Stout': 7,
        'Spice/Herb/Vegetable': 6,
        'Amber Ale': 5,
        'Fruit Beer': 4,
        'Barley Wine': 3,
        'Cider': 2,
        'Golden Ale/Blond Ale': 1   
    }
    return d[style.strip()]

states_df['states'] = states_df.apply(lambda x: x['states'].strip(), axis=1)
states_df = states_df[states_df['states'] != 'DC']
states_df = states_df.sort_values(['states']).reset_index()

states_df['style_index'] = states_df.apply(lambda x: add_index_by_most_common(x['most_common']), axis=1)

import plotly
import plotly.plotly as py
import plotly.figure_factory as ff
import numpy as np

API_KEY = 'iqJqlRcPKIHqjD1B0DVQ'
plotly.tools.set_credentials_file(username='semendez', api_key=API_KEY)

scl = [[0.0, '#665687'],[0.1, '#FBB13C'],[0.2, '#DAFF7D'],\
            [0.3, '#B2EF9B'],[0.4, '#FFF05A'],[0.5, '#FF785A'],\
              [0.6, '#B2EF9B'],[0.7, '#3B3561'], [0.8, '#5DD9C1'],\
              [0.9, '#ACFCD9'],[1.0, '#8F2D56']]

states_df['text'] = states_df['states'] + '<br>' +\
    'Most Common Beer: '+states_df['most_common']
    
data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = states_df['states'],
        z = states_df['style_index'].astype(float),
        locationmode = 'USA-states',
        text = states_df['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Beer Type")
        ) ]

layout = dict(
        title = '',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )

fig = dict( data=data, layout=layout )
py.iplot( fig, filename='d3-cloropleth-map' )

In [11]:
styles_hype = {}

for value in styles_df.BeerStyleName.value_counts().index:
    styles_hype[value] = 0
    
for row in styles_df.itertuples():
    styles_hype[row.BeerStyleName] += row.RateCount

hype_df = pd.DataFrame(columns=['hype'])

hype_df.hype = pd.Series(styles_hype).sort_values(ascending=False)

hype_df['normal_hype'] = hype_df['hype'].apply(lambda x: x / hype_df.hype.sum())
for row in hype_df.itertuples():
    value = (round(row.normal_hype, 3))
    hype_df.set_value(col='normal_hype', index=row.Index, value=value)

styles_counts = df.BeerStyleName.value_counts()

hype_df['relative_hype'] = hype_df.hype

for row in hype_df.itertuples():
    value = row.hype / styles_counts[row.Index]
    value = round(value, 3)
    hype_df.set_value(col='relative_hype', index=row.Index, value=value)
    
median_hype = hype_df.relative_hype.median()    
hype_df['relative_hype'] = hype_df['relative_hype'].apply(lambda x: x - median_hype)

hype_df.relative_hype.sort_values(ascending=False)[:10]

Lambic Style - Gueuze    30.234
Malt Liquor              17.910
Barley Wine              16.200
Low Alcohol              15.895
Pale Lager               15.481
Imperial Stout           13.237
American Strong Ale      11.754
Saké - Futsu-shu         11.698
Old Ale                  11.584
English Strong Ale       11.171
Name: relative_hype, dtype: float64

Even though the IPA has by far the most beers entered into the RateBeer database, it does not have the most reviews relatives to the number of beers. The list of top 10 most reviewed beer styles is shown above. This probably suggests that the IPA craze started well after RateBeer was launched and that these other beers have been accumulating ratings.

### More Visualizations

In [14]:
df_beers = pd.read_csv('ucsd-sansdescrip03162018.csv', sep="|")
df_beers = df_beers.rename(index=str, columns={'BeerName': 'name', 'BeerStyleName': 'style', 'Entered': 'entry', 'RateCount': 'rate_count', 'BrewerCity': 'city', 'Abbrev':'state', 'BrewerZIPCode':'zipcode'})

df_beers['name'] = df_beers['name'].str.strip()
df_beers['style'] = df_beers['style'].str.strip()
df_beers['city'] = df_beers['city'].str.strip()
df_beers['state'] = df_beers['state'].str.strip()
df_beers['zipcode'] = df_beers['zipcode'].str.strip()


In [15]:
def get_latitude(z):
    try:
        return zipcode.isequal(z).lat
    except:
        return np.nan

def get_longitude(z):
    try:
        return zipcode.isequal(z).lon
    except:
        return np.nan

In [16]:
# Setting lat/lng columns to beers df
df_beers['lat'] = df_beers.apply(lambda x: get_latitude(x['zipcode']), axis=1)
df_beers['lng'] = df_beers.apply(lambda x: get_longitude(x['zipcode']), axis=1)

In [17]:
# Counting breweries by state
df_beers_by_state = df_beers.groupby(df_beers['state'].str.strip()).size().reset_index()
df_beers_by_state = df_beers_by_state.rename(index=str, columns={0:'beers_count'})

df_beers_by_state.head()

assert sum(df_beers_by_state['beers_count']) == (len(df_beers) - 1) # 

In [18]:
df_beers_by_city = df_beers.groupby([df_beers['state'].str.upper().str.strip(), df_beers['city'].str.capitalize().str.strip()]).size().reset_index()
df_beers_by_city = df_beers_by_city.rename(index=str, columns={0:'beers_count'})

def get_zipcode(city, state):
    try:
        return df_beers['zipcode'][(state == df_beers['state']) & (city == df_beers['city'])][0]
    except:
        return np.nan

# adding zipcode column
df_beers_by_city['zipcode'] = df_beers_by_city.apply(lambda x: get_zipcode(x['city'], x['state']), axis=1)

# Removing NaN
df_beers_by_city = df_beers_by_city[~pd.isnull(df_beers_by_city['zipcode'])]

# Cleaning zip code so that they are of len 5
df_beers_by_city['zipcode'] = df_beers_by_city.apply(lambda x: x['zipcode'][:5], axis=1)

# Setting lat/lng columns to df_beers_by_city
df_beers_by_city['lat'] = df_beers_by_city.apply(lambda x: get_latitude(x['zipcode']), axis=1)
df_beers_by_city['lng'] = df_beers_by_city.apply(lambda x: get_longitude(x['zipcode']), axis=1)

# Removing NaN
df_beers_by_city = df_beers_by_city[~pd.isnull(df_beers_by_city['zipcode'])]

# Sorting data by beers_count
df_beers_by_city = df_beers_by_city.sort_values(by=['beers_count'], ascending=False)

df_beers_by_city.head()

Unnamed: 0,state,city,beers_count,zipcode,lat,lng
433,CO,Denver,5414,80205,39.76,-104.87
2613,OR,Portland,5324,97214,45.51,-122.64
889,IL,Chicago,5301,60614,,
3368,WA,Seattle,4760,98105,47.66,-122.29
1631,MN,Minneapolis,3225,55454,44.96,-93.26


In [19]:
# Population Estimates by state
df_population_estimates = pd.read_csv('raw_data/us_states_population_estimates.csv')
df_population_estimates = df_population_estimates.rename(index=str, columns={'State': 'state', 'Population Estimate': 'population_estimate', 'Year': 'year'})

# Getting latest data (population estimates from 2017), dropping it, and removing Puerto Rico
df_population_estimates = df_population_estimates[df_population_estimates['year'] == 2017]
df_population_estimates = df_population_estimates.drop('year', axis=1)
df_population_estimates = df_population_estimates[df_population_estimates['state'] != 'Puerto Rico']

assert len(df_population_estimates) == 51

# Converting full state name to its abbreviation
df_population_estimates['state'] = df_population_estimates['state'].apply((lambda s: us_state_abbrev[s]))

df_population_estimates.head()

Unnamed: 0,state,population_estimate
364,AL,4874747
365,AK,739795
366,AZ,7016270
367,AR,3004279
368,CA,39536653


In [20]:
# Merging population estimates and beers by state
df_beers_per_capita = pd.merge(df_population_estimates, df_beers_by_state, on='state')

# Beers per capita (1 beer per 100,000 habitants)
df_beers_per_capita['beers_per_capita'] = df_beers_per_capita.apply((lambda r: int((r['beers_count']/r['population_estimate']) * 100000)), axis=1)

# Dropping beer count and population estimates for Map visualization
df_beers_per_capita = df_beers_per_capita.drop('population_estimate', axis=1)
df_beers_per_capita = df_beers_per_capita.drop('beers_count', axis=1)

df_beers_per_capita.sort_values('beers_per_capita', ascending=False).head(10)

Unnamed: 0,state,beers_per_capita
45,VT,564
5,CO,348
37,OR,327
1,AK,247
26,MT,225
29,NH,205
50,WY,190
47,WA,187
19,ME,179
23,MN,158


In [21]:
df_beers = df_beers[~np.isnan(df_beers['lat'])]

df_beers_CA = df_beers[df_beers['state'] == 'CA']
df_beers_VT = df_beers[df_beers['state'] == 'VT']

df_beers_CA.head()

df_beers_VT.head()

Unnamed: 0,name,style,entry,rate_count,city,state,zipcode,lat,lng
48,Catamount Pale Ale,American Pale Ale,2001-02-13 21:34:07.000,31.0,White River Junction,VT,5001,43.65,-72.32
125,Catamount Wassail,Spice/Herb/Vegetable,2001-11-16 12:42:07.000,8.0,White River Junction,VT,5001,43.65,-72.32
224,Three Needs West Coast Pale Ale,American Pale Ale,2002-02-06 13:07:12.000,1.0,Burlington,VT,5401,44.48,-73.22
279,Vermont Pub Burly Irish Red,Irish Ale,2000-07-24 22:40:09.000,68.0,Burlington,VT,5401,44.48,-73.22
285,Vermont Pub Curacao Trippel,Abbey Tripel,2001-06-28 21:49:37.000,10.0,Burlington,VT,5401,44.48,-73.22


#### Beers by State (Chloropleth)

In [22]:
state_geo = r'data/us-states.json'

beers_state_map = folium.Map(location=[48, -102], zoom_start=3)
beers_state_map.choropleth(
    geo_data=state_geo,
    data=df_beers_by_state,
    threshold_scale=[6000, 12000, 18000, 24000, 30000, 36000],
    key_on='feature.id',
    columns=['state', 'beers_count'],
    fill_color='OrRd', fill_opacity=1, line_opacity=0.1,
    )

beers_state_map

In [24]:
# Beer production per capita
state_geo = r'data/us-states.json'

beers_per_capita_map = folium.Map(location=[48, -102], zoom_start=3)
beers_per_capita_map.choropleth(
    geo_data=state_geo,
    data=df_beers_per_capita,
    threshold_scale=[100, 200, 300, 400, 500, 600],
    key_on='feature.id',
    columns=['state', 'beers_per_capita'],
    fill_color='OrRd', fill_opacity=1, line_opacity=0.1,
    )

beers_per_capita_map

## Ethics and Privacy

Information regarding beer is considered public information in the United States. Data.GOV provides datasets regarding beer sales and production. In our project we will be mostly looking at the characteristics of different beers in different states. The datasets we acquired are legal and we have permission to collect and analyze the information. Explicit permission has been granted from an MIT License to “use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software” and the data provided. Regarding the privacy concerns of the datasets, brewing companies publicly release information about their beers for marketing or health regulation concerns. Therefore, there exists no privacy concern as brewing companies must make this information public and available. Part of our goal is to collect information of as many beers as possible as this will help us eliminate biases. We promise to not include personal biases on beers and/or exclude beers for any reason that would affect the ethics of this project. Any impact and issues that result from our analysis will not be made public and are for the purpose of this class and ourselves only. Any issues we identify during the project will be kept within the group and not made public until further research on how to handle the issue. We must clarify, we do not endorse the consumption of alcohol neither do we intend to harm the reputation of any specific beers. The data analyzed will be for informative purposes, beer lovers and science. 


## Discussion

Our analysis shows that IPA's are popular all over the United States abd the most famous by far in every state but that Sour Ales and Imperial Stouts are also doing quite well and have regional differences. Particularly, Imperial Stouts are popular in the midwest while sour beers are doing well in cosmopolitan areas.