## Web Scraping BeerAdvocate.com Using Custom API

In this notebook we scrape data from www.beeradvocate.com using custom html parsing methods. With these methods we scrape data to form a Beer class and User class. The Beer class stores the stats, info, and reviews of a particular beer. The User class stores similar data but for a specific user. With this data, we hope we are able to make novel product recommendations for online users.

This method has some advantages over the first method using $\texttt{wget}$ and $\texttt{grep}$. One is that the method parses the entire html text in one go for all data instead of having to call $\texttt{wget}$ several times which improves efficiency. Also this method extracts much more data with less tedious work because the parser is built to read all targeted values from any html file.

In [1]:
#download custom API for beeradvocate.com
#!pip install Beer_Advocate_API

[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


For full documentation of this package go to https://rawgit.com/aaronplesset/beer_advocate_api/master/genindex.html

In [12]:
import numpy as np
import pandas as pd
#import custom API
import BA
from BA import Beer
from BA import User

In [68]:
#run to inspect contents of custom package
?? BA.Beer

In [21]:
#use scrape_burger from other notebook for easy scraping
def scrape_burger(url = '', top_bun = '', bottom_bun = '', patty = '', napkins=False):
    """
    Scrapes a website's html for text located between the 'top bun' (left side) and 'bottom bun' (right side). 
    Patty argument is for fine tuning the scraping using regex. Lastly, napkins are for cleaning up the mess.
    """
    url = '\''+ url + '\''
    bottom_bun = ')\'' if bottom_bun=='' else '?)(?=' + bottom_bun + ')\''
    burger = '\'(?<=' + top_bun + ')' + patty + '(.*'+ bottom_bun
    if napkins:
        cleanup = '\'(?<=' + napkins + ')' + '(.*)\'' 
        x = !wget -qO - {url} | grep -oP $burger | grep -oP $cleanup
    else:
        x= !wget -qO - {url} | grep -oP $burger 
    #if regex search fails return space character
    if len(x)==0:
        return " "
    return x

---

## Creating Ratings Matrix:

In this chunk we extract lots of user ratings and create a ratings matrix

In [3]:
#function is used to convert scraped data to dictionaries
def convert_reviews(reviews):
    reviews=[review for review in reviews if len(review)==7]
    review_dict = {
        'date':[review[0] for review in reviews],
        'beers': [review[1] for review in reviews],
        'brewery':[review[2] for review in reviews],
        'beertype':[review[3] for review in reviews],
        'abv': [review[4] for review in reviews],
        'ratings': [review[5] for review in reviews]
    }
    return review_dict

In [7]:
#extracting users from this beer page 'https://www.beeradvocate.com/beer/profile/23222/78820/'
stout = Beer('/beer/profile/23222/78820/')
users = stout.get_reviews()['usr']
print(f"testing {len(users)} users")

testing 688 users


For users in the user list that we scraped in the chunk above, we scrape their review data and store it in a large dictionary called all_reviews. This dictionary is later used to create the ratings matrix that is used for recommendation systems.

In [8]:
%%time
all_reviews={}
#only test certain number of users so code does not run as long
num_users_tested=25
tested_users=users[:num_users_tested]
good_tested_users=tested_users.copy()
for usr in tested_users:
    #print(usr)
    temp_user = User(usr)
    if temp_user.info!='Sorry, either this user is private or does not exist':
        temp_reviews=temp_user.get_reviews(100)
        if len(temp_reviews)==0 or len(temp_reviews[0])!=7:
            good_tested_users.remove(usr)
            print(usr)
        else:
            all_reviews[usr]=convert_reviews(temp_reviews)
    else:
        good_tested_users.remove(usr)

CPU times: user 5.12 s, sys: 100 ms, total: 5.22 s
Wall time: 29 s


In [9]:
all_ratings=all_reviews.copy()

In [13]:
ratings_keys=[k for k in all_ratings.keys()]
isSameLength=all([len(all_ratings[k]['beers'])==len(all_ratings[k]['ratings']) for k in ratings_keys])
print(f"all beer and associated ratings lists have the same length (T/F): {isSameLength}")
print(f"number of users scraped is {len(ratings_keys)}")
numRatings_vec=[len(all_ratings[k]['beers']) for k in ratings_keys]
print(f"total number of ratings scraped is {np.sum(numRatings_vec)}")
print(f"average ratings/user is {np.mean(numRatings_vec)}")

all beer and associated ratings lists have the same length (T/F): True
number of users scraped is 22
total number of ratings scraped is 1510
average ratings/user is 68.63636363636364


In [14]:
#function used to see how many users rated each beer from all_ratings data
def sortedBeerList(all_ratings):
    allbeers=[]
    ratings_keys=[k for k in all_ratings.keys()]
    for k in ratings_keys:
        allbeers+=all_ratings[k]['beers']
    x1,x2=np.unique(allbeers,return_counts=True)
    return sorted(zip(x2,x1),reverse=True)

In [15]:
#function finds n most rated beers from all_ratings data
def findNmostPopularBeers(all_ratings,n=0):
    sortedbeers=sortedBeerList(all_ratings)
    if n==0 or n>=len(sortedbeers):
        n=len(sortedbeers)
    popbeers=np.array([sortedbeers[k][1] for k in range(0,n)])
    counts=np.array([sortedbeers[k][0] for k in range(0,n)])
    return popbeers,counts

In [16]:
#function is used to create ratings matrix that is used for recommendation algorithm

#all_ratings is the user ratings we sraped from beerAdcovate.com
#n_beers is the number of beers to use as rows in the matrix (automatically uses most rated beers).
#sparsity_threshold is a parameter to only select users that have rated a certain number of the n most rated beers.
#by increasing sparsity_threshold matrix sparsity is reduced, but less users are included in matrix

def create_ratings_matrix(all_ratings,n_beers,sparsity_threshold=1):
    df=pd.DataFrame()
    ratings_keys=[k for k in all_ratings.keys()]
    popular_beers,_=findNmostPopularBeers(all_ratings,n_beers)
    for j in range(0,len(ratings_keys)):
        column=np.array([0.00 for k in range(0,len(popular_beers))])
        for k in range(0,len(all_ratings[ratings_keys[j]]['beers'])):
            if all_ratings[ratings_keys[j]]['beers'][k] in popular_beers:
                beer=all_ratings[ratings_keys[j]]['beers'][k]
                rating=float(all_ratings[ratings_keys[j]]['ratings'][k])
                index=np.where(popular_beers==all_ratings[ratings_keys[j]]['beers'][k])
                column[index]=rating
        if sum(1 for k in column if k>0)>sparsity_threshold-1:
            df[ratings_keys[j]]=column
    df.index=popular_beers
    return df

___

Below is the resulting ratings matrix that can be used for our recommendation algorithms. By reading in more users, this dataframe can be made larger, but it can be quite time intensive. From our testing, we find that abot 70 reviews are scraped per second on average.

In [17]:
df=create_ratings_matrix(all_ratings,n_beers=600,sparsity_threshold=5)
print(df.shape)
df.head()

(600, 21)


Unnamed: 0,Bugsmcl,Humbolt9,Jimmeekrek,StonedTrippin,jsearley3364,Chadlossie,WOLFGANG,fossage78,Brent_B,FocalBanged,...,HoldenDurden,Eidel18,GlennF,Narkee,Zach_Attack,ManBearPat,AMCimpi,DFrisselll,rtaps,Nicholas-Drinks
Kentucky Brunch Brand Stout,5.0,5.0,4.81,0.0,4.91,5.0,5.0,5.0,4.92,5.0,...,4.5,5.0,5.0,4.93,4.93,4.67,4.58,4.25,4.84,5.0
Heady Topper,0.0,0.0,0.0,0.0,4.81,0.0,0.0,0.0,0.0,4.31,...,0.0,0.0,4.79,4.85,4.34,4.69,0.0,0.0,4.5,4.52
Mornin' Delight,4.56,0.0,4.63,0.0,4.91,0.0,0.0,0.0,0.0,4.5,...,0.0,5.0,4.56,0.0,4.58,4.51,0.0,0.0,0.0,0.0
Hunahpu's Imperial Stout - Double Barrel Aged,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,4.75,...,0.0,5.0,5.0,0.0,4.81,3.83,4.79,0.0,0.0,0.0
Barrel-Aged Abraxas,4.91,0.0,5.0,0.0,0.0,5.0,0.0,4.25,0.0,0.0,...,0.0,5.0,4.58,0.0,4.97,0.0,0.0,0.0,4.99,0.0


In [18]:
#df.to_csv(path_or_buf="test_ratings_df_6.2.18")

---

## Scraping Popular Beers:

In this chunk we extract info on the most popular beers and store it into a dataframe.

In [19]:
#Extract many beer links from top beer page (or new beer page)
#then use beer links to extract info on many beers and save it

In [22]:
top_beer_url='https://www.beeradvocate.com/lists/top/'
new_beers_url='https://www.beeradvocate.com/beer/new/'
URL_base='/beer/profile/'

beer_scrape=scrape_burger(url=top_beer_url,top_bun='/beer/profile/',bottom_bun='"><b>')
beer_names=scrape_burger(url=top_beer_url,top_bun='/"><b>',bottom_bun='</b></a><div id="extendedInfo">')
top_beer_page_urls=[URL_base+beer_scrape[k] for k in range(0,len(beer_scrape)) if k%2==0]
len(beer_names)==len(top_beer_page_urls)

True

In [23]:
top_beer_page_urls[0]

'/beer/profile/23222/78820/'

In [24]:
beer_names[0]

'Kentucky Brunch Brand Stout'

In [28]:
def createBeerDF(beer_page_urls,topindex=False):
    if topindex==False:
        topindex=len(beer_page_urls)
    all_info={}
    for k in range(0,topindex):
        temp_beer=Beer(url=beer_page_urls[k])
        all_info[temp_beer.get_name()]=temp_beer.info
    all_info_keys=all_info.keys()

    #create one row of data frame at a time
    df=pd.DataFrame(columns=['brewery','state','country','website','style',
                                'abv','availability','description','ranking','num_reviews','num_ratings','num_wants'])
    for k in all_info_keys:
        try:
            if len(all_info[k]['Notes  Commercial Description'])==0:
                notes=''
            else:
                notes=all_info[k]['Notes  Commercial Description'][0]
            if len(all_info[k]['Brewed by'])==3:
                state=''
                country=all_info[k]['Brewed by'][1]
                website=all_info[k]['Brewed by'][2]
            elif len(all_info[k]['Brewed by'])==4:
                state=all_info[k]['Brewed by'][1]
                country=all_info[k]['Brewed by'][2]
                website=all_info[k]['Brewed by'][3]
            else:
                state=''
                country=all_info[k]['Brewed by'][1]
                website=''

            row=[all_info[k]['Brewed by'][0], state,
                country, website, all_info[k]['Style'][0],
                all_info[k]['Alcohol by volume (ABV)'][0], all_info[k]['Availability'][0],
                notes, all_info[k]['Ranking'][0], all_info[k]['Reviews'][0],
                all_info[k]['Ratings'][0], all_info[k]['Wants'][0]
                ]
            df.loc[k]=row 
        except:
            print(f"Not enough info scraped for beer: {k}")
    return df

In [29]:
%%time
df2=createBeerDF(top_beer_page_urls,50)

CPU times: user 4.41 s, sys: 91 ms, total: 4.5 s
Wall time: 16.6 s


___

This dataframe can be used to perform exploratory data analaysis on popular beers on the website. A variety of interesting information is stored in it. For instance, one could group by beer style and examine the most popular words in the beer descriptions.

In [30]:
print(df2.shape)
df2.head()

(50, 12)


Unnamed: 0,brewery,state,country,website,style,abv,availability,description,ranking,num_reviews,num_ratings,num_wants
Kentucky Brunch Brand Stout,Toppling Goliath Brewing Company,Iowa,United States,tgbrews.com,American Double Imperial Stout,12.00%,Rotating,This beer is the real McCoy. Barrel aged and c...,#1,132,689,3700
Heady Topper,The Alchemist Brewery and Visitors Center,Vermont,United States,alchemistbeer.com,American Double Imperial IPA,8.00%,Year-round,"""An American Double IPA"" 75 IBU 8.0% ABV. ""Dri...",#2,2459,14098,9439
Barrel-Aged Abraxas,Perennial Artisan Ales,Missouri,United States,perennialbeer.com,American Double Imperial Stout,11.00%,Rotating,Imperial Stout aged Twelve months in Rittenhou...,#3,142,1412,2537
Marshmallow Handjee,3 Floyds Brewing Co.,Indiana,United States,3floyds.com,Russian Imperial Stout,15.00%,Spring,Dark Lord Russian Imperial Stout aged in a var...,#4,317,1589,4491
Hunahpu's Imperial Stout - Double Barrel Aged,Cigar City Brewing,Florida,United States,cigarcitybrewing.com,American Double Imperial Stout,11.00%,Rotating,Stout aged on Peruvian cacao nibs ancho and pa...,#5,154,1562,1708


In [31]:
df2.iloc[6]['description']

'Pliny the Younger the man was Pliny the Elder’s nephew and adopted son. They lived nearly 2000 years ago! Pliny the Elder is our Double IPA so we felt it was fitting to name our Triple IPA after his son. It is almost a true Triple IPA with triple the amount of hops as a regular I.P.A. That said it is extremely difficult time and space consuming and very expensive to make. And that is why we don’t make it more often! This beer is very full-bodied with tons of hop character in the nose and throughout. It is also deceptively well-balanced and smooth.'

In [32]:
#df2.to_csv(path_or_buf="top_beer_df_6.2.18")

---

## Scraping Beers from All Styles:

In this chunk we find popular beers from each style and try to create a big dataframe of their info. This is similar to what we did for the above, except now we extract beers from every listed style instead of the overall popular beer page.

In [33]:
#loop through beer styles on https://www.beeradvocate.com/beer/style/
#then scrape each beer link on each beer style page

In [46]:
#findBeers goes through all beer styles and extracts the first 50 most popular beers
#returns dictionary where each beer style is linked with list of beer urls
def findBeers(search_url='https://www.beeradvocate.com/beer/style/'):
    import requests
    dict={}
    
    main_html=requests.get(search_url).text
    parser=BA.beer.ba_parser(vals=['ba-content'],save_urls=True)
    parser.feed(main_html)
    parser.clean_lines()
    info=parser.lines
    
    base_url='https://www.beeradvocate.com'
    style_urls=[base_url+end for end in parser.urls]
    for url in style_urls:
        style_html=requests.get(url).text
        style_parser=BA.beer.ba_parser(attributes=['id','class'],vals=['titleBar','ba-content'],save_urls=True)
        style_parser.feed(style_html)
        style_parser.clean_lines()
        info=style_parser.lines
        #extract name as beer_style
        beer_style=info[0]
        #extract beer_urls
        beer_urls=[end for end in style_parser.urls if 'profile' in end]
        beer_urls=[beer_urls[k] for k in range(0,len(beer_urls)) if k%2==0]
        dict[beer_style]=beer_urls
    return dict 

In [47]:
%%time
beer_dict=findBeers()

CPU times: user 8.42 s, sys: 149 ms, total: 8.57 s
Wall time: 29.3 s


In [48]:
beer_dict_keys=[key for key in beer_dict.keys()]
all_beer_urls=[]
some_beer_urls=[]
for key in beer_dict_keys:
    all_beer_urls+=beer_dict[key]
    some_beer_urls+=beer_dict[key][:10]
print(len(all_beer_urls))
print(len(some_beer_urls))
all_beer_urls[:3]

5171
1040


['/beer/profile/192/607/',
 '/beer/profile/694/15881/',
 '/beer/profile/2743/35732/']

In [49]:
%%time
#can be time intensive based on how many beers are scraped
df3=createBeerDF(some_beer_urls)

CPU times: user 1min 33s, sys: 1.5 s, total: 1min 34s
Wall time: 6min 59s


In [50]:
print(df3.shape)
df3.head()

(1028, 12)


Unnamed: 0,brewery,state,country,website,style,abv,availability,description,ranking,num_reviews,num_ratings,num_wants
Fat Tire Amber Ale,New Belgium Brewing,Colorado,United States,newbelgium.com,American Amber Red Ale,5.20%,Year-round,No notes at this time.,#40195,2080,8898,248
Nugget Nectar,Tröegs Brewing Company,Pennsylvania,United States,troegs.com,American Amber Red Ale,7.50%,Spring,Squeeze those hops for all they're worth! Nugg...,#451,2573,8657,1430
Hop Head Red Ale,Green Flash Brewing Co.,California,United States,greenflashbrew.com,American Amber Red Ale,8.10%,Year-round,In 2011 the recipe was altered to bump the IBU...,#5041,965,3422,178
Amber Ale,Bell's Brewery - Eccentric Café & General Store,Michigan,United States,bellsbeer.com,American Amber Red Ale,5.80%,Year-round,The beer that helped build our brewery; Bell’s...,#12977,993,3192,168
Hopback Amber Ale,Tröegs Brewing Company,Pennsylvania,United States,troegs.com,American Amber Red Ale,6.00%,Year-round,Standing 12 ft. tall at the center of the brew...,#5046,1176,3124,133


In [51]:
#to save dataframe for analysis

#df3.to_csv(path_or_buf="large_beer_df_6.2.18")

---

## Testing Chunk:

Below we test out different User/Beer class functionalities. Feel free to test them out yourself!

In [77]:
test = Beer('/beer/profile/23222/78820/')
test.info

{'Brewed by': ['Toppling Goliath Brewing Company',
  'Iowa',
  'United States',
  'tgbrews.com'],
 'Style': ['American Double  Imperial Stout'],
 'Alcohol by volume (ABV)': [' 12.00%'],
 'Availability': [' Rotating'],
 'Notes  Commercial Description': ['This beer is the real McCoy. Barrel aged and crammed with coffee none other will stand in it’s way. Sought out for being delicious it is notoriously difficult to track down. If you can find one shoot to kill because it is definitely wanted... dead or alive.',
  'Added by siradmiralnelson on 02-26-2012'],
 'Ranking': ['#1'],
 'Reviews': ['132'],
 'Ratings': ['689'],
 'Bros Score': ['0'],
 'Wants': ['3701'],
 'Gots': ['103'],
 'Trade': ['5']}

In [78]:
test.get_name()

'Kentucky Brunch Brand Stout'

In [61]:
%%time
iron_rat_stout = Beer('/beer/profile/23222/78820/')
reviews = iron_rat_stout.get_reviews()

CPU times: user 2.57 s, sys: 54.2 ms, total: 2.62 s
Wall time: 10.1 s


In [67]:
reviews['review'][:3]

[['5',
  '5',
  'rDev ',
  '+3.3%',
  'look: 5 | smell: 5 | taste: 5 | feel: 5 |  overall: 5',
  'Jmonah3',
  'Yesterday at 09:19 PM',
  '5',
  '5',
  'rDev ',
  '+3.3%',
  'look: 5 | smell: 5 | taste: 5 | feel: 5 |  overall: 5'],
 ['4.85',
  '5',
  'rDev ',
  '+0.2%',
  'look: 5 | smell: 5 | taste: 4.75 | feel: 4.5 |  overall: 5'],
 ['5',
  '5',
  'rDev ',
  '+3.3%',
  'look: 5 | smell: 5 | taste: 5 | feel: 5 |  overall: 5',
  'A buddy of mine called me one night and said that I needed to come over and try this new beer he found. Man am I glad I did because it was and is the best stout I have ever had!',
  '177 characters']]

In [71]:
User('mattsmith413').info

{'Last Activity': 'May 27 2018 at 7:32 PM',
 'May 27 2018 at 7:32 P': 'Joined:',
 'Joined': 'Jul 21 2009',
 'Beer Karma': '445',
 'Beers': '759',
 'Places': '17',
 'Posts': '29',
 'Likes Received': '5'}

In [84]:
??Beer

In [80]:
??BA.tools.ba_parser