## Web Scraping BeerAdvocate via wget and grep:

In this notebook we scrape data from www.beeradvocate.com using $\texttt{wget}$ and $\texttt{grep}$ methods. With these methods we scrape data to form a Beer class and User class. The Beer class stores the stats, info, and ratings of a particular beer. The User class stores similar data but for a specific user. With this data, we hope we are able to make novel product recommendations for online users.

In [18]:
#import the required packages
import numpy as np
import pandas as pd

___

## The All-Mighty Burger

The $\text{scrape_burger}$( ) function is fundamental to this web scraping method. With this function we can pass in a url and search the html text for certain words. For example, say one line in the html file is 

$\text{"<dl><dt>Birthday:</dt> <dd>Jan 29, 1977 (Age: 41)</dd></dl>"}$

and we are interested in extracting the user's birthday. In order to extract this info we can call the burger function with following parameters:

scrape_burger(url='someurl' , top_bun='Birthday', bottom_bun = '$\text{</dd>}$', napkins='$\text{<dd>}$')

The top_bun and bottom_bun parameters specify that any text in between these buns should be scraped. This leads to the following string

"$\text{:</dt> <dd>Jan 29, 1977 (Age: 41)}$"

The napkins parameters is then used to "clean up" the string above so that only text after the napkins' flag is kept. This leads to the final string which is what we hoped to extract:


"$\text{Jan 29, 1977 (Age: 41)}$"

We use this function in many different ways in this method in order to scrape data of interest from the website. The downsides of this methods are that it can be quite tedious to find the parameters from the html text that scrape the data of interest. Also, this method is somewhat inefficient as we perform $\texttt{wget}$ and $\texttt{grep}$ on the same webpage multiple times in order to extract different pieces of data. It would be quicker to parse the entire html text only one time. Another possible issue is that this method relies on the uniform structure of beer and user pages on BeerAdvocate. If beer and user pages slightly differ, the data could be incorrectly scraped. However, this issue has not come up in our testing so far.

In [2]:
def scrape_burger(url = '', top_bun = '', bottom_bun = '', patty = '', napkins=False):
    """
    Scrapes a website's html for text located between the 'top bun' (left side) and 'bottom bun' (right side). 
    Patty argument is for fine tuning the scraping using regex. Lastly, napkins are for cleaning up the mess.
    """
    url = '\''+ url + '\''
    bottom_bun = ')\'' if bottom_bun=='' else '?)(?=' + bottom_bun + ')\''
    burger = '\'(?<=' + top_bun + ')' + patty + '(.*'+ bottom_bun
    if napkins:
        cleanup = '\'(?<=' + napkins + ')' + '(.*)\'' 
        x = !wget -qO - {url} | grep -oP $burger | grep -oP $cleanup
    else:
        x= !wget -qO - {url} | grep -oP $burger 
    #if regex search fails return space character
    if len(x)==0:
        return " "
    return x

---

## Beer Class

The Beer class has the following attributes and functions:

Beer.url - the associated url of the beer (passed in as a parameter to initialize the class)

Beer.avg_rating - the average rating of the beer on a 5-point scale

Beer.stats - relevant beer stats including its ranking, number of reviews, and number of ratings

Beer.info - relevant beer information including its brewery, style, ABV (alcohol by volume), and availability

Beer.last_page - an attribute used to find the last page of reviews

Beer.get_ratings() - scrape specified number of ratings (default is all ratings) for beer


In [17]:
class Beer:
    def __init__(self, url):
        self.url = url
        self.avg_rating = self.get_score()
        self.stats = self.get_stats()
        self.last_page = int(self.stats['NumRatings'].replace(",","")) -\
                        (int(self.stats['NumRatings'].replace(",","")) % 25)
        self.info=self.get_info()
        
    def get_score(self):
        score = scrape_burger(self.url,
                             top_bun = '"ba-ravg">',
                             bottom_bun = '<')
        return float(score[0])
    def get_stats(self):
        stats = {
            'Ranking':scrape_burger(self.url, 
                                    top_bun = '<dd>#',
                                    bottom_bun = '<')[0],
            'NumReviews': scrape_burger(self.url, 
                                    top_bun = 'ba-reviews">',
                                    bottom_bun = '<')[0],
            'NumRatings': scrape_burger(self.url, 
                                    top_bun = '"ba-ratings">',
                                    bottom_bun = '<')[0],
            'Pdev': scrape_burger(self.url, 
                                    top_bun = 'psDev: ',
                                    bottom_bun = '%')[0],
        }
        return stats 
    def get_info(self):
        info={
            'BrewedBy':scrape_burger(self.url, 
                              top_bun = 'itemprop="title">',
                             bottom_bun = '</span>',
                             patty='(?!<)(?!Beers)')[0],
            'Style':scrape_burger(self.url, 
                              top_bun = 'Style:<',
                             bottom_bun = '</b>',
                             napkins='<b>')[0],
            'ABV':scrape_burger(self.url, 
                              top_bun = '\(ABV\):</b> ',
                             bottom_bun = '%')[0],
            'Availability':scrape_burger(self.url, 
                              top_bun = 'Availability:</b> ',
                             bottom_bun = '')[0]
        }
        return info
    def get_ratings(self, n_most_recent = 0):
        """
        Returns a python dictionary of all usernames of users who have rated the beer and their corresponding rating
        """ 
        n_most_recent = (n_most_recent//25)*25 if n_most_recent != 0 else self.last_page
        names = []
        ratings = []
        for page_num in range(0,n_most_recent, 25):
            names += scrape_burger(self.url + '?view=beer&sort=&start=' + str(page_num),
                              top_bun = 'username">',
                              patty = '(?!<)(?!Place Admin)',
                              bottom_bun = '<')
            ratings += scrape_burger(self.url + '?view=beer&sort=&start=' + str(page_num),
                                 top_bun = 'BAscore_norm">',
                                 bottom_bun = '<')

        all_ratings = {'usr': names, 'rating': ratings}
        return all_ratings

---

## User Class

The User class has the following attributes and functions:

User.url - the associated url of the user (passed in as a parameter to initialize the class)

User.username - the user's username

User.stats - relevant user stats including his/her beer karma, number of ratings, number of posts, and number of likes

User.info - relevant user information including the user's gender, birthday, and location

User.last_page - an attribute used to find the last page of reviews

User.get_ratings() - scrape specified number of ratings (default is all ratings) for user

In [4]:
class User:
    def __init__(self, url,noInitialize=False):
        self.url = url
        self.username=self.get_username()
        #in case profile is private and info cannot be accessed
        if self.username==" " or noInitialize:
            return
        self.info=self.get_info()
        self.stats=self.get_stats()
        self.last_page = int(self.stats['NumRatings'].replace(",","")) - \
                        (int(self.stats['NumRatings'].replace(",",""))% 50)
    def get_username(self):
        username = scrape_burger(self.url,
                             top_bun = 'class="username">',
                             bottom_bun = '</h1',
                           )[0]
        return username
    def get_info(self):
        info={
            'Gender':scrape_burger(self.url, 
                              top_bun = 'Gender',
                             bottom_bun = '</dd>',
                             napkins='d>')[0],
            'Birthday':scrape_burger(self.url, 
                              top_bun = 'Birthday',
                             bottom_bun = '</dd>',
                             napkins='d>')[0],
            'Location':scrape_burger(self.url, 
                              top_bun = '<dt>Location',
                             bottom_bun = '" target',
                             napkins='location='       )[0],
        }
        return info
    def get_stats(self):
        stats = {
            'BeerKarma':scrape_burger(self.url, 
                                    top_bun = 'Beer Karma',
                                    bottom_bun = '</b></a>',
                                    napkins='<b>' )[0],
            'NumRatings':scrape_burger(self.url, 
                                    top_bun = 'Beers:</b></dt>',
                                    bottom_bun = ' \|',
                                    napkins='<b>' )[0],
            'NumPosts':scrape_burger(self.url, 
                                    top_bun = '<dd>Posts: ',
                                    bottom_bun = ' \|')[0],
            'NumLikes':scrape_burger(self.url, 
                                    top_bun = '\| Likes Received: ',
                                    bottom_bun = '')[0]
        }
        return stats
    def get_ratings(self, n_most_recent = 0):
        """
        Returns a python dictionary of all beers and corresponding rating. Use 'beers' or 'ratings' to index
        this data from the dictionary.
        """ 
        n_most_recent = (n_most_recent//50)*50 if (n_most_recent != 0 and n_most_recent>=50) else self.last_page
        beers = []
        ratings = []
        urlbase='https://www.beeradvocate.com/user/beers/?start='
        urlend='&ba='+self.username+'&order=dateD&view=R'
        for page_num in range(0,n_most_recent, 50):
            beers += scrape_burger(urlbase + str(page_num) + urlend,
                              top_bun = 'review"><b>',
                              bottom_bun = '</b>')
            ratings += scrape_burger(urlbase + str(page_num) + urlend,
                                 top_bun = '#F7F7F7"><b>|valign="top" ><b>',
                                 #patty = '(?!<)(?!Place Admin)',
                                 bottom_bun = '</b>')

        all_ratings = {'beers': beers, 'ratings': ratings}
        return all_ratings

---

## Data Extraction

Now that the Beer and User classes are defined, we use them to extract meaningful data that can be analyzed, namely lots of user reviews that construct a ratings matrix.

find_userurls() is used to extract a large list of user urls from multiple beer pages that can be used to create many User classes. With these User classes, we can then extract lots of rating data.

In [19]:
def find_userurls(beerurls,n_recent_from_beer=0):
    all_userurls=[]
    for beerurl in beerurls:
        beer=Beer(beerurl)
        #n_most_recent should be a multiple of 25 and >=25
        n_recent_from_beer = (n_recent_from_beer//25)*25 if n_recent_from_beer != 0 else beer.last_page
        userurls = []
        for page_num in range(0,n_recent_from_beer, 25):
            userurls += scrape_burger(beerurl + '?view=beer&sort=&start=' + str(page_num),
                              top_bun = 'community/members/',
                              bottom_bun = '" class="username"')
        userurls=['https://www.beeradvocate.com/community/members/'+userurls[k] for k in range(0,len(userurls)) if k%2==0]
        all_userurls+=userurls
    return all_userurls

In [22]:
beerurls=['https://www.beeradvocate.com/beer/profile/23222/78820/',
         'https://www.beeradvocate.com/beer/profile/46317/16814/']
user_urls=find_userurls(beerurls,n_recent_from_beer=50)
print(f"number of user urls is {len(user_urls)}")

number of user urls is 100


In [23]:
user_urls[:5]

['https://www.beeradvocate.com/community/members/jmonah3.887293/',
 'https://www.beeradvocate.com/community/members/bugsmcl.1140386/',
 'https://www.beeradvocate.com/community/members/aerizel.957865/',
 'https://www.beeradvocate.com/community/members/humbolt9.1232683/',
 'https://www.beeradvocate.com/community/members/jimmeekrek.716589/']

In the chunk below we automatically scrape a certain number of beer ratings from each user specifified in the userurl list. This is the most time consuming part as it is scraping the most data. The scraped data is stored in the dictionary all_ratings. Note that some users are not scraped because their profiles are private or removed.

In [8]:
%%time
num_ratings_from_user=200
all_ratings={}
for userurl in user_urls:
    person=User(userurl,noInitialize=True)
    if person.username!=" ":
        ratings=person.get_ratings(num_ratings_from_user)
        all_ratings[person.username]=ratings

CPU times: user 2.23 s, sys: 3.03 s, total: 5.26 s
Wall time: 2min 58s


In [9]:
ratings_keys=[k for k in all_ratings.keys()]
isSameLength=all([len(all_ratings[k]['beers'])==len(all_ratings[k]['ratings']) for k in ratings_keys])
print(f"all beer and associated ratings lists have the same length (T/F): {isSameLength}")
print(f"number of users scraped is {len(ratings_keys)}")
numRatings_vec=[len(all_ratings[k]['beers']) for k in ratings_keys]
print(f"total number of ratings scraped is {np.sum(numRatings_vec)}")
print(f"average ratings/user is {np.mean(numRatings_vec)}")

all beer and associated ratings lists have the same length (T/F): True
number of users scraped is 94
total number of ratings scraped is 9045
average ratings/user is 96.22340425531915


Now that we have successfully scraped the ratings data, we want to construct our ratings matrix that will be used for analysis. To do this we first sort and find the most rated beers. This is because e want to use the most rated beers in the matrix in order to reduce sparsity. Finally we construct the ratings matrix by using a specified number of the most rated beers.

In [11]:
#function used to see how many users rated each beer from all_ratings data
def sortedBeerList(all_ratings):
    allbeers=[]
    ratings_keys=[k for k in all_ratings.keys()]
    for k in ratings_keys:
        allbeers+=all_ratings[k]['beers']
    x1,x2=np.unique(allbeers,return_counts=True)
    return sorted(zip(x2,x1),reverse=True)

In [12]:
#function finds n most rated beers from all_ratings data
def findNmostPopularBeers(all_ratings,n=0):
    sortedbeers=sortedBeerList(all_ratings)
    if n==0 or n>=len(sortedbeers):
        n=len(sortedbeers)-1
    popbeers=np.array([sortedbeers[k][1] for k in range(0,n+1)])
    counts=np.array([sortedbeers[k][0] for k in range(0,n+1)])
    counts=np.delete(counts,np.where(popbeers==' '))
    popbeers=np.delete(popbeers,np.where(popbeers==' '))
    return popbeers,counts

In [13]:
#function is used to create ratings matrix that is used for recommendation algorithm

#all_ratings is the user ratings we sraped from beerAdcovate.com
#n_beers is the number of beers to use as rows in the matrix (automatically uses most rated beers).
#sparsity_threshold is a parameter to only select users that have rated a certain number of the n most rated beers.
#by increasing sparsity_threshold matrix sparsity is reduced, but less users are included in matrix

def create_ratings_matrix(all_ratings,n_beers,sparsity_threshold=1):
    df=pd.DataFrame()
    ratings_keys=[k for k in all_ratings.keys()]
    popular_beers,_=findNmostPopularBeers(all_ratings,n_beers)
    for j in range(0,len(ratings_keys)):
        column=np.array([0.00 for k in range(0,len(popular_beers))])
        for k in range(0,len(all_ratings[ratings_keys[j]]['beers'])):
            if all_ratings[ratings_keys[j]]['beers'][k] in popular_beers:
                beer=all_ratings[ratings_keys[j]]['beers'][k]
                rating=float(all_ratings[ratings_keys[j]]['ratings'][k])
                index=np.where(popular_beers==all_ratings[ratings_keys[j]]['beers'][k])
                column[index]=rating
        if sum(1 for k in column if k>0)>sparsity_threshold-1:
            df[ratings_keys[j]]=column
    df.index=popular_beers
    return df

In [14]:
df=create_ratings_matrix(all_ratings,n_beers=600,sparsity_threshold=5)
print(f"shape of dataframe is {df.shape}")
df.head()

(600, 81)


Unnamed: 0,Jmonah3,Bugsmcl,Humbolt9,Jimmeekrek,jsearley3364,Chadlossie,WOLFGANG,fossage78,Brent_B,FocalBanged,...,MGoeltz,NRMeadmore,JeannieVolpe,TheBeardedMeemo,justinv29,jsr1961,Beerman811,goalie35,HippieDave56,dclott
Heady Topper,0.0,0.0,0.0,0.0,4.81,4.5,0.0,0.0,0.0,4.31,...,4.78,4.89,1.21,4.81,4.91,4.71,4.16,4.68,4.46,3.47
Kentucky Brunch Brand Stout,5.0,5.0,5.0,4.81,4.91,5.0,5.0,5.0,4.92,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Julius,4.88,0.0,0.0,4.87,4.35,4.5,0.0,0.0,0.0,4.5,...,0.0,4.58,0.0,0.0,5.0,4.48,0.0,0.0,0.0,4.8
Focal Banger,4.75,4.49,0.0,4.6,0.0,4.5,4.33,4.5,0.0,0.0,...,0.0,4.52,0.0,0.0,4.62,4.41,0.0,0.0,0.0,0.0
KBS (Kentucky Breakfast Stout),4.75,0.0,0.0,0.0,0.0,4.5,0.0,0.0,0.0,4.19,...,4.96,0.0,0.0,4.76,4.71,0.0,4.43,4.5,4.25,0.0


In [48]:
#in order to save dataframe to csv for easier testing

#df.to_csv(path_or_buf="test_ratings_df")

In [148]:
#to read csv file in

#df2=pd.read_csv('test_ratings_df',index_col=0)

___

Below we perform the same data extraction and matrix creation performed earlier, just in a single function. Note that this function is time consuming, mostly because of the data extraction.

In [23]:
#overall function with following parameters:
#vector of beerurls - beer pages from which usernames are scraped (at least length one)
#n_recent_from_beers - how many usernames to scrape from each beer pages (default=all)
#n_recent_ratings - how many ratings to extract from each user when scraping (default=all)
#n_beers - how many beers that should be used in ratings matrix (default=all)
#sparsity_constraint - how many popular beers must be rated by user in order for user to be included in matrix (default=0)

#function should return ratings matrix according to specified parameters

In [27]:
def extract_and_create_ratings_df(beerurls,n_recent_from_beers=0,n_recent_ratings=0,n_beers=0,sparsity_constraint=0):
    user_urls=find_usernames(beerurls,n_recent_from_beer=n_recent_from_beers)
    all_ratings={}
    for userurl in user_urls:
        person=User(userurl,noInitialize=True)
        if person.username!=" ":
            ratings=person.get_ratings(n_recent_ratings)
            all_ratings[person.username]=ratings
    ratings_keys=[k for k in all_ratings.keys()]
    is_same_length=all([len(all_ratings[k]['beers'])==len(all_ratings[k]['ratings']) for k in ratings_keys])
    if not is_same_length:
        return "Data not extracted properly. Try different beerurl's."
    df=create_ratings_matrix(all_ratings,n_beers,sparsity_constraint)
    return df

In [31]:
%%time
beerurls=['https://www.beeradvocate.com/beer/profile/23222/78820/',
         'https://www.beeradvocate.com/beer/profile/46317/16814/']
df=extract_and_create_ratings_df(beerurls,
                                 n_recent_from_beers=50,
                                 n_recent_ratings=400,
                                 n_beers=100)

CPU times: user 4.57 s, sys: 6.61 s, total: 11.2 s
Wall time: 4min 48s


In [None]:
print(df.shape)
df.head()

## More Testing

Below we test out different User/Beer class functionalities. Feel free to test them out yourself!

In [15]:
#if want to manually find users and test them
URL1='https://www.beeradvocate.com/community/members/stonedtrippin.601042/'
URL2='https://www.beeradvocate.com/community/members/wolfgang.4062/'
URL3='https://www.beeradvocate.com/community/members/narkee.737932/'
URL4='https://www.beeradvocate.com/community/members/zach_attack.239886/'

In [12]:
test_user=User(URL1)

In [14]:
test_user.info

{'Gender': 'Male',
 'Birthday': 'May 21, 1986 (Age: 32)',
 'Location': 'Colorado'}

In [8]:
%%time
iron_rat_stout = Beer('https://www.beeradvocate.com/beer/profile/23222/78820/')

CPU times: user 23.6 ms, sys: 40.2 ms, total: 63.8 ms
Wall time: 2.26 s


In [9]:
print(iron_rat_stout.info)
print(iron_rat_stout.stats)
print(iron_rat_stout.avg_rating)

{'BrewedBy': 'Toppling Goliath Brewing Company', 'Style': 'American Double / Imperial Stout', 'ABV': '12.00', 'Availability': 'Rotating'}
{'Ranking': '1', 'NumReviews': '132', 'NumRatings': '688', 'Pdev': '6.2'}
4.84


In [317]:
%%time
reviews=iron_rat_stout.get_ratings(n_most_recent=1000)
print(f"num reviews is {len(reviews['rating'])}")

num reviews is 700
CPU times: user 239 ms, sys: 320 ms, total: 559 ms
Wall time: 16.8 s


In [316]:
len(reviews['usr']) == len(reviews['rating'])

True