# Clean Downloaded BeerAdvocate Pages

## Cache pages from www.beeradvocate.com locally for further scraping

By: Mike Beaumier -- Fellow, Insight Data Science

 - Email: [michael.beaumier@gmail.com](mailto:michael.beaumier@gmail.com)
 - Twitter: [@jollyhrothgar](https://twitter.com/jollyhrothgar)
 - **LinkedIn**: [Add me w/ 1 line message about our connection](https://www.linkedin.com/in/michaelbeaumier)

Repositories on [github](https://www.github.com/Jollyhrothgar)

This notebook is designed to clean the downloaded beer data, and generate a database + csv file.


In [21]:
## Scraping, Processing, Cleaning Libraries
import os
import re 
import sys
import csv
from time import sleep
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()

## Yelp API (25,000 requests per day)
## Installed via: https://github.com/gfairchild/yelpapi
from yelpapi import YelpAPI 

stop_words = stopwords.words('english')
stop_words.append('beer')
stop_words.append('beers')
stop_words.append('thing')
stop_words.append('bottle')
stop_words.append('glass')
stop_words.append('cup')

In [7]:
def sanitize_string(str):
    # First, we turn non-alphanumeric characters into whitespace.  You could
    # also use a regular expression here; see below.
    str = ''.join(c if c.isalpha() else ' ' for c in str)

    # Next, we split the string on spaces, ignoring leading and trailing
    # whitespace.
    words = str.split()

    # There are now three possibilities: there are no words, there was one
    # word, or there were multiple words.
    numWords = len(words)
    if numWords == 0:
        # If there were no words, the string contained only spaces (and/or
        # punctuation).  This can't be made into a valid tag, so we return
        # None.
        return None
    elif numWords == 1:
        # If there was only one word, that word is the tag, no
        # post-processing required.
        return words[0]
    else:
        # Finally, if there were multiple words, we camel-case the string:
        # we lowercase the first word, capitalize the first letter of all
        # the other words and lowercase the rest, and finally stick all
        # these words together without spaces.
        return ' '.join(w.lower() for w in words if not w in stop_words)

In [8]:
sanitize_string("type 22 oz bottleglass karl strauss 26th anniversary snifter glassfrom barons market in alpine calif price 5 49purchased sept 27 2015consumed sept 30 2015misc n apoured a golden hazy color with two fingers of rocky white head lots of bubbles coming up from the bottom of the glass at various spots minimal spotty lacing on the sides of the glass below average retention sight 3 75 hops really dominated the smell with little sour aroma which doesn t quite make sense with the name got dank citra grapefruit tangerine orange zest pineapple lemon pine lime and floral probably overrated this smell since it s supposed to be a sour but this was nice on the nose smell 4 50 taste was more balanced than the nose a good mix probably 65 35 of hops to sour got grapefruit citra lemon lime pine pineapple more tropical grapefruit pulp and citra rind taste 4 00 light body oily texture average carbonation long juicy finish feel 4 25 i really enjoyed this beer more hoppy than sour but just a really nice combination and for a great price too overall 4 00 4 13 92 a")

'type oz bottleglass karl strauss th anniversary snifter glassfrom barons market alpine calif price purchased sept consumed sept misc n apoured golden hazy color two fingers rocky white head lots bubbles coming bottom various spots minimal spotty lacing sides average retention sight hops really dominated smell little sour aroma doesn quite make sense name got dank citra grapefruit tangerine orange zest pineapple lemon pine lime floral probably overrated smell since supposed sour nice nose smell taste balanced nose good mix probably hops sour got grapefruit citra lemon lime pine pineapple tropical grapefruit pulp citra rind taste light body oily texture average carbonation long juicy finish feel really enjoyed hoppy sour really nice combination great price overall'

In [9]:
def clean_brewery_summary_data(file_name):
    '''
    Takes file_name and mines for tables containing data about beers.
    Note that the unique brewery ID must be part of the brewery's file name. 
    This step is accomplished during scraping.
    '''
    ## Extract Brewry ID
    brewery_key = -1
    matches = re.search('brewery_(\d+).html',file_name)
    if matches:
        brewery_key = matches.group(1)
    
    if brewery_key < 0:
        print "Problem extracting brewery key. Bailing out!"
        return
    f = open(file_name,'rU')
    soup = BeautifulSoup(f.read().decode('utf-8'),'lxml')
    unicode_page = soup.prettify()
    #debug
    #print 'found ',len(soup.findAll('table')),'tables.'
    
    ### Overview Information ###
    brewery_dict = {}
    
    title = soup.title.string
    title_list = title.split('|')
    brewery_name = title_list[0].strip()
    city_state_list = title_list[1].split(',')
    brewery_city = city_state_list[0].strip()
    brewery_state = city_state_list[1].strip()
    
    brewery_dict["brewery_key"] = brewery_key
    brewery_dict["brewery_name"] = brewery_name.encode('utf_8')
    brewery_dict["city"] = brewery_city.encode('utf_8')
    brewery_dict["state"] = brewery_state.encode('utf_8')
    
    tables = soup.findAll('table')
    overview_table = tables[0]
    beer_table = tables[-1]
    
    ### Get Phone Number ###
    for row in overview_table.findAll('tr'):
        #print ">",row.text
        matches = re.search('phone: (\((\d{3})\) (\d{3})-(\d{4}))',row.text)
        if matches:
            if 'brewery_phone' not in brewery_dict:
                brewery_dict['phone'] = matches.group(1).encode('utf_8')
                brewery_dict['phone_key'] = str(matches.group(2)+matches.group(3)+matches.group(4)).encode('utf_8')
        else:
            brewery_dict['phone'] = -1
            brewery_dict['phone_key'] = -1

    ## Call To Yelp API
    CONSUMER_KEY = 'ORSXdOCKxhchhVsWm_cu_Q'
    CONSUMER_SECRET = 'RDGuDq0zh1IZkyUnui60Izb00KY'
    TOKEN = '0g3ixyY4GYGaGDVgVfvV9lAmZdDQTEaB'
    TOKEN_SECRET = '7591YIL6f6Ou7HS0t58TFHfJdGk'

    yelp_api = YelpAPI(CONSUMER_KEY, CONSUMER_SECRET, TOKEN, TOKEN_SECRET)
    yelp_results_dict = {}
    if brewery_dict['phone_key'] > 0:
        yelp_results_dict = yelp_api.phone_search_query(phone=int(brewery_dict['phone_key']))
    try: 
        brewery_dict['yelp_rating'] = yelp_results_dict['businesses'][0]['rating']
        brewery_dict['yelp_review_count'] = yelp_results_dict['businesses'][0]['review_count']
        brewery_dict['latittude'] = yelp_results_dict['businesses'][0]['location']['coordinate']['latitude']
        brewery_dict['longitude'] = yelp_results_dict['businesses'][0]['location']['coordinate']['longitude'] 
    except:
        brewery_dict['yelp_rating'] = -1
        brewery_dict['yelp_review_count'] = -1
        brewery_dict['latittude'] = ''
        brewery_dict['longitude'] = ''
        
    ### Getting the beer ratings, styles, etc ###
    beers = []
    beer_dicts = []
    for row in beer_table.findAll('tr'):
        beer = []
        #beer_key contains beer style key and beer name key encoded in links.
        beer_key = row.findAll('a')
        for col in row.findAll('td'):
            value = col.text
            beer.append(value)
        beers.append(beer)
        if len(beer) == 6 and 'Style' not in beer:
            #print ">>>",beer_key
            # extract beer name key
            key_matches = re.search('/beer/profile/(\d+)/(\d+)',beer_key[0]['href'])
            
            # extract beer style key
            style_matches = re.search('/beer/style/(\d+)',beer_key[1]['href'])            
            beer_key = -1
            beer_style_key = -1
            if key_matches:
                if brewery_dict['brewery_key'] != key_matches.group(1):
                    print "Brewery key mismatch, bailing out!"
                    print "Good Key 1:", brewery_dict['brewery_key']
                    print " Bad Key 2:", key_matches.group(1)
                    return
                beer_key = key_matches.group(2)
            if style_matches:
                beer_style_key = style_matches.group(1)
            
            ## Now we can fill our beer info!
            beer_dict = {}
            beer_dict["beer_name"] = beer[0].encode('utf_8')
            beer_dict["style_name"] = beer[1].encode('utf_8')
            beer_dict["style_key"] = beer_style_key
            beer_dict["beer_name_key"] = beer_key
            beer_dict["abv"] = beer[2].encode('utf_8')
            beer_dict["avg_score"] = beer[3].encode('utf_8')
            rating = beer[4].encode('utf_8')
            clean_rating = ''.join(e for e in rating if e.isdigit() or e == '.')
            beer_dict["ratings_count"] = clean_rating
            beer_dict["bros_score"] = beer[5].encode('utf_8')
            if beer_dict["bros_score"] == '-':
                beer_dict["bros_score"] = ''
            if beer_dict['abv'] == '?':
                beer_dict['abv'] = ''
            if beer_dict['avg_score'] == '-':
                beer_dict['avg_score'] = ''
            beer_dicts.append(beer_dict)
    # debug
    # print "Brewery Summary: ",brewery_dict['brewery_name']
    
    # debug
    #for k,v in brewery_dict.iteritems():
    #    print k,v
    
    return_dict_list = []
    
    for beer in beer_dicts:
        one_beer_dict = dict(brewery_dict)
        for k,v in beer.iteritems():
            if k not in one_beer_dict:
                one_beer_dict[k] = v
        return_dict_list.append(one_beer_dict)
    
    #print "Processed:",len(return_dict_list),"beers. Brewery key:",brewery_dict["brewery_key"]
    f.close()
    return return_dict_list

In [10]:
clean_brewery_summary_data('./rescrape_data/11580/brewery_11580.html')

[{'abv': '5.40',
  'avg_score': '2.78',
  'beer_name': 'Amber Ale',
  'beer_name_key': '27980',
  'brewery_key': '11580',
  'brewery_name': 'Santa Cruz Mountain Brewing',
  'bros_score': '',
  'city': 'Santa Cruz',
  'latittude': 36.9589386,
  'longitude': -122.0484772,
  'phone': '(831) 425-4900',
  'phone_key': '8314254900',
  'ratings_count': '21',
  'state': 'CA',
  'style_key': '128',
  'style_name': 'American Amber / Red Ale',
  'yelp_rating': 4.0,
  'yelp_review_count': 246},
 {'abv': '7.10',
  'avg_score': '3.88',
  'beer_name': 'Black IPA',
  'beer_name_key': '61868',
  'brewery_key': '11580',
  'brewery_name': 'Santa Cruz Mountain Brewing',
  'bros_score': '',
  'city': 'Santa Cruz',
  'latittude': 36.9589386,
  'longitude': -122.0484772,
  'phone': '(831) 425-4900',
  'phone_key': '8314254900',
  'ratings_count': '4',
  'state': 'CA',
  'style_key': '175',
  'style_name': 'American Black Ale',
  'yelp_rating': 4.0,
  'yelp_review_count': 246},
 {'abv': '6.00',
  'avg_score':

In [11]:
def clean_beer_review_data(filename):
    return_dict_list = []
    
    #print 'cleaning:',filename
    
    brewery_key = -1
    beer_key = -1
    
    matches = re.search('(\d+)/beer_(\d+)_\d+.html',filename)
    if matches:
        brewery_key = matches.group(1)
        beer_key = matches.group(2)
    if brewery_key < 0 or beer_key < 0:
        print "couldn't match keys:",filename
        return
    f = open(filename,'rU')
    soup = BeautifulSoup(f.read().decode('utf-8'),'lxml')
    unicode_page = soup.prettify()
    if re.search('No Reviews',unicode_page):
        return 
    
    reviews = soup.find_all("div", id="rating_fullview_content_2")
    #print reviews[0]
    for review in reviews:
        username = review.div.text.split(',')[0]
        date = ''.join(review.div.text.split(',')[1:]).strip()
        BA_score = review.text.split('/')[0]
        rDev = "-1.0"
        character_num = int(''.join(review.text.split('★'.decode("utf-8"))[1].split()[0].split(',')))
        review_dict = {}
        look = "-1.0"
        smell = "-1.0"
        taste = "-1.0"
        feel = "-1.0"
        overall = "-1.0"
        review_text = 'N/A'
        
        try:
            ratings = review.find("span", { "class" : "muted"}).text.split('|')
            rDev = review.text.split('%')[0].split('rDev')[-1].strip()
            look = ratings[0].split(':')[1].strip()
            smell = ratings[1].split(':')[1].strip()
            taste = ratings[2].split(':')[1].strip()
            feel = ratings[3].split(':')[1].strip()
            overall_s = ratings[4].split(':')[1].strip()
            overall = overall_s               
            review_text = review.text.split('★'.decode("utf-8"))[0].split('overall: ' + overall_s)[-1]
        except:
            # Single Review Beers should not be weighted by individual reviewers, we already have this data from
            # the brewery page.
            return
            look = "-1.0"
            smell = "-1.0"
            taste = "-1.0"
            feel = "-1.0"
            overall = "-1.0"
            rDev = "-1.0"
            review_text = ''.join(review.text.split('★'.decode("utf-8"))[0].split('%')[1:])
        review_dict["look"]=look.encode('utf-8')
        review_dict["smell"]=smell.encode('utf-8')
        review_dict["taste"]=taste.encode('utf-8')
        review_dict["feel"]=feel.encode('utf-8')
        review_dict["overall"]=overall.encode('utf-8')
        review_dict["review_text"]=review_text.encode('utf-8')
        review_dict["username"]=username.encode('utf-8')
        review_dict["date"]=date.encode('utf-8')
        review_dict["ba_score"]=BA_score.encode('utf-8')
        review_dict["rdev"]=rDev.encode('utf-8')
        review_dict["brewery_key"]=brewery_key.encode('utf-8')
        review_dict["beer_key"]=beer_key.encode('utf-8')
        
        return_dict_list.append(review_dict)
        #print ">>>",review_text
    return return_dict_list

In [12]:
clean_beer_review_data('/home/mjbeaumier/Programming/brewery_project/ale_trail_codebase/rescrape_data/11580/beer_27981_2.html')

[{'ba_score': '3.75',
  'beer_key': '27981',
  'brewery_key': '11580',
  'date': 'Jan 25 2015',
  'feel': '3.75',
  'look': '3.75',
  'overall': '3.75',
  'rdev': '+41.5',
  'review_text': "Shocked to see the level this beer is being reviewed at. Wondering if there's any way this isn't the same beer, because the beer I had today was a Good to Very Good IPA, nicely balanced by a warm background sweetness framing the hop bitters, and nice flavor. Should be reviewing in the mid-eighties, not the high sixties. Maybe it's a new batch? Did they tweak it recently? I don't know. All I can say is, based on what I just drank, this beer is underrated.",
  'smell': '3.75',
  'taste': '3.75',
  'username': 'ronricorossi'},
 {'ba_score': '2.38',
  'beer_key': '27981',
  'brewery_key': '11580',
  'date': 'Jul 20 2012',
  'feel': '3.5',
  'look': '2.5',
  'overall': '2',
  'rdev': '-10.2',
  'review_text': 'Consumed June 30th, 2012 at the brewery.Beer is a relatively light yellow. Slightly cloudy with

In [13]:
def process_beers_merge_classifier(bayes_csv_classified_beer):
    with open(bayes_csv_classified_beer) as csv_f:
        bayes_goodness = [{k: v for k, v in row.items()}
            for row in csv.DictReader(csv_f, skipinitialspace=True)]
    bayes_username_dict = {}
    for review in bayes_goodness:
        if review['username'] not in bayes_username_dict:
            bayes_username_dict[review['username']] = {}
            bayes_username_dict[review['username']][review['beer_key']] = review
        else:
            if review['beer_key'] not in bayes_username_dict[review['username']]:
                bayes_username_dict[review['username']][review['beer_key']] = review
    return bayes_username_dict
            

In [14]:
test = process_beers_merge_classifier('test.csv')
print test['SensorySupernova']['127083']['hoppiness']

0.996533479596


In [15]:
def process_reviews():
    ### Process Beers
    file_list = load_brewery_file_list('./lists/beer_list_rescrape.txt')
    bayes_classified = process_beers_merge_classifier('test.csv')
    review_data_list = []
    counter = 0
    uniq_user_beer_dict = {}
    for review_file in file_list:
        reviews = clean_beer_review_data(review_file)
        counter += 1
        if counter % 50 == 0:
            print "processed:",counter,"beers"
        if reviews:
            for review in reviews:
                ## Remove duplicate reviews - probably a feature of the scraping, 
                ## where some pages printed out were duplicates
                if review['username'] not in uniq_user_beer_dict:
                    #print review['username']
                    uniq_user_beer_dict[review['username']] = []
                    uniq_user_beer_dict[review['username']].append(review['beer_key'])
                else:
                    if review['beer_key'] not in uniq_user_beer_dict[review['username']]:
                        uniq_user_beer_dict[review['username']].append(review['beer_key'])
                    else:
                        continue
                ## Now clean up the review data
                review_text = sanitize_string(review['review_text'])
                
                try:
                    words = review_text.split()
                except:
                    words = [" "]
                ## Stem the words to generalize the review
                stemmed_review_text = []
                try:
                    review['hoppiness'] = bayes_classified[review['username']][review['beer_key']]['hoppiness']
                except:
                    review['hoppiness'] = ''
                if words:
                    for w in words:
                        stemmed_review_text.append(ps.stem(w))
                    stemmed_review = ' '.join(stemmed_review_text)
                    review['stemmed_review_text'] = stemmed_review
                else:
                    review['stemmed_review_text'] = ""
                review['review_text'] = review_text
                review_data_list.append(review)
        #if counter == 20:
        #    break
    keys = review_data_list[0].keys()
    with open('./clean_data_csv/beer_review_information_rescrape.csv','wb') as output_file:
        dict_writer = csv.DictWriter(output_file,keys)
        dict_writer.writeheader()
        dict_writer.writerows(review_data_list)
        output_file.close()

In [16]:
def process_breweries():
    ### Process Breweries
    file_list = load_brewery_file_list('./lists/brewery_list_rescrape.txt')
    beer_data_list = []
    # this dict is indexed first by user, then by beer
    # it will be filled with keys, but not values, such that
    # we can pop duplicate reviews out of the result
    counter = 0
    for beer_file in file_list:
        beers = clean_brewery_summary_data(beer_file)        
        counter += 1
        if counter % 100 == 0:
            print "processed",counter,"breweries"
        if beers:
            for beer in beers:
                beer_data_list.append(beer)
    keys = beer_data_list[0].keys()
    with open('./clean_data_csv/brewery_information_rescrape.csv','wb') as output_file:
        dict_writer = csv.DictWriter(output_file,keys)
        dict_writer.writeheader()
        dict_writer.writerows(beer_data_list)
        output_file.close()

In [17]:
def load_brewery_file_list(filelist):
    '''
    Given a list of newline separated file names, read each file name and store in an array.
    Return this array.
    '''
    f = open(filelist,'rU')
    files = []
    for file in f:
        s_file = file.rstrip('\n\r')
        files.append(s_file)
    print "loaded",len(files),"breweries"
    return files

In [18]:
def main():
    print "called main!"
    #process_breweries()
    process_reviews()
    print "all finished!"
    return

In [19]:
if __name__=='__main__':
    print "Boilerplate call to main"
    main()

Boilerplate call to main
called main!
loaded 26701 breweries
processed: 50 beers
processed: 100 beers
processed: 150 beers
processed: 200 beers
processed: 250 beers
processed: 300 beers
processed: 350 beers
processed: 400 beers
processed: 450 beers
processed: 500 beers
processed: 550 beers
processed: 600 beers
processed: 650 beers
processed: 700 beers
processed: 750 beers
processed: 800 beers
processed: 850 beers
processed: 900 beers
processed: 950 beers
processed: 1000 beers
processed: 1050 beers
processed: 1100 beers
processed: 1150 beers
processed: 1200 beers
processed: 1250 beers
processed: 1300 beers
processed: 1350 beers
processed: 1400 beers
processed: 1450 beers
processed: 1500 beers
processed: 1550 beers
processed: 1600 beers
processed: 1650 beers
processed: 1700 beers
processed: 1750 beers
processed: 1800 beers
processed: 1850 beers
processed: 1900 beers
processed: 1950 beers
processed: 2000 beers
processed: 2050 beers
processed: 2100 beers
processed: 2150 beers
processed: 22

# To DO!

In [20]:
bayes_dict = process_beers_merge_classifier(test.csv")
print bayes_dict[0]

SyntaxError: EOL while scanning string literal (<ipython-input-20-f519a711bb1f>, line 1)

In [101]:
shit_string = "I really liked,,,, thi.s \n?? b\teeer but %##$$Toomuchmoney"
print shit_string
print sanitize_string(shit_string)

I really liked,,,, thi.s 
?? b	eeer but %##$$Toomuchmoney
i really liked thi s b eeer but toomuchmoney
