# Beer Reviews Exploration

Data is stored between two files:
 + Beeradvocate.txt
 + Ratebeer.txt

In [3]:
DATA = open( 'sample_beeradvocate.txt', 'r', encoding='ISO-8859-1', errors='replace' ).read().split( '\n\n' ) \
     + open( 'sample_ratebeer.txt'    , 'r', encoding='ISO-8859-1', errors='replace' ).read().split( '\n\n' )

for line_num, line in enumerate( DATA[3].split('\n') ):
    print( '{:2}| {}'.format( line_num, ' '.join( line.split() ) ) )

 0| beer/name: Sausa Pils
 1| beer/beerId: 47969
 2| beer/brewerId: 10325
 3| beer/ABV: 5.00
 4| beer/style: German Pilsener
 5| review/appearance: 3.5
 6| review/aroma: 3
 7| review/palate: 2.5
 8| review/taste: 3
 9| review/overall: 3
10| review/time: 1234725145
11| review/profileName: stcules
12| review/text: Golden yellow color. White, compact foam, quite creamy. Good appearance. Fresh smell, with good hop. Quite dry, with a good grassy note. Hay. Fresh and pleasant. More sweet in the mouth, with honey. The hop comes back in the end, and in the aftertaste. Not bad, but a bit too sweet for a pils. In the end some vanilla and camomile note. In the aftertaste, too. Though the hop, a bit too sweet. Honest.


## What data do we get to work with?

Each reviews consist of statistics of the beer as reported by the producer. As well as a customer review.

    beer/
        name        : The beers name
        beerId      : Identifier for a specific beer
        brewerId    : Identifier for a specific producer
        ABV         : Alcohol By Volume percentage
        style       : Style of beer ( e.g. IPA, Hefeweizen, Pale Ale )
        
    review/
        appearance  : Rating [0-5] stars
        aroma       : Rating [0-5] stars
        palate      : Rating [0-5] stars
        taste       : Rating [0-5] stars
        overall     : Rating [0-5] stars
        time        : When the review was written
        profileName : Who wrote the review
        text        : A description of the beer and its drinking
        
The reviews originally came from a text file but I converted them to a better data mining format, a **.csv**.



For this I used the following parser code

In [17]:
import csv
from itertools import islice

ATTRIBUTES = [ 'name',  'beerId',     'brewerId', 'ABV', 
               'style', 'appearance', 'aroma',    'palate', 
               'taste', 'overall',    'time',    'profileName', 
               'text' ]

def parseReviews( txt_filename, csv_filename ):
    txt_f = open( txt_filename, 'r', encoding='ISO-8859-1', errors='replace' )  
    
    line_iter  = iter( txt_f.readlines() )
    
    with open( csv_filename, 'w+' ) as csv_f:
        writer = csv.writer( csv_f )
        writer.writerow( ATTRIBUTES )
        while True:
            lines = [ line.strip() for line in list( islice( line_iter, 14 ) ) ]
            if not lines: # EOF
                break
            lines_data = [ ' '.join( ':'.join( line.split(':')[1:] ).split() ) for line in lines[:-1] ]
            writer.writerow( lines_data )
            

## Working with our Data

I used **PANDAS** python data anyalysis library, specifically for their versitile DataFrame object.


In [18]:
import pandas

Because we converted our data file to a csv format we are able to use pandas **.read_csv(** *filename* **)** function to get our DataFrame object containing our data.

In [19]:
parseReviews( 'sample_beeradvocate.txt', 'sample_output.csv' )

sample_df = pandas.read_csv( 'sample_output.csv' )

print( sample_df.iloc[0] )

name                                                Sausa Weizen
beerId                                                     47986
brewerId                                                   10325
ABV                                                            5
style                                                 Hefeweizen
appearance                                                   2.5
aroma                                                          2
palate                                                       1.5
taste                                                        1.5
overall                                                      1.5
time                                                  1234817823
profileName                                              stcules
text           A lot of foam. But a lot. In the smell some ba...
Name: 0, dtype: object


#### Now to read in the real data

I aggregated the reviews from both Beer Advocate and Rate Beer and created a single csv containing all the data.

In [63]:
df = pandas.read_csv( 'beer_reviews.csv' )

  interactivity=interactivity, compiler=compiler, result=result)


Now we get to ask some questions. We know that reviews contain ratings for a variety of qualities in the beers. It's a natural question, which beers are leading these categories?

In [103]:
df['reviews'] = 1 # This will be helpful for counting reviews
beerId_gdf = df.copy().groupby( df1.bearId )

df_taste = beerId_gdf.agg( {
    'name'    : 'first', 
    'taste'   : 'mean' , 
    'reviews' : 'count'} ).sort_values( 'taste', ascending=False )

df_taste[:5]

Unnamed: 0_level_0,taste,name,reviews
bearId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
44076,5,Speak No Evil Belgian IPA,2
10196,5,Belgian Tripel,2
59584,5,Double Kilt-Sickle Reserve,2
16206,5,Satsuma Ale,2
59566,5,"Red, Wheat, And Blue",2


But there are varying number of reviews to average for different beers. The top beer reviews could come from it being rated by one person, and them giving it 5/5. We need to encorporate support into our top reviews. We could choose a static estimated number, but we could also do something a little more dynamic.

In [104]:
average_reviews = round( beerId_gdf.reviews.sum().mean() )
print( "Average number of reviews: {}".format( average_reviews ) )

df_taste[ df_taste.reviews > average_reviews ]

Average number of reviews: 48


Unnamed: 0_level_0,taste,name,reviews
bearId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
24273,4.857143,M Belgian-Style Barleywine,56
63649,4.848485,Rare D.O.S.,66
62397,4.767068,Rare Bourbon County Stout,498
44910,4.743590,Dirty Horse,78
68548,4.730769,Armand'4 Oude Geuze Lente (Spring),130
21690,4.724590,Pliny The Younger,1220
1545,4.718553,Trappist Westvleteren 12,2544
42664,4.714286,Kaggen! Stormaktsporter,168
42349,4.710526,Vanilla Bean Aged Dark Lord,304
47658,4.697017,Founders CBS Imperial Stout,1274
