# Beer Reviews 

This example analyzes beer reviews to find the most common words used in positive and negative reviews.
Original example can be found [here](https://medium.com/rapids-ai/real-data-has-strings-now-so-do-gpus-994497d55f8e).
The size of reviews_sample.csv is 23.1MB.
Fulldataset is available on "s3://bodo-example-data/beer/reviews.csv" and its size is 2.2GB.

### Start an IPyParallel cluster 
Run the following code in a cell to start an IPyParallel cluster. 8 cores are used in this example. 

In [1]:
import os
if os.environ.get("BODO_PLATFORM_WORKSPACE_UUID",'NA') == 'NA':
    import ipyparallel as ipp
    import psutil; n = min(psutil.cpu_count(logical=False), 8)
    rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|██████████| 8/8 [00:06<00:00,  1.17engine/s]


In [2]:
%%px
import pandas as pd
import time
import bodo

## Preprocessing
1. Create lists of stopwords and punctuation that will be removed.
2. Define regex that will be used to remove these punctuation and stopwords from the reviews.
3. Use the lower and strip functions to convert all letters to lowercase and remove excess whitespace. 
4. Remove stopwords and punctuation. 

In [3]:
%%px
with open("nltk-stopwords.txt", "r") as fh:
    STOPWORDS = list(map(str.strip, fh.readlines()))


PUNCT_LIST = ["\.", "\-", "\?", "\:", ":", "!", "&", "'", ","]
punc_regex = "|".join([f"({p})" for p in PUNCT_LIST])
stopword_regex = "|".join([f"\\b({s})\\b" for s in STOPWORDS])

In [4]:
%%px
@bodo.jit
def preprocess(reviews):
    # lowercase and strip
    reviews = reviews.str.lower()
    reviews = reviews.str.strip()

    # remove punctuation and stopwords
    reviews = reviews.str.replace(punc_regex, "", regex=True)
    reviews = reviews.str.replace(stopword_regex, "", regex=True)
    return reviews

## Find the Most Common Words

In [5]:
%%px
@bodo.jit
def find_top_words(review_filename):
    # Load in the data
    t_start = time.time()
    df = pd.read_csv(review_filename, parse_dates=[2])
    print("read time", time.time() - t_start)

    score = df.score
    reviews = df.text

    t1 = time.time()
    reviews = preprocess(reviews)
    print("preprocess time", time.time() - t1)

    t1 = time.time()
    # create low and high score series
    low_threshold = 1.5
    high_threshold = 4.95
    high_reviews = reviews[score > high_threshold]
    low_reviews = reviews[score <= low_threshold]
    high_reviews = high_reviews.dropna()
    low_reviews = low_reviews.dropna()

    high_colsplit = high_reviews.str.split()
    low_colsplit = low_reviews.str.split()
    print("high/low time", time.time() - t1)

    t1 = time.time()
    high_words = high_colsplit.explode()
    low_words = low_colsplit.explode()

    top_words = high_words.value_counts().head(25)
    low_words = low_words.value_counts().head(25)
    print("value_counts time", time.time() - t1)
    print("total time", time.time() - t_start)
    return top_words, low_words

    
top_words, low_words = find_top_words("s3://bodo-example-data/beer/reviews_sample.csv")
if bodo.get_rank() == 0:
    print(top_words)
    print(low_words)

%px:   0%|          | 0/8 [00:37<?, ?tasks/s]

[stdout:0] read time 2.7194855730003837
preprocess time 15.966832514000089
high/low time 0.2035854699997799
value_counts time 0.019346405999385752
total time 18.91044466400035
beer         333
one          158
taste        140
head         119
like         117
best         102
chocolate     90
dark          90
great         86
perfect       80
good          79
sweet         77
smell         73
bottle        72
ive           70
flavor        68
glass         65
well          65
ever          65
aroma         64
nice          64
malt          63
hops          62
bourbon       62
beers         62
Name: text, dtype: int64
beer           239
like           109
taste          104
head            69
one             65
light           65
smell           57
bad             53
bottle          52
really          49
good            41
would           40
get             38
water           35
flavor          33
smells          32
carbonation     32
much            32
beers           32
glass        

%px: 100%|██████████| 8/8 [00:53<00:00,  6.65s/tasks]
