## Project 1: Analytics on Glassdoor Reviews and Yelp Category Data

### University of California, Santa Barbara  
### PSTAT 135/235  
### Last Updated: Nov 2, 2018

### OBJECTIVE  
#### In this assignment, you will perform some basic analytics on review and category data.
#### This will entail performing operations on *RDDs*, and using *list comprehensions*.
#### Read in the dataset and perform the steps requested below.

#### TOTAL POINTS = 10

### Config Setup

In [29]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("review_and_category_analytics") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'8g') \
    .getOrCreate()

sc = spark.sparkContext

Read in the dataset  
The dataset is saved in the *data* folder. Notice the pathing below, with NO forward slash in front of *data*

In [32]:
df = sc.textFile("data/reviews_and_categories.csv")

In [33]:
df.take(3)

['index,review,categories',
 '0,,"[\'point of interest\', \'mexican\', \'establishment\', \'food\', \'restaurant\']"',
 '1,,[]']

In [13]:
header = df.first()

In [14]:
header

'index,review,categories'

get non-header records

In [16]:
data = df.filter(lambda r: r != header) \
        .map(lambda row: [e for e in row.split(',')])

print the first 2 records (note: exclude the header in all calculations)

In [34]:
data.take(2)

[['0',
  '',
  '"[\'point of interest\'',
  " 'mexican'",
  " 'establishment'",
  " 'food'",
  ' \'restaurant\']"'],
 ['1', '', '[]']]

**1) get a record count (2 POINTS)**

In [35]:
data.count()

1305

store records with non-empty *review_emp_txt*

In [41]:
data_has_review = data.map(lambda r: (r[0],r[1],r[2:])) \
                    .filter(lambda r: r[1] != '')

**2) get a count of records with non-missing reviews (2 POINTS)**

In [42]:
data_has_review.count()

305

In [43]:
data_has_review.take(3)

[('3',
  '"Some franchise owners dock hours. Pros Good discounts on the food. Cons The location where I was working',
  [' in North Fresno near Riverpark Mall',
   ' was ran by the owner s father who treated the female staff with contempt and derision. Would yell at the staff',
   ' in front of the guests',
   ' if they didn t exactly follow his formula for making the sandwiches (even when the staff were trying to fulfill the special requests of the guests). He would clock out the closers (with or without their knowledge) before they were done with their tasks',
   ' and ask employees to stay an hour or two past the end of their shift',
   ' but would not pay them for their time."',
   '"[\'lunch\'',
   " 'best sandwich'",
   " 'entertainment'",
   " 'restaurants'",
   " 'sub'",
   " 'arizona'",
   " 'quick'",
   " 'social networks'",
   " 'washington'",
   " 'catering reno'",
   " 'establishment'",
   " 'nevada'",
   " 'restaurant'",
   " 'wraps'",
   " 'qsr'",
   " 'small business'",

**3) Return the count of records where review contains the word *awesome*  (2 POINTS)**

In [44]:
awesome_records = data_has_review.map(lambda r: r[1]) \
                .filter(lambda r: 'awesome' in r)

In [45]:
awesome_records.count()

3

Print the records where review contains the word *awesome*

In [46]:
awesome_records.collect()

['Very awesome. Pros They allow the use of flexible schedule. Cons There are too many hostesses.',
 'Not a bad place to work Pros The couple that owned the one I worked at were very big on school and very flexible with my school schedule Being a delivery driver is great because the tips are awesome Cons The owner was kind of crazy I m kind of a health freak and just don t like working with pizza',
 '"Cashier Play Attendant Pros The kids are awesome it feels like a family everyone is willing to help awesome food discount Cons needs new equipment like a new espresso machine']

4) Lowercase all reviews, then return the count of records where review contains the word *awesome*

In [47]:
awesome_records_lower = data_has_review.map(lambda r: r[1].lower()) \
                .filter(lambda r: 'awesome' in r) \

In [48]:
awesome_records_lower.collect()

['very awesome. pros they allow the use of flexible schedule. cons there are too many hostesses.',
 'not a bad place to work pros the couple that owned the one i worked at were very big on school and very flexible with my school schedule being a delivery driver is great because the tips are awesome cons the owner was kind of crazy i m kind of a health freak and just don t like working with pizza',
 '"cashier play attendant pros the kids are awesome it feels like a family everyone is willing to help awesome food discount cons needs new equipment like a new espresso machine',
 'busboy pros flexible schedule decent tips awesome coworkers 50 employee discount on meals cons physically demanding long hours rude and or impatient customers',
 '"awesome place to work pros management is laid back and the work environment is relaxed. everyone that works there is willing to help when you need anything. lots of college kids work there because it is a college town. it s a tremendous job for anyone i

In [49]:
awesome_records_lower.count()

5

**4) Return the top 10 most frequent categories  (4 POINTS)**  

Preprocess the categories by:  
* stripping characters: &nbsp; [ &nbsp; ] &nbsp;  ' &nbsp;  "  
* trim spaces before and after words  
* lowercase

NOTE: Be sure to keep terms together, for example 'jet skiing' should not become 'jet', 'skiing'

In [24]:
cats=data.map(lambda r: r[2:]) 

In [25]:
cats.take(3)

[['"[\'point of interest\'',
  " 'mexican'",
  " 'establishment'",
  " 'food'",
  ' \'restaurant\']"'],
 ['[]'],
 ['"[\'other\'', ' \'food & beverages\']"']]

In [26]:
cats_flat = cats.map(lambda row: [token.replace('[','') \
                                .replace(']','') \
                                .replace("'",'') \
                                .replace('"','') \
                                .strip() \
                                .lower() for token in row]) \
                                .flatMap(lambda x: x) \
                                .map(lambda x: (x,1)) \
                                .reduceByKey(lambda x,y:x+y) \
                                .map(lambda x:(x[1],x[0])) \
                                .sortByKey(False) 

In [27]:
cats_flat.take(10)

[(718, 'point of interest'),
 (718, 'establishment'),
 (717, 'food'),
 (660, 'restaurant'),
 (497, 'price'),
 (483, 'other'),
 (439, ''),
 (332, 'credit cards'),
 (311, 'menus'),
 (292, 'eating places')]