## Lab Assignment: Analytics on Glassdoor Reviews and Yelp Category Data

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023


#### Jiaxing (Joy) Qiu
#### JQ2UW
---

### OBJECTIVE  

In this assignment, you will perform some basic analytics on review and category data.  
This will entail performing operations on *RDDs*, and using *list comprehensions*.   

As this assignment covers RDDs, do not use DataFrames.

Read in the dataset and perform the steps requested below.

#### TOTAL POINTS = 10

---

### Config Setup

In [31]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("review_and_category_analytics") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'1g') \
    .getOrCreate()

sc = spark.sparkContext

Read in the dataset  

In [32]:
df = sc.textFile("reviews_and_categories.csv")
#df.take(10)

Get non-header records  

In [33]:
header = df.first()
df = df.filter(lambda x: x != header)

Print the first 2 records (note: exclude the header in all calculations)

In [34]:
df.take(2)

['0,,"[\'point of interest\', \'mexican\', \'establishment\', \'food\', \'restaurant\']"',
 '1,,[]']

**1) get a record count (2 POINTS)**

In [35]:
df.count()

1302

**2) get a count of records with non-missing reviews (2 POINTS)**

In [49]:
import csv
df1 = df.map(lambda str_row: [ "{}".format(x) \
                        for x in list(csv.reader([str_row], delimiter=',', quotechar='"'))[0] ])\
    .filter(lambda str_list_row: str_list_row[1]!='')
df1.count()

305

In [50]:
df1.take(5)

[['3',
  'Some franchise owners dock hours. Pros Good discounts on the food. Cons The location where I was working, in North Fresno near Riverpark Mall, was ran by the owner s father who treated the female staff with contempt and derision. Would yell at the staff, in front of the guests, if they didn t exactly follow his formula for making the sandwiches (even when the staff were trying to fulfill the special requests of the guests). He would clock out the closers (with or without their knowledge) before they were done with their tasks, and ask employees to stay an hour or two past the end of their shift, but would not pay them for their time.',
  "['lunch', 'best sandwich', 'entertainment', 'restaurants', 'sub', 'arizona', 'quick', 'social networks', 'washington', 'catering reno', 'establishment', 'nevada', 'restaurant', 'wraps', 'qsr', 'small business', 'meal takeaway', 'hospitality', 'sandwich', 'franchise', 'seminars', 'deli', 'point of interest', 'sandwiches', 'port', 'other', 'fo

**3) Return the count of records where review contains the word *flexible*  (1 POINT)**

In [51]:
df2 = df1.filter(lambda x: 'flexible' in x[1])

df2.count()

36

Print the records where review contains the word *flexible*

**4) Lowercase all reviews, then return the count of records where review contains the word *flexible* (1 POINT)**

In [52]:
df3 = df1.filter(lambda x: 'flexible' in x[1].lower())
df3.count()

44

**5) Return the top 10 most frequent categories  (4 POINTS)**  

Preprocess the categories by:  
* stripping characters: &nbsp; [ &nbsp; ] &nbsp;  ' &nbsp;  "  
* trim spaces before and after words  
* lowercase
* removing blank categories

NOTE: Be sure to keep terms together, for example 'jet skiing' should not become 'jet', 'skiing'

In [53]:
df = sc.textFile("reviews_and_categories.csv")
header = df.first()
df = df.filter(lambda x: x != header)


In [97]:
# keep the category column
df4 = df.map(lambda str_row: [ "{}".format(x) \
                        for x in list(csv.reader([str_row], delimiter=',', quotechar='"'))[0] ])\
    .map(lambda x: x[2])

df4.take(10)

["['point of interest', 'mexican', 'establishment', 'food', 'restaurant']",
 '[]',
 "['other', 'food & beverages']",
 "['lunch', 'best sandwich', 'entertainment', 'restaurants', 'sub', 'arizona', 'quick', 'social networks', 'washington', 'catering reno', 'establishment', 'nevada', 'restaurant', 'wraps', 'qsr', 'small business', 'meal takeaway', 'hospitality', 'sandwich', 'franchise', 'seminars', 'deli', 'point of interest', 'sandwiches', 'port', 'other', 'food', 'party trays reno', 'service', 'entrepeneur', 'franchises', 'fast food', 'grillers', 'griller', 'salad', 'management', 'businesses', 'self employed', 'wrap', 'submarine', 'delis', 'lake tahoe', 'boss', 'salads', 'trade shows', 'eating places', 'franchising', 'reno', 'subs', 'phoenix']",
 '[]',
 "['point of interest', 'establishment', 'food', 'restaurant']",
 "['dining options', 'credit cards', 'no reservations', 'no outdoor seating', 'price', 'huaraches', 'baja tacos', 'menus', 'dinner, lunch & more', 'mexican restaurant', 'sal

In [98]:
df4 = df4.map(lambda x: re.sub('[\[\]\\\']', '', x))\
    .map(lambda x: x.split(','))\
    .flatMap(lambda x: [i.rstrip().lstrip().lower() for i in x])

df4.take(10)

['point of interest',
 'mexican',
 'establishment',
 'food',
 'restaurant',
 '',
 'other',
 'food & beverages',
 'lunch',
 'best sandwich']

In [101]:
df5_w_empty = df4.map(lambda x: (x,1))\
    .reduceByKey(lambda x,y:x+y) \
    .map(lambda x:(x[1],x[0])) \
    .sortByKey(False)

df5_w_empty.take(10)

[(717, 'point of interest'),
 (717, 'establishment'),
 (716, 'food'),
 (659, 'restaurant'),
 (496, 'price'),
 (482, 'other'),
 (435, ''),
 (331, 'credit cards'),
 (311, 'menus'),
 (291, 'eating places')]

In [102]:
df5_wo_empty = df4.filter(lambda x: x != '')\
    .map(lambda x: (x,1))\
    .reduceByKey(lambda x,y:x+y) \
    .map(lambda x:(x[1],x[0])) \
    .sortByKey(False)

df5_wo_empty.take(10)

[(717, 'point of interest'),
 (717, 'establishment'),
 (716, 'food'),
 (659, 'restaurant'),
 (496, 'price'),
 (482, 'other'),
 (331, 'credit cards'),
 (311, 'menus'),
 (291, 'eating places'),
 (274, 'dining options')]