## Lab Assignment: Analytics on Glassdoor Reviews and Yelp Category Data

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023

---

### Justin Lee
### jgh2xh@virginia.edu
---

### OBJECTIVE  

In this assignment, you will perform some basic analytics on review and category data.  
This will entail performing operations on *RDDs*, and using *list comprehensions*.   

As this assignment covers RDDs, do not use DataFrames.

Read in the dataset and perform the steps requested below.

#### TOTAL POINTS = 10

---

### Config Setup

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("review_and_category_analytics") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'1g') \
    .getOrCreate()

sc = spark.sparkContext

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/08/31 15:32:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Read in the dataset  

In [2]:
df = sc.textFile("reviews_and_categories.csv")

Get non-header records  

In [3]:
import re

header = re.compile('index,review_emp_txt,categories')
df_noheader = df.filter(lambda x: not re.match(header, x))

Print the first 2 records (note: exclude the header in all calculations)

In [4]:
df_noheader.take(2)

                                                                                

['0,,"[\'point of interest\', \'mexican\', \'establishment\', \'food\', \'restaurant\']"',
 '1,,[]']

In [5]:
# Over-engineering a closure where the regex pattern is precompiled
def preprocess_closure():
    cat_pattern = re.compile('\[(.*?)\]')
    
    # text preprocessing function where for each row x:
    # x[0] is the index (int)
    # x[1] is the review (str)
    # x[2] is the category (list)    
    def _preprocess(x):
        idx_str = x.split(',')[0]
        idx = int(idx_str)
        len_idx = len(idx_str)

        # categories are surrounded by square brackets
        a = re.findall(cat_pattern, x)
        cat_str = a[0]
        if len(cat_str) > 0:
            # remove single quotation marks
            # and split categories
            # and remove leading/trailing whitespace
            cat = [c.strip().lower() for c in cat_str.replace('\'', '').split(',')]
        else:
            cat = []

        # review is between idx and categories
        rev_start = len_idx + 1  # add 1 for comma
        rev_end = len(cat_str) + 5  # add 1 for comma, 2 for square brackets, 2 for double quotation marks
        rev = x[rev_start:-rev_end].replace('\"', '')  # remove double quotation marks

        return (idx, rev, cat)

    return _preprocess
    
preprocess = preprocess_closure()

df_preprocessed = df_noheader.map(preprocess)

**1) get a record count (2 POINTS)**

In [6]:
df_preprocessed.count()

1302

**2) get a count of records with non-missing reviews (2 POINTS)**

In [7]:
df_preprocessed.filter(lambda x: len(x[1]) > 0).count()

305

**3) Return the count of records where review contains the word *flexible*  (1 POINT)**

In [8]:
flexible = re.compile('flexible')
df_preprocessed.filter(lambda x: re.search(flexible, x[1])).count()

36

Print the records where review contains the word *flexible*

In [9]:
df_preprocessed.filter(lambda x: re.search(flexible, x[1])).map(lambda x: (x[0], x[1])).collect()

[(25,
  'Nice to work for but no room to advance Pros Gave me lots of hours and very flexible with my college school schedule. Always allowed me to work as many hours as I could. Cons When asked for a raise, there was no question about it, the answer was no. I feel like this is a dead end job where you can t advance to become something better'),
 (30,
  'Great, Very flexible and understanding. Pros Good food, great people. flexible hours. Cons Limited work hours every week.'),
 (31,
  'High demand and stress with very low reward. Pros Schedule is very flexible with opportunities to work out of store for percentage of profit. Free custard! Cons Unorganized management. Low pay (min wage 10 h). Often times catering is very behind. Orders for catering change constantly. Unreliable employees.'),
 (44,
  'Decent Employer Employees Need Improvement Pros The Managers were usually very fair, level headed, and flexible Customers were usually cool Job not very difficult Cons Some employees were m

**4) Lowercase all reviews, then return the count of records where review contains the word *flexible* (1 POINT)**

In [10]:
df_preprocessed.filter(lambda x: re.search(flexible, x[1].lower())).count()

44

**5) Return the top 10 most frequent categories  (4 POINTS)**  

Preprocess the categories by:  
* stripping characters: &nbsp; [ &nbsp; ] &nbsp;  ' &nbsp;  "  
* trim spaces before and after words  
* lowercase
* removing blank categories

NOTE: Be sure to keep terms together, for example 'jet skiing' should not become 'jet', 'skiing'

In [11]:
from operator import add

cat_counts = df_preprocessed.flatMap(lambda x: x[2]) \
                            .map(lambda x: (x, -1)) \
                            .reduceByKey(add) \
                            .sortBy(lambda x: x[1]) \
                            .map(lambda x: (x[0], -x[1]))

cat_counts.take(10)

[('point of interest', 717),
 ('establishment', 717),
 ('food', 716),
 ('restaurant', 659),
 ('price', 496),
 ('other', 482),
 ('credit cards', 331),
 ('menus', 311),
 ('eating places', 291),
 ('dining options', 274)]