# 0. Installing PySpark in Google Colab

Install Dependencies (needs to be done once each time you re-open this notebook):

1.   Java 8
2.   Apache Spark with hadoop and
3.   Findspark (used to locate the spark in the system)

In [None]:
# install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

# unzip the spark file to the current folder
!tar xf spark-3.5.3-bin-hadoop3.tgz

# set your spark folder to your system path environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.3-bin-hadoop3"

# install findspark using pip
!pip install -q findspark
import findspark
findspark.init()


- Mount your Google Drive folder to access files
- Needs to be done once each time you restart your runtime

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

# 1. Download Amazon Review Dataset

- Run the following cell to download the dataset and copy them to your Google Drive to keep a permanent copy so that you don't need to re-download everytime
- Only need to run for the first time
- We will be using the following two files for this task. Details of dataset can be found at https://cseweb.ucsd.edu/%7Ejmcauley/datasets/amazon_v2/#subsets
  - Grocery_and_Gourmet_Food_5.json.gz is from 5-core review data
  - meta_Grocery_and_Gourmet_Food.json.gz is from metadata under Per-category data

In [None]:
# downloading and unzip files
!wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFilesSmall/Grocery_and_Gourmet_Food_5.json.gz
!gzip -d Grocery_and_Gourmet_Food_5.json.gz

!wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Grocery_and_Gourmet_Food.json.gz
!gzip -d meta_Grocery_and_Gourmet_Food.json.gz

# Copy files from colab machine to your google drive
# You should see files "meta_Grocery_and_Gourmet_Food.json" and "Grocery_and_Gourmet_Food_5.json" in your CS5344_AY2425Sem2_Lab folder
!cp /content/Grocery_and_Gourmet_Food_5.json /content/gdrive/My\ Drive/CS5344_AY2425Sem2_Lab/Grocery_and_Gourmet_Food_5.json
!cp /content/meta_Grocery_and_Gourmet_Food.json /content/gdrive/My\ Drive/CS5344_AY2425Sem2_Lab/meta_Grocery_and_Gourmet_Food.json


--2024-11-29 02:13:28--  https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFilesSmall/Grocery_and_Gourmet_Food_5.json.gz
Resolving datarepo.eng.ucsd.edu (datarepo.eng.ucsd.edu)... 132.239.8.30
Connecting to datarepo.eng.ucsd.edu (datarepo.eng.ucsd.edu)|132.239.8.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 146631394 (140M) [application/x-gzip]
Saving to: ‘Grocery_and_Gourmet_Food_5.json.gz’


2024-11-29 02:13:31 (50.1 MB/s) - ‘Grocery_and_Gourmet_Food_5.json.gz’ saved [146631394/146631394]

--2024-11-29 02:13:37--  https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Grocery_and_Gourmet_Food.json.gz
Resolving datarepo.eng.ucsd.edu (datarepo.eng.ucsd.edu)... 132.239.8.30
Connecting to datarepo.eng.ucsd.edu (datarepo.eng.ucsd.edu)|132.239.8.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 109586529 (105M) [application/x-gzip]
Saving to: ‘meta_Grocery_and_Gourmet_Food.json.gz’


2024-11-29 0

# 2. Amazon Review Analysis

**Task 1: Write a Spark program to find the top 15 products based on their number of reviews and average ratings**

Step 1. Calculate the number of reviews and average ratings for each asin. Use RDD, reduceByKey and map function to accomplish this step. Your RDD should have product asin as the key, and tuple of (#reviews, average_rating) as the value.


Step 2.  Create an RDD where the key is the product_asin and value is the brand name of the product using the metadata file. Remove any duplicated entries in the RDD.

Step 3.  Join the pair RDD obtained in Step 1 and the RDD created in Step 2 by the product asin.

Step 4. Take the top 15 entries, sorted by number of reviews in descending order. If multiple entries with same number of reviews, sort by average rating in descending order.


Step 5: Save your RDD into a file. Print out the product asin, #reviews, average rating and brand of your 15 entries and the expected output is:
```
[0] ASIN: B00BUKL666, #reviews: 7387, avg rating: 4.585352646541221, brand: KIND
[1] ASIN: B008QMX2SG, #reviews: 6228, avg rating: 4.573538856775851, brand: KIND
[2] ASIN: B00D3M2QP4, #reviews: 6221, avg rating: 4.573701977174088, brand: KIND
[3] ASIN: B00R7PWK7W, #reviews: 3387, avg rating: 4.568644818423383, brand: KIND
[4] ASIN: B000X3TPHS, #reviews: 3030, avg rating: 4.756435643564356, brand: YumEarth
[5] ASIN: B000F4DKAI, #reviews: 2922, avg rating: 4.623545516769336, brand: Twinings
[6] ASIN: B0001LO3FG, #reviews: 2922, avg rating: 4.623545516769336, brand: Twinings
[7] ASIN: B00KSN9TME, #reviews: 2637, avg rating: 4.583996966249526, brand: KIND
[8] ASIN: B000U0OUP6, #reviews: 2560, avg rating: 4.55859375, brand: Planters
[9] ASIN: B000E1FZHS, #reviews: 2555, avg rating: 4.55812133072407, brand: Planters
[10] ASIN: B00542YXFW, #reviews: 2455, avg rating: 4.374745417515275, brand: Davidson's Tea
[11] ASIN: B00RW0MZ6S, #reviews: 2168, avg rating: 4.550738007380073, brand: Planters
[12] ASIN: B000Z93FQC, #reviews: 2064, avg rating: 4.724806201550388, brand: YS Royal Jelly/Honey Bee
[13] ASIN: B00XA8XWGS, #reviews: 2053, avg rating: 4.590842669264491, brand: Twinings
[14] ASIN: B00XOORKRK, #reviews: 1980, avg rating: 4.543939393939394, brand: Planters
```






**Task 2: Write a Spark program to compute word counts and find common words in the reviews of each product**

Step 1: Read the reviews from the review file and preprocess the text. Eliminate punctuation and special characters, convert all words to lowercase, and tokenize the text into individual words. Your RDD should have the product asin as key and the list of words as value. You can remove entries with missing reviews.

Step 2: Calculate the frequency of each word in each product asin. Your RDD should have tuple (product_asin, word) as key and count of the corresponding word for that product as value. You can use reduceByKey() to accomplish this step.

Step 3: Find the top 10 words from reviews of the same product. Your RDD should have product_asin as key and the list containing the tuples (word, count) as value.

Step 4: Save your RDD into a file. For your reference, the first 10 entries, ordered by product asin, is as follows:
```
[0] 4639725043: [('tea', 73), ('i', 59), ('the', 52), ('it', 39), ('a', 36), ('this', 34), ('and', 33), ('is', 31), ('of', 30), ('to', 22)]
[1] 4639725183: [('the', 30), ('i', 25), ('tea', 24), ('is', 13), ('a', 12), ('it', 12), ('this', 12), ('and', 12), ('lipton', 9), ('to', 9)]
[2] 5463213682: [('the', 12), ('i', 10), ('and', 6), ('coffee', 4), ('not', 4), ('supreme', 3), ('love', 3), ('cafe', 3), ('sugar', 3), ('it', 3)]
[3] 9742356831: [('i', 189), ('the', 179), ('and', 158), ('a', 156), ('it', 140), ('to', 106), ('curry', 93), ('this', 86), ('is', 77), ('in', 72)]
[4] B00004S1C5: [('i', 30), ('the', 29), ('to', 20), ('and', 19), ('a', 15), ('is', 14), ('this', 14), ('it', 14), ('for', 14), ('are', 12)]
[5] B00004W4VD: [('jerky', 9), ('of', 6), ('the', 5), ('and', 5), ('for', 5), ('to', 4), ('a', 4), ('meat', 4), ('you', 3), ('is', 3)]
[6] B000052X2S: [('the', 11), ('i', 11), ('of', 11), ('to', 8), ('a', 8), ('drops', 8), ('in', 7), ('is', 6), ('and', 6), ('for', 6)]
[7] B000052Y74: [('i', 29), ('it', 20), ('gum', 19), ('to', 18), ('and', 18), ('of', 18), ('a', 16), ('the', 16), ('you', 12), ('mouth', 12)]
[8] B00005344V: [('i', 91), ('the', 76), ('it', 63), ('and', 57), ('tea', 53), ('this', 50), ('to', 44), ('a', 41), ('my', 33), ('of', 30)]
[9] B00005BPQ9: [('the', 59), ('i', 51), ('a', 49), ('and', 37), ('to', 34), ('for', 31), ('in', 25), ('they', 22), ('of', 22), ('milk', 20)]
```




