# Part 1: RDDs

Setup Spark

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ChiSquaredRDD").getOrCreate()
sc = spark.sparkContext

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/05/25 10:04:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/05/25 10:04:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/05/25 10:04:15 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/05/25 10:04:15 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
23/05/25 10:04:15 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
23/05/25 10:04:1

## Load reviews as RDD

In [2]:
# review_path = "hdfs:///user/dic23_shared/amazon-reviews/full/reviewscombined.json"
review_path = "hdfs:///user/dic23_shared/amazon-reviews/full/reviews_devset.json"
# review_path = "hdfs:///user/e11809642/reviews/reduced_devset.json"
input_rdd = sc.textFile(review_path)

## Obtain stopwords

Load stopwords into (local) memory (Note: file contains duplicates, so convert to set)

In [3]:
stopwords_path = 'stopwords.txt'

def load_unique_lines(filename):
    lines = set()

    with open(filename, 'r') as file:
        for line in file:
            line = line.strip()  # Remove leading/trailing whitespace and newline characters
            lines.add(line)

    return lines


stopwords_local = load_unique_lines(stopwords_path)

Broadcast stopword data 'into Spark' so that it can be used in other RDDs (broadcasting should be fine as data easily fits into memory)

In [4]:
stopwords = sc.broadcast(stopwords_local)
# values stored in broadcast variables are accessible via value prop (for this case use stopwords.value)

## Parse JSON strings, extract the category + review text

In [5]:
import json
category_review_rdd = input_rdd \
    .map(lambda json_str: json.loads(json_str)) \
    .map(lambda json_obj: (json_obj['category'], json_obj['reviewText']))

## Compute total number of documents

In [35]:
review_count = category_review_rdd.count()

                                                                                

## Compute number of documents per category

define RDD with required transformations

In [37]:
category_counts_rdd = category_review_rdd \
    .map(lambda pair: (pair[0], 1)) \
    .reduceByKey(lambda x, y: x + y)

Next, collect the values of the RDD into a local list. This list is really small and will easily fit into memory on the datanodes. Apparently, when using it within a `map()` or other transformations applied on an RDD, the data will automatically be broadcast across all data nodes with as many values as there are categories and therefore really small) and broadcast them

In [45]:
category_counts = category_counts_rdd.collect()

# alternative approach - for explanation of difference see: https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
# create broadcast variable
#category_counts = sc.broadcast(category_counts_rdd.collect())
# use .value to access the value that is now broadcast across all data nodes
#category_counts.value

# Obtain terms via tokenization followed by stopword removal, compute number of occurrences of each term in each category across all reviews

In [44]:
# define pattern for splitting/tokenizing
import re
pattern = re.compile(r"[^a-zA-Z<>^|]+")

# obtain terms and occurrences per category
term_category_occurrences_rdd = category_review_rdd \
    .map(lambda pair: ((pair[0], pair[1]), set(pattern.split(pair[1])))) \
    .flatMap(lambda pair: (((term.lower(), pair[0][0]), 1) for term in pair[1] if
                           term.lower() not in stopwords_local and len(term) >= 2)) \
    .reduceByKey(lambda x, y: x + y)

## Compute the number of occurrences of each term across all reviews

In [10]:
term_occurrences_rdd = term_category_occurrences_rdd \
    .map(lambda pair: (pair[0][0], pair[1])) \
    .reduceByKey(lambda x, y: x + y)

In [47]:
term_occurrences_rdd.take(3)

[('scripture', 119), ('verse', 73), ('tremendously', 58)]

In [46]:
term_category_occurrences_rdd.take(3)

                                                                                

[(('mic', 'Musical_Instrument'), 25),
 (('setups', 'Musical_Instrument'), 1),
 (('tape', 'Musical_Instrument'), 5)]

In [34]:
term_occurrences_rdd.union(term_category_occurrences).collect()[-1]

                                                                                

(('colada', 'Health_and_Personal_Care'), 1)

In [11]:
def calculate_chi_square(term, category):
    """
    Computes the chi-squared value for a given term and category
    """
    a = term_category_occurrences_rdd[(term, category)]
    b = term_occurrences_rdd[term] - a
    c = category_counts_rdd[category] - a
    d = review_count - a - b - c
    return review_count * (a * d - b * c) ** 2 / ((a + b) * (a + c) * (b + d) * (c + d))

In [20]:
term_category_occurrences_rdd[('mic', 'Musical_Instrument')]

TypeError: 'PipelinedRDD' object is not subscriptable

In [13]:
term_occurrences_rdd.take(5)

                                                                                

[('scripture', 119),
 ('verse', 73),
 ('tremendously', 58),
 ('chapter', 887),
 ('friendly', 310)]

In [14]:
term_category_occurrences.take(5)

                                                                                

[(('mic', 'Musical_Instrument'), 25),
 (('setups', 'Musical_Instrument'), 1),
 (('tape', 'Musical_Instrument'), 5),
 (('clip', 'Musical_Instrument'), 5),
 (('altering', 'Musical_Instrument'), 1)]

In [18]:
len(term_category_occurrences.collect())

272560

In [25]:
term_category_occurrences.take(5)[0]

(('horizons', 'Patio_Lawn_and_Garde'), 1)

In [12]:
# Compute the chi-squared value for each unique term and category pair
# (term, category) -> chi-square
term_category_chi_squared_rdd = term_category_occurrences_rdd \
    .map(lambda pair: (pair[0][0], (pair[0][1], calculate_chi_square(pair[0][0], pair[0][1])))) \
    .groupByKey()

Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/usr/lib/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/usr/lib/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 262, in __getnewargs__
    raise RuntimeError(
RuntimeError: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.


PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

In [11]:
# Select the top 75 tokens with the highest chi-square value for each category
# (category) -> [(token, chi-square)]
chi_square_rdd = term_category_chi_squared_rdd \
    .map(lambda pair: (pair[0], sorted(pair[1], key=lambda x: x[1], reverse=True)[:75]))

NameError: name 'term_category_chi_squared_rdd' is not defined

In [None]:
# Select all unique tokens from the top 75 tokens with the highest chi-square value for each category
tokens = chi_square_rdd \
    .flatMap(lambda pair: (token for token, chi_square in pair[1])) \
    .distinct() \
    .collect()

In [None]:
# Sort the tokens in alphabetical order
tokens.sort()

In [None]:
chi_square_rdd = chi_square_rdd.sortByKey()

# Save the top 75 tokens with the highest chi-square value for each category to a file in the local file system
# in the format: "<category> term1:chi_squared1 term2:chi_squared2 ... term75:chi_squared75" for each line and append the list of tokens to the end of the file
with open("chi_squared.txt", "a") as file:
    for pair in chi_square_rdd.collect():
        file.write("<%s>" % pair[0] + " ")
        for token, chi_square in pair[1]:
            file.write("%s:%f" % (token, chi_square) + " ")
        file.write("\n")
    file.write(" ".join(tokens) + "\n")

In [None]:
sc.stop()
