This is the Jupyter Notebook for Part 1 of exercise 2

In the first code cell we simply import necessary libraries and set up the Spark application using 
SparkConf and SparkContext from the pyspark library

In [None]:
import re
import os
import json
from pyspark import SparkConf, SparkContext

# Configure the Spark application and create a SparkContext
conf = SparkConf().setMaster("local").setAppName("ChiSquareJob")
sc = SparkContext(conf=conf)

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/04/27 09:54:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Next we load the input data from the given HDFS path in an RDD named input_data. 
We also load the stopwords from the txt file in the same directory. 

In [2]:
# Get the path to the directory containing the Jupyter notebook
notebook_path = os.path.abspath("")

# Load the input data and stopwords
input_data = sc.textFile("hdfs:///user/dic23_shared/amazon-reviews/full/reviews_devset.json")
stopwords = set(sc.textFile(f"file://{notebook_path}/stopwords.txt").collect())

                                                                                

In the following cell we implement a function which tokenizes the reviews. It applies the transformations which were requested in exercise 1 (e.g. lowercase, excluding stopwords and one-letter words). It also uses regex to split the text and filter it based on the given criteria. For computing the chi squared value we are only interested in the unique_tokens of the reviews.

In [3]:
def tokenize(review):
    tokenizer = re.compile(r'[\s\d\(\)\[\]\{\}\.,!?\-,;:+\=_"\'`~#@&*%€$§\\/]+')
    tokens = tokenizer.split(review)
    tokens = [token.lower() for token in tokens]
    unique_tokens = set(token for token in tokens if token not in stopwords and len(token) > 1)
    return unique_tokens

Next we implement a calculate_chi_squared function based on the given formula from the lecture. 
The function returns a dictionary containing the chi-squared values for the provided token in each category.  

In [None]:
def calculate_chi_squared(token, counts, total_word_documents):
    chi_squared_values = {}
    for category, count in counts.items():
        A = count
        B = total_word_documents - A
        C = category_counts[category] - A
        D = N - A - B - C
        upper_part = N * (A * D - B * C) ** 2
        lower_part = (A + B) * (A + C) * (B + D) * (C + D)
        chi_squared = upper_part / lower_part
        chi_squared_values[category] = chi_squared
    return chi_squared_values

# A - number of documents in c which contain t
# B - number of documents not in c which contain t
# C - number of documents in c without t
# D - number of documents not in c without t

In the following code snippet we compute the category counts from the input data. First, the data is loaded as JSON. Next the categories are extracted, and then the categories are reduced to their counts. The result is an RDD containing the category counts, which is then converted to a dictionary. The total number of documents (N) is calculated as the sum of the category counts.

In [4]:
# Compute the category counts from the input data
category_counts_rdd = input_data.map(json.loads).map(lambda review: (review['category'], 1)).reduceByKey(lambda a, b: a + b)
category_counts = dict(category_counts_rdd.collect())
N = sum(category_counts.values())

                                                                                

In the following code we perform several transformations and actions on our input_data RDD. Overall this pipeline computes for each token-category pair the chi-squared value. Below we describe the steps. 

1. Using `map(json.loads)` we Parse the input lines as a JSON object into a python directory.
2. In `flatMap(...)` the review text is tokenized. Stopwords and words with length 1 are excluded. The ouput has the format `((token, category), 1)`.
3. We sum the counts of each `(token, category)` pair by calling `reduceByKey(...)`.
4. Using `map(...)` we rearrange the tuples to the format `(token, (category, count))`.
5. `groupByKey()`: Group the token counts by token, resulting in `(token, [(category_1, count_1), ...])`. This is needed for our chi_squared formula. 
6. The next `map(...)` function converts the list of category and count pairs to a dictionary.
7. Now we calculate the chi-squared values for each token and category pair using the `calculate_chi_squared` function in `map(...)`.
8. To get the format `((token, category), chi_squared)` we flatten the chi-squared values into tuples using `flatMap(...)`

Through this pipeline we obtain an RDD, `chi_squared_values` for each token and category pair.

In [10]:
# Perform the transformations and actions on RDDs
chi_squared_values = input_data \
    .map(json.loads) \
    .flatMap(lambda review: [((token, review['category']), 1) for token in tokenize(review['reviewText']) if token not in stopwords and len(token) > 1]) \
    .reduceByKey(lambda x, y: x + y) \
    .map(lambda x: (x[0][0], (x[0][1], x[1]))) \
    .groupByKey() \
    .map(lambda token_counts: (token_counts[0], {category: count for category, count in token_counts[1]})) \
    .map(lambda token_counts: (token_counts[0], calculate_chi_squared(token_counts[0], token_counts[1], sum(token_counts[1].values())))) \
    .flatMap(lambda token_chi_squared: [((token_chi_squared[0], category), chi_squared) for category, chi_squared in token_chi_squared[1].items()])


The following pipeline takes the `chi_squared_values` RDD, which we computed in the previous cell and mereges the chi-quared values for the same cateogry and token. Lastly, it extracts the to 75 terms. 

1.  We rearrange the tuples to the format `(category, (token, chi_squared))` using the `map(...)` function. 
2.  Next, we group the chi-squared values by category using `groupByKey()`. This results in `(category, [(token_1, chi_squared_1), ...])`.
3.  Lastly, `mapValues(...)` is used to sort the list of token and chi-squared pairs in descending order based on the chi-squared value. Here, we only keep the top 75 most discriminative terms for each category.

This pipeline leads to an RDD which contains the top 75 terms for each category with their chi-squared values.

In [11]:
# Merge the chi-squared values for the same category and token
merged_chi_squared = chi_squared_values \
    .map(lambda x: (x[0][1], (x[0][0], x[1]))) \
    .groupByKey() \
    .mapValues(lambda chi_squared_values: sorted(chi_squared_values, key=lambda x: x[1], reverse=True)[:75])

In the last cell the results are written a file named "results_part1.txt". As specified in the exercise the cateogry name is followed by the top 75 terms with their chi-squared values in descending order, one category per line. Finally, the merged directory is written. It contains all unique terms across all categories in alphabetical order.

In [14]:
with open("results_part1.txt", "w") as f:
    top_terms_per_category = sorted(merged_chi_squared.collect(), key=lambda x: x[0])
    merged_dictionary = sorted(merged_chi_squared.flatMap(lambda kv: [term for term, _ in kv[1]]).distinct().collect())

    for category, top_terms in top_terms_per_category:
        f.write(f"{category} {' '.join(f'{term}:{chi_squared:.4f}' for term, chi_squared in top_terms)}\n")
    f.write(' '.join(merged_dictionary))
