## Part 2) Datasets/DataFrames: Spark ML and Pipelines

This is the Jupyter Notebook for Part 2 of exercise 2

In the first code cell we simply import necessary libraries and create a SparkSession object with application name 'Ex2_Part2' using SparkSession from the pyspark library

In [1]:
import os
import pyspark
from pyspark.sql.functions import *
from pyspark.ml.feature import *
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

# Create a SparkSession
spark = SparkSession \
    .builder \
    .appName("Ex2_Part2") \
    .master("local[*]") \
    .getOrCreate()

### Loading the Dataset and stopwords.txt

Next we load the input data from the given HDFS path in a DataFrame called df.
We also print the schema of the DataFrame as well as show the data in df.

In [3]:
# Load the input data as DataFrame
df = spark.read.json("hdfs:///user/dic23_shared/amazon-reviews/full/reviews_devset.json")
df.printSchema()
df.show()

                                                                                

root
 |-- asin: string (nullable = true)
 |-- category: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)

+----------+--------------------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|      asin|            category| helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|             summary|unixReviewTime|
+----------+--------------------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|0981850006|Patio_Lawn_and_Garde|  [6, 7]|    5.0|This was a gift f...| 12

We also load the stopwords from the txt file into a list called 'stopwords'.

In [4]:
# Get the path to the directory containing the Jupyter notebook
notebook_path = os.path.abspath("")

# Load the stopwords into a list
stopwords_file = open(f"{notebook_path}/stopwords.txt", "r")
words_data = stopwords_file.read()
stopwords = words_data.split("\n")
stopwords_file.close()

print(stopwords)

['a', 'aa', 'able', 'about', 'above', 'absorbs', 'accord', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', 'ain', 'album', 'album', 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'app', 'appear', 'appreciate', 'appropriate', 'are', 'aren', 'around', 'as', 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'baby', 'bb', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'bibs', 'bike', 'book', 'books', 'both', 'brief', 'bulbs', 'but', 'by', 'c', 'came', 'camera', 'can', 'cannot', 'cant', 'car', 'case', 'cause', 'causes', 'cd', 'certain', 'certainly', 'changes', 'clearly', 'co', 'coffee',

### Pipeline

In the following cells, we will created the different PySpark ML pipeline stages required as preparation for the classificaton task in Part 3.

Firstly, we create a RegexTokenizer object which extracts tokens form the given 'reviewText' column based on the specifications provided in the exercise description (at whitespaces, tabs, digits, and the delimiter characters ()[]{}.!?,;:+=-_"'\`~#@&*%€$§\/).
RegexTokenizer contains a parameter `toLowercase` which is set to `True` by default. It specifies to convert all characters to lowercase before tokenizing. Hence, casefolding is also taken care of with this regex based tokenizer.
The resulting tokens after tokenization will be saved in the column 'tokens' as specified by 'outputCol'.

In [None]:
tokenizer = RegexTokenizer(pattern=r'[\s\d\(\)\[\]\{\}\.,!?\-,;:+\=_"\'`~#@&*%€$§\\/]+',
                           inputCol="reviewText",
                           outputCol="tokens")

We then create a StopWordsRemover object called SWremover which is a feature transformer that filters out stop words from a given tokenized input. It removes the stopwords from stopwords.txt saved in the 'stopwords' list from the already tokenized text column.
StopWordsRemover hence uses as input the output column of the tokenizer, specified by `tokenizer.getOutputCol()` as 'inputCol'.
The filtered tokens will be saved in a new column 'filtered_tokens'.

In [None]:
SW_remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered_tokens", stopWords=stopwords)

Next, we create the TFIDF-stages of the pipeline, consisting of `CountVerctorizer` stage and the `IDF` stage.

The `CountVectorizer` object uses the filtered tokens resulting from the StopWordsRemover (`SW_remover.getOutputCol()`) and converts them into a bag-of-words representation. The resulting features are word counts, where each feature represents the number of times a word appears in a document. The `minDF` parameter specifies the minimum number of documents in which a word must appear in order to be included in the vocabulary. It is set to its default value of 1.0.

The `IDF` stage then takes the output of the CountVectorizer (`count_vectorizer.getOutputCol()`) and applies Inverse Document Frequency (IDF) weighting to those bag-of-words features. That is, it scales each feature and usually down-weights features which appear more frequently in a set of documents. The `minDocFreq` parameter specifies the minimum number of documents in which a term must appear in order to be included in the IDF calculation. It is set to its default value of 0.

In [None]:
count_vectorizer = CountVectorizer(inputCol=SW_remover.getOutputCol(), outputCol="cv_features", minDF=1.0) #vocabSize=2048,
idf = IDF(inputCol=count_vectorizer.getOutputCol(), outputCol="tfidf_features", minDocFreq = 0)

In the final stage, we create a `ChiSqSelector` object called chi_sq_selector to select the top features, here tokens, based on a chi-squared statistical test between each token and the target label for the subsequent classification task, here the different product categories.

As `ChiSqSelector` requires numeric labels as inputs, we first use `StringIndexer` to map our string 'category' column to a numerical index which is saved in the column 'label'.

The chi_sq_selector then takes as input the tokens, i.e. the output of the idf stage (`idf.getOuputCol()`), as well as the labels contained in the 'label' column and selects the top 2000 features (across all classes) based on the a chi-squared statistical test between each feature and the label. The selected features are stored in a new column called 'selected_features'.

In [None]:
cat_Indexer = StringIndexer(inputCol="category", outputCol="label")
chi_sq_selector = ChiSqSelector(numTopFeatures=2000, featuresCol=idf.getOutputCol(), outputCol="selected_features", labelCol=cat_Indexer.getOutputCol())

Lastly, we combine the above defined stages into a single pipeline, with the order of the stages resembling the order of the cells above.

In [15]:
pipeline = Pipeline(stages=[tokenizer, SW_remover, count_vectorizer, idf, cat_Indexer, chi_sq_selector])

Now that the pipeline is defined, we can train the pipeline model on the given data stored in the DataFrame df.

In [None]:
# Train the pipeline model
pipeline_model = pipeline.fit(df)

In order to extract the top 2000 terms selected by the `ChiSqSelector` we define the following function `save_selected_terms()`. It takes a pipeline model and an output txt file and writes the terms to the file in the following steps:

1. Extract the CountVectorizer and ChiSqSelector models from the pipeline model using the stages attribute and indexing.
2. From the CountVectorizerModel, extract the list of terms in the same order as the corresponding indeces in the feature vectors using the model's `vocabulary` attribute.
3. Obtain a list of indices of the 2000 selected features using the `selectedFeatures` attribute of the ChiSqSelectorModel.
4. Create the list of selected terms from the selected indices and the corresponding term in the vocabulary list and sort it alphabetically.
5. Write the list of selected terms to the output file, separated by commas.

In [16]:
def save_selected_terms(pipeline_model, output_file):
    count_vectorizer_model = pipeline_model.stages[2]
    chi_sq_selector_model = pipeline_model.stages[-1]

    #Extract the list of terms
    vocabulary = count_vectorizer_model.vocabulary
    #Obtain the list of indeces
    selected_indices = chi_sq_selector_model.selectedFeatures

    # Create a list of selected terms
    selected_terms = [vocabulary[index] for index in selected_indices]
    selected_terms.sort()
    print('# terms selected:', len(selected_terms))

    # Write the list to the output file
    with open(output_file, "w") as f:
        for term in selected_terms:
            f.write(f"{term} ")

23/05/05 10:47:20 WARN DAGScheduler: Broadcasting large task binary with size 1233.6 KiB
23/05/05 10:47:20 WARN DAGScheduler: Broadcasting large task binary with size 1235.7 KiB
23/05/05 10:47:24 WARN DAGScheduler: Broadcasting large task binary with size 1238.7 KiB
                                                                                

In this last cell the results are written to a file called "output_ds.txt" using the above function. As specified in the exercise description, it contains the top 2000 terms across all sentiment classes (in our case all product categories) in alphabetical order.

In [None]:
output_file = "output_ds.txt"
save_selected_terms(pipeline_model, output_file)

In [18]:
# Stop the Spark session
spark.stop()