# DIC EX2 - part 2

## Setup

### Initialize Spark context

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("DIC EX 2 - group 36") \
    .getOrCreate()

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4047. Attempting port 4048.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4048. Attempting port 4049.
25/05/06 14:40:13 WARN Utils: Service 'SparkUI' could not bind on port 4049. Attempting port 4050.
25/05/06 1

### Set path variables

In [2]:
data_path = "hdfs:///user/dic25_shared/amazon-reviews/full/reviews_devset.json"
stopwords_path = "stopwords.txt"
output_path = "output_ds.txt"

### Load data

In [3]:
df = spark.read.json(data_path)
df.printSchema()

[Stage 0:>                                                          (0 + 2) / 2]

root
 |-- asin: string (nullable = true)
 |-- category: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)



                                                                                

## Build pipeline

### Tokenize using regex

In [4]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

tokenizer = RegexTokenizer(inputCol="reviewText", outputCol="tokens", pattern="[\s\t\d\(\)\[\]\{\}\.\!\?\,\;\:\+\=\-\_\"\'`\~\#\@\&\*\%\€\$\§\\\/]+")
tokenized = tokenizer.transform(df)
tokenized.select("reviewText", "tokens").show()

[Stage 1:>                                                          (0 + 1) / 1]

+--------------------+--------------------+
|          reviewText|              tokens|
+--------------------+--------------------+
|This was a gift f...|[this, was, a, gi...|
|This is a very ni...|[this, is, a, ver...|
|The metal base wi...|[the, metal, base...|
|For the most part...|[for, the, most, ...|
|This hose is supp...|[this, hose, is, ...|
|This tool works v...|[this, tool, work...|
|This product is a...|[this, product, i...|
|I was excited to ...|[i, was, excited,...|
|I purchased the L...|[i, purchased, th...|
|Never used a manu...|[never, used, a, ...|
|Good price. Good ...|[good, price, goo...|
|I have owned the ...|[i, have, owned, ...|
|I had "won" a sim...|[i, had, won, a, ...|
|The birds ate all...|[the, birds, ate,...|
|Bought last summe...|[bought, last, su...|
|I knew I had a mo...|[i, knew, i, had,...|
|I was a little wo...|[i, was, a, littl...|
|I have used this ...|[i, have, used, t...|
|I actually do not...|[i, actually, do,...|
|Just what I  expe...|[just, wha

                                                                                

### Remove stopwords

In [6]:
from pyspark.ml.feature import StopWordsRemover

def load_stopwords(path: str) -> list[str]:
    """
    Load stopwords from a file efficiently.
    """
    stopwords = set()
    with open(path, "r", encoding="utf-8") as f:
        stopwords = set(line.strip() for line in f if line.strip())
    return list(stopwords)

remover = StopWordsRemover(inputCol="tokens", outputCol="tokens_filtered", stopWords=load_stopwords(stopwords_path))
removed = remover.transform(tokenized)
removed.select("tokens", "tokens_filtered").show()

[Stage 2:>                                                          (0 + 1) / 1]

+--------------------+--------------------+
|              tokens|     tokens_filtered|
+--------------------+--------------------+
|[this, was, a, gi...|[gift, husband, m...|
|[this, is, a, ver...|[nice, spreader, ...|
|[the, metal, base...|[metal, base, hos...|
|[for, the, most, ...|[part, works, pre...|
|[this, hose, is, ...|[hose, supposed, ...|
|[this, tool, work...|[tool, works, cut...|
|[this, product, i...|[typical, usable,...|
|[i, was, excited,...|[excited, ditch, ...|
|[i, purchased, th...|[purchased, leaf,...|
|[never, used, a, ...|[manual, lawnmowe...|
|[good, price, goo...|[good, price, goo...|
|[i, have, owned, ...|[owned, flowtron,...|
|[i, had, won, a, ...|[similar, family,...|
|[the, birds, ate,...|[birds, ate, blue...|
|[bought, last, su...|[bought, summer, ...|
|[i, knew, i, had,...|[knew, mouse, bas...|
|[i, was, a, littl...|[worried, reading...|
|[i, have, used, t...|[brand, long, tim...|
|[i, actually, do,...|[current, model, ...|
|[just, what, i, e...|[expected,

                                                                                

### Calculate token counts and idf

In [7]:
from pyspark.ml.feature import CountVectorizer, IDF, Tokenizer

countModel = CountVectorizer(inputCol="tokens_filtered", outputCol="token_counts").fit(removed)
featurizedData = countModel.transform(removed)

idfModel = IDF(inputCol="token_counts", outputCol="features").fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("tokens_filtered", "token_counts", "features").show()

                                                                                

25/05/06 14:43:16 WARN DAGScheduler: Broadcasting large task binary with size 1063.2 KiB


                                                                                

25/05/06 14:43:27 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
+--------------------+--------------------+--------------------+
|     tokens_filtered|        token_counts|            features|
+--------------------+--------------------+--------------------+
|[gift, husband, m...|(96130,[2,3,7,8,3...|(96130,[2,3,7,8,3...|
|[nice, spreader, ...|(96130,[0,1,3,21,...|(96130,[0,1,3,21,...|
|[metal, base, hos...|(96130,[4,10,29,1...|(96130,[4,10,29,1...|
|[part, works, pre...|(96130,[1,3,4,9,1...|(96130,[1,3,4,9,1...|
|[hose, supposed, ...|(96130,[12,32,42,...|(96130,[12,32,42,...|
|[tool, works, cut...|(96130,[0,3,4,8,1...|(96130,[0,3,4,8,1...|
|[typical, usable,...|(96130,[18,63,122...|(96130,[18,63,122...|
|[excited, ditch, ...|(96130,[6,21,35,3...|(96130,[6,21,35,3...|
|[purchased, leaf,...|(96130,[3,4,5,6,4...|(96130,[3,4,5,6,4...|
|[manual, lawnmowe...|(96130,[6,8,41,87...|(96130,[6,8,41,87...|
|[good, price, goo...|(96130,[1,13,95,2...|(96130,[1,13,95,2...|
|[ow

### Calculate chi square values and select top 75 features

In [8]:
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

labels = {}

def label_fn(x: str) -> int:
    if x not in labels:
        labels[x] = len(labels)
    return labels[x]

udfCategoryToLabel = udf(label_fn, IntegerType())
labeled = rescaledData.withColumn("label", udfCategoryToLabel("category"))

result = ChiSqSelector(numTopFeatures=75, featuresCol="features", outputCol="selectedFeatures", labelCol="label").fit(labeled).transform(labeled)

result.select("category", "reviewText", "tokens_filtered", "selectedFeatures").show()

25/05/06 14:43:31 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB


                                                                                

25/05/06 14:43:33 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB


                                                                                

25/05/06 14:43:42 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB


                                                                                

25/05/06 14:44:04 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
+--------------------+--------------------+--------------------+--------------------+
|            category|          reviewText|     tokens_filtered|    selectedFeatures|
+--------------------+--------------------+--------------------+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...|[gift, husband, m...|(75,[2,3,5,30],[5...|
|Patio_Lawn_and_Garde|This is a very ni...|[nice, spreader, ...|(75,[0,1,3,18,65]...|
|Patio_Lawn_and_Garde|The metal base wi...|[metal, base, hos...|(75,[4,7],[2.4430...|
|Patio_Lawn_and_Garde|For the most part...|[part, works, pre...|(75,[1,3,4,6,7,15...|
|Patio_Lawn_and_Garde|This hose is supp...|[hose, supposed, ...|(75,[9],[2.627002...|
|Patio_Lawn_and_Garde|This tool works v...|[tool, works, cut...|(75,[0,3,4,8,15,3...|
|Patio_Lawn_and_Garde|This product is a...|[typical, usable,...|(75,[15],[8.09593...|
|Patio_Lawn_and_Garde|I was excited to ...|[excited, d

### Get top tokens

In [9]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

def extract_indices(sparse_vector):
    return sparse_vector.indices.tolist()

extract_indices_udf = udf(extract_indices, ArrayType(IntegerType()))

df_with_indices = result.withColumn("indices", extract_indices_udf(result["selectedFeatures"]))
indices = df_with_indices.select("indices").rdd.flatMap(lambda row: row.indices).distinct().collect()

vocab = countModel.vocabulary
words = [vocab[index] for index in indices]
print(sorted(words))

25/05/06 14:44:05 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB


                                                                                

['amazon', 'author', 'back', 'bad', 'big', 'bit', 'bought', 'buy', 'character', 'characters', 'day', 'easy', 'end', 'enjoyed', 'excellent', 'family', 'feel', 'find', 'fit', 'found', 'give', 'good', 'great', 'happy', 'hard', 'high', 'highly', 'interesting', 'job', 'light', 'long', 'lot', 'love', 'loved', 'made', 'make', 'makes', 'man', 'money', 'music', 'nice', 'part', 'people', 'perfect', 'pretty', 'price', 'problem', 'purchase', 'purchased', 'put', 'quality', 'quot', 'reading', 'real', 'recommend', 'review', 'series', 'set', 'size', 'small', 'sound', 'thing', 'things', 'thought', 'time', 'times', 'wanted', 'watch', 'work', 'works', 'world', 'worth', 'written', 'year', 'years']


### Write tokens to file

In [10]:
with open(output_path, "w") as f:
    f.write(" ".join(sorted(words)))