Intitializing Setup and Loading Data

In [1]:
import org.apache.spark.sql.SparkSession
import scala.util.parsing.json.JSON
import scala.util.matching.Regex

var start = System.nanoTime()
var total = System.nanoTime()

//paths
val inputPath = "hdfs:///user/dic25_shared/amazon-reviews/full/reviews_devset.json"
//val inputPath = "hdfs:///user/dic25_shared/amazon-reviews/full/reviewscombined.json"
val stopwordPath = "Exercise_1/assets/stopwords.txt"

val TOP_K = 75
val DELIMS = "[()\\[\\]{}.!?,;:+=\\-_\"'`~#@&*%€§\\\\/0-9]+"

// === LOAD AND BRODCAST STOPWORDS ===
val stopwords = sc.textFile(stopwordPath).collect().toSet
val stopwordsBroadcast = sc.broadcast(stopwords)

// faster parsing the file by defineing the structure of the json
import org.apache.spark.sql.types._

//define sructure
val reviewSchema = StructType(Seq(
  StructField("reviewerID"    , StringType),     // e.g. "A2SUAM1J3GNN3B"
  StructField("asin"          , StringType),     // product ID
  StructField("reviewerName"  , StringType),
  StructField("helpful"       , ArrayType(IntegerType)), // [a,b]
  StructField("reviewText"    , StringType),     // full body
  StructField("overall"       , DoubleType),     // rating 1-5 (float in source)
  StructField("summary"       , StringType),     // review title
  StructField("unixReviewTime", LongType),
  StructField("reviewTime"    , StringType),
  StructField("category"      , StringType)      // our label
))

//parse file
val reviews = spark.read
  .schema(reviewSchema)                 
  .option("mode","DROPMALFORMED")       // skip corrupt lines
  .json(inputPath)
  .filter($"category".isNotNull)        
  .select("category","reviewText","summary") 
  .cache()
val parsed = reviews.as[(String,String,String)].rdd
// At this point the summary field will be omitted. If it is relevant, we need to merge it to the text.

var end = System.nanoTime()
var durationMs = (end - start) / 1e6
println(f"Runtime Parsing: $durationMs%.2f ms")

Intitializing Scala interpreter ...

Spark Web UI available at http://lbdmg01.datalab.novalocal:9999/proxy/application_1745308556449_5514
SparkContext available as 'sc' (version = 3.3.4, master = yarn, app id = application_1745308556449_5514)
SparkSession available as 'spark'


Runtime Parsing: 8918.06 ms


import org.apache.spark.sql.SparkSession
import scala.util.parsing.json.JSON
import scala.util.matching.Regex
start: Long = 12150565117071384
total: Long = 12150565117074115
inputPath: String = hdfs:///user/dic25_shared/amazon-reviews/full/reviews_devset.json
stopwordPath: String = Exercise_1/assets/stopwords.txt
TOP_K: Int = 75
DELIMS: String = [()\[\]{}.!?,;:+=\-_"'`~#@&*%€§\\/0-9]+
stopwords: scala.collection.immutable.Set[String] = Set(serious, latterly, absorbs, looks, particularly, used, e, printer, down, regarding, entirely, regardless, moreover, please, read, ourselves, able, behind, for, despite, s, maybe, viz, further, corresponding, x, any, wherein, across, name, allows, this, instead, in, taste, ought, myself, have, your, off, once, are, is, mon, his, oh, why, rd, knows, bul...


Token & Clean

In [2]:
// === TOKENIZER ===
start = System.nanoTime()

//Define tokenize method
//removes stopwords, replaces all delims with a whitespace and splits at ehitespaces after. Also transformes to lower case
def tokenize(text: String, stopwords: Set[String]): Set[String] = {
  if (text == null) return Set.empty
  val cleaned = text.toLowerCase.replaceAll(DELIMS, " ")
  cleaned.split("\\s+").filter(t => t.length > 1 && !stopwords.contains(t)).toSet
}

// call tokenize on parse json and stopwords
val tokenized = parsed.map {
  case (category, text, summary) =>
    val tokens = tokenize(text, stopwordsBroadcast.value)
    (category, tokens)
}

end = System.nanoTime()
durationMs = (end - start) / 1e6
println(f"Runtime tokenizer: $durationMs%.2f ms")

Runtime tokenizer: 197.32 ms


start: Long = 12150575032839522
tokenize: (text: String, stopwords: Set[String])Set[String]
tokenized: org.apache.spark.rdd.RDD[(String, Set[String])] = MapPartitionsRDD[13] at map at <console>:48
end: Long = 12150575230160339
durationMs: Double = 197.320817


# Counting and cacluating chi²

At this time this is the bottleneck, i.e the counting of the totalDocs etc. need to be adjusted fot the large dataset (>1h)

In [3]:
start = System.nanoTime()

val tokenCatAndDocStats = tokenized.flatMap {
  case (cat, tokens) =>
    val tokenSet = tokens.toSet
    val tokenPairs = tokenSet.map(token => ((token, cat), 1))
    val docMarker = Seq((("!DOC_COUNT", cat), 1))
    tokenPairs.toSeq ++ docMarker
}.reduceByKey(_ + _)

val docCounts = tokenCatAndDocStats
  .filter(_._1._1 == "!DOC_COUNT")
  .map { case ((_, cat), count) => (cat, count) }
  .collectAsMap()

val totalDocs = docCounts.values.sum
val docCountsBroadcast = sc.broadcast(docCounts)
val totalDocsBroadcast = sc.broadcast(totalDocs)

val tokenCatCounts = tokenCatAndDocStats
  .filter(_._1._1 != "!DOC_COUNT")

val tokenTotals = tokenCatCounts
  .map { case ((token, _), count) => (token, count) }
  .reduceByKey(_ + _)
  .collectAsMap()
val tokenTotalsBroadcast = sc.broadcast(tokenTotals)


end = System.nanoTime()
durationMs = (end - start) / 1e6
println(f"Count Runtime: $durationMs%.2f ms")
start = System.nanoTime()

start = System.nanoTime()
// === CHI-SQUARE CALCULATION ===
val N = totalDocsBroadcast.value.toDouble
println("Chi-sq")
val chi2Scores = tokenCatCounts.map {
  case ((token, cat), aCount) =>
    val A = aCount.toDouble
    val T = tokenTotalsBroadcast.value.getOrElse(token, 0).toDouble
    val C = docCountsBroadcast.value.getOrElse(cat, 0).toDouble


    val B = T - A
    val D = N - C - B - A
    val denom = (A + B) * (C + D) * (A + C) * (B + D)
    val chi2 = if (denom == 0) 0.0 else N * math.pow((A * D - B * C), 2) / denom
    (cat, (token, chi2))
}

end = System.nanoTime()
durationMs = (end - start) / 1e6
println(f"CHi Runtime: $durationMs%.2f ms")


Count Runtime: 13123.87 ms
Chi-sq
CHi Runtime: 14.19 ms


start: Long = 12150589488375187
tokenCatAndDocStats: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[15] at reduceByKey at <console>:42
docCounts: scala.collection.Map[String,Int] = Map(Kindle_Store -> 3205, Electronic -> 7825, Automotive -> 1374, Pet_Supplie -> 1235, Clothing_Shoes_and_Jewelry -> 5749, Baby -> 916, Grocery_and_Gourmet_Food -> 1297, Musical_Instrument -> 500, Movies_and_TV -> 4607, Book -> 22507, Tools_and_Home_Improvement -> 1926, Sports_and_Outdoor -> 3269, CDs_and_Vinyl -> 3749, Home_and_Kitche -> 4254, Apps_for_Android -> 2638, Office_Product -> 1243, Digital_Music -> 836, Health_and_Personal_Care -> 2982, Cell_Phones_and_Accessorie -> 3447, Beauty -> 2023, Toys_and_Game -> 2253, Patio_Lawn_and_Garde -> 994)
totalDocs: Int = 78829
docCountsBroadcast:...


In [4]:
//get top k elements
val topTokensPerCategory = chi2Scores
  .groupByKey()
  .mapValues(iter => iter.toSeq.sortBy(-_._2).take(TOP_K))



topTokensPerCategory: org.apache.spark.rdd.RDD[(String, Seq[(String, Double)])] = MapPartitionsRDD[23] at mapValues at <console>:34


Export Output

In [5]:
import scala.reflect.io.File

//format output as: <category> [term:chi2]
val output = topTokensPerCategory.map { case (category, terms) =>
  val formattedTerms = terms.map { case (term, chi2) =>
    s"$term:$chi2"
  }.mkString(" ")
  s"<$category> $formattedTerms"
}

//create and save to output file
val file = File("output_rdd.txt")
file.writeAll(output.collect().mkString("\n"))

// merging vocabluary
val mergedVocab = topTokensPerCategory.flatMap(_._2.map(_._1)).distinct().collect().sorted

// Append the sorted vocab to file
file.appendAll("\n" + mergedVocab.mkString(" "))

end = System.nanoTime()
durationMs = (end - total) / 1e6
println(f"Total Runtime: $durationMs%.2f ms")

25/05/09 23:48:49 WARN DAGScheduler: Broadcasting large task binary with size 1544.4 KiB
25/05/09 23:48:51 WARN DAGScheduler: Broadcasting large task binary with size 1544.8 KiB
Total Runtime: 29367.07 ms


import scala.reflect.io.File
output: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at map at <console>:36
file: scala.reflect.io.File = output_rdd.txt
mergedVocab: Array[String] = Array(acdelco, acne, acoustic, acre, acted, acting, action, actor, actors, actress, acura, adapter, addario, addicted, addicting, addictive, adjustment, adorable, ads, adventure, aftertaste, aired, airsoft, akai, albums, almonds, alpha, alternator, altima, ammo, amp, amplitube, android, animated, animation, anime, answering, antenna, ants, appetite, apple, apps, aquarium, ar, arch, arnley, aroma, arrangements, articulation, artisan, artist, artists, asus, atv, audio, author, authors, avent, avery, awesome, babies, back, backpacking, bag, bait, baking, ball, ballad, ballads, ballasts, balls, band, ban...
