# Job 2

Il job 2 valuta, annualmente, per ogni stato e categoria di business la valutazione media per ogni fascia di prezzo, assegnando un giudizio.

Nel dettaglio:

- per ogni business viene calcolata la valutazione media delle recensioni raggruppandole per anno;
- aggregando in base alla categoria di business, lo stato e la fascia di prezzo viene calcolata la media delle valutazioni medie delle recensioni;
- in base alla valutazione media viene elaborato un attributo aggiuntivo "business suggestion" che fornisce un giudizio sulle categorie di business, come segue:
  - valutazione media < 2: "Not recommended"
  - valutazione media 2–3.5: "Discreet"
  - valutazione media 3.5–4.5: "Recommended"
  - valutazione media > 4.5: "Highly recommended"



### Schema definitions

In [None]:
import org.apache.spark

In [None]:
import org.apache.spark.sql.types._
import java.sql.Timestamp

val reviewSchema = StructType(
  Seq(
    StructField("user_id",  StringType,            nullable = true),
    StructField("name",     StringType,            nullable = true),
    StructField("time",     LongType,              nullable = false),
    StructField("rating",   DoubleType,            nullable = true),
    StructField("text",     StringType,            nullable = true),
    StructField("pics",     ArrayType(StringType), nullable = true),
    StructField("resp",     StructType(
      Seq(
        StructField("time", LongType,              nullable = false),
        StructField("text", StringType,            nullable = true)
      )
    ),                                             nullable = true),
    StructField("gmap_id",  StringType,            nullable = false),
  )
)

case class Response(time: Timestamp, text: Option[String])

case class Review(
  user_id: Option[String],
  name: Option[String],
  time: Timestamp,
  rating: Option[Double],
  text: Option[String],
  pics: Seq[String],
  resp: Option[Response],
  gmap_id: String
)

In [None]:
val metadataSchema = StructType(
  Seq(
    StructField("name",             StringType,                                 nullable = true),
    StructField("address",          StringType,                                 nullable = true),
    StructField("gmap_id",          StringType,                                 nullable = false),
    StructField("description",      StringType,                                 nullable = true),
    StructField("latitude",         DoubleType,                                 nullable = false),
    StructField("longitude",        DoubleType,                                 nullable = false),
    StructField("category",         ArrayType(StringType),                      nullable = true),
    StructField("avg_rating",       DoubleType,                                 nullable = false),
    StructField("num_of_reviews",   IntegerType,                                nullable = false),
    StructField("price",            StringType,                                 nullable = true),
    StructField("hours",            ArrayType(ArrayType(StringType)),           nullable = true),
    StructField("MISC",             MapType(StringType, ArrayType(StringType)), nullable = true),
    StructField("state",            StringType,                                 nullable = true),
    StructField("relative_results", ArrayType(StringType),                      nullable = true),
    StructField("url",              StringType,                                 nullable = false),
  )
)

case class Metadata(
  name: Option[String],
  address: Option[String],
  gmap_id: String,
  description: Option[String],
  latitude: Double,
  longitude: Double,
  category: Seq[String],
  avg_rating: Double,
  num_of_reviews: Int,
  price: Option[String],
  hours: Seq[Seq[String]],
  MISC: Map[String, Seq[String]],
  state: Option[String],
  relative_results: Seq[String],
  url: String
)

### Dataset load and parse

In [None]:
import java.nio.file.Paths

val projectDir: String = Paths.get(System.getProperty("user.dir")).getParent.getParent.getParent.toString
val reviewsPath = s"$projectDir/dataset/sample-reviews.ndjson"
val metadataPath = s"$projectDir/dataset/metadata.ndjson"

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("NDJSON Reader")
  .master("local[*]") // Needed in local mode
  .getOrCreate()

val reviewsDf = spark.read
  .schema(reviewSchema)
  .json(reviewsPath)
  .withColumn("pics", when (col("pics") isNull, array()) otherwise col("pics"))
  .withColumn("time", from_unixtime(col("time") / 1000).cast("timestamp"))
  .withColumn("resp", 
    when (
      col("resp") isNotNull, 
      struct(
        from_unixtime(col("resp.time") / 1000).cast("timestamp").alias("time"),
        col("resp.text").cast(StringType).alias("text")
      )
    ) otherwise lit(null)
  )
  .as[Review]

val metadataDf = spark.read
  .schema(metadataSchema)
  .json(metadataPath)
  .withColumn("category", when (col("category") isNull, array()) otherwise col("category"))
  .withColumn("hours", when (col("hours") isNull, array()) otherwise col("hours"))
  .withColumn("relative_results", when (col("relative_results") isNull, array()) otherwise col("relative_results"))
  .withColumn("MISC",
    when (
      col("MISC") isNotNull,
      col("MISC").cast(MapType(StringType, ArrayType(StringType)))
    ) otherwise typedLit(Map.empty[String, Seq[String]])
  )
  .as[Metadata]

reviewsDf.printSchema()
metadataDf.printSchema()

// Unforturnately, it seems that Spark does not support case classes in RDDs. It throws ArrayStoreException
// when trying to collect the RDD... [see also [here](https://github.com/adtech-labs/spylon-kernel/issues/40)]
val reviewsRdd = reviewsDf.rdd
  .map(Review.unapply)
  .map(_.get)
  .map { case review @ (_, _, _, _, _, _, resp, _) => review.copy(_7 = resp.map(Response.unapply(_).get)) }
val metaRdd = metadataDf.rdd.map(Metadata.unapply).map(_.get)

In [None]:
/** This regex captures the state abbreviation between a comma and the ZIP code. */
private val StateRegex = """,\s*([A-Z]{2})\s+\d{5}""".r

/** The map of states that are considered for the analysis. */
val consideredStates = Map(
  "Alabama" -> "AL",
  "Mississippi" -> "MS",
  "New Hampshire" -> "NH",
  "New Mexico" -> "NM",
  "Washington" -> "WA",
)

/** Regex to match state names in the address. */
val StateNameRegex = s"""\\b(${consideredStates.keys.mkString("|")})\\b""".r

/** Regex to match state abbreviations in the address. */
val StateAbbrevRegex = s"""\\b(${consideredStates.values.mkString("|")})\\b""".r

/**
 * Extracts the state from the given address.
 * @param address
 *   the optional address string
 * @return
 *   the state abbreviation or "Unknown" if no valid state is found
 * @see
 *   [[consideredStates]]
 */
def toState(address: Option[String]): String = address
  .flatMap { addr =>
    StateRegex
      .findFirstMatchIn(addr)
      .map(_.group(1))
      .orElse(StateNameRegex.findFirstMatchIn(addr).map(stateName => consideredStates(stateName.group(1))))
      .orElse(StateAbbrevRegex.findFirstMatchIn(addr).map(_.group(1)))
  }
  .filter(consideredStates.values.toSeq.contains)
  .getOrElse("Unknown")

In [None]:
/** This function converts price string into a new one with custom symbol */
def withSymbol(price: Option[String], symbol: String): Option[String] =
  price.map(s => List.fill(s.length)(symbol).mkString)

### Job2 - Computation

---

**Metadata**: (name, address, <ins>gmap_id</ins>, description, latitude, longitude, category, avg_rating, num_of_reviews, price, hours, misc, state, relative_results, url)

**Review**: (user_id, name, time, rating, text, pics, responses, <ins>gmap_id</ins>)

---

In [None]:
// avg rating mapper to description
def ratingToSuggestion(rating: Double): String =
  rating match {
    case r if r <= 2.0 => "Not recommended"
    case r if r > 2.0 && r <= 3.5 => "Discreet"
    case r if r > 3.5 && r <= 4.5 => "Recommended"
    case r if r > 4.5 => "Highly recommended"
    case _ => "Undefined"
  }

In [None]:
for ((k,v) <- sc.getPersistentRDDs) {
  v.unpersist()
}

- Metadata: 292.5 MiB
- Reviews: 9.8 GiB

The "basic" (not optimized) job has 5 stages (c.f. `Job2 Basic`):

| #Stage | Input | Output | Shuffle read | Shuffle write | Duration | Partitions |
|--------|-------|--------|--------------|---------------|----------|------------|
| 0: load metadata | 292.5 MiB | | | 16.1 MiB | 48 s | 3 |
| 1: load reviews  | 9.8 GiB | | | 206.9 MiB | 1.6 min | 79 |
| 2: join datasets| | | 223.0 MiB | 107.4 MiB | 38 s | 79 |
| 3: group by key | | | 107.4 MiB | 21.6 MiB | 11 s | 79 | 
| 4: save | | 2.1 MiB | 21.6 MiB| | 8s | 1 |

**Total time: 2.5 minutes**

The optimized job has 5 stages (c.f. `Job2 Optimized`):

Optimization done:
- repartitioning RDDs, using same partitioner, before the join by key to reduce shuffles writes. Partition amount was calulated considering cluster configuration (4 cores per executor).
- join done as last operation after filtering and aggregations
- use of reduceByKey instead of groupByKey that is less expensive.

| #Stage | Input | Output | Shuffle read | Shuffle write | Duration | Partitions |
|--------|-------|--------|--------------|---------------|----------|------------|
| 0: load metadata | 292.5 MiB | | | 3.6 MiB | 44 s | 3 |
| 1: load reviews  | 9.8 GiB | | | **54.8 MiB** | 1.5 min | 79 |
| 2: reduce by key | | | **54.8 MiB** | **50.1 MiB** | 4s | 79 |
| 3: join datasets | | | **53.7 MiB** | **6.1 MiB** | 3s | **18** | 
| 4: save | | **1738.2 KiB** | **6.1 MiB**| | 3s | 1 |

**Total time: 1.8 minutes**

After optimization the execution time drops from 2.5 min to 1.8 min and also the amount of shuffled datas.

**The overall speedup is about 70%**


In [None]:
val partitions = 18
val partitioner = new org.apache.spark.HashPartitioner(partitions)

In [None]:
val businessAvgRating = reviewsRdd
  .filter(_._4.isDefined)
  .map{ case r => 
    val rating = r._4.get
    val gmap_id = r._8
    val year = r._3.toLocalDateTime.getYear
    (gmap_id, year) -> (1, rating)
  }
  .reduceByKey((a, b) => (a._1 + b._1, a._2 + b._2)) // count reviews and sum rating value
  .map{ case ((gmap_id, year), c) => gmap_id -> (year, c._2/c._1) }
  .partitionBy(partitioner)
  

In [None]:
val results = metaRdd
.filter(r => r._10.isDefined)
.flatMap(r => 
  r._7.map(category => r._3 -> (category, toState(r._2), withSymbol(r._10, "*")))
) //(gmap_id, (category, state, price))
.partitionBy(partitioner)
.join(businessAvgRating)
.map { case (_, ((category, state, price), (year, avgRate))) => 
  (category, state, price, year) -> (avgRate, 1)
}
.reduceByKey((a, b) => (a._1 + b._1, a._2 + b._2))
.map{ case (key, r) => 
  val avgRate = r._1/r._2
  (key, f"${avgRate}%.2f", ratingToSuggestion(avgRate))
}

### Save output result

In [None]:
val outputDirPath = s"$projectDir/output"
val outputPath = s"$outputDirPath/job2-output"

In [None]:
import org.apache.spark.sql.SaveMode
results.map { case ((category, state, price, year), avgRate, suggestion) =>
        (category, state, price, year, avgRate, suggestion)
      }
      .coalesce(1)
      .toDF("category", "state", "price", "year", "avg_rating", "business suggestion")
      .write
      .format("csv")
      .option("header", "true")
      .mode(SaveMode.Overwrite)
      .save(s"file://$outputPath")