# Job 1

L'obiettivo di questo job è capire, anno per anno, se una maggiore frequenza nella risposta alle recensioni ha un impatto sulla valutazione media ricevuta.
In particolare:
- Per ogni anno e business si calcola la media delle recensioni, il rate e il tempo medio di risposta;
- Sulla base del rate e del tempo medio di risposta viene calcolata un attributo aggiuntivo “response strategy” che categorizza il business in un particolare anno in 4 categorie (“Rapid and frequent”, “Slow but frequent”, “Occasional” o “Rare or none”);
- Aggregazione in base alla "response strategy", l'anno e lo stato per ottenere il rate medio e il numero di business all'interno della categoria.

---

The goal of this job is to understand, year by year, whether greater frequency in responding to reviews has an impact on the average rating received.

Specifically:

- For each year and business, the average reviews, rate, and average response time are calculated;
- Based on the rate and average response time, an additional attribute "response strategy" is calculated that categorizes the business in a particular year into four categories ("Rapid and frequent," "Slow but frequent," "Occasional," or "Rare or none");
- Aggregation based on the "response strategy," year, and state to get the average rate and number of businesses within the category.

In [1]:
import org.apache.spark

Intitializing Scala interpreter ...

Spark Web UI available at http://macbookpro.homenet.telecomitalia.it:4041
SparkContext available as 'sc' (version = 3.5.1, master = local[*], app id = local-1753776066807)
SparkSession available as 'spark'


import org.apache.spark


### Schema definitions

In [2]:
import org.apache.spark.sql.types._

val reviewSchema = StructType(
  Seq(
    StructField("user_id",  StringType,            nullable = true),
    StructField("name",     StringType,            nullable = true),
    StructField("time",     LongType,              nullable = false),
    StructField("rating",   DoubleType,            nullable = true),
    StructField("text",     StringType,            nullable = true),
    StructField("pics",     ArrayType(StringType), nullable = true),
    StructField("resp",     StructType(
      Seq(
        StructField("time", LongType,              nullable = false),
        StructField("text", StringType,            nullable = true)
      )
    ),                                             nullable = true),
    StructField("gmap_id",  StringType,            nullable = false),
  )
)

case class Response(time: Long, text: Option[String])

case class Review(
  user_id: Option[String],
  name: Option[String],
  time: Long,
  rating: Option[Double],
  text: Option[String],
  pics: Seq[String],
  resp: Option[Response],
  gmap_id: String
)

import org.apache.spark.sql.types._
reviewSchema: org.apache.spark.sql.types.StructType = StructType(StructField(user_id,StringType,true),StructField(name,StringType,true),StructField(time,LongType,false),StructField(rating,DoubleType,true),StructField(text,StringType,true),StructField(pics,ArrayType(StringType,true),true),StructField(resp,StructType(StructField(time,LongType,false),StructField(text,StringType,true)),true),StructField(gmap_id,StringType,false))
defined class Response
defined class Review


In [3]:
val metadataSchema = StructType(
  Seq(
    StructField("name",             StringType,                                 nullable = true),
    StructField("address",          StringType,                                 nullable = true),
    StructField("gmap_id",          StringType,                                 nullable = false),
    StructField("description",      StringType,                                 nullable = true),
    StructField("latitude",         DoubleType,                                 nullable = false),
    StructField("longitude",        DoubleType,                                 nullable = false),
    StructField("category",         ArrayType(StringType),                      nullable = true),
    StructField("avg_rating",       DoubleType,                                 nullable = false),
    StructField("num_of_reviews",   IntegerType,                                nullable = false),
    StructField("price",            StringType,                                 nullable = false),
    StructField("hours",            ArrayType(ArrayType(StringType)),           nullable = true),
    StructField("MISC",             MapType(StringType, ArrayType(StringType)), nullable = false),
    StructField("state",            StringType,                                 nullable = true),
    StructField("relative_results", ArrayType(StringType),                      nullable = true),
    StructField("url",              StringType,                                 nullable = false),
  )
)

case class Metadata(
  name: Option[String],
  address: Option[String],
  gmap_id: String,
  description: Option[String],
  latitude: Double,
  longitude: Double,
  category: Seq[String],
  avg_rating: Double,
  num_of_reviews: Int,
  price: String,
  hours: Seq[Seq[String]],
  MISC: Map[String, Seq[String]],
  state: Option[String],
  relative_results: Seq[String],
  url: String
)

metadataSchema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true),StructField(address,StringType,true),StructField(gmap_id,StringType,false),StructField(description,StringType,true),StructField(latitude,DoubleType,false),StructField(longitude,DoubleType,false),StructField(category,ArrayType(StringType,true),true),StructField(avg_rating,DoubleType,false),StructField(num_of_reviews,IntegerType,false),StructField(price,StringType,false),StructField(hours,ArrayType(ArrayType(StringType,true),true),true),StructField(MISC,MapType(StringType,ArrayType(StringType,true),true),false),StructField(state,StringType,true),StructField(relative_results,ArrayType(StringType,true),true),StructField(url,StringType,false))
defined class Metadata


### Dataset load and parse

In [4]:
import java.nio.file.Paths

val projectDir = Paths.get(System.getProperty("user.dir")).getParent.getParent.getParent
val reviewsPath = s"$projectDir/dataset/sample-reviews.ndjson"
val metadataPath = s"$projectDir/dataset/metadata.ndjson"

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("NDJSON Reader")
  .master("local[*]") // Needed in local mode
  .getOrCreate()

val reviewsDf = spark.read
  .schema(reviewSchema)
  .json(reviewsPath)
  .withColumn("pics", when (col("pics") isNull, array()) otherwise col("pics"))
  .as[Review]

val metadataDf = spark.read
  .schema(metadataSchema)
  .json(metadataPath)
  .withColumn("category", when (col("category") isNull, array()) otherwise col("category"))
  .withColumn("hours", when (col("hours") isNull, array()) otherwise col("hours"))
  .withColumn("relative_results", when (col("relative_results") isNull, array()) otherwise col("relative_results"))
  .as[Metadata]

reviewsDf.printSchema()
metadataDf.printSchema()

val reviewsRdd = reviewsDf.rdd.map(Review.unapply).map(_.get)
val metaRdd = metadataDf.rdd.map(Metadata.unapply).map(_.get)

root
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- time: long (nullable = true)
 |-- rating: double (nullable = true)
 |-- text: string (nullable = true)
 |-- pics: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- resp: struct (nullable = true)
 |    |-- time: long (nullable = true)
 |    |-- text: string (nullable = true)
 |-- gmap_id: string (nullable = true)

root
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- gmap_id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- category: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- avg_rating: double (nullable = true)
 |-- num_of_reviews: integer (nullable = true)
 |-- price: string (nullable = true)
 |-- hours: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (co

import java.nio.file.Paths
projectDir: java.nio.file.Path = /Users/lucatassi/Projects/big-data/big-data-project
reviewsPath: String = /Users/lucatassi/Projects/big-data/big-data-project/dataset/sample-reviews.ndjson
metadataPath: String = /Users/lucatassi/Projects/big-data/big-data-project/dataset/metadata.ndjson
import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@13dc2f6c
reviewsDf: org.apache.spark.sql.Dataset[Review] = [user_id: string, name: string ... 6 more fields]
metadataDf: org.apache.spark.sql.Dataset[Metadata] = [name: string, address: string ... 13 more fields]
reviewsRdd: org.apache.spark.rdd.RDD[(Option[String], Option[String], Long, Option[Double], Option[String], Seq[String], Option[Response], String)] = M...


---

**Metadata**: (name, address, <ins>gmap_id</ins>, description, latitude, longitude, category, avg_rating, num_of_reviews, price, hours, misc, state, relative_results, url)

**Review**: (user_id, name, time, rating, text, pics, responses, <ins>gmap_id</ins>)

---