# Job 2

Il job 2 valuta, annualmente, per ogni stato e categoria di business la valutazione media per ogni fascia di prezzo, assegnando un giudizio.

Nel dettaglio:

- per ogni business viene calcolata la valutazione media delle recensioni raggruppandole per anno;
- aggregando in base alla categoria di business, lo stato e la fascia di prezzo viene calcolata la media delle valutazioni medie delle recensioni;
- in base alla valutazione media viene elaborato un attributo aggiuntivo "business suggestion" che fornisce un giudizio sulle categorie di business, come segue:
  - valutazione media < 2: "Not recommended"
  - valutazione media 2–3.5: "Discreet"
  - valutazione media 3.5–4.5: "Recommended"
  - valutazione media > 4.5: "Highly recommended"


In [None]:
import org.apache.spark

### Definizione degli schemi e classi dei dati dei dataset

In [None]:
import org.apache.spark.sql.types._

val reviewSchema = StructType(
  Seq(
    StructField("user_id", StringType, nullable = false),
    StructField("name", StringType, nullable = true),
    StructField("time", LongType, nullable = false),
    StructField("rating", DoubleType, nullable = false),
    StructField("text", StringType, nullable = true),
    StructField("pics", ArrayType(StringType), nullable = true),
    StructField("resp", StructType(Seq(
      StructField("time", LongType, nullable = false),
      StructField("text", StringType, nullable = true)
    ))),
    StructField("gmap_id", StringType, nullable = false),
  )
)

case class Resp(
    time: Long,
    text: Option[String]
  )

case class Review(
  user_id: String,
  name: String,
  time: Long,
  rating: Option[Double],
  text: Option[String],
  pics: Option[Seq[String]],
  resp: Option[Resp],
  gmap_id: String
)

In [None]:
val metadataSchema = StructType(
  Seq(
    StructField("name", StringType, false),
    StructField("address", StringType, false),
    StructField("gmap_id", StringType, false),
    StructField("description", StringType, true),
    StructField("latitude", DoubleType, false),
    StructField("longitude", DoubleType, false),
    StructField("category", ArrayType(StringType), false),
    StructField("avg_rating", DoubleType, false),
    StructField("num_of_reviews", IntegerType, false),
    StructField("price", StringType, false),
    StructField("hours", ArrayType(ArrayType(StringType)), true),
    StructField("MISC", MapType(StringType, ArrayType(StringType)), false),
    StructField("state", StringType, false),
    StructField("relative_results", ArrayType(StringType), false),
    StructField("url", StringType, false)
  )
)

case class Metadata(
  name: String,
  address: String,
  gmap_id: String,
  description: Option[String],
  latitude: Double,
  longitude: Double,
  category: Seq[String],
  avg_rating: Double,
  num_of_reviews: Int,
  price: String,
  hours: Option[Seq[Seq[String]]],
  MISC: Map[String, Seq[String]],
  state: String,
  relative_results: Seq[String],
  url: String
)


### Caricamento dataset

In [None]:
val reviewsPath = "/Users/teo/Universita/Magistrale/BIG_DATA/bd-project25/big-data-dataset-exam/sample-reviews.ndjson"
val metadataPath = "/Users/teo/Universita/Magistrale/BIG_DATA/bd-project25/big-data-dataset-exam/metadata.ndjson"

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("NDJSON Reader")
  .master("local[*]")  // Needed in local mode
  .getOrCreate()

val reviewsDf = spark.read
  .schema(reviewSchema)
  .json(reviewsPath)

val metadataDf = spark.read
  .schema(metadataSchema)
  .json(metadataPath)

val reviewsRdd = reviewsDf.as[Review]
val metaRdd = metadataDf.as[Metadata]

reviewsDf.printSchema()
metaRdd.printSchema()


In [None]:
reviewsRdd
.filter(_.name.contains("Hossein"))
.collect()


In [None]:
metaRdd.
filter(_.gmap_id == "0x80dcdbd91ac0ff97:0x40cb80cf24283e4d")
.collect()