# Obiettivo del progetto
Dopo aver analizzato e compreso i dati, si vuole studiare meglio la correlazione riscontrata tra distanza e prezzo e si è individuato l'obiettivo del progetto. L'obiettivo è quello di verificare se c’è una stagionalità, nella quale i prezzi per alcuni mesi sono molto più elevati rispetto ad altri o se ci sono grandi variazioni di prezzo tra i diversi mesi rispetto alle diverse distanze.

## Descrizione del job proposto
Avendo a disposizione un solo file *.csv* si è pensato si usare un pattern di tipo *self-join*:

-	**Prima aggregazione**: aggregare per ogni combinazione di aeroporto di partenza e destinazione (*startingAeroport* e *destinationAeroport*) per ottenere la distanza media di viaggio (*totalTravelDistance*). A partire dalla distanza media generare una nuova colonna che indichi la fascia di distanza del volo (breve distanza, media distanza, lunga distanza);

-	**Join**: unire il dataset originale con il risultato ottenuto;

-	**Seconda aggregazione**: aggregare per fascia di distanza e mese (*flightDate*, da cui si ricava il mese) per ottenere per ciascuna combinazione il prezzo medio.

### Caricamento libreria Spark

Per prima cosa, si deve importare la libreria spark per avviare una `spark-shell`; in seguito verrà mostrato il link tramite il quale è possibile accedere all'interfaccia utente di Spark.

In [1]:
import org.apache.spark

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.9:4040
SparkContext available as 'sc' (version = 3.5.1, master = local[*], app id = local-1737667953668)
SparkSession available as 'spark'


import org.apache.spark


In [None]:
// DO NOT EXECUTE - this is needed just to avoid showing errors in the following cells
val sc = spark.SparkContext.getOrCreate()

### Parser del file .csv

Nella cella sottostante è implementata una `case class Flight` con come parametri tutte le colonne(*) presenti nel file .csv descritto nel notebook [data-exploration.ipynb](./data-exploration.ipynb) e un `FlightParser` che consentente l'estrazione delle informazioni necessarie per popolare l'oggetto RDD di Spark.

(*) per risolvere il `job` proposto verranno utilizzate solo alcune delle colonne.

In [2]:
import java.text.SimpleDateFormat
import java.util.Calendar

/**
 * Flight case class.
 */
case class Flight(
    legId: String,
    searchMonth: Int,
    flightMonth: Int,
    startingAirport: String,
    destinationAirport: String,
    fareBasisCode: String,
    travelDuration: String,
    elapsedDays: Int,
    isBasicEconomy: Boolean,
    isRefundable: Boolean,
    isNonStop: Boolean,
    baseFare: Double,
    totalFare: Double,
    seatsRemaining: Int,
    totalTravelDistance: Double,
    segmentsDepartureTimeEpochSeconds: String,
    segmentsDepartureTimeRaw: String,
    segmentsArrivalTimeEpochSeconds: String,
    segmentsArrivalTimeRaw: String,
    segmentsArrivalAirportCode: String,
    segmentsDepartureAirportCode: String,
    segmentsAirlineName: String,
    segmentsAirlineCode: String,
    segmentsEquipmentDescription: String,
    segmentsDurationInSeconds: String,
    segmentsDistance: String,
    segmentsCabinCode: String
) extends Serializable

/**
 * Flight parser.
 */
object FlightParser extends Serializable {

  val comma = ","

  /**
   * Convert from date (String) to month (Int).
   * @param dateString the date
   * @return the month
   */
  def monthFromDate(dateString: String): Int = {
    val sdf = new SimpleDateFormat("yyyy-MM-dd")
    val date = sdf.parse(dateString.trim)
    val cal = Calendar.getInstance()
    cal.setTime(date)
    cal.get(Calendar.MONTH) + 1
  }

  /**
   * Function to parse flights records.
   * @param line that has to be parsed
   * @return Flight object, None in case of input errors
   */
  def parseFlightLine(line: String): Option[Flight] = {
    try {
      val columns = line.split(comma)
      Some(
        Flight(
          legId = columns(0).trim,
          searchMonth = monthFromDate(columns(1)),
          flightMonth = monthFromDate(columns(2)),
          startingAirport = columns(3).trim,
          destinationAirport = columns(4).trim,
          fareBasisCode = columns(5).trim,
          travelDuration = columns(6).trim,
          elapsedDays = columns(7).trim.toInt,
          isBasicEconomy = columns(8).trim.toBoolean,
          isRefundable = columns(9).trim.toBoolean,
          isNonStop = columns(10).trim.toBoolean,
          baseFare = columns(11).trim.toDouble,
          totalFare = columns(12).trim.toDouble,
          seatsRemaining = columns(13).trim.toInt,
          totalTravelDistance = columns(14).trim.toDouble,
          segmentsDepartureTimeEpochSeconds = columns(15).trim,
          segmentsDepartureTimeRaw = columns(16).trim,
          segmentsArrivalTimeEpochSeconds = columns(17).trim,
          segmentsArrivalTimeRaw = columns(18).trim,
          segmentsArrivalAirportCode = columns(19).trim,
          segmentsDepartureAirportCode = columns(20).trim,
          segmentsAirlineName = columns(21).trim,
          segmentsAirlineCode = columns(22).trim,
          segmentsEquipmentDescription = columns(23).trim,
          segmentsDurationInSeconds = columns(24).trim,
          segmentsDistance = columns(25).trim,
          segmentsCabinCode = columns(26).trim
        )
      )
    } catch {
      case e: Exception =>
        // println(s"Errore durante il parsing della riga '$line': ${e.getMessage}")
        None
    }
  }
}

import java.text.SimpleDateFormat
import java.util.Calendar
defined class Flight
defined object FlightParser


### Caricamento dei dati

Con la seguente cella si effettua il caricamento del file *itineraries-sample\<N\>.csv*, dove con N si intende la percentuale di dati campionati dal file originale di 31,09 GB.

I file disponibili sono scaricabili dalla cartella su [OneDrive](https://liveunibo-my.sharepoint.com/:f:/g/personal/giulia_nardicchia_studio_unibo_it/Ei2686kRO3JFrY-4LnImGpwBtge9FRErDnIgvT2h2QB-Pg?e=VrufWl) e hanno percentuale: `02`, `16` e `33`.

In [3]:
val datasetsPath = "../../../datasets/big/"
val fileName = "itineraries-sample02.csv"

val rawData = sc.textFile(datasetsPath + fileName)

datasetsPath: String = ../../../datasets/big/
fileName: String = itineraries-sample02.csv
rawData: org.apache.spark.rdd.RDD[String] = ../../../datasets/big/itineraries-sample02.csv MapPartitionsRDD[1] at textFile at <console>:30


Trasformazione di un RDD composto da dati grezzi (*rawData*) in un RDD di oggetti `Flight`. La funzione `FlightParser.parseFlightLine` analizza ogni riga. `flatMap` appiattisce i risultati, scartando automaticamente le righe non valide.

In [4]:
val rddFlights = rawData.flatMap(FlightParser.parseFlightLine)

rddFlights: org.apache.spark.rdd.RDD[Flight] = MapPartitionsRDD[2] at flatMap at <console>:28


Per verificare che non ci siano stati problemi di *parsing*, con la cella seguente si vuole eseguire un'azione. La funzione `count()` calcola il numero di righe valide.

In [5]:
rddFlights.count()

res0: Long = 1520662


### Prima aggregazione

Innanzitutto si utilizza `map` per eliminare tutte le colonne che non servono a svolgere il job proposto e per trasformare i dati di tipo (chiave, valore). Si vuole aggregare per ogni combinazione di aeroporto di partenza e destinazione (*startingAeroport* e *destinationAeroport*) per ottenere la distanza media di viaggio (*totalTravelDistance*).

In [6]:
val avgDistances = rddFlights
  .map(flight => ((flight.startingAirport, flight.destinationAirport), flight.totalTravelDistance))
  .aggregateByKey((0.0, 0))(
    (acc, travelDistance) => (acc._1 + travelDistance, acc._2 + 1),
    (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
  )
  .mapValues { case (sumDistance, count) => sumDistance / count }

avgDistances: org.apache.spark.rdd.RDD[((String, String), Double)] = MapPartitionsRDD[5] at mapValues at <console>:33


In [7]:
avgDistances.collect()

res1: Array[((String, String), Double)] = Array(((BOS,LGA),406.6977958842578), ((IAD,ORD),841.525204359673), ((EWR,PHL),1039.5994575045208), ((DTW,LGA),694.1526669795088), ((OAK,DFW),2123.889001864091), ((ATL,DEN),1513.575124745888), ((IAD,CLT),587.9657102869139), ((DEN,LGA),1804.57909562639), ((DTW,EWR),736.2682451253482), ((LGA,DFW),1455.155069582505), ((OAK,JFK),3126.575707702436), ((DEN,DTW),1578.6580700623254), ((JFK,IAD),703.4175354183374), ((ORD,MIA),1521.1334047682828), ((IAD,DFW),1363.3681891954557), ((DEN,PHL),1852.40172900494), ((OAK,DEN),1419.471807628524), ((BOS,JFK),261.94046744083494), ((SFO,JFK),2652.746982695943), ((DTW,MIA),1462.8436163714111), ((PHL,OAK),2949.3589503280223), ((CLT,LGA),665.3443708609271), ((DTW,JFK),842.9147381242387), ((ATL,IAD),647.0074156470153), (...


A partire dalla distanza media si vuole generare una nuova colonna che indichi la fascia di distanza del volo (breve distanza, media distanza, lunga distanza).

Poiché usare valori numerici *hard coded* è una *bad practice*, si è deciso di utilizzare il minimo, il massimo e il numero di classi per calcolare dinamicamente l'intervallo delle fasce di distanza.

In [8]:
val (minDistance, maxDistance) = avgDistances
    .aggregate((Double.MaxValue, Double.MinValue))(
        (acc, value) => (math.min(acc._1, value._2), math.max(acc._2, value._2)),
        (acc1, acc2) => (math.min(acc1._1, acc2._1), math.max(acc1._2, acc2._2))
    )

minDistance: Double = 185.0
maxDistance: Double = 3366.947416137806


In [9]:
val numClasses = 3
val range = (maxDistance - minDistance) / numClasses

numClasses: Int = 3
range: Double = 1060.649138712602


Per calcolare l'intervallo in maniera equidistante sono stati adottati i seguenti limiti:
- **Breve**: se la distanza media è inferiore a *minimo + intervallo*;

- **Media**: se la distanza media è compresa tra *[minimo + intervallo; minimo + 2 * intervallo)*;

- **Lunga**: se la distanza media è superiore a *minimo + (numero classi - 1) * intervallo*.

In [10]:
val classifiedDistances = avgDistances.mapValues {
  case (avgDistance) => 
     val classification = if (avgDistance < minDistance + range) "short"
     else if (avgDistance <= minDistance + (numClasses - 1) * range ) "medium"
     else "long"
    classification
}

classifiedDistances: org.apache.spark.rdd.RDD[((String, String), String)] = MapPartitionsRDD[6] at mapValues at <console>:30


In [11]:
classifiedDistances.collect()

res2: Array[((String, String), String)] = Array(((BOS,LGA),short), ((IAD,ORD),short), ((EWR,PHL),short), ((DTW,LGA),short), ((OAK,DFW),medium), ((ATL,DEN),medium), ((IAD,CLT),short), ((DEN,LGA),medium), ((DTW,EWR),short), ((LGA,DFW),medium), ((OAK,JFK),long), ((DEN,DTW),medium), ((JFK,IAD),short), ((ORD,MIA),medium), ((IAD,DFW),medium), ((DEN,PHL),medium), ((OAK,DEN),medium), ((BOS,JFK),short), ((SFO,JFK),long), ((DTW,MIA),medium), ((PHL,OAK),long), ((CLT,LGA),short), ((DTW,JFK),short), ((ATL,IAD),short), ((ATL,MIA),short), ((DTW,IAD),short), ((OAK,LGA),long), ((SFO,EWR),long), ((IAD,SFO),long), ((CLT,SFO),long), ((BOS,ATL),short), ((LAX,DEN),short), ((DEN,JFK),medium), ((BOS,LAX),long), ((SFO,IAD),long), ((DTW,DEN),medium), ((ORD,LGA),short), ((ATL,OAK),long), ((MIA,CLT),short), ((EWR,...


### Join + Seconda aggregazione

Si unisce il dataset originale con il risultato ottenuto, aggregando poi per fascia di distanza e mese (*flightDate*, da cui si ricava il mese) si ottiene per ciascuna combinazione il prezzo medio.

In [12]:
val resultJob = rddFlights
  .map(flight => ((flight.startingAirport, flight.destinationAirport), (flight.flightMonth, flight.totalFare)))
  .join(classifiedDistances)
  .map {
    case (_, ((flightMonth, totalFare), classification)) => ((flightMonth, classification), (totalFare, 1))
  }
  .reduceByKey((acc, totalFare) => (acc._1 + totalFare._1, acc._2 + totalFare._2))
  .map {
    case ((flightMonth, classification), (sumTotalFare, count)) => (flightMonth, classification, sumTotalFare / count)
  }

resultJob: org.apache.spark.rdd.RDD[(Int, String, Double)] = MapPartitionsRDD[13] at map at <console>:35


In [13]:
resultJob.collect()

res3: Array[(Int, String, Double)] = Array((9,long,405.4846256093827), (11,short,222.00556413449337), (11,long,382.2554902106178), (5,long,533.1691284153799), (7,short,290.6924974546775), (9,short,256.1263844227748), (4,long,480.8690961538476), (9,medium,293.22341323056406), (11,medium,254.47777319902355), (8,medium,322.54506223357464), (7,long,554.1280932802313), (6,long,598.8304861754743), (8,short,265.2235958197947), (5,medium,354.6517011517386), (10,short,259.31215532219113), (7,medium,382.5014410856043), (10,long,411.65239396939035), (5,short,278.2944923340169), (8,long,462.04696232908327), (6,short,296.05174334065094), (4,short,306.4335681610265), (6,medium,395.04683677556426), (10,medium,294.67349021285935), (4,medium,331.0029844885145))


## Job Not Optimized

A partire dal codice scritto nelle precedenti celle, si è proceduto a *"rifattorizzare"* l'implementazione, di seguito il *job* non ottimizzato.

In [14]:
val numClassesNO = 3

val rddFlightsNO = rawData.flatMap(FlightParser.parseFlightLine)
    // (k,v) => (startingAirport, destinationAirport), (totalTravelDistance, flightDate, totalFare))
    .map(flight => ((flight.startingAirport, flight.destinationAirport),
                    (flight.totalTravelDistance, flight.flightMonth, flight.totalFare)))

val avgDistancesNO = rddFlightsNO
    .aggregateByKey((0.0, 0))(
        (acc, travelDistance) => (acc._1 + travelDistance._1, acc._2 + 1),
        (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
    )
    // (k,v) => ((startingAirport, destinationAirport), avgDistance)
    .mapValues { case (sumDistance, count) => sumDistance / count }

val (minDistanceNO, maxDistanceNO) = avgDistancesNO
    .aggregate((Double.MaxValue, Double.MinValue))(
        (acc, avgDistance) => (math.min(acc._1, avgDistance._2), math.max(acc._2, avgDistance._2)),
        (acc1, acc2) => (math.min(acc1._1, acc2._1), math.max(acc1._2, acc2._2))
    )

val rangeNO = (maxDistanceNO - minDistanceNO) / numClassesNO

val resultJobNotOptimized = avgDistancesNO
    .mapValues {
      case d if d < minDistanceNO + rangeNO => "short"
      case d if d < minDistanceNO + (numClassesNO - 1) * rangeNO => "medium"
      case _ => "long"
    } // (k,v) => ((startingAirport, destinationAirport), classification)
    .join(rddFlightsNO)
    .map { case (_, (classification, (_, month, totalFare))) => ((month, classification), (totalFare, 1)) }
    .reduceByKey((acc, totalFare) => (acc._1 + totalFare._1, acc._2 + totalFare._2))
    .map { case ((month, classification), (sumTotalFare : Double, count: Int)) => (month, classification, sumTotalFare / count) }

numClassesNO: Int = 3
rddFlightsNO: org.apache.spark.rdd.RDD[((String, String), (Double, Int, Double))] = MapPartitionsRDD[15] at map at <console>:32
avgDistancesNO: org.apache.spark.rdd.RDD[((String, String), Double)] = MapPartitionsRDD[17] at mapValues at <console>:41
minDistanceNO: Double = 185.0
maxDistanceNO: Double = 3366.947416137806
rangeNO: Double = 1060.649138712602
resultJobNotOptimized: org.apache.spark.rdd.RDD[(Int, String, Double)] = MapPartitionsRDD[24] at map at <console>:60


In [15]:
resultJobNotOptimized.collect()

res4: Array[(Int, String, Double)] = Array((9,long,405.4846256093826), (11,short,222.00556413449337), (11,long,382.2554902106179), (5,long,533.1691284153799), (7,short,290.69249745467766), (9,short,256.1263844227749), (4,long,480.8690961538476), (9,medium,293.22341323056395), (11,medium,254.4777731990232), (8,medium,322.545062233574), (7,long,554.1280932802313), (6,long,598.8304861754746), (8,short,265.223595819795), (5,medium,354.65170115173936), (10,short,259.3121553221906), (7,medium,382.50144108560403), (10,long,411.65239396939035), (5,short,278.294492334017), (8,long,462.04696232908344), (6,short,296.05174334065117), (4,short,306.4335681610265), (6,medium,395.0468367755642), (10,medium,294.67349021286043), (4,medium,331.0029844885145))


### Considerazioni sulle ottimizzazioni

- Cache/Persist
- Repartition/PartitionBy
- Broadcast variable

In [6]:
rddFlightsNO.partitioner

res1: Option[org.apache.spark.Partitioner] = None


In [7]:
rddFlightsNO.partitions.length

res2: Int = 19


In [None]:
rddFlightsNO.mapPartitionsWithIndex((index, iter) => Iterator((index, iter.size))).collect().foreach(println)

In [19]:
java.lang.Runtime.getRuntime.availableProcessors

res13: Int = 8


In [29]:
sc.defaultParallelism

res22: Int = 8


In [47]:
sc.getConf.getAll.foreach { case (key, value) =>
  println(s"$key: $value")
}

spark.eventLog.enabled: true
spark.driver.extraJavaOptions: -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false
spark.driver.cores: 4
spark.app.submitTime: 1737629015318
spark.eventLog.dir: fi

In [46]:
sc.getConf.get("spark.driver.cores")

res37: String = 4


In [None]:
sc.getConf.get("spark.driver.memory")

## Job Optimized

Una volta capite quali tecniche di ottimizzazione sono da adottare per migliorare le performance, sia in termini di tempo sia in termini di computazione, è stato riscritto il *main job* e si è ottenuto il seguente codice.

In [16]:
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.HashPartitioner

val numClassesO = 3
val numPartitions = 24
val p = new HashPartitioner(numPartitions)

val rddFlightsO = rawData.flatMap(FlightParser.parseFlightLine)
    // (k,v) => (startingAirport, destinationAirport), (totalTravelDistance, flightDate, totalFare))
    .map(flight => ((flight.startingAirport, flight.destinationAirport), 
                    (flight.totalTravelDistance, flight.flightMonth, flight.totalFare)))
    .partitionBy(p)
    //.persist(MEMORY_AND_DISK_SER)
    .cache()

val avgDistancesO = rddFlightsO
    .aggregateByKey((0.0, 0))(
    (acc, travelDistance) => (acc._1 + travelDistance._1, acc._2 + 1),
    (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
    )
    // (k,v) => ((startingAirport, destinationAirport), avgDistance)
    .mapValues { case (sumDistance, count) => sumDistance / count }
    //.persist(MEMORY_AND_DISK_SER)
    .cache()

val (minDistanceO, maxDistanceO) = sc.broadcast(avgDistancesO
    .aggregate((Double.MaxValue, Double.MinValue))(
        (acc, avgDistance) => (Math.min(acc._1, avgDistance._2), Math.max(acc._2, avgDistance._2)),
        (acc1, acc2) => (Math.min(acc1._1, acc2._1), Math.max(acc1._2, acc2._2))
    )
).value

val rangeO = (maxDistanceO - minDistanceO) / numClassesO

val resultJobOptimized = avgDistancesO
    .mapValues {
      case d if d < minDistanceNO + rangeNO => "short"
      case d if d < minDistanceNO + (numClassesNO - 1) * rangeNO => "medium"
      case _ => "long"
    } // (k,v) => ((startingAirport, destinationAirport), classification)
    .join(rddFlightsO)
    .map { case (_, (classification, (_, month, totalFare))) => ((month, classification), (totalFare, 1)) }
    .reduceByKey((acc, totalFare) => (acc._1 + totalFare._1, acc._2 + totalFare._2))
    .map { case ((month, classification), (sumTotalFare, count)) => (month, classification, sumTotalFare / count) }

import org.apache.spark.storage.StorageLevel._
import org.apache.spark.HashPartitioner
numClassesO: Int = 3
numPartitions: Int = 24
p: org.apache.spark.HashPartitioner = org.apache.spark.HashPartitioner@18
rddFlightsO: org.apache.spark.rdd.RDD[((String, String), (Double, Int, Double))] = ShuffledRDD[27] at partitionBy at <console>:43
avgDistancesO: org.apache.spark.rdd.RDD[((String, String), Double)] = MapPartitionsRDD[29] at mapValues at <console>:53
minDistanceO: Double = 185.0
maxDistanceO: Double = 3366.947416137806
rangeO: Double = 1060.649138712602
resultJobOptimized: org.apache.spark.rdd.RDD[(Int, String, Double)] = MapPartitionsRDD[36] at map at <console>:75


In [10]:
rddFlightsO.partitions.length

res4: Int = 24


In [17]:
resultJobOptimized.collect()

res5: Array[(Int, String, Double)] = Array((4,medium,331.0029844885144), (6,short,296.05174334064077), (8,short,265.2235958197836), (11,medium,254.47777319902426), (6,medium,395.04683677555823), (9,short,256.12638442276415), (5,long,533.1691284153735), (8,medium,322.5450622335694), (11,long,382.2554902106189), (4,short,306.43356816102596), (10,short,259.31215532217783), (8,long,462.04696232908634), (10,long,411.65239396938625), (5,short,278.29449233400754), (4,long,480.86909615384786), (11,short,222.0055641344965), (9,medium,293.2234132305583), (6,long,598.8304861754714), (7,long,554.1280932802309), (10,medium,294.6734902128512), (9,long,405.4846256093821), (7,short,290.69249745466715), (7,medium,382.50144108559874), (5,medium,354.65170115173464))


In [12]:
rddFlightsO.mapPartitionsWithIndex((index, iter) => Iterator((index, iter.size))).collect().foreach(println)

(0,69496)
(1,53223)
(2,49455)
(3,87341)
(4,78586)
(5,76991)
(6,49778)
(7,58925)
(8,64250)
(9,72252)
(10,71614)
(11,110965)
(12,44238)
(13,52329)
(14,50096)
(15,73514)
(16,58015)
(17,62059)
(18,77677)
(19,60100)
(20,79107)
(21,27779)
(22,59050)
(23,33822)


In [18]:
resultJobNotOptimized.sortBy(_._3, ascending = true).collect()

res6: Array[(Int, String, Double)] = Array((11,short,222.00556413449337), (11,medium,254.4777731990232), (9,short,256.1263844227749), (10,short,259.3121553221906), (8,short,265.223595819795), (5,short,278.294492334017), (7,short,290.69249745467766), (9,medium,293.22341323056395), (10,medium,294.67349021286043), (6,short,296.05174334065117), (4,short,306.4335681610265), (8,medium,322.545062233574), (4,medium,331.0029844885145), (5,medium,354.65170115173936), (11,long,382.2554902106179), (7,medium,382.50144108560403), (6,medium,395.0468367755642), (9,long,405.4846256093826), (10,long,411.65239396939035), (8,long,462.04696232908344), (4,long,480.8690961538476), (5,long,533.1691284153799), (7,long,554.1280932802313), (6,long,598.8304861754746))


In [19]:
resultJobOptimized.sortBy(_._3, ascending = true).collect()

res7: Array[(Int, String, Double)] = Array((11,short,222.0055641344965), (11,medium,254.47777319902426), (9,short,256.12638442276415), (10,short,259.31215532217783), (8,short,265.2235958197836), (5,short,278.29449233400754), (7,short,290.69249745466715), (9,medium,293.2234132305583), (10,medium,294.6734902128512), (6,short,296.05174334064077), (4,short,306.43356816102596), (8,medium,322.5450622335694), (4,medium,331.0029844885144), (5,medium,354.65170115173464), (11,long,382.2554902106189), (7,medium,382.50144108559874), (6,medium,395.04683677555823), (9,long,405.4846256093821), (10,long,411.65239396938625), (8,long,462.04696232908634), (4,long,480.86909615384786), (5,long,533.1691284153735), (7,long,554.1280932802309), (6,long,598.8304861754714))


Questa cella, serve a liberare la memoria ed è stata eseguita solo quando necessario per motivi di *debugging*.

In [20]:
sc.getPersistentRDDs.foreach(_._2.unpersist())

## Salvataggio dei risultati su file

In [10]:
import org.apache.spark.sql.SaveMode

val jobNotOptimized = "../../../../output/jobNotOptimized"
val jobOptimized = "../../../../output/jobOptimized"

import org.apache.spark.sql.SaveMode
jobNotOptimized: String = ../../../../output/jobNotOptimized
jobOptimized: String = ../../../../output/jobOptimized


In [None]:
resultJobNotOptimized
  .coalesce(1)
  .toDF().write.format("csv").mode(SaveMode.Overwrite).save(jobNotOptimized)

In [None]:
resultJobOptimized
  .coalesce(1)
  .toDF().write.format("csv").mode(SaveMode.Overwrite).save(jobOptimized)