# Obiettivo del progetto
Dopo aver analizzato e compreso i dati, si vuole studiare meglio la correlazione riscontrata tra distanza e prezzo e si è individuato l'obiettivo del progetto. L'obiettivo è quello di verificare se c’è una stagionalità, nella quale i prezzi per alcuni mesi sono molto più elevati rispetto ad altri o se ci sono grandi variazioni di prezzo tra i diversi mesi rispetto alle diverse distanze.

## Descrizione del job proposto
Avendo a disposizione un solo file *.csv* si è pensato si usare un pattern di tipo *self-join*:
-	**Prima aggregazione**: aggregare per ogni combinazione di aeroporto di partenza e destinazione (*startingAeroport* e *destinationAeroport*) per ottenere la distanza media di viaggio (*totalTravelDistance*). A partire dalla distanza media generare una nuova colonna che indichi la fascia di distanza del volo (breve distanza, media distanza, lunga distanza);
-	**Join**: unire il dataset originale con il risultato ottenuto;
-	**Seconda aggregazione**: aggregare per fascia di distanza e mese (*flightDate*, da cui si ricava il mese) per ottenere per ciascuna combinazione il prezzo medio.

### Caricamento libreria Spark

Per prima cosa, si deve importare la libreria spark per avviare una `spark-shell`; in seguito verrà mostrato il link tramite il quale è possibile accedere all'interfaccia utente di Spark.

In [1]:
import org.apache.spark

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.9:4040
SparkContext available as 'sc' (version = 3.5.1, master = local[*], app id = local-1736970020996)
SparkSession available as 'spark'


import org.apache.spark


In [4]:
// DO NOT EXECUTE - this is needed just to avoid showing errors in the following cells
val sc = spark.SparkContext.getOrCreate()

sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6382c5d2


### Parser del file .csv

La cella sottostante implementa un *parser* per il file .csv descritto nel notebook [data-exploration.ipynb](./data-exploration.ipynb), consentendo l'estrazione delle informazioni necessarie per popolare l'oggetto RDD di Spark.

In [2]:
import java.text.SimpleDateFormat
import java.util.Calendar

object FlightParser {
  
  val commaRegex = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
  val comma = ","

  /**
   * Convert from date (String) to month (Int).
   * @param dateString the date
   * @return the month
   */
  def monthFromDate(dateString: String): Int = {
    val sdf = new SimpleDateFormat("yyyy-MM-dd")
    val date = sdf.parse(dateString.trim)
    val cal = Calendar.getInstance()
    cal.setTime(date)
    cal.get(Calendar.MONTH) + 1
  }

  case class Flight(
     legId: String,
     searchDate: Int,
     flightDate: Int,
     startingAirport: String,
     destinationAirport: String,
     fareBasisCode: String,
     travelDuration: String,
     elapsedDays: Int,
     isBasicEconomy: Boolean,
     isRefundable: Boolean,
     isNonStop: Boolean,
     baseFare: Double,
     totalFare: Double,
     seatsRemaining: Int,
     totalTravelDistance: Double,
     segmentsDepartureTimeEpochSeconds: String,
     segmentsDepartureTimeRaw: String,
     segmentsArrivalTimeEpochSeconds: String,
     segmentsArrivalTimeRaw: String,
     segmentsArrivalAirportCode: String,
     segmentsDepartureAirportCode: String,
     segmentsAirlineName: String,
     segmentsAirlineCode: String,
     segmentsEquipmentDescription: String,
     segmentsDurationInSeconds: String,
     segmentsDistance: String,
     segmentsCabinCode: String
  )

  /**
   * Function to parse flights records.
   * @param line that has to be parsed
   * @return Flight object, None in case of input errors
   */
  def parseFlightLine(line: String): Option[Flight] = {
    try {
      val columns = line.split(comma)
      Some(
        Flight(
          legId = columns(0).trim,
          searchDate = monthFromDate(columns(1)),
          flightDate = monthFromDate(columns(2)),
          startingAirport = columns(3).trim,
          destinationAirport = columns(4).trim,
          fareBasisCode = columns(5).trim,
          travelDuration = columns(6).trim,
          elapsedDays = columns(7).trim.toInt,
          isBasicEconomy = columns(8).trim.toBoolean,
          isRefundable = columns(9).trim.toBoolean,
          isNonStop = columns(10).trim.toBoolean,
          baseFare = columns(11).trim.toDouble,
          totalFare = columns(12).trim.toDouble,
          seatsRemaining = columns(13).trim.toInt,
          totalTravelDistance = columns(14).trim.toDouble,
          segmentsDepartureTimeEpochSeconds = columns(15).trim,
          segmentsDepartureTimeRaw = columns(16).trim,
          segmentsArrivalTimeEpochSeconds = columns(17).trim,
          segmentsArrivalTimeRaw = columns(18).trim,
          segmentsArrivalAirportCode = columns(19).trim,
          segmentsDepartureAirportCode = columns(20).trim,
          segmentsAirlineName = columns(21).trim,
          segmentsAirlineCode = columns(22).trim,
          segmentsEquipmentDescription = columns(23).trim,
          segmentsDurationInSeconds = columns(24).trim,
          segmentsDistance = columns(25).trim,
          segmentsCabinCode = columns(26).trim
        )
      )
    } catch {
      case e: Exception =>
        // println(s"Errore durante il parsing della riga '$line': ${e.getMessage}")
        None
    }
  }
}

import java.text.SimpleDateFormat
import java.util.Calendar
defined object FlightParser


In [3]:
val rawData = sc.textFile("../../../../datasets/big/itineraries-sample02.csv")

rawData: org.apache.spark.rdd.RDD[String] = ../../../../datasets/big/itineraries-sample02.csv MapPartitionsRDD[1] at textFile at <console>:27


In [4]:
val rddFlights = rawData.flatMap(FlightParser.parseFlightLine)

rddFlights: org.apache.spark.rdd.RDD[FlightParser.Flight] = MapPartitionsRDD[2] at flatMap at <console>:28


In [5]:
rddFlights.count()

res0: Long = 1520662


In [6]:
val flightsDistances = rddFlights
  .map(x => ((x.startingAirport, x.destinationAirport), x.totalTravelDistance))
  .aggregateByKey((0.0, 0.0))(
    (acc, rating) => (acc._1 + rating, acc._2 + 1), // Somma prezzo totale e incrementa conteggio
    (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2) // Combina risultati parziali
  )
  .mapValues { case (sum, count) => (sum / count, count) } // Calcolo media e numero di occorrenze

flightsDistances: org.apache.spark.rdd.RDD[((String, String), (Double, Double))] = MapPartitionsRDD[5] at mapValues at <console>:33


In [7]:
flightsDistances.collect()

res1: Array[((String, String), (Double, Double))] = Array(((BOS,LGA),(406.6977958842578,9573.0)), ((IAD,ORD),(841.525204359673,4404.0)), ((EWR,PHL),(1039.5994575045208,5530.0)), ((DTW,LGA),(694.1526669795088,6393.0)), ((OAK,DFW),(2123.889001864091,5901.0)), ((ATL,DEN),(1513.575124745888,5411.0)), ((IAD,CLT),(587.9657102869139,2858.0)), ((DEN,LGA),(1804.57909562639,6745.0)), ((DTW,EWR),(736.2682451253482,3590.0)), ((LGA,DFW),(1455.155069582505,9054.0)), ((OAK,JFK),(3126.575707702436,3038.0)), ((DEN,DTW),(1578.6580700623254,4653.0)), ((JFK,IAD),(703.4175354183374,3741.0)), ((ORD,MIA),(1521.1334047682828,7466.0)), ((IAD,DFW),(1363.3681891954557,4313.0)), ((DEN,PHL),(1852.40172900494,5668.0)), ((OAK,DEN),(1419.471807628524,4824.0)), ((BOS,JFK),(261.94046744083494,6803.0)), ((SFO,JFK),(2652....


In [8]:
val minDistance = flightsDistances.min()._2._1

minDistance: Double = 993.9075896762905


In [9]:
val maxDistance = flightsDistances.max()._2._1

maxDistance: Double = 2738.66144486692


In [10]:
val numClassi = 3

val range = (maxDistance - minDistance) / numClassi

numClassi: Int = 3
range: Double = 581.5846183968765


In [11]:
val classifiedDistances = flightsDistances.map {
  case ((startingAirport, destinationAirport), (avgDistance, _)) =>
    val classification = if (avgDistance < minDistance + range) "Breve"
    else if (avgDistance <= minDistance + (numClassi - 1) * range ) "Media"
    else "Lunga"
    ((startingAirport, destinationAirport), classification)
}

classifiedDistances: org.apache.spark.rdd.RDD[((String, String), String)] = MapPartitionsRDD[6] at map at <console>:30


In [None]:
//val rddClassifiedDistances = sc.parallelize(classifiedDistances)

In [12]:
val result = rddFlights
  .map(x => ((x.startingAirport, x.destinationAirport), (x.flightDate, x.totalFare)))
  .join(classifiedDistances)
  .map {
    case ((startingAirport, destinationAirport), ((month, totalFare), classification)) => ((month, classification), (totalFare, 1))
  }
  .reduceByKey { case ((sumFare1, count1), (sumFare2, count2)) =>
    (sumFare1 + sumFare2, count1 + count2)
  }
  .mapValues { case (sum, count) => (sum / count, count) } // Calcola il prezzo medio e il numero totale di voli
  .map {
    case ((month, classification), (avgFare, count)) => (month, classification, avgFare)
  }

/*result2.collect().foreach {
  case (classification, (avgTotalFare, numFlights)) =>
    println(s"Fascia: $classification, Prezzo Medio: $avgTotalFare, Numero Voli: $numFlights")
}*/


result: org.apache.spark.rdd.RDD[(Int, String, Double)] = MapPartitionsRDD[14] at map at <console>:38


In [13]:
result.collect()

res2: Array[(Int, String, Double)] = Array((5,Breve,283.8788881446023), (10,Media,325.7054848013336), (5,Media,413.6171160589015), (8,Lunga,461.10027818585945), (6,Breve,304.4254653392809), (11,Media,272.2007247531067), (11,Breve,222.67424202278337), (7,Lunga,553.1568957980675), (5,Lunga,531.5929277671767), (9,Media,326.2977310691745), (9,Breve,255.8687988591393), (6,Media,458.49081160607454), (11,Lunga,383.3642073607308), (7,Breve,299.0520991699244), (10,Lunga,410.4446781736263), (7,Media,432.4863157403677), (8,Media,359.8989032783995), (4,Breve,305.31267008117635), (6,Lunga,597.482866518744), (8,Breve,268.6789576042566), (10,Breve,259.3999845308579), (4,Media,356.7572332890661), (4,Lunga,480.69310774341034), (9,Lunga,404.9251304499109))


In [14]:
import org.apache.spark.sql.SaveMode

val aggregatedFlights = "../../../../output/aggregatedFlights"

import org.apache.spark.sql.SaveMode
aggregatedFlights: String = ../../../../output/aggregatedFlights


In [15]:
result
  .coalesce(1)
  .toDF().write.format("csv").mode(SaveMode.Overwrite).save(aggregatedFlights)