* Master DAC - BDLE
* Author: Mohamed-Amine Baazizi
* Affiliation: LIP6 - Faculté des Sciences - Sorbonne Université
* Email: mohamed-amine.baazizi@lip6.fr
* October 2024

FERCHE Adelin-Flaviu \
N°Etudiant: 3800655 \
M2 SAR

# Building an effective data preparation pipeline for ML


## Outline

This homework is about building an effective data preparation pipeline.
It covers the following aspects covered throughout the session:

* ingest raw data, curate it, transform it
* load the data into delta tables to enforce constraints and allow updates
* build an ML pipeline for training a decision tree model and run cross validation

It is based on raw data about car prices crawled from a public source.
Start by running some data exploration queries to decide which do select or discard based on general understanding.








## Prerequisite

### System setup

In [1]:
%%capture
!pip install -q pyspark
!pip install -q delta-spark
!pip install pyngrok

In [2]:
!pip list|grep spark

delta-spark                        3.2.1
pyspark                            3.5.3


In [3]:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

local = "local[*]"
appName = "ADIA certificate - Delta Lake "
localConfig = SparkConf().setAppName(appName).setMaster(local).\
  set("spark.executor.memory", "8G").\
  set("spark.driver.memory","8G").\
  set("spark.sql.catalogImplementation","in-memory").\
  set("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension").\
  set("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").\
  set("spark.jars.packages","io.delta:delta-spark_2.12:3.1.0").\
  set("spark.databricks.delta.schema.autoMerge.enabled","true")


spark = SparkSession.builder.config(conf = localConfig).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")

In [4]:
spark

### Data import

In [5]:
%%capture
!wget --no-verbose https://nuage.lip6.fr/s/89BG8HD9r3iE693/download/MLData.tgz -O /tmp/MLData.tgz
!tar -xzvf /tmp/MLData.tgz  --directory /tmp/

In [6]:
!ls -hal /tmp/MLData

total 73M
drwxr-xr-x 2  501 staff 4.0K Jan  6  2022 .
drwxrwxrwt 1 root root  4.0K Oct 24 12:24 ..
-rw-r--r-- 1  501 staff  66M Jan  6  2022 autos.csv
-rw-r--r-- 1  501 staff  176 Jan  6  2022 ._loan.csv
-rw-r--r-- 1  501 staff 6.8M Jan  6  2022 loan.csv


In [7]:
query = """
CREATE TABLE IF NOT EXISTS raw_vehiculePrices
USING csv
OPTIONS (
  header "true",
  path "/tmp/MLData/autos.csv",
  inferSchema "true"
)
"""
spark.sql(query)

DataFrame[]

## Phase 0: Understanding the data

In this part, you are invited to get some knowledge about the data by reading its schema and extracting  some basic statistical information about the values of columns that you will find interesting.

In [8]:
query = """
DESCRIBE raw_vehiculePrices
"""
spark.sql(query).show()

+-------------------+---------+-------+
|           col_name|data_type|comment|
+-------------------+---------+-------+
|        dateCrawled|timestamp|   NULL|
|               name|   string|   NULL|
|             seller|   string|   NULL|
|          offerType|   string|   NULL|
|              price|      int|   NULL|
|             abtest|   string|   NULL|
|        vehicleType|   string|   NULL|
| yearOfRegistration|      int|   NULL|
|            gearbox|   string|   NULL|
|            powerPS|      int|   NULL|
|              model|   string|   NULL|
|          kilometer|      int|   NULL|
|monthOfRegistration|      int|   NULL|
|           fuelType|   string|   NULL|
|              brand|   string|   NULL|
|  notRepairedDamage|   string|   NULL|
|        dateCreated|timestamp|   NULL|
|       nrOfPictures|      int|   NULL|
|         postalCode|      int|   NULL|
|           lastSeen|timestamp|   NULL|
+-------------------+---------+-------+



In [9]:
query = """
SELECT * FROM raw_vehiculePrices TABLESAMPLE (5 ROWS);
"""
spark.sql(query).show()


+-------------------+--------------------+------+---------+-----+------+-----------+------------------+---------+-------+-----+---------+-------------------+--------+----------+-----------------+-------------------+------------+----------+-------------------+
|        dateCrawled|                name|seller|offerType|price|abtest|vehicleType|yearOfRegistration|  gearbox|powerPS|model|kilometer|monthOfRegistration|fuelType|     brand|notRepairedDamage|        dateCreated|nrOfPictures|postalCode|           lastSeen|
+-------------------+--------------------+------+---------+-----+------+-----------+------------------+---------+-------+-----+---------+-------------------+--------+----------+-----------------+-------------------+------------+----------+-------------------+
|2016-03-24 11:52:17|          Golf_3_1.6|privat|  Angebot|  480|  test|       NULL|              1993|  manuell|      0| golf|   150000|                  0|  benzin|volkswagen|             NULL|2016-03-24 00:00:00|     

In [10]:
query = """
SELECT  min(yearOfRegistration), max(yearOfRegistration),
          avg(yearOfRegistration), median(yearOfRegistration)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+-----------------------+-----------------------+-----------------------+--------------------------+
|min(yearOfRegistration)|max(yearOfRegistration)|avg(yearOfRegistration)|median(yearOfRegistration)|
+-----------------------+-----------------------+-----------------------+--------------------------+
|                   1000|                   9999|     2004.5767206439623|                    2003.0|
+-----------------------+-----------------------+-----------------------+--------------------------+



In [11]:
query = """
SELECT  yearOfRegistration, count(*)
FROM raw_vehiculePrices
GROUP BY yearOfRegistration
order by 1 desc,2 desc
"""
spark.sql(query).show()

+------------------+--------+
|yearOfRegistration|count(1)|
+------------------+--------+
|              9999|      27|
|              9996|       1|
|              9450|       1|
|              9229|       1|
|              9000|       5|
|              8888|       2|
|              8500|       1|
|              8455|       1|
|              8200|       1|
|              8000|       2|
|              7800|       1|
|              7777|       1|
|              7500|       2|
|              7100|       1|
|              7000|       4|
|              6500|       1|
|              6200|       1|
|              6000|       6|
|              5911|       2|
|              5900|       1|
+------------------+--------+
only showing top 20 rows



In [12]:
query = """
SELECT  min(price), max(price),
          avg(price), median(price)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+----------+----------+------------------+-------------+
|min(price)|max(price)|        avg(price)|median(price)|
+----------+----------+------------------+-------------+
|         0|2147483647|17286.338865535483|       2950.0|
+----------+----------+------------------+-------------+



In [13]:
query = """
SELECT  min(kilometer), max(kilometer),
          avg(kilometer), median(kilometer)
FROM raw_vehiculePrices
"""
spark.sql(query).show()

+--------------+--------------+------------------+-----------------+
|min(kilometer)|max(kilometer)|    avg(kilometer)|median(kilometer)|
+--------------+--------------+------------------+-----------------+
|          5000|        150000|125618.56044408226|         150000.0|
+--------------+--------------+------------------+-----------------+



In [14]:
%%capture
!pip install ydata-profiling

In [15]:
spark_df = spark.sql("SELECT * FROM raw_vehiculePrices")

pandas_df = spark_df.toPandas()


from ydata_profiling import ProfileReport

profile = ProfileReport(pandas_df, title = "Vehicule Prices Profiling Report")
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Phase 1: Cleaning the data and selecting relevant columns

In this part you are invited to decide which columns are useful for you analysis and to perform some cleaning on the data by removing outlier values (e.g. remove records with strange values for a specific column).
The result of your cleaning and selection should be stored in a table called `phase1`

In [16]:
# On a décidé d'enlever les colonnes dateCrawled, seller, offerType, postalCode, lastSeen, nrOfPictures
# car elles ne sont pas pertinentes pour un acheteur.
# Ainsi, on a gardé les colonnes qui permettent une description du véhicule.

raw_vehiculePrices_df =  spark.sql("SELECT * from raw_vehiculePrices")

useful_columns = ['name', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'kilometer', 'monthOfRegistration',
                  'fuelType', 'brand', 'notRepairedDamage', 'dateCreated']

select_df = raw_vehiculePrices_df.select(useful_columns)

In [17]:
# Suppression de tout les tuples contenant une valeur nulle dans au moins une colonne
clean_df = select_df.dropna()

# Enlever des colonnes les valeurs étranges ou non réelles.
clean_df = clean_df.filter((clean_df['price'] >= 1000) & (clean_df['price'] <= 25000000))

clean_df = clean_df.filter((clean_df['yearOfRegistration'] >= 1900) & (clean_df['yearOfRegistration'] <= 2024))

clean_df = clean_df.filter((clean_df['powerPS'] >= 10) & (clean_df['powerPS'] <= 2000))

clean_df = clean_df.filter(clean_df['monthOfRegistration'] > 0)

# Sauvegarder les changements
clean_df.write.format("delta").mode("overwrite").save("/tmp/MLData/phase1")

In [18]:
query = """
CREATE TABLE IF NOT EXISTS phase1
USING delta
LOCATION '/tmp/MLData/phase1'
"""

spark.sql(query)

DataFrame[]

Give a brief summary of your choices

## Phase 2: Organizing the data

In this part you are invited to load the data into delta tables where you will define meaningful constraints and conditions to be fulfiled by any future incoming data.
The result of this phase should a delta table called `deltaPrices`

In [19]:
delta_df = spark.sql("SELECT * FROM phase1")
delta_df.write.format("delta").mode("overwrite").save("/tmp/MLData/deltaPrices")

In [20]:
spark.sql("""
CREATE TABLE IF NOT EXISTS deltaPrices
USING delta
LOCATION '/tmp/MLData/deltaPrices'
""")

DataFrame[]

In [21]:
# La valeur du prix doit être raisonnable et cohérent avec la réalité
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_price CHECK (price >= 1000 AND price <= 25000000)
""")

DataFrame[]

In [22]:
# L'année doit être compris entre 1900 et l'année courante. Avant 1900, les voitures n'existaient pas, selon moi.
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_yearOfRegistration CHECK (yearOfRegistration >= 1900 AND yearOfRegistration <= 2024)
""")

DataFrame[]

In [23]:
# La puissance doit être raisonnable et cohérent avec la réalité
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_powerPS CHECK (powerPS >= 10 AND powerPS <= 2000)
""")

DataFrame[]

In [24]:
# Le mois doit être compris entre 1 et 12 sinon on peut avoir des valeurs absurdes
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_monthOfRegistration CHECK (monthOfRegistration >= 1 AND monthOfRegistration <= 12)
""")

DataFrame[]

In [25]:
# On suppose que tout les types de véhicules sont déjà dans la table, donc la valeur doit être parmi celles existantes.
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_vehicleType CHECK (vehicleType IN ('suv', 'limousine', 'kleinwagen', 'kombi', 'bus', 'cabrio', 'coupe', 'andere'))
""")

DataFrame[]

In [26]:
# Une voiture ne peut être que manuel ou automatique
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_gearbox CHECK (gearbox IN ('manuell', 'automatik'))
""")

DataFrame[]

In [27]:
# Pour ne pas avoir de valeurs négatives pour le kilométrage
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_kilometer CHECK (kilometer >= 0)
""")

DataFrame[]

In [28]:
# On suppose que tout les types de carburants sont déjà dans la table, donc la valeur doit être parmi celles existantes
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_fuelType CHECK (fuelType IN ('benzin', 'diesel', 'lpg', 'cng', 'hybrid', 'andere', 'elektro'))
""")

DataFrame[]

In [29]:
# La valeur ne peut être que contrôle ou test
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_abtest CHECK (abtest IN ('test', 'control'))
""")

DataFrame[]

In [30]:
# La valeur ne peut être que oui ou non
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT valid_notRepairedDamage CHECK (notRepairedDamage IN ('nein', 'ja'))
""")

DataFrame[]

Vu que les colonnes gardées sont utiles pour un acheteur, elles doivent être toutes renseigner, lors de l'enregistrement d'un véhicule.

In [31]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_price CHECK (price IS NOT NULL)
""")

DataFrame[]

In [32]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_name CHECK (name IS NOT NULL)
""")

DataFrame[]

In [33]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_vehicleType CHECK (vehicleType IS NOT NULL)
""")

DataFrame[]

In [34]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_yearOfRegistration CHECK (yearOfRegistration IS NOT NULL)
""")

DataFrame[]

In [35]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_gearbox CHECK (gearbox IS NOT NULL)
""")

DataFrame[]

In [36]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_powerPS CHECK (powerPS IS NOT NULL)
""")

DataFrame[]

In [37]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_model CHECK (model IS NOT NULL)
""")

DataFrame[]

In [38]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_kilometer CHECK (kilometer IS NOT NULL)
""")

DataFrame[]

In [39]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_monthOfRegistration CHECK (monthOfRegistration IS NOT NULL)
""")

DataFrame[]

In [40]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_fuelType CHECK (fuelType IS NOT NULL)
""")

DataFrame[]

In [41]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_brand CHECK (brand IS NOT NULL)
""")

DataFrame[]

In [42]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_dateCreated CHECK (dateCreated IS NOT NULL)
""")

DataFrame[]

In [43]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_abtest CHECK (abtest IS NOT NULL)
""")

DataFrame[]

In [44]:
spark.sql("""
ALTER TABLE deltaPrices
ADD CONSTRAINT not_null_notRepairedDamage CHECK (notRepairedDamage IS NOT NULL)
""")

DataFrame[]

In [45]:
# Vérification des contraintes
spark.sql("SHOW TBLPROPERTIES deltaPrices").show(truncate=False)

+----------------------------------------------+-------------------------------------------------------------------------------------+
|key                                           |value                                                                                |
+----------------------------------------------+-------------------------------------------------------------------------------------+
|delta.constraints.not_null_abtest             |abtest IS NOT NULL                                                                   |
|delta.constraints.not_null_brand              |brand IS NOT NULL                                                                    |
|delta.constraints.not_null_datecreated        |dateCreated IS NOT NULL                                                              |
|delta.constraints.not_null_fueltype           |fuelType IS NOT NULL                                                                 |
|delta.constraints.not_null_gearbox            |gearbox

Comment on the constraints you added

....

## Phase 3: Analysing the data and ensuring query evaluation effeciency

Suggest 2 or 3 meaningfull queries as described above and suggest a data organization scheme for optimizing one such query of your choice.

In [46]:
# Le prix moyen d'une voiture selon sa marque et son type
query_1 = """
SELECT brand, vehicleType, avg(price) as avg_price
FROM deltaPrices
GROUP BY brand, vehicleType
ORDER BY avg_price DESC
"""

spark.sql(query_1).show()

+-------------+-----------+------------------+
|        brand|vehicleType|         avg_price|
+-------------+-----------+------------------+
|      porsche|      coupe| 82556.94207317074|
|      porsche|  limousine| 59279.97435897436|
|       jaguar|     cabrio| 49362.88636363636|
|      porsche|     cabrio| 43440.29674796748|
|       jaguar|      coupe|37468.431818181816|
|      porsche|     andere|37248.333333333336|
|          bmw|        bus|29552.978260869564|
|      porsche|        suv|27486.664864864866|
|    chevrolet|     cabrio|         26964.375|
|        mazda|        bus|26100.731759656654|
|         audi|        suv|25413.964646464647|
|   land_rover|        suv|19331.881469115193|
|         mini|        suv|19179.666666666668|
|        rover|        suv|           19037.5|
|mercedes_benz|        suv| 18241.90448625181|
|    chevrolet|      coupe|18007.545454545456|
|         audi|      coupe| 17848.20835751596|
|          bmw|        suv|17206.907203907205|
|        rove

In [47]:
# Le nombre de voitures enregistrées selon l'année et le mois
query_2 = """
SELECT yearOfRegistration, monthOfRegistration, count(*) as nb_cars
FROM deltaPrices
GROUP BY yearOfRegistration, monthOfRegistration
ORDER BY nb_cars DESC
"""

spark.sql(query_2).show()

+------------------+-------------------+-------+
|yearOfRegistration|monthOfRegistration|nb_cars|
+------------------+-------------------+-------+
|              2006|                  6|   1862|
|              2003|                  3|   1753|
|              2006|                  3|   1628|
|              2004|                  4|   1550|
|              2007|                  3|   1545|
|              2005|                  5|   1541|
|              2005|                  3|   1518|
|              2005|                  6|   1507|
|              2006|                  4|   1435|
|              2006|                  5|   1413|
|              2007|                  7|   1393|
|              2009|                  6|   1385|
|              2006|                 12|   1379|
|              2004|                  3|   1379|
|              2006|                 11|   1375|
|              2004|                  6|   1374|
|              2008|                  4|   1372|
|              2006|

In [48]:
# La puissance moyenne pour chaque type de véhicule
query_3 = """
SELECT vehicleType, avg(powerPS) as avg_powerPS
FROM deltaPrices
GROUP BY vehicleType
ORDER BY avg_powerPS DESC
"""

spark.sql(query_3).show()

+-----------+------------------+
|vehicleType|       avg_powerPS|
+-----------+------------------+
|      coupe|194.52297838270616|
|        suv|176.07880338863396|
|     cabrio|156.98327485380116|
|      kombi|148.76299451669914|
|  limousine|146.72253270773615|
|        bus|121.71627360621667|
|     andere|       113.1484375|
| kleinwagen| 78.79787690057742|
+-----------+------------------+



Pour optimiser la performance de la query_1, on peut, tout d'abord, redéfinir la table deltaPrices en la partitionant selon 'brand'. Cela permettra d'augmenter grandement la vitesse d'exécution de la query car chaque 'brand' sera stocké dans des fichiers différents facilitant la lecture pour Spark qui lira que les données nécessaires.

Enfin, nous allons z-order selon le 'vehicleType' et 'price' permettant de regrouper les données et d'éviter qu'elles soient répartis dans tous les fichiers.

Normalement, avec ces deux techniques, l'exécution de la query_1 devrait être améliorer.

In [49]:
delta_df.write.format("delta").partitionBy("brand").mode("overwrite").save("/tmp/delta/deltaPrices")

spark.sql("""
CREATE TABLE IF NOT EXISTS deltaPrices
USING DELTA
LOCATION '/tmp/delta/deltaPrices'
""")

spark.sql("""
OPTIMIZE deltaPrices
ZORDER BY (vehicleType, price)
""")

DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterParallelism:bigint,totalScheduledTasks:bigint,autoCompactParallelismStats:struct<maxClusterActiveParallelism:bigint,minClusterActiveParallelism:bigint,maxSessionActiveParallelism:bigint,minSessionActiveParallelism:bigint>,de

## Ingesting new data and reruning analytics  

In this part you are invited to suggest the insertion of fictious new data that conforms to the schema established in phase 2 and to rerun some queries of phase 3 to see the evolution of the result. Ideally, write a query that compares an aggregation value in two different versions of the data by exploiting the delta history feature.

In [50]:
spark.sql("""
INSERT INTO deltaPrices
VALUES
('Audi 1', 80000, 'control', 'bus', 1929, 'automatik', 100, '1', '100000', 1, 'elektro', 'audi', 'nein', CAST('2016-01-29' AS timestamp)),
('Audi 2', 5000, 'test', 'andere', 1929, 'manuell', 400, '2', '5000', 1, 'hybrid', 'audi', 'ja', CAST('2010-10-12' AS timestamp)),
('Audi 3', 200000, 'control', 'coupe', 1934, 'manuell', 1000, '3', '0', 1, 'benzin', 'audi', 'nein', CAST('2000-05-14' AS timestamp)),
('Alfa Roméo 1', 10000, 'control', 'cabrio', 1936, 'manuell', 2000, '1', '20000', 5, 'diesel', 'alfa_romeo', 'nein', CAST('2015-04-02' AS timestamp)),
('Alfa Roméo 2', 1000, 'control', 'andere', 1936, 'automatik', 10, '2', '80000', 7, 'lpg', 'alfa_romeo', 'nein', CAST('2015-04-02' AS timestamp)),
('Alfa Roméo 3', 25000000, 'test', 'andere', 1931, 'manuell', 1000, '3', '200000', 6, 'diesel', 'alfa_romeo', 'ja', CAST('2015-04-02' AS timestamp)),
('Alfa Roméo 4', 10000, 'test', 'kombi', 1931, 'manuell', 1000, '1', '1000', 6, 'diesel', 'alfa_romeo', 'ja', CAST('2015-04-02' AS timestamp))
""")

DataFrame[]

In [51]:
query = """
DESCRIBE HISTORY deltaPrices
"""
spark.sql(query).show()

+-------+--------------------+------+--------+--------------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+------------+--------------------+
|version|           timestamp|userId|userName|     operation| operationParameters| job|notebook|clusterId|readVersion|   isolationLevel|isBlindAppend|    operationMetrics|userMetadata|          engineInfo|
+-------+--------------------+------+--------+--------------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+------------+--------------------+
|     26|2024-10-24 12:32:...|  NULL|    NULL|         WRITE|{mode -> Append, ...|NULL|    NULL|     NULL|         25|     Serializable|         true|{numFiles -> 2, n...|        NULL|Apache-Spark/3.5....|
|     25|2024-10-24 12:32:...|  NULL|    NULL|      OPTIMIZE|{predicate -> [],...|NULL|    NULL|     NULL|         24|SnapshotIsolation|        false|{numRemovedFiles ...|     

In [53]:
query_1 = """
WITH previous_data AS (
  SELECT brand, vehicleType, avg(price) AS avg_price, 'previous' AS version
  FROM deltaPrices VERSION AS OF 25
  GROUP BY brand, vehicleType
),
current_data AS (
  SELECT brand,vehicleType, avg(price) AS avg_price, 'current' AS version
  FROM deltaPrices VERSION AS OF 26
  GROUP BY brand, vehicleType
)

SELECT brand, vehicleType, avg_price, version
FROM previous_data
UNION ALL
SELECT brand, vehicleType, avg_price, version
FROM current_data
ORDER BY brand, vehicleType, version
"""

spark.sql(query_1).show()

+----------+-----------+------------------+--------+
|     brand|vehicleType|         avg_price| version|
+----------+-----------+------------------+--------+
|alfa_romeo|     andere|         4167991.5| current|
|alfa_romeo|     andere|           1737.25|previous|
|alfa_romeo|     cabrio|10022.173333333334| current|
|alfa_romeo|     cabrio|10022.322147651006|previous|
|alfa_romeo|      coupe| 7626.726775956284| current|
|alfa_romeo|      coupe| 7626.726775956284|previous|
|alfa_romeo| kleinwagen| 5971.662337662337| current|
|alfa_romeo| kleinwagen| 5971.662337662337|previous|
|alfa_romeo|      kombi| 4803.656140350877| current|
|alfa_romeo|      kombi| 4785.359154929577|previous|
|alfa_romeo|  limousine| 4301.490054249548| current|
|alfa_romeo|  limousine| 4301.490054249548|previous|
|      audi|     andere| 8319.923076923076| current|
|      audi|     andere| 8385.019607843138|previous|
|      audi|        bus|          13181.25| current|
|      audi|        bus| 3635.714285714286|pre

In [54]:
query_2 = """
WITH previous_data AS (
  SELECT yearOfRegistration, monthOfRegistration, count(*) AS nb_cars, 'previous' AS version
  FROM deltaPrices VERSION AS OF 25
  GROUP BY yearOfRegistration, monthOfRegistration
),
current_data AS (
  SELECT yearOfRegistration, monthOfRegistration, count(*) AS nb_cars, 'current' AS version
  FROM deltaPrices VERSION AS OF 26
  GROUP BY yearOfRegistration, monthOfRegistration
)

SELECT yearOfRegistration, monthOfRegistration, nb_cars, version
FROM previous_data
UNION ALL
SELECT yearOfRegistration, monthOfRegistration, nb_cars, version
FROM current_data
ORDER BY yearOfRegistration, monthOfRegistration, version
"""

spark.sql(query_2).show()

+------------------+-------------------+-------+--------+
|yearOfRegistration|monthOfRegistration|nb_cars| version|
+------------------+-------------------+-------+--------+
|              1929|                  1|      3| current|
|              1929|                  1|      1|previous|
|              1930|                  1|      1| current|
|              1930|                  1|      1|previous|
|              1930|                  7|      2| current|
|              1930|                  7|      2|previous|
|              1931|                  1|      1| current|
|              1931|                  1|      1|previous|
|              1931|                  6|      3| current|
|              1931|                  6|      1|previous|
|              1931|                  7|      1| current|
|              1931|                  7|      1|previous|
|              1932|                  2|      1| current|
|              1932|                  2|      1|previous|
|             

In [55]:
query_3 = """
WITH previous_data AS (
  SELECT vehicleType, avg(powerPS) AS avg_powerPS, 'previous' AS version
  FROM deltaPrices VERSION AS OF 25
  GROUP BY vehicleType
),
current_data AS (
  SELECT vehicleType, avg(powerPS) AS avg_powerPS, 'current' AS version
  FROM deltaPrices VERSION AS OF 26
  GROUP BY vehicleType
)

SELECT vehicleType, avg_powerPS, version
FROM previous_data
UNION ALL
SELECT vehicleType, avg_powerPS, version
FROM current_data
ORDER BY vehicleType, version
"""

spark.sql(query_3).show()

+-----------+------------------+--------+
|vehicleType|       avg_powerPS| version|
+-----------+------------------+--------+
|     andere|114.07532467532468| current|
|     andere|       113.1484375|previous|
|        bus|121.71529252315338| current|
|        bus|121.71627360621667|previous|
|     cabrio|157.09104730717502| current|
|     cabrio|156.98327485380116|previous|
|      coupe| 194.5874629733408| current|
|      coupe|194.52297838270616|previous|
| kleinwagen| 78.79787690057742| current|
| kleinwagen| 78.79787690057742|previous|
|      kombi| 148.7822816358899| current|
|      kombi|148.76299451669914|previous|
|  limousine|146.72253270773615| current|
|  limousine|146.72253270773615|previous|
|        suv|176.07880338863396| current|
|        suv|176.07880338863396|previous|
+-----------+------------------+--------+

