# ETL

Het extract - transform - load concept is een veel voorkomend begrip in (big) data toepassingen en geeft het stappenplan weer van de levenscyclus van de data binnen je toepassing.
Het concept bestaat uit drie stappen:
* extract: zoeken van data, inlezen en validatie
* transform: verwerken van data, data cleaning, aggregatie, groupering, filtering, ...
* load: opslaan van de getransformeerde data in een file, database, datawarehouse, datalake, ...

In de rest van deze notebook gaan we bestuderen hoe deze stappen uit te voeren met Spark.
Hiervoor gaan we een csv gebruiken als bronbestand.

## Extract

In deze directory staat een zip file waarin deze csv is opgeslaan. 
Unzip deze file eerst en upload het naar het hdfs

In [29]:
import zipfile
import pydoop.hdfs as hdfs

with zipfile.ZipFile('cars.zip', 'r') as zipFile:
    zipFile.extractall()
    
localFs = hdfs.hdfs(host='')
clientFs = hdfs.hdfs(host='localhost', port=9000)

if not clientFs.exists('ETL'):
    clientFs.create_directory('ETL')
    
for f in clientFs.list_directory('ETL'):
    clientFs.delete(f['name'], True)

localFs.copy('cars.csv', clientFs, 'ETL/cars.csv')

0

Maak nu een locale sparkcontext aan en lees dit bestand in

In [30]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[2]').appName('ETL_cars').getOrCreate()

# extract gedeelte
df = spark.read.csv('ETL/cars.csv', header=True, sep=',')
print('Total rows = {}'.format(df.count()))
df.show(1)

Total rows = 38531
+-----------------+----------+------------+------+--------------+-------------+-----------+--------------+-----------+---------------+---------+------------+-----+----------+---------+---------------+---------------+----------------+----------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------------+
|manufacturer_name|model_name|transmission| color|odometer_value|year_produced|engine_fuel|engine_has_gas|engine_type|engine_capacity|body_type|has_warranty|state|drivetrain|price_usd|is_exchangeable|location_region|number_of_photos|up_counter|feature_0|feature_1|feature_2|feature_3|feature_4|feature_5|feature_6|feature_7|feature_8|feature_9|duration_listed|
+-----------------+----------+------------+------+--------------+-------------+-----------+--------------+-----------+---------------+---------+------------+-----+----------+---------+---------------+---------------+----------------+----------+---------+-----

De datastructuur van het csv is als volgt:

In [31]:
df.printSchema()

root
 |-- manufacturer_name: string (nullable = true)
 |-- model_name: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- color: string (nullable = true)
 |-- odometer_value: string (nullable = true)
 |-- year_produced: string (nullable = true)
 |-- engine_fuel: string (nullable = true)
 |-- engine_has_gas: string (nullable = true)
 |-- engine_type: string (nullable = true)
 |-- engine_capacity: string (nullable = true)
 |-- body_type: string (nullable = true)
 |-- has_warranty: string (nullable = true)
 |-- state: string (nullable = true)
 |-- drivetrain: string (nullable = true)
 |-- price_usd: string (nullable = true)
 |-- is_exchangeable: string (nullable = true)
 |-- location_region: string (nullable = true)
 |-- number_of_photos: string (nullable = true)
 |-- up_counter: string (nullable = true)
 |-- feature_0: string (nullable = true)
 |-- feature_1: string (nullable = true)
 |-- feature_2: string (nullable = true)
 |-- feature_3: string (nullable = true)


## Transform

De transform stap is de meest complexe stap van de drie en kan uit een grote verscheidenheid van bewerkingen bestaan, zoals:
* Dataformaten aanpassen
* Vertalingen van tekst
* Geencodeerde waarden aanpassen: 0/1 vs true/false of m/f vs male/female
* Allerhande data-cleaning stappen
* Encoderen (Ordinal of One-hot) van categorieke kolommen
* Groeperen van data
* Uitvoeren van berekeningen 
* ...

Schrijf hieronder eerst zelf de code om de volgende stappen uit te voeren:
* Omzetten naar integer van de kolommen: odometer_value, year_produced, engine_capacity, price_usd, number_of_photos, up_counter, duration_listed
* Omzetten naar boolean van de kolommen: engine_has_gas, has_warranty, is_exchangeable, feature_0 tot en met 9
* Bereken het aantal null en nan waarden per kolom

In [32]:
df_backup = df

In [33]:
import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType, BooleanType

df = df_backup

cols = ['odometer_value', 'year_produced', 'engine_capacity', 'price_usd', 'number_of_photos', 'up_counter', 'duration_listed']
for c in cols:
    #df = df.withColumn(c, f.col(c).cast(IntegerType()))
    df = df.withColumn(c, f.col(c).cast('int'))
    
df.printSchema()

root
 |-- manufacturer_name: string (nullable = true)
 |-- model_name: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- color: string (nullable = true)
 |-- odometer_value: integer (nullable = true)
 |-- year_produced: integer (nullable = true)
 |-- engine_fuel: string (nullable = true)
 |-- engine_has_gas: string (nullable = true)
 |-- engine_type: string (nullable = true)
 |-- engine_capacity: integer (nullable = true)
 |-- body_type: string (nullable = true)
 |-- has_warranty: string (nullable = true)
 |-- state: string (nullable = true)
 |-- drivetrain: string (nullable = true)
 |-- price_usd: integer (nullable = true)
 |-- is_exchangeable: string (nullable = true)
 |-- location_region: string (nullable = true)
 |-- number_of_photos: integer (nullable = true)
 |-- up_counter: integer (nullable = true)
 |-- feature_0: string (nullable = true)
 |-- feature_1: string (nullable = true)
 |-- feature_2: string (nullable = true)
 |-- feature_3: string (nullable = 

In [34]:
cols = ['engine_has_gas', 'has_warranty', 'is_exchangeable', 'feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9']
for c in cols:
    #df = df.withColumn(c, f.col(c).cast(BooleanType()))
    df = df.withColumn(c, f.col(c).cast('boolean'))
    
df.printSchema()

root
 |-- manufacturer_name: string (nullable = true)
 |-- model_name: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- color: string (nullable = true)
 |-- odometer_value: integer (nullable = true)
 |-- year_produced: integer (nullable = true)
 |-- engine_fuel: string (nullable = true)
 |-- engine_has_gas: boolean (nullable = true)
 |-- engine_type: string (nullable = true)
 |-- engine_capacity: integer (nullable = true)
 |-- body_type: string (nullable = true)
 |-- has_warranty: boolean (nullable = true)
 |-- state: string (nullable = true)
 |-- drivetrain: string (nullable = true)
 |-- price_usd: integer (nullable = true)
 |-- is_exchangeable: boolean (nullable = true)
 |-- location_region: string (nullable = true)
 |-- number_of_photos: integer (nullable = true)
 |-- up_counter: integer (nullable = true)
 |-- feature_0: boolean (nullable = true)
 |-- feature_1: boolean (nullable = true)
 |-- feature_2: boolean (nullable = true)
 |-- feature_3: boolean (null

In [35]:
nulls = df.select([f.count(f.when(f.col(c).isNull(), 1)).alias(c) for c in df.columns])  # die 1 dient om iets te hebben dat je kan tellen
nulls.show()

+-----------------+----------+------------+-----+--------------+-------------+-----------+--------------+-----------+---------------+---------+------------+-----+----------+---------+---------------+---------------+----------------+----------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------------+
|manufacturer_name|model_name|transmission|color|odometer_value|year_produced|engine_fuel|engine_has_gas|engine_type|engine_capacity|body_type|has_warranty|state|drivetrain|price_usd|is_exchangeable|location_region|number_of_photos|up_counter|feature_0|feature_1|feature_2|feature_3|feature_4|feature_5|feature_6|feature_7|feature_8|feature_9|duration_listed|
+-----------------+----------+------------+-----+--------------+-------------+-----------+--------------+-----------+---------------+---------+------------+-----+----------+---------+---------------+---------------+----------------+----------+---------+---------+---------+-------

In bovenstaande code kan je zien dat er slechts een aantal null-waarden in de dataset aanwezig zijn.
Deze kunnen ingevuld worden door middel van een [imputer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Imputer.html).
Hier laten we deze rijen echter gewoon vallen voor de eenvoud:

In [36]:
df = df.na.drop()
df.count()  # bovenaan hadden we er 38531 dus zijn er inderdaad 10 weg

38521

De oefening om de waarden in te vullen met een imputer (bvb door het gemiddelde) kan je hieronder doen.

In [37]:
# oefening

Bereken nu de volgende waarden van de beschikbare data:
* Aantal autos per merk
* Welke verschillende types van transmissie zijn er?
* Marktaandeel (percentage) van de verschillende types motor?
* Maximum prijs van elk merk
* Wat zijn de vijf goedkoopste voertuigen met een automatische transmissie?

In [38]:
# autos per merk
df.groupby('manufacturer_name').count().show()

+-----------------+-----+
|manufacturer_name|count|
+-----------------+-----+
|       Volkswagen| 4243|
|            Lexus|  213|
|           Jaguar|   53|
|            Rover|  235|
|           Lancia|   92|
|             Jeep|  107|
|       Mitsubishi|  887|
|              ГАЗ|  200|
|              Kia|  912|
|             Mini|   68|
|        Chevrolet|  435|
|            Volvo|  721|
|            Lifan|   47|
|          Hyundai| 1116|
|             LADA|  146|
|        SsangYong|   79|
|             Audi| 2468|
|             Seat|  303|
|         Cadillac|   43|
|          Pontiac|   42|
+-----------------+-----+
only showing top 20 rows



In [39]:
# types transmissie
df.groupby('transmission').count().show()
df.select('transmission').distinct().show()

+------------+-----+
|transmission|count|
+------------+-----+
|   automatic|12888|
|  mechanical|25633|
+------------+-----+

+------------+
|transmission|
+------------+
|   automatic|
|  mechanical|
+------------+



In [40]:
# marktaandeel
aantal_autos = df.count()
df_aandeel = df.groupby('engine_type').count()
df_aandeel.withColumn('marktaandeel', f.col('count')/aantal_autos).show()

+-----------+-----+-------------------+
|engine_type|count|       marktaandeel|
+-----------+-----+-------------------+
|   gasoline|25647| 0.6657926845097479|
|     diesel|12874|0.33420731549025207|
+-----------+-----+-------------------+



In [41]:
# maximum prijs per merk
df.groupby('manufacturer_name').max('price_usd').show()

+-----------------+--------------+
|manufacturer_name|max(price_usd)|
+-----------------+--------------+
|       Volkswagen|         43999|
|            Lexus|         48610|
|           Jaguar|         50000|
|            Rover|          9900|
|           Lancia|          9500|
|             Jeep|         43000|
|       Mitsubishi|         31400|
|              ГАЗ|         30000|
|              Kia|         44700|
|             Mini|         39456|
|        Chevrolet|         49900|
|            Volvo|         48200|
|            Lifan|         15750|
|          Hyundai|         45954|
|             LADA|         13800|
|        SsangYong|         15900|
|             Audi|         46750|
|             Seat|         18350|
|         Cadillac|         25750|
|          Pontiac|         10000|
+-----------------+--------------+
only showing top 20 rows



In [42]:
# goedkoopste voertuigen met automatische transmissie
df_cheapest = df.filter(f.col('transmission') == 'automatic').sort(f.col('price_usd').asc()).limit(5)

df_cheapest.show()

+-----------------+----------+------------+------+--------------+-------------+-----------+--------------+-----------+---------------+---------+------------+---------+----------+---------+---------------+---------------+----------------+----------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------------+
|manufacturer_name|model_name|transmission| color|odometer_value|year_produced|engine_fuel|engine_has_gas|engine_type|engine_capacity|body_type|has_warranty|    state|drivetrain|price_usd|is_exchangeable|location_region|number_of_photos|up_counter|feature_0|feature_1|feature_2|feature_3|feature_4|feature_5|feature_6|feature_7|feature_8|feature_9|duration_listed|
+-----------------+----------+------------+------+--------------+-------------+-----------+--------------+-----------+---------------+---------+------------+---------+----------+---------+---------------+---------------+----------------+----------+---------+---------+--

## Load

In deze stap veronderstellen we dat we enkel de 5 goedkoopste auto's willen bewaren.
Schrijf hieronder de benodigde code om de informatie van deze autos op te slaan in een json.

In [43]:
df_cheapest.write.format('json').save('ETL/result.json')

Dit is een voorbeeld waarbij de resultaten worden opgeslaan in een bestand.
Andere mogelijkheden zijn om het op te slaan in een SQL-database.
Demo-code om dit te bereiken kan je [hier](https://kontext.tech/column/spark/395/save-dataframe-to-sql-databases-via-jdbc-in-pyspark) bekijken.
Later in dit vak zullen we ook NoSQL-databases bekijken.
Op dat moment zullen we zien hoe we de resultaten kunnen bewaren in dit type database beheersystemen (DBMS).