# Spark Aufgaben
1. Importe laden
2. Jupyter Spark starten und Twitter-Streams von Avro lesen
3. ETL Strecke: Avro Daten einlesen und als Delta Datei wieder raus schreiben
4. Analyse-Aufgaben erledigen 
5. Verlaufsanalyse durchführen
6. **Ausschalten der Spark-App**

## Wichtige Hinweise
1. Führe alle Anweisungen in der vorgegebenen Reihenfolge aus. Die einzelnen Programmierzellen bauen aufeinander auf.
2. **Beende unbedingt am Ende die Spark-Anwendung mit dem untersten Befehl "spark.stop()" , wenn du aufhörst an den Daten zu arbeiten.**
3. Du kannst jederzeit das Notebook wieder hochfahren, wenn du Schritt 1 & 2 (Laden der Imports & Jupyter Spark und seine Konfigurationen hochfahren) ausführen.
4. Mit **"Strg" + "Enter"** führst du einzelne Zellen direkt aus.
5. In der oberen Leiste kannst du über **"Insert"** weitere Zellen hinzufügen, um weitere Test-Funktionen zu schreiben. 

## 1. Laden der Imports
Hier werden alle benötigten Libraries für dieses Lab heruntergeladen.

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import Row
from pyspark.sql.functions import explode
from pyspark.sql.functions import lower, col
import pyspark.sql.functions as f

from delta import *


import datetime
from datetime import datetime
import json


# use 95% of the screen for jupyter cell
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important; }<style>"))

  from IPython.core.display import display, HTML


## 2. Jupyter Spark & Konfigurationen hochfahren
Hier wird die App jupyter-spark konfiguriert und hochgefahren, welche unsere weiteren Schritte ausführt.

In [2]:
appName="jupyter-spark"

conf = SparkConf()

# CLUSTER MANAGER
################################################################################
# set Kubernetes Master as Cluster Manager(“k8s://https://” is NOT a typo, this is how Spark knows the “provider” type).
conf.setMaster("k8s://https://kubernetes.default.svc.cluster.local:443")

# CONFIGURE KUBERNETES
################################################################################
# set the namespace that will be used for running the driver and executor pods.
conf.set("spark.kubernetes.namespace","frontend")
# set the docker image from which the Worker pods are created
conf.set("spark.kubernetes.container.image", "thinkportgmbh/workshops:spark-3.3.2")
conf.set("spark.kubernetes.container.image.pullPolicy", "Always")

# set service account to be used
conf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
# authentication for service account(required to create worker pods):
conf.set("spark.kubernetes.authenticate.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt")
conf.set("spark.kubernetes.authenticate.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token")


# CONFIGURE SPARK
################################################################################
conf.set("spark.sql.session.timeZone", "Europe/Berlin")
# set driver host. In this case the ingres service for the spark driver
# find name of the driver service with 'kubectl get services' or in the helm chart configuration
conf.set("spark.driver.host", "jupyter-spark-driver.frontend.svc.cluster.local")
# set the port, If this port is busy, spark-shell tries to bind to another port.
conf.set("spark.driver.port", "29413")
# add the postgres driver jars into session
conf.set("spark.jars", "/opt/spark/jars/spark-avro_2.12-3.3.2.jar")
conf.set("spark.executor.extraClassPath","/opt/spark/jars/spark-avro_2.12-3.3.2.jar")
#conf.set("spark.executor.extraLibrary","/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.3.1.jar, /opt/spark/jars/kafka-clients-3.3.1.jar")
#conf.set("spark.driver.extraClassPath","/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.3.1.jar, /opt/spark/jars/kafka-clients-3.3.1.jar, /opt/spark/jars/spark-avro_2.12-3.3.1.jar")

# CONFIGURE S3 CONNECTOR
conf.set("spark.hadoop.fs.s3a.endpoint", "minio.minio.svc.cluster.local:9000")
conf.set("spark.hadoop.fs.s3a.access.key", "trainadm")
conf.set("spark.hadoop.fs.s3a.secret.key", "train@thinkport")
conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
conf.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")

#conf.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension, org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions, org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
conf.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
conf.set("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

# CONFIGURE WORKER (Customize based on workload)
################################################################################
# set number of worker pods
conf.set("spark.executor.instances", "1")
# set memory of each worker pod
conf.set("spark.executor.memory", "1G")
# set cpu of each worker pod
conf.set("spark.executor.cores", "2")
# Number of possible tasks = cores * executores

## Deltalake
# conf.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")

# SPARK SESSION
################################################################################
# and last, create the spark session and pass it the config object

spark = SparkSession\
    .builder\
    .config(conf=conf) \
    .config('spark.sql.session.timeZone', 'Europe/Berlin') \
    .appName(appName)\
    .getOrCreate()

# also get the spark context
sc=spark.sparkContext
# change the log level to warning, to see less output
sc.setLogLevel('ERROR')

# get the configuration object to check all the configurations the session was startet with
for entry in sc.getConf().getAll():
        if entry[0] in ["spark.app.name","spark.kubernetes.namespace","spark.executor.memory","spark.executor.cores","spark.driver.host","spark.master"]:
            print(entry[0],"=",entry[1])
            
spark

22/12/08 22:38:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


spark.kubernetes.namespace = frontend
spark.master = k8s://https://kubernetes.default.svc.cluster.local:443
spark.app.name = jupyter-spark
spark.executor.memory = 1G
spark.executor.cores = 2
spark.driver.host = jupyter-spark-driver.frontend.svc.cluster.local


## 3. Einlesen und Schreiben von Daten

### 3.1 Einlesen der Daten aus unserem S3 Speicher-Bucket 
Laden der Daten aus unserem Bucket in "s3a://twitter/avro" in einen DataFrame, um auf den Daten zu arbeiten. 

In [3]:
df_avro=(spark
    .read.format("avro")
    # Pfad zu Bucket
    .load("s3a://twitter/avro")
    # repartition auf 20 um optimierter mit den wenigen cpu zu arbeiten
    .repartition(20)
   ).cache()


 # nur Tweets mit dem Hashtag BigData weiter verwenden
df = df_avro.filter(f.array_contains(f.col("hashtags"),"BigData")==True)

print("Anzahl aller Tweets: ",df_avro.count())
print("Anzahl Tweets mit BigData: ",df.count())

22/12/08 22:38:22 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


Kurz anschauen was da drin ist

In [4]:
df.show()print("Anzahl aller Tweets: ",df_avro.count())
print("Anzahl Tweets mit BigData: ",df.count())
df.show()

                                                                                

+-------------------+-------------------+--------------------+-------------+---------------+-------------------+------------------+-------------+--------+--------------------+
|           tweet_id|         created_at|       tweet_message|    user_name|  user_location|user_follower_count|user_friends_count|retweet_count|language|            hashtags|
+-------------------+-------------------+--------------------+-------------+---------------+-------------------+------------------+-------------+--------+--------------------+
|1600981511217459200|2022-12-08 23:31:51|Lead Data Enginee...|    vinay_145|           null|              19009|              3419|            0|      en|[LeadDataEngineer...|
|1600982807937155072|2022-12-08 23:37:00|RT @gp_pulipaka: ...| phillip4real|  Cleveland, OH|                830|               696|            0|      en|[Maths, BigData, ...|
|1600982603145924643|2022-12-08 23:36:12|RT @HacBrain247: ...|DevHighlights|           null|              11094|        

### 3.2 Schreiben der Daten ins Delta-Format
Hier werden die Daten direkt im Delta-Format umgewandelt und in den S3-Bucket "s3a://twitter/delta" geschrieben. Dieser Schritt ist wichtig, um die Daten passend für Trino zu abzulegen.

In [5]:
writer_delta=(df
                .write.partitionBy("language")
                .mode("overwrite")
                .format("delta")
                .option("overwriteSchema", "true")
                .option("userMetadata", "Initial Ladung")
                .save("s3a://twitter/delta")
             )

                                                                                

## 4. Analyse-Aufgaben


### 4.1 Tweets anschauen und den Aufbau des Dataframes
Schau dir den Datensatz einmal genau an. Welche Spalten gibt es? Welche Datentypen sind vorhanden?

In [6]:
df.show()

+-------------------+-------------------+--------------------+-------------+---------------+-------------------+------------------+-------------+--------+--------------------+
|           tweet_id|         created_at|       tweet_message|    user_name|  user_location|user_follower_count|user_friends_count|retweet_count|language|            hashtags|
+-------------------+-------------------+--------------------+-------------+---------------+-------------------+------------------+-------------+--------+--------------------+
|1600981511217459200|2022-12-08 23:31:51|Lead Data Enginee...|    vinay_145|           null|              19009|              3419|            0|      en|[LeadDataEngineer...|
|1600982807937155072|2022-12-08 23:37:00|RT @gp_pulipaka: ...| phillip4real|  Cleveland, OH|                830|               696|            0|      en|[Maths, BigData, ...|
|1600982603145924643|2022-12-08 23:36:12|RT @HacBrain247: ...|DevHighlights|           null|              11094|        

### 4.2  Das Schema des Datensatzes anzeigen 
<br>
<code> df.printSchema()</code> gibt das Schema des Datensatzes aus.

In [7]:
df.printSchema()

root
 |-- tweet_id: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- tweet_message: string (nullable = true)
 |-- user_name: string (nullable = true)
 |-- user_location: string (nullable = true)
 |-- user_follower_count: integer (nullable = true)
 |-- user_friends_count: integer (nullable = true)
 |-- retweet_count: integer (nullable = true)
 |-- language: string (nullable = true)
 |-- hashtags: array (nullable = true)
 |    |-- element: string (containsNull = true)



### 4.3 Zählen der Tweets pro Stunde
Schreibe eine Abfrage, die **die Anzahl an Tweets pro Stunde** zählt.
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df_hourly=(df
            .withColumn("hour", f.hour(f.col("created_at")))
            .groupBy("hour")
            .count()
            .withColumnRenamed("count","total")
            .sort("hour")
          )
df_hourly.show(20)</code>
</details>
</p>


In [8]:
df_hourly=(df  
            .withColumn("hour", f.hour(f.col("created_at")))
            .groupBy("hour")
            .count()
            .withColumnRenamed("count","total")
            .sort("hour")
          )

df_hourly.show(20)



+----+-----+
|hour|total|
+----+-----+
|  23|   14|
+----+-----+



                                                                                

### 4.4 Top 10 User nach Tweet-Anzahl
Schreibe eine Abfrage, die die **Top User** nach ihrer **Anzahl an Tweets** ausgibt. Bedenke dabei, deine Ausgabe auf **10** Einträge zu limitieren.
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df_top_user=(df
                .groupBy("user_name")
                .agg(
                    f.count("user_name").alias("numberOfTweets")
                    )
                .orderBy(f.col("numberOfTweets").desc())
                .limit(10)
                .withColumnRenamed("user_name","user")
                )
df_top_user.show()</code>
</details>
</p>

In [9]:
df_top_user=(df
                .groupBy("user_name")
                .agg(
                    f.count("user_name").alias("numberOfTweets")
                    )
                .orderBy(f.col("numberOfTweets").desc())
                .limit(10)
                .withColumnRenamed("user_name","user")
                )
df_top_user.show()




+-------------+--------------+
|         user|numberOfTweets|
+-------------+--------------+
|  HacBrain247|             4|
|Stanleyhacks2|             3|
|    vinay_145|             1|
| phillip4real|             1|
|DevHighlights|             1|
| PythonRoboto|             1|
| greentechdon|             1|
|     Richack_|             1|
| Anasalmana55|             1|
+-------------+--------------+



                                                                                

### 4.5 Umgang mit Arrays
Für die folgenden Aufgabe wird die <code>explode</code>-Funktion benötigt. Schreibe eine Abfrage die das Hashtag-array mit <code>explode</code> teilt. Gebe dabei die Spalten "user_name", "tweet_id"und die explodierte"hashtags"- Spalte mit einem Limit von 20 Zeilen aus. 
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df_hash=(df
         .withColumn("hashtags",explode("hashtags"))
        .limit(20)
        .select("user_name", "tweet_id", "hashtags")
        )
df_hash.show()</code>
</details>
</p>

In [10]:
df_hash=(df
         .withColumn("hashtags",explode("hashtags"))
         .limit(20)
         .select("user_name", "tweet_id", "hashtags")
        )
df_hash.show()



+------------+-------------------+-----------------+
|   user_name|           tweet_id|         hashtags|
+------------+-------------------+-----------------+
|   vinay_145|1600981511217459200| LeadDataEngineer|
|   vinay_145|1600981511217459200|     DataEngineer|
|   vinay_145|1600981511217459200|           Sydney|
|   vinay_145|1600981511217459200|              NSW|
|   vinay_145|1600981511217459200|        Australia|
|   vinay_145|1600981511217459200|CareerOpportunity|
|   vinay_145|1600981511217459200|        HiringNow|
|   vinay_145|1600981511217459200|         TechJobs|
|   vinay_145|1600981511217459200|   DataManagement|
|   vinay_145|1600981511217459200|          BigData|
|phillip4real|1600982807937155072|            Maths|
|phillip4real|1600982807937155072|          BigData|
|phillip4real|1600982807937155072|        Analytics|
|phillip4real|1600982807937155072|      DataScience|
|phillip4real|1600982807937155072|               AI|
|phillip4real|1600982807937155072|  MachineLea

                                                                                

### 4.6 Top 5 Hashtags der Top 10 User
Schreibe eine Abfrage, die die **Top 5 der Hashtags** der **10 User** mit den **meisten Tweets** ausgibt.
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df_top5_per_user=(df_top_user
            # filter via join
            .join(df,[df_top_user.user==df.user_name],how="left")
            # hashtags array in Zeilen Einträge exploden
            .withColumn("hashtags",explode("hashtags"))
            # hashtags lowercase schreiben um Doppelungen zu entfernen
            .withColumn("hashtags", lower(col('hashtags')))
            # groupieren und counten by hashtag
            .groupBy("hashtags").agg(f.count("hashtags"))
            # rückwärts sortieren
            .sort(f.col("count(hashtags)").desc())
            # top 5 selectieren
            .limit(5) 
                 )
df_top5_per_user.show()</code>
</details>
</p>

In [11]:
df_top5_per_user=(df_top_user
            # filter via join
            .join(df,[df_top_user.user==df.user_name],how="left")
            # hashtags array in Zeilen Einträge exploden
            .withColumn("hashtags",explode("hashtags"))
            # hashtags lowercase schreiben um Doppelungen zu entfernen
            .withColumn("hashtags", lower(col('hashtags')))
            # groupieren und counten by hashtag
            .groupBy("hashtags").agg(f.count("hashtags"))      
            # rückwärts sortieren
            .sort(f.col("count(hashtags)").desc())
            # top 5 selectieren
            .limit(5)
                 )
df_top5_per_user.show()



+-------------+---------------+
|     hashtags|count(hashtags)|
+-------------+---------------+
|      bigdata|             14|
|  datascience|             12|
|       python|             11|
|    analytics|              9|
|cybersecurity|              9|
+-------------+---------------+



                                                                                

 ### 4.7 Top 10 Influencer (User mit #BigData-tweets mit den meisten Followern) 
 Schreibe eine Abfrage, die die **Top 10 Influencer** mit den **meisten Follower** zählt und sortiert anzeigt.
 <br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df_top_influencer=(df
                .groupBy("user_name")
                .agg(
                    f.max("user_follower_count").alias("follower")
                    )
                .orderBy(f.col("follower").desc())
                )
df_top_influencer.show(10)</code>
</details>
</p>

In [12]:
df_top_influencer=(df
                .groupBy("user_name")
                .agg(
                    f.max("user_follower_count").alias("follower")
                    )
                .orderBy(f.col("follower").desc())
                   
                )
df_top_influencer.show(10)



+-------------+--------+
|    user_name|follower|
+-------------+--------+
|    vinay_145|   19009|
|DevHighlights|   11094|
| greentechdon|    5182|
| PythonRoboto|    3043|
|     Richack_|    1464|
| phillip4real|     830|
|  HacBrain247|     467|
| Anasalmana55|     305|
|Stanleyhacks2|     258|
+-------------+--------+



                                                                                

### 4.8 Top 10 Influencer und ihre Anzahl an tweets
Schreibe eine Abfrage, die die **Top 10 Influencer**, ihre Follower und die **Anzahl ihrer Tweets** ausgibt. außeredem soll es sortiert nach den Anzahl ihrer Follower sein. 
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df_withRetweets=(df_top_user
            # filter via join auf die Top 10 Influencer
            .join(df_top_influencer, [df_top_influencer.user_name==df_top_user.user],how="left")
            .orderBy(f.col("follower").desc())
            .limit(10)
            .drop("user_name")
            .select("user","follower","numberOfTweets")
    )
df_withRetweets.show()</code>
</details>
</p>

In [13]:
df_withRetweets=(df_top_user
            # filter via join auf die Top 10 Influencer
            .join(df_top_influencer, [df_top_influencer.user_name==df_top_user.user],how="left") 
            .orderBy(f.col("follower").desc())
            .limit(10)
            .drop("user_name")   
            .select("user","follower","numberOfTweets")
            
    )

df_withRetweets.show()

                                                                                

+-------------+--------+--------------+
|         user|follower|numberOfTweets|
+-------------+--------+--------------+
|    vinay_145|   19009|             1|
|DevHighlights|   11094|             1|
| greentechdon|    5182|             1|
| PythonRoboto|    3043|             1|
|     Richack_|    1464|             1|
| phillip4real|     830|             1|
|  HacBrain247|     467|             4|
| Anasalmana55|     305|             1|
|Stanleyhacks2|     258|             3|
+-------------+--------+--------------+



### Bonusaufgabe: Filter nach den Top 10 Locations und ihrem Top Hashtag
Schreibe eine Abfrage, die die **Top 10 häufigsten Locations** ausgibt und das am **zweitmeisten verwendete Hashtag** dort. Da alle unsere Daten das Hashtag #BigData beinhalten. 
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df3=(df
    .select("user_location")
    .where(~f.col("user_location").isin("","null","REMOTE","Earth"))
    .groupBy("user_location")
    .count()
    .withColumnRenamed("count","location_total")
    .orderBy(f.col("location_total").desc())
    .limit(10)
    )</code>
    
<code>df4=(df
    .select("user_location","hashtags")
    .withColumn("singletag",f.explode(f.col("hashtags")))
    .groupBy("user_location","singletag")
    .count()
    .withColumnRenamed("count","tags_total")
        )</code>
    
<code>df5=(df3.alias("a")
    .join(f.broadcast(df4.alias("b")),[df3.user_location==df4.user_location],how="left")
    .select("a.user_location","a.location_total","b.singletag","b.tags_total")      
    .withColumn("rank",f.row_number().over(Window.partitionBy("a.user_location")
    .orderBy(f.col("b.tags_total").desc())))
    .filter(f.col("rank")==1)
    .sort(f.col("location_total").desc())
    .limit(10)
    )
df5.show()</code>
</details>
</p>

In [14]:
df3=(df
    .select("user_location")
    .where(~f.col("user_location").isin("","null","REMOTE","Earth"))
    .groupBy("user_location")
    .count()
    .withColumnRenamed("count","location_total")
    .orderBy(f.col("location_total").desc())
    .limit(10)
    )

df3.show()

+---------------+--------------+
|  user_location|location_total|
+---------------+--------------+
|London, Ontario|             4|
|  United States|             3|
|  Cleveland, OH|             1|
|       Virginia|             1|
|    Chicago, IL|             1|
+---------------+--------------+



                                                                                

In [15]:
df4=(df
    .select("user_location","hashtags")
    .withColumn("singletag",f.explode(f.col("hashtags")))
    .groupBy("user_location","singletag")
    .count()
    .withColumnRenamed("count","tags_total")
    )

df4.show()



+-------------+-----------------+----------+
|user_location|        singletag|tags_total|
+-------------+-----------------+----------+
|         null|CareerOpportunity|         1|
|         null|     DataEngineer|         1|
|         null|           Sydney|         1|
|         null|        HiringNow|         1|
|         null|          BigData|         4|
|         null|   DataManagement|         1|
|         null|         TechJobs|         1|
|         null|        Australia|         1|
|         null| LeadDataEngineer|         1|
|         null|              NSW|         1|
|Cleveland, OH|           Python|         1|
|Cleveland, OH|      DataScience|         1|
|Cleveland, OH|          BigData|         1|
|Cleveland, OH|           RStats|         1|
|Cleveland, OH|             IIoT|         1|
|Cleveland, OH|        Analytics|         1|
|Cleveland, OH|            Maths|         1|
|Cleveland, OH|  MachineLearning|         1|
|Cleveland, OH|              IoT|         1|
|Cleveland

                                                                                

In [16]:
df5=(df3.alias("a")
    .join(f.broadcast(df4.alias("b")),[df3.user_location==df4.user_location],how="left")
    .select("a.user_location","a.location_total","b.singletag","b.tags_total")
    .withColumn("rank",f.row_number().over(Window.partitionBy("a.user_location").orderBy(f.col("b.tags_total").desc())))
    .filter(f.col("rank")==2)
    .sort(f.col("location_total").desc())
    .limit(10)
    )
df5.show()

                                                                                

+---------------+--------------+---------------+----------+----+
|  user_location|location_total|      singletag|tags_total|rank|
+---------------+--------------+---------------+----------+----+
|London, Ontario|             4|        BigData|         4|   2|
|  United States|             3|MachineLearning|         3|   2|
|    Chicago, IL|             1|       WhatsApp|         1|   2|
|  Cleveland, OH|             1|            IoT|         1|   2|
|       Virginia|             1|    DataScience|         1|   2|
+---------------+--------------+---------------+----------+----+



## 5. Delta History and Time Travel
Führe den folgenden Code aus um die aktuelle Delta-Daten-Version upzudaten. Wenn du mehrere Versionen sehen willst schreibe öfter raus mit <code>writer_delta()</code> mit einigen Minuten Abstand.

In [17]:
writer_delta=(df
                #.filter(f.array_contains(f.col("hashtags"),"DataScience")==True)
                .write.partitionBy("language")
                .mode("overwrite")
                .format("delta")
                .option("overwriteSchema", "true")
                .option("userMetadata", "Update Ladung")
                .save("s3a://twitter/delta")
             )

                                                                                

### 5.1 Delta Tabelle ausgeben
Lade die Delta-Tabelle und lasse dir die ersten 2 Einträge ausgeben.
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code># Load Delta file in s3 into Delta Table Object
dt = DeltaTable.forPath(spark, "s3a://twitter/delta")
dt.toDF().show(2)</code>
</details>
</p>

In [18]:
# Load Delta file in s3 into Delta Table Object
dt = DeltaTable.forPath(spark, "s3a://twitter/delta")
dt.toDF().show(2)


[Stage 90:>                                                         (0 + 1) / 1]

+-------------------+-------------------+--------------------+-----------+---------------+-------------------+------------------+-------------+--------+--------------------+
|           tweet_id|         created_at|       tweet_message|  user_name|  user_location|user_follower_count|user_friends_count|retweet_count|language|            hashtags|
+-------------------+-------------------+--------------------+-----------+---------------+-------------------+------------------+-------------+--------+--------------------+
|1600981933282168832|2022-12-08 23:33:32|SQL Cheat Sheet ?...|HacBrain247|London, Ontario|                466|                41|            0|      en|[BigData, Analyti...|
|1600982339454001152|2022-12-08 23:35:09|Any hacking servi...|HacBrain247|London, Ontario|                466|                41|            0|      en|[MachineLearning,...|
+-------------------+-------------------+--------------------+-----------+---------------+-------------------+------------------+-

                                                                                

### 5.2  Auslese der Historie aus den Metadaten
1. Führe mehrmals Write to Delta aus und prüfe, wie die Historie neue Einträge hinzufügt  

In [20]:
# get the metadata for the full history of the table
fullHistoryDF = dt.history()    

# get the metadata for the last operation
lastOperationDF = dt.history(1) 

fullHistoryDF.select("version","readVersion","timestamp","userId","operation","operationParameters","operationMetrics").show()

+-------+-----------+-------------------+------+---------+--------------------+--------------------+
|version|readVersion|          timestamp|userId|operation| operationParameters|    operationMetrics|
+-------+-----------+-------------------+------+---------+--------------------+--------------------+
|      1|          0|2022-12-08 23:42:16|  null|    WRITE|{mode -> Overwrit...|{numFiles -> 14, ...|
|      0|       null|2022-12-08 23:40:12|  null|    WRITE|{mode -> Overwrit...|{numFiles -> 14, ...|
+-------+-----------+-------------------+------+---------+--------------------+--------------------+



### 5.3 Laden einer  Versionen 
Lade eine der Versionen und lasse dir alle `languages` anzeigen (via distinct().show())
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df = spark.read.format("delta").load("s3a://twitter/delta")
df.select("language").distinct().show()</code>
</details>
</p>

In [26]:
# load latest delta version
df = spark.read.format("delta").load("s3a://twitter/delta")
df.select("language").distinct().show()
df.show()

                                                                                

+--------+
|language|
+--------+
|      en|
|     und|
|     qht|
|      es|
|      it|
|      ar|
|      sv|
|     qme|
|      fr|
|      pl|
|      pt|
|      in|
|      tr|
|      de|
|      no|
|      ro|
|      tl|
|      hu|
|      ca|
|      ht|
+--------+
only showing top 20 rows



                                                                                

+-------------------+-------------------+--------------------+---------------+--------------------+-------------------+------------------+-------------+--------+--------------------+
|           tweet_id|         created_at|       tweet_message|      user_name|       user_location|user_follower_count|user_friends_count|retweet_count|language|            hashtags|
+-------------------+-------------------+--------------------+---------------+--------------------+-------------------+------------------+-------------+--------+--------------------+
|1600475534689026049|2022-12-07 14:01:17|AIM spoke to @Ree...|Analyticsindiam|    Bengaluru, India|              15350|               483|            0|      en|[quantumcomputing...|
|1600475603202981889|2022-12-07 14:01:33|Micro Focus and J...|     jenna_loup|         Houston, TX|                 97|               102|            0|      en|[sustainability, ...|
|1600475677743980544|2022-12-07 14:01:51|RT @Khulood_Alman...| Khulood_Almani|Kingdom

### 5.4. Laden einer ältere Versionen 
Lade eine ältere Version und bestätige, dass noch alle Daten vorhanden sind.
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df_timetravel_old = spark.read.format("delta").option("versionAsOf", 2).load("s3a://twitter/delta")
df_timetravel_old.select("language").distinct().show()
df_timetravel_old.show()</code>
</details>
</p>

In [27]:
#load specific historic version
df_timetravel_old = spark.read.format("delta").option("versionAsOf", 2).load("s3a://twitter/delta")
df_timetravel_old.select("language").distinct().show()
df_timetravel_old.show()

                                                                                

+--------+
|language|
+--------+
|      en|
|     und|
|     qht|
|      es|
|      it|
|      ar|
|      sv|
|     qme|
|      fr|
|      pl|
|      pt|
|      in|
|      tr|
|      de|
|      no|
|      ro|
|      tl|
|      hu|
|      ca|
|      ht|
+--------+
only showing top 20 rows



                                                                                

+-------------------+-------------------+--------------------+---------------+--------------------+-------------------+------------------+-------------+--------+--------------------+
|           tweet_id|         created_at|       tweet_message|      user_name|       user_location|user_follower_count|user_friends_count|retweet_count|language|            hashtags|
+-------------------+-------------------+--------------------+---------------+--------------------+-------------------+------------------+-------------+--------+--------------------+
|1600475534689026049|2022-12-07 14:01:17|AIM spoke to @Ree...|Analyticsindiam|    Bengaluru, India|              15350|               483|            0|      en|[quantumcomputing...|
|1600475603202981889|2022-12-07 14:01:33|Micro Focus and J...|     jenna_loup|         Houston, TX|                 97|               102|            0|      en|[sustainability, ...|
|1600475677743980544|2022-12-07 14:01:51|RT @Khulood_Alman...| Khulood_Almani|Kingdom

### 5.5 Überschreiben von neueren Version
Überschreibe nun mit der älteren Version die Aktuellste. 
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>df_pasttopresent = (spark
                   .read.format("delta").option("versionAsOf", 0).load("s3a://twitter/delta")
                   .write.partitionBy("language").mode("overwrite").format("delta").save("s3a://twitter/delta")
                   )
df = spark.read.format("delta").load("s3a://solution/twitter_delta")
df.show()</code>
</details>
</p>

In [28]:
# write old version back as latest
df_pasttopresent = (spark
                   .read.format("delta").option("versionAsOf", 0).load("s3a://twitter/delta")
                   .write.partitionBy("language").mode("overwrite").format("delta").save("s3a://twitter/delta")
                   )
df = spark.read.format("delta").load("s3a://twitter/delta")
df.show()

                                                                                

+-------------------+-------------------+--------------------+---------------+--------------------+-------------------+------------------+-------------+--------+--------------------+
|           tweet_id|         created_at|       tweet_message|      user_name|       user_location|user_follower_count|user_friends_count|retweet_count|language|            hashtags|
+-------------------+-------------------+--------------------+---------------+--------------------+-------------------+------------------+-------------+--------+--------------------+
|1600475534689026049|2022-12-07 14:01:17|AIM spoke to @Ree...|Analyticsindiam|    Bengaluru, India|              15350|               483|            0|      en|[quantumcomputing...|
|1600475603202981889|2022-12-07 14:01:33|Micro Focus and J...|     jenna_loup|         Houston, TX|                 97|               102|            0|      en|[sustainability, ...|
|1600475677743980544|2022-12-07 14:01:51|RT @Khulood_Alman...| Khulood_Almani|Kingdom

### 5.6 Zurück in die Zukunft
Kehre zurück zum aktuellsten Timestamp, indem `timestampAsOf`anstelle von `versionAsOf`verwenden und einem aktuellen timestamp, anstelle der Versionsnummer.
<br>
<br>
<details>
<summary> &#8964 Lösung </summary>
<p>
<code>f_b2future = (spark
                .read.format("delta").option("timestampAsOf", "\<aktuellsten Stand\>").load("s3a://twitter/delta")
               )
f_b2future.show()</code>
</details>
</p>

In [29]:
f_b2future = (spark
                .read.format("delta").option("timestampAsOf", "2022-12-08 16:37:54").load("s3a://twitter/delta")
               )
f_b2future.show()

                                                                                

+-------------------+-------------------+--------------------+---------------+--------------------+-------------------+------------------+-------------+--------+--------------------+
|           tweet_id|         created_at|       tweet_message|      user_name|       user_location|user_follower_count|user_friends_count|retweet_count|language|            hashtags|
+-------------------+-------------------+--------------------+---------------+--------------------+-------------------+------------------+-------------+--------+--------------------+
|1600589057381306401|2022-12-07 21:32:23|@EstelaMandela @b...| Khulood_Almani|Kingdom of Saudi ...|              43586|              2261|            0|      en|[AI, Python, Data...|
|1600525331085295616|2022-12-07 17:19:09|Free Udemy Certif...|       mikejo_m|                null|                 12|                 4|            0|      en|[Developers, DEVC...|
|1600501676586270721|2022-12-07 15:45:10|Why We Migrated F...|   dataclaudius|       

# 6. Ausschalten der Spark-App
**Bitte schließe am Ende die Spark-App wieder mit dem folgenden Befehl `spark.stop()`, wenn du fertig mit der Bearbeitung der Aufgaben bist.** 

In [30]:
spark.stop()


22/12/08 16:32:34 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
