# Structured Streaming

* Source -> Monitoring, Transforming -> Sink
    * Source: Where data comes from
    * Sink: Where parsed structured streamed data comes to
* Microbatch
    * Output Modes: Append, Update, Complete
    * Trigger
        * Default (when microbatch ends)
        * Time
        * Once
        * Continuous
    * Checkpointdir
        * Directory to save checkpoint data about the streaming process

# Partitioning

* Shuffle - redistribution of data between partitions
* Partitioning - by default; explicit (partitionBy); in-memory (repartition(), coalesce())
* Bucketing - Fixed number of partitions (ideal in high cardinality scenarios)

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StreamingTest").getOrCreate()

churn = spark.read.csv("/home/ubuntu/obenkyo/raw_data/spark_course_udemy/Churn.csv", inferSchema=True, sep=';', header=True)

spark.sql("create database temp_churn").show()

In [2]:
spark.sql("use temp_churn").show()

++
||
++
++



In [8]:
churn.write.partitionBy("Geography").saveAsTable("churn_table")

                                                                                

In [10]:
churn.write.bucketBy(3, "Geography").saveAsTable("churn_table2")

                                                                                

## Caching

* StorageLevel
    * MEMORY_ONLY
    * MEMORY_AND_DISK

In [11]:
from pyspark import StorageLevel

In [16]:
spark.sql("use temp_churn")

DataFrame[]

In [None]:
df = spark.sql("select * from temp_churn.churn")

df.storageLevel

df.cache()

df.persist(StorageLevel.DISK_ONLY)

df.unpersist()