### CONFIGURATIONS

Lets start by making some configurations

In [1]:
# Kafka Broker infos
kafkaBootstrapServers = "localhost:9092" 
kafkaUser = "MY_USER"
kafkaSecret = "MY_SECRET"

# Schema registry infos
schemaRegistryApiKey = "MY_API_KEY"
schemaRegistrySecret = "MY_SECRET"
schemaRegistryUrl = "..."

# Topic to consume
topicName = "gps"

### SCHEMA REGISTRY
- Data comming from Kafka topic is avro-serialized and therefore we have to retreive the avrol schema in order to deserialize the avrol stream.

In [1]:
# Retrieve GPS Schema grom Schema Registry

from confluent_kafka.schema_registry import SchemaRegistryClient

schema_registry_conf = {
    'url': schemaRegistryUrl,
    'basic.auth.user.info': '{}:{}'.format(schemaRegistryApiKey, schemaRegistrySecret)}

# Instantiate a new Schema registry client with the authentication details
schema_registry_client = SchemaRegistryClient(schema_registry_conf)

#Gettting the latest version of the schema (Topic name suffixed by value)
gps_schema_response = schema_registry_client.get_latest_version(topicName + "-value").schema 
gps_schema = gps_schema_response.schema_str
#  gps_schema var now has a JSON description of our Avro Schema.

NameError: name 'schemaRegistryUrl' is not defined

### CONFIGURE THE SINK


In [2]:
# path_to_bucket = "/mnt/10ac-batch-5/week9/g3"
bucket = "/mnt/10ac-batch-5/week9/g3/speech-to-text-delta"


delta_location = bucket + "/delta-table"
checkpoint_location = bucket + "/checkpoints"
schema_location = bucket + "/kafka_schema.json"



In [None]:
## Alter configurations depending on the directory you 

if not any(mount.mountPoint == '/mnt/datalake' for mount in dbutils.fs.mounts()):
  try:
    dbutils.fs.mount(
      source = bucket,
      mount_point = "/mnt/datalake",
    )
  except Exception as e:
    print(e)
    print("already mounted. Try to unmount first")

### CREATE THE STREAM

For the configurations, it is important to note that:
- The option startingOffsets controls how we consume the topic.
1. latest: we consume only the new incoming events in the topic.
2. earliest: we consume all the events present in the topic.

Also,
Each record consumed from Kafka will have the following schema :

- key: Record Key (bytes)
- value: Record value (bytes)
- topic: Kafka topic the record was in
- partition: Topic partition the record was in
- offset: Offset value of the record
- timestamp: Timestamp associated with the record
- timestampType: Enum for the timestamp type

In [3]:
gpsDF = ( 
  spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaBootstrapServers)
  .option("subscribe", topicName)
  .option("startingOffsets", "latest")
  .option("failOnDataLoss", "false")
  .load()
)

NameError: name 'spark' is not defined

### DESERIALIZATION
Now we have to deserialize the ```value``` column from our created dataframe above
- We first use the from_avro function from pyspark.sql.avro.function to deserialize the record.
- The 5 first bytes of the value retrieved correspond to the Magic Byte (0) and the schema ID so we only take the data after.

- We also give to this function the GPS Avro schema retrieved from the schema registry. The last argument is a configuration to control the behavior of the stream when the deserialization fails.





In [None]:
from pyspark.sql.avro.functions import from_avro

from_avro_options= {"mode":"PERMISSIVE"}

structuredGpsDf = (
  gpsDF
  .select(from_avro(fn.expr("substring(value, 6, length(value)-5)"), gps_schema, from_avro_options).alias("value"))
  .selectExpr("value.timestamp", "value.deviceId", "value.latitude", "value.longitude", "value.altitude", "value.speed") \
)

display(structuredGpsDf)

### WRITING INTO THE DELTA TABLE

In [None]:
structuredGpsDf.writeStream \
.format("delta") \
.outputMode("append") \
.option("mergeSchema", "true") \
.option("checkpointLocation", checkpoint_location )\
.start(delta_location)
