###Utility notebook to install Kafka Server
This notebook:
1. Installs a `kafka service` locally on your `driver` node.
2. Creates a topic called `pyspark`.
3. Loads sample data into the topic.

In [0]:
%sh
wget https://archive.apache.org/dist/kafka/0.10.2.1/kafka_2.12-0.10.2.1.tgz

In [0]:
%sh tar -xvf kafka*.tgz

In [0]:
%sh 
cd kafka_2.12-0.10.2.1
bin/zookeeper-server-start.sh -daemon config/zookeeper.properties
bin/kafka-server-start.sh -daemon config/server.properties

In [0]:
%sh 
nc -vz localhost 2181
nc -vz localhost 9092

In [0]:
%sh 
cat logs/stderr

In [0]:
#%sh kafka_2.12-0.10.2.1/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wordcount < logs/stderr

In [0]:
csv_df = (spark
                 .read
                   .option("header", "true")
                   .option("inferSchema", "true")
                   .csv("/FileStore/shared_uploads/online_retail/online_retail_II.csv")
            )
retail_df = csv_df.selectExpr("Invoice as InvoiceNo", "StockCode", "Description", "Quantity", "InvoiceDate", "Price as UnitPrice", "`Customer ID` as CustomerID", "Country")
retail_df.show()

In [0]:
from pyspark.sql.functions import to_json, struct, from_json, monotonically_increasing_id
from pyspark.sql.types import StructType, StructField, StringType

kafka_df = retail_df.withColumn("key", monotonically_increasing_id().cast("STRING")).withColumn("value", to_json(struct([retail_df[x] for x in retail_df.columns])).cast("STRING"))

jsonSchema = StructType([ StructField("eventName", StringType(), True),
                          StructField("eventParams", StringType(), True)
                        ])
kafka_df.select("key", "value").show()

Write retail events to `retail_events` Kafka topic.

In [0]:
(kafka_df.select("key", "value")
 .write
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "retail_events")
  .save()
)