# **Proyecto: Kafka y Spark Streaming (datos de wikipedia)**

Seguiremos usando los datos de wikichange para responder a lo sgte:
<li>Escriba una consulta que cuente y agrupe por el atributo bot del mensaje. Ejecute la consulta en forma iterativa y muestre como los resultados van cambiando.</li>
<li>Para lo anterior, active desde Colab el productor codificado en el notebook BD06-Kafka-Spark. Importante: los datos recibidos se estan almacenando en memoria principal del consumidor.</li>
<li>Escriba una consulta que muestre como va cambiando el conteo de bots pero usando ventanas deslizantes. Para la ventana de tiempo utilice el atributo change_timestamp del mensaje.</li>
<li>Proponga 3 consultas más que permitan ver como los datos van cambiando en el tiempo.</li>

In [1]:
!spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 MyPythonScript.py

:: loading settings :: url = jar:file:/usr/local/lib/python3.7/dist-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-36a2361a-313f-4985-93de-343c5669284d;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.0.1 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.0.1 in central
	found org.apache.kafka#kafka-clients;2.4.1 in central
	found com.github.luben#zstd-jni;1.4.4-3 in central
	found org.lz4#lz4-java;1.7.1 in central
	found org.xerial.snappy#snappy-java;1.1.7.5 in central
	found org.slf4j#slf4j-api;1.7.30 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.commons#commons-pool2;2.6.2 in central
:: resolution report :: resolve 959ms :: artifacts dl 43

In [2]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import * 
import pyspark.sql.functions as fn 
from pyspark.sql.types import StringType
import time

In [4]:
sc = SparkContext('local')

In [5]:
spark = SparkSession(sc)

In [6]:
wikiStream = (spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers","ec2-18-118-112-10.us-east-2.compute.amazonaws.com:9092") # kafka server
  .option("subscribe", "wiki") # topic
  .option("startingOffsets", "earliest") # start from beginning 
  .load())

AnalysisException: ignored

In [None]:
wikiStream

In [None]:
wikiStream.isStreaming

In [None]:
from pyspark.sql.types import StringType

# Convert binary to string key and value
wikiStream = (wikiStream
    .withColumn("key", wikiStream["key"].cast(StringType()))
    .withColumn("value", wikiStream["value"].cast(StringType())))

In [None]:
wikiStream

In [None]:
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, BooleanType, LongType, IntegerType

# Event data schema
schema_wiki = StructType(
    [StructField("$schema",StringType(),True),
     StructField("bot",BooleanType(),True),
     StructField("comment",StringType(),True),
     StructField("id",StringType(),True),
     StructField("length",
                 StructType(
                     [StructField("new",IntegerType(),True),
                      StructField("old",IntegerType(),True)]),True),
     StructField("meta",
                 StructType(
                     [StructField("domain",StringType(),True),
                      StructField("dt",StringType(),True),
                      StructField("id",StringType(),True),
                      StructField("offset",LongType(),True),
                      StructField("partition",LongType(),True),
                      StructField("request_id",StringType(),True),
                      StructField("stream",StringType(),True),
                      StructField("topic",StringType(),True),
                      StructField("uri",StringType(),True)]),True),
     StructField("minor",BooleanType(),True),
     StructField("namespace",IntegerType(),True),
     StructField("parsedcomment",StringType(),True),
     StructField("patrolled",BooleanType(),True),
     StructField("revision",
                 StructType(
                     [StructField("new",IntegerType(),True),
                      StructField("old",IntegerType(),True)]),True),
     StructField("server_name",StringType(),True),
     StructField("server_script_path",StringType(),True),
     StructField("server_url",StringType(),True),
     StructField("timestamp",StringType(),True),
     StructField("title",StringType(),True),
     StructField("type",StringType(),True),
     StructField("user",StringType(),True),
     StructField("wiki",StringType(),True)])

# Create dataframe setting schema for event data
df_wiki = (wikiStream
           # Sets schema for event data
           .withColumn("value", from_json("value", schema_wiki))
          )

In [None]:
df_wiki

In [None]:
df_wiki.isStreaming

In [None]:
from pyspark.sql.functions import col, from_unixtime, to_date, to_timestamp

# Transform into tabular 
# Convert unix timestamp to timestamp
# Create partition column (change_timestamp_date)
df_wiki_formatted = (df_wiki.select(
    col("key").alias("event_key")
    ,col("topic").alias("event_topic")
    ,col("timestamp").alias("event_timestamp")
    ,col("value.$schema").alias("schema")
    ,"value.bot"
    ,"value.comment"
    ,"value.id"
    ,col("value.length.new").alias("length_new")
    ,col("value.length.old").alias("length_old")
    ,"value.minor"
    ,"value.namespace"
    ,"value.parsedcomment"
    ,"value.patrolled"
    ,col("value.revision.new").alias("revision_new")
    ,col("value.revision.old").alias("revision_old")
    ,"value.server_name"
    ,"value.server_script_path"
    ,"value.server_url"
    ,to_timestamp(from_unixtime(col("value.timestamp"))).alias("change_timestamp")
    ,to_date(from_unixtime(col("value.timestamp"))).alias("change_timestamp_date")
    ,"value.title"
    ,"value.type"
    ,"value.user"
    ,"value.wiki"
    ,col("value.meta.domain").alias("meta_domain")
    ,col("value.meta.dt").alias("meta_dt")
    ,col("value.meta.id").alias("meta_id")
    ,col("value.meta.offset").alias("meta_offset")
    ,col("value.meta.partition").alias("meta_partition")
    ,col("value.meta.request_id").alias("meta_request_id")
    ,col("value.meta.stream").alias("meta_stream")
    ,col("value.meta.topic").alias("meta_topic")
    ,col("value.meta.uri").alias("meta_uri")
))

In [None]:
df_wiki_formatted

In [None]:
df_wiki_formatted.isStreaming

In [None]:
query = df_wiki_formatted.writeStream.format("memory").queryName("wikiTable").outputMode("append").start()

In [None]:
type(query)

In [None]:
print(query.name)

In [None]:
for x in range(10):
  DF = spark.sql("select event_topic,bot,user from wikiTable")
  print(DF.show())
  time.sleep(1)

In [None]:
print(spark.streams.active)

In [None]:
query.stop()