Given a timeseries clickstream hit data of user activity, enrich the data with session id and visit number.

A session will be defined as 30 mins of inactivity and maximum 2 hours.

For Batch use case, the source and sink are hive tables. Read the data from hive, use spark batch (Scala) to do the computation. 

Please don't use direct spark sql and save the results in parquet with enriched data.

For real-time, the source and sink are Kafka (JSON). 
Read the real-time stream from Kafka, process the stream, add additional 2 fields- session id and visit number. Write the stream back to Kafka.
Usage of latest spark version is recommended. Code should be well formatted and documented.

Please see below some sample records. 
Feel free to create your own data set for testing the code/data pipeline.


In [1]:
import subprocess
# result = subprocess.run(['last', '-FR'], stdout=subprocess.PIPE).stdout.decode("utf-8")
result = subprocess.check_output(['last', '-FR'])

In [2]:
users = result.split("\n")[:-3]

In [3]:
import re
import hashlib
def get_features(user):
    wordList = re.sub("[^\w]", " ",  user).split()
    mystring = wordList[0]
    hash_object = hashlib.md5(mystring.encode())
    date = wordList[4:10]
    date[0] = '10'
    #return [mystring, str(hash_object.hexdigest()), " ".join(wordList[3:10])]
    return [str(hash_object.hexdigest()), " ".join(date)]

In [4]:
users

['annapurn pts/21       Thu Oct 11 12:55:11 2018   still logged in                      ',
 'roshanku pts/15       Thu Oct 11 12:52:22 2018   still logged in                      ',
 'katukuri pts/11       Thu Oct 11 12:44:33 2018   still logged in                      ',
 'chandrar pts/8        Thu Oct 11 12:37:56 2018   still logged in                      ',
 'deepikag pts/22       Thu Oct 11 12:37:28 2018   still logged in                      ',
 'learning pts/9        Thu Oct 11 12:37:24 2018   still logged in                      ',
 'chandrar pts/10       Thu Oct 11 12:37:21 2018   still logged in                      ',
 'katukuri pts/19       Thu Oct 11 12:36:54 2018   still logged in                      ',
 'katukuri pts/15       Thu Oct 11 12:36:44 2018 - Thu Oct 11 12:42:50 2018  (00:06)    ',
 'choudhar pts/6        Thu Oct 11 12:28:12 2018   still logged in                      ',
 'mailprad pts/6        Thu Oct 11 12:06:59 2018 - Thu Oct 11 12:18:58 2018  (00:11)    ',

In [5]:
hashed_logins = list(map(get_features,users))

In [6]:
hashed_logins[:3]

[['99c674fd9305f1135a09fc1e15d2d2e4', '10 11 12 55 11 2018'],
 ['ef8a53e6c0990b0373f88e9be5c4b303', '10 11 12 52 22 2018'],
 ['bf8dfd7f7703c257e4652567b4a8db21', '10 11 12 44 33 2018']]

In [7]:
len(hashed_logins)

2653

In [8]:
import os
path = os.getenv("HOME") +"/data/mmt_data/"
spark_home = "/usr/hdp/current/spark2-client"
mode = "yarn"

In [9]:
print(path)
print(spark_home)

/home/kranthidr/data/mmt_data/
/usr/hdp/current/spark2-client


In [10]:
import findspark
findspark.init(spark_home)
findspark.find()

'/usr/hdp/current/spark2-client'

In [11]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master(mode).appName("userSessionsDataPrep").enableHiveSupport().getOrCreate()

In [12]:
spark

In [13]:
sc = spark.sparkContext

In [14]:
import pandas as pd
import numpy as np

The minimum supported version is 1.0.0



In [15]:
pdf = pd.DataFrame(data={"user": np.array(hashed_logins)[:,0], "login":np.array(hashed_logins)[:,1]})

In [16]:
df = spark.createDataFrame(pdf)

In [17]:
df.show()

+-------------------+--------------------+
|              login|                user|
+-------------------+--------------------+
|10 11 12 55 11 2018|99c674fd9305f1135...|
|10 11 12 52 22 2018|ef8a53e6c0990b037...|
|10 11 12 44 33 2018|bf8dfd7f7703c257e...|
|10 11 12 37 56 2018|712411069b55d05e0...|
|10 11 12 37 28 2018|4dcc5847371dd581a...|
|10 11 12 37 24 2018|25a9ac406aceb47a0...|
|10 11 12 37 21 2018|712411069b55d05e0...|
|10 11 12 36 54 2018|bf8dfd7f7703c257e...|
|10 11 12 36 44 2018|bf8dfd7f7703c257e...|
|10 11 12 28 12 2018|a7ac434f70bdb667c...|
|10 11 12 06 59 2018|58e8127d35f92a64b...|
|10 11 11 50 19 2018|fff60b5b6c9385873...|
|10 11 11 47 53 2018|a01a345ab591afe2b...|
|10 11 11 47 30 2018|fff60b5b6c9385873...|
|10 11 11 42 57 2018|c185ddac8b5a8f5aa...|
|10 11 11 35 46 2018|4dcc5847371dd581a...|
|10 11 11 35 13 2018|8a2504276794a7d23...|
|10 11 11 34 34 2018|c185ddac8b5a8f5aa...|
|10 11 11 32 21 2018|8a2504276794a7d23...|
|10 11 11 29 03 2018|c185ddac8b5a8f5aa...|
+----------

In [18]:
df.count()

2653

In [19]:
df.groupBy("user").count().count()

207

In [20]:
from pyspark.sql.functions import to_timestamp

In [21]:
df = df.withColumn("login_time", to_timestamp("login",'MM dd HH mm ss yyyy'))

In [22]:
df.show()

+-------------------+--------------------+-------------------+
|              login|                user|         login_time|
+-------------------+--------------------+-------------------+
|10 11 12 55 11 2018|99c674fd9305f1135...|2018-10-11 12:55:11|
|10 11 12 52 22 2018|ef8a53e6c0990b037...|2018-10-11 12:52:22|
|10 11 12 44 33 2018|bf8dfd7f7703c257e...|2018-10-11 12:44:33|
|10 11 12 37 56 2018|712411069b55d05e0...|2018-10-11 12:37:56|
|10 11 12 37 28 2018|4dcc5847371dd581a...|2018-10-11 12:37:28|
|10 11 12 37 24 2018|25a9ac406aceb47a0...|2018-10-11 12:37:24|
|10 11 12 37 21 2018|712411069b55d05e0...|2018-10-11 12:37:21|
|10 11 12 36 54 2018|bf8dfd7f7703c257e...|2018-10-11 12:36:54|
|10 11 12 36 44 2018|bf8dfd7f7703c257e...|2018-10-11 12:36:44|
|10 11 12 28 12 2018|a7ac434f70bdb667c...|2018-10-11 12:28:12|
|10 11 12 06 59 2018|58e8127d35f92a64b...|2018-10-11 12:06:59|
|10 11 11 50 19 2018|fff60b5b6c9385873...|2018-10-11 11:50:19|
|10 11 11 47 53 2018|a01a345ab591afe2b...|2018-10-11 11

In [23]:
spark.sql("""
CREATE DATABASE IF NOT EXISTS Kranthidr_db
LOCATION '/user/kranthidr/Kranthidr_db'
""").show()

++
||
++
++



In [24]:
spark.sql("""
USE Kranthidr_db
""").show()

++
||
++
++



In [25]:
spark.sql("""
SHOW tables
""").show()

+------------+-------------------+-----------+
|    database|          tableName|isTemporary|
+------------+-------------------+-----------+
|kranthidr_db|            flights|      false|
|kranthidr_db|flights_from_select|      false|
|kranthidr_db|       hive_flights|      false|
|kranthidr_db|     hive_flights_2|      false|
|kranthidr_db|        nested_data|      false|
|kranthidr_db|partitioned_flights|      false|
|kranthidr_db|      user_sessions|      false|
+------------+-------------------+-----------+



In [26]:
new_df = spark.sql("""
select * from user_sessions
""")

In [27]:
to_store = df.select("user","login_time")

In [28]:
to_store = to_store.union(new_df)

In [29]:
#to_store.write.mode("append").saveAsTable("Kranthidr_db.user_sessions")

In [30]:
spark.sql("""
show tables
""").show()

+------------+-------------------+-----------+
|    database|          tableName|isTemporary|
+------------+-------------------+-----------+
|kranthidr_db|            flights|      false|
|kranthidr_db|flights_from_select|      false|
|kranthidr_db|       hive_flights|      false|
|kranthidr_db|     hive_flights_2|      false|
|kranthidr_db|        nested_data|      false|
|kranthidr_db|partitioned_flights|      false|
|kranthidr_db|      user_sessions|      false|
+------------+-------------------+-----------+



In [31]:
to_store.orderBy("login_time").show()

+--------------------+-------------------+
|                user|         login_time|
+--------------------+-------------------+
|3c889f0520978d740...|2018-10-01 03:48:47|
|3c889f0520978d740...|2018-10-01 03:48:47|
|d0b0efd87408e9c1a...|2018-10-01 03:53:28|
|d0b0efd87408e9c1a...|2018-10-01 03:53:28|
|d0b0efd87408e9c1a...|2018-10-01 03:53:35|
|d0b0efd87408e9c1a...|2018-10-01 03:53:35|
|d0b0efd87408e9c1a...|2018-10-01 03:53:36|
|d0b0efd87408e9c1a...|2018-10-01 03:53:36|
|1a2f8a2300b200cce...|2018-10-01 03:55:41|
|1a2f8a2300b200cce...|2018-10-01 03:55:41|
|3c889f0520978d740...|2018-10-01 04:03:55|
|3c889f0520978d740...|2018-10-01 04:03:55|
|3c889f0520978d740...|2018-10-01 04:04:18|
|3c889f0520978d740...|2018-10-01 04:04:18|
|28b10381690c5876b...|2018-10-01 04:04:28|
|28b10381690c5876b...|2018-10-01 04:04:28|
|ef8a53e6c0990b037...|2018-10-01 04:31:49|
|ef8a53e6c0990b037...|2018-10-01 04:31:49|
|d0b0efd87408e9c1a...|2018-10-01 04:33:13|
|d0b0efd87408e9c1a...|2018-10-01 04:33:13|
+----------

In [32]:
from pyspark.sql.functions import first, count, col, expr

In [33]:
to_store.count()

5291

In [34]:
new_to_store = to_store.groupBy("login_time")\
.agg(first(col("user")))\
.withColumnRenamed("first(user, false)", "user")\
.select("user","login_time")

In [35]:
to_store.groupBy("login_time")\
.agg(first(col("user")), count(col("login_time")))\
.withColumnRenamed("count(login_time)", "count")\
.withColumnRenamed("first(user, false)", "user")\
.select("user","login_time").count()

2641

In [36]:
to_store.groupBy("login_time")\
.agg(first(col("user")), count(col("login_time")))\
.withColumnRenamed("count(login_time)", "count")\
.withColumnRenamed("first(user, false)", "user")\
.groupBy("count").count().show()

+-----+-----+
|count|count|
+-----+-----+
|    1|    3|
|    3|   12|
|    2| 2626|
+-----+-----+



In [37]:
spark.sql("""
select * from user_sessions
""").count()

2638

In [38]:
new_to_store.count()

2641

In [39]:
# spark.sql("""
# TRUNCATE table user_sessions
# """).show()

In [40]:
new_to_store.write.mode("append").saveAsTable("Kranthidr_db.user_sessions1")

In [41]:
spark.sql("""
DROP TABLE Kranthidr_db.user_sessions
""").show()

++
||
++
++



In [42]:
spark.sql("""
ALTER TABLE Kranthidr_db.user_sessions1 RENAME TO Kranthidr_db.user_sessions
""").show()

++
||
++
++



In [43]:
spark.sql("""
select * from user_sessions
""").count()

2641

In [53]:
spark.sql("""
select * from user_sessions
""").orderBy("login_time").show(truncate=False)

+--------------------------------+-------------------+
|user                            |login_time         |
+--------------------------------+-------------------+
|3c889f0520978d740bef43fb35a4af84|2018-10-01 03:48:47|
|d0b0efd87408e9c1ae1d63e7f4c49f77|2018-10-01 03:53:28|
|d0b0efd87408e9c1ae1d63e7f4c49f77|2018-10-01 03:53:35|
|d0b0efd87408e9c1ae1d63e7f4c49f77|2018-10-01 03:53:36|
|1a2f8a2300b200cce8e554ed46398e5a|2018-10-01 03:55:41|
|3c889f0520978d740bef43fb35a4af84|2018-10-01 04:03:55|
|3c889f0520978d740bef43fb35a4af84|2018-10-01 04:04:18|
|28b10381690c5876be785ea4bc2ab240|2018-10-01 04:04:28|
|ef8a53e6c0990b0373f88e9be5c4b303|2018-10-01 04:31:49|
|d0b0efd87408e9c1ae1d63e7f4c49f77|2018-10-01 04:33:13|
|d0b0efd87408e9c1ae1d63e7f4c49f77|2018-10-01 04:33:16|
|d0b0efd87408e9c1ae1d63e7f4c49f77|2018-10-01 04:33:18|
|c9fa205bc34be4bc90571ea8c52670e1|2018-10-01 04:37:21|
|99c674fd9305f1135a09fc1e15d2d2e4|2018-10-01 05:01:53|
|7998a932738a6b28738e61d13d6eee0e|2018-10-01 05:15:16|
|a9562eefd