Given a timeseries clickstream hit data of user activity, enrich the data with session id and visit number.

A session will be defined as 30 mins of inactivity and maximum 2 hours.

For Batch use case, the source and sink are hive tables. Read the data from hive, use spark batch (Scala) to do the computation. 

Please don't use direct spark sql and save the results in parquet with enriched data.

For real-time, the source and sink are Kafka (JSON). 
Read the real-time stream from Kafka, process the stream, add additional 2 fields- session id and visit number. Write the stream back to Kafka.
Usage of latest spark version is recommended. Code should be well formatted and documented.

Please see below some sample records. 
Feel free to create your own data set for testing the code/data pipeline.


In [1]:
import subprocess
# result = subprocess.run(['last', '-FR'], stdout=subprocess.PIPE).stdout.decode("utf-8")
result = subprocess.check_output(['last', '-FR'])

In [2]:
users = result.split("\n")[:-3]

In [3]:
import re
import hashlib
def get_features(user):
    wordList = re.sub("[^\w]", " ",  user).split()
    mystring = wordList[0]
    hash_object = hashlib.md5(mystring.encode())
    date = wordList[4:10]
    date[0] = '10'
    #return [mystring, str(hash_object.hexdigest()), " ".join(wordList[3:10])]
    return [str(hash_object.hexdigest()), " ".join(date)]

In [4]:
users

['siddhant pts/13       Mon Oct 22 06:04:32 2018 - Mon Oct 22 06:05:25 2018  (00:00)    ',
 'siddhant pts/32       Mon Oct 22 06:02:21 2018 - Mon Oct 22 06:02:27 2018  (00:00)    ',
 'macksv17 pts/31       Mon Oct 22 06:01:03 2018   still logged in                      ',
 'bdeepika pts/30       Mon Oct 22 06:00:20 2018   still logged in                      ',
 'loganath pts/22       Mon Oct 22 05:50:00 2018   still logged in                      ',
 'dksriniv pts/22       Mon Oct 22 05:47:54 2018 - Mon Oct 22 05:48:44 2018  (00:00)    ',
 'sachinji pts/21       Mon Oct 22 05:46:44 2018   still logged in                      ',
 'macksv17 pts/18       Mon Oct 22 05:45:23 2018   still logged in                      ',
 'roshanku pts/15       Mon Oct 22 05:41:07 2018   still logged in                      ',
 'roshanku pts/12       Mon Oct 22 05:33:08 2018   still logged in                      ',
 'botlagun pts/4        Mon Oct 22 05:30:55 2018   still logged in                      ',

In [5]:
hashed_logins = list(map(get_features,users))

In [6]:
hashed_logins[:3]

[['e5bf515039cdf685df68445a1dac27af', '10 22 06 04 32 2018'],
 ['e5bf515039cdf685df68445a1dac27af', '10 22 06 02 21 2018'],
 ['50861eb650930b5974df9f9c7019acc0', '10 22 06 01 03 2018']]

In [7]:
len(hashed_logins)

4853

In [8]:
import os
path = os.getenv("HOME") +"/data/mmt_data/"
spark_home = "/usr/hdp/current/spark2-client"
mode = "yarn"

In [9]:
print(path)
print(spark_home)

/home/kranthidr/data/mmt_data/
/usr/hdp/current/spark2-client


In [10]:
import findspark
findspark.init(spark_home)
findspark.find()

'/usr/hdp/current/spark2-client'

In [11]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master(mode).appName("userSessionsDataPrep").enableHiveSupport().getOrCreate()

In [12]:
spark

In [13]:
sc = spark.sparkContext

In [14]:
import pandas as pd
import numpy as np

The minimum supported version is 1.0.0



In [15]:
pdf = pd.DataFrame(data={"user": np.array(hashed_logins)[:,0], "login":np.array(hashed_logins)[:,1]})

In [16]:
df = spark.createDataFrame(pdf)

In [17]:
df.show()

+-------------------+--------------------+
|              login|                user|
+-------------------+--------------------+
|10 22 06 04 32 2018|e5bf515039cdf685d...|
|10 22 06 02 21 2018|e5bf515039cdf685d...|
|10 22 06 01 03 2018|50861eb650930b597...|
|10 22 06 00 20 2018|6e2c9ea8686ee0ae4...|
|10 22 05 50 00 2018|1a2f8a2300b200cce...|
|10 22 05 47 54 2018|52cc29303ded84a63...|
|10 22 05 46 44 2018|3fc5f9aee6aebdfbb...|
|10 22 05 45 23 2018|50861eb650930b597...|
|10 22 05 41 07 2018|ef8a53e6c0990b037...|
|10 22 05 33 08 2018|ef8a53e6c0990b037...|
|10 22 05 30 55 2018|38418c737cf158499...|
|10 22 05 29 09 2018|3fc5f9aee6aebdfbb...|
|10 22 05 28 10 2018|38418c737cf158499...|
|10 22 05 22 01 2018|03f935be1987fb24c...|
|10 22 05 20 18 2018|3fc5f9aee6aebdfbb...|
|10 22 05 11 46 2018|3fc5f9aee6aebdfbb...|
|10 22 04 59 50 2018|4cff6f777e1796ffe...|
|10 22 04 52 33 2018|99c674fd9305f1135...|
|10 22 04 52 24 2018|4cff6f777e1796ffe...|
|10 22 04 50 40 2018|99c674fd9305f1135...|
+----------

In [18]:
df.count()

4853

In [19]:
df.groupBy("user").count().count()

288

In [20]:
from pyspark.sql.functions import to_timestamp

In [21]:
df = df.withColumn("login_time", to_timestamp("login",'MM dd HH mm ss yyyy'))

In [22]:
df.show()

+-------------------+--------------------+-------------------+
|              login|                user|         login_time|
+-------------------+--------------------+-------------------+
|10 22 06 04 32 2018|e5bf515039cdf685d...|2018-10-22 06:04:32|
|10 22 06 02 21 2018|e5bf515039cdf685d...|2018-10-22 06:02:21|
|10 22 06 01 03 2018|50861eb650930b597...|2018-10-22 06:01:03|
|10 22 06 00 20 2018|6e2c9ea8686ee0ae4...|2018-10-22 06:00:20|
|10 22 05 50 00 2018|1a2f8a2300b200cce...|2018-10-22 05:50:00|
|10 22 05 47 54 2018|52cc29303ded84a63...|2018-10-22 05:47:54|
|10 22 05 46 44 2018|3fc5f9aee6aebdfbb...|2018-10-22 05:46:44|
|10 22 05 45 23 2018|50861eb650930b597...|2018-10-22 05:45:23|
|10 22 05 41 07 2018|ef8a53e6c0990b037...|2018-10-22 05:41:07|
|10 22 05 33 08 2018|ef8a53e6c0990b037...|2018-10-22 05:33:08|
|10 22 05 30 55 2018|38418c737cf158499...|2018-10-22 05:30:55|
|10 22 05 29 09 2018|3fc5f9aee6aebdfbb...|2018-10-22 05:29:09|
|10 22 05 28 10 2018|38418c737cf158499...|2018-10-22 05

In [23]:
spark.sql("""
CREATE DATABASE IF NOT EXISTS Kranthidr_db
LOCATION '/user/kranthidr/Kranthidr_db'
""").show()

++
||
++
++



In [24]:
spark.sql("""
USE Kranthidr_db
""").show()

++
||
++
++



In [25]:
spark.sql("""
SHOW tables
""").show()

+------------+-------------------+-----------+
|    database|          tableName|isTemporary|
+------------+-------------------+-----------+
|kranthidr_db|            flights|      false|
|kranthidr_db|flights_from_select|      false|
|kranthidr_db|       hive_flights|      false|
|kranthidr_db|     hive_flights_2|      false|
|kranthidr_db|        nested_data|      false|
|kranthidr_db|partitioned_flights|      false|
|kranthidr_db|      user_sessions|      false|
+------------+-------------------+-----------+



In [26]:
hive_df = spark.sql("""
select * from user_sessions
""").cache()

In [27]:
to_store = df.select("user","login_time")

In [28]:
hive_df.count()

4811

In [29]:
to_store.count()

4853

In [30]:
added_df = to_store.withColumnRenamed("user","user_now").join(hive_df.withColumnRenamed("user","user_old"), 
                                                              on="login_time", how="left_outer")

In [31]:
added_df.count()

4867

In [32]:
from pyspark.sql.functions import desc, asc

In [33]:
added_df.orderBy(desc("login_time")).show()

+-------------------+--------------------+--------------------+
|         login_time|            user_now|            user_old|
+-------------------+--------------------+--------------------+
|2018-10-22 06:04:32|e5bf515039cdf685d...|                null|
|2018-10-22 06:02:21|e5bf515039cdf685d...|                null|
|2018-10-22 06:01:03|50861eb650930b597...|                null|
|2018-10-22 06:00:20|6e2c9ea8686ee0ae4...|6e2c9ea8686ee0ae4...|
|2018-10-22 05:50:00|1a2f8a2300b200cce...|1a2f8a2300b200cce...|
|2018-10-22 05:47:54|52cc29303ded84a63...|52cc29303ded84a63...|
|2018-10-22 05:46:44|3fc5f9aee6aebdfbb...|3fc5f9aee6aebdfbb...|
|2018-10-22 05:45:23|50861eb650930b597...|50861eb650930b597...|
|2018-10-22 05:41:07|ef8a53e6c0990b037...|ef8a53e6c0990b037...|
|2018-10-22 05:33:08|ef8a53e6c0990b037...|ef8a53e6c0990b037...|
|2018-10-22 05:30:55|38418c737cf158499...|38418c737cf158499...|
|2018-10-22 05:29:09|3fc5f9aee6aebdfbb...|3fc5f9aee6aebdfbb...|
|2018-10-22 05:28:10|38418c737cf158499..

In [73]:
from pyspark.sql.functions import isnan, when, count, col, current_timestamp, lit, to_date

In [35]:
added_df.select([count(when(col(c).isNull(), c)).alias("cn_"+c)\
           for c in added_df.columns]).show(5,False)

+-------------+-----------+-----------+
|cn_login_time|cn_user_now|cn_user_old|
+-------------+-----------+-----------+
|25           |0          |28         |
+-------------+-----------+-----------+



In [44]:
latest_entry= hive_df.orderBy(desc("login_time")).take(1)[0].login_time

In [74]:
to_store.withColumn("diff",col("login_time")>latest_entry).where("diff").withColumn("date", 
                                                                                     to_date("login_time")).show()

+--------------------+-------------------+----+----------+
|                user|         login_time|diff|      date|
+--------------------+-------------------+----+----------+
|e5bf515039cdf685d...|2018-10-22 06:04:32|true|2018-10-22|
|e5bf515039cdf685d...|2018-10-22 06:02:21|true|2018-10-22|
|50861eb650930b597...|2018-10-22 06:01:03|true|2018-10-22|
+--------------------+-------------------+----+----------+



In [None]:
to_store.write.mode("append").saveAsTable("Kranthidr_db.user_sessions2")

In [None]:
spark.sql("""
DROP TABLE Kranthidr_db.user_sessions
""").show()

In [None]:
spark.sql("""
ALTER TABLE Kranthidr_db.user_sessions1 RENAME TO Kranthidr_db.user_sessions
""").show()

In [None]:
spark.sql("""
select * from user_sessions
""").count()

In [None]:
spark.sql("""
select * from user_sessions
""").orderBy("login_time").show(truncate=False)