<h2 style='text-align: center;'>AI Lab 2 -  Capstone Project</h2>

<h3 style='text-align: center;'>Author - Aloy Banerjee</h3>
<h3 style='text-align: center;'>Roll No. CH22M503</h3>

**Part 1: Kafka**
> For this part, you need to complete the following tasks:

    1. Install and Set-up Kafka on the Google Colab Notebook
    2. Stream data from the given dataset and break the data into batches of 10 records that are written to Kafka, separated by a sleep time of 10 seconds until 100 records are written.
    3. Use the Kafka consumer to read every 5 seconds from the producer. (Note: The dataset you will use for this part is YELP-review.json.)

In [None]:
import os
from google.colab import drive

# Check if Google Drive is already mounted
if not os.path.exists('/content/drive'):
    # Mount Google Drive if it's not already mounted
    drive.mount('/content/drive')
else:
    print("Google Drive is already mounted.")



Google Drive is already mounted.


**Install Kafka and Zookeeper**

In [None]:
!curl -sSOL https://downloads.apache.org/kafka/3.5.0/kafka_2.12-3.5.0.tgz
!tar -xzf kafka_2.12-3.5.0.tgz

In [None]:
!echo "Starting ZooKeeper service..."
!./kafka_2.12-3.5.0/bin/zookeeper-server-start.sh -daemon ./kafka_2.12-3.5.0/config/zookeeper.properties

!echo "Starting Kafka service..."
!./kafka_2.12-3.5.0/bin/kafka-server-start.sh -daemon ./kafka_2.12-3.5.0/config/server.properties

!echo "Waiting for 10 secs until Kafka and ZooKeeper services are up and running..."

!sleep 10

!ps -ef | grep kafka

Starting ZooKeeper service...
Starting Kafka service...
Waiting for 10 secs until Kafka and ZooKeeper services are up and running...
root        4119       1  0 10:52 ?        00:00:04 java -Xmx512M -Xms512M -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xlog:gc*:file=/content/kafka_2.12-3.5.0/bin/../logs/zookeeper-gc.log:time,tags:filecount=10,filesize=100M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/content/kafka_2.12-3.5.0/bin/../logs -Dlog4j.configuration=file:./kafka_2.12-3.5.0/bin/../config/log4j.properties -cp /content/kafka_2.12-3.5.0/bin/../libs/activation-1.1.1.jar:/content/kafka_2.12-3.5.0/bin/../libs/aopalliance-repackaged-2.6.1.jar:/content/kafka_2.12-3.5.0/bin/../libs/argparse4j-0.7.0.jar:/content/kafka_2.12-3.5.0/bin/../libs/audience-annotations-0.13.0.jar:/cont

**Run Kafka and Zookeeper in daemon mode on port 9092**

In [None]:
!./kafka_2.12-3.5.0/bin/kafka-topics.sh --create --bootstrap-server 127.0.0.1:9092 --replication-factor 1 --partitions 1 --topic yelp_reviews

Error while executing topic command : Topic 'yelp_reviews' already exists.
[2023-07-30 11:15:15,123] ERROR org.apache.kafka.common.errors.TopicExistsException: Topic 'yelp_reviews' already exists.
 (kafka.admin.TopicCommand$)


**Create new topics in kafka**

In [None]:
!./kafka_2.12-3.5.0/bin/kafka-topics.sh --describe --bootstrap-server 127.0.0.1:9092 --topic yelp_reviews

Topic: yelp_reviews	TopicId: tP2TAYRwSTW4ymD52UvVwg	PartitionCount: 1	ReplicationFactor: 1	Configs: 
	Topic: yelp_reviews	Partition: 0	Leader: 0	Replicas: 0	Isr: 0


**Install OpenJDK**

In [None]:
!echo "Installing OpenJDK 8 JDK..."
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Installing OpenJDK 8 JDK...


**Install Kafka's client**

In [None]:
!pip install kafka-python



**Importing Library**

In [None]:
import json
import time
import numpy as np
import pandas as pd
import random
from pandas import Timestamp
from datetime import datetime
import threading
from kafka import KafkaProducer, KafkaConsumer
from kafka.errors import KafkaError

**Common Variables**

In [None]:
# Define the Kafka topic
topic = "yelp_reviews"
# Define the google drive path
drive_link = '/content/drive/MyDrive/CapstoneProjectData/yelp_academic_dataset_review.json'


**Common Method**

In [None]:
def read_json_data(file_path, num_lines=None):
    """
    Reads data from a specified JSON file and converts it to a list of dictionaries.
    Each dictionary corresponds to a line in the JSON file. If num_lines is not specified,
    the function will attempt to read all lines in the file.

    Args:
        file_path (str): The path of the JSON file to read.
        num_lines (int, optional): The maximum number of lines to read from the file.
            Defaults to None, which means all lines will be read.

    Returns:
        list: A list of dictionaries representing the JSON data read from the file.

    Raises:
        IOError: If the file cannot be read.
        json.JSONDecodeError: If there is an error decoding the JSON data.
    """
    json_data = []
    try:
        with open(file_path, 'r') as f:
            for i, line in enumerate(f):
                if num_lines is not None and i >= num_lines:
                    break
                record = json.loads(line)
                json_data.append(record)
    except IOError:
        raise IOError(f"Could not read file at path: {file_path}")
    except json.JSONDecodeError:
        raise json.JSONDecodeError(f"Error decoding JSON data in file: {file_path}")

    return json_data

def send_to_kafka(producer, topic, data, iteration):
    """
    Sends data to Kafka topic using the specified producer.

    Parameters:
        producer (KafkaProducer): The Kafka producer instance.
        topic (str): The name of the Kafka topic to send data to.
        data (list): A list of dictionaries representing the data to send.
        iteration(int) : Iteration Number

    Returns:
        None
    """
    try:
        print(f'yelp data length : {len(data)}')
        for record in data:
            if 'timestamp' in record and isinstance(record['timestamp'], str):
                record['timestamp'] = datetime.strptime(record['timestamp'], "%Y-%m-%d %H:%M:%S")
            message = json.dumps(record).encode("utf-8")
            print(message)
            producer.send(topic, value=message)
        print(f"Iteration:{iteration} data sent successfully to Kafka.")
    except KafkaError as e:
        print(f"Error sending data to Kafka: {e}")

def consume_messages(consumer):
    """
    Consumes messages from Kafka in a separate thread.

    Parameters:
        consumer (KafkaConsumer): The Kafka consumer instance.

    Returns:
        None
    """
    try:
        while True:  # Run indefinitely to keep consuming messages
            for message in consumer:
                record = message.value
                print(record)
                # Perform further processing of the record here if needed
            time.sleep(5)  # Sleep for 5 seconds before reading more messages
    except Exception as e:
        print(f"Error occurred while consuming messages: {e}")
    finally:
        consumer.close()

### **Kafka Producer**

**Calling Kafka Producer**

In [None]:
# Create a Kafka producer
producer = KafkaProducer(bootstrap_servers="localhost:9092")

try:
    # Load the yelp data and select randonly 100 record for sending to Kafka
    yelp_reviews_data = read_json_data(drive_link, num_lines=1000)
    num_lines_to_select = 100
    yelp_reviews_data_ops = random.sample(yelp_reviews_data, num_lines_to_select)
    # Send data to Kafka in batches of 10 records with a 10-second sleep time between batches
    batch_size = 10
    for i in range(0, len(yelp_reviews_data_ops), batch_size):
        batch = yelp_reviews_data_ops[i:i + batch_size]
        iteration = (i // batch_size) + 1
        send_to_kafka(producer, topic, batch, iteration)
        time.sleep(10)

except Exception as e:
    print(f"Error: {e}")
finally:
    producer.close()

yelp data length : 10
b'{"review_id": "2EjmDJnE2z5_lAWyV1rd4g", "user_id": "7Ie0VmQtnGYUVq2YW4dTVw", "business_id": "xruWHK8Z5N0JWyQubLHjgA", "stars": 5.0, "useful": 1, "funny": 0, "cool": 0, "text": "I cannot even believe someone gave this place one star. Here\'s the scoop. They have the BEST cannoli in the city. Termini\'s are no where NEAR as good, and Isgro\'s is far too hyped up for the quality. Potito\'s is my go to for cannoli, i wouldn\'t even THINK about going anywhere else. I\'m there every xmas eve, and will be for the rest of my days. I stood in line for isgros for TWO HOURS this year for a tirimisu, i walked right into potito\'s and picked up my bambino cannoli stuffed cannoli. it\'s as good as it sounds, trust me.", "date": "2011-02-14 16:56:15"}'
b'{"review_id": "FTax2GLIJRbHbEBUE82_pA", "user_id": "JggphOM7FIbvUyPcsfcNTw", "business_id": "9gObo5ltOMo6UgsaXaHPWA", "stars": 5.0, "useful": 5, "funny": 2, "cool": 2, "text": "Second stop for Center City Restaurant Week:  R2L

### **Kafka Consumer**

**View from the Topics**

In [None]:
!./kafka_2.12-3.5.0/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic yelp_reviews --from-beginning --max-messages 90

{"review_id": "2EjmDJnE2z5_lAWyV1rd4g", "user_id": "7Ie0VmQtnGYUVq2YW4dTVw", "business_id": "xruWHK8Z5N0JWyQubLHjgA", "stars": 5.0, "useful": 1, "funny": 0, "cool": 0, "text": "I cannot even believe someone gave this place one star. Here's the scoop. They have the BEST cannoli in the city. Termini's are no where NEAR as good, and Isgro's is far too hyped up for the quality. Potito's is my go to for cannoli, i wouldn't even THINK about going anywhere else. I'm there every xmas eve, and will be for the rest of my days. I stood in line for isgros for TWO HOURS this year for a tirimisu, i walked right into potito's and picked up my bambino cannoli stuffed cannoli. it's as good as it sounds, trust me.", "date": "2011-02-14 16:56:15"}
{"review_id": "FTax2GLIJRbHbEBUE82_pA", "user_id": "JggphOM7FIbvUyPcsfcNTw", "business_id": "9gObo5ltOMo6UgsaXaHPWA", "stars": 5.0, "useful": 5, "funny": 2, "cool": 2, "text": "Second stop for Center City Restaurant Week:  R2L, dinner for $35 each. \n\nOn the 3

In [None]:
# Create a Kafka consumer
consumer = KafkaConsumer(topic, bootstrap_servers='localhost:9092', auto_offset_reset='earliest')

# Start consuming messages in a separate thread
consumer_thread = threading.Thread(target=consume_messages, args=(consumer,))
consumer_thread.start()

try:
    # Main thread sleeps for 30 seconds to allow message consumption
    time.sleep(30)
except KeyboardInterrupt:
    print("Received keyboard interrupt. Stopping consumer gracefully.")
finally:
    # Set a flag to stop the consumer thread gracefully
    consumer.close()
    consumer_thread.join()

print("Exiting the main thread.")



b'{"review_id": "2EjmDJnE2z5_lAWyV1rd4g", "user_id": "7Ie0VmQtnGYUVq2YW4dTVw", "business_id": "xruWHK8Z5N0JWyQubLHjgA", "stars": 5.0, "useful": 1, "funny": 0, "cool": 0, "text": "I cannot even believe someone gave this place one star. Here\'s the scoop. They have the BEST cannoli in the city. Termini\'s are no where NEAR as good, and Isgro\'s is far too hyped up for the quality. Potito\'s is my go to for cannoli, i wouldn\'t even THINK about going anywhere else. I\'m there every xmas eve, and will be for the rest of my days. I stood in line for isgros for TWO HOURS this year for a tirimisu, i walked right into potito\'s and picked up my bambino cannoli stuffed cannoli. it\'s as good as it sounds, trust me.", "date": "2011-02-14 16:56:15"}'
b'{"review_id": "FTax2GLIJRbHbEBUE82_pA", "user_id": "JggphOM7FIbvUyPcsfcNTw", "business_id": "9gObo5ltOMo6UgsaXaHPWA", "stars": 5.0, "useful": 5, "funny": 2, "cool": 2, "text": "Second stop for Center City Restaurant Week:  R2L, dinner for $35 each.

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


**Part 2: Pyspark Streaming**

> Consider a scenario where you are a data engineer working for a ride-sharing company called "PyRides." The company has a fleet of drivers who continuously send GPS location data and trip information during their shifts. As a data engineer, you process this real-time streaming data and gain insights into driver performance.

> **The data stream contains the following fields:**

        1.   **driver_id:** The unique identifier of the driver.
        2.   **timestamp:** The timestamp of the GPS location or trip event.
        3.   **latitude:** The latitude of the driver's location.
        4.   **longitude:** The longitude of the driver's location.
        5.   **trip_distance:** The distance covered by the driver during the trip (if it's a trip event).

> **Task 1: Count of Unique Drivers**
Your task is to implement a PySpark Streaming code to calculate the count of unique drivers within a sliding window of 10 minutes, updated every 5 minutes. Display the results for each window update.

> **Task 2: Average Trip Duration**
Your task is to implement a PySpark Streaming code to calculate the average trip duration for each driver within a tumbling window of 15 minutes. Display the results for each window update.

> **Task 3: Idle Time Detection**
Your task is to implement a PySpark Streaming code to detect idle time for each driver using session windows. Consider it an idle session if the driver's location remains unchanged for more than 30 minutes. Display the start and end timestamps of each idle session.

In [49]:
!pip install pyspark



**Importing Libraries**

In [337]:
import os
import numpy as np
import pandas as pd
import warnings
from google.colab import drive
from pyspark.sql import SparkSession
from pyspark.sql.functions import * # window, col, countDistinct, avg, lag, when, approx_count_distinct, sum, min, max,from_unixtime, unix_timestamp,expr
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType, LongType
from pyspark.sql.window import Window
from pyspark.streaming import StreamingContext
from pyspark import SparkConf, SparkContext

In [328]:
# Check if Google Drive is already mounted
if not os.path.exists('/content/drive'):
    # Mount Google Drive if it's not already mounted
    drive.mount('/content/drive')
else:
    print("Google Drive is already mounted.")

Google Drive is already mounted.


**Common Variable Declaration**

In [419]:
# Create a SparkSession
spark_session = SparkSession.builder.master("local").appName("PyRidesDriverPerformance").config('spark.ui.port', '4050').getOrCreate()
# File path to the JSON data
file_path = '/content/drive/MyDrive/CapstoneProjectData/'
# Define the sliding window duration and slide duration
window_duration_part1, slide_duration_part1 = "10 minutes", "5 minutes"
# Define the tumbling window of 15 minutes
window_duration_part2 = "15 minutes"
# Define the session window of 30 minutes
session_gap_duration = "30 minutes"
# Define a threshold for idle session detection (30 minutes in seconds)
idle_threshold_seconds = 1800
# Define the schema for the streaming data
schema = StructType([
    StructField("driver_id", StringType(), nullable=False),
    StructField("timestamp", TimestampType(), nullable=False),
    StructField("latitude", DoubleType(), nullable=False),
    StructField("longitude", DoubleType(), nullable=False),
    StructField("trip_distance", DoubleType(), nullable=True),
    StructField("event_type", StringType(), nullable=True)  # Set nullable=True if trip_distance can be missing
])

**Loading the streaming data**

In [331]:
# Read the JSON file into a DataFrame using the specified schema.
ride_information_stream = spark_session.readStream.format("json").option('multiline', True).schema(schema).json(file_path)
# Display the schema of the DataFrame.
ride_information_stream.printSchema()
ride_information_dataframe = ride_information_stream.select("*")
ride_information_query = ride_information_dataframe.writeStream.format("memory").outputMode("append").queryName("ride_information_using_stream").trigger(processingTime='5 seconds').start()

root
 |-- driver_id: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- event_type: string (nullable = true)



In [332]:
ride_information = spark_session.sql("select * from ride_information_using_stream")

In [333]:
ride_information.show()

+---------+-------------------+------------------+-------------------+------------------+----------+
|driver_id|          timestamp|          latitude|          longitude|     trip_distance|event_type|
+---------+-------------------+------------------+-------------------+------------------+----------+
|     null|               null|              null|               null|              null|      null|
|     D001|2023-07-30 05:55:48|37.538849579360246|-121.22748988885533| 3.423076573047031|      Trip|
|     D001|2023-07-30 07:10:08| 37.93476356947273| -121.0215678805398| 5.072880200441588|      Trip|
|     D001|2023-07-30 07:11:01|37.795197133875064|  -121.920222434928|              null|       GPS|
|     D001|2023-07-30 05:20:21| 37.46871545044007|-121.05888928225791|1.5685949947600415|      Trip|
|     D001|2023-07-30 05:27:29| 37.67930257604861| -121.7174686646489|              null|       GPS|
|     D001|2023-07-30 06:20:49| 37.57858135955766|-121.45726820764473|              null|  

**Task 1: Count of Unique Drivers**

In [359]:
ride_information = ride_information.orderBy('driver_id', 'timestamp')
unique_drivers_count = ride_information.groupBy(window(col("timestamp"), window_duration_part1, slide_duration_part1)).agg(countDistinct("driver_id").alias("unique_drivers_count"))

In [361]:
print('Task 1: Count of Unique Drivers : ')
ordered_unique_drivers_data = unique_drivers_count.orderBy('window')
ordered_unique_drivers_data.show(ordered_unique_drivers_data.count(), False)

Task 1: Count of Unique Drivers : 
+------------------------------------------+--------------------+
|window                                    |unique_drivers_count|
+------------------------------------------+--------------------+
|{2023-07-30 05:05:00, 2023-07-30 05:15:00}|47                  |
|{2023-07-30 05:10:00, 2023-07-30 05:20:00}|50                  |
|{2023-07-30 05:15:00, 2023-07-30 05:25:00}|50                  |
|{2023-07-30 05:20:00, 2023-07-30 05:30:00}|50                  |
|{2023-07-30 05:25:00, 2023-07-30 05:35:00}|50                  |
|{2023-07-30 05:30:00, 2023-07-30 05:40:00}|50                  |
|{2023-07-30 05:35:00, 2023-07-30 05:45:00}|50                  |
|{2023-07-30 05:40:00, 2023-07-30 05:50:00}|50                  |
|{2023-07-30 05:45:00, 2023-07-30 05:55:00}|49                  |
|{2023-07-30 05:50:00, 2023-07-30 06:00:00}|50                  |
|{2023-07-30 05:55:00, 2023-07-30 06:05:00}|50                  |
|{2023-07-30 06:00:00, 2023-07-30 06:10:0

**Task 2: Average Trip Duration**

In [469]:
eventwise_ride = ride_information.select("driver_id", "timestamp", "event_type")
eventwise_ride = eventwise_ride.dropna(how="all")
only_trip_data = eventwise_ride.filter(eventwise_ride.event_type == "Trip")
only_trip_data.show()

+---------+-------------------+----------+
|driver_id|          timestamp|event_type|
+---------+-------------------+----------+
|     D001|2023-07-30 05:12:56|      Trip|
|     D001|2023-07-30 05:16:08|      Trip|
|     D001|2023-07-30 05:17:46|      Trip|
|     D001|2023-07-30 05:17:51|      Trip|
|     D001|2023-07-30 05:20:21|      Trip|
|     D001|2023-07-30 05:20:57|      Trip|
|     D001|2023-07-30 05:23:03|      Trip|
|     D001|2023-07-30 05:29:41|      Trip|
|     D001|2023-07-30 05:29:56|      Trip|
|     D001|2023-07-30 05:32:20|      Trip|
|     D001|2023-07-30 05:32:34|      Trip|
|     D001|2023-07-30 05:34:54|      Trip|
|     D001|2023-07-30 05:41:40|      Trip|
|     D001|2023-07-30 05:43:05|      Trip|
|     D001|2023-07-30 05:43:44|      Trip|
|     D001|2023-07-30 05:43:55|      Trip|
|     D001|2023-07-30 05:44:40|      Trip|
|     D001|2023-07-30 05:47:29|      Trip|
|     D001|2023-07-30 05:47:41|      Trip|
|     D001|2023-07-30 05:48:13|      Trip|
+---------+

In [482]:
windows_specs = Window.partitionBy('driver_id').orderBy('timestamp')
#windows_specs = Window.partitionBy("driver_id", window("timestamp", "15 minutes")).orderBy("timestamp")
trip_data_within_window = only_trip_data.withColumn("prev_timestamp",lag('timestamp').over(windows_specs))
trip_data_within_window = trip_data_within_window.withColumn("trip_duration",(col("timestamp").cast('long') - col('prev_timestamp').cast('long')))

In [479]:
trip_data_within_window.show()

+---------+-------------------+----------+-------------------+-------------+
|driver_id|          timestamp|event_type|     prev_timestamp|trip_duration|
+---------+-------------------+----------+-------------------+-------------+
|     D001|2023-07-30 05:12:56|      Trip|               null|         null|
|     D001|2023-07-30 05:16:08|      Trip|               null|         null|
|     D001|2023-07-30 05:17:46|      Trip|2023-07-30 05:16:08|           98|
|     D001|2023-07-30 05:17:51|      Trip|2023-07-30 05:17:46|            5|
|     D001|2023-07-30 05:20:21|      Trip|2023-07-30 05:17:51|          150|
|     D001|2023-07-30 05:20:57|      Trip|2023-07-30 05:20:21|           36|
|     D001|2023-07-30 05:23:03|      Trip|2023-07-30 05:20:57|          126|
|     D001|2023-07-30 05:29:41|      Trip|2023-07-30 05:23:03|          398|
|     D001|2023-07-30 05:29:56|      Trip|2023-07-30 05:29:41|           15|
|     D001|2023-07-30 05:32:20|      Trip|               null|         null|

In [480]:
average_trip_duration = trip_data_within_window.groupBy('driver_id', window('timestamp', window_duration_part2)).agg(avg('trip_duration').alias('average trip duration'))
average_trip_duration = average_trip_duration.filter(col('average trip duration').isNotNull())

In [481]:
average_trip_duration.show(average_trip_duration.count(),truncate = False)

+---------+------------------------------------------+---------------------+
|driver_id|window                                    |average trip duration|
+---------+------------------------------------------+---------------------+
|D005     |{2023-07-30 05:15:00, 2023-07-30 05:30:00}|82.5                 |
|D015     |{2023-07-30 05:45:00, 2023-07-30 06:00:00}|103.0                |
|D042     |{2023-07-30 05:30:00, 2023-07-30 05:45:00}|221.33333333333334   |
|D002     |{2023-07-30 06:00:00, 2023-07-30 06:15:00}|148.6                |
|D004     |{2023-07-30 05:45:00, 2023-07-30 06:00:00}|102.71428571428571   |
|D024     |{2023-07-30 05:00:00, 2023-07-30 05:15:00}|44.333333333333336   |
|D043     |{2023-07-30 06:45:00, 2023-07-30 07:00:00}|79.375               |
|D044     |{2023-07-30 07:00:00, 2023-07-30 07:15:00}|110.8                |
|D013     |{2023-07-30 06:00:00, 2023-07-30 06:15:00}|91.75                |
|D023     |{2023-07-30 06:00:00, 2023-07-30 06:15:00}|250.33333333333334   |

**Task 3: Idle Time Detection**

In [446]:
eventwise_ride = ride_information.select("driver_id", "timestamp", "event_type")
eventwise_ride = eventwise_ride.dropna(how="all")
only_GPS_data = eventwise_ride.filter(eventwise_ride.event_type == "GPS")
only_GPS_data.show()

+---------+-------------------+----------+
|driver_id|          timestamp|event_type|
+---------+-------------------+----------+
|     D001|2023-07-30 05:15:57|       GPS|
|     D001|2023-07-30 05:21:07|       GPS|
|     D001|2023-07-30 05:23:27|       GPS|
|     D001|2023-07-30 05:27:29|       GPS|
|     D001|2023-07-30 05:31:50|       GPS|
|     D001|2023-07-30 05:32:08|       GPS|
|     D001|2023-07-30 05:36:21|       GPS|
|     D001|2023-07-30 05:38:09|       GPS|
|     D001|2023-07-30 05:38:53|       GPS|
|     D001|2023-07-30 05:41:03|       GPS|
|     D001|2023-07-30 05:42:55|       GPS|
|     D001|2023-07-30 05:43:06|       GPS|
|     D001|2023-07-30 05:45:38|       GPS|
|     D001|2023-07-30 05:48:31|       GPS|
|     D001|2023-07-30 05:52:03|       GPS|
|     D001|2023-07-30 05:56:52|       GPS|
|     D001|2023-07-30 05:57:14|       GPS|
|     D001|2023-07-30 06:02:19|       GPS|
|     D001|2023-07-30 06:08:42|       GPS|
|     D001|2023-07-30 06:09:15|       GPS|
+---------+

In [447]:
windows_specs = Window.partitionBy('driver_id').orderBy('timestamp')
GPS_data_within_window = only_GPS_data.withColumn("prev_timestamp",lag('timestamp').over(windows_specs))

In [448]:
GPS_data_within_window.show()

+---------+-------------------+----------+-------------------+
|driver_id|          timestamp|event_type|     prev_timestamp|
+---------+-------------------+----------+-------------------+
|     D001|2023-07-30 05:15:57|       GPS|               null|
|     D001|2023-07-30 05:21:07|       GPS|2023-07-30 05:15:57|
|     D001|2023-07-30 05:23:27|       GPS|2023-07-30 05:21:07|
|     D001|2023-07-30 05:27:29|       GPS|2023-07-30 05:23:27|
|     D001|2023-07-30 05:31:50|       GPS|2023-07-30 05:27:29|
|     D001|2023-07-30 05:32:08|       GPS|2023-07-30 05:31:50|
|     D001|2023-07-30 05:36:21|       GPS|2023-07-30 05:32:08|
|     D001|2023-07-30 05:38:09|       GPS|2023-07-30 05:36:21|
|     D001|2023-07-30 05:38:53|       GPS|2023-07-30 05:38:09|
|     D001|2023-07-30 05:41:03|       GPS|2023-07-30 05:38:53|
|     D001|2023-07-30 05:42:55|       GPS|2023-07-30 05:41:03|
|     D001|2023-07-30 05:43:06|       GPS|2023-07-30 05:42:55|
|     D001|2023-07-30 05:45:38|       GPS|2023-07-30 05

In [456]:
GPS_data_within_window = GPS_data_within_window.withColumn("idle_duration",(col("timestamp").cast('long') - col('prev_timestamp').cast('long')))

In [457]:
GPS_data_within_window.show()

+---------+-------------------+----------+-------------------+-------------+
|driver_id|          timestamp|event_type|     prev_timestamp|idle_duration|
+---------+-------------------+----------+-------------------+-------------+
|     D001|2023-07-30 05:15:57|       GPS|               null|         null|
|     D001|2023-07-30 05:21:07|       GPS|2023-07-30 05:15:57|          310|
|     D001|2023-07-30 05:23:27|       GPS|2023-07-30 05:21:07|          140|
|     D001|2023-07-30 05:27:29|       GPS|2023-07-30 05:23:27|          242|
|     D001|2023-07-30 05:31:50|       GPS|2023-07-30 05:27:29|          261|
|     D001|2023-07-30 05:32:08|       GPS|2023-07-30 05:31:50|           18|
|     D001|2023-07-30 05:36:21|       GPS|2023-07-30 05:32:08|          253|
|     D001|2023-07-30 05:38:09|       GPS|2023-07-30 05:36:21|          108|
|     D001|2023-07-30 05:38:53|       GPS|2023-07-30 05:38:09|           44|
|     D001|2023-07-30 05:41:03|       GPS|2023-07-30 05:38:53|          130|

In [460]:
# Identify idle sessions
idle_sessions = GPS_data_within_window.withColumn("idle_duration", when(col("idle_duration") > idle_threshold_seconds, 1).otherwise(0))
idle_sessions.show(idle_sessions.count())

+---------+-------------------+----------+-------------------+-------------+
|driver_id|          timestamp|event_type|     prev_timestamp|idle_duration|
+---------+-------------------+----------+-------------------+-------------+
|     D001|2023-07-30 05:15:57|       GPS|               null|            0|
|     D001|2023-07-30 05:21:07|       GPS|2023-07-30 05:15:57|            0|
|     D001|2023-07-30 05:23:27|       GPS|2023-07-30 05:21:07|            0|
|     D001|2023-07-30 05:27:29|       GPS|2023-07-30 05:23:27|            0|
|     D001|2023-07-30 05:31:50|       GPS|2023-07-30 05:27:29|            0|
|     D001|2023-07-30 05:32:08|       GPS|2023-07-30 05:31:50|            0|
|     D001|2023-07-30 05:36:21|       GPS|2023-07-30 05:32:08|            0|
|     D001|2023-07-30 05:38:09|       GPS|2023-07-30 05:36:21|            0|
|     D001|2023-07-30 05:38:53|       GPS|2023-07-30 05:38:09|            0|
|     D001|2023-07-30 05:41:03|       GPS|2023-07-30 05:38:53|            0|

In [464]:
print('Driver idle for 30 minutes')
idle_sessions = idle_sessions.filter(col("idle_duration") > 0)
idle_sessions.show(idle_sessions.count(), truncate = False)

Driver idle for 30 minutes
+---------+---------+----------+--------------+-------------+
|driver_id|timestamp|event_type|prev_timestamp|idle_duration|
+---------+---------+----------+--------------+-------------+
+---------+---------+----------+--------------+-------------+



In [465]:
unique_idle_drivers_df = GPS_data_within_window.orderBy("driver_id",window("timestamp","30 minutes"), ascending=False).dropDuplicates(["driver_id"])

In [466]:
unique_idle_drivers_df.show(unique_idle_drivers_df.count(), truncate = False)

+------------------------------------------+---------+-------------------+----------+-------------------+-------------+
|window                                    |driver_id|timestamp          |event_type|prev_timestamp     |idle_duration|
+------------------------------------------+---------+-------------------+----------+-------------------+-------------+
|{2023-07-30 07:00:00, 2023-07-30 07:30:00}|D001     |2023-07-30 07:01:01|GPS       |2023-07-30 06:58:53|128          |
|{2023-07-30 07:00:00, 2023-07-30 07:30:00}|D002     |2023-07-30 07:00:27|GPS       |2023-07-30 06:57:22|185          |
|{2023-07-30 07:00:00, 2023-07-30 07:30:00}|D003     |2023-07-30 07:00:49|GPS       |2023-07-30 06:59:02|107          |
|{2023-07-30 07:00:00, 2023-07-30 07:30:00}|D004     |2023-07-30 07:00:20|GPS       |2023-07-30 06:53:18|422          |
|{2023-07-30 07:00:00, 2023-07-30 07:30:00}|D005     |2023-07-30 07:02:12|GPS       |2023-07-30 06:56:44|328          |
|{2023-07-30 07:00:00, 2023-07-30 07:30: