Step 1: Create folder named “mobile” in dbfs on databricks. Place the files ‘file1.json’, ‘file2.json’, ‘file3.json’ in this folder. Copy the path to the folder.

Step 2: In a notebook add in the following code: ( be sure to substitute the correct path to the mobile
folder)

Step 3: Run the code to start a streaming spark job. ( 2 marks )

In [0]:
from pyspark.sql.types import StructType, StringType, TimestampType
from pyspark.sql.functions import window

mobile_folder_path = 'dbfs:/FileStore/mobile'

# Define the schema
mobile_data_schema = StructType() \
  .add("id", StringType(), False) \
  .add("action", StringType(), False) \
  .add("ts", TimestampType(), False)

# Create a streaming DataFrame from JSON files in the specified directory
mobile_ss_df = spark.readStream \
  .schema(mobile_data_schema) \
  .json(mobile_folder_path)

# Check if the DataFrame is a streaming DataFrame
print(mobile_ss_df.isStreaming) # Should print True

# Group by a 10-minute window and action, then count occurrences
action_count_df = mobile_ss_df.groupBy(window("ts", "10 minutes"),"action").count()

# Write the output to the console
mobile_console_sq = action_count_df.writeStream \
  .format("console") \
  .option("truncate", "false") \
  .outputMode("complete") \
  .start()

True


4. In a new code cell, display the results of the streaming job by calling
“display(action_count_df)”. ( 1 mark )


In [0]:
display(action_count_df)

window,action,count
"List(2018-03-02T10:00:00.000+0000, 2018-03-02T10:10:00.000+0000)",open,4
"List(2018-03-02T11:00:00.000+0000, 2018-03-02T11:10:00.000+0000)",crash,1
"List(2018-03-02T10:10:00.000+0000, 2018-03-02T10:20:00.000+0000)",open,1
"List(2018-03-02T10:00:00.000+0000, 2018-03-02T10:10:00.000+0000)",close,3
"List(2018-03-02T11:10:00.000+0000, 2018-03-02T11:20:00.000+0000)",swipe,1


5. Add an additional file to the ‘mobile’ folder on dbfs. The file to add is named ‘newaction.json’

6. After adding the file, in a new cell, display the results of the streaming job by calling
‘display(action_count_df)’. ( 1 mark)

In [0]:
display(action_count_df)

window,action,count
"List(2018-03-02T10:00:00.000+0000, 2018-03-02T10:10:00.000+0000)",open,4
"List(2018-03-02T11:00:00.000+0000, 2018-03-02T11:10:00.000+0000)",crash,1
"List(2018-03-02T10:10:00.000+0000, 2018-03-02T10:20:00.000+0000)",open,1
"List(2018-03-02T10:00:00.000+0000, 2018-03-02T10:10:00.000+0000)",close,3
"List(2018-03-02T11:10:00.000+0000, 2018-03-02T11:20:00.000+0000)",swipe,1


7. How has the input changed between step 4 and step 6? ( 2 marks)

The input would have changed but i had left the cell running for step 4 so it updated at the same time. However with the newaction file added the counts would be higher, and the new action in the file would add an additional row to the databricks table.

8. Stop the streaming job by calling ‘mobile_console_sq.stop()’. This stops the
streaming query job. ( 1 mark)

In [0]:
mobile_console_sq.stop()

9. For the streaming job setup in step 2, describe the following properties:

&nbsp;&nbsp;&nbsp;&nbsp;a) Data Source ( 2 marks )

The data source for the streaming job is the directory in the DBFS that the job is reading its data from. This is where the job reads the JSON files from in a streaming fashion, meaning when a new file is added spark will read the folder and process the new data in real time.

&nbsp;&nbsp;&nbsp;&nbsp;b) Output Mode ( 2 marks )

Output mode defines how the processed data is written to the ouput after every trigger. The use of compelete in this job means that the entire reulst of the aggregation (count of actions) is re calculated and displayed every time new data is processed.

&nbsp;&nbsp;&nbsp;&nbsp;c) Trigger Type ( 2 marks )

With streaming jobs typically data is continuously flowing into the system. The trigger is the event that kicks off the processing of the data. By default, Spark uses micro-batch processing, meaning it continuously checks for new data and processes it in small batches.

&nbsp;&nbsp;&nbsp;&nbsp;d) Data Sink ( 2 marks )

The data sink in the streaming hov is the destinatoin where the processed data is written to. In our case the data is being written to the console output and the Databricks display allowing to be viewed using the display(action_count_df) function.


10. Describe the processing which is being done by the streaming job
( summarize in a paragraph what the streaming job does) ( 4 marks)

This streaming job is continuously reading JSON files from the 'mobile' folder set up in DBFS. Each JSON file contains a record with an id, action, and a timestamp (ts). The job extracts this data and groups it into 10 minute time windows, counting the number of occurences of each action type within these windows. The processed results are then displayed in real time through the Databricks console and table view. As new files are added to our 'mobile' folder, the job will process the new additions and update the counts dynamically. The streaming job ensrues any new actions are processed and reported live, giving a good example of real-time data monitoring, collection, and analysis.