### Spark notebook ###

This notebook will only work in a Jupyter session running on `mathmadslinux2p`.

You can start your own Jupyter session on `mathmadslinux2p` and open this notebook in Chrome on the MADS Windows server by

**Steps**

1. Login to the MADS Windows server using https://mathportal.canterbury.ac.nz/.
2. Download or copy this notebook to your home directory.
3. Open powershell and run `ssh mathmadslinux2p`.
4. Run `start_pyspark_notebook` or `/opt/anaconda3/bin/jupyter-notebook --ip 132.181.129.68 --port $((8000 + $((RANDOM % 999))))`.
5. Copy / paste the url provided in the shell window into Chrome on the MADS Windows server.
6. Open the notebook from the Jupyter root directory (which is your home directory).
7. Run `start_spark()` to start a spark session in the notebook.
8. Run `stop_spark()` before closing the notebook or kill your spark application by hand using the link in the Spark UI.

In [1]:
# Run this cell to import pyspark and to define start_spark() and stop_spark()

import findspark

findspark.init()

import getpass
import pandas
import pyspark
import random
import re

from IPython.display import display, HTML
from pyspark import SparkContext
from pyspark.sql import SparkSession


# Functions used below

def username():
    """Get username with any domain information removed.
    """

    return re.sub('@.*', '', getpass.getuser())


def dict_to_html(d):
    """Convert a Python dictionary into a two column table for display.
    """

    html = []

    html.append(f'<table width="100%" style="width:100%; font-family: monospace;">')
    for k, v in d.items():
        html.append(f'<tr><td style="text-align:left;">{k}</td><td>{v}</td></tr>')
    html.append(f'</table>')

    return ''.join(html)


def show_as_html(df, n=20):
    """Leverage existing pandas jupyter integration to show a spark dataframe as html.
    
    Args:
        n (int): number of rows to show (default: 20)
    """

    display(df.limit(n).toPandas())

    
def display_spark():
    """Display the status of the active Spark session if one is currently running.
    """
    
    if 'spark' in globals() and 'sc' in globals():

        name = sc.getConf().get("spark.app.name")
        
        html = [
            f'<p><b>Spark</b></p>',
            f'<p>The spark session is <b><span style="color:green">active</span></b>, look for <code>{name}</code> under the running applications section in the Spark UI.</p>',
            f'<ul>',
            f'<li><a href="http://mathmadslinux2p.canterbury.ac.nz:8080/" target="_blank">Spark UI</a></li>',
            f'<li><a href="{sc.uiWebUrl}" target="_blank">Spark Application UI</a></li>',
            f'</ul>',
            f'<p><b>Config</b></p>',
            dict_to_html(dict(sc.getConf().getAll())),
            f'<p><b>Notes</b></p>',
            f'<ul>',
            f'<li>The spark session <code>spark</code> and spark context <code>sc</code> global variables have been defined by <code>start_spark()</code>.</li>',
            f'<li>Please run <code>stop_spark()</code> before closing the notebook or restarting the kernel or kill <code>{name}</code> by hand using the link in the Spark UI.</li>',
            f'</ul>',
        ]
        display(HTML(''.join(html)))
        
    else:
        
        html = [
            f'<p><b>Spark</b></p>',
            f'<p>The spark session is <b><span style="color:red">stopped</span></b>, confirm that <code>{username() + " (jupyter)"}</code> is under the completed applications section in the Spark UI.</p>',
            f'<ul>',
            f'<li><a href="http://mathmadslinux2p.canterbury.ac.nz:8080/" target="_blank">Spark UI</a></li>',
            f'</ul>',
        ]
        display(HTML(''.join(html)))


# Functions to start and stop spark

def start_spark(executor_instances=2, executor_cores=1, worker_memory=1, master_memory=1):
    """Start a new Spark session and define globals for SparkSession (spark) and SparkContext (sc).
    
    Args:
        executor_instances (int): number of executors (default: 2)
        executor_cores (int): number of cores per executor (default: 1)
        worker_memory (float): worker memory (default: 1)
        master_memory (float): master memory (default: 1)
    """

    global spark
    global sc

    user = username()
    
    cores = executor_instances * executor_cores
    partitions = cores * 4
    port = 4000 + random.randint(1, 999)

    spark = (
        SparkSession.builder
        .master("spark://masternode2:7077")
        .config("spark.driver.extraJavaOptions", f"-Dderby.system.home=/tmp/{user}/spark/")
        .config("spark.dynamicAllocation.enabled", "false")
        .config("spark.executor.instances", str(executor_instances))
        .config("spark.executor.cores", str(executor_cores))
        .config("spark.cores.max", str(cores))
        .config("spark.executor.memory", f"{worker_memory}g")
        .config("spark.driver.memory", f"{master_memory}g")
        .config("spark.driver.maxResultSize", "0")
        .config("spark.sql.shuffle.partitions", str(partitions))
        .config("spark.ui.port", str(port))
        .appName(user + " (jupyter)")
        .getOrCreate()
    )
    sc = SparkContext.getOrCreate()
    
    display_spark()

    
def stop_spark():
    """Stop the active Spark session and delete globals for SparkSession (spark) and SparkContext (sc).
    """

    global spark
    global sc

    if 'spark' in globals() and 'sc' in globals():

        spark.stop()

        del spark
        del sc

    display_spark()


# Make css changes to improve spark output readability

html = [
    '<style>',
    'pre { white-space: pre !important; }',
    'table.dataframe td { white-space: nowrap !important; }',
    'table.dataframe thead th:first-child, table.dataframe tbody th { display: none; }',
    '</style>',
]
display(HTML(''.join(html)))

In [2]:
# Run this cell to start a spark session in this notebook

start_spark(executor_instances=2, executor_cores=1, worker_memory=1, master_memory=1)

0,1
spark.app.name,kda115 (jupyter)
spark.dynamicAllocation.enabled,false
spark.master,spark://masternode2:7077
spark.driver.port,40717
spark.executor.id,driver
spark.driver.memory,1g
spark.driver.host,mathmadslinux2p.canterbury.ac.nz
spark.sql.warehouse.dir,file:/users/home/kda115/Spark/Assignment/Supplementary_Material_1/Processing_Notebook/spark-warehouse
spark.app.startTime,1726369120221
spark.app.id,app-20240915145841-1087


## Processing - Daily 

In [3]:
# Import the pyspark API to defined data types
from pyspark.sql import functions as F
from pyspark.sql.types import *

### A. Determine the default block sizes ? The size of 2023 / 2024 files in HDFS? How many blocks in 2023 / 2024? What are individuals block sizes for 2023? 

In [5]:
# The defaults block sizes of HDFS 
! hdfs getconf -confKey "dfs.blocksize"

134217728


In [6]:
# How many blocksizes are required in daily climate summarize for 2024?
! hdfs fsck /data/ghcnd/daily/2024.csv.gz -files -blocks 

Connecting to namenode via http://masternode2:9870/fsck?ugi=kda115&files=1&blocks=1&path=%2Fdata%2Fghcnd%2Fdaily%2F2024.csv.gz
FSCK started by kda115 (auth:SIMPLE) from /192.168.40.11 for path /data/ghcnd/daily/2024.csv.gz at Sun Aug 25 10:38:45 NZST 2024

/data/ghcnd/daily/2024.csv.gz 88831735 bytes, replicated: replication=8, 1 block(s):  OK
0. BP-700027894-132.181.129.68-1626517177804:blk_1074220563_479763 len=88831735 Live_repl=8


Status: HEALTHY
 Number of data-nodes:	32
 Number of racks:		1
 Total dirs:			0
 Total symlinks:		0

Replicated Blocks:
 Total size:	88831735 B
 Total files:	1
 Total blocks (validated):	1 (avg. block size 88831735 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	4
 Average block replication:	8.0
 Missing blocks:		0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Blocks queued for replication:	0


2024 Climate Summary:
1. File Size: 88,831,735 bytes (~88.8 MB)
2. Number of Blocks: 1 block.
3. Block Size: Since the file size is smaller than the default block size, it fits entirely within a single block.

In [7]:
# How many block sizes are required in daily climate summarize for 2023?
! hdfs fsck /data/ghcnd/daily/2023.csv.gz -files -blocks 

Connecting to namenode via http://masternode2:9870/fsck?ugi=kda115&files=1&blocks=1&path=%2Fdata%2Fghcnd%2Fdaily%2F2023.csv.gz
FSCK started by kda115 (auth:SIMPLE) from /192.168.40.11 for path /data/ghcnd/daily/2023.csv.gz at Sun Aug 25 10:40:08 NZST 2024

/data/ghcnd/daily/2023.csv.gz 168357302 bytes, replicated: replication=8, 2 block(s):  OK
0. BP-700027894-132.181.129.68-1626517177804:blk_1074220535_479735 len=134217728 Live_repl=8
1. BP-700027894-132.181.129.68-1626517177804:blk_1074220536_479736 len=34139574 Live_repl=8


Status: HEALTHY
 Number of data-nodes:	32
 Number of racks:		1
 Total dirs:			0
 Total symlinks:		0

Replicated Blocks:
 Total size:	168357302 B
 Total files:	1
 Total blocks (validated):	2 (avg. block size 84178651 B)
 Minimally replicated blocks:	2 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	4
 Average block replication:	8.0
 Missing blo

2023 Climate Summary:
1. File Size: 168,357,302 bytes (~168.4 MB)
2. Number of Blocks: 2 blocks.
3. Block Sizes:
    Block 1: 134,217,728 bytes (128 MB).
    Block 2: 34,139,574 bytes (~34.1 MB).
4. Explanation: The first block is filled to the default block size limit (128 MB), and the remaining data (34.1 MB) is stored in a second block.

Conclusion: 
1. 2024 Summary: The 2024 file is small enough to fit within a single block, using 1 block.
2. 2023 Summary: The 2023 file exceeds the block size, resulting in 2 blocks being used—one full block and one partial block.
3. Since both files are relatively small, they do not heavily utilize the block capacity, but the 2023 file does require an additional block due to its larger size.

### B. Load and count the number of observation in 2023 and then seperately in 2024? 

In [4]:
# Using pyspark.sql function to infer schema 
schema_daily = StructType([
    StructField("Station_ID", StringType(), False),
    StructField("DATE",  StringType(), True),
    StructField("Element", StringType(), True),
    StructField("VALUE", IntegerType(), True),
    StructField("Measurement_Flag", StringType(), True),
    StructField("Quality_Flag", StringType(), True),
    StructField("Source_Flag", StringType(), True),
    StructField("Observation_Time", StringType(), True)
])

In [19]:
# Load the 2023 daily dataset 
daily_2023 = spark.read.csv("hdfs:///data/ghcnd/daily/2023.csv.gz", schema = schema_daily, header = False)

# Count the number of observations in 2023
count_daily_2023 = daily_2023.count()

print(f"The number of observations in daily 2023 dataset is {count_daily_2023}")

The number of observations in daily 2023 dataset is 37867272


In [20]:
# Load the 2024 daily dataset
daily_2024 = spark.read.csv("hdfs:///data/ghcnd/daily/2024.csv.gz", schema = schema_daily, header = False)

# Count the number of observations in 2024 
count_daily_2024 = daily_2024.count()

print(f"The number of observations in daily 2024 data is {count_daily_2024}")

The number of observations in daily 2024 data is 19720790


#### B.1 How many tasks were executed by each stage of each job?

In [18]:
# Print out the number of partitions
num_partitions_2023 = daily_2023.rdd.getNumPartitions()
print(f"Number of partitions: {num_partitions_2023}")

Number of partitions: 1


There 2 jobs in this executed:
1. Read and Count the 2023 daily data (Job 1):
    - 2 stages : 1 for reading the data and 1 for counting. 
2. Read and Count the 2024 daily data (Job 2):
    - 2 stages: 1 for reading the data and 1 for counting. 

1. Number of Jobs: There are 2 jobs, one for loading and counting the 2023 data and one for loading and counting the 2024 data.
2. Number of Stages: Each job has 2 stages: one for reading the data and one for counting, making a total of 4 stages across both jobs.
3. Number of Tasks: Each stage has only 1 task because there is only 1 partition in the dataset for both 2023 and 2024.

#### B.2 Did the number of tasks executed correspond to the number of blocks in each input? 

The number of tasks does not always correspond to the number of blocks in each input. For example, the 2023 dataset was stored across 2 HDFS blocks (128 MB and 34.1 MB). If the block size is configured to be 128 MB in HDFS, this data could typically be split into 2 blocks and will have 2 tasks if each task processes one block. However, due to the file being compressed with Gzip, the situation is different, when Spark encounters a Gzip compressed file, it must be read the entire file as a single unit without paralleizing the read operation across blocks otherwise it would lead to corrupted chunks, because Gzip does not support splitting regardless of how many HDFS blocks it spans (Apache Spark and Data Compression, n.d). As a result, Spark creating only 1 task to process the entire file.                

Cite: (Apache Spark and data compression. (n.d)).

### C. Load and count the total number of observations in the years 2014 to 2023 (inclusive)?

- Glob patterns allow you to specify multiple files or directories using wildcard characters. 
- For example, to load data for the years 2014 through 2023, you can use {2014,2015,2016,...,2023} in the file path

In [16]:
# read the daily dataset from 2014 - 2023 using glob patterns
daily_data_2014_2023 = spark.read.csv("hdfs:///data/ghcnd/daily/{2014,2015,2016,2017,2018,2019,2020,2021,2022,2023}.csv.gz",
                                     schema = schema_daily, header = False)


# Count the number of observations from 2014 - 2023 
total_count_2014_2023 = daily_data_2014_2023.count()

# Print the result 
print(f"Total number of observations from 2014 to 2023: {total_count_2014_2023}")

Total number of observations from 2014 to 2023: 370803270


#### C.1 How many tasks were executed by each stage and how does this number correspond to your input?

- This operations involves only one job which is loading and counting the number of observations in the years between 2014 to 2023. As a result, there are 370,803,270 observations from 2014 to 2023.. 
- Regarding to this job, there are 2 stages which defined as stage 10 and 11 in Figure 5. 
- Stage 10 processed an input size of 1581.1 MiB which was split into 10 tasks (Figure tasks). The input data from the years 2014 to 2023 was devided into 10 paritions with each parition being processed by a seperate task. The number of tasks directly corresponds to the number of partitions created based on the input size. Hence, there are 10 tasks.            
- In stage 11, this stage performed the final count operation which required only 1 task since the data had been aggregated and did not require further partitioning. Therefore, the operation was completed in a single task.
- In conclusion, there are 1 job and 2 stages which total 11 tasks, where stage 10 have 10 tasks and stage 11 have 1 task.

### D. How many tasks do you think would run in parallel when loading and applying transformations to all years in daily? 

#### D.1 Can you think of any practical way you could increase this either in Spark or by changing how data stored in HDFS

- Change in HDFS: According to the 2023 and 2024 daily dataset, the file is stored in HDFS as a Gzip file, which cannot be split, Spark must read the entire file in a single task without parallel processing. By changing the file format to one that supports splitting, such as CSV or plain text, Spark can divide the file into multiple partitions and process them in parallel. This change would reduce the processing time when loading and counting the data.

- Change in Spark: Since parallelism is limited when working with Gzip files in Spark by the fact that each Gzip file is processed as a single partition. However, we can still optimize the reading and counting of observations by effectively utilizing the Spark cluster resources. The key to optimizing performance is to process as many Gzip files concurrently as possible, by using  4 executors and 2 cores per executor, we have a total of 8 cores available for task execution. This means Spark can run up to 8 tasks in parallel. Since each Gzip file is processed as a single task, we can process 8 Gzip files concurrently. According to Figure 7, Spark assigned 8 of these files to be processed in parallel across the 8 available cores at launch time 10:16:53. Once these 8 tasks complete, the remaining 2 files are processed by the same 8 cores in 10:17:09 launch time.  As a result, by ensuring that all 8 cores are fully utilized, we can reduce the overall processing time.. 

In [5]:
# Run this cell before closing the notebook or kill your spark application by hand using the link in the Spark UI

stop_spark()