# PySpark data exploration

In [2]:
import os

from pyspark.sql import SparkSession

In [3]:
spark = (
    SparkSession.builder
                        # Name to identify your experiment in the cluster's dashboard:
                        .appName("big data course")
                        # Connect with the cluster's orchestrator:
                        .master("local[*]") # Cannot be "local" if you want to use your company's cluster.
                        # Maximum memory any result dataframe will take up in driver memory:
                        .config("spark.driver.maxResultSize", "4g")
                        # How much memory can be allocated to the driver (master/orchestrator):
                        .config("spark.driver.memory", "1g")
                        # How much executors will be needed for the experiment:
                        .config("spark.executor.instances", "5")
                        # Alternatively, allow spinning up more executors when there is more computation load, and discard them when less load:
                        # .config("spark.dynamicAllocation.enabled", True)
                        # .config("spark.dynamicAllocation.minExecutors", 1)
                        # .config("spark.dynamic Allocation.maxExecutors", 4)
                        # How much memory can be allocated to each executor:
                        .config("spark.executor.memory", "4g")
                        # How much CPU cores can be used to optimize parallellization within the executors (can be useful for shuffling data etc):
                        #.config("spark.executor.cores", 4)
                        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                        .getOrCreate()
)
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/26 23:31:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Clarification on the "Spark UI" link that doesn't work here.

In [20]:
binder_url = 'https://hub.mybinder.org' + os.environ['JUPYTERHUB_SERVICE_PREFIX']
binder_url = "https://hub.gke2.mybinder.org" + os.environ["JUPYTERHUB_SERVICE_PREFIX"]
binder_url

'https://hub.gke2.mybinder.org/user/ouseful-templat--binder-pyspark-u9gz9x0l/'

In [23]:
external_spark_ui_url = binder_url + "proxy/4040"
external_spark_ui_url

'https://hub.gke2.mybinder.org/user/ouseful-templat--binder-pyspark-u9gz9x0l/proxy/4040'

Normally, you should be able to visit the Spark UI when entering the above URL in a new browser tab. If a token is requested, you can copy-paste it from the command output of the next cell. But you'll probably get a HTTP error 4xx or 5xx.

This *should* work, since the docker container of this binder project uses jupyter-server-proxy to make locally-listened ports available externally on the binder URL. See the documentation: https://jupyter-server-proxy.readthedocs.io/en/latest/arbitrary-ports-hosts.html

It of course *doesn't* work though.
There seems to be an issue with the proxying of the pyspark UI port 4040. We cannot navigate to it externally, but we can internally from within our container (see next cells). This is due to how the binder container is configured.

So, let's settle with the fact that with this binder container, we will not be able to check the Spark UI when experimenting. Check that on your company's cluster, when moving from this toy environment towards "the real thing".
Let's just use the binder container for now to play around with pySpark.

In [19]:
! jupyter server list

Currently running servers:
http://jupyter-ouseful-2dtemplat-2d-2dbinder-2dpyspark-2du9gz9x0l:8888/user/ouseful-templat--binder-pyspark-u9gz9x0l/?token=-vnIMZ4cSOStJX-kBRUmWA :: /home/jovyan


In [15]:
# url used internally within this binder's docker container:
spark.sparkContext.uiWebUrl

'http://jupyter-ouseful-2dtemplat-2d-2dbinder-2dpyspark-2du9gz9x0l:4040'

In [36]:
# This isn't great, but with this hack we can see that the Spark UI *is* live:
import requests
from IPython.display import HTML

def render_local_spark_ui(subpage="jobs"):
    subpage_url = spark.sparkContext.uiWebUrl + ("" if subpage is None else "/" + subpage)
    print(subpage_url)
    response = requests.get(subpage_url)
    return HTML(data=bytes.decode(response.content))
    
render_local_spark_ui(subpage="jobs")

http://jupyter-ouseful-2dtemplat-2d-2dbinder-2dpyspark-2du9gz9x0l:4040/jobs


## Exploratory data analysis

On a small sample of the popular flights dataset: https://github.com/ozlerhakan/datacamp/blob/master/Introduction%20to%20PySpark/flights_small.csv.

In [9]:
df = spark.read.csv("flights_small.csv", header=True, inferSchema=True) # if inferSchema=True is not set, all columns are just string.
df.printSchema()
df.show(10) # if lots of columns, this plots nicer: .limit(10).toPandas()

                                                                                

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8

In [None]:
df.write.parquet("flights_results.parquet")

In [3]:
spark.stop()