# **Exercise 1: Start a Spark Standalone Cluster**
In this exercise, you will initialize a Spark Standalone Cluster with a Master and one Worker.
Next, you will start a PySpark shell that connects to the cluster and open the Spark Application
Web UI to monitor it. We will be using the Theia terminal to run commands and docker-based
containers to launch the Spark processes.

**Task A**: Download [cars.csv](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/cars.csv)

### **Task B:** Initialize the Cluster

In [None]:
%%bash

# Download spark-master image if it does not exist and create the container
docker run \
    --name spark-master \
    -h spark-master \
    -e ENABLE_INIT_DAEMON=false \
    -p 4040:4040 \
    -p 8080:8080 \
    -v `pwd`:/home/root \
    -d bde2020/spark-master:3.1.1-hadoop3.2

## Download spark-worker image if it does not exist and create 2 worker containers
docker run \
    --name spark-worker-1 \
    --link spark-master:spark-master \
    -e ENABLE_INIT_DAEMON=false \
    -p 8081:8081 \
    -v `pwd`:/home/root \
    -d bde2020/spark-worker:3.1.1-hadoop3.2 # The image (final result of docker build using the dockerfile) used to create the container.

docker run \
    --name spark-worker-2 \
    --link spark-master:spark-master \
    -e ENABLE_INIT_DAEMON=false \
    -e SPARK_WORKER_WEBUI_PORT=8082 \
    -p 8082:8082 \
    -v `pwd`:/home/root \
    -d bde2020/spark-worker:3.1.1-hadoop3.2 
# "-e SPARK_WORKER_WEBUI_PORT=8082" because in the dockerfile, only 8081 is open for all workers, this MAKES worker 2 open 8082 by force so it can use it else you won't be able to access the worker UI

### **Task C:** Connect a PySpark Shell to the Cluster, Open the UI and Create a DataFrame

#### 1. Connect to Pyspark
**NOTE:** This step is needed only if we were running the upcoming steps in an interactive terminal

In [None]:
%%bash

# docker exec (runs) → pyspark (starts) → pyspark (connects to) → spark-master service.
docker exec \
    -it `docker ps | grep spark-master | awk '{print $1}'` \
    /spark/bin/pyspark \
    --master spark://spark-master:7077

#### 2. Create a DataFrame

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('cars_csv_processor').master('spark://172.17.160.1:7077').getOrCreate()

25/12/06 00:26:05 WARN Utils: Your hostname, MICKYXPS15 resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/12/06 00:26:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/06 00:26:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/06 00:26:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/12/06 00:26:17 ERROR TransportClient: Failed to send RPC RPC 5174377288581702508 to /172.17.160.1:4040: io.netty.channel.StacklessClosedChannelException
io.netty.channel.StacklessClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
25/12/06 00:26:17 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 172.

25/12/06 00:27:18 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master


In [None]:
spark.stop()

In [None]:
df = spark.read.csv("/home/root/cars.csv", header=True, inferSchema=True) \
        .repartition(32) \
        .cache()
df.show()

# **Exercise 2: Run an SQL Query and Debug in the Application UI**
In this exercise, you will define a user-defined function (UDF) and run a query that results in an
error. We will locate that error in the application UI and find the root cause. Finally, we will
correct the error and re-run the query.

### **Task A:** Run an SQL Query

#### 1. Define a UDF to show engine type.

In [None]:
from pyspark.sql.functions import udf
import time

In [None]:
@udf("string")
def engine(cylinders):
    time.sleep(0.2)  # Intentionally delay task
    eng = {4: "inline-four", 6: "V6", 8: "V8"}
    return eng.get(cylinders, "other")

#### 2. Add the UDF as a column in the DataFrame

In [None]:
df = df.withColumn("engine", engine("cylinders"))

##### 3. Group the DataFrame by “cylinders” and aggregate other columns

In [None]:
dfg = df.groupby("cylinders")
dfa = dfg.agg({"mpg": "avg", "engine": "first"})
dfa.show()