# Spark Clusters

### Starting a Spark Cluster
`cd $SPARK_HOME/bin`

(for powershell)
`cd $env:SPARK_HOME\bin`

` spark-class org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080`

Now when building SparkSessions, you have to specify the master as `spark://localhost:7077`
Note to access the Spark UI, you have to enter `http://localhost:8081/`

In [2]:
import findspark
findspark.init()

import pyspark  
from pyspark.sql import SparkSession
import os
import pyspark.sql.functions as F
import json

In [6]:
credentials = json.load(open("../credentials/credentials.json"))
spark = SparkSession\
    .builder\
    .master("spark://localhost:7077")\
    .appName('HelloWorld')\
    .config("spark.driver.memory", "6G") \
    .config("spark.executor.memory", "6G") \
    .config("spark.driver.maxResultSize", "6G") \
    .config("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")\
    .config(f'fs.azure.account.key.{credentials["storage_account_name"]}.blob.core.windows.net',credentials["storage_account_key"])\
    .getOrCreate()

spark

### Starting a worker

`spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 --host localhost`

Now, attempt to test the code below.  *Copied from 09_azure_blob.ipynb*

In [10]:
df_yellow = spark.read.parquet("../resources/datasets/yellow/*/*")
df_green = spark.read.parquet("../resources/datasets/green/*/*")

df_green = df_green\
    .withColumnRenamed("lpep_pickup_datetime", "pickup_datetime") \
    .withColumnRenamed("lpep_dropoff_datetime", "dropoff_datetime")

#Same with yellow
df_yellow = df_yellow\
    .withColumnRenamed("tpep_pickup_datetime", "pickup_datetime") \
    .withColumnRenamed("tpep_dropoff_datetime", "dropoff_datetime")

common_columns = set(df_yellow.columns) & set(df_green.columns) # intersection of columns using & operator

df_yellow = df_yellow.select(*common_columns).withColumn("service_type", F.lit("yellow"))
df_green = df_green.select(*common_columns).withColumn("service_type", F.lit("green"))

df_trips_data = df_green.unionAll(df_yellow)

df_trips_data.show(5)

+------------------+------------+----------+--------------------+---------------+---------------------+----------+-----------+-------------+------------+--------+------------+-------------------+-------+------------+-------------------+------------+-----+------------+
|store_and_fwd_flag|tolls_amount|RatecodeID|congestion_surcharge|passenger_count|improvement_surcharge|tip_amount|fare_amount|trip_distance|payment_type|VendorID|PULocationID|    pickup_datetime|mta_tax|total_amount|   dropoff_datetime|DOLocationID|extra|service_type|
+------------------+------------+----------+--------------------+---------------+---------------------+----------+-----------+-------------+------------+--------+------------+-------------------+-------+------------+-------------------+------------+-----+------------+
|                 N|         0.0|       1.0|                null|            5.0|                  0.3|       0.0|        3.0|          0.0|         2.0|       2|         264|2018-12-21 23:17:2

### Converting to script
`jupyter nbconvert --to=script "./spark/10_spark_cluster.ipynb"`

In [None]:
df_trips_data.write.option("header", "true").parquet(f"wasbs://zoomcampcontainer@{credentials['storage_account_name']}.blob.core.windows.net/tripsdata")

Now, you can run the script from the command line.  This is useful for running on a cluster.

`python 10_spark_cluster.py`

With argument passing:

`python 10_spark_cluster.py --input_green "2019/*" --input_yellow "2019/*" --output "trips_new/tripsdata_2019"`

```
    (From the source Code)
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_green', required=False, default="*/*")
    parser.add_argument('--input_yellow', required=False, default="*/*")
    parser.add_argument('--output', required=False, default='tripsdata')
```

### Spark Submits
What is a spark submit?
spark-submit is a command-line tool provided by Apache Spark that allows users to submit their Spark applications to a cluster for execution. It is the primary interface used for running Spark applications on a cluster, and it takes care of setting up the application environment, packaging the application code and its dependencies, and launching the application on the cluster.

`spark-submit --master spark://localhost:7077 --executor-memory 4G --total-executor-cores 2 10_spark_cluster.py --input_green "2019/*" --input_yellow "2019/*" --output "trips_new_spark_submit/tripsdata_2019"`