# Hands-On Exercise: Advanced Spark Programming Using PySpark

Objective: This exercise introduces students to advanced Spark programming with PySpark. It covers RDDs, DataFrames, Datasets, Spark SQL, data analytics, and machine learning with Spark MLlib. By the end, students will have hands-on experience implementing transformations and performing analytics using PySpark.

## Step 1: Spark RDDs, DataFrames, and Datasets

**What are RDDs, DataFrames, and Datasets?**

- RDDs (Resilient Distributed Datasets): A fault-tolerant collection of elements that can be operated on in parallel.

- DataFrames: Distributed collection of data organized into named columns.

- Datasets: An extension of DataFrames providing type-safety and object-oriented programming.


### Task 1: Create RDDs and Perform Basic Operations

1. Initialize a Spark Session:

In [None]:
from pyspark.sql import SparkSession

# simple spark session run locally
spark = SparkSession.builder \
    .master("yarn") \
    .appName("BasicSpark") \
    .getOrCreate()

# spark session run on yarn with some configurations, add:
# .config("spark.some.config.option", "some-value") \
spark = SparkSession.builder \
    .master("yarn") \
    .appName("AdvancedSpark") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '3') \
    .config('spark.cores.max', '3') \
    .config("spark.driver.memory",'8g') \
    .getOrCreate()


2. **Create an RDD**: Create an RDD from a list of numbers and perform basic transformations:

In [None]:
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Map transformation: Multiply each element by 2
rdd_mapped = rdd.map(lambda x: x * 2)

# Collect results
print(rdd_mapped.collect())


## Step 2: Spark DataFrame Operations

### Task 2: Perform DataFrame Transformations and Actions

1. Join DataFrames: Assume you have two DataFrames: `sales_df` (sales data) and `product_df` (product information). Join them:

In [None]:
sales_df = spark.read.csv(
    "hdfs:///user/datatech-labs/retail_data/sales_data.csv",
    header=True,
    inferSchema=True
)

product_df = spark.read.csv(
    "hdfs:///user/datatech-labs/retail_data/products_data.csv",
    header=True,
    inferSchema=True
)

retail_df = sales_df.join(product_df, sales_df["product_id"] == product_df["product_id"]) \
                    .drop(product_df["product_id"])
retail_df.show(5)

# write to hdfs
retail_df.write.csv(
    "hdfs:///user/datatech-labs/retail_data/retail_data.csv",
    header=True,
    mode="overwrite"
)


2. DataFrame Transformations: Use transformation functions like `withColumn`, `drop`, `distinct`:

In [None]:
# Add a new column with a calculated value (total cost)
retail_df = retail_df.withColumn("price", retail_df["total_amount"] / retail_df["quantity"])
retail_df.show(5)

# Drop columns
retail_df = retail_df.drop("discount")
retail_df.show(5)


3. Cache and Persist: Cache a DataFrame to memory for repeated access:

In [None]:
retail_df.cache()
retail_df.count()  # Action to trigger caching


### Task 3: Explore DataFrames in Detail

1. Load a DataFrame from a CSV File: Load retail sales data from a CSV file into a DataFrame:

In [None]:
retail_df = spark.read.csv(
    "hdfs:///user/datatech-labs/retail_data/retail_data.csv",
    header=True,
    inferSchema=True
)
retail_df.show(5)


2. Explore DataFrame Schema: Check the schema of the DataFrame:

In [None]:
retail_df.printSchema()


3. Select and Filter Data: Perform basic operations on DataFrames:

In [None]:
# Select specific columns
retail_df.select("product_id", "quantity", "price").show()

# Filter rows where quantity is greater than 8
retail_df.filter(retail_df["quantity"] > 8).show()


4. GroupBy and Aggregations: Perform aggregations on the DataFrame:

In [None]:
retail_df.groupBy("product_id") \
    .agg({
        "quantity": "sum",
        "price": "avg"
    }) \
    .show()


## Step 3: Spark SQL and Analytics

### Task 4: Write SQL Queries on DataFrames

1. Create Temporary Views for SQL Queries:

In [None]:
retail_df.createOrReplaceTempView("retail_data")


2. Execute SQL Queries: Run SQL queries on the DataFrame:

In [None]:
result = spark.sql("""
    SELECT product_id, SUM(quantity) as total_quantity 
    FROM retail_data 
    GROUP BY product_id
""")
result.show()


3. Join Using SQL: Write SQL for joining the DataFrames:

In [None]:
sales_df.createOrReplaceTempView("sales")
product_df.createOrReplaceTempView("products")

spark.sql("""
    SELECT s.product_id, p.product_name, SUM(s.quantity) as total_quantity
    FROM sales s
    JOIN products p ON s.product_id = p.product_id
    GROUP BY s.product_id, p.product_name
""").show()


## Step 4: Connecting to Hive and Reading Data from Hive Tables

### Task 5: Query Data from Hive Tables

1. **Connect to Hive**: Ensure that Hive is properly configured and that the Hive Metastore is accessible from Spark.

**Refer to "Hive_installation" in week3 for more details.**

2. Read Hive Table into Spark: Query data from Hive and load it into a Spark DataFrame.

In [None]:
# Enable Hive support
spark = SparkSession.builder \
    .appName("Spark Hive Exercise") \
    .enableHiveSupport() \
    .getOrCreate()

# Load data from a Hive table
hive_df = spark.sql("SELECT * FROM retail_dw.sales")

hive_df.show()


3. Write Data to Hive: Write data back to a Hive table.

In [None]:
# enrich data
enriched_hive_df = hive_df.withColumn("price", hive_df["total_amount"] / hive_df["quantity"])

enriched_hive_df.write. \
    mode("overwrite"). \
    saveAsTable("retail_dw.enriched_sales")


## Step 5: Submitting PySpark Jobs with spark-submit

In this step, students will learn how to use the `spark-submit` command to run PySpark applications on a cluster, how to include dependencies, and how to monitor the job's progress through the Spark UI.

### Task 6: Submit a PySpark Script Using `spark-submit`

1. Create a PySpark Script: Write a PySpark script and save it as `retail_analysis.py`. Here’s a basic script to process retail data:

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession. \
    builder. \
    appName("Retail Data Analysis"). \
    getOrCreate()

# Load retail sales data from CSV
retail_df = spark.read.csv(
    "hdfs:///user/datatech-labs/retail_data/retail_data.csv",
    header=True,
    inferSchema=True
)

# Perform basic transformation
df_filtered = retail_df.filter(retail_df['total_amount'] > 100)

# Show results
df_filtered.show()


2. Submit the Script Using spark-submit: Use the `spark-submit` command to submit the PySpark script to the Spark cluster.

- `--master`: Specifies the master URL (in this case, `yarn` for a Hadoop cluster).

- `--deploy-mode`: Defines where the driver program will run (`cluster` or `client`).

- `/path/to/retail_analysis.py`: The path to the PySpark script.

In [None]:
$ spark-submit \
    --master yarn \
    --deploy-mode client \
    retail_analysis.py

### Task 7: Including Dependencies in spark-submit

Sometimes, you need to include external dependencies (such as additional libraries) when submitting a job. There are two common ways to do this:

1. Include a Python Package: Use the `--py-files` option to add additional Python files or ZIP files that contain dependencies:

In [None]:
$ spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --py-files /path/to/dependencies.zip \
    /path/to/retail_analysis.py


2. Include a JAR File: If the job requires external Java libraries, use the `--jars` option to include the JAR file:

In [None]:
$ spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --jars /path/to/external-library.jar \
    /path/to/retail_analysis.py


### Task 8: Monitor the Job Through Spark UI

Spark provides a web-based UI for monitoring job execution and resource usage.

1. Access the Spark UI: Once the job is submitted, the Spark UI can be accessed using the Spark master’s web interface. The default port is `4040` for client mode or `8088` for YARN Resource Manager.

- If running in client mode, open a web browser and navigate to:
    `http://<driver-host>:4040`

- If running in cluster mode, access the YARN Resource Manager web interface:
    `http://<resource-manager-host>:8088`

2. Monitor Job Progress:

- *Stages*: View how the job is split into different stages and how tasks are executed.

- *Tasks*: Monitor the completion rate of tasks and check for any failed tasks.

- *Executors*: Check how much memory and CPU each executor is using.

- *Storage*: See cached data, RDDs, and DataFrames.


3. Check Logs: In the Spark UI, under each job, you can access detailed logs to diagnose any failures or performance issues.