# Spark Execution Concepts

## Spark SQL

Spark SQL is a core module of Apache Spark that provides capabilities for working with structured data. It allows you to query data using SQL or a DataFrame API, which is a programmatic interface for working with structured data, and integrates seamlessly with other Spark components, fitting perfectly within Spark's Unified Stack.

### Key Concepts in Spark SQL

1. **SparkSession**:
   - The SparkSession is the entry point to Spark SQL functionality. It's the central object you use to interact with Spark. Think of it as the starting point for your Spark application.
   - It's responsible for creating DataFrames and executing SQL queries.
   - It replaces the older ```SparkContext``` (which is still used for core Spark functionalities) and ```SQLContext``` (used in earlier versions of Spark SQL). ```SparkSession``` encapsulates both.

2. **Application**:
   - A Spark application is the overall program you write using Spark. It consists of a driver program and executors.
   - The driver program is the main process that coordinates the execution of your Spark application. It creates the ```SparkSession```, defines the transformations and actions on the data, and schedules the tasks to be executed on the cluster.
   - Executors are processes that run on worker nodes in the cluster. They execute the tasks assigned to them by the driver program.
   - The ```SparkSession``` is created within the driver program and is associated with the application.

3. **DataFrames**:
   - A DataFrame is a distributed collection of data organized into named columns. It's conceptually similar to a table in a relational database1 or a Pandas DataFrame in Python, but it's distributed across the Spark cluster.
   - DataFrames provide a structured view of the data, allowing you to perform operations using SQL queries or a DataFrame API.
   - They are the primary abstraction for working with structured data in Spark SQL.
  
4. **SQL Queries**:
   - Spark SQL allows you to execute SQL queries directly on DataFrames.
   - This makes it easy for those familiar with SQL to work with Spark.
   - Spark SQL supports a wide range of SQL features, including joins, aggregations, and window functions.

5. **Catalyst Optimizer**:
   - Spark SQL uses the Catalyst optimizer to optimize the execution of SQL queries and DataFrame operations.
   - Catalyst analyzes the query and generates an optimized execution plan, which can significantly improve performance.

6. **Tungsten Engine**:
   - Tungsten is another performance optimization within Spark SQL.
   - It focuses on improving memory management and code generation to make data processing more efficient.

### Import Modules

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count

In [2]:
# Initialize SparkSession object in a PySpark application
spark = SparkSession.builder.appName("TotalOrdersPerRegionCountry").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/02/14 10:32:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
type(spark)

pyspark.sql.session.SparkSession

### Load Dataset

In [None]:
# Import data file
data_file = "data/sales_records.csv"

# Load data file into a PySpark DataFrame
sales_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(data_file)

                                                                                

In [5]:
type(sales_df)

pyspark.sql.dataframe.DataFrame

### Transformations and Actions

#### Select and Aggregation Actions from DataFrame

In [6]:
# Select specific columns from the DataFrame
sales_df.select("Region", "Country", "Order ID").show(n=10, truncate=False)

+---------------------------------+---------------------+---------+
|Region                           |Country              |Order ID |
+---------------------------------+---------------------+---------+
|Middle East and North Africa     |Azerbaijan           |535113847|
|Central America and the Caribbean|Panama               |874708545|
|Sub-Saharan Africa               |Sao Tome and Principe|854349935|
|Sub-Saharan Africa               |Sao Tome and Principe|892836844|
|Central America and the Caribbean|Belize               |129280602|
|Europe                           |Denmark              |473105037|
|Europe                           |Germany              |754046475|
|Middle East and North Africa     |Turkey               |772153747|
|Europe                           |United Kingdom       |847788178|
|Asia                             |Kazakhstan           |471623599|
+---------------------------------+---------------------+---------+
only showing top 10 rows



In [7]:
# Group by Region and Country columns and count the number of Order IDs
count_sales_df = (sales_df.select("Region", "Country", "Order ID")
                  .groupBy("Region", "Country").agg(count("Order ID").alias("Total Orders"))
                  .orderBy("Total Orders", ascending=False))

# Show the resulting DataFrame
count_sales_df.show(n=10, truncate=False)

# Count the total number of rows in the DataFrame
print(f"Total Rows: {count_sales_df.count()}")

                                                                                

+---------------------------------+-------------+------------+
|Region                           |Country      |Total Orders|
+---------------------------------+-------------+------------+
|Sub-Saharan Africa               |Sudan        |623         |
|Australia and Oceania            |New Zealand  |593         |
|Europe                           |Vatican City |590         |
|Europe                           |Malta        |589         |
|Sub-Saharan Africa               |Mozambique   |589         |
|Middle East and North Africa     |Tunisia      |584         |
|Asia                             |Cambodia     |584         |
|Central America and the Caribbean|Panama       |578         |
|Sub-Saharan Africa               |Rwanda       |576         |
|Sub-Saharan Africa               |Cote d'Ivoire|575         |
+---------------------------------+-------------+------------+
only showing top 10 rows

Total Rows: 185


#### Narrow versus Wide Transformations

**Narrow Transformations** 
- **Definition:** A narrow transformation is one where each partition of the output RDD depends only on a single partition of the input RDD. In other words, the data required to compute a partition in the output RDD resides entirely within the corresponding partition of the input RDD.
- **Characteristics:**
  - **No data shuffling:** Do not require data to be shuffled between partitions or acorss the network.
  - **Localized computation:** Each partition can be processed independently without needing data from other partitions.   
  - **Examples:** ```map```, ```filter```, ```flatMap```, ```union```
- **Benefits:**
  - **Performance:** Narrow transformations are generally faster because they avoid the overhead of data shuffling.   
  - **Fault tolerance:** If a partition is lost, it can be easily recomputed from the corresponding input partition.   

**Wide Transformations**
- **Definition:** A wide transformation is one where each partition of the output RDD may depend on multiple partitions of the input RDD. This means that data from different partitions needs to be combined or shuffled to produce the output.   
- **Characteristics:**
  - **Data shuffling:** Wide transformations require data to be shuffled across the network, which can be expensive.   
  - **Global computation:** Computing a partition in the output RDD may require data from multiple input partitions.   
  - **Examples:** ```groupByKey```, ```reduceByKey```, ```join```, ```distinct```, ```repartition```, ```intersection```, ```coalesce```
  - **Considerations:**
    - **Performance:** Wide transformations are generally slower than narrow transformations due to the data shuffling involved.   
    - **Resource intensive:** Wide transformations can be more resource intensive as they may require significant network I/O and memory.   


**Key Difference: Data Shuffling**
The fundamental difference between narrow and wide transformations lies in whether data shuffling is required. Narrow transformations do not involve shuffling, while wide transformations do. Data shuffling is the process of redistributing data across the cluster based on some key or condition. It is a costly operation that involves moving data over the network, which can be a bottleneck.   

**Importance**
Understanding the difference between narrow and wide transformations is crucial for optimizing Spark applications. By choosing appropriate transformations and minimizing data shuffling, you can significantly improve the performance of your Spark jobs.