#  Resilient Distributed Datasets

# What Are the Low-Level APIs?

In Apache Spark, the low-level APIs refer to the foundational programming interfaces that enable developers to interact directly with distributed data and control the execution of tasks in a Spark application. The two main low-level APIs in Spark are:

1. **Resilient Distributed Datasets (RDD):**
   - RDD is the fundamental data structure in Spark, representing a fault-tolerant collection of elements that can be processed in parallel. RDDs are immutable, distributed collections of objects that can be processed in parallel. RDDs can be created from external data sources or by transforming other RDDs through operations like `map`, `filter`, and `reduce`.

   **Example: Creating and Transforming RDDs:**
   ```python
   from pyspark import SparkContext

   # Create a SparkContext
   sc = SparkContext("local", "RDDExample")

   # Create an RDD from a list
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Transformations on RDD
   squared_rdd = rdd.map(lambda x: x**2)

   # Actions on RDD
   result = squared_rdd.reduce(lambda x, y: x + y)

   print(result)
   ```

2. **Spark Core API:**
   - Spark Core is the foundation of the Spark ecosystem and provides the basic functionality for distributed computing. It includes the essential components such as task scheduling, memory management, and fault recovery. Spark applications use the Spark Core API to interact with the underlying Spark engine.

   **Example: Using Spark Core for Word Count:**
   ```python
   from pyspark import SparkContext

   # Create a SparkContext
   sc = SparkContext("local", "SparkCoreExample")

   # Read a text file and perform word count
   text_file = sc.textFile("path/to/your/text/file.txt")
   word_count = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

   # Collect and print the results
   results = word_count.collect()
   for result in results:
       print(result)
   ```

While RDD and Spark Core provide powerful capabilities for distributed computing, higher-level abstractions like DataFrames and Spark SQL have been introduced to simplify development for common use cases and improve optimization. In modern Spark applications, developers often use higher-level APIs for ease of use and productivity, but understanding the low-level APIs can be beneficial for fine-grained control and optimization in specific scenarios.

# When to Use the Low-Level APIs?
You should generally use the lower-level APIs in three situations:
You need some functionality that you cannot find in the higher-level APIs; for example,
1. if you need very tight control over physical data placement across the cluster.
2. You need to maintain some legacy codebase written using RDDs.
3. You need to do some custom shared variable manipulation. 

# RDD

Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache Spark, serving as the building block for distributed and fault-tolerant data processing. RDDs provide a distributed collection of objects that can be processed in parallel across a cluster. They are designed to be fault-tolerant, meaning they can recover from node failures during computation.

Here are key characteristics and concepts related to RDDs:

1. **Immutable Distributed Collection:**
   - RDDs are immutable, meaning their content cannot be changed once created. However, transformations on RDDs result in new RDDs. This immutability simplifies fault recovery and parallel processing.

2. **Resilience:**
   - RDDs are resilient to node failures. If a partition of an RDD is lost due to a node failure, Spark can recover the lost data by recomputing the lost partition from the original data and lineage information (information about the sequence of transformations).

3. **Partitioning:**
   - RDDs are divided into partitions, which are the basic units of parallelism. Each partition can be processed on a separate node in the Spark cluster. The number of partitions can be configured to control parallelism.

4. **Transformation and Action:**
   - RDDs support two types of operations: transformations and actions. Transformations create a new RDD from an existing one (e.g., `map`, `filter`), while actions return a value to the driver program or write data to an external storage system (e.g., `reduce`, `collect`).

5. **Lazy Evaluation:**
   - RDDs use lazy evaluation, meaning transformations are not executed immediately. Instead, they are evaluated only when an action is called. This optimization allows Spark to optimize the execution plan based on the entire sequence of transformations.

6. **Data Lineage:**
   - RDDs keep track of their lineage, which is a record of the sequence of transformations used to build the RDD. This lineage information is crucial for fault recovery. If a partition is lost, Spark can recompute it using the original data and the lineage.

7. **Wide vs. Narrow Transformations:**
   - Transformations are categorized as either narrow or wide. Narrow transformations (e.g., `map`, `filter`) do not require shuffling of data between partitions, while wide transformations (e.g., `groupByKey`, `reduceByKey`) require data shuffling, which can be more expensive.

8. **Caching:**
   - RDDs can be cached in memory to improve the performance of iterative algorithms or when the same dataset is used multiple times. Caching allows Spark to keep the data in memory across multiple stages of a computation.

Here's a simple example of using RDDs:



In [5]:
from pyspark import SparkContext
import warnings
warnings.filterwarnings("ignore")

# Create a SparkContext
sc = SparkContext("local", "RDDExample")

# Create an RDD from a list
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Transformations
squared_rdd = rdd.map(lambda x: x**2)

# Action
result = squared_rdd.reduce(lambda x, y: x + y)

print(result)

55


# Types Of RDD

 RDDs in Apache Spark can be broadly categorized into two types based on their content and behavior: "generic" RDDs and key-value RDDs.

1. **Generic RDDs:**
   - A generic RDD, often referred to as a "simple" or "non-pair" RDD, is a distributed collection of elements without any inherent key-value structure. Each element in the RDD is treated as an independent unit of data, and transformations and actions are applied to the entire dataset.

   **Example:**
   ```python
   from pyspark import SparkContext

   # Create a SparkContext
   sc = SparkContext("local", "GenericRDDExample")

   # Create a generic RDD
   data = [1, 2, 3, 4, 5]
   rdd = sc.parallelize(data)

   # Transformation and Action on a generic RDD
   squared_rdd = rdd.map(lambda x: x**2)
   result = squared_rdd.reduce(lambda x, y: x + y)

   print(result)
   ```

2. **Key-Value RDDs:**
   - Key-value RDDs, also known as "pair" RDDs, represent data as key-value pairs. Each element in the RDD is a tuple (key, value), where both key and value can be of any data type. Key-value RDDs are particularly useful for operations that involve grouping or aggregating data based on keys.

   **Example:**
   ```python
   from pyspark import SparkContext

   # Create a SparkContext
   sc = SparkContext("local", "KeyValueRDDExample")

   # Create a key-value RDD
   data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
   kv_rdd = sc.parallelize(data)

   # Transformation and Action on a key-value RDD
   age_sum_by_name = kv_rdd.reduceByKey(lambda x, y: x + y)
   age_sum_by_name.collect()
   ```

In the second example, the key-value RDD is created with tuples representing (name, age) pairs. The `reduceByKey` transformation is then used to calculate the sum of ages for each unique name.

The choice between using a generic RDD or a key-value RDD depends on the nature of the data and the operations you intend to perform. Key-value RDDs are especially beneficial for certain types of operations, such as grouping by key, reducing by key, and joining with other key-value RDDs. They provide a convenient way to express relationships and dependencies in your data.

# Transformations On RDD

Transformations in Apache Spark are operations on RDDs that create a new RDD by applying a function to each element of the existing RDD. Transformations are lazy, meaning they are not executed immediately. Instead, they build a logical execution plan that is executed only when an action is called. Here are some common transformations in Spark:

1. **`map(func)`**
   - Applies a function to each element of the RDD and returns a new RDD of the results.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "MapTransformationExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Square each element using map transformation
   squared_rdd = rdd.map(lambda x: x**2)
   ```

2. **`filter(func)`**
   - Returns a new RDD containing only the elements that satisfy the given predicate.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "FilterTransformationExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Filter even numbers using filter transformation
   filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
   ```

3. **`flatMap(func)`**
   - Similar to `map`, but each input item can be mapped to zero or more output items.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "FlatMapTransformationExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Duplicate each element using flatMap transformation
   duplicated_rdd = rdd.flatMap(lambda x: [x, x])
   ```

4. **`union(other)`**
   - Returns a new RDD that contains the elements of the source RDD and the other RDD.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "UnionTransformationExample")
   rdd1 = sc.parallelize([1, 2, 3])
   rdd2 = sc.parallelize([3, 4, 5])

   # Combine two RDDs using union transformation
   combined_rdd = rdd1.union(rdd2)
   ```

5. **`groupByKey()`**
   - Groups the elements of the RDD by key and returns a new RDD of `(key, values)` pairs.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "GroupByKeyTransformationExample")
   kv_rdd = sc.parallelize([("Alice", 25), ("Bob", 30), ("Alice", 22)])

   # Group elements by key using groupByKey transformation
   grouped_rdd = kv_rdd.groupByKey()
   ```

6. **`reduceByKey(func)`**
   - Groups the elements of the RDD by key and applies a reduce function to the values associated with each key.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "ReduceByKeyTransformationExample")
   kv_rdd = sc.parallelize([("Alice", 25), ("Bob", 30), ("Alice", 22)])

   # Sum ages by key using reduceByKey transformation
   sum_by_key_rdd = kv_rdd.reduceByKey(lambda x, y: x + y)
   ```

These are just a few examples of transformation operations in Spark. Transformations are building blocks that allow you to express complex data manipulations and transformations on your RDDs. Remember that transformations are evaluated lazily, and their execution is triggered when an action is called.

# Actions On RDD

Actions in Apache Spark are operations that return a value to the driver program or write data to an external storage system. Unlike transformations, actions trigger the execution of the computation plan built by transformations. Here are some common actions in Spark:

1. **`collect()`**
   - Returns all the elements of the RDD as an array to the driver program. Be cautious when using `collect()` on large datasets, as it brings all the data to the driver, and it might not fit in memory.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "CollectActionExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Collect all elements to the driver program
   result = rdd.collect()
   ```

2. **`count()`**
   - Returns the number of elements in the RDD.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "CountActionExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Count the number of elements in the RDD
   count = rdd.count()
   ```

3. **`first()`**
   - Returns the first element of the RDD.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "FirstActionExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Get the first element of the RDD
   first_element = rdd.first()
   ```

4. **`take(n)`**
   - Returns the first `n` elements of the RDD.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "TakeActionExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Get the first three elements of the RDD
   first_three_elements = rdd.take(3)
   ```

5. **`reduce(func)`**
   - Aggregates the elements of the RDD using a specified reduce function. The function should be associative and commutative.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "ReduceActionExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Sum all elements using reduce action
   sum_result = rdd.reduce(lambda x, y: x + y)
   ```

6. **`foreach(func)`**
   - Applies a function to each element of the RDD. This is often used for side-effect operations like printing.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "ForeachActionExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Print each element using foreach action
   rdd.foreach(lambda x: print(x))
   ```

7. **`saveAsTextFile(path)`**
   - Writes the elements of the RDD as text files in the specified path.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "SaveAsTextFileActionExample")
   rdd = sc.parallelize([1, 2, 3, 4, 5])

   # Save the RDD as a text file
   rdd.saveAsTextFile("path/to/save/text/files")
   ```

These actions trigger the actual computation of the RDD and return values to the driver program or write data to external storage. It's important to note that actions are the operations that lead to the execution of the entire Spark computation plan built by transformations.

# Spark Context..

In Apache Spark, a `SparkContext` is a fundamental entry point to interact with a Spark cluster. It represents the connection to a Spark cluster and is responsible for coordinating the execution of distributed Spark applications. The `SparkContext` is typically created once in a Spark application and serves as a central coordinator for the entire application.

Here are key aspects and functionalities associated with the `SparkContext`:

1. **Creation:**
   - The `SparkContext` is created when a Spark application is initiated. It connects to the Spark cluster and coordinates the distribution of tasks.

   ```python
   from pyspark import SparkContext

   # Create a SparkContext
   sc = SparkContext("local", "MySparkApplication")
   ```

   In this example, the `SparkContext` is created with the `"local"` master, indicating that the Spark application runs in local mode on a single machine. In a real cluster, you would replace `"local"` with the address of your Spark cluster's master node.

2. **Configuration:**
   - The `SparkContext` allows you to configure various aspects of your Spark application, such as the application name, cluster configuration, and logging settings.

   ```python
   from pyspark import SparkConf, SparkContext

   # Configure Spark
   conf = SparkConf().setAppName("MySparkApplication").setMaster("local")
   sc = SparkContext(conf=conf)
   ```

3. **Access to Cluster Resources:**
   - The `SparkContext` manages the allocation of resources in the Spark cluster, including memory and CPU. It communicates with the cluster manager (e.g., Apache Mesos, Apache YARN, or Spark's standalone cluster manager) to request resources for running tasks.

4. **Creation of RDDs:**
   - RDDs (Resilient Distributed Datasets) are the fundamental data structure in Spark. The `SparkContext` is used to create RDDs from external data sources or by parallelizing existing data structures.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "RDDCreationExample")

   # Create an RDD from a list
   rdd = sc.parallelize([1, 2, 3, 4, 5])
   ```

5. **Coordinate Job Execution:**
   - The `SparkContext` coordinates the execution of Spark jobs. It divides the application into stages and tasks, schedules tasks on worker nodes, and monitors their progress.

6. **Control of Logging:**
   - The `SparkContext` provides control over the logging level and other logging configurations for the Spark application.

   ```python
   from pyspark import SparkContext

   sc = SparkContext("local", "LoggingExample")

   # Set the log level to ERROR
   sc.setLogLevel("ERROR")
   ```

7. **Stop the SparkContext:**
   - Once the Spark application is complete, it's essential to stop the `SparkContext` to release resources and shut down the Spark cluster connections.

   ```python
   sc.stop()
   ```

The `SparkContext` is a crucial component for managing the execution and resources of Spark applications. In modern Spark applications, you might also encounter the use of `SparkSession`, which is a higher-level abstraction built on top of `SparkContext` and includes additional functionalities for working with DataFrames and Spark SQL.

# Spark Context V/s Spark Session

`SparkContext` and `SparkSession` are both important components in Apache Spark, but they serve different purposes and have distinct roles in Spark applications.

### SparkContext:

1. **Role:**
   - `SparkContext` is the entry point and the central coordinator for low-level Spark functionality.
   - It represents the connection to a Spark cluster and is responsible for managing the execution of Spark applications.

2. **Functionality:**
   - Manages the allocation of resources in the Spark cluster, including memory and CPU.
   - Coordinates the execution of Spark jobs, dividing the application into stages and tasks.
   - Provides access to cluster resources and controls the creation of Resilient Distributed Datasets (RDDs).
   - Handles configuration settings for the Spark application.

3. **Creation:**
   - Typically created once in a Spark application.

   ```python
   from pyspark import SparkContext

   # Create a SparkContext
   sc = SparkContext("local", "MySparkApplication")
   ```

4. **Legacy:**
   - `SparkContext` is the older and more traditional entry point in Spark, and it predates the introduction of DataFrames and Spark SQL.

### SparkSession:

1. **Role:**
   - `SparkSession` is a higher-level abstraction introduced in Spark 2.0 to provide a unified entry point for reading data, working with DataFrames, and executing Spark SQL queries.
   - It encapsulates `SparkContext` and provides additional functionalities for structured data processing.

2. **Functionality:**
   - Provides a single entry point for reading data from various structured sources, creating DataFrames, and executing Spark SQL queries.
   - Manages the creation of DataFrames, Datasets, and TempViews for working with structured data.
   - Encapsulates configurations, including those related to Spark SQL and Hive.

3. **Creation:**
   - Typically created once in a Spark application, similar to `SparkContext`.

   ```python
   from pyspark.sql import SparkSession

   # Create a SparkSession
   spark = SparkSession.builder.appName("MySparkApplication").getOrCreate()
   ```

4. **Modern Usage:**
   - `SparkSession` is the modern and recommended entry point for Spark applications, especially when working with structured data using DataFrames and Spark SQL.

### Summary:

- `SparkContext` is primarily concerned with low-level Spark functionality, resource management, and the coordination of Spark jobs.
- `SparkSession` is a higher-level abstraction that focuses on structured data processing, providing unified access to Spark's structured APIs.

In practice, many Spark applications use `SparkSession` for structured data processing while still having access to `SparkContext` for low-level operations. The `SparkSession` encapsulates a `SparkContext` internally and simplifies the overall development experience, especially for users working with structured and semi-structured data.