## Spark Operations
### Lesson objectives
#### In this lesson, we will explain the following topics:

- Understand the two types of Spark operations: transformations and actions.
- Learn about the immutability of Spark operations and its implications.
- Explore examples of transformations and actions, including lazy evaluation and its benefits.
### Spark Operations
#### Spark Operations
- Spark operations on distributed data can be classified into two types: transformations and actions.
- All spark operations are immutable.
### Immutable Objects
- An object whose state cannot change after it has been constructed is called immutable (unchangeable).1
- The methods of an immutable object do not modify the state of the object.

In [0]:
data = [("Abdo",25),("Ola", 23),("Esraa", 28),("Asmaa",32)]
rdd = sc.parallelize(data)

print("Original Data RDD:")
for row in rdd.collect():
    print(row)

Original Data RDD:
('Abdo', 25)
('Ola', 23)
('Esraa', 28)
('Asmaa', 32)


In [0]:
print(f"Original RDD ID: {rdd.id()}")

rdd_filter = rdd.filter(lambda x : x[1] > 23)
print("RDD Filter:")
for row in rdd_filter.collect():
    print(row)

print(f"RDD ID: {rdd_filter.id()}")


Original RDD ID: 6
RDD Filter:
('Abdo', 25)
('Esraa', 28)
('Asmaa', 32)
RDD ID: 7


### Spark Operations: Transformations
- Transformations: transform a Spark DataFrame into a new DataFrame without altering the original data.
- Example of Spark transformations: **map(), select(), filter(), or drop()**.

### Spark Transformations: What are Lazy Transformations?
- In Spark, transformations are lazy.
- This means computations are not executed immediately.
- Spark builds a DAG (Directed Acyclic Graph) of transformations.
- All Transformations results are not computed immediately, but they are recorded or remembered as a lineage.
### Spark Transformations: Benefits of Lazy Evaluation
- Optimization: A lineage allows Spark, at a later time in its execution plan, to rearrange certain transformations, coalesce them, or optimize transformations into stages for more efficient execution.
- Resource Management: Executes tasks efficiently, using fewer resources.
- Fault Tolerance: Easier to recompute parts of the pipeline if a part fails.
### Spark Transformations: Lazy Transformation
- Consider a dataset with map and filter transformations.
- Spark does not execute these transformations when they are defined.
- Transformations are executed when an action (like collect, count) is called.

In [0]:
data = [("Abdo",25),("Ola", 23),("Esraa", 28),("Asmaa",32)]
rdd = sc.parallelize(data)


map_rdd = rdd.map(lambda x: (x[0], x[1], x[1]>23))

filter_rdd = map_rdd.filter(lambda x : x[2])

df = spark.createDataFrame(filter_rdd , ['Name' , "Age" , " OlderThan23"])

final_df = df.select('Name' , "Age")

result = final_df.collect()

display(result)


Name,Age
Abdo,25
Esraa,28
Asmaa,32


### Spark Operations: Actions
- An action triggers the lazy evaluation of all the recorded transformations.
- Actions are operations that trigger execution of transformations.
- They are used to either compute a result to be returned to the Spark driver program or to write data to an external storage system.
- Actions include operations like **count, collect, saveAsTextFile, and take**.
### Examples of Spark Actions
- **collect()**: Collects all elements from the Spark context to the driver program.
- **count()**: Returns the number of elements in the dataset.
- **saveAsTextFile(path)**: Saves the dataset to a text file at the specified path.
- **take(n)**: Returns an array with the first n elements of the dataset.