# Resilient Distributed Dataset

RDD stands for Resilient Distributed Dataset. It's a fundamental data structure in Apache Spark, representing an immutable, distributed collection of data elements. Let's break down the key aspects:
- **Resilient**: RDDs are fault-tolerant. If a node in your Spark cluster fails, the RDD can be reconstructed from other nodes. Spark achieves this through lineage tracking.  Lineage is a record of all the transformations applied to create an RDD. This lineage allows Spark to recompute lost partitions without having to reload the entire dataset.
- **Distributed**: RDDs are partitioned and distributed across multiple nodes in a cluster. This parallelization is crucial for Spark's performance. The data is split into chunks (partitions), and each partition can be processed on a different machine concurrently.
- **Dataset**: RDDs represent a collection of data. This data can come from various sources:
  - Files (text files, CSV, Parquet, Avro, etc.)
  - Databases (JDBC connections)
  - Other RDDs
  - In-memory collections
- **Immutable**: Once an RDD is created, it cannot be changed. Transformations on an RDD create a new RDD. This immutability simplifies debugging and makes it easier to reason about the code.
- **Lazy Evaluation**: Computations on RDDs are not performed immediately. Instead, Spark builds a plan of operations (a Directed Acyclic Graph or DAG) and executes it only when an action is triggered. This lazy evaluation allows Spark to optimize the execution plan and avoid unnecessary computations.

**Key Concepts Related to RDDs**:
- **Transformations**: Operations that create new RDDs from existing ones. Examples:
  - `map`: Applies a function to each element.
  - `filter`: Returns elements that satisfy a condition.
  - `reduce`: Aggregates elements using a function.
  - `groupBy`: Groups elements based on a key.
  - `join`: Combines elements from two RDDs.
- **Actions**: Operations that trigger the execution of RDD computations and return a result to the driver program or write data to an external system. Examples:
  - `collect`: Returns all elements of the RDD to the driver program. (Use with caution for large datasets!)
  - `count`: Returns the number of elements in the RDD.
  - `first`: Returns the first element.
  - `take`: Returns the first n elements.
  - `saveAsTextFile`: Writes the RDD to a text file.
- **Partitions**: RDDs are divided into partitions, which are the basic units of parallelism. The number of partitions can be configured and significantly impacts performance.  Good partitioning ensures balanced workload distribution across the cluster.

**Why are RDDs important?**

RDDs provide a powerful abstraction for distributed data processing. They handle the complexities of data distribution, fault tolerance, and parallel execution, allowing developers to focus on the logic of their data processing tasks. While newer abstractions like DataFrames and Datasets are built on top of RDDs and offer more structured data handling and optimizations, understanding RDDs is still crucial for a deeper understanding of Spark's internals and for certain advanced use cases. They are the foundation upon which Spark's more modern data processing capabilities are built.

## Import Module

In [1]:
from pyspark.sql import SparkSession

## Create a Spark Session

In [2]:
spark = SparkSession.builder.appName("ExperimentingWithRDDs").getOrCreate()

25/02/18 16:14:04 WARN Utils: Your hostname, Cesars-MBP.local resolves to a loopback address: 127.0.0.1; using 192.168.7.230 instead (on interface en0)
25/02/18 16:14:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/02/18 16:14:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/02/18 16:14:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Create a List of Words

Splits the given sentence, `text_for_list`, into a Python list of words based on whitespace

In [3]:
text_for_list = "Spark makes life much easier and puts me in a good mood which makes Spark awesome!".split(" ")
type(text_for_list)

list

In [4]:
print(text_for_list)

['Spark', 'makes', 'life', 'much', 'easier', 'and', 'puts', 'me', 'in', 'a', 'good', 'mood', 'which', 'makes', 'Spark', 'awesome!']


## Convert the List to an RDD

**Parallelize**: converts `text_for_list` into an RDD

In [5]:
text_rdd = spark.sparkContext.parallelize(text_for_list)

## Collect and Print RDD Data

**Collect**: The `collect()` action gathers all the elements from the RDD back into a local Python list

In [6]:
text_data = text_rdd.collect()

for word in text_data:
    print(word)

[Stage 0:>                                                          (0 + 0) / 8]

Spark
makes
life
much
easier
and
puts
me
in
a
good
mood
which
makes
Spark
awesome!


                                                                                

## Count Elements in the RDD

**Count**: The action returns the total number of elements in the RDD

In [7]:
text_rdd.count()

                                                                                

16

## Count Unique Elements

**Distinct**: The transformation creates a new RDD that contains only the unique elements from the `text_rdd`.

***Note***: *Tranformations in Spark are lazy, i.e. they are not executed until an action is called.*

In [8]:
text_rdd.distinct().count()

14

## Recollecting and Printing the RDD Data

In [9]:
text_data = text_rdd.collect()

for word in text_data:
    print(word)

Spark
makes
life
much
easier
and
puts
me
in
a
good
mood
which
makes
Spark
awesome!


## Create and Print a Unique RDD

Create new RDD from an existing one by applying a transformation and then performing an action to view the results.

In [10]:
text_unique_rdd = text_rdd.distinct()

for word in text_unique_rdd.collect():
    print(word)

good
makes
life
puts
a
which
Spark
much
easier
awesome!
in
and
me
mood


## Define a Custom Function and Filter the RDD

Apply a custom logic transformation with `filter` to select a subset of data from an RDD and `collect` the data into a local list. 

In [11]:
def wordStartsWith(word, letter):
    return word.startswith(letter)

In [12]:
text_rdd.filter(lambda word: wordStartsWith(word, "S")).collect()

['Spark', 'Spark']

## Create a List of Numbers

In [13]:
num_list = [*range(1, 21)]
print(num_list)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]


## Convert the List of Numbers into an RDD

In [14]:
num_rdd = spark.sparkContext.parallelize(num_list)

## Map each Number to a Tuple of its Value and its Square

Create a new RDD with `map` transformation that applies function to every element. For each number `n`, it creates a tuple `(n, n * n)`. The new RDD is a tuple of the form `(number, square)`.

The `collect()` action gathers all the elements from the distributed RDD back to the driver as a list.

In [15]:
num_squared_rdd = num_rdd.map(lambda n: (n, n * n))

for element in num_squared_rdd.collect():
    print(element)

(1, 1)
(2, 4)
(3, 9)
(4, 16)
(5, 25)
(6, 36)
(7, 49)
(8, 64)
(9, 81)
(10, 100)
(11, 121)
(12, 144)
(13, 169)
(14, 196)
(15, 225)
(16, 256)
(17, 289)
(18, 324)
(19, 361)
(20, 400)


## Map Text Data to a Transformed Tuple

Apply the `map` transformation to an existing RDD, and for each word, a function produces a tuple into a new RDD.

In [16]:
text_trained_rdd = text_rdd.map(lambda word: (word, word[0], word.startswith("S")))
for element in text_trained_rdd.collect():
    print(element)

('Spark', 'S', True)
('makes', 'm', False)
('life', 'l', False)
('much', 'm', False)
('easier', 'e', False)
('and', 'a', False)
('puts', 'p', False)
('me', 'm', False)
('in', 'i', False)
('a', 'a', False)
('good', 'g', False)
('mood', 'm', False)
('which', 'w', False)
('makes', 'm', False)
('Spark', 'S', True)
('awesome!', 'a', False)


## Flatten Words into Characters and Retrieve a Subset

Transform a each word into its constituent characters and then get a sample of these characters.

By applying the `flatMap` transformation, it flattens the list to a new RDD which consists of individual characters. The `take` action retrieves the first 10 characters from the flattened RDD. 

In [17]:
text_rdd.flatMap(lambda word: list(word)).take(10)

['S', 'p', 'a', 'r', 'k', 'm', 'a', 'k', 'e', 's']

## Create Countries List and Convert to RDD

In [18]:
countries = [("USA", 96),  ("India", 68), ("UK", 86), ("Germany", 84), ("Canada", 82), ("France", 83), ("Norway", 81), ("Australia", 82), ("Brazil", 79), ("Mexico", 76)]

countries_rdd = spark.sparkContext.parallelize(countries)

## Sort the Countries List by Country Name (Key)

The `sortByKey()` transformation sorts the RDD based on the keys, and is brought back to the driver as a list with `collect()`. 

In [19]:
sorted_countries = countries_rdd.sortByKey().collect()

for element in sorted_countries:
    print(element)

('Australia', 82)
('Brazil', 79)
('Canada', 82)
('France', 83)
('Germany', 84)
('India', 68)
('Mexico', 76)
('Norway', 81)
('UK', 86)
('USA', 96)


## Reverse the Tuple Order and Sort in Descending Order

By altering the transformation with `sortByKey(False)`, the list is sorted in descending order, based on the value, after changing the position of the elements from `(country, value)` to `(value, country)`. 

In [20]:
sorted_countries = countries_rdd.map(lambda c: (c[1], c[0])).sortByKey(False).collect()

for element in sorted_countries:
    print(element)

(96, 'USA')
(86, 'UK')
(84, 'Germany')
(83, 'France')
(82, 'Canada')
(82, 'Australia')
(81, 'Norway')
(79, 'Brazil')
(76, 'Mexico')
(68, 'India')
