# CS5052 - Spark Programming
> Created by: Professor Blesson Varghese\
> School of Computer Science, University of St Andrews\
> Contact: cs5052.staff@st-andrews.ac.uk

This notebook introduces you to Spark programming using Python. Spark is a system that coordinates the processing of large datasets in parallel across many machines. In practice, you could run Spark across a cluster of nodes that will be managed by Spark. Spark in the context of the lab is installed and run on a single machine. 

You can setup the enviroment to run this notebook on the lab machine by: 
```
cd <your desired folder>
python3.12 -m venv pyspark
. pyspark/bin/activate
pip install --upgrade pip
pip install pyspark jupyterlab
```

Run the JupyterLab server after activating the virtual environment using the following command:
```
jupyter-lab
```
A browser window should open automatically.

To create self-contained notebooks, explicit commands must be provided in the code within the notebook for installing any additional packages using the following command:
```
%pip install <package_name>
```

**Note:** The notebook submitted for the CS5052 Practical 1 must run on the lab machine. 

# `SparkSession`

- Every Spark application consists of a driver program and executors (workers); see figure below
- Driver program accesses Spark through a `SparkSession` object
    - A unified point of entry as of Spark 2.0
    - Represents a connection to a cluster
    - `SparkContext`, `SQLContext` and `HiveContext` all combined in `SparkSession`

 ![Spark Overview; Obtained from: https://spark.apache.org/docs/latest/cluster-overview.html](images/sparksession.png)

In [1]:
# import os
# import sys
#
# print("Python being used:", sys.executable)
#
# os.environ["PYSPARK_PYTHON"] = sys.executable
# os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

In [2]:
# Import SparkSession class from pyspark.sql module
# SparkSession is the entry point to Spark 
from pyspark.sql import SparkSession

In [3]:
# Create a SparkSession and assign it to variable 'spark'
# There are different variants on the usage - refer to the documentation or a tutorial
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("test") \
    .getOrCreate()


# DataFrame

## DataFrame: create manually

In [4]:
# Create a DataFrame with one column called “number” and 10000 rows
data = spark.range(1000).toDF("number")

# Shows the first 20 rows by default
data.show()

# Show more or fewer rows N
N = 50
data.show(N)

+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
+------+
only showing top 20 rows
+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
|    20|
|    21|
|    22|
|    23|
|    24|
|    25|
|    26|
|    27|
|    28|
|    29|
|    30|
|    31|
|    32|
|    33|
|    34|
|    35|
|    36|
|    37|
|    38|
|    39|
|    40|
|    41|
|    42|
|    43|
|    44|
|    45|
|    46|
|    47|
|    48|
|    49|
+------+
only showing top 50 rows


In [5]:
from pyspark.sql import Row

# Python list containing two rows
emp = [Row("Jack", 24), Row("Bobby", 26)]

# Convert Python data into Spark DataFrame
emp_df = spark.createDataFrame(emp, ["name","age"])

emp_df.show()


Py4JJavaError: An error occurred while calling o56.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (Monster executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:678)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:663)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:1034)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:1014)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:596)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:611)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:593)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:593)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:180)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:716)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:86)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:83)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:97)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:719)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
	at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:837)
	at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
	at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:381)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:1022)
	... 26 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$3(DAGScheduler.scala:3122)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3122)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3114)
	at scala.collection.immutable.List.foreach(List.scala:323)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3114)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1303)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1303)
	at scala.Option.foreach(Option.scala:437)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1303)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3397)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3328)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3317)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:50)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1017)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2496)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2517)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2536)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:544)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:497)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:58)
	at org.apache.spark.sql.classic.Dataset.collectFromPlan(Dataset.scala:2275)
	at org.apache.spark.sql.classic.Dataset.$anonfun$head$1(Dataset.scala:1401)
	at org.apache.spark.sql.classic.Dataset.$anonfun$withAction$2(Dataset.scala:2265)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:717)
	at org.apache.spark.sql.classic.Dataset.$anonfun$withAction$1(Dataset.scala:2263)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$8(SQLExecution.scala:177)
	at org.apache.spark.sql.execution.SQLExecution$.withSessionTagsApplied(SQLExecution.scala:285)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$7(SQLExecution.scala:139)
	at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
	at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112)
	at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:106)
	at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:139)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:308)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:138)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:92)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:250)
	at org.apache.spark.sql.classic.Dataset.withAction(Dataset.scala:2263)
	at org.apache.spark.sql.classic.Dataset.head(Dataset.scala:1401)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2814)
	at org.apache.spark.sql.classic.Dataset.getRows(Dataset.scala:338)
	at org.apache.spark.sql.classic.Dataset.showString(Dataset.scala:374)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:678)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:663)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:1034)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:1014)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:596)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:611)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:593)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:593)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:180)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:716)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:86)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:83)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:97)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:719)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more
Caused by: java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
	at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:837)
	at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
	at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:381)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:1022)
	... 26 more


## DataFrame: create from file

In [None]:
# Create DataFrame from a file
df = ( 
    spark.read
    .option("header", True)         # Tells Spark the first line is a header
    .option("inferSchema", True)    # Spark scans the column and infers data type; id will be an integer, country and capital will be a string
    .format("csv")                  
    .load("sample_data1.csv")
)

df.show()

## DataFrame: Datasource

Many different file types are possible, including CSV, JSON, ORC, Parquet, Text, Table, JDBC

# Two Major Operations

- All abstractions such as RDD and DataFrames offer two types of operation
    - **Transformation:** construct a new RDD/DataFrame from a previous one
    - **Action:** compute the result based on an RDD/DataFrame

# Transformations

## Transformations: `printSchema()` and `describe()`

In [None]:
# Print the structure (data type of the columns) of the DataFrame
df.printSchema()

In [None]:
# Describes the schema of the DataFrame
df.describe()

In [None]:
# Describe the structure of specific column
df.select("Country").describe()

## Transformations: `where()` and `filter()`

In [None]:
df_population = ( 
                    spark.read
                    .option("header", True)         
                    .option("inferSchema", True) 
                    .format("csv")                  
                    .load("sample_data2.csv")
)

df_population.show()

hundredK_plus = df_population.filter("Population >= 100000")
hundredK_plus.show()

under_50K = df_population.where("Population <= 50000")
under_50K.show()

## Transformation: `distinct()` and	`limit()` 

In [None]:
df_town_village = ( 
                    spark.read
                    .option("header", True)         
                    .option("inferSchema", True) 
                    .format("csv")                  
                    .load("sample_data3.csv")
)

df_town_village.show()

unique_county = df_town_village.select("County").distinct()
unique_county.show()

N = 5
shortN_list = df_town_village.limit(N)
shortN_list.show()

# Alternate usage
df_town_village.limit(N).show()

## Transformation: Sorting using `sort()` or `orderBy()`

### Basic sorting

In [None]:
# Sort by a single column
sorted = df_town_village.sort("County")
sorted.show()

# Sort by multiple columns
sorted = df_town_village.sort("County", "Town/Village")
sorted.show()

In [None]:
# Order by a single column
sorted = df_town_village.orderBy("Town/Village")
sorted.show()

# Order by multiple columns
sorted = df_town_village.orderBy("County", "Town/Village")
sorted.show()

### Specifying sort direction

In [None]:
from pyspark.sql.functions import desc, asc 

sorted = df_town_village.orderBy(desc("Town/Village"))
sorted.show()

sorted = df_town_village.orderBy(asc("County"), desc("Town/Village"))
sorted.show()

## Transformation: Sampling data using `sample`

In [None]:
with_replacement = False    # Sample without replacement; each row can appear at most once 
fraction = 0.50             # Roughly 50% of the rows are selected
seed = None                 # Sets the random seed for reproducibility; if an integer sample value is set it produces the same sample everytime 

sample = df_town_village.sample(with_replacement, fraction, seed)
sample.show()

## Transformation: Aggregation

In [None]:
from pyspark.sql.functions import count, countDistinct 

df_town_village.select(count("County")).show()

df_town_village.select(countDistinct("County")).show()

# min, max, avg, first, last and groupBy functions are available and self explanatory


## DataFrame: Some Actions

In [None]:
# first()
row = df_town_village.first()
print(row)
print(row["Town/Village"])      #Access column of the first row

# show()
df_town_village.show()
N = 6
df_town_village.show(N)

# take(N)
N = 4
rows = df_town_village.take(N)  #Similar to first, but returns multiple rows
for row in rows:
    print(row)

# collect()
all_rows = df_town_village.collect()    #Returns all rows as a list of objects
# Note: if the DataFrame is large, then may not work as all memory is brought into memory
# Use this for small datasets or debugging
for row in all_rows:
    print(row)

#count()
print(f"Total rows: {df_town_village.count()}")

# RDD

- Low level but still relevant in some cases:
- Raw data processing e.g. text file without structure.
    - Creating new RDDs
    - Transforming existing RDDs
    - Computing results from RDDs


## Create RDD

In [None]:
# From an existing file
lines = spark.sparkContext.textFile("README_dummy.md")

# or
sc = spark.sparkContext
lines = sc.textFile("README_dummy.md")

# Collect all lines into a Python list
all_lines = lines.collect()

# Print each line
for line in all_lines:
    print(line)

In [None]:
# From a list
numbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8])
# numbers is an RDD containing numbers 1 to 8
# The data is split into partitions and can be processed in parallel

# Aggregate the partitions and print all numbers
print(numbers.collect())

In [None]:
# Create an RDD from a file 
lines = sc.textFile("README_dummy.md")

# Create new RDD with lines containing Spark
lines = lines.filter(lambda x: 'Spark' in x)

# Count the number of items in this RDD
# Note: The above two lines doesn't do anything 
# The statement below will read the file and do the computation
print(lines.count())

# The statement below will read the file and do the computation again
print(lines.count())


## RDD - Persisting

- Spark recomputes RDDs each time an action is performed on it
    - By default RDD is not stored in memory
    - The `persist()` function stores an RDD permanently

In [None]:
lines = sc.textFile("README_dummy.md")
lines = lines.filter(lambda x: 'Spark' in x)

# Load and store dataset in memory
lines.persist()

# Perform computation on the stored dataset
print(lines.count())

## Basic RDD Transformation Functions

- Construct an RDD from a previous one
    - Performed on one or more RDDs
    - Return a new RDD

### `filter()`
- Takes in a function, returns an RDD that only has elements that pass the filter() function

In [None]:
lines = sc.textFile("README_dummy.md")

# Create a new RDD consisting lines that contain ‘Spark’
lines = lines.filter(lambda x: 'Spark' in x)

all_lines = lines.collect()

for line in all_lines:
    print(line)

### `map()`
- Takes in a function and applies it to each element in the RDD

In [None]:
lines = sc.textFile("README_dummy.md")

# Create a new RDD in which all strings are in uppercase
lines = lines.map(lambda x: x.upper())

all_lines = lines.collect()

for line in all_lines:
    print(line)

### `flatmap()`
- Applies a function to each element in an RDD
- Returns a sequence (list of elements)
- The final RDD is flattened

In [None]:
lines = sc.parallelize([
    "I love Spark",
    "Spark is awesome",
    "Big data rocks"
])

words_using_map = lines.map(lambda line: line.split(" "))
print(words_using_map.collect())

In [None]:
words_using_flatmap = lines.flatMap(lambda line: line.split(" "))
print(words_using_flatmap.collect())

### `distinct()`
- Returns a new RDD with only distinct items

In [None]:
numbers = sc.parallelize([0, 1, 2, 4, 7, 5, 4, 3, 2, 1, 1, 0])
numbers = numbers.distinct()
print(numbers.collect())

### `union(other)`
- Returns a new RDD consisting of items from both sources

In [None]:
numbers = sc.parallelize([0, 1, 2, 3, 4])
characters = sc.parallelize(['A', 'B', 'C', 'D', 'E'])
result = numbers.union(characters)
print(result.collect())

### `intersection(other)`
- Returns a new RDD consisting of only items from both sources and removes all duplicates

In [None]:
number_list1 = sc.parallelize([0, 1, 2, 4, 6, 7, 8])
number_list2 = sc.parallelize([0, 1, 3, 4, 5, 7])
result = number_list1.intersection(number_list2)
print(result.collect())

### `subtract(other)`
- Returns a new RDD consisting of only items in the first RDD but not in the other one

In [None]:
number_list1 = sc.parallelize([0, 1, 2, 4, 6, 7, 8])
number_list2 = sc.parallelize([0, 1, 3, 4, 5, 7])
result = number_list1.subtract(number_list2)
print(result.collect())

## Basic RDD Action Functions

- Compute result based on RDD(s)
    - Performed on one or more RDD(s)
    - Return a result, which is not an RDD

### `first()`
- Returns the first item in an RDD

In [None]:
numbers = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8]) 
print(numbers.first())

### `collect()`
- Returns a list containing the entire RDD's content

In [None]:
numbers = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8]) 
print(numbers.collect())

### `count()`
- Returns the number of items in an RDD

In [None]:
numbers = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8]) 
print(numbers.count())

### `reduce(function)`
- Takes a function that operates on two elements and returns a new element

In [None]:
numbers = sc.parallelize([1, 2, 3, 4, 5]) 
result = numbers.reduce(lambda x, y: x * y)
print(result)

### `takeOrdered(num, ordering)`
- Return a number of items based on the provided ordering

In [None]:
numbers = sc.parallelize([8, 0, 4, 6, 9, 7, 2, 1, 5, 3])

# Return five smallest numbers from the list
print(numbers.takeOrdered(5, lambda x: x))

# Return five largest numbers from the list
print(numbers.takeOrdered(5, lambda x: -x))
# Note: How this function works:
# Original numbers: 8, 0, 4, 6, 9, 7, 2, 1, 5, 3
# Negated numbers: -8, 0, -4, -6, -9, -7, -2, -1, -5, -3
# 5 smallest of these: -9, -8, -7, -6, -5
# Negate back: 9, 8, 7, 6, 5