# Introduction to Spark
© Explore Data Science Academy

<p align="center">
<img src="https://github.com/Explore-AI/Pictures/blob/f7a433fa65521cca0016a42da6cffbebc6e65c40/data_engineering/transform/spark_logo.png?raw=True"
     alt="Spark Logo"
     style="padding-bottom=0.5em"
     width=600px/>
</p>

## Learning objectives

In this train, you learn how to:
- describe the four characteristics Spark is designed around;
- list and compare the four components of the Spark API stack;
- understand and apply the Spark Structured API to ingest and transform data; and
- describe datasets commonly used in Spark.

## Introduction

With the introduction of MapReduce, Apache Hadoop radically transformed how data were processed at the time. However, one of the downsides of the MapReduce architecture was that Hadoop used the Hadoop File System to store data on disk while performing computations. Disks are typically slow and moving data to the central processing unit has a high cost, reducing the speed with which data can be processed. 

Apache Spark went one step further in improving the processing of big data by introducing memory processing, using the **L**east **R**ecently **U**sed (LRU) algorithm and the machine's Random Access Memory (RAM). Processing data in memory removes the need to transfer data to disk and allows data to be consistently accessible by the central processing unit, greatly reducing latency. 

Spark is built as a unified analytics engine, considering not just the simple use cases in ETL pipelines, but rather combining the following libraries:
- SparkSQL, a powerful engine for performing transformations of Big Data;
- MLlib, a machine learning library that makes use of Spark's powerful processing engine;
- GraphX, a graphing library to process graphs at scale; and
- Spark Streaming, an engine for performing real-time analytics on large datasets.
This feature allows interoperability between the various engines and a uniform engine for the processing of various use cases in big data.



## Spark characteristics
*The developers of Spark have established some core principles by which they wanted to design and implement the tool. These characteristics are central to the Spark tool, and when diving deeper into the system, you should be able to easily see how they have been implemented.*

### Speed
1. Spark largely benefits from recent advances and a reduction in CPU and memory costs. Spark is optimised to take advantage of gigabytes of memory and multiple cores, and is built on Unix-based operating systems that allow for multithreading and parallel processing.
2. Spark also builds its queries as directed acyclic graphs, which constructs efficient computational graphs that can be decomposed into tasks to be executed across multiple workers on a cluster. 
3. Spark's physical execution engine (Tungsten) uses whole-stage code generation to generate compact code for execution.

### Ease of use
While Spark is built on top of a simple logical data structure called a Resilient Distributed Dataset (RDD), it offers high-level data abstractions on top of RDD such as DataFrames and Datasets. The abstractions provide a set of transformations and actions as operations to present a simple programming model to build big data applications in a familiar language.

### Modularity
Spark consists of various modules, including SparkSQL (the code module), Structured Streaming, MLlib, and GraphX. These modules are available in various programming languages (Scala, Python, Java, SQL, and R) with well-documented APIs. While being modular, Spark provides a single API for all modules that operate on the same underlying engine, allowing for unified processing for all workloads.

### Extensibility
Spark decouples storage and computes (as opposed to Hadoop), focussing on fast, parallel computation. This allows Spark to read stored data from a very wide range of sources (Apache Hadoop, Apache Cassandra, Apache HBase, MongoDB, Apache Hive, RDBMSs, and more) and perform all the processing in memory. Spark's DataFrameReaders and DataFrameWriters can also be extended to work with cloud providers and streaming applications (for example, Amazon S3, Apache Kafka, and Azure Storage). Additionally, Spark has a vibrant community that has created connectors to other sources and monitoring tools.    

<p align="center">
<img src="https://github.com/Explore-AI/Pictures/blob/cdf0d3a37606ad752412f2234776b3bd6135ff1c/data_engineering/transform/intro_to_spark/spark_connectors.png?raw=True"
     alt="Spark Connectors"
     style="padding-bottom=0.5em"
     width=600px/>
     <br>
     <center><em>The Apache Spark ecosystem of connectors, including some examples of data source management, application, and environment connectors. Taken from <a href="https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/">Learning Spark</a>  </em></center>
</p>


## Installing Apache Spark and configuring the environmental variables
*Before Spark can be used, it needs to be installed, and several environmental variables require configuring.*

Have a look at an installation guide for [Windows 7](https://edumine.wordpress.com/2015/06/11/how-to-install-apache-spark-on-a-windows-7-environment/), [Windows 10](https://phoenixnap.com/kb/install-spark-on-windows-10), or [macOS](https://notadatascientist.com/install-spark-on-macos/), for more guidance on the following steps:
1. Install Java 8 or above.
2. Install Python 3.7 or above, if you don't have a proper installation yet.
3. Download Apache Spark from [Spark Download page](https://spark.apache.org/downloads.html), and choose a release and Hadoop build. For this module, we prefer using Spark 3.x.x, prebuilt with Hadoop 2.7.
4. Verify the Spark Software File.
5. Install Apache Spark.
6. Add the winutils.exe file.
7. Configure the environment variables, and include the `bin` directory in the different paths, including `SPARK_HOME`, `HADOOP_HOME`, and `JAVA_HOME`.


|💡 NOTE |
|:--------------- |
| If you are only programming in Python, you can install Spark from the [PyPI repository](https://pypi.org/project/pyspark/) using `pip install pyspark`, and additional dependencies using `pip install pyspark[sql,ml,mllib]`.|

<p align="center">
<img src="https://github.com/Explore-AI/Pictures/blob/48c7a5e4f87e1770973c91fb114c9a5b02bdc575/data_engineering/transform/intro_to_spark/download_spark.png?raw=True"
     alt="Spark Download Page"
     style="padding-bottom=0.5em"
     width=800px/>
     <br>
     <center><em>The Apache <a href="https://spark.apache.org/downloads.html">Spark download page</a> for Spark 3.x.x, prebuilt with Hadoop 2.7.</em></center>
</p>

You can now open up an interactive PySpark (Python Spark) shell, or you can move into the bin directory and launch `pyspark`. Alternatively, if you installed PySpark using PyPI you can run `pyspark` using Python.

## Under the Spark hood
*A Spark application has a complex architecture which enables parallel processing, with multiple workers that perform the heavy lifting, and a driver that is coordinating the process.*


### Cluster manager
The cluster manager allocates and manages resources on the Spark cluster. Spark currently supports four cluster managers: the built-in standalone cluster manager, Apache Hadoop YARN, Apache Mesos, and Kubernetes.

### Spark executor
The Spark executor program runs on all workers in a Spark cluster assigned by the driver. Executors are responsible for executing instructions on the worker nodes, through communication with the driver, and reporting the state of the computation.


### Spark driver
At a high level, a Spark application has a driver program that orchestrates parallel operations on a Spark cluster. The driver accesses the distributed components in the cluster (Spark executors and the cluster manager) through a `SparkSession`, the unified entrypoint into Spark since Spark 2.0. 
The other roles fulfilled by the driver are communicating with the cluster manager, requesting resources (CPU, memory) from the cluster manager, and transforms the Spark operations into DAG operations, schedules them, and distributes their execution as tasks across the Spark executors. Once the resources are allocated, it communicates directly with the executors. 



<p align="center">
<img src="https://github.com/Explore-AI/Pictures/blob/ee6fc34f790a7eaa5259d794813732bd0ed6ff2b/data_engineering/transform/intro_to_spark/spark_cluster.PNG?raw=True"
     alt="Spark Cluster"
     style="padding-bottom=0.5em"
     width=600px/>
     <br>
     <center><em>An overview of an Apache Spark cluster. </em></center>
</p>

In the above diagram, the Spark driver is indicated as a Master Node, and the Spark executor as a Worker Node.

### Distributed data and partitions
Spark distributes data across storage as partitions, which can reside in cloud storage or HDFS. While the data is distributed as partitions across the physical cluster, Spark treats each partition as a high-level logical data abstraction – a DataFrame in memory. This allows executors to only process data that is close to them, minimising network bandwidth. Each partition is thus assigned its own data to work on. 

<p align="center">
<img src="https://github.com/Explore-AI/Pictures/blob/a50e6031aa1eb955c2a2443ca4e9b063a75424e9/data_engineering/transform/intro_to_spark/spark_data_partitioning.PNG?raw=True"
     alt="Spark Data Partitioning"
     style="padding-bottom=0.5em"
     width=600px/>
     <br>
     <center><em>A representation of data files distributed across physical machines and each Spark executor core gets a specific partition of data to work on. Based on <a href="https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch01.html">this</a> example.</em></center>
</p>

### SparkSession
SparkSession is the single first entry point into Spark, through which all other resources can be accessed (for example, creating runtime parameters, creating DataFrames, accessing metadata, reading in data, and issuing queries). The SparkSession can be instantiated using any of the APIs, and in turn, accessed through the global variables `spark`.

It is worth clarifying the difference between the SparkSession and SparkContext here. Prior to Spark 2.0, you had to launch a SparkContext for each 'type' of application you wanted to run, in other words, for SQL work you launched a SQLContext, for Hive a HiveContext, and so on. SparkSession effectively combines all of these contexts into a single entrypoint. 


Let's have a look at the launched SparkSession:

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()

In [2]:
spark

Let's have a look at any SparkContexts associated with the launched SparkSession, if any:

In [3]:
sc = spark.sparkContext
sc

It is also possible to create multiple SparkSessions, to allow multiple entrypoints into the SparkContext. This is beyond the scope of this train, and more information on the difference between the SparkSession and SparkContext can be found [here](https://medium.com/@achilleus/spark-session-10d0d66d1d24), along with how to set up multiple SparkSessions.

You can now access the SparkContext, SparkUI, and other methods which act directly on the SparkContext. We will discuss exposing and using these components in future trains.

## Spark components
*The Spark application is not just a single tool, but rather allows for the processing of big data, dealing with graph objects, performing machine learning, and streaming.*

One of the distinguishing features of Spark is the unification of the processing components, having a single engine for batch, stream, machine learning, and graph processing. All of the workloads are separate from the underlying Spark engine. Thus, when writing code in one of the above APIs, it will be converted into a DAG for execution on the underlying fault-tolerant engine, being converted into a highly compact bytecode for execution on the workers' JVMs.

### SparkSQL
This module is specifically for processing structured data (RDBMS, csv, parquet, avro, JSON, ORC, etc.). The API first reads in the data from one of the aforementioned sources and converts it to a temporary or permanent Spark table. 
This can then be manipulated using one of the programming language APIs, or using SQL queries directly. Spark Structured APIs are ANSI SQL:2003-compliant and can be used as a pure SQL engine as well. 

We can actually convert a table into a temporary view:

In [4]:
# Create data to read into the DataFrame.
data = [(1, 'doge', 23),
        (2, 'pepe', 44),
        (3, 'steve', 65),
        (4, 'brian', 0)] 

columns = ['id', 'name', 'count']  

# Read the data as a DataFrame using the createDataFrame method.
df = spark.createDataFrame(data, columns)

# Create a SparkSQL view using the createOrReplaceTempView() method.
df.createOrReplaceTempView("temp_df")


Let's run through what we have just done:

1. Created a list of tuples which will each form a row within a DataFrame we want to create.
2. Created a list of column names for the DataFrame.
3. Created a DataFrame by using the createDataFrame() method of the SparkSession and specifying the data generated in 1, and field names in 2.
4. Created a view from the DataFrame.

Now we can use the SparkSQL API to query the view we created:

In [9]:
# The spark.sql() method returns a DataFrame on which we can call the show() method to print the results 

spark.sql("SELECT * FROM temp_df").show()

+---+-----+-----+
| id| name|count|
+---+-----+-----+
|  1| doge|   23|
|  2| pepe|   44|
|  3|steve|   65|
|  4|brian|    0|
+---+-----+-----+



You can see that we use the `sql()` method to run SQL queries on the underlying views or tables. 
You will also start to become familiar with another method: `show()` to visualise the data.

### Spark MLlib
This module contains common machine learning algorithms, built on top of the Spark DataFrame-based APIs. This API allows users to extract and transform features, build pipelines for model training and prediction, and persistence models during deployment. In addition to basic machine learning algorithms, this API contains common linear algebra operations and statistics, as well as lower-level ML primitives, such as a gradient descent optimizer. This API also closely resembles the .fit .transform paradigm first implemented in the scikit-learn python package.

### Spark Structured Streaming
This module is built atop the SparkSQL engine and DataFrame-based APIs. The Stream API views an incoming stream as a continually growing table, with new data being added as rows to the end of the table. This allows developers to treat the table in the same way as they would a static table. This module also accepts a wide range of sources, including Apache Kafka, Kinesis, and HDFS-based or cloud storage.

### GraphX
This module is primarily for the creation and manipulation of graph objects, performing graph-parallel operations. GraphX allows standard graph operations. A lot of these functions are contributed by the Spark community.

## Basic Spark concepts
When using Spark, it is important to understand some of the basic concepts in the Spark application and how the code is transformed and executed as tasks across the Spark executors. 

*Application*

    The complete Spark program as written by a user. Consists of a driver program and executors on the cluster.
    
*SparkSession*

    A programmatic object that provides a point of entry to interaction with an underlying Spark engine and allows programming Spark through its APIs. In the shell (also in Databricks) this is instantiated for you, but you have to set this up if coding up a Spark application. 
    
*Job*

    A parallel computation that consists of multiple tasks and is created in response to a Spark action (`collect`, `show`, `write`). Spark transforms this into a DAG, which is Spark's execution plan. Each node within a DAG can be single or multiple Spark stages.
    
*Stage*

    Each job is divided into smaller tasks, called stages, that depend on each other.
    
*Task*

    A single unit of work that is sent to the Spark executor.

Spark operations can be classified as either *transformations* or *actions*. Transformations transform the Spark DataFrame into a new DataFrame without altering the original data, making it immutable. Examples include `select()` and `filter()`. All transformations are executed lazily (in other words, not executed immediately, but rather recorded and remembered as a lineage). This allows Spark to rearrange transformations at a later stage into the most optimal execution order. Lazy evaluation enables Spark to delay execution until an action is invoked.

As an example, we can select the name column from the DataFrame we created previously:

In [10]:
sub_df = df.select('name')
sub_df

DataFrame[name: string]

This does not actually execute the select statement, but rather records the transformation as a lineage. We can view the execution plan by calling the `explain()` method on the transformed DataFrame:

In [11]:
sub_df.explain()

== Physical Plan ==
*(1) Project [name#1]
+- *(1) Scan ExistingRDD[id#0L,name#1,count#2L]




Lazy evaluation lets Spark set up an optimal execution plan, while allowing for perfect lineage and [immutability](https://en.wikipedia.org/wiki/Immutable_object), providing complete [idempotency](https://en.wikipedia.org/wiki/Idempotence). 

An action triggers the lazy evaluation (`show()`, `collect()`).


In [12]:
sub_df.show()

+-----+
| name|
+-----+
| doge|
| pepe|
|steve|
|brian|
+-----+



## Spark Structured API and RDD
*As previously mentioned, most of the processing in Spark is done using the Structured API (`pyspark.sql`). However, underneath that, and in the first versions of Spark, the main version and method to interact with Spark was through the RDD API, which is still supported and allows some functionality that is not possible with the higher levels of abstraction.*

### Spark RDD
The RDD is the lowest level of abstraction in Spark. It has three characteristics: dependencies, partitions, and the compute function. All three of these are integral to RDD programming on which all high-level APIs are constructed: 

- A list of dependencies instructs Spark on how an RDD is constructed with its required inputs. This enables resiliency, by allowing Spark to recreate the RDD from scratch if required. 

- Partitions allow Spark to split the processing between executors. 

- RDD has a compute function that produces an iterator for the data that will be stored in the RDD. The iterator will iterate over the data to perform the computations it is instructed to do.

While this schema is simple and elegant, there are a couple of issues, including that most of the computation is opaque to Spark, meaning the operation cannot be inspected and optimised. Spark does not know the data type, and can thus not optimally serialise the object. 

First, we have to create an RDD.
RDDs are defined by running the `parallelize()` method on the defined `SparkContext`, and unlike DataFrames have to be viewed using the `collect()` method.

In [13]:
rdd = sc.parallelize(
    [('pink', 10), ('blue', 9), ('green', 8), ('gold', 7), ('black', 6), ('pink', 7), ('blue', 6)])
rdd.collect()

[('pink', 10),
 ('blue', 9),
 ('green', 8),
 ('gold', 7),
 ('black', 6),
 ('pink', 7),
 ('blue', 6)]

We can inspect various parts of the RDD, such as getting the number of partitions the data reside in:

In [14]:
rdd.getNumPartitions()

4

We can perform various transformations on the RDD object, albeit a bit more difficult than working with Pandas DataFrames.

Here we map a lambda function to each value and access the tuples in the RDD as you would rows:

In [15]:
rdd.map(lambda y: y[1] * 2).collect()

[20, 18, 16, 14, 12, 14, 12]

If you want to preserve the RDD structure, you have to specify the tuple structure again: 

In [16]:
rdd.map(lambda y: (y[0], y[1] * 2)).collect()

[('pink', 20),
 ('blue', 18),
 ('green', 16),
 ('gold', 14),
 ('black', 12),
 ('pink', 14),
 ('blue', 12)]

You can apply additional functions, such as sorting the returned RDD:

In [20]:
rdd.map(lambda y: (y[0], y[1] * 2)).sortByKey().collect()

[('black', 12),
 ('blue', 18),
 ('blue', 12),
 ('gold', 14),
 ('green', 16),
 ('pink', 20),
 ('pink', 14)]

Some of the other operations on the RDD are a lot less intuitive. `reduceByKey()` is a method that merges the values for each key with the function specified. In this case, the lambda function indicates what should happen to the values of the keys that are matched:

In [15]:
rdd.reduceByKey(lambda a,b: a+b).collect()

[('green', 8), ('pink', 17), ('blue', 15), ('black', 6), ('gold', 7)]

The RDD API is not a very intuitive interface to work with and does not have all the optimisations that are contained within the higher-level abstractions. It does allow full customisation, which means that it can be tuned completely, and in that sense does provide optimisation.

If you are more interested in the RDD API, spend some time reading the [documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html). 

### Structured Spark
SparkSQL, DataFrames and Datasets make up the Spark Structured API. 

Spark 2.x introduced structure to Spark. This was to include common patterns found in data analysis, such as filtering, selecting, counting, aggregating, averaging, and grouping. These common operators allow Spark to better construct an efficient query plan for execution. 

Structuring Spark then also allows arranging data into a more tabular format that data analysts, scientists, and engineers are familiar with. 

Some other benefits include better performance and space efficiency across Spark components, as well as better expressivity, simplicity, compostability, and uniformity.

Spark DataFrame APIs were inspired by the Pandas DataFrames structure, format, and specific operations. Spark DataFrames are like distributed tables with named columns and schemas, where each column has a specific data type. The structured table DataFrames are easy to digest when viewed, but also make it easy to perform common operations on rows or columns. DataFrames are also immutable, and Spark keeps a lineage of all transformations, creating new DataFrames while the previous versions are preserved. 

In Spark, we use the `show()` method to print out our DataFrames. The first argument to the `show()` method is the number of rows to show, and the second is if to truncate results that do not fit in the window.

Let's preview the DataFrame, setting the number to `10` and truncation to `False`:

In [21]:
# Call the show() method to print the DataFrame

df.show(10, False)

+---+-----+-----+
|id |name |count|
+---+-----+-----+
|1  |doge |23   |
|2  |pepe |44   |
|3  |steve|65   |
|4  |brian|0    |
+---+-----+-----+



Let's select only the name column:

In [22]:
df.select('name').show()

+-----+
| name|
+-----+
| doge|
| pepe|
|steve|
|brian|
+-----+



We can show the original DataFrame again, and we will see that the DataFrame has not changed:

In [17]:
df.show()

+---+-----+-----+
| id| name|count|
+---+-----+-----+
|  1| doge|   23|
|  2| pepe|   44|
|  3|steve|   65|
|  4|brian|    0|
+---+-----+-----+



To select only one field, we have to assign it to a new object:

In [23]:
df2 = df.select('name')

In [24]:
df2.show()

+-----+
| name|
+-----+
| doge|
| pepe|
|steve|
|brian|
+-----+



#### Data types
A named column in a DataFrame and its associated Spark data type can be declared in the schema.

Spark supports basic internal data types of supported programming languages. For example, Python integer, float, and string types are supported, along with other basic types, as seen in the table below.

*Python data types, the value assigned in Python, and the API to instantiate.*

| Data type   | Value assigned   | API to instantiate      |
|-------------|------------------|-------------------------|
| ByteType    | `int`            | `DataTypes.ByteType`    |
| ShortType   | `int`            | `DataTypes.ShortType`   |
| IntegerType | `int`            | `DataTypes.IntegerType` |
| LongType    | `int`            | `DataTypes.LongType`    |
| FloatType   | `float`          | `DataTypes.FloatType`   |
| DoubleType  | `float`          | `DataTypes.DoubleType`  |
| StringType  | `str`            | `DataTypes.StringType`  |
| BooleanType | `bool`           | `DataTypes.BooleanType` |
| DecimalType | `decimal.Decimal`| `DecimalType`           |

Additionally, Spark also makes provision for complex data types, for example, maps, arrays, dates, timestamps, etc.

*Python complex data types, the value assigned in Python, and the API to instantiate.*

| Data type     | Value assigned                                   | API to instantiate                        |
|---------------|--------------------------------------------------|-------------------------------------------|
| BinaryType    | `bytearray`                                      | `BinaryType()`                            |
| TimestampType | `datetime.datetime`                              | `TimestampType()`                         |
| DateType      | `datetime.date`                                  | `DateType()`                              |
| ArrayType     | List, tuple, or array                            | `ArrayType(dataType, [nullable])`         |
| MapType       | `dict`                                           | `MapType(keyType, valueType, [nullable])` |
| StructType    | List or tuple                                    | `StructType([fields])`                    |
| StructField   | A value type corresponding to type of this field | `StructField(name, dataType, [nullable])` |

To fully utilise these types they have to be combined in a structured manner. A *schema* defines the column names and data types for a Spark DataFrame. It is best practice to define a schema upfront before creating a DataFrame. This has the advantage of schema inference:

1. Inferring the schema does not place an additional computational burden on Spark. Spark has to read part of the file to infer the schema.
2. Errors that do not match the schema can be detected simply and early on.

You can define a schema programmatically (using the programming language API), or by employing a Data Definition Language (DDL). 
Once a table has been defined, the schema can be accessed through the `schema` attribute in Python. 

We can demonstrate this practically, first, by importing the types that are needed for defining the schema:

In [25]:
from pyspark.sql.types import StructField, StructType, IntegerType, StringType

In [26]:
schema = StructType([StructField('id', IntegerType()),
                    StructField('name', StringType()),
                    StructField('count', IntegerType())])

StructType defines the structure of a DataFrame and is a collection or list of StructField objects.

The StructField class defines the columns of a DataFrame, and the required parameters are column name (String) and column type (DataType) – as imported above. It allows the optional arguments: nullable column (Boolean) and metadata (MetaData).

We can then construct a DataFrame, as previously:

In [27]:
data = [(1, 'doge', 23),
        (2, 'pepe', 44),
        (3, 'steve', 65),
        (4, 'brian', 0)] 


df = spark.createDataFrame(data, schema=schema)

In [23]:
df.show()

+---+-----+-----+
| id| name|count|
+---+-----+-----+
|  1| doge|   23|
|  2| pepe|   44|
|  3|steve|   65|
|  4|brian|    0|
+---+-----+-----+



To make sure that the values were read correctly, we can use the `printSchema()` method offered by Spark:

In [28]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- count: integer (nullable = true)



#### Columns and expressions
Columns are the basic operational units in Spark, and in Spark-supported programming languages, columns are objects with public methods. For example, you can use the `expr()` function on a column to perform expressions on the column, in other words, `expr("columnName ** 2")` will get the square for each value in the `columnName` column. Columns can also be referenced using the `col()` function, for example, to achieve the same as above you would use `col(columnName) ** 2`. Other functions can be found in the [reference documentation](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html), and include string manipulations (`concat`), mathematical operations (`sum`, `count`), logical operations (`>`, `<`), and manipulations (`sort`). 

We recommend that you spend some more time experimenting with the Spark Structured API and DataFrames to get comfortable with the API (including trying all of the above expressions).

## Spark Dataset

In addition to DataFrames, Spark also has a Dataset API, which is only available in Scala and Java. 
This API has some additional options available, which are allowed due to static typing of the two languages. 

For more information, read the [Spark documentation](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html).

## Conclusion

This train was an initial introduction to the Spark API. We introduced you to the characteristics and inner workings of Spark, briefly showed the four components of Spark, presented the RDD and Structured API, some available functions and methods in the Spark API.

You should now start getting comfortable with using the Spark API, be able to create DataFrames, and start thinking about which aggregations you would want to perform on a given dataset. 