[_Ref: Learning Spark - Chapter 3, Damji et al_]

There are three key Spark interfaces that you should know about. [Ref](https://databricks.com/spark/getting-started-with-apache-spark/quick-start#spark-interfaces)

- RDD (*original data structure for Apache Spark*)
- 💗 DataFrame (*most common today*)
- Datasets (*Java and Scala only*)

The RDD is the most basic abstraction in Spark, a simple programming API model upon which all higher-level functionality is constructed. However, RDD APIs are not expressive, and you don't want to use them in most of the scenarios.

Spark 2.x introduced a few key schemes for structuring Spark. One is to express com‐ putations by using common patterns found in data analysis. These patterns are expressed as high-level operations such as filtering, selecting, counting, aggregating, averaging, and grouping. This provides added clarity and simplicity.

### Is it efficient? - Opaqueness of RDD

The computation of an RDD is opaque to Spark. That is, Spark does not know what you are doing in the compute function. Whether you are performing a join, filter, select, or aggregation, Spark only sees it as a lambda expression. Another problem is that the data type is also opaque for Python RDDs; Spark only knows that it’s a generic object in Python.

This opacity clearly hampers Spark’s ability to rearrange your computation into an efficient query plan.

What about **DataFrame** APIs? Their operators let you tell Spark what you wish to compute with your data, and as a result, it can construct an efficient query plan for execution. Structure yields a number of benefits, including better performance and space efficiency across Spark components.

## DataFrame APIs

A `DataFrame` looks like a table to us, and they are actually distributed in-memory tables (and remind us of Pandas' DataFrames).

Each column is assigned a [data type](https://spark.apache.org/docs/latest/sql-ref-datatypes.html). They can be summarized in two main categories

- basic data types
  - numeric (`IntegerType`, `FloatType`, etc)
  - strings (`StringType`, etc)
  - boolean
- structured and complex types
  - date and time (`TimestampType`, `DateType`, `DayTimeIntervalType`, etc)
  - structures (`ArrayType`, `MapType`, `StructField`)

In [0]:
# data types can be inferred automatically by Spark
df = spark.read.csv('dbfs:/databricks-datasets/learning-spark-v2/mnm_dataset.csv', header=True)
df.show()

+-----+------+-----+
|State| Color|Count|
+-----+------+-----+
|   TX|   Red|   20|
|   NV|  Blue|   66|
|   CO|  Blue|   79|
|   OR|  Blue|   71|
|   WA|Yellow|   93|
|   WY|  Blue|   16|
|   CA|Yellow|   53|
|   WA| Green|   60|
|   OR| Green|   71|
|   TX| Green|   68|
|   NV| Green|   59|
|   AZ| Brown|   95|
|   WA|Yellow|   20|
|   AZ|  Blue|   75|
|   OR| Brown|   72|
|   NV|   Red|   98|
|   WY|Orange|   45|
|   CO|  Blue|   52|
|   TX| Brown|   94|
|   CO|   Red|   82|
+-----+------+-----+
only showing top 20 rows



However, it is also possible to define the schema and the data type **before** reading the data (and this is actually recommended).

❓🙋‍♀️🙋‍♂️
Why is it recommended to define the schema before reading a dataset?

In [0]:
df.printSchema()

root
 |-- State: string (nullable = true)
 |-- Color: string (nullable = true)
 |-- Count: string (nullable = true)



Three main advantages

1. performance. When Spark has to infer the schema, it reads part of the dataset just to make this inference. It can be avoided if you tell Spark what data types you expect from each column.
2. precision. Spark might choose the wrong data types when inferring.
3. errors. There can be errors in some datasets (wrong formatting, etc). These errors can be spotted if the data does not match the data.