[_Ref: Learning Spark - Chapter 3, Damji et al_]

There are three key Spark interfaces that you should know about. [Ref](https://databricks.com/spark/getting-started-with-apache-spark/quick-start#spark-interfaces)

- RDD (*original data structure for Apache Spark*)
- 💗 DataFrame (*most common today*)
- Datasets (*Java and Scala only*)

The RDD is the most basic abstraction in Spark, a simple programming API model upon which all higher-level functionality is constructed. However, RDD APIs are not expressive, and you don't want to use them in most of the scenarios.

Spark 2.x introduced a few key schemes for structuring Spark. One is to express computations by using common patterns found in data analysis. These patterns are expressed as high-level operations such as filtering, selecting, counting, aggregating, averaging, and grouping. This provides added clarity and simplicity.

### Is it efficient? - Opaqueness of RDD

The computation of an RDD is opaque to Spark. That is, Spark does not know what you are doing in the compute function. Whether you are performing a join, filter, select, or aggregation, Spark only sees it as a lambda expression. Another problem is that the data type is also opaque for Python RDDs; Spark only knows that it’s a generic object in Python.

This opacity clearly hampers Spark’s ability to rearrange your computation into an efficient query plan.

What about **DataFrame** APIs? Their operators let you tell Spark what you wish to compute with your data, and as a result, it can construct an efficient query plan for execution. Structure yields a number of benefits, including better performance and space efficiency across Spark components.

## DataFrame APIs

A `DataFrame` looks like a table to us, and they are actually distributed in-memory tables (and remind us of Pandas' DataFrames).

Each column is assigned a [data type](https://spark.apache.org/docs/latest/sql-ref-datatypes.html). They can be summarized in two main categories

- basic data types
  - numeric (`IntegerType`, `FloatType`, etc)
  - strings (`StringType`, etc)
  - boolean
- structured and complex types
  - date and time (`TimestampType`, `DateType`, `DayTimeIntervalType`, etc)
  - structures (`ArrayType`, `MapType`, `StructField`)

In [0]:
# data types can be inferred automatically by Spark
df = spark.read.csv('dbfs:/databricks-datasets/learning-spark-v2/mnm_dataset.csv', header=True)
df.show(3)

+-----+-----+-----+
|State|Color|Count|
+-----+-----+-----+
|   TX|  Red|   20|
|   NV| Blue|   66|
|   CO| Blue|   79|
+-----+-----+-----+
only showing top 3 rows



However, it is also possible to define the schema and the data type **before** reading the data (and this is actually recommended).

❓🙋‍♀️🙋‍♂️
Why is it recommended to define the schema before reading a dataset?

.

.

.

.

.

.

.

.

In [0]:
df.printSchema()

root
 |-- State: string (nullable = true)
 |-- Color: string (nullable = true)
 |-- Count: string (nullable = true)



Three main advantages

1. performance. When Spark has to infer the schema, it reads part of the dataset just to make this inference. It can be avoided if you tell Spark what data types you expect from each column.
2. precision. Spark might choose the wrong data types when inferring.
3. errors. There can be errors in some datasets (wrong formatting, etc). These errors can be spotted if the data does not match the data.

In [0]:
# Let's define a schema BEFORE reading the data
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("State", StringType(), False),
    StructField("Color", StringType(), False),
    StructField("Count", IntegerType(), False)
])

In [0]:
df = spark.read.csv(
    'dbfs:/databricks-datasets/learning-spark-v2/mnm_dataset.csv',
    schema=schema,
    header=True
)

In [0]:
df.printSchema()

root
 |-- State: string (nullable = true)
 |-- Color: string (nullable = true)
 |-- Count: integer (nullable = true)



In [0]:
df.show(3)

+-----+-----+-----+
|State|Color|Count|
+-----+-----+-----+
|   TX|  Red|   20|
|   NV| Blue|   66|
|   CO| Blue|   79|
+-----+-----+-----+
only showing top 3 rows



Writing a schema from scratch is tedious. Can I speed it up? Yes, we can print the schema definition of an existing schema.

In [0]:
df.schema

Out[7]: StructType([StructField('State', StringType(), True), StructField('Color', StringType(), True), StructField('Count', IntegerType(), True)])

### Columns and expressions

What are actually the columns of a `DataFrame`? They are objects, `Column` objects. They can be accessed in different ways including the `col` method.

In [0]:
df.Count

Out[8]: Column<'Count'>

In [0]:
from pyspark.sql.functions import col

In [0]:
col('Count')

Out[10]: Column<'Count'>

Let's see what we can do with columns and how we can transform them.

In [0]:
# select columns using col
df.select(col('Count')).show(3)

+-----+
|Count|
+-----+
|   20|
|   66|
|   79|
+-----+
only showing top 3 rows



In [0]:
# select columns using DataFrame.NameOfColumn
df.select(df.Count).show(3)

+-----+
|Count|
+-----+
|   20|
|   66|
|   79|
+-----+
only showing top 3 rows



In [0]:
# We can transform a column via expr or using direct python operators
df.select(df.Count * 10).show(3)

from pyspark.sql.functions import expr
df.select(expr('Count * 10')).show(3)

+------------+
|(Count * 10)|
+------------+
|         200|
|         660|
|         790|
+------------+
only showing top 3 rows

+------------+
|(Count * 10)|
+------------+
|         200|
|         660|
|         790|
+------------+
only showing top 3 rows



In [0]:
# With withColumn, we can define a new column that is computed based on other columns' values
df.withColumn('halfCount', df.Count / 2).show(3)
df.withColumn('halfCount', col('Count') / 2).show(3)
df.withColumn('halfCount', expr('Count / 2')).show(3)

+-----+-----+-----+---------+
|State|Color|Count|halfCount|
+-----+-----+-----+---------+
|   TX|  Red|   20|     10.0|
|   NV| Blue|   66|     33.0|
|   CO| Blue|   79|     39.5|
+-----+-----+-----+---------+
only showing top 3 rows

+-----+-----+-----+---------+
|State|Color|Count|halfCount|
+-----+-----+-----+---------+
|   TX|  Red|   20|     10.0|
|   NV| Blue|   66|     33.0|
|   CO| Blue|   79|     39.5|
+-----+-----+-----+---------+
only showing top 3 rows

+-----+-----+-----+---------+
|State|Color|Count|halfCount|
+-----+-----+-----+---------+
|   TX|  Red|   20|     10.0|
|   NV| Blue|   66|     33.0|
|   CO| Blue|   79|     39.5|
+-----+-----+-----+---------+
only showing top 3 rows



In [0]:
# we can concatenate columns too
from pyspark.sql.functions import concat
df.withColumn('StateColor', concat(df.State, df.Color))\
  .select('StateColor').show(3)

+----------+
|StateColor|
+----------+
|     TXRed|
|    NVBlue|
|    COBlue|
+----------+
only showing top 3 rows



In [0]:
# what if I'd like to split them with a '-'?
df.withColumn('StateColor', concat(df.State, ' - ', df.Color))\
  .select('StateColor').show(3)

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-1933447525078976>:2[0m
[1;32m      1[0m [38;5;66;03m# what if I'd like to split them with a '-'?[39;00m
[0;32m----> 2[0m [43mdf[49m[38;5;241;43m.[39;49m[43mwithColumn[49m[43m([49m[38;5;124;43m'[39;49m[38;5;124;43mStateColor[39;49m[38;5;124;43m'[39;49m[43m,[49m[43m [49m[43mconcat[49m[43m([49m[43mdf[49m[38;5;241;43m.[39;49m[43mState[49m[43m,[49m[43m [49m[38;5;124;43m'[39;49m[38;5;124;43m - [39;49m[38;5;124;43m'[39;49m[43m,[49m[43m [49m[43mdf[49m[38;5;241;43m.[39;49m[43mColor[49m[43m)[49m[43m)[49m\
[1;32m      3[0m   [38;5;241m.[39mselect([38;5;124m'[39m[38;5;124mStateColor[39m[38;5;124m'[39m)[38;5;241m.[39mshow([38;5;241m3[39m)

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_fun

In [0]:
# we cannot pass a string because concat is expecting a Column input
# we can use the lit function to add a constant or a literal to a dataframe
from pyspark.sql.functions import lit
df.withColumn('StateColor', concat(df.State, lit(' - '), df.Color))\
  .select('StateColor').show(3)

+----------+
|StateColor|
+----------+
|  TX - Red|
| NV - Blue|
| CO - Blue|
+----------+
only showing top 3 rows



In [0]:
# We can sort a dataframe according to the value of a column
df.sort(df.Count.desc()).show()

+-----+------+-----+
|State| Color|Count|
+-----+------+-----+
|   CA| Brown|  100|
|   WY| Green|  100|
|   NV|   Red|  100|
|   TX|   Red|  100|
|   CA|   Red|  100|
|   UT|   Red|  100|
|   WY|  Blue|  100|
|   UT|Yellow|  100|
|   AZ|Orange|  100|
|   CO|   Red|  100|
|   TX|Yellow|  100|
|   CO| Green|  100|
|   NV|   Red|  100|
|   CA| Brown|  100|
|   UT|Yellow|  100|
|   CA|   Red|  100|
|   NM| Green|  100|
|   NV|   Red|  100|
|   WA| Brown|  100|
|   NM|Orange|  100|
+-----+------+-----+
only showing top 20 rows



## Exercise

Consider the dataset `/databricks-datasets/flights/departuredelays.csv` about flights and delays.

1. Import the csv in a `DataFrame`. Would you define the schema before?
2. The column `delay` expresses the delay in minutes. Can you compute a new column `delayInHours` where the amount of `delay` is converted to hours?
3. What is the flight with largest delay ever?
4. [Bonus] What is the most popular route? Note that a route is the combination of an `origin` and a `destination`

## Rows

A row in Spark is a generic [Row](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.Row.html) object, containing one or more columns.
The fields in it can be accessed:

- like attributes (`row.key`)
- like dictionary values (`row[key]`)

In [0]:
from pyspark.sql import Row

Row('Alice', 11)

Out[20]: <Row('Alice', 11)>

In [0]:
# Or using named arguments
Row(name="Alice", age=11)

Out[21]: Row(name='Alice', age=11)

In [0]:
alice_row = Row(name="Alice", age=11)
# we can access values via index or via key
print(alice_row[0])
print(alice_row['name'])

Alice
Alice


In [0]:
# you can create quickly dataframes for prototyping
rows = [
    alice_row,
    Row(name='Bob', age=15)
]
spark.createDataFrame(rows).show()

+-----+---+
| name|age|
+-----+---+
|Alice| 11|
|  Bob| 15|
+-----+---+



In [0]:
# we can load in memory a dataframe as list of rows via collect
df.collect()

Out[27]: [Row(State='TX', Color='Red', Count=20),
 Row(State='NV', Color='Blue', Count=66),
 Row(State='CO', Color='Blue', Count=79),
 Row(State='OR', Color='Blue', Count=71),
 Row(State='WA', Color='Yellow', Count=93),
 Row(State='WY', Color='Blue', Count=16),
 Row(State='CA', Color='Yellow', Count=53),
 Row(State='WA', Color='Green', Count=60),
 Row(State='OR', Color='Green', Count=71),
 Row(State='TX', Color='Green', Count=68),
 Row(State='NV', Color='Green', Count=59),
 Row(State='AZ', Color='Brown', Count=95),
 Row(State='WA', Color='Yellow', Count=20),
 Row(State='AZ', Color='Blue', Count=75),
 Row(State='OR', Color='Brown', Count=72),
 Row(State='NV', Color='Red', Count=98),
 Row(State='WY', Color='Orange', Count=45),
 Row(State='CO', Color='Blue', Count=52),
 Row(State='TX', Color='Brown', Count=94),
 Row(State='CO', Color='Red', Count=82),
 Row(State='CO', Color='Red', Count=12),
 Row(State='CO', Color='Red', Count=17),
 Row(State='OR', Color='Green', Count=16),
 Row(State='AZ

## User Defined Functions

You can define your own custom functions to transform `DataFrame`s. They are called _User Defined Functions_ (UDF).

In [0]:
from pyspark.sql.functions import udf

@udf
def to_upper(some_string):
    if some_string is not None:
        return some_string.upper()

In [0]:
df.select("Color", to_upper(col("Color"))).show(5)

+------+---------------+
| Color|to_upper(Color)|
+------+---------------+
|   Red|            RED|
|  Blue|           BLUE|
|  Blue|           BLUE|
|  Blue|           BLUE|
|Yellow|         YELLOW|
+------+---------------+
only showing top 5 rows

