# Spark

## Spark: What’s Underneath an RDD?

• Dependencies   
• Partitions (with some locality information)   
• Compute function: Partition => Iterator[T]   


Spark 2.x introduced a few key schemes for structuring Spark. One is to express com‐
putations by using common patterns found in data analysis. These patterns are
expressed as high-level operations such as filtering, selecting, counting, aggregating,
averaging, and grouping. This provides added clarity and simplicity.

This specificity is further narrowed through the use of a set of common operators in a
DSL.

## Schemas and Creating DataFrames

Defining a schema up front as opposed to taking a schema-on-read approach offers three benefits:
    
• You relieve Spark from the onus of inferring data types.  
• You prevent Spark from creating a separate job just to read a large portion of your file to ascertain the schema, which for a large data file can be expensive and time-consuming.  
• You can detect errors early if data doesn’t match the schema.   

Here's an example how to define schemas:

In [1]:
# In Python
from pyspark.sql import SparkSession
#sc.setLogLevel(newLevel)
# Define schema for our data using DDL
schema = "`Id` INT, `First` STRING, `Last` STRING, `Url` STRING, `Published` STRING, `Hits` INT, `Campaigns` ARRAY<STRING>"
# Create our static data
data = [[1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter",
        "LinkedIn"]],
        [2, "Brooke","Wenig", "https://tinyurl.2", "5/5/2018", 8908, ["twitter",
        "LinkedIn"]],
        [3, "Denny", "Lee", "https://tinyurl.3", "6/7/2019", 7659, ["web",
        "twitter", "FB", "LinkedIn"]],
        [4, "Tathagata", "Das", "https://tinyurl.4", "5/12/2018", 10568,
        ["twitter", "FB"]],
        [5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web",
        "twitter", "FB", "LinkedIn"]],
        [6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568,
        ["twitter", "LinkedIn"]]
        ]
# Main program
if __name__ == "__main__":
    # Create a SparkSession
    spark = (SparkSession
        .builder
        .appName("Example-3_6")
        .getOrCreate())
    # Create a DataFrame using the schema defined above
    blogs_df = spark.createDataFrame(data, schema)
    # Show the DataFrame; it should reflect our table above
    blogs_df.show()
    # Print the schema used by Spark to process the DataFrame
    print(blogs_df.printSchema())


21/08/08 22:21:06 WARN Utils: Your hostname, OutOne resolves to a loopback address: 127.0.1.1; using 192.168.1.90 instead (on interface wlp8s0)
21/08/08 22:21:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/08/08 22:21:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+

root
 |-- Id: integer (nullable = true)
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Hits: integer (

[Here's a link to the documentation on Python](http://spark.apache.org/docs/latest/api/python/)

To start with the basics, lets look at the [columns API](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#column-apis)

If run into any dataType issue,
```Column.cast(dataType)``` converts the column into type dataType.
...
Keep exploring.   

On Rows, here are examples of how to make quick rows:

In [2]:
from pyspark.sql import Row
blog_row = Row(6, "Reynold", "Xin", "https://tinyurl.6", 255568, "3/2/2015", ["twitter", "LinkedIn"])
# access using index for individual items
blog_row[1]

'Reynold'

In [3]:
rows = [Row("Matei Zaharia", "CA"), Row("Reynold Xin", "CA")]
authors_df = spark.createDataFrame(rows, ["Authors", "State"])
authors_df.show()

+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+



To read from a file: ```DataFrameReader```   
To save to a file: ```DataFrameWriter```

## Hands on a Dataset
Quote:
To get started, let’s read a large CSV file containing data on San Francisco Fire
Department calls.1 As noted previously, we will define a schema for this file and use
the DataFrameReader class and its methods to tell Spark what to do. Because this file
contains 28 columns and over 4,380,660 records,2 it’s more efficient to define a
schema than have Spark infer it.

First lets define the schema for the dataset.

In [4]:
from pyspark.sql.types import *
# Programmatic way to define a schema
fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
            StructField('UnitID', StringType(), True),
            StructField('IncidentNumber', IntegerType(), True),
            StructField('CallType', StringType(), True),
            StructField('CallDate', StringType(), True),
            StructField('WatchDate', StringType(), True),
            StructField('CallFinalDisposition', StringType(), True),
            StructField('AvailableDtTm', StringType(), True),
            StructField('Address', StringType(), True),
            StructField('City', StringType(), True),
            StructField('Zipcode', IntegerType(), True),
            StructField('Battalion', StringType(), True),
            StructField('StationArea', StringType(), True),
            StructField('Box', StringType(), True),
            StructField('OriginalPriority', StringType(), True),
            StructField('Priority', StringType(), True),
            StructField('FinalPriority', IntegerType(), True),
            StructField('ALSUnit', BooleanType(), True),
            StructField('CallTypeGroup', StringType(), True),
            StructField('NumAlarms', IntegerType(), True),
            StructField('UnitType', StringType(), True),
            StructField('UnitSequenceInCallDispatch', IntegerType(), True),
            StructField('FirePreventionDistrict', StringType(), True),
            StructField('SupervisorDistrict', StringType(), True),
            StructField('Neighborhood', StringType(), True),
            StructField('Location', StringType(), True),
            StructField('RowID', StringType(), True),
            StructField('Delay', FloatType(), True)])


Now, lets read the dataset file. ```dfFire``` is read from the defined file, and with the schema defined above.

In [5]:
sf_fire_file = "../repo/chapter3/data/sf-fire-calls.csv"
dfFire = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)

To save the dataframe, for example to Parquet, which saves the schema in it's metadata, it's as simple as:   
```python
parquet_path = ...
fire_df.write.format("parquet").save(parquet_path)
```

Examples of some filters and projections
A projection in relational parlance is a way to return only the
rows matching a certain relational condition by using filters. In Spark, projections are
done with the select() method, while filters can be expressed using the filter() or
where() method. We can use this technique to examine specific aspects of our SF Fire
Department data set:

In [6]:
# In Python
from pyspark.sql.functions import col

dfFewFires = (dfFire
        .select("IncidentNumber", "AvailableDtTm", "CallType")
        .where(col("CallType") != "Medical Incident"))
dfFewFires.show(5, truncate=False)


+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



```countDistinct```: Returns a new Column for distinct count of col or cols.   
What if we want to know how many distinct CallTypes were recorded as the causes
of the fire calls? These simple and expressive queries do the job:

In [7]:
from pyspark.sql.functions import countDistinct
(dfFire
    .select("CallType")
    .where(col("CallType").isNotNull())
    .agg(countDistinct("CallType").alias("DistinctCallTypes"))
    .show()
)




+-----------------+
|DistinctCallTypes|
+-----------------+
|               30|
+-----------------+



                                                                                

We can list the distinct call types in the data set using these queries:
In Python, filter for only distinct non-null CallTypes from all the rows

In [8]:
(dfFire
    .select("CallType")
    .where(col("CallType").isNotNull())
    .distinct()
    .show(10, False)
)

+-----------------------------------+
|CallType                           |
+-----------------------------------+
|Elevator / Escalator Rescue        |
|Marine Fire                        |
|Aircraft Emergency                 |
|Confined Space / Structure Collapse|
|Administrative                     |
|Alarms                             |
|Odor (Strange / Unknown)           |
|Citizen Assist / Service Call      |
|HazMat                             |
|Watercraft in Distress             |
+-----------------------------------+
only showing top 10 rows



Renaming columns in pyspark is also easy.  
My question here is can I throw a dictionary of renames and will spark understand?

In [9]:
# In Python
dfNewFire = dfFire.withColumnRenamed("Delay", "ResponseDelayedinMins")
(dfNewFire
    .select("ResponseDelayedinMins")
    .where(col("ResponseDelayedinMins") > 5)
    .show(5, False)
)


+---------------------+
|ResponseDelayedinMins|
+---------------------+
|5.35                 |
|6.25                 |
|5.2                  |
|5.6                  |
|7.25                 |
+---------------------+
only showing top 5 rows



### Mutate columns

Modifying the contents of a column or its type are common operations during data
exploration. 

For example, in our SF Fire Department data set, the columns CallDate, WatchDate, and AlarmDtTm are strings rather than either Unix timestamps or SQL dates, both of which Spark supports and can easily manipulate during transformations or actions.

`spark.sql.functions` has a set of to/from date/time‐ stamp functions such as `to_timestamp()` and `to_date()` that we can use for just this purpose.

In [10]:
from pyspark.sql.functions import to_timestamp
dfFireTS = (dfNewFire
    .withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy"))
    .drop("CallDate")
    .withColumn("OnWatchDate", to_timestamp(col("WatchDate"), "MM/dd/yyyy"))
    .drop("WatchDate")
    .withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"),
    "MM/dd/yyyy hh:mm:ss a"))
    .drop("AvailableDtTm"))


In [11]:
(dfFireTS
    .select("IncidentDate", "OnWatchDate", "AvailableDtTS")
    .show(5, False))


+-------------------+-------------------+-------------------+
|IncidentDate       |OnWatchDate        |AvailableDtTS      |
+-------------------+-------------------+-------------------+
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 01:51:44|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 03:01:18|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 02:39:50|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 04:16:46|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 06:01:58|
+-------------------+-------------------+-------------------+
only showing top 5 rows



Unpacked:
1. Convert the existing column’s data type from string to a Spark-supported timestamp.
2. Use the new format specified in the format string "MM/dd/yyyy" or "MM/dd/yyyy hh:mm:ss a" where appropriate.
3. After converting to the new data type, `drop()` the old column and append the new one specified in the first argument to the `withColumn()` method.
4. Assign the new modified DataFrame to `dfFireTS`.

In [12]:
from pyspark.sql.functions import year
(dfFireTS
    .select(year('IncidentDate'))
    .distinct()
    .orderBy(year('IncidentDate'))
    .show())




+------------------+
|year(IncidentDate)|
+------------------+
|              2000|
|              2001|
|              2002|
|              2003|
|              2004|
|              2005|
|              2006|
|              2007|
|              2008|
|              2009|
|              2010|
|              2011|
|              2012|
|              2013|
|              2014|
|              2015|
|              2016|
|              2017|
|              2018|
+------------------+



                                                                                

### Aggregations

What if we want to know what the most common types of fire calls were, or what zip codes accounted for the most calls? These kinds of questions are common in data analysis and exploration.

Let’s take our first question: what were the most common types of fire calls?

In [13]:
(dfFireTS
    .select("CallType")
    .where(col("CallType").isNotNull())
    .groupBy("CallType")
    .count()
    .orderBy("count", ascending=False)
    .show(n=10, truncate=False))

+-------------------------------+------+
|CallType                       |count |
+-------------------------------+------+
|Medical Incident               |113794|
|Structure Fire                 |23319 |
|Alarms                         |19406 |
|Traffic Collision              |7013  |
|Citizen Assist / Service Call  |2524  |
|Other                          |2166  |
|Outside Fire                   |2094  |
|Vehicle Fire                   |854   |
|Gas Leak (Natural and LP Gases)|764   |
|Water Rescue                   |755   |
+-------------------------------+------+
only showing top 10 rows



Other common DataFrame operations like `min`, `max` and `sum`

In [14]:
import pyspark.sql.functions as F
(dfFireTS
.select(F.sum("NumAlarms"), 
        F.avg("ResponseDelayedinMins").alias('ResponseDelayAverage'),
        F.min("ResponseDelayedinMins"), 
        F.max("ResponseDelayedinMins"))
 .show())


+--------------+--------------------+--------------------------+--------------------------+
|sum(NumAlarms)|ResponseDelayAverage|min(ResponseDelayedinMins)|max(ResponseDelayedinMins)|
+--------------+--------------------+--------------------------+--------------------------+
|        176170|   3.892364154521585|               0.016666668|                   1844.55|
+--------------+--------------------+--------------------------+--------------------------+



## End-to-End DataFrame Example

• What were all the different types of fire calls in 2018?   
• What months within the year 2018 saw the highest number of fire calls?  
• Which neighborhood in San Francisco generated the most fire calls in 2018?  
• Which neighborhoods had the worst response times to fire calls in 2018?  
• Which week in the year in 2018 had the most fire calls?  
• Is there a correlation between neighborhood, zip code, and number of fire calls?   
• How can we use Parquet files or SQL tables to store this data and read it back?   



In [15]:
# My Code here

## Typed Objects, Untyped Objects, and Generic Rows

In Spark’s supported languages, Datasets make sense only in Java and Scala, whereas in Python and R only DataFrames make sense. This is because Python and R are not compile-time type-safe; types are dynamically inferred or assigned during execution, not during compile time. The reverse is true in Scala and Java.

Row is a generic object type in Spark, holding a collection of mixed types that can be
accessed using an index.

In [16]:
from pyspark.sql import Row
row = Row(350, True, "Learning Spark 2E", None)

In [17]:
row[0]

350

In [18]:
row[1]

True

In [19]:
row[2]

'Learning Spark 2E'

## DataFrames, DataSets and RDDs

• If you want to tell Spark what to do, not how to do it, use DataFrames or Datasets.  
• If you want rich semantics, high-level abstractions, and DSL operators, use Data‐Frames or Datasets.  
• If you want strict compile-time type safety and don’t mind creating multiple case classes for a specific Dataset[T], use Datasets.  
• If your processing demands high-level expressions, filters, maps, aggregations, computing averages or sums, SQL queries, columnar access, or use of relational operators on semi-structured data, use DataFrames or Datasets.  
• If your processing dictates relational transformations similar to SQL-like queries, use DataFrames.  
• If you want to take advantage of and benefit from Tungsten’s efficient serialization with Encoders, use Datasets.  
• If you want unification, code optimization, and simplification of APIs across Spark components, use DataFrames.  
• If you are an R user, use DataFrames.   
• If you are a Python user, use DataFrames and drop down to RDDs if you need more control.  
• If you want space and speed efficiency, use DataFrames.  
• If you want errors caught during compilation rather than at runtime, choose the appropriate API as depicted in the following figure.   

![title](chapter3/img1.png)

Use RDDs when

• Are using a third-party package that’s written using RDDs  
• Can forgo the code optimization, efficient space utilization, and performance benefits available with DataFrames and Datasets   
• Want to precisely instruct Spark how to do a query   



What’s more, you can seamlessly move between DataFrames or Datasets and RDDs at will using a simple API method call, `df.rdd`. (Note, however, that this does have a cost and should be avoided unless necessary.) After all, DataFrames and Datasets are built on top of RDDs, and they get decomposed to compact RDD code during whole- stage code generation, which we discuss in the next section.

Spark Stack:
![title](chapter3/img2.png)

### The .explain(True) method

![title](chapter3/img3.png)

In [20]:
dfFireTS.explain(True)

== Parsed Logical Plan ==
Project [CallNumber#56, UnitID#57, IncidentNumber#58, CallType#59, CallFinalDisposition#62, Address#64, City#65, Zipcode#66, Battalion#67, StationArea#68, Box#69, OriginalPriority#70, Priority#71, FinalPriority#72, ALSUnit#73, CallTypeGroup#74, NumAlarms#75, UnitType#76, UnitSequenceInCallDispatch#77, FirePreventionDistrict#78, SupervisorDistrict#79, Neighborhood#80, Location#81, RowID#82, ... 4 more fields]
+- Project [CallNumber#56, UnitID#57, IncidentNumber#58, CallType#59, CallFinalDisposition#62, AvailableDtTm#63, Address#64, City#65, Zipcode#66, Battalion#67, StationArea#68, Box#69, OriginalPriority#70, Priority#71, FinalPriority#72, ALSUnit#73, CallTypeGroup#74, NumAlarms#75, UnitType#76, UnitSequenceInCallDispatch#77, FirePreventionDistrict#78, SupervisorDistrict#79, Neighborhood#80, Location#81, ... 5 more fields]
   +- Project [CallNumber#56, UnitID#57, IncidentNumber#58, CallType#59, CallFinalDisposition#62, AvailableDtTm#63, Address#64, City#65, Zi

21/08/08 22:21:27 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


## Summary (Cropped)

Through illustrative common data operations and code examples, we demonstrated that the high-level DataFrame and Dataset APIs are far more expressive and intuitive than the low-level RDD API. Designed to make processing of large data sets easier, the Structured APIs provide domain-specific operators for common data operations, increasing the clarity and expressiveness of your code.