# Spark

## Spark: What’s Underneath an RDD?

• Dependencies   
• Partitions (with some locality information)   
• Compute function: Partition => Iterator[T]   


Spark 2.x introduced a few key schemes for structuring Spark. One is to express com‐
putations by using common patterns found in data analysis. These patterns are
expressed as high-level operations such as filtering, selecting, counting, aggregating,
averaging, and grouping. This provides added clarity and simplicity.

This specificity is further narrowed through the use of a set of common operators in a
DSL.

## Schemas and Creating DataFrames

Defining a schema up front as opposed to taking a schema-on-read approach offers three benefits:
    
• You relieve Spark from the onus of inferring data types.  
• You prevent Spark from creating a separate job just to read a large portion of your file to ascertain the schema, which for a large data file can be expensive and time-consuming.  
• You can detect errors early if data doesn’t match the schema.   

Here's an example how to define schemas:

In [4]:
# In Python
from pyspark.sql import SparkSession
#sc.setLogLevel(newLevel)
# Define schema for our data using DDL
schema = "`Id` INT, `First` STRING, `Last` STRING, `Url` STRING, `Published` STRING, `Hits` INT, `Campaigns` ARRAY<STRING>"
# Create our static data
data = [[1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter",
        "LinkedIn"]],
        [2, "Brooke","Wenig", "https://tinyurl.2", "5/5/2018", 8908, ["twitter",
        "LinkedIn"]],
        [3, "Denny", "Lee", "https://tinyurl.3", "6/7/2019", 7659, ["web",
        "twitter", "FB", "LinkedIn"]],
        [4, "Tathagata", "Das", "https://tinyurl.4", "5/12/2018", 10568,
        ["twitter", "FB"]],
        [5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web",
        "twitter", "FB", "LinkedIn"]],
        [6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568,
        ["twitter", "LinkedIn"]]
        ]
# Main program
if __name__ == "__main__":
    # Create a SparkSession
    spark = (SparkSession
        .builder
        .appName("Example-3_6")
        .getOrCreate())
    # Create a DataFrame using the schema defined above
    blogs_df = spark.createDataFrame(data, schema)
    # Show the DataFrame; it should reflect our table above
    blogs_df.show()
    # Print the schema used by Spark to process the DataFrame
    print(blogs_df.printSchema())


+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+

root
 |-- Id: integer (nullable = true)
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Hits: integer (

[Here's a link to the documentation on Python](http://spark.apache.org/docs/latest/api/python/)

To start with the basics, lets look at the [columns API](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#column-apis)

If run into any dataType issue,
```Column.cast(dataType)``` converts the column into type dataType.
...
Keep exploring.   

On Rows, here are examples of how to make quick rows:

In [6]:
from pyspark.sql import Row
blog_row = Row(6, "Reynold", "Xin", "https://tinyurl.6", 255568, "3/2/2015", ["twitter", "LinkedIn"])
# access using index for individual items
blog_row[1]

'Reynold'

In [7]:
rows = [Row("Matei Zaharia", "CA"), Row("Reynold Xin", "CA")]
authors_df = spark.createDataFrame(rows, ["Authors", "State"])
authors_df.show()

+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+



To read from a file: ```DataFrameReader```   
To save to a file: ```DataFrameWriter```

## Hands on a Dataset
Quote:
To get started, let’s read a large CSV file containing data on San Francisco Fire
Department calls.1 As noted previously, we will define a schema for this file and use
the DataFrameReader class and its methods to tell Spark what to do. Because this file
contains 28 columns and over 4,380,660 records,2 it’s more efficient to define a
schema than have Spark infer it.

First lets define the schema for the dataset.

In [11]:
from pyspark.sql.types import *
# Programmatic way to define a schema
fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
            StructField('UnitID', StringType(), True),
            StructField('IncidentNumber', IntegerType(), True),
            StructField('CallType', StringType(), True),
            StructField('CallDate', StringType(), True),
            StructField('WatchDate', StringType(), True),
            StructField('CallFinalDisposition', StringType(), True),
            StructField('AvailableDtTm', StringType(), True),
            StructField('Address', StringType(), True),
            StructField('City', StringType(), True),
            StructField('Zipcode', IntegerType(), True),
            StructField('Battalion', StringType(), True),
            StructField('StationArea', StringType(), True),
            StructField('Box', StringType(), True),
            StructField('OriginalPriority', StringType(), True),
            StructField('Priority', StringType(), True),
            StructField('FinalPriority', IntegerType(), True),
            StructField('ALSUnit', BooleanType(), True),
            StructField('CallTypeGroup', StringType(), True),
            StructField('NumAlarms', IntegerType(), True),
            StructField('UnitType', StringType(), True),
            StructField('UnitSequenceInCallDispatch', IntegerType(), True),
            StructField('FirePreventionDistrict', StringType(), True),
            StructField('SupervisorDistrict', StringType(), True),
            StructField('Neighborhood', StringType(), True),
            StructField('Location', StringType(), True),
            StructField('RowID', StringType(), True),
            StructField('Delay', FloatType(), True)])


Now, lets read the dataset file. ```dfFire``` is read from the defined file, and with the schema defined above.

In [16]:
sf_fire_file = "../repo/chapter3/data/sf-fire-calls.csv"
dfFire = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)

To save the dataframe, for example to Parquet, which saves the schema in it's metadata, it's as simple as:   
```python
parquet_path = ...
fire_df.write.format("parquet").save(parquet_path)
```

Examples of some filters and projections
A projection in relational parlance is a way to return only the
rows matching a certain relational condition by using filters. In Spark, projections are
done with the select() method, while filters can be expressed using the filter() or
where() method. We can use this technique to examine specific aspects of our SF Fire
Department data set:

In [21]:
# In Python
from pyspark.sql.functions import col

dfFewFires = (dfFire
        .select("IncidentNumber", "AvailableDtTm", "CallType")
        .where(col("CallType") != "Medical Incident"))
dfFewFires.show(5, truncate=False)


+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



```countDistinct```: Returns a new Column for distinct count of col or cols.   
What if we want to know how many distinct CallTypes were recorded as the causes
of the fire calls? These simple and expressive queries do the job:

In [24]:
from pyspark.sql.functions import countDistinct
(dfFire
    .select("CallType")
    .where(col("CallType").isNotNull())
    .agg(countDistinct("CallType").alias("DistinctCallTypes"))
    .show()
)




+-----------------+
|DistinctCallTypes|
+-----------------+
|               30|
+-----------------+



                                                                                

We can list the distinct call types in the data set using these queries:
In Python, filter for only distinct non-null CallTypes from all the rows

In [26]:
(dfFire
    .select("CallType")
    .where(col("CallType").isNotNull())
    .distinct()
    .show(10, False)
)

                                                                                

+-----------------------------------+
|CallType                           |
+-----------------------------------+
|Elevator / Escalator Rescue        |
|Marine Fire                        |
|Aircraft Emergency                 |
|Confined Space / Structure Collapse|
|Administrative                     |
|Alarms                             |
|Odor (Strange / Unknown)           |
|Citizen Assist / Service Call      |
|HazMat                             |
|Watercraft in Distress             |
+-----------------------------------+
only showing top 10 rows



Renaming columns in pyspark is also easy.  
My question here is can I throw a dictionary of renames and will spark understand?

In [27]:
# In Python
dfNewFIre = dfFire.withColumnRenamed("Delay", "ResponseDelayedinMins")
(dfNewFIre
    .select("ResponseDelayedinMins")
    .where(col("ResponseDelayedinMins") > 5)
    .show(5, False)
)


+---------------------+
|ResponseDelayedinMins|
+---------------------+
|5.35                 |
|6.25                 |
|5.2                  |
|5.6                  |
|7.25                 |
+---------------------+
only showing top 5 rows

