<DIV ALIGN=CENTER>

# Introduction to Spark (DataFrames)
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

Previously in this course, we have discussed doing data science at the
Unix command line, and with Python, primarily by using Pandas. We also
have discussed other Python libraries that bring new functionalities to
the Python data science stack. Other _big data_ technologies, however,
also exist and can be relevant to particular data science
investigations, depending on the scale of data. Of these other
technologies, one of the most promising is [**Spark**][sp].

Spark is a cluster computing system that leverages [Hadoop][sh]
technologies like [HDFS][shdfs] for high performance storage and
[Yarn][sy] for cluster management. While some may see Spark as a
replacement for Hadoop, an alternative argument can be made that [Spark
is simply another compute engine][sce] for Hadoop, in addition to
Map-Reduce.

In this IPython Notebook, we explore using Spark to perform data
processing in a similar maner to our previous efforts with Pandas. For
this we will use the airline data, which has been stored in an HDFS
system that is accesible from within our Spark cluster. [Other][dw]
tutorials exist, although they often focus on Scala examples since Spark
is written for that language.

-----
[sp]: http://spark.apache.org
[sh]: http://hadoop.apache.org
[sy]: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[shdfs]: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
[sce]: http://techcrunch.com/2015/07/12/spark-and-hadoop-are-friends-not-foes/
[dw]: https://github.com/deanwampler/spark-workshop/tree/master/tutorial

### Initialization

In this class, we do not use a dedicated Spark cluster, and instead run
our spark applications from within our
JupyterHub Server environment. However, we still emphasize resource
management, in particular we demonstrate how to ensure that any
SparkContext previously used by this Jupyter Server is properly released
before starting a new one. After this, we will initialize a new
SparkContext to properly interact from this dockerized IPython Notebook
to the Spark cluster.

----- 

-----

### Using Spark

Spark is a framework for processing large-data tasks, in general this
means Petabytes (or more of data). Spark can run on the HDFS file
system, which can be set up to chunk files into blocks and to replicate
these blocks across a cluster's storage to promote increased
performance. Spark abstracts these details, however, allowing us to
develop an application on a small system and scale up to large data on a
cluster. 

In Spark, communications move between a driver process and the execution
processes. This communication is handled for us by using a
[`SparkContext`][sc], which requests resources from the Spark master
process, such as number of cores, which are reserved to complete our
Spark tasks. In the previous code cell, we initialized our
`SparkContext`. Once a Spark Context is active, we can use the Spark
Console to monitor jobs and the overall Spark infrastructure. The
Jupyter Server currently sets an HTTP header (`X-Frame-Options`) that
prevents us from easily displaying this console within this Notebook.
However, if you open a new web browser to the IP address of this
Notebook and use `4040` as the port number, you should be able to view
and interact with the console, as shown in the following screenshot:

![Spark Console](images/spark-console.png)

-----

The basic data structure in Spark is a [Resilient Distributed
Dataset][rdd] (RDD). An RDD is immutable, thus if you want to add a
column to an RDD, you must create a new copy that includes the new
column. In Spark, data processing tasks can be transformation or
actions, and these tasks can be pipelined for efficiency. Each
transformation creates a new RDD, but since Spark uses lazy evaluation,
the transformations are not executed until an action is invoked.

These concepts are demonstrated in the following code cells, where we
first create a list of integers, which we use to initialize a new RDD.


-----
[sc]: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
[rdd]: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

## Introduction to Spark DataFrames 

In this IPython Notebook, we explore using Spark to perform data
processing in a similar manner to our previous efforts with Pandas. For
this we will use the airline data, which has been stored within a
filesystem that is accessible from within our Spark cluster. We first
initialize our spark environment, which in this Notebook is slightly
different since we will connect our Spark environment to our Cassandra
database. This requires additional Java libraries to be acquired and
installed into the Spark environment, which will cause the Spark Context
creation to take longer (so be patient).

-----

In [1]:
# Setup pySpark to be able to work with Cassandra
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = \
    '--packages TargetHolding:pyspark-cassandra:0.3.5 pyspark-shell'

In [2]:
# We release the SparkContext if it exists.
try:
    sc
except:
    pass ;
else:
    sc.stop()

# Now handle initial import statements
from pyspark import SparkConf, SparkContext

# Create new Spark Configuration 
# Also set Cassandra host ip
myconf = SparkConf()
myconf.setMaster('local[*]')
myconf.setAppName("ACCY 571: Professor Brunner")
myconf.set('spark.executor.memory', '1g')
myconf.set('spark.cassandra.connection.host', '192.168.100.24')

# Create and initialize a new Spark Context
sc = SparkContext(conf=myconf)

# Display Spark version information, which also verifies SparkContext is active
print("\nSpark version: {0}".format(sc.version))


Spark version: 2.0.1


-----

### Data Processing

In this Notebook, we will need sample data. To simplify acquiring data
to demonstrate using Spark DataFrames, we include the RDD code from the
[Introduction to Spark](intro2spark.ipynb) Notebook in the following
cell.

-----

In [3]:
filename = '/home/data_scientist/data/2001/2001-1.csv'

text_file = sc.textFile(filename)

col_data = text_file.map(lambda l: l.split(",")) \
            .map(lambda p: (p[0], p[1], p[2], p[4], p[14], p[15], p[16], p[17], p[18])) \
            .filter(lambda line: 'Year' not in line)

cols = col_data.filter(lambda line: 'NA' not in line)

fields = cols.map(lambda p: (int(p[0]), int(p[1]), int(p[2]), int(p[3]),
                          int(p[4]), int(p[5]), p[6], p[7], int(p[8])))

# Should be 480106 if everything works correctly
print('Number of entries in fields RDD = {0}'.format(fields.count()))

Number of entries in fields RDD = 480106


-----

## Spark DataFrame

Spark supports a simplified [Data Frame][spdf] as part of the [Spark
SQL][spsql] library. We can create a Data Frame from an existing RDD by
also specifying the column labels and data types. The data types must
be one of the pre-defined [Spark SQL types][spdt]. After creating the
new DataFrame (which is backed by an RDD), we can perform many of the
same tasks with Spark that we performed with Pandas (but not all, and
not in as simple of an approach). The following code cells show how we
can take our 2001 flight data RDD and create a new Data Frame, which we
subsequently use in several subsequent code cells.

-----
[spdf]: https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
[spsql]: https://spark.apache.org/sql/
[spdt]: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types

In [4]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *

# sc is an existing SparkContext.
sqlContext = SQLContext(sc)

schemaString = "Year Month DayOfMonth DepTime ArrDelay DepDelay Origin Destination Distance"

fieldTypes = [IntegerType(), IntegerType(), IntegerType(), \
              IntegerType(), IntegerType(), IntegerType(), \
              StringType(), StringType(), IntegerType()]

f_data = [StructField(field_name, field_type, True) \
          for field_name, field_type in zip(schemaString.split(), fieldTypes)]

schema = StructType(f_data)

In [5]:
df = sqlContext.createDataFrame(fields, schema)
print(df)

DataFrame[Year: int, Month: int, DayOfMonth: int, DepTime: int, ArrDelay: int, DepDelay: int, Origin: string, Destination: string, Distance: int]


-----

In the following three code cells, we `show` the first few lines of the
DataFrame, then use the `head` method, which displays more syntactic
information for each row, and finally use the `describe` method, which
doesn't execute until the `show` action is invoked. While the output is
less visually attractive than the Pandas result, we still obtain the
necessary information.

After these code cells, we access the DataFrame schema, first by using
the `printSchema` method to nicely output the schema, and next access a
column directly, which we can now do since we have named our DataFrame
columns.

-----

In [6]:
df.show(5)

+----+-----+----------+-------+--------+--------+------+-----------+--------+
|Year|Month|DayOfMonth|DepTime|ArrDelay|DepDelay|Origin|Destination|Distance|
+----+-----+----------+-------+--------+--------+------+-----------+--------+
|2001|    1|        17|   1806|      -3|      -4|   BWI|        CLT|     361|
|2001|    1|        18|   1805|       4|      -5|   BWI|        CLT|     361|
|2001|    1|        19|   1821|      23|      11|   BWI|        CLT|     361|
|2001|    1|        20|   1807|      10|      -3|   BWI|        CLT|     361|
|2001|    1|        21|   1810|      20|       0|   BWI|        CLT|     361|
+----+-----+----------+-------+--------+--------+------+-----------+--------+
only showing top 5 rows



In [7]:
df.head(4)

[Row(Year=2001, Month=1, DayOfMonth=17, DepTime=1806, ArrDelay=-3, DepDelay=-4, Origin='BWI', Destination='CLT', Distance=361),
 Row(Year=2001, Month=1, DayOfMonth=18, DepTime=1805, ArrDelay=4, DepDelay=-5, Origin='BWI', Destination='CLT', Distance=361),
 Row(Year=2001, Month=1, DayOfMonth=19, DepTime=1821, ArrDelay=23, DepDelay=11, Origin='BWI', Destination='CLT', Distance=361),
 Row(Year=2001, Month=1, DayOfMonth=20, DepTime=1807, ArrDelay=10, DepDelay=-3, Origin='BWI', Destination='CLT', Distance=361)]

In [8]:
df.describe().show()

+-------+--------------------+------+-----------------+-----------------+-----------------+------------------+-----------------+
|summary|                Year| Month|       DayOfMonth|          DepTime|         ArrDelay|          DepDelay|         Distance|
+-------+--------------------+------+-----------------+-----------------+-----------------+------------------+-----------------+
|  count|              480106|480106|           480106|           480106|           480106|            480106|           480106|
|   mean|              2001.0|   1.0|16.01370530674476|1359.660206287778|6.382288494624103| 8.781523246949632|716.9933556339641|
| stddev|1.136732532560936...|   0.0|8.936964382456553|487.2369594358406|31.04865060768924|27.966300686761794|568.6557196351681|
|    min|                2001|     1|                1|                1|              -80|               -59|               21|
|    max|                2001|     1|               31|             2400|             1688|      

In [9]:
df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayOfMonth: integer (nullable = true)
 |-- DepTime: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Destination: string (nullable = true)
 |-- Distance: integer (nullable = true)



In [10]:
df.Year

Column<b'Year'>

-----

We can extract data from the DataFrame by using similar techniques to
what we used with Pandas. One difference is that we need to `filter` the
DataFrame, as opposed to directly access rows. However, we can filter
rows to extract flights that left O'Hare, and secondly those flights
that left O'Hare more than two hours late. In the second case, we also
tranform the output to `select` the _Destination_ column and a new
column that is the _Distance_ in kilometers.

-----

In [11]:
df.filter(df['Origin'] == 'ORD').count()

27455

In [12]:
df.filter(df['Origin'] == 'ORD').filter(df['DepDelay'] > 120).select(df['Destination'], df['Distance'] * 1.6).show(10)

+-----------+-----------------+
|Destination| (Distance * 1.6)|
+-----------+-----------------+
|        PHL|           1084.8|
|        CLT|958.4000000000001|
|        MEM|            785.6|
|        MEM|            785.6|
|        MEM|            785.6|
|        STL|            412.8|
|        STL|            412.8|
|        PVD|           1358.4|
|        LAX|           2792.0|
|        LAX|           2792.0|
+-----------+-----------------+
only showing top 10 rows



-----

## Spark SQL

Given a Spark DataFrame, we can apply SQL statements directly against
the DataFrame by registering the DataFrame as a Spark temporary SQL
table. The following code cells demonstrates this, as we register our
DataFrame as a `flights` table, and execute a SQL statement to select
the same data we obtained from our previous DataFrame filter.Since the
data are unordered, we have different results displayed via the `show`
method.

-----

In [13]:
df = sqlContext.createDataFrame(fields, schema)

df.registerTempTable("flights")

# SQL can be run over DataFrames that have been registered as a table.
sql_q = "SELECT Destination, Distance FROM flights WHERE Origin = 'ORD' AND DepDelay > 120"

results = sqlContext.sql(sql_q)

# The results of SQL queries are RDDs and support all the normal RDD operations.
results.show(10)

+-----------+--------+
|Destination|Distance|
+-----------+--------+
|        PHL|     678|
|        CLT|     599|
|        MEM|     491|
|        MEM|     491|
|        MEM|     491|
|        STL|     258|
|        STL|     258|
|        PVD|     849|
|        LAX|    1745|
|        LAX|    1745|
+-----------+--------+
only showing top 10 rows



-----

### Cassandra Query

We now connect to a remote database from a Spark application. In this
case, we will use our existing Cassandra instance running on Microsoft
Azure. We have already initialized the Spark context to acquire and
install the spark-cassandra connector in the first code cell int his
Notebook and we specified the host ip address for our Cassandra database
as part of the Spark Context configuration parameters in the second code
cell. Our next step is to establish a connection to a Cassandra keyspace
and read data from a table into a Spark RDD. 

This last step is performed in the following cell. We first tell Spark
to use the spark-cassandra driver, which will run in the Spark JVM, to
connect to the database. Next, we load the `airlines` table from the
`bigdog` keyspace in our Cassandra database. This table was created by
using the following CQL query:

```cql
drop_schema = '''
DROP TABLE IF EXISTS Airlines ;
'''

create_schema = '''
CREATE TABLE Airlines (
    Year int,
    Month int,
    DayOfMonth int,
    DepTime int,
    ArrDelay int,
    DepDelay int,
    Origin text,
    Destination text,
    Distance int,
    PRIMARY KEY(Month, DayOfMonth, DepTime, Origin)
);
'''
```

The 2001 flight data we have analyzed previously in this Notebook has
already been loaded into this table by using the following Python code:

```python
df.write.format("org.apache.spark.sql.cassandra").\
    options(table='airlines', keyspace='bigdog').save(mode="overwrite")
```

One change from the previous Spark DataFrame used in this Notebook is
the creation of the Column names (since CQL is case insensitive, while
Spark is case sensitive), for the creation of this table, df was created
with the following column names:

```python
schemaString = "year month dayofmonth deptime arrdelay depdelay origin destination distance"
```

After we load the data into the new flights RDD, we display the first
few rows, and the column datatypes.

-----

In [14]:
flights = sqlContext.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="bigdog", table="airlines")

Py4JJavaError: An error occurred while calling o97.load.
: java.lang.ClassNotFoundException: org.apache.spark.Logging was removed in Spark 2.0. Please check if your library is compatible with Spark 2.0
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:159)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/Logging
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:132)
	... 16 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 34 more


In [None]:
flights.show(5)

In [None]:
flights.dtypes

-----

Given this Spark RDD, we can now perform subsequent operations as
demonstrated in the [Introduction to Spark][intro2spark.ipynb] Notebook.
Below, we apply several filters to the RDD to generate a subset of the
full data. In this case, we select long flights from the Baltimore
Washington International airport. Another option would be to create a
DataFrame from this RDD and use the Spark DataFrame techniques presented
earlier in this Notebook.

-----

In [None]:
bwi = flights.filter(flights.origin == 'BWI').filter(flights.distance > 1500)

In [None]:
bwi.show(5)

-----
### Student Activity

In the preceding cells, we introduced Spark DataFrames and Spark SQL.
Now that you have run the Notebook, go back and make the following
changes to see how the results change.

1. Change the DataFrame to include different columns from the flights
data. You might review the original [airline data
set](http://stat-computing.org/dataexpo/2009/) website to see the column
descriptions.

2. Use a SQL query on the `df` DataFrame to compute the mean distance
between all flights from O'Hare to Los Angeles International Airport
(LAX).

4. Add an index column to this Spark DataFrame, which sequentially
increases.

Additional, more advanced problems:

1. Turn the Cassandra SQL RDD obtained previously in this Notebook into
a Spark DataFrame and output the results of the `describe` function on
all numeric columns.

2. Turn this Spark DataFrame into a Pandas DataFrame and make a
regression plot of the arrival delay versus the departure delay by using
Seaborn.

-----

### Ending the Spark Session

We must stop the `SparkContext` in order to release resources on the
instructional cluster before existing this Notebook.

-----

In [None]:
sc.stop()