# Spark
<img src="img/arch.png" width="50%"/>

Create the spark context and a SQlContext which is used to interact with the the Spark SQL API.
The context appears under the `Applications` part of the [Spark Master UI](http://localhost:8080/).

The Spark [Jobs UI](http://127.0.0.1:4040/jobs/) loads on the client computer (Jupyter Container).

In [1]:
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('spark://spark:7077')
sqlContext = SQLContext(sc)

Create a Pandas Dataframe with 4 rows. More on dataframes later.

In [3]:
import pandas as pd
pdf = pd.DataFrame({'x': list(range(0, 4))})

Use list comprehension to loop over the Pandas dataframe. The loop is executed sequentially. This takes roughly 120 seconds.

In [4]:
import time

sleepTime = 30

start_time = time.time()
[time.sleep(sleepTime) for x in pdf['x']]
(time.time() - start_time)

120.03146696090698

1. Convert the Pandas dataframe to a Spark dataframe.
2. Declare a sleep function which will execute on each executor
3. Tell Spark to partition the data (groupby) then call a UDF (sleep) on the partition. `count()` prompts Spark to begin execution

Execution time is shorter because the work is deistributed across workers.

In [14]:
import pandas as pd
import time
from pyspark.sql.functions import pandas_udf, PandasUDFType

sleepTime = 30

df = sqlContext.createDataFrame(pdf)

def sleep(data):
    [time.sleep(sleepTime) for x in data['x']]
    return data

start_time = time.time()
df.groupby('x').applyInPandas(sleep,df.schema).count()
(time.time() - start_time)

43.67003536224365

Intentionally raise an exception. The [Stages](http://127.0.0.1:4040/stages) section of the Jobs UI can be used to track down errors.

In [None]:
import pandas as pd
import time
from pyspark.sql.functions import pandas_udf, PandasUDFType

sleepTime = 30

def sleep(data):
    raise Exception('Random UDF error')
    return data

start_time = time.time()
df.groupby('x').applyInPandas(sleep,df.schema).count()
(time.time() - start_time)

In [None]:
sc.stop()