# Introduction to PySpark

In [1]:
import findspark
findspark.init()

## Getting to know PySpark

### What is Spark, anyway?

- Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple *nodes*(think of each node as a separate computer). Splitting upo your data makes it easier to work wtih very large datasets because each node only works with a small amount of data.

### Using Spark in Python

- The first step in using Spark is connectig to a cluster.
- Creating the connetion is as simple as creating an instance of the `SparkContext` class.

### Examining the SparkConext

In [2]:
# make the connection
from pyspark import SparkContext

sc = SparkContext(appName = 'DataCampTutorial')

# verify sparkcontext
print(sc)

# print spark version
print(sc.version)

<SparkContext master=local[*] appName=DataCampTutorial>
2.4.3


## Using DataFrames

Spark's core data structure is the Resilient Distributed Dataset (RDD).

To start working with Spark DataFrames, you first have to create a `SparkSession` object from your `SparkContext`.

You can think of `SparkContext` as your **connection** to the cluster and
`SparkSession` as your **interface** with that connection.

### Creating a SparkSession

In [3]:
# make interface to work with
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# check if interface was loaded correctly
print(spark)

<pyspark.sql.session.SparkSession object at 0x7f433b732c50>


In [4]:
file_path = "../data/flights.csv"

# read data
flights = spark.read.csv(file_path, header=True)

#show data
flights.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

In [5]:
# make table in SparkSession interface
flights.createOrReplaceTempView("flights")

### Viewing tables

In [6]:
# check table was created
print(spark.catalog.listTables())

[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


### Are you query-ious?

Now that we have created table in our `SparkSession` we can start making queries

In [7]:
query = "FROM flights SELECT * LIMIT 10"

# get the first 10 rows of flights
flights10 = spark.sql(query)

flights10.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

### Pandafy a Spark DataFrame

Sometimes it makes sense to take the data you just queried and work wit it locally using `pandas`. Spark makes it easy with the `.toPandas()` method.

In [8]:
query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"

# run query
flight_counts = spark.sql(query)

# convert the results to a pandas DataFrame
pd_counts = flight_counts.toPandas()

# print head of pd_counts
print(pd_counts.head())

  origin dest    N
0    SEA  RNO    8
1    SEA  DTW   98
2    SEA  CLE    2
3    SEA  LAX  450
4    PDX  SEA  144


## Manipulating data

### Creating columns

The `.withColumn()` method allows you to perform column-wise operations on a Spark `DataFrame`.

In [9]:
# create a DataFrame called flights_df
flights_df = spark.table("flights") #note "flights" is a table in the SparkSession

# Show the head of flights_df
print(flights.show(5))

# add duration_hrs column
# *note column air_time contains duration of flight in minutes
flights_df = flights_df.withColumn("duration_hrs", flights_df.air_time / 60)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
+----+-----+---+--------+---------+-----

### Filtering Data

The `.filter()` method is similar to SQL's `WHERE` clause. The `.filter()` method takes a BOOLEAN expression or the `WHERE` clause of an SQL expression

For example, the following two expressions will produce the same output:

> flights.filter(flights.air_time > 120).show()
<br>flights.filter("air_time > 120").show()

In [13]:
# filter flights with SQL string
long_flights1 = flights_df.filter("distance > 1000")

# filter flights with boolean statement
long_flights2 = flights_df.filter(flights_df.distance > 1000)

print(long_flights1.show())

print(long_flights2.show())

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|      duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|               6.0|
|2014|    4| 19|    1236|       -4|    1508|       -7|     AS| N309AS|   490|   SEA| SAN|     135|    1050|  12|    36|              2.25|
|2014|   11| 19|    1812|       -3|    2352|       -4|     AS| N564AS|    26|   SEA| ORD|     198|    1721|  18|    12|               3.3|
|2014|    8|  3|    1120|        0|    1415|        2|     AS| N305AS|   656|   SEA| PHX|     154|    1107|  11|    20| 2.566666666666667|
|2014|   11| 12|    2346|  

### Selecting

`.select()` method is Spark's variant of SQL's `SELECT` statement. The method takes multiple arguments - one for each column you want to select.Simi

Similar to SQL, you can use `.alias()` and `.select()` methods to perform any column operation and return a transformed column with a new name.lights

For example,
> flights_df.select((flights_df.air_time / 60).("duration_hrs"))

Or maybe an easier way is to use `.selectExpr()` method
> flights_df.selectExpr("air_time/60 as duration_hrs")

In [14]:
# select columns tailnum, origin, and dest
selected1 = flights_df.select("tailnum", "origin", "dest")

# select columns origin, dest, and carrier using df.colName syntax
temp = flights.select(flights_df.origin, flights_df.dest, flights_df.carrier)

# define first filter
filterA = flights_df.origin == "SEA"

# define second filter
filterB = flights_df.dest =="PDX"

# filter the data
selected2 = temp.filter(filterA).filter(filterB)

### Selecting II

In [15]:
# define avg_speed
# *note avg_speed is distance column divided by air_time column in hours
avg_speed = (flights_df.distance / (flights_df.air_time / 60)).alias("avg_speed")

# select the correct columns
speed1 = flights_df.select("origin", "dest", "tailnum", avg_speed)

# Create the same table using a SQL expression
speed2 = flights.selectExpr("origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed")

### Aggregating

**GroupedData** are methods in python like `.min()`,`.max()`, and `count()`. These can be accessed in PySpark by calling `.groupBy()` method.

For example
> df.groupBy().min("col").show()

In [16]:
flights_df.describe()

DataFrame[summary: string, year: string, month: string, day: string, dep_time: string, dep_delay: string, arr_time: string, arr_delay: string, carrier: string, tailnum: string, flight: string, origin: string, dest: string, air_time: string, distance: string, hour: string, minute: string, duration_hrs: string]

In [17]:
flights_df.describe("air_time", "distance").show()

+-------+------------------+-----------------+
|summary|          air_time|         distance|
+-------+------------------+-----------------+
|  count|             10000|            10000|
|   mean|152.88423173803525|        1208.1516|
| stddev|  72.8656286392139|656.8599023464376|
|    min|               100|             1009|
|    max|                NA|              991|
+-------+------------------+-----------------+



In [18]:
# change distance and air_time colums to float data

flights_df = flights_df.withColumn("distance", flights_df.distance.cast("float"))
flights_df = flights_df.withColumn("air_time", flights_df.air_time.cast("float"))

# find the shortest flight from PDX in terms of distance
flights_df.filter(flights_df.origin == "PDX").groupBy().min("distance").show()

# find the longest flight from SEA in terms of duration
flights_df.filter(flights_df.origin == "SEA").groupBy().max("air_time").show()

+-------------+
|min(distance)|
+-------------+
|        106.0|
+-------------+

+-------------+
|max(air_time)|
+-------------+
|        409.0|
+-------------+



### Aggregating II

In [19]:
# find avg duration of Delta flights that left from Seattle
flights_df.filter(flights_df.carrier == "DL").filter(flights_df.origin == "SEA").groupBy().avg("air_time").show()

# find total of air_time in hours
# *hint create new column called duration_hrs
flights_df.withColumn("duration_hrs", flights_df.air_time/60).groupBy().sum("duration_hrs").show()

+------------------+
|     avg(air_time)|
+------------------+
|188.20689655172413|
+------------------+

+------------------+
| sum(duration_hrs)|
+------------------+
|25289.600000000126|
+------------------+



### Grouping and Aggregating

You can pass one or more columns in your DataFrame to the `.groupBy()` method and aggregate the groups like how you would in SQL `GROUP BY` statement.

Spark's **GroupedData** also has a `.agg()` method that allows you to access aggregate functions from `pyspark.sql.functions` submodule. This submodule contains useful functions for computing things like *standard deviation*.

### Grouping and Aggregating I

In [20]:
# group by tailnum
by_plane = flights_df.groupBy("tailnum")

# get number of flights of each plane
by_plane.count().show()

# group by origin
by_origin = flights_df.groupBy("origin")

# get average duration of flights from PDX and SEA
by_origin.avg("air_time").show()

+-------+-----+
|tailnum|count|
+-------+-----+
| N442AS|   38|
| N102UW|    2|
| N36472|    4|
| N38451|    4|
| N73283|    4|
| N513UA|    2|
| N954WN|    5|
| N388DA|    3|
| N567AA|    1|
| N516UA|    2|
| N927DN|    1|
| N8322X|    1|
| N466SW|    1|
|  N6700|    1|
| N607AS|   45|
| N622SW|    4|
| N584AS|   31|
| N914WN|    4|
| N654AW|    2|
| N336NW|    1|
+-------+-----+
only showing top 20 rows

+------+------------------+
|origin|     avg(air_time)|
+------+------------------+
|   SEA| 160.4361496051259|
|   PDX|137.11543248288737|
+------+------------------+



### Grouping and Aggregating II

In [21]:
# change dep_delay column to float
flights_df = flights_df.withColumn("dep_delay", flights_df.dep_delay.cast("float"))

import pyspark.sql.functions as F

# group by month and dest
by_month_dest = flights_df.groupBy("month", "dest")

# get average departure delay by month and destination
by_month_dest.avg("dep_delay").show()

# get standard deviation of departure delays
by_month_dest.agg(F.stddev("dep_delay")).show()

+-----+----+--------------------+
|month|dest|      avg(dep_delay)|
+-----+----+--------------------+
|   11| TUS| -2.3333333333333335|
|   11| ANC|   7.529411764705882|
|    1| BUR|               -1.45|
|    1| PDX| -5.6923076923076925|
|    6| SBA|                -2.5|
|    5| LAX|-0.15789473684210525|
|   10| DTW|                 2.6|
|    6| SIT|                -1.0|
|   10| DFW|  18.176470588235293|
|    3| FAI|                -2.2|
|   10| SEA|                -0.8|
|    2| TUS| -0.6666666666666666|
|   12| OGG|  25.181818181818183|
|    9| DFW|   4.066666666666666|
|    5| EWR|               14.25|
|    3| RDM|                -6.2|
|    8| DCA|                 2.6|
|    7| ATL|   4.675675675675675|
|    4| JFK| 0.07142857142857142|
|   10| SNA| -1.1333333333333333|
+-----+----+--------------------+
only showing top 20 rows

+-----+----+----------------------+
|month|dest|stddev_samp(dep_delay)|
+-----+----+----------------------+
|   11| TUS|    3.0550504633038935|
|   11| ANC|  

## Joining

In Pyspark, joins are performed using the DataFrame method `.join()`. This method takes three arguments:
<br>-The first is the second DataFrame that you want to join with the first one
<br>-Then the `on` statement to name the key of which the colum(s) share to be joined on
<br>-Third is `how` statement that specifies the kind of join to perform.

### Joining II

In [22]:
# load data to spark session
file_path = "../data/airports.csv"
airports = spark.read.csv(file_path, header = True)

# load DataFrame as a table into catalog
airports.createOrReplaceTempView("airports")

# check table was created
spark.catalog.listTables()

# switching to flight and not flights_df, realized creating flights_df was redundant

# examine data
airports.describe()

airports.show(5)

+---+--------------------+----------+-----------+----+---+---+
|faa|                name|       lat|        lon| alt| tz|dst|
+---+--------------------+----------+-----------+----+---+---+
|04G|   Lansdowne Airport|41.1304722|-80.6195833|1044| -5|  A|
|06A|Moton Field Munic...|32.4605722|-85.6800278| 264| -5|  A|
|06C| Schaumburg Regional|41.9893408|-88.1012428| 801| -6|  A|
|06N|     Randall Airport| 41.431912|-74.3915611| 523| -5|  A|
|09J|Jekyll Island Air...|31.0744722|-81.4277778|  11| -4|  A|
+---+--------------------+----------+-----------+----+---+---+
only showing top 5 rows



In [23]:
# rename faa column
airports = airports.withColumnRenamed("faa", "dest")

# join the DataFrames
flights_with_airports = flights.join(airports, on = "dest", how = "leftouter")

flights_with_airports.show(5)

+----+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+--------+--------+----+------+--------------------+---------+-----------+---+---+---+
|dest|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|air_time|distance|hour|minute|                name|      lat|        lon|alt| tz|dst|
+----+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+--------+--------+----+------+--------------------+---------+-----------+---+---+---+
| LAX|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA|     132|     954|   6|    58|    Los Angeles Intl|33.942536|-118.408075|126| -8|  A|
| HNL|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA|     360|    2677|  10|    40|       Honolulu Intl|21.318681|-157.922428| 13|-10|  N|
| SFO|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA|     111|     679|  14|    43|  San 

## Getting started with machine learning pipelines

### Machine Learning Pipelines

At the core of the `pysark.ml` module are the `Transformer` and `Estimator` classes. Almost every other class in the module behaves similarly to these two basic classes.

`Transformer` classes have a `.transform()` method that takes a DataFrame and returns a new DataFrame; usually the original one with a new column appended. For example, you might use the class `Bucketizer` to create discrete bins from a continuous feature or the class `PCA` to reduce the dimensionality of your dataset using principal component analysis.

`Estimator` classes all implement a `.fit()` method. These methods also take a DataFrame, but instead of returning another DataFrame they return a model object. This can be something like a `StringIndexerModel` for including categorical data saved as strings in your models, or a `RandomForestModel` that uses the random forest algorithm for classification or regression.

### Join the DataFrames

In [37]:
file_path = "../data/planes.csv"
planes = spark.read.csv(file_path, header=True)

# Rename year column
planes = planes.withColumnRenamed("year", "plane_year")

# join the DataFrames
model_data = flights.join(planes, on="tailnum", how="leftouter")

### Data types

It's important to know that SPark only handles numeric data. That means all of the columns in your DataFrame must be either integers or decimals (called 'doubles' in Spark).

When we imported our data, we let Spark guess what kind of information each column held. Unfortunately, Spark doesn't always guess right and you can see that some of the columns in our DataFrame aree strings containing numbers as opposed to actual numeric values.

To remedy this, you can use the `.cast()` method in combination with the `.withColumn()` method. It's important to note that `.cast()` works on columns, while `.withColumn()` works on DataFrames.

The only argument you need to pass to `.cast()` method is the kind of value you want to create, in string form.

### String to integer

In [42]:
# cast the columns to integers
model_data = model_data.withColumn("arr_delay", model_data.arr_delay.cast("integer"))
model_data = model_data.withColumn("air_time", model_data.air_time.cast("integer"))
model_data = model_data.withColumn("month", model_data.month.cast("integer"))
model_data = model_data.withColumn("plane_year", model_data.plane_year.cast("integer"))

### Create a new column

In [43]:
# create the coluimn plane_age
model_data = model_data.withColumn("plane_age", model_data.year - model_data.plane_year)

### Making a Boolean

In [44]:
# create is_late
model_data = model_data.withColumn("is_late", model_data.arr_delay > 0)

# convert to an integer
model_data = model_data.withColumn("label", model_data.is_late.cast("integer"))

# remove missing values
model_data = model_data.filter("arr_delay is not NULL and dep_delay is not NULL and air_time is not NULL and plane_year is not NULL")

### Strings and factors

`pyspark.ml.features` submodule has functions that can handle non numerical data for modeling

Steps for Encoding your categorical feature
- Create a `StringIndexer`
    - Members of this class are `Estimator`s that take a DataFrame with a column of strings and maps each unqique string to a number.
    - The `Estimator` then returns a `Transformer` that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.
- Second step is to encode this numeric column using `OneHotEncoder`

### Carrier

In [49]:
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator 

# create a StringIndexer
carr_indexer = StringIndexer(inputCol="carrier", outputCol="carrier_index")

# create a OneHotEncoder
carr_encoder = OneHotEncoderEstimator(inputCols=["carrier_index"], outputCols=["carrier_fact"])

### Destination

In [50]:
# create a StringIndexer
dest_indexer = StringIndexer(inputCol="dest", outputCol="dest_index")

# create a OneHotEncoder
dest_encoder = OneHotEncoderEstimator(inputCols=["dest_index"], outputCols=["dest_fact"])

### Assemble a vector 

In [51]:
from pyspark.ml.feature import VectorAssembler

# make a VectorAssembler
vec_assembler = VectorAssembler(inputCols=["month", "air_time", "carrier_fact", "dest_fact",
                                          "plane_age"], outputCol="features")

### Create the pipeline

In [52]:
# import pipeline
from pyspark.ml import Pipeline

# make the pipeline
flights_pipe = Pipeline(stages=[dest_indexer, dest_encoder, carr_indexer, 
                                carr_encoder, vec_assembler])

### Test vs Train

Its good practice to now split the data into a training and testing set in order to see your model's performance on unseen data. It's important to make sure you perform all of the transformations **BEFROE** splitting the data.

### Transform the data

In [54]:
# fit and transform the data
piped_data = flights_pipe.fit(model_data).transform(model_data)

### Split the data

In [55]:
# split the data into training and test sets
training, test = piped_data.randomSplit([0.6, 0.4])

## Model tuning and selection

### What is logistic regression?

This model is very similar to a linear regression, but instead of predicting a numeric variable, it predicts the probability (between 0 and 1) of an event.

### Create the modeler

In [56]:
# import LogisticRegression
from pyspark.ml.classification import LogisticRegression

# create a logisticregression estimator
lr = LogisticRegression()

### Cross validation

This model splits the data into partitions and is a good estimate of the model's error on the test set or unseen data.

### Create the evaluator

In [57]:
# import the evaluation submodule
import pyspark.ml.evaluation as evals

# create a BinaryClassficiationEvaluator
evaluator = evals.BinaryClassificationEvaluator(metricName="areaUnderROC")

### Make a grid

In [58]:
import numpy as np

# import the tuning submodule
import pyspark.ml.tuning as tune

# create the parameter grid
grid = tune.ParamGridBuilder()

# add the hyperparameter
grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01))
grid = grid.addGrid(lr.elasticNetParam, [0, 1])

# build the grid
grid = grid.build()

### Make the validator

In [59]:
# create the crossvalidator
cv = tune.CrossValidator(estimator=lr,
                        estimatorParamMaps=grid,
                        evaluator=evaluator)

### Fit the model(s)

In [60]:
# call lr.fit()
best_lr = lr.fit(training)

# print best_lr
print(best_lr)

LogisticRegressionModel: uid = LogisticRegression_b782cecd3956, numClasses = 2, numFeatures = 81


### Evaluating binary classifiers

*AUC*, or area under the curve will be used as the metric to evaluate binary classification algorithms. The closer *AUC* is to one(1), the better the model is!

### Evaluate the model

In [61]:
# use the model to predict the test set
test_results = best_lr.transform(test)

# evaluate the predictions
print(evaluator.evaluate(test_results))

0.6942929666404072


### Conclusion

**The next steps are learning how to create large scale Spark clusters and manage and submit jobs so that you can use models in the real world.**

In [62]:
# close connection
sc.stop()
spark.stop()