![](nb_images/guiones_wave.jpeg "A big day at Playa Guiones.")

Well, you guessed it: it's time for us to learn PySpark!

I know, I know, I can hear you screaming into your pillow. Indeed we just spent all that time converting from R and learning python and why the hell do we need yet another API for working with dataframes?

That's a totally fair question.

So what happens when we're working on something in the real world, where datasets get large in a hurry, and we suddenly have a dataframe that no longer fits into memory?
We need a way for our computations and datasets to scale across multiple nodes in a distributed system without having to get too fussy about all the distributed compute details.

Enter PySpark.

I think it's fair to think of PySpark as a python package for working with arbitrarily large dataframes, i.e., it's like pandas but scalable.
It's built on top of [Apache Spark](https://spark.apache.org/), a unified analytics engine for large-scale data processing.
[PySpark](https://spark.apache.org/docs/latest/api/python/)  is essentially a way to access the functionality of spark via python code.
While there are other high-level interfaces to Spark (such as Java, Scala, and R), for data scientists who are already working extensively with python, PySpark will be the natural interface of choice.
PySpark also has great integration with [SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html), and it has a companion machine learning library called [MLlib](https://spark.apache.org/mllib/) that's more or less a scalable scikit-learn (maybe we can cover it in a future post).

So, here's the plan.
First we're going to get set up to run PySpark locally in a jupyter notebook on our laptop.
This is my preferred environment for interactively playing with PySpark and learning the ropes.
Then we're going to get up and running in PySpark as quickly as possible by reviewing the most essential functionality for working with dataframes and comparing it to how we would do things in pandas.
Once we're comfortable running PySpark on the laptop, it's going to be much easier to jump onto a distributed cluster and run PySpark at scale.

Let's do this.

## How to Run PySpark in a Jupyter Notebook on Your Laptop

Ok, I'm going to walk us through how to get things installed on a Mac or Linux machine where we're using homebrew and conda to manage virtual environments.
If you have a different setup, your favorite search engine will help you get PySpark set up locally.

### Install Spark

Most of the Spark sourcecode is written in Scala, so first we install Scala.

```
$ brew install scala
```

Install Spark.

```
$ brew install apache-spark
```

Check where Spark is installed.
```
$ brew info apache-spark
apache-spark: stable 3.1.1, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/3.1.2 (1,361 files, 242.6MB) *
...
```

Set the Spark home environment variable to the path returned by `brew info` with `/libexec` appended to the end.
Don't forget to add the export to your `.zshrc` file too.

```
$ export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.2/libexec
```

Test the installation by starting the Spark shell.

```
$ spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 14.0.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
```

If you get the `scala>` prompt, then you've successfully installed Spark on your laptop!

### Install PySpark

Use conda to install the PySpark python package.
As usual, it's advisable to do this in a new virtual environment.


```
$ conda install pyspark
```

You should be able to launch an interactive PySpark REPL by saying pyspark.

```
$ pyspark
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.8.3 (default, Jul  2 2020 11:26:31)
Spark context Web UI available at http://192.168.100.47:4041
Spark context available as 'sc' (master = local[*], app id = local-1624127229929).
SparkSession available as 'spark'.
>>>
```

This time we get a familiar python `>>>` prompt.
This is an interactive shell where we can easily experiment with PySpark.
Feel free to run the example code in this post here in the PySpark shell, or, if you prefer a notebook, read on and we'll get set up to run PySpark in a jupyter notebook.

### The Spark Session Object

You may have noticed that when we launched that PySpark interactive shell, it told us that something called `SparkSession` was available as `'spark'`.
So basically, what's happening here is that when we launch the pyspark shell, it instantiates an object called `spark` which is an instance of class `pyspark.sql.session.SparkSession`.
The spark session object is going to be our entry point for all kinds of PySpark functionality, i.e., we're going to be saying things like `spark.this()` and `spark.that()` to make stuff happen.

The PySpark interactive shell is kind enough to instantiate one of these spark session objects for us automatically.
However, when we're using another interface to PySpark (like say a jupyter notebook running a python kernal), we'll have to make a spark session object for ourselves.

### Create a PySpark Session in a Jupyter Notebook

There are a few ways to run PySpark in jupyter which you can read about [here](https://www.datacamp.com/community/tutorials/apache-spark-python).

For derping around with PySpark on your laptop, I think the best way is to instantiate a spark session from a  jupyter notebook running on a regular python kernel.
The method we'll use involves running a standard jupyter notebook session with a python kernal and using the findspark package to initialize the spark session.
So, first install the findspark package.

```
$ conda install findspark
```

Launch jupyter as usual.

```
$ jupyter notebook
```


Go ahead and fire up a new notebook using a regular python 3 kernal.
Once you land inside the notebook, there are a couple things we need to do to get a spark session instantiated.
You can think of this as boilerplate code that we need to run in the first cell of a notebook where we're going to use PySpark.

In [None]:
import pyspark
import findspark
from pyspark.sql import SparkSession

findspark.init()
spark = SparkSession.builder.appName('My Spark App').getOrCreate()

First we're running findspark's `init()` method to find our Spark installation. If you run into errors here,
make sure you got the `SPARK_HOME` environment variable correctly set in the install instructions above.
Then we instantiate a spark session as `spark`.
Once you run this, you're ready to rock and roll with PySpark in your jupyter notebook.

> Note: Spark provides a handy web UI that you can use for monitoring and debugging. Once you instantiate the spark session You can open the UI in your web browser at [http://localhost:4040/jobs/](http://localhost:4040/jobs/).