# **PySpark**: The Apache Spark Python API

## 1. Introduction

This notebook shows how to connect Jupyter notebooks to a Spark cluster to process data using Spark Python API.

## 2. The Spark Cluster

### 2.1. Connection

To connect to the Spark cluster, create a SparkSession object with the following params:

+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);
+ **master:** Spark Master URL, same used by Spark Workers;
+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-notebook").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "512m").\
        getOrCreate()


More confs for SparkSession object in standalone mode can be added using the **config** method. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession).

In [2]:
sc = spark.sparkContext

In [11]:
tickets = sc.textFile("/data/aircrafts_data.csv")

In [12]:
tickets.collect()

['aircraft_code,model,range',
 '773,"Boeing 777-300",11100',
 '763,"Boeing 767-300",7900',
 'SU9,"Sukhoi Superjet-100",3000',
 '320,"Airbus A320-200",5700',
 '321,"Airbus A321-200",5600',
 '319,"Airbus A319-100",6700',
 '733,"Boeing 737-300",4200',
 'CN1,"Cessna 208 Caravan",1200',
 'CR2,"Bombardier CRJ-200",2700']

In [16]:
names = tickets.map(lambda x: (x.split(",")[0] , x.split(",")[1]))

In [25]:
boeing = names.filter(lambda x: x[1].startswith('\"B'))

In [26]:
boeing.collect()

[('773', '"Boeing 777-300"'),
 ('763', '"Boeing 767-300"'),
 ('733', '"Boeing 737-300"'),
 ('CR2', '"Bombardier CRJ-200"')]