# PySpark 
> [Main Table of Contents](../../README.md)

## In This Notebook

- What is PySpark?
- Workflow
- SQL API
    - Example: Builder pattern to create SparkSession
- Vocabulary

## What is PySpark?

- Distributed computing on clusters for large scale parallel data processing

## Workflow

1. Create connection with a cluster with `pyspark.sql.SparkContext(conf=conf)` where the connection is configured through `conf=pyspark.SparkConf()`
2. Access PySpark.DataFrame API through instance of `pyspark.sql.SparkSession()`
    - This instance is easier-to-use high-level abstraction to RDD

## SQL API

Classes | Explanation
--- | ---
`pyspark.SparkConf()` | Configuration about SparkContext application
`pyspark.sql.SparkContext(conf=conf)` | Connection to a cluster<br>Tell Spark how to access a cluster how to access a cluster using `conf` kwarg
`pyspark.sql.SparkSession(spark_context)` |  Interface to a cluster<br>The entry point to programming Spark with the Dataset and *_DataFrame API_*<br><br>SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables<br>Create a `SparkSession` using builder pattern

In [None]:
# Example: Builder pattern to create SparkSession
spark = SparkSession.builder \
        .master("local") \
        .appName("Word Count") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

: 

## Vocabulary

Term | Explanation
--- | ---
Cluster | Collection of processing units<br>e.g. Group of separate computers
Node | One unit in a cluster
Master | Main unit in a cluster responsible for splitting data and distributing to workers
Worker | Other units in a cluster that work on segments of data delegated by the master
RDD<br>Resilient Distrubuted Datasets|The core datastructure in Spark<br>A fault-tolerant collection of elements that can be operated on in parallel<br>RDD is a low-level API and difficult to use. Instead use the higher abstraction `SparkSession.DataFrame`
