#Lesson 05 - Introduction to PySpark

## Spark Language APIs

Apache Spark is written in the Scala programming language, but APIs exist to allow spark applications to be developed using Java, Scala, Python, and R. We will work primarily in Python in this course. The Spark Python API is called **PySpark**.

## PySpark

In this lesson, we will provide an introduction to some of the components of a PySpark application. We can import PySpark into our Python session as shown below.

In [0]:
import pyspark

## The SparkSession

Every Spark application has a **SparkSession** object that is created as part of the driver process. The SparkSession instance represents the entry point through which all Spark functionality is accessed. The user issues commands in a Spark application through this SparkSession instance. 

The next cell shows how to create a SparkSession object.

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

We can get information about our Spark session by viewing the `SparkSession` object that we have created.

In [0]:
spark

In the cell below, we will check the type of the object stored in the variable `spark`.

In [0]:
print(type(spark))

The `SparkSession` object has an attribute that stores the version of Spark that we are running.

In [0]:
print(spark.version)

## SparkContext

As mentioned, the `SparkSession` object is the entry point through which we submit commands to Spark. This object can be used directly to work with DataFrames, which are the primary data type in the SparkSQL component (and will be discussed later in this course). However, the `SparkSession` object also contains entry points for accessing the other components of Spark. The entry points are referred to as **contexts**. The first context we will work with is the **`SparkContext`**, which is associated with the SparkCore component. The SparkContext provides tools for working with Resilient Distributed Datasets, or RDDs. An RDD is the most basic data type introduced by Spark, and all other data types in Spark are built on top of RDDs. 

In the next code cell, we will assign a `SparkContext` object to the variable `sc`. We will then use it do display some information about our Spark environment.

In [0]:
sc = spark.sparkContext

print('Spark Version:', sc.version)
print('Spark Mode:', sc.master)
print('Spark AppName:', sc.appName)
print('Default Parallelism:', sc.defaultParallelism)

In the next lesson, we will begin to explore RDDs and the `SparkContext` object.