PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.

In [2]:
import findspark
findspark.init()

SparkSession introduced in version 2.0 and and is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame and DataSet. It’s object spark is default available in spark-shell and it can be created programmatically using SparkSession builder pattern.

In [3]:
from pyspark.sql import SparkSession

In [4]:
# (1) Initialise the Spark Session
spark = SparkSession\
    .builder\
    .getOrCreate()

In [5]:
# (2) Generate fake data on the driver
mylist = [
    ["Julien", 67], 
    ["Ιουλιανός", 32], 
    ["Юлиан", 89],
    ["尤利安", 40]
]

A Spark “driver” is an application that creates a SparkContext for executing one or more jobs in the Spark cluster. It allows your Spark/PySpark application to access Spark Cluster with the help of Resource Manager.

When you create a SparkSession object, SparkContext is also created and can be retrieved using spark.sparkContext. SparkContext will be created only once for an application; even if you try to create another SparkContext, it still returns existing SparkContext.

### Role of a Spark Context

Spark Context is the entry point. Like a key to your car.

Functions -
* Tells Sparks how to access a cluster
* Allocates Executors
* The context, living in your driver program, coordinates sets of processes on the cluster to run your application.
* The context keeps track of live executors by sending heartbeat messages periodically.
* Finally, the context may perform dynamic resource allocation if the cluster manager permits. This increases cluster utilization in shared environments by proper scheduling of multiple applications according to their resource demands.

#### Scenario 1:

In the sense, if you want to compute a complex aggregation on spark, you need to distribute the task in the cluster.

Spark context is the gateway to a Spark Cluster.

#### Scenario 2:

If you have a dataset ( simple CSV/TXT file) and want computations on this data, you want all the worker nodes to have access to this data. Use the spark context to broadcast this file to all the nodes.

It allows your Spark Application to access Spark Cluster with the help of Resource Manager. The resource manager can be one of these three- Spark Standalone, YARN, Apache Mesos

## 1.1 What are RDDs?

* RDDs stands for Resilient Distributed Dataset.

* RDD is the core abstraction in Apache Spark. (In case you are thinking, What does abstraction mean in programming? then go through this excellent expalnation over stackoverflow - Abstraction

* An RDD is simply a immutable distributed collection of elements.

* The name captures two important properties:

    - Resilient means that we must be able to withstand failures and complete an ongoing computation.
Distributed means that we must account for multiple machines having a subset of data. Formally, RDD is a read-only, partitioned collection of records
    - In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.

* The data inside a spark application is read into the form of RDDs and then Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

* RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

## 1.2 Creating RDDs
Spark provides two ways to create RDDs:

* Parallelizing a collection in your driver program.
* Loading an external dataset


#### Parallelizing a collection in your driver program:

Create RDDs using parallelize() method on existing iterable or collection in your driver program. We will be doing this in our first program

#### Loading an external dataset
Two methods: These methods takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI).

- `sc.textFile()`: Creates a RDD with each line as an element. 
- `sc.wholeTextFiles()` :  Creates a PairRDD with the key being the file name with a path.

In [6]:
# (3) Distribute it over the network
rdd = spark.sparkContext.parallelize(mylist)

## 1.4 Operations on RDDs

From the Spark Programming Guide:

> RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

### Transformation on RDDs
Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data

Operations like map, filter, flatMap are transformations.

> Note: when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

### Actions on RDDs
This kind of operation will also give you another RDD but this operation will trigger all the lined up transformation on the base RDD (or in the DAG) and than execute the action operation on the last RDD.
Operations like collect, take, count, first, saveAsTextFile are actions
Let's have a look to data in your RDD currently looks like. We will use take() method for this:

take(n):

The action take(n) returns n number of elements from RDD.

In [7]:
rdd.take(3)

[['Julien', 67], ['Ιουλιανός', 32], ['Юлиан', 89]]

In [6]:
# (4) Return the mean of ages:
meanage = rdd.map(lambda x: x[1]).mean()
print("Mean age is {}".format(meanage))

Mean age is 57.0


In [None]:
# (5) Return person whose age is below 60
belowthreshold = rdd\
    .filter(lambda x: x[1] < 60)\
    .map(lambda x: x[0])\
    .collect()
print("{} is/are below 60".format(belowthreshold))

In [None]:
# (6) Go from RDD to DataFrame
df = rdd.toDF(["Name", "Age"])
df.show()
df.printSchema()