<a href="https://colab.research.google.com/github/Thegreatesthumphrey/ds-spark-sparkcontext-onl01-dtsc-ft-041320/blob/master/index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

In [7]:
import pyspark

pyspark.__version__

'2.4.5'

In [6]:
!pip install pyspark==2.4.5

Collecting pyspark==2.4.5
[?25l  Downloading https://files.pythonhosted.org/packages/9a/5a/271c416c1c2185b6cb0151b29a91fff6fcaed80173c8584ff6d20e46b465/pyspark-2.4.5.tar.gz (217.8MB)
[K     |████████████████████████████████| 217.8MB 60kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 42.8MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.5-py2.py3-none-any.whl size=218257927 sha256=584064a9927e1cd322acf3a44407ac85b342bd5631b6d4f2232cba1e972650e4
  Stored in directory: /root/.cache/pip/wheels/bf/db/04/61d66a5939364e756eb1c1be4ec5bdce6e04047fc7929a3c3c
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4

In [8]:
import os
# /usr/lib/jvm/java-8-openjdk-amd64
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# /content/spark-2.4.5-bin-hadoop2.7
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

In [9]:
import findspark
findspark.init()

In [10]:
import pyspark 
sc = pyspark.SparkContext('local[*]') # [*] represents a local context i.e. no cluster

In [11]:
sc, type(sc)

(<SparkContext master=local[*] appName=pyspark-shell>,
 pyspark.context.SparkContext)

# Understanding Spark and `SparkContext( )`

## OBJECTIVES

* Describe Spark's parallalism with master and executor nodes. 
* Understand `SparkContext()` for managing connections in parallel applications. 
* Provide an overview of major `SparkContext()` properties and methods.  

### Introduction

The PySpark series of lessons and labs will provide you with an introduction to **Apache Spark**, the leading framework for big data processing in jupyter notebooks and PySpark, using a PySpark docker image in a standalone mode. 

Spark comes bundled with a **Cluster Resource Manager** which can divide and share the physical resources of a cluster of machines between multiple Spark applications. Spark's **Standalone cluster manager** operates in the standalone mode and allows Spark to manage its own cluster. 

In Spark computational model, communication routinely occurs between a **driver** and **executors**. The driver has Spark jobs that it needs to run and these jobs are split into tasks that are submitted to the executors for completion. The results from these tasks are delivered back to the driver. The spark driver declares the transformations and actions on data and submits such requests to the **master**. The machine on which the Spark Standalone cluster manager runs is called the **Master Node**. For these labs, this distributed arrangement will be simulated on a single machine allowing you to initialize multiple worker nodes. 

### `SparkContext( )`
In order to use Spark and its API we will need to use a **SparkContext**. A SparkContext is a client of Spark’s execution environment and it acts as the master of the Spark application. SparkContext sets up internal services and establishes a connection to a Spark execution environment. In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. 

When running Spark, we can start a new Spark application by creating a new SparkContext. After creation, it asks the master for some cores to use to do work. The master sets these cores aside and they don't get used for other applications. This setup is described in the figure below

![](https://github.com/Thegreatesthumphrey/ds-spark-sparkcontext-onl01-dtsc-ft-041320/blob/master/executors.png?raw=1)

Spark applications driver program launches various parallel operations on executor Java Virtual Machines (JVMs) running either in a cluster or locally on the same machine. When running locally, "PySparkShell" is the driver program. In all cases, this driver program contains the main loop for the program and creates distributed datasets on the cluster, then applies operations (transformations & actions) to those datasets. Driver programs access Spark through the `SparkContext` object, which represents a connection to a computing cluster. A SparkContext object (usually shown as `sc`) is the main entry point for Spark functionality. A Spark context can be used to create Resilient Distributed Datasets (RDDs) on a cluster.


Lets start a spark application by importing pyspark, creating a spark context as `sc` and try printing out type of `sc`.

In [None]:
import pyspark
sc = pyspark.SparkContext('local[*]')

In [14]:
# Display the type of the Spark Context

type(sc)
pyspark.context.SparkContext

pyspark.context.SparkContext

### SparkContext attributes

You can use Python's `dir()` function to get a list of all the attributes (including methods) accessible through the sc object.

In [20]:
# Use Python's dir(obj) to get a list of all attributes of SparkContext
dir(sc)

['PACKAGE_EXTENSIONS',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accumulatorServer',
 '_active_spark_context',
 '_batchSize',
 '_callsite',
 '_checkpointFile',
 '_conf',
 '_dictToJavaMap',
 '_do_init',
 '_encryption_enabled',
 '_ensure_initialized',
 '_gateway',
 '_getJavaStorageLevel',
 '_initialize_context',
 '_javaAccumulator',
 '_jsc',
 '_jvm',
 '_lock',
 '_next_accum_id',
 '_pickled_broadcast_vars',
 '_python_includes',
 '_repr_html_',
 '_serialize_to_jvm',
 '_temp_dir',
 '_unbatched_serializer',
 'accumulator',
 'addFile',
 'addPyFile',
 'appName',
 'applicationId',
 'binaryFiles',
 'binaryRecords',
 'broadcas

Alternatively, you can use Python's `help()` function to get an easier to read list of all the attributes, including examples, that the sc object has.

In [19]:
# Use Python's help ( help(object) ) function to get information on attributes and methods for sc object. 
help(sc)


Welcome to Python 3.6's help utility!

If this is your first time using Python, you should definitely check out
the tutorial on the Internet at https://docs.python.org/3.6/tutorial/.

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules.  To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, symbols, or topics, type
"modules", "keywords", "symbols", or "topics".  Each module also comes
with a one-line summary of what it does; to list the modules whose name
or summary contain a given string such as "spam", type "modules spam".


You are now leaving help and returning to the Python interpreter.
If you want to ask for help on a particular object directly from the
interpreter, you can type "help(object)".  Executing "help('string')"
has the same effect as typing a particular string at the help> prompt.


You should also have a look at [Spark's SparkContext Documentation Page](https://spark.apache.org/docs/0.6.0/api/core/spark/SparkContext.html) to explore these in further detail.

Let's try to check a few spark context attributes including `SparkContext.version` and `SparkContext.default paralellism` to check the current version of Apache Spark and number of cores being used for parallel processing. 


In [28]:
# Check the number of cores being used
sc.version

# Check for the current version of Spark

sc.defaultParallelism
# Default number of cores being used: 2
# Current version of Spark: 2.3.1

2

Let's also check the name of current application by using `SparkContext.appName` attribute. 

In [29]:
# Check the name of application currently running in spark environment
sc.appName

# 'pyspark-shell'

'pyspark-shell'

A Spark Context can be shut down using `SparkContext.stop()` method. Let's use this method to shut down the current spark context. 

In [30]:
#Shut down SparkContext
sc.stop()

Once shut down, you can no longer access spark functionality before starting a new SparkContext. 

### Summary:

In this short lab, we saw how SparkContext is used as an entry point to Spark applications. We learnt how to start a SparkContext, how to list and use some of the attributes and methods in SparkContext and how to shut it down. Students are encouraged to explore other attributes and methods offered by the sc object. Some of these, namely creating and transforming datasets as RDDs will be explored in later labs. 