### 1. What is Spark, anyway?

Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

However, with greater computing power comes greater complexity.

Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:

1. Is my data too big to work with on a single machine?
1. Can my calculations be easily parallelized?

### 2. Using Spark in Python

The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.

When you're just getting started with Spark it's simpler to just run a cluster locally. Thus, for this course, instead of connecting to another computer, all computations will be run on DataCamp's servers in a simulated cluster.

Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

An object holding all these attributes can be created with the SparkConf() constructor. Take a look at the [spark documentation](http://spark.apache.org/docs/2.4.6/index.html) for all the details!

For the rest of this course you'll have a SparkContext called sc already available in your workspace.

#### Q: How do you connect to a Spark cluster from PySpark?
Ans: Create an instance of the SparkContext class.

### Exercise
#### Examining The SparkContext

In this exercise you'll get familiar with the *SparkContext*.

You'll probably notice that code takes longer to run than you might expect. This is because Spark is some serious software. It takes more time to start up than you might be used to. You may also find that running simpler computations might take longer than expected. That's because all the optimizations that Spark has under its hood are designed for complicated operations with big data sets. That means that for simple or small problems Spark may actually perform worse than some other solutions!

#### Instructions

Get to know the SparkContext.

1. Call print() on sc to verify there's a SparkContext in your environment.
1. print() sc.version to see what version of Spark is running on your cluster.

In [7]:
import pyspark as sp

In [8]:
sc = sp.SparkContext.getOrCreate()

In [9]:
# Verify SparkContext
print(sc)

# Print Spark version
print(sc.version)

<SparkContext master=local[*] appName=pyspark-shell>
2.4.5
