## Spark Getting Started

### University of California, Santa Barbara  
### PSTAT 135/235  
### Last Updated: Dec 12, 2018

---  


Source: Learning Spark

Chapter 1: Introduction to Data Analysis with Spark  
Chapter 2: Getting Started

### OBJECTIVES
-  Spark background
-  Setup and installation
-  Basic concepts
-  Minimal code examples
-  Running Spark: Interactive Session
-  Running Spark: Command Line

### CONCEPTS

- Functional programming

- SparkSession - single point of entry to interact w Spark functionality

- SparkContext - used as entry points, from Spark 1.0

- Resilient Distributed Datasets (RDDs) - Spark’s fundamental abstraction for distributed data and computation

- Dataset

- Driver Program - contains application main function, defines RDDs on cluster, applies operations to them.

- Worker Node or Executor - the units that perform tasks

---

*Spark Background*

- Designed to be fast  
no waiting around for hours, need to work interactively with data  


- Designed to handle big data


- General Purpose  
Unlike Hadoop, several modules in one place, such as machine learning, batch, queries, streaming


- Caching is possible, so intermediate data can be stored in memory on workers


- Highly accessible:simple APIs to Python, Java, Scala, R, SQL  
Integrates w other Big Data tools such as Hadoop, Cassandra  
Can access HDFS data, Amazon S3, and others


**Documentation from README**  
You can find the latest Spark documentation, including a programming
guide, on the [project web page](http://spark.apache.org/documentation.html).

For general development tips, including info on developing Spark using an IDE, see   
[http://spark.apache.org/developer-tools.html](the Useful Developer Tools page).

Spark also comes with several sample programs in the `examples` directory.  
To run one of them in a shell, use `./bin/run-example <class> [params]`.  

For example:

In [None]:
./bin/run-example SparkPi

will run the Pi example locally.

**Install**  

Page 9 provides step-by-step download and install instructions

Depends:
1. Python needs to be installed
2. Java 6 or higher needs to be installed

Change logging level (Page 12), change from: 

log4j.rootCategory=INFO  

To  

log4j.rootCategory=WARN  

# Interactive Python shell

From installed location and using ($) to denote prompt:


In [None]:
$ bin\pyspark

Set up a minimal case Spark Session:
- using the local machine as master
- naming the app

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("pspark_test") \
        .getOrCreate()

In [2]:
# print info about the session
spark

In [3]:
sc = spark.sparkContext


### RDDs and Datasets

Before Spark 2.0, the main programming interface of Spark was the *Resilient Distributed Dataset (RDD)*.  

After Spark 2.0, RDDs are replaced by *Dataset*, which is strongly-typed like an RDD, but with richer optimizations under the hood. 

The RDD interface is still supported  

Using Dataset is recommended, and it has better performance than RDD.

## Computing

### Example 1: Read lines from text file

In [4]:
import os

In [5]:
data_path = '/home/jovyan/UCSB_BigDataAnalytics/data/'

In [6]:
data_filename = 'README.md'

In [7]:
lines = spark.read.text(os.path.join(data_path, data_filename))

In [None]:
lines.count()

In [None]:
lines.first()

In [None]:
lines.collect()

In [None]:
type(lines.collect())

In [None]:
type(lines.collect()[0])

### Example 2: Text Search  - print all lines containing “Spark”

In [None]:
spark_lines.columns

In [88]:
spark_lines = lines.filter(lines.value.contains("Spark"))

In [9]:
# return list of first 5 records
spark_lines.take(5)   

In [None]:
type(spark_lines)

### Example 3: Word Count

In [113]:
# Read the file into an RDD
lines = sc.textFile(os.path.join(data_path, data_filename))

In [None]:
type(lines)

In [116]:
words = lines.flatMap(lambda x: x.split())

In [None]:
words.take(5)

In [122]:
wordcounts = words.map(lambda x: (x, 1)) \
                  .reduceByKey(lambda x,y:x+y) \
                  .map(lambda x:(x[1],x[0])) \
                  .sortByKey(False)

In [None]:
wordcounts.take(10)