[back](./00-index.ipynb)

---
## `Apache Spark - Overview`


### `What is Spark?`

- **Spark** is a framework for distributed framework
- It is a streamlined alternative to Map-Reduce
- **Spark** applications can be written in Scala, Java or Python

### `Why Spark?`

**Why learn Spark**

- **Spark** enables you to analyze petabytes of data
- **Spark** is significantly faster than Map-Reduce
- Paradoxically, **Spark's** API is simpler than the Map-Reduce API

### `Origins`

- **Spark** was initially started at **UC Berkeley's AMPLab** (AMP = Algorithms Machines People) in 2009
- After being open sourced in 2010 under a BSD license, the project was donated in 2013 to the _Apache Software Foundation_ and switched its license to Apache 2.0.
- **Spark** is one of the most active projects in the _Apache Software Foundation_ and one of the most active open source big data projects.

### `Essence of Spark`

What is the basic idea of **Spark**?

- **Spark** takes the Map-Reduce paradigm and changes it in some critical ways:
  - Instead of writing single Map-Reduce jobs, a **Spark** job consists of a series of map and reduce functions
  - Moreover, the intermediate data is kept in memory instead of being written to disk

### `Spark Ecosystem`

<p align="center">
  <img src="../../assets/spark_ecosystem.png" alt="Spark Ecosystem">
</p>

### `Spark Logging`

Q: How can I make Spark logging less verbose?

- By default Spark logs messages at the **INFO** level
- Here are the steps to make it only print out warnings and errors
    ```bash
    cd $SPARK_HOME/conf
    cp log4j.properties.template log4j.properties
    ```
- Edit `log4j.properties` and replace `rootCategory=INFO` with `rootCategory=ERROR`

### `Spark Execution`

<p align="center">
  <img src="../../assets/spark_execution.png" alt="Spark Execution">
</p>

#### `Spark Terminology`

|Term|Meaning|
|----:|----:|
|Driver|Process that contains the Spark Context|
|Executor|Process that executes one or more Spark tasks|
|Master|Process which manages applications across the cluster, e.g., Spark Master|
|Worker|Process which manages executors on a particular worker node, e.g., Spark Worker|


<!-- <p align="center">
  <img src="../../assets/spark_terminology.png" alt="Spark Terminology">
</p> -->

### `Spark Job`

Q: Flip a coin 100 times using Python's `random()` function. What fraction of the time do you get heads?

- Initialize Spark


In [1]:
import pyspark as ps

spark = ps.sql.SparkSession.builder.master('local[*]').appName('spark-lecture').getOrCreate()

sc = spark.sparkContext


22/06/12 01:37:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


- Define and run the Spark job

In [2]:
import random

n = 100

heads = (sc.parallelize(range(n))
        .map(lambda _: random.random())
        .filter(lambda r: r < 0.5).count())

tails = n - heads
ratio = 1. * heads / n

print('heads: ', heads)
print('tails: ', tails)
print('ratio: ', ratio)

heads:  55
tails:  45
ratio:  0.55


                                                                                

#### `Notes`

- `sc.parallelize` creates a **RDD**
- **map** and **filter** are _transformations_
  - They create new RDDs from existing RDDs
- `count` is an _action_ and brings the data from the RDDs back to the driver

#### `Spark Terminology`

|Term|Meaning|
|----:|----:|
|RDD|_Resilient Distributed Dataset_ or distributed sequence of records|
|Spark Job|Sequence of transformations on data with a final action|
|Transformation|Spark operation that produces a local object|
|Action|Spark operation that produces a local object|
|Spark Application|Sequence of Spark jobs and other code|

- A Spark job pushes the data to the cluster, all computation happens on the _executors_, then the result is sent back to the driver.

#### `Understanding The Code`

In [3]:
rdd = sc.parallelize(range(n))
rdd.collect()


[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99]

In [4]:

rdd_map = rdd.map(lambda _: random.random())
rdd_map.collect()


[0.11048732820468354,
 0.2520917913435715,
 0.8319029660855509,
 0.2914023400457618,
 0.8970176817649934,
 0.5486326308361258,
 0.7385884898295728,
 0.26980100052010847,
 0.3145232246993669,
 0.3956785313451111,
 0.9803245403190293,
 0.14465830445192274,
 0.6200910604134594,
 0.4066266127638466,
 0.24113503509411793,
 0.5454709067324557,
 0.4978431140223679,
 0.2502245803915162,
 0.8594570615813439,
 0.8234675095828748,
 0.7050156811493302,
 0.030337138062836222,
 0.8546739968794432,
 0.664557897079867,
 0.969961992753474,
 0.8461086209431787,
 0.368085476859066,
 0.46083364596089527,
 0.15768102554919072,
 0.9564927899169007,
 0.030701485035367626,
 0.009775122850377294,
 0.0014013184543518742,
 0.1809761358875862,
 0.05154358750738908,
 0.27889969724750985,
 0.45189814272317075,
 0.8033567360812239,
 0.8433865912910039,
 0.09128490953951807,
 0.19739073665553974,
 0.6809153907454448,
 0.29537337987693113,
 0.9618046008485446,
 0.9630071657997124,
 0.36541703333440034,
 0.056729705246

In [5]:
rdd_filter = rdd_map.filter(lambda r: r < 0.5)
rdd_filter.collect()


[0.24816898680350252,
 0.4444352084761919,
 0.46082730666770844,
 0.15626997426544742,
 0.18839250524454165,
 0.23875664594081547,
 0.14675140966361788,
 0.15818163471476143,
 0.2745584643078963,
 0.28451161276227,
 0.3316871503633313,
 0.3488543892994437,
 0.3596398365865848,
 0.4802019564869422,
 0.309853301377091,
 0.31080375575390795,
 0.3709706314232716,
 0.38100470142811016,
 0.4813917948410664,
 0.007544145062662655,
 0.3413447289221131,
 0.2460790209455641,
 0.29526142614240436,
 0.2161198542410867,
 0.35938939675019643,
 0.38798432231761815,
 0.24162704607217234,
 0.46768288782966216,
 0.0418106428657814,
 0.30891176132171605,
 0.42259389956982796,
 0.29919169398132217,
 0.4550719168768034,
 0.41196462546289914,
 0.26912474025176014,
 0.4045402392603109,
 0.3844404122021694,
 0.2935035365436519,
 0.01010808363407878,
 0.15882162686221912,
 0.19664329876067288,
 0.059581051543504904,
 0.2215007220660341,
 0.0029917410447303228,
 0.050915093828940994,
 0.26563414330185553,
 0.04

In [6]:
rdd_filter.count() # This is the action, to collect / count / mean etc

59

#### `Pop Quiz`

In the below Spark job, what is the transformation? and what is the action?

```python
sc.parallelize(range(10)).filter(lambda x: x % 2 == 0).collect()
```

#### `Lambda vs. Functions`

- Instead of `lambda` we can pass in fully defined functions into `map`, `filter`, and other `RDD` transformations.
- We often use `lambda` for short functions.
- And, we use `def` for more substantial functions.

#### `Another Example - Finding Primes`

Q: Find all the primes less than 100.

- Define function to determine if a number is prime.

In [7]:
def is_prime(number):
  factor_min = 2
  factor_max = int(number ** 0.5) + 1
  for factor in range(factor_min, factor_max):
    if number % factor == 0:
      return False
  return True

- Now, we use this function to filter non-primes.

In [8]:
numbers = range(2, 100)

primes = (sc.parallelize(numbers)
          .filter(is_prime)
          .collect())

print(primes)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]


#### `Pop Quiz`

<p align="center">
  <img src="../../assets/spark_execution.png" alt="Spark Execution">
</p>

Q: Where does `is_prime` execute?
Q: Where does the RDD object get collected?

#### `Transformations and Actions`

- Common RDD Constructs

    |Expression|Meaning|
    |:----:|:----|
    |`sc.parallelize(list)`|Create RDD of elements of list|
    |`sc.textFile(path)`|Create RDD of lines from file|
<br>

- Common Transformations

    |Expression|Meaning|
    |:----:|:----|
    |`filter(lambda x: x % 2 == 0)`|Discard non-even elements|
    |`map(lambda x: x * 2)`|Multiply each RDD element by 2|
    |`map(lambda x: x.split())`|Split each string into words|
    |`flatMap(lambda x: x.split())`|Split each string into words and flatten sequence|
    |`sample(withReplacement = True, 0.25)`|Create sample of 25% of elements with replacement|
    |`union(rdd)`|Append `rdd` to existing RDD|
    |`distinct()`|Remove duplicates in RDD|
    |`sortBy(lambda x: x, ascending = False)`|Sort elements in descending order|
<br>

- Common Actions

    |Expression|Meaning|
    |:----:|:----|
    |`collect()`|Converts RDD to in-memory list|
    |`take(3)`|First 3 elements of RDD|
    |`top(3)`|Top 3 elements of RDD|
    |`takeSample(withReplacement = True, 3)`|Create sample of 3 elements with replacement|
    |`sum()`|Find element sum (assumes numeric elements)|
    |`mean()`|Find element mean (assumes numeric elements)|
    |`stdev()`|Find element deviation (assumes numeric elements)|


### `Conclusion`


---
[next]()