# Spark Overview

- This lecture will be an abstract overview of the Spark Ecosystem, and we will be covering :
    - How Spark is different ?
    - Spark vs MapReduce
    - Spark RDDs
    - RDD Operations

- Don't stress about having to understand all the operations, we will go through them once again when we will actually program with Spark and Python.

# Spark

- Spark is one of the latest technologies being used to quickly and easily handle Big Data.
- It is an open source project on Apache.
- It was first released in February 2013 and has exploded in popularity due to it's ease of use and speed.
- It was created at the AMPLab at UC Berkeley.
Relatively speaking, it is a very new technology but is adopted by many users.
-----
- We may think of Spark as an alternative to MapReduce, rather than a replacement for Hadoop.
- It is not intended to replace Hadoop, rather to provide a comprehensive and unified solution to manage various big data use cases and requirements.
- Spark can use data stored in a variety of formats such as :
    - Cassandra
    - AWS S3
    - HDFS
    - And more..

## Spark vs MapReduce

- MapReduce requires files to be stored in HDFS, Spark does not!
- Spark also can perform operations upto 100x faster than MapReduce.
---------
<span class="girk">**How does Spark achieve this speed ?**</span>

- Well, MapReduce writes most data to disk (hard disk) after each map and reduce operation.
- Spark keeps most of the data in memory (RAM) after each operation.
- Spark can spill over to disk if the memory is filled.

## Spark RDDs
- At the core of Spark is the idea of a Resilient Distributed Dataset (RDD)
- Resilient Distributed Dataset (RDD) has 4 main features :
    - Distributed Collection of Data
    - Fault-tolerant
    - Parallel Operation - Partitioned
    - Ability to use many data sources.
![image.png](attachment:image.png)
- Like we discussed in the last series of notes about distributed systems, Spark operates in the same way.
- There is a driver program, which operates a SparkContext, and this communicates with Cluster Manager and this communicates with Worker Node which executes the various tasks.

 ![image.png](attachment:image.png)
 
 - Here we have RDD objects and we perform a bunch of transformations and actions on these objects.
 - We have DAG (Direct Acyclic Graph) Scheduler.
 - Then there is Task Scheduler followed by Worker Nodes itself.
 ----------
 - Spark allows programmers to develop complex, multi-step data pipelines using DAG pattern
 - It also supports in-memory data sharing across these DAGs, so that different jobs can work with same data.
 - For our work, we will be concerned only with RDD objects,everything else will occur under the hood for us.
 
 - RDDs are immutable, lazily evaluated and cacheable.
 - There are 2 types of RDD opertaions :
     - Transformations
     - Actions
 These 2 operations are the core of what we'll be doing with Python and Spark.
 - We will be coding transformations and actions on some distributed data set.

## Actions

- First - Return all the elements of the RDD as an array at the driver program.
- Collect - Return all the elements of the RDD as an array at the driver program.
- Count - Return the number of elements in RDD
- Take - Return an array with the first n elements of the RDD

## Basic Transformations

- Filter : RDD.filter() : Applies a function to each element and returns elements that evaluates to true.
![image.png](attachment:image.png)

- Map : RDD.map() : Transforms each element and preserves # of elements, very similar idea to pandas .apply()
- Mapping some sort of function to every element in RDD.
![image.png](attachment:image.png)

- FlatMap : RDD.flatMap() : Transforms each element into 0-N elements and change # of elements from original RDD to after we perform RDD.flatMap().
![image.png](attachment:image.png)

## Example for Map vs FlatMap

- Map()
    - Grabbing first letter of a list of names.Leads to having same number of elements before and after transformation.
- FlatMap()
    - Transforming a corpus of text into a list of words. Number of elements after transformation changes.

More examples on these two, when programming with PySpark.

## Pair RDDs
- Often RDDs will be holding their values in tuples (key,value)
- This offers better partitioning of data and leads to functionality based on reduction.
    ## Reduce and ReduceByKey :
    - Reduce()
        - An action that will aggregate RDD elements using a function that returns a single element.
    - ReduceByKey()
        - An action that will aggregate Pair RDD elements using a function that returns a pair RDD.
    - These are ideas similar to a GroupBy operation.

## About Updates and Releases :

- Spark is being continually developed and new releases come out very often!
- The Spark Ecosystem now includes :
    - Spark SQL
    - Spark DataFrames
    - MLib
    - GraphX
    - Spark Streaming

# Next :
- Setting up AWS account and get Spark up and running !