# Big Data Overview


* Explanation of Hadoop, MapReduce, Spark, and PySpark

* Local verses Distributed Systems.

* Overview of Hadoop Ecosystem

* Detailed overview of Spark

* Set-up on Amazon Web Services

* Resources on other Spark Options 

* Jupyter Notebook hands-on code with PySpark and RDDs

___

* We've worked with Data that can fit ona local computer, inthe scale of 0-8gb. 

* But what can we do if we have a larger set of data? 

    * Try using an SQL database to move storage onto a hard drive instead of RAM
    
    * Or use a distributed system, that distributes the data to multiple machines/computer. 
    
___

![image.png](attachment:image.png)

![image.png](attachment:image.png)

* A local process will use the computation resources of a single machine

* A distributed process has access to the computational resources across a number of machines connected through a network.

* after a certain point, it is easier to scale out to many lower CPU machines, than to try and scale up to a single machine with a high CPU. 

___

* Distributed machines also have the advantage of easily scaling you can just add more machines. 

* they also include fault tolerance, if one machine fails, the whole network can still go on. 

* lets discuss the typical format of a distributed architecture that uses hadoop. 
___

* Hadoop is a way to distribute very large files across multiple machines.

* It uses the Hadoop Distributed File System (HDFS) 

* HDFS allows a user to work with large data sets 

* HDFS also duplicates blocks of data for fault tolerance 

* It also then uses MapReduce 

* MapReduce allows computations on that data  

![image.png](attachment:image.png)

* HDFS will use blocks of data, with a size of 128 MB by default

* Each of these blocks is replicated 3 times 

* The blocks are distributed in a way to support fault tolerance. 

* Smaller blocks provide more parallelization during processing

* Multiple copies of a block prevent loss of data due to a failure of a node

* MapReduce is a way of splitting a computation task to a distributed set of files (such as HDFS)

* It consists of a Job Tracker and multiple Task Trackers. 

![image.png](attachment:image.png)

* The Job trakcer sends code to run on the Task Trackers

* The Task Trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes

____

* What we covered can be though of in two distinct parts:
    
    * Using HDFS to distribute large data sets
    
    * Using MapReduce to distribute a computational task to a distributed data set

* Next we will learn about the latest technology in this space known as Spark.

* Spark improves on the concepts of using distribution

# Spark

* This will be an abstract overview, we will discuss: 
    
    * Spark
    
    * Spark vs MapReduce
    
    * Spark RDDs 
    
    * RDD Operations
    
* Don't worry about having to understand all the operations, we will review and cover this again when we actualy program with Spark and Python. 

___

* Spark is one of the latest technologies being used to quickly and easily handle Big Data. 

* It is an open source project on Apache

* It was first released in February 2013 and has exploded in popularity due to it's ease of use and speed.

* Created at the AMPLab at UC Berkeley.

---

* Think of Spark as a flexible alternative to MapReduce.

* Spark can use data stored in a varietey of formats

    * Cassandra
    
    * AWS S3
    
    * HDFS
    
    * And more

---

* MapReduce requires files to be stored in HDFS, Spark does not! 

* Spark also can perform operations up to 100x faster than MapReduce

* So how does it achieve this speed?

---

* MapReduce writes most data to disk after each map and reduce operation

* Spark keeps most of the data in memory after each transformation

* Spark can spill over to disk if the memory is filled.

---

* At the core of Spark is the idea of a Resilient Distributed Dataset (RDD)

* Resilient Distributed Dataset (RDD) Has 4 main features:

    * Distributed Collection of Data
    
    * Fault-tolerant
    
    * Parallel operation - partitioned 
    
    * ability to use many data sources
    
---

![image.png](attachment:image.png)

![image.png](attachment:image.png)

* RDDs are immutable, lazily evaluatied and cacheable 

* There are two types of RDD operations:

    * Transformations
    
    * Actions

---

* Basic Actions:
    
    * First
    
    * Collect
    
    * Count 
    
    * Take
    
---

* Collect - Return all the elements of the RDD as an array at the driver program.

* Count - Return the number of elements in the RDD

* First - Return first element in the RDD

* Take - Return an array with the first n elements of the RDD

--- 

* Basic Transformations

    * Filter
    
    * Map
    
    * FlatMap

---

* RDD.filter()
    
    * Applies a function to each element and returns elements that evaluate to true.
    
![image.png](attachment:image.png)

* RDD.map()
    * Transforms each elelent and preserves # of elements, very similar idea to pandas .apply()
![image.png](attachment:image.png)

* RDD.flatMap()

    * Transforms each element into 0-N elements and changes # of elements.

![image.png](attachment:image.png)

## Map vs FlatMap

* Map()
    
    * Grabbing first letter of a list of names
    
* FlatMap()
    
    * Transforming a corpus of text into a list of words

* We will show many more examples when programming with PySpark. 

## Pair RDDs

* Often RDDs will be holding their values in tuples 
    
    * (key, value)

* This offers better partitioning of data and leads to functionality based on reduction.  

## Reduce and ReduceByKey

* Reduce()

    * An action that will aggregate RDD elements using a function that returns a single elemenet
    
* ReduceByKey()
    
    * An action that will aggregate Pair RDD elements using a function that returns a Pair RDD
    
* These ideas are similar to Group By operation

---

* Spark is being continually developed and new releases come out often! 

* The Spark Ecosystem now includes:
    
    * Spark SQL
    
    * Spark DataFrames
    
    * MLlib
    
    * GraphX
    
    * Spark Streaming

---

* Now we've learned enough to get started!

* We're now going to show you how to set up an Amazon Web Services account to get Spark up and running! 

* We'll also have text article lecture for some other options in case you don't want to use AWS.
