# Big Data Overview

* Explanaition of Hadoop, MapReduce, Spark and PySpark
* Local versus Distributed Systems
* Overview of Hadoop Ecosystem
* Detailed overview of Spark
* Setup on AWS
* Resources on other Spark options
* Jupyter Notebook hands-on code with PySpark and RDDs

___
* What can we do if we have large data set that a local computer can't handle
    * Try using a SQL database to move storage onto hard drive instead of RAM
    * Or use a distributed system, that distributes the data across multiple machines/ computers
        * have one master node, controlling a network of distributed nodes
        
___

* A local process will use the computation resources of a single machine
* A distributed proces shas access to the computaitonal resources across a number of machines connected through a network
* After a certain point, it is easier to scale out to many lower CPU machines, that to try to scale up a single machine with a higher CPU  

___

* Distributed machines also have the advantage of easily scaling, you can just add more machines
* They also include fault tolerance, if on machine fails, the whole network can still go on
___

## Hadoop

* Hadoop is a way to distribute very large files across multiple machines
    * Name Node (CPU, RAM)
        * Data (Child) Node (CPU, RAM)
        * Data (Child) Node (CPU, RAM)
        * Data (Child) Node (CPU, RAM)
* It uses the Hadoop Distributed File System (**HDFS**)
* HDFS allows a user to work with large data sets
* HDFS also duplicates blocks of data for fault tolerance
* It also uses **MapReduce**, which allows computations on that data
    * Job Tracker (CPU, RAM)
        * Task Tracker (CPU, RAM)
        * Task Tracker (CPU, RAM)
        * Task Tracker (CPU, RAM)

___
* HDFS will use blocks of data, with size of 128Mb by default
* Each block of these is replicated 3 times
* The blocks are distributed in a way to support fault tolerance
* Smaller blocks provide more parallelization during processing
* Multiple copies of a block prevent loss of data due to failure of a node      
___
* Map Reduce is a way of splitting a computation task to a distributed set of files (such as HDFS)
* It consides of a Job Tracker and multiple Task Trackers
* The Job Tracker sends code to run on the Task Trackers
* The Task Trackers allocate CPU and memory for the tasks and monitor the task on the worker nodes
___
* Two distinct parts
    1. HDFS to distribute large data sets
    2. MapReduce to distribute a computational task to a distributed data set
* Spark is the latest technology in this space and improves on the concept of using distribution
___

## Spark

* Abstract overivew
    * Spark
    * Spark vs MapReduce
    * RDD Operations
    
___

* Spark is one of the latest technologies being used to quickly and easily handle Big Data
* It's an open source project from Apache
* First release in 2013 and has since exploded in popularity due to its ease of use and speed
* It was created at the AMPLab at UC Berkeley
* Think of Spark as a flexible alternative to MapReduce, not Hadoop
* Spark can use data stored in a variety of formats
    * AWS S3
    * HDFS
    * Cassandra
    * and more...

___

* Spark vs MapReduce
    * MapReduce requires files to be stored on HDFS, while Spark does not
    * Spark can perform computation 100x faster than MapReduce
    * How is Spark faster?
        * MapReduce writes most data to disk after each map and reduce operation
        * Spark keeps most of the data in memory after each transformation
        * Spark can spill over to disk if memory is filled

___

* At the core of Spark is the idea of a Resilient Distributed Dataset (**RDD**)
* **RDD** has 4 main features
    1. Distributed collection of data
    2. Fault-tolerant
    3. SupportpParallel operation by partitioning the data
    4. Ability to use many data sources
___

* Driver Program: SparkConext
    * Cluster Manager
        * Worker Node
            * Executor
                * Task
                * Task
            * Cache
        * Worker Node
            * Executor
                * Task
                * Task
            * Cache
            
___
* RDD Objects : build operator DAG (directed acyclic graph)
    * DAGScheduler: splits graph into stages of tasks; submits each stage as ready
        * TaskScheduler: launches tasks via cluster manager; retry failed/ straggling tasks
            * Worker: executes tasks; stores + serves blocks
            
___

* Spark allows developers to create complex multi-step data pipelines using DAG (directed acyclic graph) pattern
* It also supports in-memory data sharing across the DAGs so different jobs can work with the same data

___

* RDDs are immutable, lazily evaluated, and cacheable
* There are **2 types of RDD operations**
    1. Transformations
    2. Actions
* We'll be coding Transformations and Actions, in python, on a distributed dataset
* Basic Actions
    * **First** - return the first element in the RDD
    * **Collect** - return all the elements of the RDD as an array at the driver program
    * **Count** - return the number of elements in the RDD
    * **Take** - return an array with the first n elements of the RDD
* Basic Transformations
    * Filter - **RDD.filter()** applies a function to each element and returns elements that evaluate to true, similar to python built-in .filter()
    * Map - **RDD.map()** transforms each element and preserves the same number of elements, similar to panads .apply()
    * FlatMap - **RDD.flatMap()** transforms each element into 0-N elements and will probably change the number of elements to the original RDD
    
___

Confustion between Map() and FlatMap(), examples to help clarify
* Map() - grapping first leeter of a list of names
* FlatMap() - transforming a corpus of text into a list of words

___

**Pair RDDs**
* Often RDDs will be holding their values in tuples (key, value)
* This offers better partitioning of data nad leads to functionality based on reduction

**Reduction methods**
* **Reduce()**
    * An action that will aggregate RDD elements using a funciton that reutrns a single element 
* **ReduceByKey()**
    * An action that will aggregate Pari RDD elements using a function that returns a pair RDD
* These ideas are similar to a .groupby() operation
___

Spark is continually delveoped and new releases come out often
Spark Ecosystem now includes
    * Spark SQL
    * Spark DataFrames
    * MLib
    * GraphX
    * SparkStreaming
    
___

## AWS Setup

* Sign up for free tier 12 months
* Guide to creating AWS account: https://docs.aws.amazon.com/AmazonSimpleDB/latest/DeveloperGuide/AboutAWSAccounts.html
* We'll be using Amazon Elastic Compute Cloud **EC2** 
    * think of EC2 as a Virtual Machine, in which we'll set everything up
    * we'll be using the 't2.micro instance usage' 750 hrs per months free for 12 months 
    * https://aws.amazon.com/ec2/pricing/on-demand/
        * t2.micro \$0.0116 per Hour
    * https://aws.amazon.com/ec2/spot/pricing/
        * t2.micro \$0.0081 per Hour
* Make sure to shut resources down after use

___

**EC2 Instance and Spark  Setup**

* **Create EC2 instance on AWS**
    * AWS login > Services > EC2 > Launch instance
        * select Ubuntu Free Tier Eligible > General Purpose t2.micro Free Tier Eligible
        * Next: Configure Details >>
            * Number of instances = 1 (default)
            * leave default settings
        * Next: Add Storage >>
            * 8 GB (default)
            * leave default settings
        * Next: Add Tags >>
            * Add
                * key: myspark
                * value: mymachine
        * Next: Configure Security Group >>
            * Type: All traffic (default is SSH)
            * leave everything else to default settings
            **NOTE:** When we set up the EC2 instance and configure the security groups setting for these spot instances we keep all the ports are open for simplicity, but it should be noted that these settings should be much more strict if put in production.
        * Review and Launch >>
        * Launch >>
            * Create a new key pair
                * key pair name: newspark
                * Download Key Pair >>
                    * privae key file (.pem)
                    * *Won't be able to download file again after this window, so make sure to download it*
        * Launch Instances >> / Request Spot Instances >>
        * Click on the instance number/ID
            * now we have an Instance ID and a public DNS (IP address) and .pem file for our VM 
            * perform actions
                * Actions > Instance State > Stop/ Reboot / Terminate
                * we'll use Terminate once we're done    
* **Use SSH to conenct to EC2 over internet**
    * SSH is different for Windows vs Mac/Linux
        * our goal is to remotely connect to the commmand line of our virtual machine running on EC22
        * Windows: 
            * follow these steps: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html
            * Download PuTTy: https://www.chiark.greenend.org.uk/~sgtatham/putty/
                * putty.exe
                * puttygen.exe
            * need EC2 instance ID, publicis DNS name from AWS EC2 UI
            * and need .pem file
            * convert .pem file to .ppk file usig PuTTYGen
                * open **puttygen.exe**
                * Type of Key: RSA or SSH-2 RSA
                * Load >> the .pem file
                * Save private key >>
                    * save to folder with file name puttyspark.ppk
            * start PuTTY session
                * click **putty.exe**
                * Session pane (default)
                    * Hostname: prefix 'ubuntu@' to Public DNS name from EC2 Instance UI
                * Connection pane
                    * SSH > Auth
                        * Browse: .ppk file
                * open >> to start putty session
                    * say yes to security warning
                * now we have a terminal connected to our EC2 Ubunu VM
                    * can run python etc...
        * Mac/Linux: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html
            * simpler on Max/ Linux as SSH is built in and only require 2 simple command lines
        * Secure Shell Connection (SSH)
            * start PuTTY session
                * click **putty.exe**
                * Session pane (default)
                    * Hostname: prefix 'ubuntu@' to Public DNS name from EC2 Instance UI
                * Connection pane
                    * SSH > Auth
                        * Browse: .ppk file
                * open >> to start putty session
                    * say yes to security warning
                * now we have a terminal connected to our EC2 Ubunu VM
                    * can run python etc...
                    * `python3`
* **Setup Spark and Jupyter on EC2 instance**
    * https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297
    * install Anaconda
        * `wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh`
            * can change to the latest version of python 2 or 3
        * `bash Anaconda3–4.1.1-Linux-x86_64.sh`
            * enter to go through license
            * 'yes' to continue
            * 'yes' to installer location
    * Check installation by seeing with python we're using (anaconda vs ubuntu built-in)
        * `which python`
    * change to anaconda python
        * `source .bashrc`
            * should say `/home/ubuntu/anaconda3/bin/python`
        * now we can use `python` in the terminal
    * config file for jupyter
        * `jupyter notebook --generate-config`
    * create certifications for our connections
        * `mkdir certs`
        * `cd certs`
        * `sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem`
            * `US`
            * `CA`
            * `San Jose`
            * `Pieran Data`
            * `Data`
            * `Jose`
            * email leave blank
        * Once in the certs directory, change the permissions on the .pem file with: `sudo chmod 777 mycert.pem`
    * Edit myrcert.pem
        * `cd ~/.jupyter/`
        * `vi jupyter_notebook_config.py`
            * press `i` key for inset mode
                * `c = get_config()`
                * `# Notebook config this is where you saved your pem cert`
                * `c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'`
                * `# Run on all IP addresses of your instance`
                * `c.NotebookApp.ip = '*'`
                * `# Don't open browser by default`
                * `c.NotebookApp.open_browser = False`  
                * `# Fix port to 8888`
                * `c.NotebookApp.port = 8888`
            * press `esc` to exit insert mode
            * press `:wq!` then `enter`      
    * Check Jupyter notebook is working
       * `cd` back to home directory
       * `jupyter notebook`
           * should see the jupyter notebook is running on port 8888
       * in chrome browser tab:
           * navigtate to `https://` + `EC2 Publics DNS addres` + `:8888`
           * click through untrusted certificate warnings
       * jupyter notebook instance, connected to the EC2 VM, should appear
       * kill the notebook using `ctrl + c`
    * Install Spark
        * need to first install Java and Scala
            * `sudo apt-get update`
            * `sudo apt-get install default-jre
            * `java -version`
            * `sudo apt-get install scala`
            * `scala -version`
        * Install Py4j
            * `export PATH=$PATH:$HOME/anaconda3/bin`
            * `conda install pip`
            * `which pip`
                * should say `/home/ubuntu/anaconda3/bin/pip`
            * `pip install py4j`
        * Install Spark and Hadoop
            * `wget http://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz`
            * `sudo tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz`
        * Tell python where to find spark
            * `export SPARK_HOME='/home/ubuntu/spark-2.0.0-bin-hadoop2.7'`
            * `export PATH=$SPARK_HOME:$PATH`
            * `export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH`
        * Launch jupyter notebook `jupyter notebook`
            * in browser navigtate to `https://` + `EC2 Publics DNS addres` + `:8888`
            * create a new notebook
                * `from pyspark import SparkContext`
                * `sc = SparkContext()`