#Lesson 01 - Introduction to Spark

![Apache Spark](https://drbeane.github.io/files/images/spark_logo.png)

## What is Big Data?

**Big data** is a field that focuses on developing tools and methods for analyzing and extracting information from data sets that are too large or complex to be dealt with using traditional techniques. Big data is often characterized by three properties, know as the "**Three Vs**". These are **Volume**, **Velocity**, and **Variety**. 


* **Volume.** This refers to the amount of data available. An estimated 463 exabyes (that's 463 trillion gigabyes) of data every day in 2020. This rate of data generation is constantly increasing. In fact, 90% of the world's data has been generated in the last two years alone. As a result, the last several years have seen a dramatic increase in the amount of data available to companies, governments, and other organizations. This has necessitated the development of new methods for retrieving, storing, processing, and working with large amounts of data. 

* **Velocity.** This refers to the speed at which data is recieved, and the speed at which it must be acted on. In some cases, data is collected only occasionally and is warehoused until it is needed. But it becoming more and more common that businesses are collecting data in continuous streams which must also be processed continuously to allow for real-time decicion making. Online retailers and credit card companies must continuously process transaction data in order to identify and react to potential instances of fraud. Security companies must continuously monitor server log data in order to identify and block malicious behavior. Rideshare companies must continuously track the locations and destinations of drivers so that they can effeciently schedule rides. 

* **Variety.** This refers to the diversity in the types of data that are now available. Traditionally, there has been an emphasis on working with structured data that can be neatly arranged in rows and columns, and can thus be conveniently stored in a relational database system (RDBS). But it is becoming more and more common for organizations to collect and gain insights from unstructured and semi-structured data such as text, audio, and video.

## Distributed Data Analysis

One of the challenges presented by the increased volume of available data is the need to store and analyze datasets that are too large to be stored on or processed by a single machine. **Distributed data analysis** refers to the practice of using a network of connected computers to process data. 

A **cluster** is a group of computers that are able to communicate and share resources. The machines in a cluster work collectively to perform computationally intensive tasks in **parallel** (i.e. simultaneously). Using a cluster allows us to increase the computational power available to us. Individual machines on a cluster are called **nodes**. In a typical cluster, one of the nodes will be designated as a **master** mode, while the remaining nodes will be **workers**. The master node is responsible for handling scheduling and delegating tasks to be completed by the workers. A **cluster manager** is a piece of software that is installed on the nodes of a cluster and that monitors the resources available on each of these machines. The master node used the cluster manager to request resources from workers and to efficiently schedule tasks.

## Apache Spark

**Apache Spark** is an open-source, unified engine for performing large-scale data processing. Spark allows you to split a data processing task across the machines in a cluster, with each node performing the task on a chunk of the data. Behind the scenes, Spark manages and coordinates the flow of information and execution of tasks involved in using a cluster to process data.

### Spark History

The development of Spark was initiated in 2009 by Matei Zaharia, who was at that time a graduate student at UC Berkeley. The goal of the project was to address some of the limitations of Hadoop MapReduce, the most popular framework for distributed computing at that time. Spark was originally released in 2012, and was donated to the Apache Software Foundation in 2013, where it became one of their top projects. In December 2016, Apache released Version 2.0 of Spark, which represented a major redesign of the software, and introduced many new features. At the time of this writing, the current version of Spark is version 3.0, which was released in June 2020.

### Spark's Design Philosphy

Spark has several characteristics that represents improvements on the systems it was intended to replace. Two of the most important such improvements are its speed and its pursuit of unification. 

#### Speed

Spark is dramatically faster than Hadoop MapReduce at performing computations. One reason for this is that Spark stores the results of all intermediate calculations in memory. Hadoop MapReduce frequently writes the results of intermediate calculations to disk. Reading and writing data to disk are expensive operations, and Spark is able to achieve increased performance by avoiding these transactions. 

Another factor that contributes to Spark's speed is its use of **lazy evaluation**, which is a strategy in which computations are not performed until absolutely necessary. Spark operations are grouped into two types: **transformations** and **actions**. We will discuss these concepts in more detail later, but you can think of transformations as defining intermediate calculations that are not intended to generate a final result. Actions, on the other hand, are operations that explicitly request results to be returned to the user or to be written to disk. Spark does not perform transformations as they are recieved. Instead, Spark schedules them using a diagram called a **directed acyclic graph**, or **DAG**. The transformations composing a DAG are not actually executed until an action is called required the results of these transformations. This strategy allows Spark to review the DAG prior to performing an action to see if the transformations requested can be optimized in a way that would allow for more efficient calculation. 

#### Unified Analytics

Hadoop MapReduce was not originally designed to perform tasks relating to machine learning or processing streaming data. As a result, developers created additional systems to handle these types of workloads. These systems came with distinct APIs and approaches, and did not always work well together. One of the goals of Spark was to provide a unified approach to performing a variety of of tasks that previously had to be handled using separate frameworks. Spark achieves this unification by providing a course set of functionality along with a set of additional modules that are designed to handle a diverse set of specific use-cases. 

#### Focus on Computation

Apache Spark is a computational engine, and not a storage solution. Spark can pull data from a wide range of storage systems, including (but not limited to) [Amazon S3](https://aws.amazon.com/free/storage/s3), [Hadoop HDFS](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html), [Apache Cassandra](https://cassandra.apache.org/), and [Apache Kafka](https://kafka.apache.org/). Apache Spark also supports multiple cluster managers, including [Apache Mesos](http://mesos.apache.org/), [Apache YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), and Spark's own Standalone cluster manager. Spark was designed to focus on optimizing computations, and to work with whatever storage system and cluster manager that an organization already happened to have in place.

### Spark Components

The diagram below provides an overview of the primary components that are available in Spark. 

![Apache Spark](https://drbeane.github.io/files/images/spark_components.png)


The components of Spark are as follows:

* **Spark Core** is the foundation on which the rest of Spark is built. Spark core provides the Resilient Distributed Dataset (RDD) data type that is the most basic type of container for data in Spark. 

* **Spark SQL** provides advanced tools for working with structured data. It provides the DataFrame datatype, which allows for convenient representation of data in a tabular format. Spark SQL also provides a API for issues SQL commands, allowing Spark to interface with relational databases. 

* **MLlib** is Spark's machine learning library. It will allow us to distribute the training of a machine learning model over a cluster. 

* **Spark Streaming** provides tools for analyzing real-time streaming data. 

* **GraphX** provides tools for processing network or graph data.

### Drivers and Executors
A Spark application consists of a single **driver** process and some number of **executor** processes. The driver process is essentially the brain of the application. It is responsible for maintaining information about the application, processing user input, and scheduling tasks to be performed by the executors. The executor processes are responsible for performing tasks and reporting the results back to the driver. 

![Spark Architecture](https://drbeane.github.io/files/images/spark_architecture.png)

Note that drivers and executors do not necessarily correspond to individual machines in the cluster. A given node might host the driver as well as several executors. In fact, all of these processes might be hosted on a single computer. Each executor might be associated with a single core on a node, or executors might each have access to multiple cores. 

- worker -->  machines
- executor --> software that working on the worker machine
- driver --> master machine

## Spark Language APIs

Spark is written in the Scala, but APIs exist to allow spark commands to be issued from Java, Scala, Python, and R. We will work primarily in Python in this course. The Spark Python API is called **PySpark**.