### Big Data
* Datasets that are so large or so complex that traditional data processing software or simple code is inadequate to deal with them.
* Not only data, but also the infrastructure that is needed to support such analyses.
* Four Vs
    * Volume - "Scale of Data" - Generally, anything bigger than 1GB of data is considered to be 'BIG'
    * Velocity - "Speed of Data"
    * Variety - "Diversity of Data" Could be dense or sparse, or dependent on time, or not, different types of data.
    * Veracity - "The certainty of data" Is the data trustworthy/valid ?
    
Always use data processing tools with discretion i.e. If you only have a dataset that is .1 mb, you should probably stick to something like CSV.

Searching is typically spread across many machines.

To store, process, and recall information from large and complex data sets, it's almost always a necessity to have more than one computer, or relatively small size server, to handle the data. When you start to have data spread across, potentially, many machines, you need to have tools that abstract away the management and work flow needed to use multiple machines.

### Types of data you will find in the wild
* Structured Data - makes most sense for relational databases. i.e. Pandas Dataframes
* Unstructured Data - Images, Audio, video, GeoSpatial, etc.

### 3 Major domains of big data
1. Computations
2. Infrastructure
3. Data Storage

### Processing Data "Computations use cases for big data"
Generalized Data Processing - Being fed data and running a computation over it.
Hadoop, Apache Spark "built on top of Hadoop". 

Spark - Distributed data processing framework
* Based on resilient distributed datasets
RDD - A huge list of objects in memory, which is so big it needs to be distributed across many computers.'

* Spark gives you shared variables

Functional tools in Python needed for Spark (Perform functions on list, and return another list)
* lambda
    * Anonymous functions lambda x: x + 1
* map
    * Apply a function to each item in a list, and return a new list with the applied variables (Like apply in Pandas)
    * map(add1, [1, 2, 3]) => [2, 3, 4]
* filter
    * filter(isOdd, [1, 2, 3, 4]) = > [1, 3]
* reduce
    * Applies a function to all pairs of elements of a list, returns ONE value not a list
    * reduce(add, [1, 2, 3, 4]) => 10 "(((1 + 2) + 3) + 4)"
* itertools
    * Chain
    * FlatMap
        * Flattens a list of lists. i.e. [[1, 2, 3, 4], [7, 8, 9]]
        * from itertools import chain
        * chain(map(lambda t: range(t[0], t[1]), [(1,5), (7,9)])) => [1, 2, 3, 4, 7, 8, 9]
        
### In the Spark Shell "pyspark"
* sc.parallelize => Takes a list of elements, and sends it to Spark to return an RDD
* writing sc.parallelize doesn't actually do anything, it just tells Spark "if I ever need to run this, I will run it this way".

We divide RDD methods into two kinds:
1. Transformations (Map, Filter, FlatMap)
* Return another RDD
* Are not really performed, until an action is called (lazy)
2. Actions (Reduce, Take, Collect, Sum)
* Return a value other than an RDD
* Are performed immediately

Example = RDD1 = sc.parallelize( range(1, 1000) )
RDD1.map(lambda x: x + 10)

### Reading Text Files
sc.textFile(path, min_partitions, useUnicode=True)
* Returns an RDD of strigns (1 per line)
* Can read many files using wildcard, *
* Can read from hdfs, ....
Example = people = sc.textFile("../data/people.txt")

### Group By Functions
1. Read in File, lets say one structured  "Name | Gender | Age"
2. We use reduceByKey to group by certain criteria
Example = people.map(lambda t: (t[1], 1)).reduceByKey(lambda x, y: x + y) => [(M, 3), (F, 5)]
3. You are grouping by the second column t[1] Gender, and summing up the count of each one.


Code in Apache Spark travels from Scala to python interpreter for things like map/reduce

### Create objects in python
1. Load .py files that has your class in it, and import the class
from something.py import Person
2. people = sc.textFile("../../data/sales/sales.txt").map(lambda x: Person().parse(x))
3. Outputs people objects

### Joins 
* You can use Joins to combine two RDDs

states = [("AK", "Alaska"), ("AL", "Alabama"), ("AZ", "Arizona")] 
populations = [("AK": 100,000), ("AL", 90,000), ("AZ", 345,000)]

states_rdd = sc.parallelize(states)
populations_rdd = sc.parallelize(populations)

states_rdd.join(populations_rdd);

## Writing a Spark Application

## Spark for Machine Learning

### When we write a Spark program, we write a driver program. We describe to Spark the sequence of operations we want applied to our data, and the Spark driver program is than going to orchestrate whats going to happen on the worker nodes.

### Create DataFrame
img here

### Specify Feature Extraction
img here

### Model Evaluation
img here

SIDENOTE* = Good way of solving the problem of overfitting with decision trees is using random forests. Decision Trees have a tendency to get really specific on the training set, but really bad on unknown data.