# Fundamentals of Big Data
### Notes, Introduction to Big Data
---

## What is a Traditional Data Problem?

Traditional problems are solved with business intelligence:

* skills
    * querying
    * reporting
    * exploration
* relational databases
    * SQL
* historical questions
    * SELECT AVG(profit) FROM results
    * facts 
        * if the data is correct, the answers are correct

## What is a Big Data problem?

def., a non-traditional data problem...

> Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by **traditional data-processing application software**. 

> Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.

> Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy.


- wikipedia, Big Data

## What new challenges brought about this field?

* real-time analysis
* complex data
* 

## What situations require real-time analysis?

* responsive interaction with customers
* IOT devices working with sensor data
* ... 

## E-commerce Example
* imagine a web store (eg., amazon)
* the website wishes to monitor customer behaviour (eg., adding items to cart, leaving the website without purchasing, etc.)
    * monitor these REAL TIME
* respond to the "customer behaviour signals"
    * with some intelligent/adaptive response
    * offering discounts, alterting user, drawing user attention, ... 
    
    
### The Challenges
* each customer generates a significant amount of data
    * eg., mouse behaviour, click events, leaving & joining... 
* thousands of customers using the web store *at the same time*
* analysis has to be fast, and responsive *in real time*
    * vs., a "traditional problem" which can be left to batch / over-night processing
    
    

## What types of data are more complex than "traditional" rows?
* increaed data complexity
* images
* text
    * speech
* audio


## Satellite Image Analysis for Government & Charity Resource Tracking
* Suppose we distribute resources via trucks/cars/etc. in a developing nation
    * can we track the efficient/safe/reliable distribution of these?
    * also, eg., george cloony: satellite image analysis for early-warning of conflicts
    
### Challenges
* a single satellite image is GBs of data
* many images are needed to reconstruct a country/road-system
    * > 10 TB of data *per view*
* processing requires looking at a "view of the country" each day/week/etc. 

## What are The Three Vs?

How can we decide when a problem is non-traditional? Are there any emprical metrics?

* Velocity
    * real-time
    * problems "quickly" become non-traditional
        * traditional solutions *do not scale well*
    * response required < 1s
    * ratio = traditional-query-time / response-time
    * is this close-to or more-than 1?
        * eg., 60 seconds / 120 secods = 0.5
        * eg., 60 seconds / 30 seconds = 2
        
* Vairety
    * data complexity 
    * problems "quickly" become non-traditional
        * traditional solutions *do not scale well*
    * independent dimensions in the dataset
        * independent attributes of the data
    * repeated information:
        * (lat, long), (address), (geotag)
    * large amounts of *relevant* non-repeated information
    * image: 1MP is 1,000,000
        * even a small image is 1-million dimensions
    * if you're considering >20 *relevant*, *indepedent* attributes, may have big-data problem
    
    
* Volume
    * traditional solutions *scale well*
    * RAM
        * how much memory does a query need?
        * ratio = query-need / machine-amount
            * of the best plausible single machine
        * eg., 100 GB / 512 GB
            * not a big data problem
        * eg., 1 TB / 512 GB
            * could be a big data problem
            * may also be possible to optimize the query 
    * Storage
        * how big is the entire dataset?
        * ratio = disk-size / machine-amount
        * eg., 10 TB / 16TB
            * not a big data problem
            
            
## Where is the boundary between big-data and traditional?

Not fixed. In 2010, 100 GB query may be big data, now *certaintly isn't*. 


## What other `V`s are also mentioned?
* Veracity
* Value
* ...

## What is Big Data for an Analyst?

'Big Data Analysis' connotes prediction, explanation, non-traditional data models: graphs, documents, ... 

## What is Big Data for an Engineer?

Installing, Peformance, tools, tradeoffs local single-machine, data centres, CAP theorem, configure

## What are these non-traditional data models?

The traditional data model is *the table*: row, columns,..

Non-Traditional:

* key-value pairs
    * loose structure
* documents
    * heirachy
* graphs
    * order
* column stores
    * 1000s col
* schemaless
    * unstructuerd data
* exotic:
    * time series dbs
    * matrices, linear algebra, 

Often with big data problems, datasets will be stored in a non-tabular form -- even if, by the end, we have a table. 

## What is NoSQL?

Originally, "not sql",  now, "not only sql".

Data systems, and data model (ie., structures), which are non-tabular and are not "natively" SQL-based. 

## How do I understand data models & queries?

The data model describes how the data is structured. The query brings together datasets, reshapes and summaries; ie., the query structures and processes. 


There's a tradeoff: you can either structure data *specific* to a need, and make the query simpler & faster; or make the structure general, querying take longer. 

Relational databases propose a universal data structure (ie., a table); but this *can* have negative performance implications if you're problem requires: 
* order, 
* sparcity, 
* heirachy, 
* extensive re-structuring. 

## What's a table?
* { (id, age, temp), ... }
* { (1, 20, 19), (2, 20, 30), ...}
* table = set of rows
* row = tuple of values

## What are Key-Value pairs?
* a single tagged value
    * {"id": 1}
* datasets are bundles a single piece of data with a label
    * bag of 
        * me:   {"id": 1}, {"age": 20}, {"temp": 19}
        * you:  {"id": 2}, {"age": 20}, {"temp": 30}

## What are Documents?
* documents give you heirachy
* using key-values in a more structured way
* collection of key-value pairs
    * where each *value* can itself be such a collection
* this enables modelling *heirachical* data
    * vs., tabular -- *querying* creates heiarchy
    * vs., tabular -- here, we store in a heiarchy
    
```
credit_file = {
    "name" : "Michael",
    "address": [("London", "UK"), ...],
    "loan" : {
        "amount": 1000,
        "date": 1/1/1900
    }
}
```


## What are Graphs?
* graphs give you ordering between datasets ("rows")
* data is stored in an order
    * eg., consider below, storing
    * eg., $\{(Alice, (Michael, Bob)),  (Michael, ()), (Bob, (Alice)), (Eve, (Bob))\}$
        * ie., $\{(Node, Neighbors)\}$
* tables do not order data
    * querying has to impose order
    * therefore querying is more efficient if data is stored in an order
    
```
Michael -LIKES-> Alice <-LIKES-> Bob -LIKES-> Eve
```

* who like Alice?
    * {Bob, Eve}

## What's a Column-Store?
* a table stored in column-form

* compare with: { (id, age, temp), ... }
* { (1,2), (20, 20), (19, 30), ...}

* table = set of columns
* querying selects a *subset of columns*
    * so choosing, say, 20 / 2000 columns is *efficient* 
* eg., consider an image: that could be a 1million-col row


## What's a schemaless db / data store?

Consider images: it is not efficient to store, as a row, a 1mil-col image. 

Rather, leave it on the disk as a binary image file *and while querying* convert to a tabular (ie., a matrix) form. 


This suggest a need for a storage system which imposes no consistent schema across datasets (ie., files); and leaves the query to do *all* the processing, including even imposing some basic structure. 

This is just a file system!

A schemaless data store is then *just a file system*... a big data version will provide a file system across 100s+ machines. 

## What are the important examples of Big Data tools & techniques?

* schemaless ("object store")
    * hadoop (on-premise)
    * S3 amazon 
    * Microsoft Object-Store
    * Google Cloud Storage
* graph
    * neo4j
* key-value
    * redis
* document
    * mongo
    * ...
* columnar
    * cassandra
    * hbase