# Big Data Intro

**Big Data** - the process used when traditional data mining cannot uncover insights and meaning of underlying data. 
- unstructured
- time sensitive
- streaming data
- large scale data that can't be processed via database engines

Which uses massive parallelism on readily-available hardware.

**Big Data Topics**
- Analytics
- Architecture
- High Performance Computing
- Streaming Data
- Visualization

## Three V's of Big Data
**Volume, Variety, Velocity**

**Volume** - Ranges from terabytes to petabytes of data

**Variety** - Includes data from a variety of sources and formats
- Web logs
- Social media interactions
- Ecommerce transactions
- Financial transactions

# Big Data Architecture
A Data Architect aligns the data, software, hardware, networks, cloud services, developers, testers, sysadmins, DBAs, and other components of IT Infrastructure

**Hadoop Ecosystem**
- HDFS: Distributed file system
- Yarn: DIstributed processing framework
- Zookeeper: Coordination
- Flume: Log Collector
- Sqoop: Data Exchange
- Oozie: Workflow
- Pig: Scripting
- Mahout: Machine Learning
- R Connections: Statistics
- Hive: SQL Query
- HBase: Columnar Store
- Ambari: Provisioning, Managing and Monitoring Hadoop Clusters

# Big Data and HPC
High Performance Computing (HPC) allows scientists and engineers to solve complex, computer-intensive problems. HPC applications require:
- high network performance
- fast storage
- large amounts of memory
- high compute capabilities

# Big Data and Visualization
The process of displaying data/information in graphical charts and figures. Visually report findings to others, but to do this effectively, you need to pick the right figure or chart.
- KPIs: Single value that is a reflection of how you're doing in a particular area. KPI charts
- Relationships: Establish whether a relationship exists between 2 or more variables. Scatter/Bubble Chart
- Comparisons: Show or examine variable changes. Bar Chart, Table, Column Chart, Line Chart
- Distributions: Depict how your data is distributed over certain invervals. Column Histogram, Scatter Chart
- Compositions: Highlight elements that make up our data. Whether data is static or changing over time. Stacked Pie Chart, Tree Map, Stacked Charts

# Big Data and Streaming
The process of transferring a stream of data from one place to another to a sender and recipient or through a network trajectory. Before streaming data understand batch processing and stream processing

## Batch Processing
Used to compute arbitrary queries over different sets of data. Computes results derived from all the data it encompasses (MapReduce does batch processing).
- Data Scope: Queries or processing over all or most of the data in a dataset
- Data Size: Large batches of Data
- Performance: Latencies in minutes to hours
- Analysis: Complex Analytics

## Stream Processing
Requires ingesting a sequence of data, and incrementally updating metrics, reports, and summary statistics in response to each arriving data record. Better suited for real-time monitoring
- Data Scope: Queries or processing over data within a rolling time windo, or on just the most recent data
- Data Size: Individual records or micro batches consisting of a few records
- Performance: Requires latency in the order of seconds or milliseconds
- Analysis: Simple response functions, aggregates, and rolling metrics

# Data and Databases
## Data Structures
one of the most common topics for developer job interview questions. The good news is that they're basically just specialized formats for organizing and storing data

- 1D Arrays
- ND Arrays
- Dynamic Arrays
- Singly Linked List
- Doubly Linked List
- Circular Linked List
- Stack
- Queue
- Trees
- Binary Search Trees

## Data Management and Data Migration
### Data Management
A database management system (DBMS) is system software for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data.

## Data Migration
Transporting data between computers, storage devices, and storage formats; **AWS Schema Conversion Tool**

## Data warehouses and lakes
**Data Warehouse** - Collection of info derived from operational systems and external sources. Designed to allow data consolidation, analysis, and reporting at various aggregated levels. Data is populated through **extraction, transformation, and loading** (**ETL**). **Stores data in files or folders**

**Data Lake** - Storage repository that holds a vast amount of raw data in its native form until the data is needed. Stores data with flat architecture (or raw data format)

A **data warehouse** utilizes a pre-defined schema optimized for analytics.
 
A **data lake** is a centralized repository for all data, including structured and unstructured.

**Data Warehouse**
- Data: Relational data from transactional systems, operational databases, and line of business applications
- Schema: Designed prior to the data warehouse implementation (schema-on-write)
- Price/Performance: Fastest query results using higher cost storage
- Data Quality: Highly curated data that serves as the central version of the truth
- Users: Business analysts, data scientists, and data developers
- Analytics: Batch reporting, BI, and visualizations

**Data Lake**
- Data: Non-relational and relational data from IoT devices, web sites, mobile apps, social media, and corporate applications
- Schema: Written at the time of analysis (schema-on-read)
- Price/Performance: Query results getting faster using low-cost storage
- Data Quality: Any data that may or may not be curated (i.e. raw data)
- Users: Data scientists, data developers, and business analysts (using curated data)
- Analytics: Machine learning, predictive analytics, data discovery, and profiling

## Relational
Collective set of multiple datasets organized by tables, records, and columns.

## Non-Relational