# Week 11 - Non-relational Databases - Big Data and NoSQL

# Chapter 14: Big Data and NoSQL

## Learning Objectives
* Explain the role of Big Data in modern business[cite: 1, 14].
* Describe how the primary characteristics of Big Data go beyond the traditional "3 Vs"[cite: 1, 14].
* Explain how the core components of the Hadoop framework operate[cite: 1, 14].
* Identify the major components of the Hadoop ecosystem[cite: 1, 14].
* Summarize the four major approaches of the NoSQL data model[cite: 1, 14].
* Describe the characteristics of NewSQL databases[cite: 1, 14].
* Explain how a document database such as MongoDB stores and manipulates data[cite: 1, 14].
* Explain how a graph database such as Neo4j stores and manipulates data[cite: 2, 15].

## Preview
This chapter explores Big Data and the NoSQL data model in detail, building on concepts introduced in Chapter 2[cite: 2, 15]. It addresses how Big Data goes beyond the traditional "3 Vs" (volume, velocity, variety) and delves into technologies developed to handle it, such as the Hadoop framework and its components like Hadoop Distributed File System (HDFS) and MapReduce[cite: 4, 17, 18, 19, 20, 21]. The chapter also covers higher-level approaches of the NoSQL data model, including key-value, document, column-oriented, and graph databases, as well as NewSQL databases[cite: 21, 22]. Finally, it examines basic database activities in MongoDB and Neo4j[cite: 23].

## 14-1 Big Data
Big Data refers to datasets exhibiting characteristics of volume, velocity, and variety to an extent that makes them unsuitable for management by traditional relational database management systems (RDBMS)[cite: 46, 47, 597].

### Characteristics of Big Data (The 3 Vs):
* **Volume**: The quantity of data to be stored[cite: 47, 597, 605, 630]. This is often ambiguous as what constitutes "Big Data" changes over time[cite: 599, 600, 601]. The key is that the volume challenges current relational database technology[cite: 602].
* **Velocity**: The speed at which data enters the system and must be processed[cite: 47, 606, 658]. This includes a significant increase in the rate of data capture (e.g., from one transaction to 30 clicks for an online retailer)[cite: 661, 662]. Velocity also encompasses processing speed, broken into:
    * **Stream processing**: Analyzing data as it enters the system, often for filtering and deciding what to store (e.g., CERN's Large Hadron Collider filtering 600 TB/sec to 1 GB/sec)[cite: 672, 673, 674, 675, 676, 677, 690].
    * **Feedback loop processing**: Analyzing stored data to produce actionable results, focusing on outputs (e.g., real-time book recommendations or informing strategic decisions)[cite: 678, 679, 680, 681, 682, 684, 692].
* **Variety**: The vast array of formats and structures in which data can be captured[cite: 47, 607, 685]. Data can be structured, unstructured (e.g., emails, videos), or semi-structured[cite: 686, 687, 688, 689, 697, 702]. Unlike relational databases that impose structure upon storage, Big Data processing imposes structure as needed during retrieval and processing, offering flexibility[cite: 708, 709, 710].

### Other Characteristics of Big Data (Additional Vs):
Beyond the initial 3 Vs, other characteristics have been proposed[cite: 712, 713]:
* **Variability**: Data meaning changes based on context[cite: 714, 716, 723]. This is especially relevant in areas like sentiment analysis, which attempts to determine if a statement conveys a positive, negative, or neutral attitude[cite: 717, 718, 719, 720, 721, 722].
* **Veracity**: The trustworthiness and accuracy of the data, especially pertinent with automated data capture[cite: 725, 726, 727, 728, 729].
* **Value (Viability)**: The degree to which data can be analyzed to provide meaningful information and add value to the organization[cite: 733, 734, 735, 736, 765]. Only data with potential to impact organizational behavior should be captured[cite: 736, 744, 745].
* **Visualization**: The ability to graphically present data in an understandable way to enable insights for decision-makers[cite: 737, 738, 739, 766].

Big Data does not negate relational database technology, which remains critical for structured data requiring ACID transactions[cite: 747, 748, 749]. However, RDBMS are not always the best solution for all organizational data anymore[cite: 750]. This has led to **polyglot persistence**, the coexistence of various data storage and management technologies[cite: 753, 767]. Scaling out (distributing data across clusters of commodity servers) is the dominant approach for Big Data, leading to the development of new non-relational technologies[cite: 756, 757].

## 14-2 Hadoop
Hadoop is a Java-based framework for distributing and processing very large datasets across computer clusters[cite: 208, 760]. It has become the de facto standard for Big Data storage and processing[cite: 190, 759]. Its two most important components are the Hadoop Distributed File System (HDFS) and MapReduce[cite: 209, 761].

### 14-2a HDFS (Hadoop Distributed File System)
HDFS is a highly distributed, fault-tolerant file storage system designed for large amounts of data at high speeds[cite: 771]. Key assumptions of HDFS include:
* **High volume**: HDFS assumes extremely large files (terabytes, petabytes) [cite: 774, 775] organized into large physical blocks (default 64 MB), reducing metadata overhead[cite: 779, 780].
* **Write-once, read-many**: Files are written once, then closed and cannot be changed, simplifying concurrency and improving throughput[cite: 781, 782, 783]. Recent advancements allow appending new data to files, which is crucial for database logs[cite: 785, 786].
* **Streaming access**: Optimized for batch processing of entire files as continuous data streams, unlike transaction processing systems that randomly access small data pieces[cite: 787, 788].
* **Fault tolerance**: Designed for clusters of thousands of low-cost commodity computers, HDFS replicates data across multiple devices (default replication factor of three) to ensure availability if a device fails[cite: 789, 790, 791, 792].

Hadoop utilizes three types of nodes within HDFS[cite: 795]:
* **Client node**: Interacts with the HDFS to store or retrieve data[cite: 795, 806].
* **Name node**: The "brain" of HDFS, managing file system metadata (e.g., file names, block locations). It doesn't store actual file data[cite: 800, 801, 802, 803, 805].
* **Data nodes**: Store the actual file data blocks and handle read/write requests from clients. They send heartbeat signals and block reports to the name node[cite: 796, 797, 798, 799].

### 14-2b MapReduce
MapReduce is a programming model that supports processing large datasets in a highly parallel, distributed manner[cite: 211, 763]. It works synergistically with HDFS[cite: 764]. MapReduce involves two main functions:
* **Map function**: Processes input data and generates intermediate key-value pairs[cite: 224, 225, 243, 244].
* **Reduce function**: Takes the intermediate key-value pairs and combines them to produce a summary result[cite: 226, 245].

Direct low-level Java programming for MapReduce jobs has declined, leading to the development of simpler applications.

### 14-2c Hadoop Ecosystem Components
* **Hive**: A data warehousing system on HDFS that supports HiveQL (a SQL-like language) for ad hoc queries, processed into MapReduce jobs. It's suitable for scalable batch processing but not fast retrieval of small data subsets.
* **Pig**: A tool for compiling Pig Latin (a high-level procedural scripting language) into MapReduce jobs, often used for data transformation tasks like ETL.
* **Data Ingestion Applications**:
    * **Flume**: For ingesting large datasets from server log files (e.g., clickstream data) into Hadoop, supporting scheduled or event-based imports with optional transformations.
    * **Sqoop**: Converts data between relational databases (Oracle, MySQL, SQL Server) and HDFS, working in both directions. It imports table data row by row in parallel using MapReduce.
* **Direct Query Applications**: Provide faster query access by interacting directly with HDFS, bypassing the MapReduce layer.
    * **HBase**: A column-oriented NoSQL database on HDFS, highly distributed and scalable[cite: 268, 269, 270, 271]. It doesn't support SQL, relying on lower-level languages like Java, and is good for fast processing of sparse datasets[cite: 272, 273, 274, 291, 292, 293]. It is used by Facebook for its messaging system[cite: 275, 294].
    * **Impala**: The first SQL on Hadoop application, a query engine supporting SQL queries directly against data in HDFS, often using in-memory caching[cite: 295, 296, 298, 299].

### 14-2d Hadoop Pushback
While a customized Hadoop ecosystem offers powerful solutions for Big Data, its modular nature presents significant implementation challenges due to the need for installation and integration of independent components[cite: 303, 304, 323, 324]. This complexity and steep learning curve have propelled interest in alternative solutions like NoSQL databases[cite: 328].

## 14-3 NoSQL
NoSQL is a broad term for non-relational database technologies developed to address Big Data challenges[cite: 329]. The name is considered unfortunate as it defines what these technologies are not, rather than what they are, and some support SQL-like query languages[cite: 330, 331, 357, 358]. Some prefer "Not Only SQL"[cite: 360].

Most NoSQL products fall into four categories[cite: 364]:
* Key-value data stores
* Document databases
* Column-oriented databases
* Graph databases

Many NoSQL databases are open-source and often associated with the Linux operating system due to cost and customization benefits in large clusters[cite: 366, 367, 368, 369, 370].

### 14-3a Key-Value Databases
Key-value (KV) databases are conceptually the simplest NoSQL data model, storing data as collections of key-value pairs[cite: 387, 388].
* The **key** acts as an identifier[cite: 389].
* The **value** can be any type of data (text, XML, image), and the database does not interpret its contents; applications handle the value's meaning[cite: 389, 390, 391, 394].
* There are no foreign keys or direct relationship tracking among keys, which simplifies DBMS work and makes KV databases extremely fast and scalable for basic processing[cite: 392, 393].

### Column-oriented databases
This term can refer to two things:
1.  **Column-centric storage within relational databases**: Data for a given column is stored together, optimizing read performance for analytical queries but being inefficient for transactional insert/update/delete operations[cite: 415, 416]. These still require structured data and support SQL[cite: 417].
2.  **A type of NoSQL database (column family database)**: This model takes column-centric storage beyond the relational model[cite: 418]. These databases do not require data to conform to predefined structures nor do they support standard SQL (though some, like Cassandra, have SQL-like languages like CQL)[cite: 419]. Examples include Google's BigTable, HBase, Hypertable, and Cassandra[cite: 420]. A "column" here is a key-value pair itself (e.g., "cus_lname: Ramas").

### 14-3b Document Databases
Document databases are a type of NoSQL database that stores and manages data in a flexible, semi-structured format called documents. These documents are self-describing, meaning they contain both the data and information about the data's structure.
* **Self-describing**: Documents contain information about their own structure, allowing for flexibility as new data attributes can be added without altering a predefined schema.
* **Flexible structure**: Documents can have varying structures, enabling schema-on-read rather than schema-on-write, where the schema is applied during data retrieval and processing.
* **Schema-on-read**: Structure is imposed during data processing or retrieval, allowing for easy updates and additions of new attributes without affecting existing data.
* **Querying**: Document databases allow querying on fields within the documents.
* **Examples**: MongoDB is a popular document database.

### 14-3c Graph Databases
Graph databases are a type of NoSQL database that stores data as nodes and relationships, optimized for highly interconnected data.
* **Nodes**: Represent entities (e.g., people, places, events).
* **Relationships**: Represent connections between nodes, often with properties.
* **Aggregate ignorant**: Unlike key-value, document, and column family databases that are "aggregate aware" (data collected around a central entity), graph databases are "aggregate ignorant"[cite: 447, 476]. This means data about each topic is stored separately, and joins are used to aggregate individual pieces of data as needed[cite: 477].
* **Performance**: They excel at querying relationships, making them suitable for social networks, recommendation engines, and fraud detection.
* **Scalability**: Graph databases do not scale out to clusters as well as aggregate-aware NoSQL databases due to their focus on highly related data[cite: 446, 479, 500].
* **Examples**: Neo4j is a popular graph database.

### 14-3d Aggregate Awareness
* **Aggregate Aware**: Key-value, document, and column family databases are aggregate aware[cite: 447]. This means data is collected or aggregated around a central topic or entity[cite: 448]. For example, a blog website might aggregate all data related to a blog post (title, content, date, poster, comments) into a single denormalized collection[cite: 468, 469]. This allows for clustering efficiency by making each piece of data relatively independent[cite: 472, 473]. Separating independent pieces of data (shards) across nodes allows NoSQL databases to scale out effectively[cite: 475].
* **Aggregate Ignorant**: Graph databases, like relational databases, are aggregate ignorant[cite: 476]. They do not organize data into collections based on a central entity[cite: 476]. Data about each topic is stored separately, and joins are used to aggregate individual pieces of data as needed[cite: 477]. This offers greater flexibility in combining data elements[cite: 478].

## 14-4 NewSQL Databases
NewSQL databases attempt to bridge the gap between RDBMS and NoSQL[cite: 503]. They aim to provide ACID-compliant transactions over a highly distributed infrastructure[cite: 504].
* **SQL support**: Like RDBMSs, NewSQL databases support SQL as the primary interface[cite: 508].
* **ACID compliance**: They maintain ACID-compliant transactions, critical for line-of-business operations[cite: 504, 508].
* **Distributed clusters**: Similar to NoSQL, they support highly distributed clusters[cite: 535].
* **Data stores**: They can incorporate key-value or column-oriented data stores[cite: 535].
* **Disadvantages**: NewSQL databases often rely heavily on in-memory storage, which can jeopardize the durability component of ACID and limit their ability to handle vast datasets[cite: 536, 537, 538]. While theoretically scalable, practical scaling has been limited to dozens of data nodes, far less than the hundreds used by some NoSQL databases[cite: 539, 540].
* **Examples**: ClusterixDB and NuoDB are examples of NewSQL products[cite: 507].

## 14-5 MongoDB
MongoDB is a popular document database that stores data in flexible, JSON-like documents.
* **Document Structure**: Data is stored as JSON-like BSON (Binary JSON) documents, which allow for nested structures and varying schemas within a collection.
* **Collections**: Documents are organized into collections, which are analogous to tables in relational databases.
* **_id field**: Each document in MongoDB has a unique `_id` field, serving as a primary key.
* **Schema Flexibility**: MongoDB is "schema-less," meaning documents within the same collection can have different fields, allowing for agile development and easy adaptation to changing data requirements.
* **Sharding**: MongoDB supports sharding, which is the distribution of data across multiple servers to handle large datasets and high throughput.
* **Querying**: MongoDB provides a rich query language that allows for complex queries, aggregation, and indexing on document fields.
