# Big Data Principles Chapter 1 and 2 - Summary
### By Carolina Garma Escoffié
---

## Chapter 1. A new paradigm for Big Data

#### 1. 
**Your boss asks you to build a simple web analytics application. Your monitory system drops you a lot of “Timeout error on inserting to the database.”  You need to do something to fix the problem, and you need to do something quickly. What's the first thing it comes to your mind?**
 > Instead of having the web server hit the database directly, you insert a queue
between the web server and the database

![r1](img/r1.JPG)

#### 2. 
**You still having the same problem, you realize the database is clearly the bottleneck. How would you how scale a write-heavy relational database?**
 > The best approach is to use multiple database servers and spread the
table across all the servers. Each server will have a subset of the data for the table. This
is known as _horizontal partitioning or sharding_.

#### 3. 
**In general, what are some of the main issues of working with a heavy relational DB?**
 > Answers may vary. Some of them are timeout issues, overloading, db bottleneck, multiple servers queues, and shards, fault-tolerance issues, corruption issues...

#### 4. 
**How Big Data techniques help?**
 > 1.  When it comes to scaling, you’ll just add nodes 
2. Immutable data
3. You might write bad data, but at least you won’t destroy good data
4. Much stronger human-fault tolerance guarantee

#### 5. 
**Mention some open sources Big Data systems**
 >  1. Hadoop, 
 2. HBase, 
 3. MongoDB, 
 4. Cassandra, 
 5. RabbitMQ,

#### 6. 
**In what kind of situations you shouldn't use Hadoop?**
 >  Hadoop, for example, can parallelize large-scale batch computations on very large
amounts of data, but the computations have high latency. You don’t use Hadoop for
anything where you need low-latency results. 

#### 7. 
**In what kind of situations you shouldn't use Cassandra?**
 > NoSQL databases like Cassandra achieve their scalability by offering you a much
more limited data model than you’re used to with something like SQL. Squeezing
your application into these limited data models can be very complex. And because the
databases are mutable, they’re not human-fault tolerant. 

#### 8. 
**Provide a general-purpose definition for a data system**
 > query = function(all data) 

#### 9. 
**List the desired properties for a Big Data System**
 > 1. Robustness and fault tolerance
2. Low latency reads and updates
3. Scalability
4. Generalization
5. Extensibility
6. Ad hoc queries 
7. Minimal maintenance
8. Debuggability

#### 10. 
**The Lambda Architecture is horizontally scalable, what does it mean?**
 > scaling is accomplished by adding more machines. 

#### 11. 
**Enumerate some problems of the fully incremental architecture**
 > 1. Operational complexity: compaction, cascading failure
2. Extreme complexity of achieving eventual consistency
3. Lack of human-fault tolerance

#### 12. 
**How is the Lambda Architecture composed?**
 >  The main idea of the Lambda Architecture is to build Big Data systems as a series of
layers, as shown in figure 1.6

![r2](img/r2.JPG)

#### 13. 
**Describe the batch layer**
 >  The portion of the Lambda Architecture that implements the batch view = function(all data) equation is called the batch layer. The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset (see figure 1.8).

![r3](img/r3.JPG)

#### 14. 
**Describe the serving layer**
 >  The batch layer emits batch views as the result of its functions. The next step is to load the views somewhere so that they can be queried. This is where the serving layer comes in. The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it (see figure 1.9). When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available. 

![r4](img/r4.JPG)

#### 15. 
**Describe the speed layer**
 >  As its name suggests, its goal is to ensure new data is represented in query functions as quickly as needed for the application requirements (see figure 1.10). You can think of the speed layer as being similar to the batch layer in that it produces views based on data it receives. One big difference is that the speed layer only looks at recent data, whereas the batch layer looks at all the data at once.

![r5](img/r5.JPG)

![r6](img/r6.JPG)

## Chapter 2. Data model for Big Data

#### 1. 
**What does data mean according to the Big Data Model?**
 > _Data_ refers to the information that can’t be derived from anything else. Data
serves as the axioms from which everything else derives. 

#### 2. 
**What's the difference between _views_ and _queries_?**
 > _Queries_ are questions you ask of your data. For example, you query your financial transaction history to determine your current bank account balance. 
_Views_ are information that has been derived from your base data. They are built
to assist with answering specific types of queries. 

![r7](img/r7.JPG)

#### 3. 
**Which are the key properties of data?**
 > _rawness_, _immutability_, and _perpetuity_ (or the “eternal trueness of data”).

#### 4. 
**What is data _rawness_**
 > The more information you can deduce, the more rawer your data is and more questions you can ask of it. 

#### 5. 
**What do you do to have immutable data?**
 > Instead of updating or deleting data, you only add more. This doesn’t require an index for your data, which is a huge
simplification.

![r8](img/r8.JPG)

#### 6. 
**How is data eternally true?**
 > The key consequence of immutability is that each piece of data is true in perpetuity.This mentality is the same as when you learned history in school. The fact _The United States consisted of thirteen states on July 4, 1776_, is always true due to the specific date; the fact that the number of states has increased since then is captured in additional (also perpetual) data. 

#### 7. 
**How is the data in the fact-based model?**
 > In the fact-based model, you deconstruct the data into fundamental units called facts —they are _atomic_, _timestamped_ and _identifiable_.  

![r9](img/r9.JPG)

#### 8. 
**What are the components of a graph schema?**
 > 1. _Nodes_: are the entities in the system. 
2. _Edges_: are relationships between nodes.
3. _Propeties_: are information about entities.

![r10](img/r10.JPG)

## Some Aditional Resources about Big Data Risks

#### 1. 
**Why is Big data nothing without the right infrastructure? **
 > According to Forrester reports, at least 60% of all data within an enterprise remains unused. 
A lot of what was considered big data is garbage, meaning that it is unreliable, uncleaned data that requires an extreme amount of work to be usable. And sometimes, the little information that can be found in an extremely large datasets is not worth the effort, time and costs that need to be spent to find it. 


#### 2. 
**Is more data always better?**
 > The quality of the data often matters more than the quantity. We want data that measures something we care about, in a reliable, consistent manner. 


#### References
Marz, N., Warren, J. (2015). _Big Data: Principles and best practices of scalable real-time data systems_. Manning Publications Co. Pp.1-46
 Di Russo, J. (2020). _Bye Bye Big Data!_. Medium. Towards Data Science. Retrieved from: https://towardsdatascience.com/bye-bye-big-data-fbea187c7739