# Part 1: Foundations of Data Systems

The concepts and methods in this part comprise data systems from the point of view of a single physical machine.

## Chapter 1: Reliable, Scalable, and Maintainable Applications

![](images/image_1.png)

!! Definition !!

* **Data Intensive Application:** 
Raw CPU power is rarely a limiting factor for these applications—bigger problems are
usually the amount of data, the complexity of data, and the speed at which it is
changing.

Examples of needs of a Data Intensive Application:
* Store data so that they, or another application, can find it again later (databases)
* Remember the result of an expensive operation, to speed up reads (caches)
* Allow users to search data by keyword or filter it in various ways (search indexes)
* Send a message to another process, to be handled asynchronously (stream processing)
* Periodically crunch a large amount of accumulated data (batch processing)

![](images/image_2.png)

!! Definition !!
* **Reliability:** The system should continue to work correctly (performing the correct function at
the desired level of performance) even in the face of adversity (hardware or software
faults, and even human error).

* **Scalability:** As the system grows (in data volume, traffic volume, or complexity), there should
be reasonable ways of dealing with that growth.

* **Maintainability:** Over time, many different people will work on the system (engineering and operations,
both maintaining current behavior and adapting the system to new use
cases), and they should all be able to work on it productively.

---

### **Reliability**

Means the ability of the system to working correctly even when faults appears. There are 3 typical types of faults:
* **Hardware Faults** 
* **Software Errors**
* **Human Errors**

#### **Hardare Faults**
Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone
unplugs the wrong network cable, etc.

**PREVENTION STRATEGY:** 

To add redundancy to the individual hardware components
in order to reduce the failure rate of the system. Disks may be set up in a RAID
configuration, servers may have dual power supplies and hot-swappable CPUs, and
datacenters may have batteries and diesel generators for backup power. When one
component dies, the redundant component can take its place while the broken component
is replaced. This approach cannot completely prevent hardware problems
from causing failures.

Hence there is a move toward systems that can tolerate the loss of entire machines, by
using software fault-tolerance techniques in preference or in addition to hardware
redundancy. Such systems also have operational advantages: a single-server system
requires planned downtime if you need to reboot the machine (to apply operating
system security patches, for example), whereas a system that can tolerate machine
failure can be patched one node at a time, without downtime of the entire system (a
rolling upgrade)

#### **Software Errors**
Are faults are harder to anticipate, and because they are correlated across nodes, they tend to cause
many more system failures than uncorrelated hardware faults.

**PREVENTION STRATEGY:** 

There is no quick solution to the problem of systematic faults in software. Lots of
small things can help: carefully thinking about assumptions and interactions in the
system; thorough testing; process isolation; allowing processes to crash and restart;
measuring, monitoring, and analyzing system behavior in production. If a system is
expected to provide some guarantee (for example, in a message queue, that the number
of incoming messages equals the number of outgoing messages), it can constantly
check itself while it is running and raise an alert if a discrepancy is found

#### **Human Errors**
Humans design and build software systems, and the operators who keep the systems
running are also human. Even when they have the best intentions, humans are
known to be unreliable.

**PREVENTION STRATEGY:** 

How do we make our systems reliable, in spite of unreliable humans? The best systems
combine several approaches:
* Design systems in a way that minimizes opportunities for error. For example,
well-designed abstractions, APIs, and admin interfaces make it easy to do “the
right thing” and discourage “the wrong thing.” However, if the interfaces are too
restrictive people will work around them, negating their benefit, so this is a tricky
balance to get right.
* Decouple the places where people make the most mistakes from the places where
they can cause failures. In particular, provide fully featured non-production
sandbox environments where people can explore and experiment safely, using
real data, without affecting real users.
* Test thoroughly at all levels, from unit tests to whole-system integration tests and
manual tests. Automated testing is widely used, well understood, and especially
valuable for covering corner cases that rarely arise in normal operation.
* Allow quick and easy recovery from human errors, to minimize the impact in the
case of a failure. For example, make it fast to roll back configuration changes, roll
out new code gradually (so that any unexpected bugs affect only a small subset of
users), and provide tools to recompute data (in case it turns out that the old computation
was incorrect).
* Set up detailed and clear monitoring, such as performance metrics and error
rates. In other engineering disciplines this is referred to as telemetry. (Once a
rocket has left the ground, telemetry is essential for tracking what is happening,
and for understanding failures.) Monitoring can show us early warning signals
and allow us to check whether any assumptions or constraints are being violated.
When a problem occurs, metrics can be invaluable in diagnosing the issue.
* Implement good management practices and training—a complex and important
aspect, and beyond the scope of this book.

---

### **Scalability**

Scalability is the term we use to describe a system’s ability to cope with increased
load. Note, however, that it is not a one-dimensional label that we can attach to a system:
it is meaningless to say “X is scalable” or “Y doesn’t scale.” Rather, discussing scalability means considering questions like “If the system grows in a particular way,
what are our options for coping with the growth?” and “How can we add computing
resources to handle the additional load?”

#### **Measuring load**
Load can be described
with a few numbers which we call load parameters. The best choice of parameters
depends on the architecture of your system: it may be requests per second to a web
server, the ratio of reads to writes in a database, the number of simultaneously active
users in a chat room, the hit rate on a cache, or something else. Perhaps the average
case is what matters for you, or perhaps your bottleneck is dominated by a small
number of extreme cases.

![](images/image_3.png)

#### **Performance**
Once you have described the load on your system, you can investigate what happens
when the load increases. You can look at it in two ways:
* When you increase a load parameter and keep the system resources (CPU, memory,
network bandwidth, etc.) unchanged, how is the performance of your system
affected?
* When you increase a load parameter, how much do you need to increase the
resources if you want to keep performance unchanged?

Both questions require performance numbers, so let’s look briefly at describing the
performance of a system.

Examples:

In a batch processing system such as Hadoop, we usually care about throughput—the
number of records we can process per second, or the total time it takes to run a job
on a dataset of a certain size. In online systems, what’s usually more important is the
service’s response time—that is, the time between a client sending a request and
receiving a response.

![](images/image_4.png)

#### **Scalability Strategy**

Some systems are elastic, meaning that they can automatically add computing resources
when they detect a load increase, whereas other systems are scaled manually (a
human analyzes the capacity and decides to add more machines to the system). An
elastic system can be useful if load is highly unpredictable, but manually scaled systems
are simpler and may have fewer operational surprises.

An architecture that scales well for a particular application is built around assumptions
of which operations will be common and which will be rare—the load parameters.
If those assumptions turn out to be wrong, the engineering effort for scaling is at
best wasted, and at worst counterproductive. In an early-stage startup or an unproven
product it’s usually more important to be able to iterate quickly on product features
than it is to scale to some hypothetical future load.

For example, a system that is designed to handle 100,000 requests per second, each
1 kB in size, looks very different from a system that is designed for 3 requests per
minute, each 2 GB in size—even though the two systems have the same data throughput.

---

### **Maintainability**

We can and should design software in such a way that it will hopefully minimize
pain during maintenance, and thus avoid creating legacy software ourselves. To
this end, we will pay particular attention to three design principles for software
systems:
* **Operability:**
Make it easy for operations teams to keep the system running smoothly.
* **Simplicity:**
Make it easy for new engineers to understand the system, by removing as much
complexity as possible from the system. (Note this is not the same as simplicity
of the user interface.)
* **Evolvability:**
Make it easy for engineers to make changes to the system in the future, adapting
it for unanticipated use cases as requirements change. Also known as extensibility,
modifiability, or plasticity.

As previously with reliability and scalability, there are no easy solutions for achieving
these goals. Rather, we will try to think about systems with operability, simplicity,
and evolvability in mind.

#### **Operability**

Good operability means making routine tasks easy, allowing the operations team to
focus their efforts on high-value activities. Data systems can do various things to
make routine tasks easy, including:
* Providing visibility into the runtime behavior and internals of the system, with
good monitoring
* Providing good support for automation and integration with standard tools
* Avoiding dependency on individual machines (allowing machines to be taken
down for maintenance while the system as a whole continues running uninterrupted)
* Providing good documentation and an easy-to-understand operational model
(“If I do X, Y will happen”)
* Providing good default behavior, but also giving administrators the freedom to
override defaults when needed
* Self-healing where appropriate, but also giving administrators manual control
over the system state when needed
* Exhibiting predictable behavior, minimizing surprises

#### **Simplicity**
One of the best tools we have for removing accidental complexity is abstraction. A
good abstraction can hide a great deal of implementation detail behind a clean,
simple-to-understand façade. A good abstraction can also be used for a wide range of
different applications. Not only is this reuse more efficient than reimplementing a
similar thing multiple times, but it also leads to higher-quality software, as quality
improvements in the abstracted component benefit all applications that use it.

#### **Evolvability**

In terms of organizational processes, Agile working patterns provide a framework for
adapting to change. The Agile community has also developed technical tools and patterns
that are helpful when developing software in a frequently changing environment,
such as test-driven development (TDD) and refactoring.

Maintainability has many facets, but in essence it’s about making life better for the
engineering and operations teams who need to work with the system. Good abstractions
can help reduce complexity and make the system easier to modify and adapt for
new use cases. Good operability means having good visibility into the system’s health,
and having effective ways of managing it.