## Chapter 1

The book focuses on three important concerns.
1. Reliability: System should continue to work correctly.
2. Scalability: As the system grows, there should be a reasonable way to deal the growth.
3. Maintainability: Over time, different people will work with it. They should be able to work with the system productively.

### Reliability

Fault: A component of the system deviates from the spec.
Failure: System as a whole stops working.

It is impossible to make number of faults to zero. System should instead be designed so that faults don't cause failures.

#### Hardware Faults

- Mean time to failure of single hardware component is generally large. However, when we have lots of hardware, probability of getting hardware fault increases.

If probability of failure is $\alpha$, and we have N=10_000 components, probability of failing at least one component is $1 - (1-\alpha)^N$. 

In [13]:
def failure_prob(N, alpha):
    return round((1 - (1-alpha)**N)*100, 2)


for alpha in (.0001, .001, .01):
    for N in (1000, 10_000):
        print(f'{N=:5}, {alpha=:6}, failure_prob={failure_prob(N, alpha):5}%')

N= 1000, alpha=0.0001, failure_prob= 9.52%
N=10000, alpha=0.0001, failure_prob=63.21%
N= 1000, alpha= 0.001, failure_prob=63.23%
N=10000, alpha= 0.001, failure_prob=100.0%
N= 1000, alpha=  0.01, failure_prob=100.0%
N=10000, alpha=  0.01, failure_prob=100.0%


In earlier days, hardware faults were dealt with adding redundancy, like RAID, power backup. In the cloud, N is too large. And in the cloud it is common for vm to become unavailable. Thus, it is common to design systems that can tolerate the whole machine failure using software fault-tolerant techniques. 

#### Software Faults

Software errors tend to cause more failures than hardware ones. e.g., bug in linux kernel, rouge process eating up resources, cascading failures. 

These bugs remain dormant, and surface rarely. When they do, generally we find that original designers made certain assumptions, which no longer hold true.

### Scalability

Systems that work find currently, may not work when the "load" increases. Load can be described with certain parameters called as load parameters. Choice of load parameters depend upon application. It could be rpm, ratio of reads to write in db, number of active users in a chat room etc. 

Performance can be looked in two ways. When the load parameter increases
1. How does the system performance change when hardware is constant?
2. How the resources need to be increased, when performance needs to be constant?

Performance is generally described by,
- throughput in offline systems
- response time in online systems

It is important to differential between latency and response time. latency is how much time it takes to process the request, response time is from user's pov. If the resources are in heavy demand, request will stay in queue for long time. The latency will be the same, but response time will be higher in this case.

These are measured as in p95, p99, p999 and so on. 

Similar to previous analysis, even if user has only 1 in 1000 chance of experiencing the tail responsetime, if user sends N requests, the chances of them experiencing the bad experience incerases.

In [16]:

for alpha in (0.01, 0.001):
    for N in (10, 25, 50):
        print(f'{N=:2}, {alpha=:6}, failure_prob={failure_prob(N, alpha):5}%')

N=10, alpha=  0.01, failure_prob= 9.56%
N=25, alpha=  0.01, failure_prob=22.22%
N=50, alpha=  0.01, failure_prob= 39.5%
N=10, alpha= 0.001, failure_prob=  1.0%
N=25, alpha= 0.001, failure_prob= 2.47%
N=50, alpha= 0.001, failure_prob= 4.88%


In addition, customers with slowest experience may very well be the ones who buys most on you platform, and their large purchase history causes the higher response time.

## Chapter 2: Data Models and Query Language