# Designing Data-Intensive Applications: Step-by-step guide

## Step 1: Big picture of components (See chapter 1)
The first step in designing a data-intensive application is to broadly identify the components that will come into play. You can ask yourself questions like:
* How big is my data?
* How complex is it?
* How fast doest it change?
* Does it need caching processes?
* Does it need to allow users to search data by keyword or filter it in various ways (search indexes)?

The idea is to clarify a broad picture of the components that are needed in the architecture, and then go deeper into each component by selecting the technology that best fits the overall need of the application.

![](images/image_2.png)

## Step 2: Designing for reliability (See chapter 1)
The idea is to choose an initial (it may change at any stage of the design process) prevention strategy for:
* Hardare faults
    - Adding redundancy on hardware components 
    - Using systems that can tolerate the loss of an entire machine (distributed systems)
* Software errors
    - Check software assumptions and interactions
    - Testing
    - Monitoring of software designed alerts about assumptions and interactions
* Human errors
    - Well designed abstractions
    - Testing

## Step 3: Designing for scalability (See chapter 1)
Scalability is the term we use to describe a system’s ability to cope with increased load. It includes questions like:
* If the system grows in a particular way, what are our options for coping with the growth?
* How can we add computing resources to handle the additional load?

In this step you must identify the following parameters (examples not exclusive):
* Load parameter
    - Requests per second to a web server
    - Ratio of reads to writes in a database
    - Number of simultaneously active users in a chat room
    - Hit rate on a cache
* Performance parameter
    - Throughput (distributed systems)
    - Response time (online systems)
    
Based on the parameters the architect choose the correct scalability strategy to deal with the additional load so it do not impact the performance. (i.e. manual scaling vs _elastic_ systems)

## Step 4: Designing for maintainability (See chapter 1)
The idea is to choose an strategy for better:
* Operability
    - Good monitoring
    - Good support for automation and integration with standard tools
    - Avoiding dependency on individual machines
    - Good documentation
* Simplicity
    - Good abstraction
* Evolvability
    - Agile working patterns
    - Test-driven development (TDD)
    - Refactoring

## Step 5: Choosing a data model (See chapter 2)

Depending on the number of relationships that exists in the abstraction, the most popular data models for databases are relational, document and graphs.

* Relational databases fits in the majority of cases.
* Document databases target use cases where data comes in self-contained documents
and relationships between one document and another are rare.
* Graph databases go in the opposite direction, targeting use cases where anything
is potentially related to everything.

A feature to take into account is also the query language used for each model. For the relational model **the schema is explicit** (enforced on write) contrary to document and graph where **the schema is implicit** (handled on read).  

If the data to store need to be a binary object (text, video, JSON document, etc.) a key-value database will be the best choise, since they are highly effective at scaling applications that deal with high-velocity, non-transactional data.

## Step 6: Choosing a storage processing systems (See chapter 3)
Depending on the read and write patterns of the application an OLTP or an OLAP system should be used.

![](images/image_16.png)

For OLTP systems the storage engines are devided into log-structured and page-oriented. 

* Log-structured
    - Hash index: Rask
    - LSM-Tree: Cassandra, HBase
    - Search engines: Lucene
* Page-oriented
    - B-Tree: LMDB
    - B+ Tree: SQL Server, Oracle

As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees
are thought to be faster for reads. Reads are typically slower on LSM-trees
because they have to check several different data structures and SSTables at different
stages of compaction.

For OLAP systems the most important factor is the data model, here Star and Snowflake are the options. Disk bandwidth (not seek time) is often the bottleneck here, and column-oriented storage is an increasingly popular solution for this kind of workload.

* Column-oriented products
    - Apache Parquet
* Datawarehouse products
    - OLTP and OLAP: SAP HANA, SQL Server
    - OLAP (paid) --> Teradata, Vertica, Amazon RedShift, Google BigQuery
    - OLAP (open source) --> Apache Hive, Spark SQL, Impala, Presto, Apache Tajo, Apache Drill