# "Chapter 2: The Software: ClickHouse and Tinybird"
> "SQL queries and scaling-up ever-growing amounts of data"

- toc: true
- branch: master
- badges: true
- comments: false
- categories: [real-time, analytics, queries, SQL]
- image: images/tb_logo_navbar.png
- hide: false
- search_exclude: true
- sticky-rank: 3


#The Software: ClickHouse and Tinybird

If you are comfortable writing queries in SQL and you need to scale up to deal with ever-growing amounts of data then this course is for you.

The choice of software is not important here because we are interested in the principles of real-time analytics, which are true whatever software you use. However, since the most important thing for learning is writing code and experimenting, running the examples will help you. The examples we show are in ClickHouse and Tinybird (built on ClickHouse) because this database management system is built for fast queries on large amounts of data.

**ClickHouse Installation**
ClickHouse is a fast, open-source, column-oriented OLAP (online analytical processing) database management system that can be [installed](https://clickhouse.tech/docs/en/getting-started/install/) on many systems.

**Tinybird Installation**
[Tinybird](https://www.tinybird.co/) is the easiest way to develop real-time analytics APIs over large quantities of data for low-latency and high-concurrency applications. You can use Tinybird’s User Interface, CLI tool or REST API to get the data you need in real-time to build your applications.     

Sign up [here](https://www.tinybird.co/signup) to open an account to run the examples in these notebooks.


#Modern Computer Hardware
Most likely you either know what you were taught about hardware architectures back when you were a student or you don’t know anything at all about hardware. That’s fine, to drive a car you don’t need to know how the engine works. So let’s introduce here some really basic concepts about hardware and how it works nowadays.

Let’s start with the simplest approach: a computer has a CPU that executes instructions stored in memory and communicates with the rest of the world using various IO devices (network, disk…). That’s an oversimplification of a machine but it’s enough to start with.


##Data Storage Systems
If we simplify, we have main memory and external storage devices. In theory, main memory access is much faster and has random access time and the disk is slower. We learnt that disk access is not random because of the spinning disks they use. These days this is still true even with SSDs (although they are much faster). And obviously data in memory is not permanent.

But the reality is there are several cache layers in the CPU (L1, L2, L3…), multiple CPUs, the main memory, which is not random access, and the “slow” disk, which is often accessed across the network (but if we do things right it will be mapped to main memory by the OS).


##CPU Speed
Back in the 1980s, CPUs were a piece of hardware that would fetch instructions from memory and execute instructions. Today they are complex things that fetch instructions from memory in parallel, cluster instructions to parallelize execution and have many execution pipelines that run those in parallel.

The important takeaway here is that CPU’s are much much faster than memory, so in order to take advantage of that speed we need to organize our data in such a way that the CPUs are not stopped. 

It makes no sense having 32 CPUs in a machine if 90% of the time they are waiting for data.

Not only clock speed and pipeline complexity have changed, today’s CPUs also have specific instruction sets to optimize some use cases. For example, there are instructions that allow you to add several numbers at a time, a simpler CPU can add just one number at a time, this is called vectorization and we will see how [analytics] databases use this to speed up computation.


##Memory Speed
We need to differentiate between latency and speed/throughput. If you want to move things from one place to another you could use a car. A car is fast but the volume of things you can move in one trip is not large. If you use a truck it will take more time to get there but you will carry way more load. The car has low latency and low throughput, the truck has high latency and high throughput. 

Given our use case, what we are looking for is speed or throughput. If we wanted to just get a value from the main memory, latency would be more important but working with large datasets we usually fetch MB per query.

An extreme example of these are GPUs, they have a very specific access pattern and their memory layout is designed to optimize bandwidth. Read [Fabian Giesen on how GPUs work](https://fgiesen.wordpress.com/2011/07/02/a-trip-through-the-graphics-pipeline-2011-part-2/) to understand how each architecture is different depending on the use case.

In this fantastic [article](https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/) by Forrest Smith on Memory Bandwith with Napkin Math, you can see different memory speed benchmarks:

per core
 - L1 Bandwidth:  210 GB/s 
 - L2 Bandwidth:  80 GB/s
 - L3 Bandwidth:  60 GB/s

whole system
 - RAM Bandwidth: 45 GB/s

memory sizes
 - L1 cache: 192 kilobytes (32 KB per core)
 - L2 cache: 1.5 megabytes (256 KB per core)
 - L3 cache: 12 megabytes  (shared; 2 MB per core)


But those figures are the theory, speed depends a lot on how memory is accessed. These are the main takeaways about memory speed in real scenarios:
 - Random access is 10x times slower than sequential access.
 - Accessing already cached data is way faster than accessing non-cached data. 
 - Cached data increases speed linearly with the number of cores.

To visualize it better, imagine you go to the grocery store and every single product you want to buy is placed on the same shelf. You would get all of them in a matter of seconds. However, the reality is that the products are usually organized by type so you need to spend time moving between the shelves to get all the items. You are the CPU, the shelves are the memory in this example.

One great [talk](https://www.youtube.com/watch?v=MC1EKLQ2Wmg) to understand better how modern machines work is Mythbusting Modern Hardware to Gain 'Mechanical Sympathy' from Martin Thompson.

##Disk Speed
In general, disks are slow or way slower than memory. As always, that’s not totally true, for example, when reading large pieces of data SSD disks are pretty decent. Obviously try to use SSD disks when possible but always looking at the performance we get per $ spent.


##OS
We do not talk to the CPU, memory, disk or network directly, we have a layer that helps us to do that and controls how it is done. For example, the Linux kernel manages processes and decides which processes run, which disk pages are in memory and so on.

There are many things we should take into account related to the OS (threads, IO and so on) but the main one is page cache, in other words, how the OS maps disk data in memory so that access is faster. If you are not sufficiently careful, you might end up fetching more disk pages than needed and therefore increasing the amount of data read. And you already know that reading from disk is slow.

As a side note, there are databases that use direct IO to bypass the OS page cache so you’ll need to spend some time researching what the best practices are when using them.


##Cloud
When you run software in a machine in the cloud most likely you are running an application on top of an OS that is virtualized, in other words, there is another layer (at least) between your application and the hardware. That means your application might be fighting for resources, especially page cache. If you assume that part of the data you are querying is in memory but really it’s not because another guest has it, then your API response time percentiles are going to grow.

A great [talk](https://www.infoq.com/presentations/low-latency-cloud-oss/) from Mark Price about this is Achieving Low-Latency in the Cloud with OSS.

On the other hand, with cloud you can have cheap storage and “unlimited” network and resources, so if your data grows in a non-predictable way that’s something that you should value when making the decision on where to put your data. You can also take advantage of spot instances to deal with the load and use lots of smaller instances, this means if one of them goes down the problem is small.


## [Course Outline](https://colab.research.google.com/github/AlisonJD/RTACourse/blob/main/01_Getting_Started.ipynb)

|Previous Notebook       |Next Notebook|
| :----------------- |:-------------|
|1. [Getting Started - How to Run the Notebooks in Google Colab and Tinybird](https://colab.research.google.com/github/AlisonJD/RTACourse/blob/main/01_Getting_Started.ipynb) |3. [Use the Right Database](https://colab.research.google.com/github/AlisonJD/RTACourse/blob/main/03_Use_the_Right_Database.ipynb)|