## Getting started with Monitoring

The key to ensuring all of this, is to set up good monitoring and alerting rules. 

**Monitoring lets us look into the history and current status of a system**. How can we know what the status is? We'll check out a bunch of different metrics.

Well, some metrics are generic, like how much memory an instance is using. Other metrics are specific to the service we want to monitor. Say your company is running a website and you want to check if it's working correctly. When a web server responds to an HTTP request, it starts by sending a response code, followed by the content of the response.

Response codes in the 400 range means there was a client-side problem in the request. When monitoring your web service, you want to check both the count of response codes and their types to know if everything's okay.

Now, once we've decided what metrics we care about, what do we do with them? We'll typically store them in the monitoring system

**Some systems like AWS Cloudwatch, Google Stack Driver, or Azure Metrics are offered directly by the Cloud providers. Other systems like Prometheus, Datadog, or Nagios can be used across vendors**. T

here's two ways of getting our metrics into the monitoring system. 

#### Pull Model
**Some systems use a pull model, which means that the monitoring infrastructure periodically queries our service to get the metrics**. 

#### Push Model

**Other monitoring systems use a push model, which means that our service needs to periodically connect to the system to send the metrics**. No matter how we get the metrics into the system, we can create dashboards based on the collected data.


we can look at the progression of two or more metrics together to check out how the change in one metrics effects another. 

> Pro tip, you only want to store the metrics that you care about, since storing all of these metrics in the system takes space, and storage space costs money. 

### Whitebox Monitoring 

When we collect metrics from inside a system, like how much storage space the service is currently using or how long it takes to process a request, this is called whitebox monitoring. **Whitebox monitoring checks the behavior of the system from the inside**. We know the information we want to track, and we're in charge of making it possible to track.

### Blackbox Monitoring

**blackbox monitoring checks the behavior of the system from the outside**. This is typically done by making a request to the service and then checking that the actual response matches the expected response

We can set up alerting rules to let us know if something's wrong. This is a critical part of ensuring a reliable system,
 

## Getting Alerts When Things Go Wrong

 It's much better to create automation that checks the health of your system and notifies you when things don't behave as expected
 
So how do we do that? The most basic approach is to run a job periodically that checks the health of the system and sends out an email if the system isn't healthy. **On a Linux system, we could do this using cron, which is the tool to schedule periodic jobs**. We'd pair this with a simple Python script that checks the service and sends any necessary emails. This is an extremely simplified version of an alerting system, but it shares the same principles. Is all alerting systems, no matter how complex and advanced.

Raising an alert signals that something is broken and a human needs to respond. For example, you can set up your system to raise alerts if the application is using more than 10 gigabytes of RAM, or if it's responding with too many 500 errors, or if the queue of requests waiting to get processed gets too long. 

**We typically divide useful alerts into two groups, those that need immediate attention and those that need attention in the near future.**

**Those that need immediate attention are called pages, which comes from a device called a pager.** Before mobile phones became popular, pagers were the device of choice for receiving urgent messages, and they're still used in some places around the world. Nowadays, most people receive their pages in other forms like SMS, automated phone calls, emails, or through a mobile app, but we still call them pages. 

**On the flip side, the non-urgent alerts are usually configured to create bugs or tickets for an IT specialist to take care of during their workday.** They can also be configured to send email to specific mailing lists or send a message to a chat channel that will be seen by the people maintaining the service. 

> One thing to highlight is that all alerts should be actionable.

They can also be configured to send email to specific mailing lists or send a message to a chat channel that will be seen by the people maintaining the service. One thing to highlight is that all alerts should be actionable.


## Service-Level Objectives

No system is ever available 100% of the time, it's just not possible. But depending on how critical the service is, it can have different service level objectives, or SLOs. **SLOs are pre-established performance goals for a specific service. Setting these objectives helps manage the expectations of the service users, and the targets also guide the work of those responsible for keeping the service running**. SLOs need to be measurable, which means that there should be metrics that track how the service is performing and let you check if it's meeting the objectives or not.



If our service has an SLO of 99% availability, it means it can be down up to 1 % of the time. If we measure this over a year, the service can be down for a total of 3.65 during the year and still have 99% availability. Availability targets like this one are commonly named by their number of nines. Our 99% example would be a two 9 service, 99.9% availability is a three 9 service, 99.999% availability is a five 9 service. Five nine services promised a total down time of up to five minutes in a year. **Five nines is super high availability, reserved only for the most critical systems**. A three nine service, aiming for a maximum of eight hours of downtime per year, is fine for a lot of IT systems. Now, you might be wondering, why not just make everything a five nine service? It's a good question. The answer is, because it's really expensive and usually not necessary. If your service isn't super critical and it's okay for it to be down briefly once in a while having two or three nines of availability might be enough. You can keep the service running with a small team.


Any service can have a bunch of different service level objectives like these, they tell its users what to expect from it. **Some services, like those that we pay for, also have more strict promises in the form of service level agreements, or SLAs. A service level agreement is a commitment between a provider and a client**.


Service level objectives though are more like a soft target, it's what the maintaining team aims for, but the target might be missed in some situations

If you're in charge of a website, you'll typically measure the rate of responses with 500 return codes to check if your service is behaving correctly. If your SLO is 99% of successful requests, you can set up a non-paging alert if the rate of errors is above 0.5%, and a paging alert if it reaches 1%. 

If your service was working fine and meeting all of its SLOs and then started misbehaving, it's likely this was caused by a recent change. That's why some teams use the concepts of **error budgets to handle their services**.

Say you're running a service that has three nines of availability. This means the service can be down 43 minutes per month, this is your error budget. You can tolerate up to 43 minutes of downtime, so you keep track of the total time the service was down during the month. If it starts to get close to those 43 minutes, you might decide to stop pushing any new features and focus on resolving the problems that keep causing the downtime. 

## Basic Monitoring in GCP

we want to see how we can use the tools provided by the cloud vendor to monitor them and create alerts based on them. 
**we'll use the monitoring tool called Stackdriver, which is part of the overall offering**.

The monitoring system gives us a very simple overview of each of the instances with three basic metrics, CPU usage, Disk I/O, and network traffic. 

we want to check out how to set up an alert to notify us if something isn't behaving correctly. To do this, we'll create a new alerting policy. To set up a new alert, we have to configure the condition that triggers the alert.

After we've done that, we can also configure how we want to be notified of the issue and add any documentation that we want the notification to include. Let's start by configuring the condition.

We'll start by selecting that we want to monitor GCE, VM instances, which are the instances that we currently have running and then select the CPU utilization metric.

we could choose to only look at some of the instances, selecting by their zone, region, or name. This can be useful if you want to have separate alerts for instances used for production, and those used for testing or development. On top of that, we can also choose an aggregator for the data, these aggregators are useful when the metrics that you're collecting are about the overall system and not just one instance.


## More Information on Monitoring and Alerting
Check out the following links for more information:

- https://www.datadoghq.com/blog/monitoring-101-collecting-data/
- https://www.digitalocean.com/community/tutorials/an-introduction-to-metrics-monitoring-and-alerting
- https://en.wikipedia.org/wiki/High_availability
- https://landing.google.com/sre/books/

