# AUA, DS 229 – MLOps
### Week 14 – Monitoring Machine Learning Systems

***

<div class="alert alert-block alert-danger">
<b>Action</b>:
    <b>Open a terminal an run</b>: `docker-compose up --build`
    then run all the cells of this notebook. Afterwards, you can continue 
    from the next cell.
</div>

Machine learning systems are increasingly being used to solve complex problems across various fields, such as healthcare, finance, and autonomous vehicles. However, it is essential to monitor these systems to ensure that they are performing as expected and to identify any potential issues or biases.

<center><img src="./images/ml_lifecycle.png" width=900 height = 200/></center>

[[Image source](https://martinfowler.com/articles/cd4ml.html)]

#### Monitoring vs Testing

**Testing** – Our best effort verification of correctness (**necessity**)  
**Monitoring** – Our best effort to track predictable failures (**satisfaction**)

[[Source](https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/)]

<mark>What problems can be detected as a result of monitoring?</mark>  
Monitoring machine learning systems involves collecting data about the system's behavior and performance, analyzing the data to identify patterns and trends, and taking appropriate actions based on the findings. This process can help to detect and address issues such as model drift, data quality problems, and biases in the system.

### Data drift

<center><img src="./images/data_drift.png" width=700 height = 150/></center>

Data drift can occur when the distribution of data used to train a machine learning model changes over time. This can happen for a variety of reasons, such as changes in user behavior, changes in the underlying environment, or changes in the data collection process. When data drift occurs, the performance of the machine learning model can degrade, as it is no longer making accurate predictions based on the new data.

Concept drift detection is a technique used to monitor for data drift. It involves comparing the distribution of the data used to train the machine learning model to the distribution of the new data that the model is being used to make predictions on. If there is a significant difference between these distributions, it can indicate that data drift has occurred.

There are various approaches to detecting concept drift, including statistical methods, distance-based methods, and density-based methods. Statistical methods involve comparing statistical properties of the data distributions, such as mean, variance, and correlation. Distance-based methods measure the difference between the distributions using distance metrics such as Kullback-Leibler divergence or Earth Mover's Distance. Density-based methods estimate the probability density function of the data and compare it to the previous distribution.

Once data drift has been detected, there are several strategies that can be used to address it. One approach is to retrain the machine learning model using the new data. Another approach is to use techniques such as online learning or incremental learning to update the model in real-time as new data arrives. In some cases, it may also be necessary to collect additional data that better represents the new distribution.

## The goal

Of course, we are naturally interested in knowing how accurate our models are when they are put into use. However, it's often impossible to immediately assess the accuracy of a model in many cases. For example, with a fraud detection model, the accuracy of its predictions can only be confirmed when new live cases arise and are investigated or when customer data is cross-checked against known fraudsters. Other fields, such as disease risk prediction, credit risk prediction, future property values, and long-term stock market prediction, also face similar challenges where immediate feedback is not available. With these limitations in mind, it makes sense to track proxy values to model accuracy in production, including metrics like precision, recall, F1 score, or area under the receiver operating characteristic curve (AUC-ROC). As a proxy of model accuracy can play:
- Model prediction distribution (regression algorithms) or frequencies (classification algorithms) – we can compare the distributions of our model predictions using statistical tests, either through an automated or manual process. This can involve basic statistical measures such as median, mean, standard deviation, and maximum/minimum values. For instance, if the variables follow a normal distribution, we would anticipate the mean values to fall within the standard error of the mean range. However, this is a simple statistical method and may not capture all of the nuances of the data. More in-depth analyses may be required to fully understand the accuracy of the model's predictions.

- Model input distribution (numerical features) or frequencies (categorical features), as well as missing value checks – If we have a predetermined set of values for an input feature, we can perform two checks to ensure the model's accuracy. Firstly, for categorical inputs, we can confirm that the input values fall within an acceptable set. For numerical inputs, we can verify that the input values are within the expected range. Secondly, we can check that the frequency of each respective value within the set is consistent with what we have observed in previous data. By conducting these checks, we can confirm that the model is producing reliable predictions based on the input values.


## Operations Monitoring

The field of software engineering has a more established system of monitoring which falls under the umbrella of Site Reliability Engineering. When it comes to the operational aspects of our ML system, the following areas are of concern:

- System Performance (**Latency**) – the amount of time it takes for a system to respond to a given input or request. Latency can be measured in various units such as seconds, milliseconds, or microseconds.
- System Performance (**IO/Memory/Disk Utilisation**) – IO Utilization refers to the amount of data being transferred to and from a system's storage devices, such as hard drives or solid-state drives (SSDs). High IO utilization can indicate that the system is heavily reliant on its storage devices and may be experiencing a bottleneck in data transfer. This can lead to slower response times and reduced overall system performance. Memory Utilization refers to the amount of physical memory being used by the system. Disk Utilization refers to the amount of disk space being used by the system. High disk utilization can indicate that the system is running out of storage space, which can impact performance and cause errors or crashes.
- System Reliability (**Uptime**) – System reliability refers to the ability of a computer system to function without failure or interruption over a period of time. Uptime, in particular, is a measure of system reliability that refers to the amount of time a system is operational and available for use. Uptime is typically measured as a percentage of the total time a system is expected to be available. For example, a system that is expected to be available 24 hours a day, 7 days a week has a total uptime of 100%, while a system that experiences 1 hour of downtime in a week has an uptime of 99.6%.

***

## [Prometheus](https://prometheus.io/docs/introduction/overview/)

<center><img src="./images/prometheus.png" width=200 height = 80/></center>

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud.

**Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.**

#### What are metrics?
The metrics are numeric measurements. Time series means that changes are recorded over time. What users want to measure differs from application to application. For a web server it might be request times, for a database it might be number of active connections or number of active queries etc.

Metrics play an important role in understanding why your application is working in a certain way. Let's assume you are running a web application and find that the application is slow. You will need some information to find out what is happening with your application. For example the application can become slow when the number of requests are high. If you have the request count metric you can spot the reason and increase the number of servers to handle the load.

[[Source](https://prometheus.io/docs/introduction/overview/)]

## [Grafana](https://grafana.com/grafana/dashboards/)

<center><img src="./images/grafana.png" width=400 height = 80/></center>

Grafana is an open-source analytics and visualization platform that enables users to create, explore, and share dashboards, visualizations, and data queries. It allows users to connect to and visualize data from various sources, including databases, cloud platforms, and other data repositories, and build interactive, real-time dashboards for monitoring and analysis.

Grafana provides an intuitive user interface and a wide range of visualization options, including graphs, charts, and tables, making it easy for users to explore and understand their data. It also supports advanced features like alerting, data transformations, and annotations, allowing users to detect and respond to anomalies and trends in their data.

Grafana is highly extensible, with a rich plugin architecture that allows users to customize and extend the platform's functionality. It also integrates with a range of popular data sources and tools, including InfluxDB, Elasticsearch, Prometheus, and more.

Grafana is widely used by organizations of all sizes and industries, from startups to large enterprises, to monitor and analyze their data and gain insights into their systems and operations. With its flexible architecture, powerful visualization capabilities, and active developer community, Grafana is a popular choice for building custom analytics and monitoring solutions.

***

# Monitoring breast cancer predictor

#### Project struture:

- We create a Connexion app instance and add the API definition as before.

- We create a Flask app instance from the Connexion app, which allows us to access the underlying Flask application.

- We use the DispatcherMiddleware class from the werkzeug.middleware.dispatcher module to add a Prometheus metrics endpoint to the Flask app. This endpoint will be accessible at */metrics*.

- Finally, we start the application by calling app.run() as before. Now, the application will be monitored by Prometheus and you can view the metrics by accessing the */metrics* endpoint.

*Prometheus will act as a data source where all the data will be available on port 9090, and Grafana will fetch these data points and display them in the dashboard*.

We are going to run all our services locally on Docker containers.
- [Prometheus Docker Hub image](https://hub.docker.com/r/prom/prometheus/)
- [Grafana Docker Hub image](https://hub.docker.com/r/grafana/grafana)

The most important point in `docker-compose.yml` configuration is `prometheus.yml` file mounting from our local to the docker container. This file includes configuration for pulling data (metrics) from our app service or Python project. Without the file, we won't able to see the custom metrics that our project includes.

The Prometheus configuration file, `prometheus.yml`, is used to specify which targets Prometheus should scrape and how to process the collected data. By default, Prometheus uses the YAML file from path: `/etc/prometheus/prometheus.yml`. We need to add our own `prometheus.yml` file and inform Prometheus to use that one.

The `prometheus.yml` file has a simple structure with three main sections:

- **global**: This section contains global configuration settings for Prometheus. These settings apply to all scraping jobs defined in the configuration file. Some examples of global settings include the evaluation interval and the scrape timeout.

    > The **scrape_interval** parameter specifies the interval at which Prometheus should scrape the metrics data from the target. The default value for scrape_interval is 1 minute, but you can adjust this interval according to your needs. It is important to note that **scrape_interval** affects both the performance and the accuracy of the metrics collected by Prometheus. A shorter interval will result in more frequent scraping and more up-to-date data, but can also put more load on the system being monitored. A longer interval will reduce the load on the system, but may also result in less accurate data.

    > The **evaluation_interval** parameter specifies the interval at which Prometheus should re-evaluate the rules defined in its configuration files and generate alerts if any of the defined alert conditions are met. The default value for evaluation_interval is 1 minute, but you can adjust this interval according to your needs. A shorter interval will result in more frequent evaluations and more up-to-date alerts, but can also put more load on the system being monitored. A longer interval will reduce the load on the system, but may also result in less accurate and less timely alerts.


- **scrape_configs**: This section defines the list of targets that Prometheus should scrape. Each target is defined in a separate block within the scrape_configs section. Each block specifies the URL endpoint that Prometheus should scrape, along with any additional configuration settings for that target. For example, you might specify the HTTP port, path, and metrics path for a target.

    > The **job_name** parameter specifies the name of the job that the target belongs to. This name is used to identify the job in the Prometheus UI and in alert rules, making it easier to manage and monitor different targets and their associated metrics.  
    Each **job_name** should have its own set of scrape_configs defined in the `prometheus.yml` configuration file. Each **scrape_config** defines the specific endpoints or URLs that Prometheus should scrape for metrics data for that particular job.


- **rule_files**: This section defines the list of files containing Prometheus rules that should be evaluated. Each rule file is defined in a separate line. Rules are used to define alerts based on the collected metrics. For example, you might create a rule that generates an alert if a certain metric goes above a certain threshold.



#### [Prometheus metric types include](https://prometheus.io/docs/concepts/metric_types/):

- **Counter**: A counter is a cumulative metric that counts the number of events that occur. **Counters always increase in value and never decrease**. They are useful for tracking the number of requests, errors, or other events. Do not use a counter to expose a value that can decrease. For example, do not use a counter for the number of currently running processes; instead use a gauge.  
When working with counter metrics in Prometheus, two related metrics are generated automatically: one with the **_count** postfix and another with the **_created** postfix. The metric with the _count postfix represents the total number of events that have occurred since the counter was created or last reset. This metric is incremented every time an event is observed. For example, if you are monitoring the number of HTTP requests received by a web server, the _count metric would represent the total number of requests received. The metric with the _created postfix represents the timestamp when the counter was created. This metric is useful when you want to track how long a counter has been running. For example, if you are monitoring the uptime of a service, the _created metric would represent the timestamp when the service was started.



- **Gauge**: A gauge is a metric that represents a single numerical value that can go up or down over time. Gauges are used to track metrics such as CPU usage, memory usage, or the number of connections to a database.

- **Histogram**: A histogram is used to measure the distribution of a set of values. It separates values into buckets and counts how many values fall into each bucket. Histograms are useful for measuring things like response times or request sizes. A histogram with a base metric name of `basename` exposes multiple time series during a scrape:
  - cumulative counters for the observation buckets, exposed as `<basename>_bucket` (each bucket of the histogram is represented by a separate time-series with a label **le** that indicates the upper bound of the bucket range.)
  - the total sum of all observed values, exposed as `<basename>_sum`
  - the count of events that have been observed, exposed as `<basename>_count`  
  
- **Summary**: A summary is similar to a histogram. While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window. Summaries are useful for measuring things like request latencies. A summary with a base metric name of `<basename>` exposes multiple time series during a scrape:
  - streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as `<basename>`
  - the total sum of all observed values, exposed as `<basename>_sum`
  - the count of events that have been observed, exposed as `<basename>_count`

Histograms and summaries both sample observations, typically request durations or response sizes. They track the number of observations and the sum of the observed values, allowing you to calculate the average of the observed values. Note that the number of observations (showing up in Prometheus as a time series with a _count suffix) is inherently a counter (as described above, it only goes up). The sum of observations (showing up as a time series with a _sum suffix) behaves like a counter, too, as long as there are no negative observations. Obviously, request durations or response sizes are never negative. In principle, however, you can use summaries and histograms to observe negative values (e.g. temperatures in centigrade). In that case, the sum of observations can go down, so you cannot apply rate() to it anymore. In those rare cases where you need to apply rate() and cannot avoid negative observations, you can use two separate summaries, one for positive and one for negative observations (the latter with inverted sign), and combine the results later with suitable PromQL expressions.

***

Open http://localhost:8000/ui/ and http://localhost:3000/. For Grafana, use **admin** both for username and password. Then it will require adding a new password and we can keep it the same as it is since we're testing locally. Visit http://localhost:9090 for prometheus server.

Add a Prometheus data source to Grafana by navigating to the Grafana web interface at http://localhost:3000. Click on "Add data source", select "Prometheus" as the type, and enter the URL of the Prometheus server (http://prometheus:9090 or http://host.docker.internal:9090 in this case). Click "Save & Test" to verify the connection.

Open http://localhost:3000/dashboards and add a new panel by selecting the metric you want to monitor.

> In most cases, specifying `localhost` or `127.0.0.1` in `prometheus.yml` will not work when the target is running inside a Docker container because the localhost address refers to the container itself, not the host machine.

> Instead, you need to use a special DNS name, `host.docker.internal`, that resolves to the internal IP address of the host machine from within the Docker container. This DNS name was introduced in Docker version 17.12 for both Mac and Windows and can be used to access services running on the host machine from within a Docker container.

> So, when running a service inside a Docker container that needs to be accessed from another service running on the host machine, you should use `host.docker.internal` instead of `localhost` or `127.0.0.1` in the configuration files.

### Simulating Live Data for Breast Cancer Prediction

In [None]:
import requests
import time
import random

import numpy as np

In [None]:
# These statistics are taken from the training data.
stats = {
    "mean": {
        0: {
            "mean concavity": 0.160775, 
            "worst area": 1422.286321, 
            "mean area": 978.376415
        },
        1: {
            "mean concavity": 0.046058, 
            "worst area": 558.899440, 
            "mean area": 462.790196
        }
    },
    "std": {
        0: {
            "mean concavity": 0.075019, 
            "worst area": 597.967743, 
            "mean area": 367.937978
        },
        1: {
            "mean concavity": 0.043442, 
            "worst area": 163.601424, 
            "mean area": 134.287118
        }
    }
}


def sample_features(target=None):
    """Samples features corresponding to the target if specified."""
    
    if target is None:
        target = random.randint(0, 1)
    
    means = list(stats["mean"][target].values())
    stds = list(stats["std"][target].values())

    return [np.random.normal(mean, std / 10) for mean, std in zip(means, stds)]

In [None]:
minutes = 10  # For how many minutes to run each nested loop.
URL = "http://localhost:8000/predict"

In [None]:
for target in (0, 1) * 50:
    
    END_TIME = time.time() + 60 * minutes
    current_min = 1
    request_counter = 0
    print(f"Feeding data 'corresponding' to target {target}")

    # Nested loop start.
    while time.time() < END_TIME:

        sec = random.random() * 15
        time.sleep(sec)

        feature_values = sample_features(target=target)
        params = {
            "mean_concavity": feature_values[0], 
            "worst_area": feature_values[1],
            "mean_area": feature_values[2]
        }
        response = requests.get(URL, params)

        diff = END_TIME - time.time()
        if (diff > 0) and (diff // 60 == minutes - current_min):
            current_min += 1
            print(f"  .. {int(diff)} seconds still to go")

        request_counter += 1


    print("="*40)
    print(f"Performed {request_counter} requests.\n\n")
    

***
## [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/)

Prometheus provides a functional query language called PromQL (Prometheus Query Language) that lets the user select and aggregate time series data in real time.

<div class="alert alert-block alert-danger">
<b>Action</b>:
    Visit http://localhost:3000/ and create a dashboard to visualize the following metrics:
</div>

#### 1) Visualize the number of predict requests as a time series data:
> `request_count_total`

#### 2) Visualize the predicted class labels:
> `target_sum_sign`

### [Range Vector Selectors](https://prometheus.io/docs/prometheus/latest/querying/basics/#range-vector-selectors)
Range Vector Selectors select a range of samples back from the current instant. Syntactically, a time duration is appended in square brackets (`[]`) at the end of a vector selector to specify how far back in time values should be fetched for each resulting range vector element. In the first example below, we select all the values we have recorded within the last 5 minutes for all time series that have the metric name **request_processing_seconds_count** (paste the following two queries in http://localhost:9090/ query space and play with minutes):
> `request_processing_seconds_count[5m]`  
> `request_processing_seconds_sum[10m]`

[Here](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-durations) is the list of possible time units. Note that each observation in each of the two queries above are sperated from each other exactly by the amount of time specified in **scrape_interval** in `prometheus.yml`

#### `rate()`, `irate()` and  `increase()`
- `rate()`: This calculates the rate of increase per second, averaged over the entire provided time window.   
Example: `rate(request_count_total[5m])` yields the per-second rate of HTTP requests as averaged over a time window of 5 minutes. 
- `irate()` ("instant rate"): This calculates the rate of increase per second just like `rate()`, but only considers the last two samples under the provided time window for the calculation and ignores all earlier ones.   
Example: `irate(request_count_total[5m])` looks at the two last samples under the provided 5-minute window and calculates the per-second rate of increase between them. This function can be helpful if you want to make a zoomed-in graph show very quick responses to changes in a rate, but the output will be much more spiky than for `rate()`.
- `increase()`: This function is exactly equivalent to `rate()` except that it does not convert the final unit to "per-second" (1/s). Instead, the final output unit is per-provided-time-window.   
Example: `increase(request_count_total[5m])` yields the total increase in handled HTTP requests over a 5-minute window (unit: 1 / 5m). Thus `increase(foo[5m]) / (5 * 60)` is 100% equivalent to `rate(foo[5m])`.

See how is the rate calculated [here](https://promlabs.com/blog/2021/01/29/how-exactly-does-promql-calculate-rates/).


#### 3) Calculate the average number of predict requests per 10m:
> `increase(request_count_total[10m])`


#### 4) Calculate the average request time in seconds during the last 2 minutes:
> `(rate(request_processing_seconds_sum[2m]) / rate(request_processing_seconds_count[2m]))`

#### 5) Calculate the average `worst_area` during the last 2 minutes:
> `(rate(worst_area_hist_sum[2m]))/(rate(worst_area_hist_count[2m]))`

# References
- [Prometheus](https://prometheus.io/docs/introduction/overview/)
- [Grafana](https://grafana.com/docs/grafana/latest/)
- [PromQL intro](https://prometheus.io/docs/prometheus/latest/querying/basics/)
- [PromQL functions](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate)
- [How Exactly Does PromQL Calculate Rates?](https://promlabs.com/blog/2021/01/29/how-exactly-does-promql-calculate-rates/)