# 2. ML Monitoring metrics

## 2.1. How to evaluate ML model quality

### Challenges of standard ML monitoring
Standard ML monitoring means measuring the ML model performance such as:
    - **Model quality and error metrics:** It assesses the model performance in production such as accuracy, precision, recall, F1 score, etc.
    - **Buisness or product metrics:** It assesses the business impact of the model in production such as purchases, clicks, visits, etc.

- **Challenges**
    - *delayed ground truth:* The ground truth is not available immediately after the model prediction. This makes calculating model performance metrics impossible.
    - *Past performance does not guarantee future performance:* This is because the data distribution in the future may change.
    - *Segments with different quality* 
    - *Volatile target function

### Early monitoring metrics
**Early monitoring metrics** are metrics that can be calculated before the ground truth is available. It is used to track the issues with:
- **Data Quality**
- **Data drift**
- **Output drift:** It occurs when the model's predictions become less accurate because the true values in the dataset have shifted or evolved.

## 2.2. Evaluating ML model quality in production

### Classification quality metrics
- **Accuracy:** The proportion of correctly classified instances out of the total instances. This metric fails with imbalanced data.

        $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$
- **Precision:** The proportion of true positive predictions out of all positive predictions.

        $Precision = \frac{TP}{TP + FP}$
- **Recall:** The proportion of true positive predictions out of all actual positive instances.
    
        $Recall = \frac{TP}{TP + FN}$
- **F1 score:** The harmonic mean of precision and recall, providing a balance between the two metrics.

        $F1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}$
- **AUC-ROC:** The area under the Receiver Operating Characteristic (ROC) curve, which measures the model's ability to distinguish between positive and negative classes.
- **Log loss:** It measures how close is the prediction from the actual value. 

        $LogLoss = -\frac{1}{N}\sum_{i=1}^{N}(y_i\log(p_i) + (1-y_i)\log(1-p_i))$
- **Confusion matrix:** A table that shows the number of correct and incorrect predictions made by the model compared to the actual outcomes which are TP, FP, TN, FN.
- **Precision-recall table:** A table that shows the precision and recall values for different thresholds.
- **Class separation quality:** It helps to understand how well the model separates the classes. It is measured by the distance between the distributions of the model's predictions for each class.
- **Error analysis:** It helps to understand the model's errors and the reasons behind them.

### Regression quality metrics
- **Mean Absolute Error (MAE):** The average of the absolute differences between predictions and actual values.

        $MAE = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y_i}|$
- **Mean Squared Error (MSE):** The average of the squared differences between predictions and actual values.
    
        $MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y_i})^2$
- **Root Mean Squared Error (RMSE):** The square root of the average of the squared differences between predictions and actual values.

        $RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y_i})^2}$
- **Predicted VS Actual Values:** A plot that shows the predicted values against the actual values.

### Ranking quality metrics
Ranking focuses on the relative order rather then their absolute values like Search engines and recommending systems. This are the metrics used
- **Cumulative Gain** 
- **Discounted Cumulative Gain (DCG)**
- **Normalized DCG (NDCG)**
- **Precision@K**
- **Recall@K**
-  **Lift@k**

## 2.3. Data quality in Machine Learning

### What can go wrong with the input data?
- **Wrong source**
- **Lost access**
- **Bad SQL or not SQL**
- **Infrastructure update**
- **Broken feature code**
- **Data loss or corruption**

### Data quality metrics and analysis
- **Data profiling:** it involves the basic descriptive stats for the dataset such us Min Max values, Quantiles, unique values, most common values, etc. It also includes also data visualization and data distribution analysis.
- **Manual Thresholds:** In the absence of reference data, establish data quality criteria manually based on domain knowledge.
    - Minimal missing values.
    - No duplicate rows or columns.
    - Avoid constant or highly correlated features.
    - Prevent target leaks (high feature-target correlations).
    - Enforce logical ranges, considering feature context (e.g., age cannot be negative).
- **Reference Data:** Having reference data simplifies the process and enables automated checks.

    - Compare current data to the reference dataset.
    - Automatically assess data schema, completeness, batch size, and specific column patterns.
    - Check for unique/non-unique features and expected data distributions.
    - Set criteria based on observed value ranges and descriptive statistics like averages and medians.

## 2.4. Data and prediction drift in ML

### What is data drift and why evaluate it?
In case of the ground truth data unvailability, proxy metrics are used such as feature and prediction drift.
- **Prediction drift:** It tracks changes in the distribution of the model predictions over time. If a change is detected it can be an early sign of changes in the model environment.
- **Feature drift:** It tracks changes in the distribution of the input features. It can be an early warnnig of the model quality degradation, data quality issues, or changes in the model environment.

### How to detect data drift
- **Drift detection method:** such as statistical tests, distance metrics, etc.
- **Drift detection threshold** 
- **Reference dataset**
- **Alert conditions**

### Univariate vs. multivariate drift
- **Univariate drift:** It tracks the drifts in the distribution of each feature independently. It's easy to understand because we're looking at one feature at a time.
- **Multivariate drift:** It tracks the drifts in the distribution of the hole dataset. It's useful when dealing with many features or complec interactions.


### Tips for calculating drift
- **Calculate data quality**
- **Mind the feature set:** The approach to drift analysis varies based on the type and importance of features.
- **Mind the segment based-drift** 

### Some key consideration about data drift
- **Prediction Drift Matters Most:** Changes in what the model predicts are often more critical than changes in the input data.
- **Data Drift is Flexible:** What's considered "drift" depends on the specific situation and data you're dealing with.
- **Drift Doesn't Always Harm Models:** Consider the context, importance of features, and use case.
- **Data Drift Monitoring is Selective:** It's especially useful for important models with delayed feedback, but it's not always necessary.
- **Data Drift Helps Debugging:** tracking it can help troubleshoot issues in the model.
- **Drift Detection is Valuable with Labels:** Changes in features might show up before the model's performance drops.