# **Chapter 73: Alerting and Notification Systems**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the role of alerting in time‑series prediction systems
- Distinguish between different types of alerts: threshold‑based, anomaly, drift, performance, and system health
- Design effective alert rules with appropriate severity levels
- Implement multiple notification channels (email, Slack, SMS, webhooks)
- Apply deduplication and aggregation techniques to prevent alert fatigue
- Build an escalation policy that ensures critical alerts are handled
- Create a reusable alert manager class integrated with a NEPSE monitoring pipeline

---

## **73.1 Introduction to Alerting in Time‑Series Systems**

In any production time‑series prediction system, such as the NEPSE stock prediction system, monitoring is only half the story. The other half is **alerting** – the proactive notification of stakeholders when certain conditions are met. Without alerting, a model could silently fail, data pipelines could break, or unusual market movements could go unnoticed until it is too late.

Alerting serves several critical purposes:

- **Operational awareness** – Notify engineers when the system is unhealthy (e.g., data ingestion stops, API latency spikes).
- **Model performance tracking** – Inform data scientists when prediction accuracy drops or when concept drift is detected.
- **Business action triggers** – Alert traders or portfolio managers when a stock reaches a predefined price level, when a technical indicator signals a potential reversal, or when unusual volume suggests a market move.

For the NEPSE prediction system, which processes daily OHLCV data from a CSV file, we might want alerts when:

- A stock’s closing price moves more than 4% (the first circuit breaker level).
- The 14‑day RSI enters overbought (>70) or oversold (<30) territory.
- Trading volume exceeds three standard deviations from its 20‑day average.
- The prediction error (MAE) of the model exceeds a threshold for three consecutive days.

This chapter will guide you through building a robust alerting and notification subsystem that integrates seamlessly with your time‑series pipeline.

---

## **73.2 Alert Types and Conditions**

Alerts can be classified based on what they monitor. In a time‑series prediction system, five categories are especially relevant:

### **73.2.1 Threshold‑Based Alerts**

These are the simplest: they fire when a numeric value crosses a predefined threshold. For NEPSE, examples include:

- `Close > 500` (price level)
- `Daily_Return > 5` (large percentage gain)
- `Volume > 1,000,000` (liquidity spike)

Thresholds can be static (configured at design time) or dynamic (e.g., based on rolling statistics).

### **73.2.2 Anomaly Detection Alerts**

Anomaly alerts identify data points that deviate significantly from expected patterns. They are essential for catching data quality issues or extreme market events. For instance, a sudden zero‑volume day for a normally liquid stock, or a price that is 10 standard deviations away from its moving average.

### **73.2.3 Model Drift Alerts**

Once a model is deployed, its performance may degrade over time due to changes in the underlying data distribution (concept drift) or the input features (data drift). Alerts can be triggered when:

- The prediction error on recent data exceeds a baseline by a certain margin.
- The distribution of a key feature (e.g., `Volume`) shifts significantly (detected via statistical tests like KS‑test).

### **73.2.4 Performance Degradation Alerts**

These are system‑centric: they notify when the prediction service itself is underperforming. Examples:

- API response time > 500ms.
- Batch prediction job fails or takes longer than expected.
- Database connection errors.

### **73.2.5 System Health Alerts**

Health alerts cover infrastructure: CPU/memory usage, disk space, and service availability. While not directly tied to predictions, they are vital for maintaining a reliable system.

---

## **73.3 Alert Severity and Prioritization**

Not all alerts are equally important. A price alert for a minor movement might be informational, while a model failure alert could be critical. Defining severity levels helps route alerts appropriately and prevents desensitization.

Common severity levels:

| Level | Name | Meaning | Response time |
|-------|------|---------|---------------|
| P0 | Critical | System down, data loss, immediate action required | < 15 minutes |
| P1 | High | Severe performance degradation, major prediction errors | < 1 hour |
| P2 | Medium | Non‑urgent issues, potential drift detected | < 24 hours |
| P3 | Low | Informational, threshold crosses of minor interest | Best effort |
| P4 | Debug | Development/testing only, no production notification | None |

In the NEPSE system, a circuit‑breaker hit might be P1 (high), while a daily RSI value above 70 might be P2 or P3.

---

## **73.4 Notification Channels**

Once an alert is triggered, it must be delivered through one or more channels. The choice of channel depends on the alert severity and the intended audience.

### **73.4.1 Email**

Email is suitable for low‑severity, daily digests or reports. It is not ideal for urgent alerts because of potential delays and inbox noise.

### **73.4.2 SMS / Push Notifications**

Critical alerts (P0, P1) should use SMS or mobile push notifications (via services like Twilio, Pushover) to reach on‑call engineers immediately.

### **73.4.3 Slack / Microsoft Teams**

Team collaboration platforms are excellent for medium‑severity alerts. They allow rich formatting, threading, and easy integration with chatbots.

### **73.4.4 Webhooks**

Webhooks enable integration with custom applications, incident management systems (PagerDuty, Opsgenie), or even triggering automated remediation scripts.

### **73.4.5 Dashboard / Logging**

For debugging and historical analysis, alerts can be written to logs or displayed on a monitoring dashboard (e.g., Grafana). This is not a primary notification method but is useful for post‑mortem analysis.

---

## **73.5 Alert Aggregation and Deduplication**

Without careful design, a flapping alert (one that fires repeatedly) can cause **alert fatigue**, where team members ignore notifications. To combat this, we implement:

- **Deduplication** – If the same alert fires multiple times within a short window, only the first occurrence is sent, or notifications are batched.
- **Aggregation** – Similar alerts (e.g., multiple stocks hitting RSI oversold) can be grouped into a single message.
- **Rate limiting** – Prevent more than N alerts per hour from a single source.

A common technique is to maintain an **alert history** in memory or a database, storing the last time each alert was triggered. New alerts are only sent if the cooldown period has elapsed.

---

## **73.6 Escalation Policies**

Even the best alerting system will sometimes fail to elicit a response. An escalation policy defines what happens if an alert is not acknowledged within a certain time. For example:

1. Alert fires → send to Slack channel #alerts.
2. If no acknowledgment within 15 minutes, escalate to SMS to on‑call engineer.
3. If still unacknowledged after 30 minutes, escalate to phone call to team lead.

Escalation policies are typically managed by dedicated tools (PagerDuty, Opsgenie), but we can implement a simple version in code for demonstration.

---

## **73.7 Implementation: Building an Alert Manager for NEPSE**

We will now build a flexible alert manager tailored to the NEPSE prediction system. The design consists of:

- `AlertRule`: a class that defines a condition, severity, and notification channels.
- `AlertManager`: the core engine that evaluates rules against incoming data and triggers notifications.
- `NotificationChannel`: abstract base class for different notification methods (Email, Slack, SMS).
- `AlertHistory`: tracks recent alerts to implement deduplication and cooldown.

We will also integrate with the existing feature‑engineered DataFrame from previous chapters.

### **73.7.1 Defining an Alert Rule**

An alert rule encapsulates everything needed to evaluate a condition and decide how to notify.

```python
from typing import Callable, List, Optional
from datetime import datetime, timedelta
import pandas as pd

class AlertRule:
    """
    Represents a single alert rule.
    
    Attributes:
        name (str): Unique identifier for the rule.
        condition (Callable[[pd.Series], bool]): A function that takes a row (Series) and returns True if the alert should fire.
        severity (str): Severity level (e.g., 'P0', 'P1', ...).
        channels (List[str]): List of notification channels to use.
        cooldown_minutes (int): Minimum time between consecutive alerts for this rule.
        last_trigger (Optional[datetime]): When the rule last fired (used for cooldown).
        description (str): Human-readable description.
    """
    def __init__(
        self,
        name: str,
        condition: Callable[[pd.Series], bool],
        severity: str,
        channels: List[str],
        cooldown_minutes: int = 30,
        description: str = ""
    ):
        self.name = name
        self.condition = condition
        self.severity = severity
        self.channels = channels
        self.cooldown_minutes = cooldown_minutes
        self.last_trigger = None
        self.description = description

    def should_trigger(self, row: pd.Series, now: datetime) -> bool:
        """
        Evaluate the condition and respect cooldown.
        """
        if not self.condition(row):
            return False
        
        if self.last_trigger is not None:
            elapsed = (now - self.last_trigger).total_seconds() / 60.0
            if elapsed < self.cooldown_minutes:
                return False  # Still in cooldown
        
        self.last_trigger = now
        return True
```

**Explanation:**

- The `condition` is a callable that takes a row (a `pd.Series` representing one day of NEPSE data) and returns `True` if the alert condition is met. This design allows us to write any logic, from simple comparisons to complex multi‑feature checks.
- `cooldown_minutes` prevents the same alert from flooding the system. For example, if a stock’s price stays above a threshold all day, we only want one notification, not one per minute.
- `last_trigger` stores the timestamp of the last firing, and `should_trigger` checks the cooldown before allowing a new alert.

### **73.7.2 Notification Channels**

We define an abstract base class and concrete implementations for email and Slack. In a real system, you would also implement SMS via Twilio, etc.

```python
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import requests
import logging
from abc import ABC, abstractmethod

logger = logging.getLogger(__name__)

class NotificationChannel(ABC):
    """Abstract base class for notification channels."""
    
    @abstractmethod
    def send(self, subject: str, message: str, severity: str) -> bool:
        """Send a notification. Return True on success."""
        pass

class EmailChannel(NotificationChannel):
    def __init__(self, smtp_host: str, smtp_port: int, username: str, password: str, from_addr: str, to_addrs: List[str]):
        self.smtp_host = smtp_host
        self.smtp_port = smtp_port
        self.username = username
        self.password = password
        self.from_addr = from_addr
        self.to_addrs = to_addrs

    def send(self, subject: str, message: str, severity: str) -> bool:
        msg = MIMEMultipart()
        msg['From'] = self.from_addr
        msg['To'] = ', '.join(self.to_addrs)
        msg['Subject'] = f"[{severity}] {subject}"
        msg.attach(MIMEText(message, 'plain'))

        try:
            with smtplib.SMTP(self.smtp_host, self.smtp_port) as server:
                server.starttls()
                server.login(self.username, self.password)
                server.send_message(msg)
            logger.info(f"Email sent: {subject}")
            return True
        except Exception as e:
            logger.error(f"Email failed: {e}")
            return False

class SlackChannel(NotificationChannel):
    def __init__(self, webhook_url: str, channel: str = "#alerts", username: str = "AlertBot"):
        self.webhook_url = webhook_url
        self.channel = channel
        self.username = username

    def send(self, subject: str, message: str, severity: str) -> bool:
        payload = {
            "channel": self.channel,
            "username": self.username,
            "text": f"*[{severity}]* {subject}\n{message}",
            "icon_emoji": ":warning:" if severity.startswith('P0') or severity.startswith('P1') else ":information_source:"
        }
        try:
            response = requests.post(self.webhook_url, json=payload)
            response.raise_for_status()
            logger.info(f"Slack message sent: {subject}")
            return True
        except Exception as e:
            logger.error(f"Slack failed: {e}")
            return False
```

**Explanation:**

- The `NotificationChannel` abstract base class defines the interface. Each concrete channel implements `send()`.
- `EmailChannel` uses SMTP to send plain‑text emails. In production, you would likely use a dedicated email service like SendGrid or AWS SES, but SMTP works for demonstration.
- `SlackChannel` posts a message to a Slack channel via an incoming webhook. The severity determines the emoji.
- Both channels log success/failure for debugging.

### **73.7.3 Alert History and Cooldown Tracking**

We need a simple store to remember when each rule last fired. For a single‑process application, an in‑memory dictionary suffices. For distributed systems, use Redis or a database.

```python
from threading import Lock

class AlertHistory:
    """Thread‑safe store for alert firing timestamps."""
    def __init__(self):
        self._data = {}
        self._lock = Lock()

    def get_last_trigger(self, rule_name: str) -> Optional[datetime]:
        with self._lock:
            return self._data.get(rule_name)

    def set_last_trigger(self, rule_name: str, timestamp: datetime):
        with self._lock:
            self._data[rule_name] = timestamp
```

In our `AlertRule.should_trigger`, we would use this history, but for simplicity we stored `last_trigger` directly on the rule. In a production system, you would likely externalize this to avoid race conditions if rules are evaluated concurrently.

### **73.7.4 The Alert Manager**

The `AlertManager` orchestrates rule evaluation and notification dispatching.

```python
from typing import Dict, List, Optional
import pandas as pd
from datetime import datetime

class AlertManager:
    """
    Manages alert rules and notification channels.
    
    Usage:
        manager = AlertManager()
        manager.add_rule(rule)
        manager.register_channel('email', email_channel)
        manager.register_channel('slack', slack_channel)
        manager.process_row(row)   # called for each new data point
    """
    def __init__(self):
        self.rules: List[AlertRule] = []
        self.channels: Dict[str, NotificationChannel] = {}
        self.history = AlertHistory()

    def add_rule(self, rule: AlertRule):
        self.rules.append(rule)

    def register_channel(self, name: str, channel: NotificationChannel):
        self.channels[name] = channel

    def process_row(self, row: pd.Series, now: Optional[datetime] = None):
        """
        Evaluate all rules against a single row of data.
        """
        if now is None:
            now = datetime.now()

        for rule in self.rules:
            if not rule.should_trigger(row, now):
                continue

            # Build alert message
            subject = f"Alert: {rule.name}"
            message = self._format_message(rule, row)

            # Send to each configured channel
            for channel_name in rule.channels:
                channel = self.channels.get(channel_name)
                if channel:
                    channel.send(subject, message, rule.severity)
                else:
                    logger.warning(f"Channel '{channel_name}' not registered for rule {rule.name}")

    def _format_message(self, rule: AlertRule, row: pd.Series) -> str:
        """Create a human‑readable alert message."""
        lines = [
            f"Rule: {rule.name}",
            f"Severity: {rule.severity}",
            f"Time: {datetime.now().isoformat()}",
            f"Symbol: {row.get('Symbol', 'N/A')}",
            f"Close: {row.get('Close', 'N/A'):.2f}",
            f"Description: {rule.description}",
            "\nRelevant values:"
        ]
        # Add a few key columns for context
        for col in ['Open', 'High', 'Low', 'Vol', 'Daily_Return', 'RSI']:
            if col in row:
                lines.append(f"  {col}: {row[col]:.2f}")
        return "\n".join(lines)
```

**Explanation:**

- The `AlertManager` holds lists of rules and a dictionary of named channels.
- `process_row` iterates through all rules, checks if they should trigger (respecting cooldown via the rule's own `should_trigger`), and then dispatches notifications through the channels specified in the rule.
- `_format_message` creates a detailed, readable message that includes the symbol, price, and any relevant feature values. This helps the recipient quickly understand the context.

### **73.7.5 Defining NEPSE‑Specific Rules**

Let’s create some concrete rules for our NEPSE system using the engineered DataFrame from previous chapters.

Assume our DataFrame has columns: `Symbol`, `Close`, `Daily_Return`, `RSI`, `Volume_Z_Score`, etc.

```python
def create_nepse_rules():
    rules = []

    # Rule 1: Circuit breaker hit (price change > 4%)
    def circuit_breaker_condition(row):
        return abs(row.get('Daily_Return', 0)) > 4

    rules.append(AlertRule(
        name="Circuit Breaker Hit",
        condition=circuit_breaker_condition,
        severity="P1",
        channels=["slack", "email"],
        cooldown_minutes=60,
        description="Stock moved more than 4% in a single day."
    ))

    # Rule 2: RSI overbought
    def rsi_overbought(row):
        return row.get('RSI', 50) > 70

    rules.append(AlertRule(
        name="RSI Overbought",
        condition=rsi_overbought,
        severity="P2",
        channels=["slack"],
        cooldown_minutes=120,
        description="RSI exceeds 70, possible reversal."
    ))

    # Rule 3: Volume anomaly (Z-score > 3)
    def volume_anomaly(row):
        return row.get('Volume_Z_Score', 0) > 3

    rules.append(AlertRule(
        name="Volume Anomaly",
        condition=volume_anomaly,
        severity="P2",
        channels=["slack"],
        cooldown_minutes=30,
        description="Volume exceeds 3 standard deviations from 20-day average."
    ))

    # Rule 4: Extreme price drop (>3%) with high volume
    def price_drop_with_volume(row):
        return row.get('Daily_Return', 0) < -3 and row.get('Volume_Z_Score', 0) > 2

    rules.append(AlertRule(
        name="Sharp Drop with Heavy Volume",
        condition=price_drop_with_volume,
        severity="P1",
        channels=["slack", "email"],
        cooldown_minutes=120,
        description="Price drops more than 3% on unusually high volume – potential panic selling."
    ))

    return rules
```

### **73.7.6 Integrating with the Data Pipeline**

Now we can integrate the alert manager into our daily prediction pipeline. For example, after we load and engineer features for each day, we call `manager.process_row(row)` for each stock.

```python
# Assume we have a DataFrame 'df' with all engineered features
manager = AlertManager()
manager.register_channel('slack', SlackChannel(webhook_url='https://hooks.slack.com/...'))
manager.register_channel('email', EmailChannel(
    smtp_host='smtp.gmail.com',
    smtp_port=587,
    username='alerts@example.com',
    password='app-password',
    from_addr='nepse-alerts@example.com',
    to_addrs=['trader@example.com']
))

for rule in create_nepse_rules():
    manager.add_rule(rule)

# Iterate through data in chronological order
for idx, row in df.iterrows():
    manager.process_row(row)
```

**Explanation:**

- The manager is instantiated, channels are registered, and rules are added.
- We iterate through the DataFrame (which should be sorted by date) and process each row. This simulates a real‑time or batch‑oriented alert check.
- In a production streaming system, you would call `process_row` as soon as a new data point arrives (e.g., from Kafka or after a daily ETL job).

---

## **73.8 Best Practices and Pitfalls**

### **73.8.1 Avoid Alert Fatigue**

- **Use cooldowns** as shown above. No one wants 100 emails per hour.
- **Aggregate similar alerts**. Instead of one message per stock, group all stocks that triggered the same rule into a single digest.
- **Set appropriate severity**. Reserve high‑urgency channels (SMS) for truly critical events.

### **73.8.2 Test Your Alerts**

- Implement a “dry run” mode where alerts are logged but not sent.
- Use synthetic data to verify that conditions fire as expected.

### **73.8.3 Document Every Rule**

- Maintain a central registry of alert rules, their rationale, and expected action. This helps on‑call engineers respond correctly.

### **73.8.4 Monitor the Alerting System Itself**

- Alerting is part of your system’s critical path. Monitor that notifications are being sent and that channels are healthy. A dead webhook should trigger an alert itself.

### **73.8.5 Handle Failures Gracefully**

- If a notification channel fails (e.g., SMTP server down), log the error and consider falling back to another channel or queuing the alert for later retry.

---

## **Chapter Summary**

In this chapter, we built a comprehensive alerting and notification subsystem for a time‑series prediction system, using the NEPSE stock prediction system as a running example. We covered:

- The importance of alerting in production ML systems.
- Types of alerts: threshold, anomaly, drift, performance, and health.
- Severity levels to prioritize responses.
- Notification channels (email, Slack, SMS, webhooks) and their appropriate use.
- Techniques to prevent alert fatigue: deduplication, aggregation, cooldowns.
- Escalation policies for ensuring critical alerts are acted upon.
- A full implementation of an `AlertManager`, `AlertRule`, and notification channels, with concrete rules tailored to NEPSE data.

The alert manager we designed is modular, extensible, and integrates seamlessly with the feature‑engineered DataFrame from earlier chapters. By adding a few lines of code to your data pipeline, you can now be notified in real time about significant market events, model degradation, or system issues.

In the next chapter, **Chapter 74: Complete Financial Prediction System**, we will tie together everything we have learned – data collection, feature engineering, model training, deployment, and now alerting – into a complete, end‑to‑end NEPSE prediction system.

---

**End of Chapter 73**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='72. interactive_exploration_tools.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../10. case_studies_and_real_world_applications/74. complete_financial_prediction_system.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
