# 18. Detekce Anom√°li√≠ s Transformery

**Autor:** Praut s.r.o. - AI Integration & Business Automation

V tomto notebooku se nauƒç√≠me pou≈æ√≠vat Transformer modely pro detekci anom√°li√≠ v r≈Øzn√Ωch typech dat - ƒçasov√Ωch ≈ôad√°ch, textu a tabulkov√Ωch datech.

## Obsah
1. √övod do detekce anom√°li√≠
2. Anom√°lie v ƒçasov√Ωch ≈ôad√°ch (Autoencoder Transformer)
3. Anom√°lie v textu (outlier detection)
4. Anom√°lie v logech a ud√°lostech
5. Produkƒçn√≠ monitoring syst√©m

In [None]:
# Instalace knihoven
!pip install transformers sentence-transformers torch pandas numpy scikit-learn matplotlib seaborn -q

In [None]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
from datetime import datetime, timedelta
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Kontrola GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Pou≈æ√≠v√°m za≈ô√≠zen√≠: {device}")

## 1. √övod do detekce anom√°li√≠

Anom√°lie jsou vz√°cn√© vzory v datech, kter√© se v√Ωraznƒõ li≈°√≠ od vƒõt≈°iny. Existuje nƒõkolik p≈ô√≠stup≈Ø:

| Metoda | Princip | Pou≈æit√≠ |
|--------|---------|--------|
| Autoencoder | Rekonstrukƒçn√≠ chyba | ƒåasov√© ≈ôady, obrazy |
| Isolation Forest | Izolace vz√°cn√Ωch bod≈Ø | Tabulkov√° data |
| Embedding + LOF | Vzd√°lenost v latentn√≠m prostoru | Text, logy |
| Attention-based | Neobvykl√© attention vzory | Sekvence |

In [None]:
# Generov√°n√≠ syntetick√Ωch dat s anom√°liemi

def generate_sensor_data_with_anomalies(
    n_samples: int = 5000,
    anomaly_ratio: float = 0.05,
    seed: int = 42
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generuje syntetick√° data ze senzor≈Ø s anom√°liemi.
    Simuluje data z pr≈Ømyslov√©ho stroje.
    """
    np.random.seed(seed)
    
    # Norm√°ln√≠ data - periodick√Ω sign√°l s ≈°umem
    t = np.linspace(0, 100, n_samples)
    
    # 4 senzory
    sensor1 = np.sin(t * 0.1) + np.random.normal(0, 0.1, n_samples)  # Teplota
    sensor2 = np.cos(t * 0.15) + np.random.normal(0, 0.1, n_samples)  # Vibrace
    sensor3 = 0.5 * np.sin(t * 0.2) + 0.3 * np.cos(t * 0.1) + np.random.normal(0, 0.05, n_samples)  # Tlak
    sensor4 = np.random.normal(0, 0.2, n_samples)  # ≈†um (pro stabilitu)
    
    data = np.column_stack([sensor1, sensor2, sensor3, sensor4])
    
    # Oznaƒçen√≠ anom√°li√≠
    labels = np.zeros(n_samples)
    n_anomalies = int(n_samples * anomaly_ratio)
    anomaly_indices = np.random.choice(n_samples, n_anomalies, replace=False)
    
    # Injekce r≈Øzn√Ωch typ≈Ø anom√°li√≠
    for idx in anomaly_indices:
        anomaly_type = np.random.choice(['spike', 'drift', 'noise', 'drop'])
        sensor_idx = np.random.randint(0, 4)
        
        if anomaly_type == 'spike':  # N√°hl√Ω skok
            data[idx, sensor_idx] += np.random.uniform(2, 4) * np.sign(np.random.randn())
        elif anomaly_type == 'drift':  # Postupn√Ω drift
            drift_len = min(20, n_samples - idx)
            drift = np.linspace(0, np.random.uniform(1, 2), drift_len)
            data[idx:idx+drift_len, sensor_idx] += drift
        elif anomaly_type == 'noise':  # Zv√Ω≈°en√Ω ≈°um
            noise_len = min(10, n_samples - idx)
            data[idx:idx+noise_len, sensor_idx] += np.random.normal(0, 1, noise_len)
        else:  # Drop - v√Ωpadek
            data[idx, sensor_idx] = 0
        
        labels[idx] = 1
    
    return data.astype(np.float32), labels.astype(np.int64)


# Generov√°n√≠ dat
sensor_data, labels = generate_sensor_data_with_anomalies(n_samples=5000, anomaly_ratio=0.05)
print(f"Data shape: {sensor_data.shape}")
print(f"Anom√°lie: {labels.sum()} ({labels.mean()*100:.1f}%)")

# Vizualizace
fig, axes = plt.subplots(4, 1, figsize=(14, 8), sharex=True)
sensor_names = ['Teplota', 'Vibrace', 'Tlak', 'Stabilita']

for i, (ax, name) in enumerate(zip(axes, sensor_names)):
    ax.plot(sensor_data[:500, i], 'b-', alpha=0.7, linewidth=0.5)
    # Oznaƒçen√≠ anom√°li√≠
    anomaly_idx = np.where(labels[:500] == 1)[0]
    ax.scatter(anomaly_idx, sensor_data[anomaly_idx, i], c='red', s=30, label='Anom√°lie')
    ax.set_ylabel(name)
    ax.grid(True, alpha=0.3)
    if i == 0:
        ax.legend()

plt.xlabel('ƒåas')
plt.suptitle('Senzorov√° data s anom√°liemi')
plt.tight_layout()
plt.show()

## 2. Anom√°lie v ƒçasov√Ωch ≈ôad√°ch (Autoencoder Transformer)

In [None]:
import math


class PositionalEncoding(nn.Module):
    """Pozicov√© k√≥dov√°n√≠ pro Transformer."""
    
    def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        
        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)


class TransformerAutoencoder(nn.Module):
    """
    Transformer-based Autoencoder pro detekci anom√°li√≠.
    Anom√°lie jsou identifikov√°ny jako vzorky s vysokou rekonstrukƒçn√≠ chybou.
    """
    
    def __init__(
        self,
        n_features: int,
        d_model: int = 64,
        n_heads: int = 4,
        n_encoder_layers: int = 2,
        n_decoder_layers: int = 2,
        d_ff: int = 256,
        dropout: float = 0.1,
        latent_dim: int = 32
    ):
        super().__init__()
        
        self.n_features = n_features
        self.d_model = d_model
        
        # Input projection
        self.input_projection = nn.Linear(n_features, d_model)
        self.positional_encoding = PositionalEncoding(d_model, dropout=dropout)
        
        # Encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_ff,
            dropout=dropout,
            activation='gelu',
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_encoder_layers)
        
        # Bottleneck (latent space)
        self.to_latent = nn.Linear(d_model, latent_dim)
        self.from_latent = nn.Linear(latent_dim, d_model)
        
        # Decoder
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_ff,
            dropout=dropout,
            activation='gelu',
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=n_decoder_layers)
        
        # Output projection
        self.output_projection = nn.Linear(d_model, n_features)
        
    def encode(self, x: torch.Tensor) -> torch.Tensor:
        """Zak√≥duje vstup do latentn√≠ho prostoru."""
        x = self.input_projection(x)
        x = self.positional_encoding(x)
        encoded = self.encoder(x)
        latent = self.to_latent(encoded)
        return latent
    
    def decode(self, latent: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
        """Dek√≥duje z latentn√≠ho prostoru."""
        memory = self.from_latent(latent)
        target_emb = self.input_projection(target)
        target_emb = self.positional_encoding(target_emb)
        decoded = self.decoder(target_emb, memory)
        output = self.output_projection(decoded)
        return output
    
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            x: Input tensor (batch, seq_len, n_features)
            
        Returns:
            reconstruction: Rekonstruovan√Ω vstup
            latent: Latentn√≠ reprezentace
        """
        latent = self.encode(x)
        reconstruction = self.decode(latent, x)
        return reconstruction, latent


print("Model vytvo≈ôen")

In [None]:
from torch.utils.data import Dataset, DataLoader


class TimeSeriesDataset(Dataset):
    """Dataset pro ƒçasov√© ≈ôady."""
    
    def __init__(self, data: np.ndarray, labels: np.ndarray, window_size: int = 50, stride: int = 1):
        self.data = data
        self.labels = labels
        self.window_size = window_size
        self.stride = stride
        
        # ≈†k√°lov√°n√≠
        self.scaler = StandardScaler()
        self.data_scaled = self.scaler.fit_transform(data)
        
        # Vytvo≈ôen√≠ index≈Ø oken
        self.indices = list(range(0, len(data) - window_size + 1, stride))
        
    def __len__(self):
        return len(self.indices)
    
    def __getitem__(self, idx):
        start_idx = self.indices[idx]
        window = self.data_scaled[start_idx:start_idx + self.window_size]
        
        # Label je 1 pokud je v oknƒõ alespo≈à jedna anom√°lie
        window_labels = self.labels[start_idx:start_idx + self.window_size]
        is_anomaly = 1 if window_labels.sum() > 0 else 0
        
        return {
            'data': torch.FloatTensor(window),
            'label': torch.tensor(is_anomaly, dtype=torch.long),
            'start_idx': start_idx
        }


# Vytvo≈ôen√≠ datasetu (pouze norm√°ln√≠ data pro tr√©nov√°n√≠)
normal_mask = labels == 0
train_data = sensor_data[normal_mask][:3000]  # Prvn√≠ 3000 norm√°ln√≠ch vzork≈Ø
train_labels = np.zeros(len(train_data))

train_dataset = TimeSeriesDataset(train_data, train_labels, window_size=50, stride=5)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Test dataset obsahuje i anom√°lie
test_dataset = TimeSeriesDataset(sensor_data[3000:], labels[3000:], window_size=50, stride=1)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

print(f"Train windows: {len(train_dataset)}")
print(f"Test windows: {len(test_dataset)}")

In [None]:
# Tr√©nov√°n√≠ autoencoderu

n_features = sensor_data.shape[1]
model = TransformerAutoencoder(
    n_features=n_features,
    d_model=64,
    n_heads=4,
    n_encoder_layers=2,
    n_decoder_layers=2,
    latent_dim=32
).to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
criterion = nn.MSELoss()

# Training loop
n_epochs = 30
train_losses = []

for epoch in range(n_epochs):
    model.train()
    epoch_loss = 0
    
    for batch in train_loader:
        data = batch['data'].to(device)
        
        optimizer.zero_grad()
        reconstruction, _ = model(data)
        loss = criterion(reconstruction, data)
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_loss)
    scheduler.step(avg_loss)
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}/{n_epochs} - Loss: {avg_loss:.6f}")

# Vizualizace loss
plt.figure(figsize=(10, 4))
plt.plot(train_losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Detekce anom√°li√≠ na testovac√≠ch datech

@torch.no_grad()
def compute_reconstruction_errors(model: nn.Module, dataloader: DataLoader, device: torch.device) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Vypoƒç√≠t√° rekonstrukƒçn√≠ chyby pro v≈°echna okna."""
    model.eval()
    
    errors = []
    all_labels = []
    indices = []
    
    for batch in dataloader:
        data = batch['data'].to(device)
        labels = batch['label'].numpy()
        start_idx = batch['start_idx'].numpy()
        
        reconstruction, _ = model(data)
        
        # MSE pro ka≈æd√© okno
        mse = ((reconstruction - data) ** 2).mean(dim=(1, 2)).cpu().numpy()
        
        errors.extend(mse)
        all_labels.extend(labels)
        indices.extend(start_idx)
    
    return np.array(errors), np.array(all_labels), np.array(indices)


# V√Ωpoƒçet rekonstrukƒçn√≠ch chyb
errors, test_labels, test_indices = compute_reconstruction_errors(model, test_loader, device)

print(f"Pr≈Ømƒõrn√° chyba: {errors.mean():.6f}")
print(f"Std chyby: {errors.std():.6f}")

# Urƒçen√≠ prahu pro anom√°lie (nap≈ô. 95. percentil)
threshold = np.percentile(errors, 95)
print(f"Pr√°h (95. percentil): {threshold:.6f}")

# Predikce anom√°li√≠
predictions = (errors > threshold).astype(int)

# Evaluace
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print("\nClassification Report:")
print(classification_report(test_labels, predictions, target_names=['Norm√°ln√≠', 'Anom√°lie']))

try:
    auc = roc_auc_score(test_labels, errors)
    print(f"\nAUC-ROC: {auc:.4f}")
except:
    print("AUC nelze vypoƒç√≠tat (pouze jedna t≈ô√≠da)")

In [None]:
# Vizualizace rekonstrukƒçn√≠ch chyb

fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Horn√≠ graf - rekonstrukƒçn√≠ chyby
ax1 = axes[0]
ax1.plot(test_indices, errors, 'b-', alpha=0.7, linewidth=0.5, label='Rekonstrukƒçn√≠ chyba')
ax1.axhline(y=threshold, color='r', linestyle='--', label=f'Pr√°h ({threshold:.4f})')

# Oznaƒçen√≠ skuteƒçn√Ωch anom√°li√≠
anomaly_mask = test_labels == 1
ax1.scatter(test_indices[anomaly_mask], errors[anomaly_mask], c='red', s=20, alpha=0.5, label='Skuteƒçn√© anom√°lie')

ax1.set_ylabel('Rekonstrukƒçn√≠ chyba')
ax1.set_title('Detekce anom√°li√≠ pomoc√≠ Transformer Autoencoderu')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Doln√≠ graf - distribuce chyb
ax2 = axes[1]
ax2.hist(errors[test_labels == 0], bins=50, alpha=0.7, label='Norm√°ln√≠', density=True)
ax2.hist(errors[test_labels == 1], bins=50, alpha=0.7, label='Anom√°lie', density=True)
ax2.axvline(x=threshold, color='r', linestyle='--', label='Pr√°h')
ax2.set_xlabel('Rekonstrukƒçn√≠ chyba')
ax2.set_ylabel('Hustota')
ax2.set_title('Distribuce rekonstrukƒçn√≠ch chyb')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Anom√°lie v textu (outlier detection)

In [None]:
class TextAnomalyDetector:
    """
    Detektor anom√°li√≠ v textu pomoc√≠ Sentence Transformers.
    Identifikuje texty, kter√© jsou s√©manticky odli≈°n√© od vƒõt≈°iny.
    """
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        print(f"Naƒç√≠t√°n√≠ modelu {model_name}...")
        self.encoder = SentenceTransformer(model_name)
        self.embeddings = None
        self.texts = None
        self.detector = None
        self.centroid = None
        self.threshold = None
        
    def fit(self, texts: List[str], contamination: float = 0.05):
        """
        Natr√©nuje detektor na kolekci text≈Ø.
        
        Args:
            texts: Seznam text≈Ø
            contamination: Oƒçek√°van√Ω pod√≠l anom√°li√≠
        """
        self.texts = texts
        print(f"Vytv√°≈ôen√≠ embedding≈Ø pro {len(texts)} text≈Ø...")
        
        # Vytvo≈ôen√≠ embedding≈Ø
        self.embeddings = self.encoder.encode(
            texts,
            show_progress_bar=True,
            convert_to_numpy=True
        )
        
        # V√Ωpoƒçet centroidu (pr≈Ømƒõrn√Ω embedding)
        self.centroid = self.embeddings.mean(axis=0)
        
        # Tr√©nov√°n√≠ Isolation Forest
        self.detector = IsolationForest(
            contamination=contamination,
            random_state=42,
            n_estimators=100
        )
        self.detector.fit(self.embeddings)
        
        # V√Ωpoƒçet prahu na z√°kladƒõ vzd√°lenosti od centroidu
        distances = np.linalg.norm(self.embeddings - self.centroid, axis=1)
        self.threshold = np.percentile(distances, (1 - contamination) * 100)
        
        print(f"Model natr√©nov√°n. Pr√°h vzd√°lenosti: {self.threshold:.4f}")
        
    def detect_anomalies(self, texts: List[str]) -> List[Dict]:
        """
        Detekuje anom√°lie v nov√Ωch textech.
        
        Returns:
            Seznam slovn√≠k≈Ø s informacemi o ka≈æd√©m textu
        """
        # Vytvo≈ôen√≠ embedding≈Ø
        embeddings = self.encoder.encode(texts, convert_to_numpy=True)
        
        # Isolation Forest predikce
        if_predictions = self.detector.predict(embeddings)
        if_scores = self.detector.decision_function(embeddings)
        
        # Vzd√°lenost od centroidu
        distances = np.linalg.norm(embeddings - self.centroid, axis=1)
        
        results = []
        for i, text in enumerate(texts):
            is_anomaly_if = if_predictions[i] == -1
            is_anomaly_dist = distances[i] > self.threshold
            
            results.append({
                'text': text[:100] + '...' if len(text) > 100 else text,
                'is_anomaly': is_anomaly_if or is_anomaly_dist,
                'isolation_forest_anomaly': is_anomaly_if,
                'distance_anomaly': is_anomaly_dist,
                'anomaly_score': float(-if_scores[i]),  # Vy≈°≈°√≠ = v√≠ce anom√°ln√≠
                'distance_from_centroid': float(distances[i])
            })
        
        return results
    
    def find_most_anomalous(self, n: int = 5) -> List[Tuple[int, str, float]]:
        """Najde N nejv√≠ce anom√°ln√≠ch text≈Ø v tr√©novac√≠ch datech."""
        distances = np.linalg.norm(self.embeddings - self.centroid, axis=1)
        top_indices = np.argsort(distances)[-n:][::-1]
        
        return [
            (int(idx), self.texts[idx], float(distances[idx]))
            for idx in top_indices
        ]


# P≈ô√≠klad textov√Ωch dat - e-shop recenze
reviews = [
    # Norm√°ln√≠ recenze
    "Skvƒõl√Ω produkt, jsem velmi spokojen√Ω s kvalitou.",
    "Doruƒçen√≠ bylo rychl√© a zbo≈æ√≠ v po≈ô√°dku.",
    "Doporuƒçuji, v√Ωborn√Ω pomƒõr cena/v√Ωkon.",
    "Produkt odpov√≠d√° popisu, bez probl√©m≈Ø.",
    "Kvalitn√≠ zpracov√°n√≠, splnilo oƒçek√°v√°n√≠.",
    "Rychl√© dod√°n√≠, dob≈ôe zabalen√©.",
    "Super obchod, urƒçitƒõ nakoup√≠m znovu.",
    "Zbo≈æ√≠ v perfektn√≠m stavu, dƒõkuji.",
    "P≈ôesnƒõ to co jsem hledal, spokojenost.",
    "Bezprobl√©mov√° komunikace, v≈ôele doporuƒçuji.",
    "V√Ωborn√° kvalita za rozumnou cenu.",
    "Produkt p≈ôekonal m√© oƒçek√°v√°n√≠.",
    "V≈°echno probƒõhlo hladce, d√≠ky.",
    "Skvƒõl√° z√°kaznick√° podpora.",
    "Urƒçitƒõ doporuƒçuji tento obchod.",
    # Anom√°ln√≠ recenze (spam, podez≈ôel√©)
    "SUPER V√ùHRA!!! Klikni sem a vyhraj iPhone!!!",
    "asdfghjkl random text bla bla",
    "Nav≈°tivte www.podvodny-web.cz pro slevy!",
    "üéÅüéÅüéÅ FREE MONEY üí∞üí∞üí∞ click here!!!",
    "The quick brown fox jumps over lazy dog",  # Anglicky v ƒçesk√©m kontextu
]

# Tr√©nov√°n√≠ detektoru
text_detector = TextAnomalyDetector()
text_detector.fit(reviews[:15], contamination=0.1)  # Tr√©nov√°n√≠ jen na norm√°ln√≠ch

# Detekce anom√°li√≠
print("\n--- Detekce anom√°li√≠ v textech ---")
results = text_detector.detect_anomalies(reviews)

for result in results:
    status = "‚ùå ANOM√ÅLIE" if result['is_anomaly'] else "‚úì OK"
    print(f"{status}: {result['text'][:50]}... (sk√≥re: {result['anomaly_score']:.3f})")

## 4. Anom√°lie v logech a ud√°lostech

In [None]:
@dataclass
class LogEntry:
    """Reprezentace z√°znamu logu."""
    timestamp: datetime
    level: str
    source: str
    message: str
    metadata: Dict = field(default_factory=dict)


class LogAnomalyDetector:
    """
    Detektor anom√°li√≠ v syst√©mov√Ωch logech.
    Kombinuje anal√Ωzu frekvence, embeddingy a pattern matching.
    """
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        self.encoder = SentenceTransformer(model_name)
        self.known_patterns = {}
        self.frequency_baseline = {}
        self.embedding_detector = None
        self.embeddings_cache = {}
        
    def learn_patterns(self, logs: List[LogEntry], window_minutes: int = 60):
        """
        Nauƒç√≠ se norm√°ln√≠ vzory z historick√Ωch log≈Ø.
        """
        print(f"Uƒçen√≠ vzor≈Ø z {len(logs)} log≈Ø...")
        
        # 1. Frekvence podle level a source
        level_counts = {}
        source_counts = {}
        
        for log in logs:
            level_counts[log.level] = level_counts.get(log.level, 0) + 1
            source_counts[log.source] = source_counts.get(log.source, 0) + 1
        
        total = len(logs)
        self.frequency_baseline = {
            'levels': {k: v / total for k, v in level_counts.items()},
            'sources': {k: v / total for k, v in source_counts.items()}
        }
        
        # 2. Embeddingy zpr√°v pro detekci neobvykl√Ωch
        messages = [log.message for log in logs]
        embeddings = self.encoder.encode(messages, show_progress_bar=True)
        
        # Isolation Forest na embedding√°ch
        self.embedding_detector = IsolationForest(
            contamination=0.05,
            random_state=42
        )
        self.embedding_detector.fit(embeddings)
        
        # 3. Ulo≈æen√≠ zn√°m√Ωch vzor≈Ø
        unique_messages = set(messages)
        for msg in unique_messages:
            # Normalizace zpr√°vy (odstranƒõn√≠ ƒç√≠sel, ƒças≈Ø)
            normalized = self._normalize_message(msg)
            self.known_patterns[normalized] = self.known_patterns.get(normalized, 0) + 1
        
        print(f"Nauƒçeno {len(self.known_patterns)} unik√°tn√≠ch vzor≈Ø")
        print(f"Baseline √∫rovnƒõ: {self.frequency_baseline['levels']}")
    
    def _normalize_message(self, message: str) -> str:
        """Normalizuje zpr√°vu odstranƒõn√≠m promƒõnn√Ωch ƒç√°st√≠."""
        import re
        # Nahrazen√≠ ƒç√≠sel
        normalized = re.sub(r'\d+', '<NUM>', message)
        # Nahrazen√≠ IP adres
        normalized = re.sub(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', '<IP>', normalized)
        # Nahrazen√≠ UUID
        normalized = re.sub(r'[a-f0-9-]{36}', '<UUID>', normalized)
        return normalized
    
    def analyze_log(self, log: LogEntry) -> Dict:
        """
        Analyzuje jednotliv√Ω log a vr√°t√≠ sk√≥re anom√°lie.
        """
        anomaly_indicators = []
        scores = {}
        
        # 1. Kontrola √∫rovnƒõ
        level_freq = self.frequency_baseline['levels'].get(log.level, 0)
        if level_freq < 0.01:  # Vz√°cn√° √∫rove≈à
            anomaly_indicators.append(f"Vz√°cn√° √∫rove≈à: {log.level}")
        scores['level_rarity'] = 1 - level_freq
        
        # 2. Kontrola zdroje
        source_freq = self.frequency_baseline['sources'].get(log.source, 0)
        if source_freq < 0.01:  # Nezn√°m√Ω zdroj
            anomaly_indicators.append(f"Nezn√°m√Ω zdroj: {log.source}")
        scores['source_rarity'] = 1 - source_freq
        
        # 3. Embedding anom√°lie
        embedding = self.encoder.encode([log.message])
        if_score = -self.embedding_detector.decision_function(embedding)[0]
        if if_score > 0.1:  # Anom√°ln√≠ embedding
            anomaly_indicators.append("Neobvykl√° zpr√°va (embedding)")
        scores['embedding_anomaly'] = max(0, if_score)
        
        # 4. Pattern matching
        normalized = self._normalize_message(log.message)
        if normalized not in self.known_patterns:
            anomaly_indicators.append("Nezn√°m√Ω vzor zpr√°vy")
            scores['pattern_novelty'] = 1.0
        else:
            scores['pattern_novelty'] = 0.0
        
        # 5. V√Ωpoƒçet celkov√©ho sk√≥re
        weights = {
            'level_rarity': 0.2,
            'source_rarity': 0.2,
            'embedding_anomaly': 0.4,
            'pattern_novelty': 0.2
        }
        total_score = sum(scores[k] * weights[k] for k in weights)
        
        return {
            'log': log,
            'is_anomaly': total_score > 0.3 or log.level == 'CRITICAL',
            'anomaly_score': total_score,
            'indicators': anomaly_indicators,
            'detailed_scores': scores
        }
    
    def analyze_batch(self, logs: List[LogEntry]) -> List[Dict]:
        """Analyzuje batch log≈Ø."""
        return [self.analyze_log(log) for log in logs]


# Vytvo≈ôen√≠ syntetick√Ωch log≈Ø
def generate_logs(n_normal: int = 200, n_anomalous: int = 20) -> List[LogEntry]:
    """Generuje syntetick√© logy."""
    np.random.seed(42)
    logs = []
    
    # Norm√°ln√≠ zpr√°vy
    normal_messages = [
        "Request processed successfully in {ms}ms",
        "User {user_id} logged in from {ip}",
        "Database query completed in {ms}ms",
        "Cache hit for key {key}",
        "Background job {job_id} completed",
        "Health check passed",
        "Connection established to {service}",
        "Metrics exported successfully",
    ]
    
    sources = ['api-server', 'database', 'cache', 'worker', 'gateway']
    
    for i in range(n_normal):
        msg_template = np.random.choice(normal_messages)
        msg = msg_template.format(
            ms=np.random.randint(10, 500),
            user_id=np.random.randint(1000, 9999),
            ip=f"{np.random.randint(1,255)}.{np.random.randint(1,255)}.{np.random.randint(1,255)}.{np.random.randint(1,255)}",
            key=f"cache_key_{np.random.randint(1,100)}",
            job_id=f"job_{np.random.randint(1,1000)}",
            service=np.random.choice(['redis', 'postgres', 'elasticsearch'])
        )
        
        logs.append(LogEntry(
            timestamp=datetime.now() - timedelta(minutes=np.random.randint(0, 1440)),
            level=np.random.choice(['INFO', 'DEBUG', 'INFO', 'INFO', 'WARNING']),
            source=np.random.choice(sources),
            message=msg
        ))
    
    # Anom√°ln√≠ logy
    anomalous_messages = [
        "SECURITY: Multiple failed login attempts from {ip}",
        "ERROR: Database connection timeout after 30s",
        "CRITICAL: Disk space below 5%",
        "ALERT: Unusual traffic spike detected",
        "ERROR: Unhandled exception in payment processing",
        "WARNING: SSL certificate expires in 2 days",
        "CRITICAL: Memory usage above 95%",
        "SECURITY: SQL injection attempt detected",
    ]
    
    for i in range(n_anomalous):
        msg = np.random.choice(anomalous_messages).format(
            ip=f"{np.random.randint(1,255)}.{np.random.randint(1,255)}.{np.random.randint(1,255)}.{np.random.randint(1,255)}"
        )
        
        level = 'ERROR' if 'ERROR' in msg else ('CRITICAL' if 'CRITICAL' in msg else 'WARNING')
        
        logs.append(LogEntry(
            timestamp=datetime.now() - timedelta(minutes=np.random.randint(0, 1440)),
            level=level,
            source=np.random.choice(sources + ['unknown-service']),
            message=msg
        ))
    
    return logs


# Generov√°n√≠ a anal√Ωza log≈Ø
all_logs = generate_logs(n_normal=200, n_anomalous=20)

# Tr√©nov√°n√≠ na norm√°ln√≠ch log√°ch (prvn√≠ch 150)
log_detector = LogAnomalyDetector()
log_detector.learn_patterns(all_logs[:150])

# Anal√Ωza v≈°ech log≈Ø
print("\n--- Anal√Ωza log≈Ø ---")
results = log_detector.analyze_batch(all_logs[150:])

# Zobrazen√≠ anom√°li√≠
print("\nDetekovan√© anom√°lie:")
for result in sorted(results, key=lambda x: -x['anomaly_score'])[:10]:
    if result['is_anomaly']:
        log = result['log']
        print(f"\n[{log.level}] {log.source}: {log.message[:60]}...")
        print(f"  Sk√≥re: {result['anomaly_score']:.3f}")
        print(f"  Indik√°tory: {', '.join(result['indicators'])}")

## 5. Produkƒçn√≠ monitoring syst√©m

In [None]:
from collections import deque
import threading
import time


@dataclass
class Alert:
    """Reprezentace alertu."""
    timestamp: datetime
    severity: str  # 'low', 'medium', 'high', 'critical'
    source: str
    message: str
    anomaly_score: float
    details: Dict = field(default_factory=dict)


class ProductionAnomalyMonitor:
    """
    Produkƒçn√≠ syst√©m pro real-time detekci anom√°li√≠.
    Kombinuje v√≠ce detektor≈Ø a poskytuje alerting.
    """
    
    def __init__(
        self,
        time_series_model: TransformerAutoencoder,
        text_detector: TextAnomalyDetector,
        log_detector: LogAnomalyDetector,
        device: torch.device,
        alert_threshold: float = 0.5,
        window_size: int = 50
    ):
        self.ts_model = time_series_model
        self.text_detector = text_detector
        self.log_detector = log_detector
        self.device = device
        self.alert_threshold = alert_threshold
        self.window_size = window_size
        
        # Bufery pro streaming data
        self.sensor_buffer = deque(maxlen=window_size * 2)
        
        # Alerty
        self.alerts = []
        self.alert_callbacks = []
        
        # Statistiky
        self.stats = {
            'sensor_checks': 0,
            'text_checks': 0,
            'log_checks': 0,
            'alerts_raised': 0,
            'start_time': datetime.now()
        }
        
        # Scaler pro senzory (mƒõl by b√Ωt natr√©novan√Ω na historick√Ωch datech)
        self.sensor_scaler = StandardScaler()
        self.sensor_scaler_fitted = False
        
    def register_alert_callback(self, callback):
        """Registruje callback pro nov√© alerty."""
        self.alert_callbacks.append(callback)
        
    def _raise_alert(self, alert: Alert):
        """Vytvo≈ô√≠ nov√Ω alert."""
        self.alerts.append(alert)
        self.stats['alerts_raised'] += 1
        
        # Zavol√°n√≠ callback≈Ø
        for callback in self.alert_callbacks:
            try:
                callback(alert)
            except Exception as e:
                print(f"Chyba v alert callback: {e}")
    
    @torch.no_grad()
    def process_sensor_data(self, data_point: np.ndarray) -> Optional[Alert]:
        """
        Zpracuje nov√Ω bod senzorov√Ωch dat.
        Vr√°t√≠ alert pokud je detekov√°na anom√°lie.
        """
        self.stats['sensor_checks'] += 1
        
        # P≈ôid√°n√≠ do bufferu
        self.sensor_buffer.append(data_point)
        
        # Pot≈ôebujeme alespo≈à window_size bod≈Ø
        if len(self.sensor_buffer) < self.window_size:
            return None
        
        # Fit scaler p≈ôi prvn√≠m pou≈æit√≠
        if not self.sensor_scaler_fitted:
            self.sensor_scaler.fit(list(self.sensor_buffer))
            self.sensor_scaler_fitted = True
        
        # P≈ô√≠prava dat
        window = np.array(list(self.sensor_buffer)[-self.window_size:])
        window_scaled = self.sensor_scaler.transform(window)
        window_tensor = torch.FloatTensor(window_scaled).unsqueeze(0).to(self.device)
        
        # Predikce
        self.ts_model.eval()
        reconstruction, _ = self.ts_model(window_tensor)
        
        # Rekonstrukƒçn√≠ chyba
        mse = ((reconstruction - window_tensor) ** 2).mean().item()
        
        # Kontrola prahu
        if mse > self.alert_threshold:
            severity = 'critical' if mse > self.alert_threshold * 2 else 'high' if mse > self.alert_threshold * 1.5 else 'medium'
            
            alert = Alert(
                timestamp=datetime.now(),
                severity=severity,
                source='sensor_monitor',
                message=f"Anom√°lie v senzorov√Ωch datech detekov√°na (MSE: {mse:.4f})",
                anomaly_score=mse,
                details={
                    'window_end_values': data_point.tolist(),
                    'reconstruction_error': mse
                }
            )
            self._raise_alert(alert)
            return alert
        
        return None
    
    def process_text(self, text: str, source: str = 'text_input') -> Optional[Alert]:
        """
        Analyzuje text na anom√°lie.
        """
        self.stats['text_checks'] += 1
        
        results = self.text_detector.detect_anomalies([text])
        result = results[0]
        
        if result['is_anomaly']:
            severity = 'high' if result['anomaly_score'] > 0.8 else 'medium'
            
            alert = Alert(
                timestamp=datetime.now(),
                severity=severity,
                source=source,
                message=f"Anom√°ln√≠ text detekov√°n: {text[:50]}...",
                anomaly_score=result['anomaly_score'],
                details=result
            )
            self._raise_alert(alert)
            return alert
        
        return None
    
    def process_log(self, log: LogEntry) -> Optional[Alert]:
        """
        Analyzuje log na anom√°lie.
        """
        self.stats['log_checks'] += 1
        
        result = self.log_detector.analyze_log(log)
        
        if result['is_anomaly']:
            severity_map = {
                'CRITICAL': 'critical',
                'ERROR': 'high',
                'WARNING': 'medium'
            }
            severity = severity_map.get(log.level, 'low')
            
            alert = Alert(
                timestamp=datetime.now(),
                severity=severity,
                source='log_monitor',
                message=f"Anom√°ln√≠ log [{log.level}]: {log.message[:50]}...",
                anomaly_score=result['anomaly_score'],
                details={
                    'log_level': log.level,
                    'log_source': log.source,
                    'indicators': result['indicators']
                }
            )
            self._raise_alert(alert)
            return alert
        
        return None
    
    def get_statistics(self) -> Dict:
        """Vr√°t√≠ statistiky monitoringu."""
        runtime = (datetime.now() - self.stats['start_time']).total_seconds()
        
        return {
            'runtime_seconds': runtime,
            'sensor_checks': self.stats['sensor_checks'],
            'text_checks': self.stats['text_checks'],
            'log_checks': self.stats['log_checks'],
            'total_alerts': self.stats['alerts_raised'],
            'alert_rate': self.stats['alerts_raised'] / max(1, runtime) * 60,  # per minute
            'alerts_by_severity': self._count_alerts_by_severity()
        }
    
    def _count_alerts_by_severity(self) -> Dict[str, int]:
        """Poƒç√≠t√° alerty podle severity."""
        counts = {'low': 0, 'medium': 0, 'high': 0, 'critical': 0}
        for alert in self.alerts:
            counts[alert.severity] = counts.get(alert.severity, 0) + 1
        return counts
    
    def get_recent_alerts(self, n: int = 10) -> List[Alert]:
        """Vr√°t√≠ N nejnovƒõj≈°√≠ch alert≈Ø."""
        return sorted(self.alerts, key=lambda x: x.timestamp, reverse=True)[:n]


# Vytvo≈ôen√≠ produkƒçn√≠ho monitoru
monitor = ProductionAnomalyMonitor(
    time_series_model=model,
    text_detector=text_detector,
    log_detector=log_detector,
    device=device,
    alert_threshold=0.3,
    window_size=50
)

# Registrace callback pro alerty
def alert_handler(alert: Alert):
    print(f"üö® [{alert.severity.upper()}] {alert.message}")

monitor.register_alert_callback(alert_handler)

print("Produkƒçn√≠ monitor nastaven.")

In [None]:
# Simulace real-time monitoringu

print("=== Simulace real-time monitoringu ===")
print("\n1. Streaming senzorov√Ωch dat:")

# Simulace streaming dat (prvn√≠ch 100 bod≈Ø)
for i in range(100):
    # Norm√°ln√≠ data
    data_point = sensor_data[i]
    monitor.process_sensor_data(data_point)

# Injekce anom√°lie
print("\n(Injekce anom√°lie do senzorov√Ωch dat...)")
anomalous_point = sensor_data[100].copy()
anomalous_point[0] += 5  # Spike v teplotƒõ
monitor.process_sensor_data(anomalous_point)

print("\n2. Kontrola textov√Ωch vstup≈Ø:")
texts_to_check = [
    "Norm√°ln√≠ zpr√°va od z√°kazn√≠ka.",
    "KLIKNI SEM PRO V√ùHRU!!! www.spam.cz",
    "Produkt je v po≈ô√°dku, dƒõkuji."
]

for text in texts_to_check:
    monitor.process_text(text, source='customer_feedback')

print("\n3. Kontrola log≈Ø:")
test_logs = [
    LogEntry(datetime.now(), 'INFO', 'api-server', 'Request processed in 45ms'),
    LogEntry(datetime.now(), 'CRITICAL', 'database', 'Connection pool exhausted'),
    LogEntry(datetime.now(), 'ERROR', 'unknown-service', 'Segmentation fault in worker')
]

for log in test_logs:
    monitor.process_log(log)

# Zobrazen√≠ statistik
print("\n" + "=" * 50)
print("Statistiky monitoringu:")
stats = monitor.get_statistics()
for key, value in stats.items():
    print(f"  {key}: {value}")

print("\nNejnovƒõj≈°√≠ alerty:")
for alert in monitor.get_recent_alerts(5):
    print(f"  [{alert.severity}] {alert.timestamp.strftime('%H:%M:%S')}: {alert.message[:60]}...")

## Shrnut√≠

V tomto notebooku jsme vytvo≈ôili komplexn√≠ syst√©m pro detekci anom√°li√≠:

1. **Transformer Autoencoder** - Detekce anom√°li√≠ v ƒçasov√Ωch ≈ôad√°ch pomoc√≠ rekonstrukƒçn√≠ chyby
2. **Text Anomaly Detector** - Identifikace anom√°ln√≠ch text≈Ø pomoc√≠ embedding≈Ø a Isolation Forest
3. **Log Analyzer** - Kombinace pattern matching, frekvence a s√©mantick√© anal√Ωzy
4. **Production Monitor** - Real-time syst√©m kombinuj√≠c√≠ v≈°echny detektory

### Kl√≠ƒçov√© poznatky

- Autoencodery detekuj√≠ anom√°lie jako vzorky s vysokou rekonstrukƒçn√≠ chybou
- Embeddingy umo≈æ≈àuj√≠ detekci s√©mantick√Ωch anom√°li√≠ v textu
- Kombinace v√≠ce metod (ensemble) zlep≈°uje robustnost
- Pr√°h pro anom√°lie je nutn√© kalibrovat na konkr√©tn√≠ch datech

### Praktick√© tipy

- V≈ædy tr√©nujte model pouze na norm√°ln√≠ch datech
- Monitorujte false positive rate a pr≈Øbƒõ≈ænƒõ ladite prahy
- Implementujte alerting s r≈Øzn√Ωmi √∫rovnƒõmi z√°va≈ænosti
- Logujte v≈°echny detekce pro n√°slednou anal√Ωzu a zlep≈°ov√°n√≠