# DISTILLED MONITORING SYSTEM

## Predictive monitoring with local caching and fallback support

### 🚀 QUICK WORKFLOW:
1. `setup()` - Initialize system and fallbacks
2. `generate_datasets()` - Generate training data (resumable)
3. `train()` - Train the distilled model
4. `test()` - Test model inference
5. `demo()` - Run monitoring demo

### 📊 MONITORING:
- `status()` - Check system status
- `show_progress()` - Check dataset generation progress

### 🔧 RECOVERY:
- `retry_failed()` - Retry failed generations
- `reset_progress()` - Start fresh

**Data Sources:** Splunk, Jira, Confluence, IBM Spectrum Conductor, VEMKD logs from Red Hat Linux

**Fallback Order:** Remote API → Ollama → Local Model → Static Responses

In [1]:
# Import the system
from main_notebook import *
from config import CONFIG

CONFIG['model_name'] = "bert-base-uncased" # use local cached model instead of attempting to download. 
print("🚀 Distilled Monitoring System")
print("📊 Ready for predictive monitoring with local caching")
print(f"📁 Cache directory: {CONFIG['hf_cache_dir']}")

INFO:config:📋 Batch discovered 19 Ollama models
INFO:config:📁 Efficiently discovered 4 local models
INFO:config:📋 Discovered 22 total models
INFO:config:🎯 Built rotation pool: 18 models
INFO:config:   ollama: 15 models
INFO:config:   local: 2 models
INFO:config:   static: 1 models
INFO:config:✅ Enhanced model chain initialized
INFO:config:   Total models: 22
INFO:config:   Rotation pool: 18
INFO:distilled_model_trainer:🗂️ Enhanced logging initialized - local log: logs\training_20250724_174024.log
INFO:common_utils:✅ Loaded language_dataset.json


🚀 Distilled Monitoring System
📊 Predictive monitoring with dynamic discovery

Type quick_start_guide() for usage instructions
Type status() to check system status
🚀 Distilled Monitoring System
📊 Ready for predictive monitoring with local caching
📁 Cache directory: ./hf_cache/


In [2]:
# 1. Setup system with fallback chain
print("🚀 Setting up Distilled Monitoring System...")
print("This includes: directories, fallback systems, and progress tracking")

setup_success = setup()

if setup_success:
    print("\n✅ System setup complete!")
    print("\nNext: generate_datasets() to create training data")
else:
    print("\n❌ Setup failed. Check error messages above.")
    print("You may need to install dependencies or setup Ollama.")

🚀 Setting up Distilled Monitoring System...
This includes: directories, fallback systems, and progress tracking
🚀 Setting up distilled monitoring system...

FALLBACK SYSTEM SETUP
Setting up comprehensive fallback system...

1. Checking Remote LLM...
   ❌ Remote LLM not configured

2. Checking Ollama...


INFO:config:📝 Static responses created: 20 responses


   ✅ Ollama available with 19 models
      • qwen2.5-coder:latest
      • falcon:latest
      • qwen2.5vl:latest
      • tinyllama:latest
      • phi4:latest
      • ... and 14 more

3. Checking Local Models...
      • microsoft_DialoGPT-medium
      • bert-base-uncased
   ✅ 2 local models available

4. Setting up Static Fallback...
   ✅ Static fallback responses created

✅ Fallback system ready with 3 methods: Ollama, Local Models, Static Fallback
Testing fallback system...

Running test queries...

Test 1: Explain what high CPU usage indicates in system mo...
✅ Success: 849 chars, 1 model(s)
   Sample: High CPU usage in system monitoring indicates that the central processing unit (...

Test 2: What causes java.lang.OutOfMemoryError in applicat...
✅ Success: 744 chars, 1 model(s)
   Sample: The `java.lang.OutOfMemoryError` is a common error that occurs when an applicati...

Test 3: How do you troubleshoot network connectivity issue...


INFO:dataset_generator:🔍 Discovering YAML files in data_config
INFO:dataset_generator:  ✅ conversation_prompts.yaml: 499 items (conversation_patterns)
INFO:dataset_generator:  ✅ english_patterns.yaml: 200 items (error_scenarios)
INFO:dataset_generator:  ✅ error_patterns.yaml: 33 items (error_scenarios)
INFO:dataset_generator:  ✅ metrics_patterns.yaml: 107 items (error_scenarios)
INFO:dataset_generator:  ✅ personality_types.yaml: 40 items (personality_types)
INFO:dataset_generator:  ✅ project_management_terms.yaml: 625 items (technical_terms)
INFO:dataset_generator:  ✅ question_styles.yaml: 54 items (question_styles)
INFO:dataset_generator:  ✅ response_templates.yaml: 48 items (general_knowledge)
INFO:dataset_generator:  ✅ technical_terms.yaml: 1271 items (technical_terms)


✅ Success: 629 chars, 1 model(s)
   Sample: To troubleshoot network connectivity issues on Linux, you can follow these steps...

📊 TEST RESULTS: 3/3 successful
✅ Fallback system ready


INFO:dataset_generator:📊 Total YAML configs loaded: 9
INFO:dataset_generator:📊 Resuming session session_20250718_113730
INFO:dataset_generator:🔧 Extracted technical terms: 1352 terms across 15 categories
INFO:dataset_generator:✅ Enhanced generator initialized
INFO:dataset_generator:   YAML configs: 9
INFO:dataset_generator:   Technical terms: 15
INFO:dataset_generator:   Conversation styles: 3



✅ Setup complete!

✅ System setup complete!

Next: generate_datasets() to create training data


In [3]:
# 2. Check system status
status()

INFO:common_utils:✅ Loaded language_dataset.json
INFO:dataset_generator:🎯 Dynamic targets calculated:
INFO:dataset_generator:   Base multiplier: 1.5
INFO:dataset_generator:   Models per question: 1
INFO:dataset_generator:   Total target: 2347
INFO:common_utils:✅ Loaded language_dataset.json
INFO:common_utils:✅ Loaded language_dataset.json
INFO:config:🎮 CUDA GPU: NVIDIA GeForce RTX 4090 (22GB)



SYSTEM STATUS
Setup: ✅
Datasets: ❌
Model: ✅
Fallbacks: ✅

ENHANCED DATASET GENERATION PROGRESS
Session: session_20250718_113730
Models per question: 1
Save frequency: every 50 samples

✅ technical_explanation:
    Progress: 2540/2028 (125.2%)
    Base: 2028 × 1 models = 2028 total
    Remaining: 0

🔄 conversation_example:
    Progress: 0/165 (0.0%)
    Base: 165 × 1 models = 165 total
    Remaining: 165

🔄 error_troubleshooting:
    Progress: 0/69 (0.0%)
    Base: 69 × 1 models = 69 total
    Remaining: 69

🔄 english_pattern:
    Progress: 0/19 (0.0%)
    Base: 19 × 1 models = 19 total
    Remaining: 19

🔄 personality_response:
    Progress: 0/36 (0.0%)
    Base: 36 × 1 models = 36 total
    Remaining: 36

🔄 question_style_variation:
    Progress: 0/30 (0.0%)
    Base: 30 × 1 models = 30 total
    Remaining: 30

📈 OVERALL: 2540/2347 (108.2%)
🎯 Remaining: 319 samples
💾 Saves completed: 0

Files:
  Language Dataset: ✅
    Samples: 3314
  Metrics Dataset: ❌
  Trained Model: ✅

Environmen

In [None]:
# 3. Check dataset generation progress (if any)
print("📊 Current dataset generation progress:")
show_progress()

print("\n💡 TIPS:")
print("• First run: Will show new session")
print("• Resuming: Will show existing progress")
print("• Use reset_progress() to start fresh")
print("• Use retry_failed() to retry failed items")

In [None]:
# use with caution.
# reset_progress()

In [4]:
# 4. Generate the datasets. 
generate_datasets()

INFO:dataset_generator:🗣️ Starting complete dataset generation
INFO:dataset_generator:🎯 Dynamic targets calculated:
INFO:dataset_generator:   Base multiplier: 1.5
INFO:dataset_generator:   Models per question: 1
INFO:dataset_generator:   Total target: 2347
INFO:dataset_generator:🗣️ Starting rich language generation: 2347 target



📊 DATASET GENERATION


INFO:common_utils:✅ Loaded language_dataset.json
INFO:dataset_generator:✅ Target already met: 3314/2347
INFO:dataset_generator:📊 Starting metrics generation: 10000 target
INFO:dataset_generator:📊 Need to generate: 10000 metrics samples
INFO:common_utils:💾 Saved metrics_dataset.json
INFO:dataset_generator:✅ Metrics generation complete: 10000 total
INFO:common_utils:✅ Loaded language_dataset.json
INFO:common_utils:✅ Loaded metrics_dataset.json


✅ Dataset generation completed successfully!
   Language dataset: 0
   Metrics dataset: Updated
✅ Dataset generation completed!


(True, True)

In [None]:
# 5. Train the distilled model
print("🏋️ TRAINING DISTILLED MODEL")
print("="*40)
print(f"Environment: {detect_training_environment()}")
print(f"Model: {CONFIG['model_name']}")
print(f"Batch size: {CONFIG['batch_size']}")
print(f"Epochs: {CONFIG['epochs']}")
print("")
print("⚠️  Training can take way long depending on hardware")
print("")

try:
    success = train()
    if success:
        print("\n✅ Training completed!")
        print("Next: test() to test the model")
    else:
        print("\n❌ Training failed - check if datasets exist")
except Exception as e:
    print(f"\n❌ Training error: {e}")
    print("Check logs for detailed error information")

In [None]:
# 6. Test model inference
print("🧪 TESTING MODEL INFERENCE")
print("="*40)
print("Testing with scenarios:")
print("• Normal operation")
print("• CPU spike")
print("• Memory pressure")
print("")

test_success = test()

if test_success:
    print("\n✅ Model testing successful!")
    print("Next: demo() to run monitoring demo")
else:
    print("\n❌ Testing failed - ensure model is trained")

In [None]:
# 7. Run monitoring demo
print("🎭 MONITORING DEMO")
print("="*30)
print("Features:")
print("• Real-time metric processing")
print("• Anomaly detection")
print("• Alert generation")
print("• Recommendation engine")
print("• Dashboard display")
print("")

# Customize demo duration
DEMO_MINUTES = 3

print(f"Running {DEMO_MINUTES}-minute demo...")
print("Will inject anomalies to demonstrate detection")
print("")

try:
    demo(minutes=DEMO_MINUTES)
    print("\n✅ Demo completed!")
    print("Check exported metrics history for results")
except KeyboardInterrupt:
    print("\n⏹️  Demo stopped by user")
except Exception as e:
    print(f"\n❌ Demo error: {e}")

In [None]:
# 8. Final system status
print("📋 FINAL SYSTEM STATUS")
print("="*40)

status()

print("\n🎉 SYSTEM COMPLETE!")
print("="*30)
print("Your distilled monitoring system includes:")
print("  ✅ Multi-model fallback system")
print("  ✅ Local caching for portability")
print("  ✅ Progress tracking and resume")
print("  ✅ Trained monitoring model")
print("  ✅ Real-time anomaly detection")
print("  ✅ Actionable recommendations")
print("")
print("🔧 NEXT STEPS:")
print("  • Integrate with your actual data sources")
print("  • Customize thresholds and alerts")
print("  • Set up continuous monitoring")
print("  • Implement feedback loops for learning")

## 🛠️ Troubleshooting & Recovery

### Common Commands:
- `status()` - Complete system status
- `show_progress()` - Dataset generation progress
- `retry_failed()` - Retry failed generation items
- `reset_progress()` - Start dataset generation fresh

### Common Issues:
- **Generation interrupted:** Just run the generation cell again
- **Failed generations:** Use `retry_failed()`
- **Want to start over:** Use `reset_progress()`
- **Memory issues:** Reduce `CONFIG['batch_size']`
- **No models available:** Check Ollama is running

### Configuration:
Modify `CONFIG` in `config.py` or use:
```python
CONFIG['language_samples'] = 2000  # Increase dataset size
CONFIG['batch_size'] = 8           # Reduce for less memory
CONFIG['epochs'] = 5               # More training epochs

In [None]:
import os
os.environ['TRITON_DISABLE_LINE_INFO'] = '1'
os.environ['TORCH_COMPILE_DISABLE'] = '1'
from transformers import AutoTokenizer
from config import CONFIG
from distilled_model_trainer import DistilledModelTrainer

trainer = DistilledModelTrainer(CONFIG, resume_training=True)
trainer.train()  # Updates latest model or creates new if none found

In [None]:
# Test if the moved model will work for training
import torch
from pathlib import Path
from transformers import AutoTokenizer, AutoModel, AutoConfig

def test_training_compatibility():
    """Test if the moved model is ready for training."""
    model_path = "./pretrained/bert-base-uncased/"
    
    print("🧪 TESTING TRAINING COMPATIBILITY")
    print("=" * 40)
    
    try:
        # Load exactly as the training code will
        print("Loading tokenizer...")
        tokenizer = AutoTokenizer.from_pretrained(
            model_path,
            local_files_only=True
        )
        
        print("Loading config...")
        config = AutoConfig.from_pretrained(
            model_path,
            local_files_only=True
        )
        
        print("Loading model...")
        model = AutoModel.from_pretrained(
            model_path,
            config=config,
            local_files_only=True,
            torch_dtype=torch.float32
        )
        
        print("✅ All components loaded successfully!")
        
        # Test training-specific functionality
        print("\nTesting training features...")
        
        # Check if model can be put in training mode
        model.train()
        print("✅ Model can enter training mode")
        
        # Test gradient computation
        model.eval()
        test_input = tokenizer("test", return_tensors="pt", max_length=128, truncation=True)
        
        # Enable gradients
        for param in model.parameters():
            param.requires_grad = True
        
        output = model(**test_input)
        loss = output.last_hidden_state.mean()  # Dummy loss
        loss.backward()
        
        print("✅ Gradient computation works")
        
        # Check model size
        param_count = sum(p.numel() for p in model.parameters())
        print(f"✅ Model parameters: {param_count:,} ({param_count/1e6:.1f}M)")
        
        # Check config details
        print(f"✅ Hidden size: {config.hidden_size}")
        print(f"✅ Vocab size: {config.vocab_size}")
        
        print(f"\n🎉 MODEL IS READY FOR TRAINING!")
        return True
        
    except Exception as e:
        print(f"❌ Training compatibility test failed: {e}")
        return False

# Run the test
success = test_training_compatibility()

if success:
    print(f"\n✅ Your moved model in ./pretrained/bert-base-uncased/ is ready!")
    print(f"The distilled_model_trainer.py should now work without internet.")
else:
    print(f"\n❌ There may still be issues with the moved model.")

In [None]:
# Recovery and troubleshooting commands
print("🔧 RECOVERY COMMANDS")
print("="*30)

print("\n📊 Progress Management:")
print("show_progress()    # Check current progress")
print("retry_failed()     # Retry failed items")
print("reset_progress()   # Start completely fresh")

print("\n🔍 Diagnostics:")
print("status()           # Complete system status")

print("\n⚙️  Current Configuration:")
print(f"Language samples: {CONFIG['language_samples']}")
print(f"Metrics samples: {CONFIG['metrics_samples']}")
print(f"Batch size: {CONFIG['batch_size']}")
print(f"Model: {CONFIG['model_name']}")
print(f"Cache dir: {CONFIG['hf_cache_dir']}")

print("\n💡 To modify configuration:")
print("CONFIG['language_samples'] = 2000")
print("CONFIG['batch_size'] = 8")

In [None]:
# Optional: Run individual recovery commands
# Uncomment as needed:

# show_progress()      # Check progress
# retry_failed()       # Retry failed items
# reset_progress()     # Start fresh (WARNING: deletes progress)

print("Uncomment commands above as needed for recovery")