# üöÄ Smart Alert AI Service - Training Notebook

This notebook trains all AI models for the Smart Alert service from scratch.

## Models:
1. **Alert Scoring Model** - Random Forest for priority scoring (0-100)
2. **Semantic Duplicate Detector** - Sentence Transformers for duplicate detection
3. **Notification Timing Model** - Thompson Sampling for optimal notification times

## Steps:
1. Install dependencies
2. Setup project directories
3. Cleanup old data
4. Train Alert Scorer
5. Initialize Duplicate Detector
6. Train Notification Timing
7. Test all models
8. Download trained models


---
## 1. Install Dependencies


In [11]:
# Install required packages
%pip install -q numpy pandas scikit-learn scipy joblib
%pip install -q sentence-transformers
%pip install -q fastapi uvicorn pydantic

print("‚úÖ Dependencies installed!")


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
‚úÖ Dependencies installed!



[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


---
## 2. Setup Project Directories


In [12]:
import os
from pathlib import Path

# Create project directories
BASE_DIR = Path('/content/ai_service')
DATA_DIR = BASE_DIR / 'data'
MODELS_DIR = DATA_DIR / 'models'
TRAINING_DIR = DATA_DIR / 'training'
CACHE_DIR = DATA_DIR / 'cache'

for directory in [MODELS_DIR, TRAINING_DIR, CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

print(f"‚úÖ Directories created:")
print(f"  - Models: {MODELS_DIR}")
print(f"  - Training: {TRAINING_DIR}")
print(f"  - Cache: {CACHE_DIR}")


‚úÖ Directories created:
  - Models: \content\ai_service\data\models
  - Training: \content\ai_service\data\training
  - Cache: \content\ai_service\data\cache


---
## 3. Cleanup Old Data


In [13]:
import shutil

def cleanup_all():
    """Remove all old models, databases, and caches"""
    deleted = 0
    
    # Cleanup models
    if MODELS_DIR.exists():
        for f in MODELS_DIR.glob('*'):
            if f.is_file():
                f.unlink()
                print(f"  Deleted: {f.name}")
                deleted += 1
    
    # Cleanup database
    if TRAINING_DIR.exists():
        for f in TRAINING_DIR.glob('*.db*'):
            f.unlink()
            print(f"  Deleted: {f.name}")
            deleted += 1
    
    # Cleanup cache
    if CACHE_DIR.exists():
        for item in CACHE_DIR.iterdir():
            if item.is_dir():
                shutil.rmtree(item)
            else:
                item.unlink()
            print(f"  Deleted: {item.name}")
            deleted += 1
    
    if deleted == 0:
        print("  (No old files found)")
    print(f"\n‚úÖ Cleanup complete! ({deleted} items deleted)")

cleanup_all()


  Deleted: alert_scorer.pkl
  Deleted: notification_timing.json
  Deleted: .locks
  Deleted: models--sentence-transformers--paraphrase-multilingual-MiniLM-L12-v2

‚úÖ Cleanup complete! (4 items deleted)


---
## 4. Train Alert Scoring Model (Random Forest)

Uses synthetic data generated from rule-based formulas to train a Random Forest model.

**Features (15 total):**
- Alert properties: severity, type, age, distance, audience match
- Contextual: user interactions, time of day, day of week, weather
- Characteristics: content length, images, safety guide
- Social signals: similar alerts, engagement rate, source reliability


In [14]:
import numpy as np
import joblib
import math
import time
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Configuration
N_FEATURES = 15
N_SAMPLES = 1000
RF_N_ESTIMATORS = 100
RF_MAX_DEPTH = 10
RF_RANDOM_STATE = 42

def generate_synthetic_features(n_samples: int) -> np.ndarray:
    """Generate synthetic feature vectors"""
    np.random.seed(42)
    features = np.zeros((n_samples, N_FEATURES))
    
    for i in range(n_samples):
        features[i, 0] = np.random.choice([1, 2, 3, 4])  # severity_score
        features[i, 1] = np.random.choice([1, 2, 3, 4])  # alert_type_score
        features[i, 2] = np.random.exponential(12)       # hours_since_created
        features[i, 3] = np.random.exponential(20)       # distance_km
        features[i, 4] = np.random.choice([0, 1])        # target_audience_match
        features[i, 5] = np.random.poisson(5)            # user_previous_interactions
        features[i, 6] = np.random.randint(0, 24)        # time_of_day
        features[i, 7] = np.random.randint(0, 7)         # day_of_week
        features[i, 8] = np.random.choice([0, 1, 2, 3, 4])  # weather_severity
        features[i, 9] = np.random.randint(50, 500)      # content_length
        features[i, 10] = np.random.choice([0, 1])       # has_images
        features[i, 11] = np.random.choice([0, 1])       # has_safety_guide
        features[i, 12] = np.random.poisson(3)           # similar_alerts_count
        features[i, 13] = np.random.beta(2, 2)           # alert_engagement_rate
        features[i, 14] = np.random.uniform(0.5, 1.0)    # source_reliability
    
    return features

def apply_rule_based_scoring(X: np.ndarray) -> np.ndarray:
    """Apply rule-based scoring formula"""
    scores = np.zeros(len(X))
    
    for i, features in enumerate(X):
        severity_score = 25 * features[0]
        type_score = 30 + 17.5 * features[1]
        hours = features[2]
        time_decay_score = 100 * math.exp(-0.05 * hours)
        distance = features[3]
        if distance >= 50:
            distance_score = 0
        else:
            ratio = 1 - (distance / 50)
            distance_score = 100 * ratio * ratio
        audience_score = 100 if features[4] else 50
        
        final_score = (
            0.35 * severity_score +
            0.20 * type_score +
            0.15 * time_decay_score +
            0.20 * distance_score +
            0.10 * audience_score
        )
        scores[i] = np.clip(final_score, 0, 100)
    
    return scores

print("üîÑ Training Alert Scoring Model...")
start_time = time.time()

# Generate training data
print(f"  Generating {N_SAMPLES} synthetic samples...")
X = generate_synthetic_features(N_SAMPLES)
y = apply_rule_based_scoring(X)

# Train model
print("  Training Random Forest...")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = RandomForestRegressor(
    n_estimators=RF_N_ESTIMATORS,
    max_depth=RF_MAX_DEPTH,
    random_state=RF_RANDOM_STATE,
    n_jobs=-1
)
model.fit(X_scaled, y)

# Cross-validation
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='neg_mean_absolute_error')
cv_mae = -cv_scores.mean()

elapsed = time.time() - start_time

# Feature importance
feature_names = [
    'severity_score', 'alert_type_score', 'hours_since_created',
    'distance_km', 'target_audience_match', 'user_previous_interactions',
    'time_of_day', 'day_of_week', 'weather_severity',
    'content_length', 'has_images', 'has_safety_guide',
    'similar_alerts_count', 'alert_engagement_rate', 'source_reliability'
]
importance = dict(zip(feature_names, model.feature_importances_))
top_features = sorted(importance.items(), key=lambda x: x[1], reverse=True)[:5]

print(f"\nüìä Training Results:")
print(f"  - Training time: {elapsed:.2f} seconds")
print(f"  - Cross-validation MAE: {cv_mae:.2f}")
print(f"\n  Top 5 Feature Importance:")
for name, imp in top_features:
    print(f"    - {name}: {imp:.4f}")

# Save model
model_data = {'model': model, 'scaler': scaler, 'is_trained': True}
model_path = MODELS_DIR / 'alert_scorer.pkl'
joblib.dump(model_data, model_path)

print(f"\n‚úÖ Alert Scorer saved to: {model_path}")


üîÑ Training Alert Scoring Model...
  Generating 1000 synthetic samples...
  Training Random Forest...

üìä Training Results:
  - Training time: 0.86 seconds
  - Cross-validation MAE: 2.38

  Top 5 Feature Importance:
    - severity_score: 0.5407
    - distance_km: 0.2603
    - hours_since_created: 0.0839
    - alert_type_score: 0.0697
    - target_audience_match: 0.0126

‚úÖ Alert Scorer saved to: \content\ai_service\data\models\alert_scorer.pkl


---
## 5. Initialize Semantic Duplicate Detector

Uses pre-trained Sentence Transformer model (multilingual BERT) for semantic similarity.
No training needed - just download and test.


In [15]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

MODEL_NAME = "paraphrase-multilingual-MiniLM-L12-v2"

print("üîÑ Loading Sentence Transformer model...")
print("  (This may take a few minutes on first run)")

start_time = time.time()
duplicate_model = SentenceTransformer(MODEL_NAME, cache_folder=str(CACHE_DIR))
elapsed = time.time() - start_time

print(f"\n‚úÖ Model loaded in {elapsed:.2f} seconds")
print(f"  Model: {MODEL_NAME}")

# Test similarity
print("\nüß™ Testing semantic similarity...")

test_texts = [
    "M∆∞a l·ªõn g√¢y ng·∫≠p l·ª•t t·∫°i qu·∫≠n 1, TP.HCM",
    "Ng·∫≠p l·ª•t do m∆∞a l·ªõn ·ªü khu v·ª±c qu·∫≠n 1 th√†nh ph·ªë H·ªì Ch√≠ Minh",
    "ƒê·ªông ƒë·∫•t m·∫°nh 5.5 ƒë·ªô richter t·∫°i Nh·∫≠t B·∫£n"
]

embeddings = duplicate_model.encode(test_texts)

sim_12 = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
sim_13 = cosine_similarity([embeddings[0]], [embeddings[2]])[0][0]

print(f"\n  Text 1: {test_texts[0]}")
print(f"  Text 2: {test_texts[1]}")
print(f"  Text 3: {test_texts[2]}")
print(f"\n  Similarity (1-2, similar): {sim_12:.4f}")
print(f"  Similarity (1-3, different): {sim_13:.4f}")

print("\n‚úÖ Duplicate Detector ready!")


üîÑ Loading Sentence Transformer model...
  (This may take a few minutes on first run)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`



‚úÖ Model loaded in 11.70 seconds
  Model: paraphrase-multilingual-MiniLM-L12-v2

üß™ Testing semantic similarity...

  Text 1: M∆∞a l·ªõn g√¢y ng·∫≠p l·ª•t t·∫°i qu·∫≠n 1, TP.HCM
  Text 2: Ng·∫≠p l·ª•t do m∆∞a l·ªõn ·ªü khu v·ª±c qu·∫≠n 1 th√†nh ph·ªë H·ªì Ch√≠ Minh
  Text 3: ƒê·ªông ƒë·∫•t m·∫°nh 5.5 ƒë·ªô richter t·∫°i Nh·∫≠t B·∫£n

  Similarity (1-2, similar): 0.7800
  Similarity (1-3, different): 0.2965

‚úÖ Duplicate Detector ready!


---
## 6. Train Notification Timing Model (Thompson Sampling)

Uses Multi-Armed Bandit with Thompson Sampling to learn optimal notification times.
Simulates realistic engagement patterns for 24 time slots.


In [16]:
import json

N_TIME_SLOTS = 24
EPSILON = 0.1

print("üîÑ Training Notification Timing Model...")

# Initialize Beta distribution parameters (uniform prior)
alpha = np.ones(N_TIME_SLOTS, dtype=float)
beta_param = np.ones(N_TIME_SLOTS, dtype=float)

# Simulate realistic day patterns
print("  Simulating realistic engagement patterns...")

patterns = {
    'morning': (6, 9, 0.6),      # Moderate engagement
    'work': (9, 17, 0.3),        # Low engagement
    'evening': (17, 22, 0.8),    # High engagement
    'night_early': (22, 24, 0.1),
    'night_late': (0, 6, 0.1)
}

for name, (start, end, rate) in patterns.items():
    for hour in range(start, end):
        if hour < N_TIME_SLOTS:
            successes = int(20 * rate)
            failures = 20 - successes
            alpha[hour] += successes
            beta_param[hour] += failures

# Calculate expected success rates
expected_rewards = alpha / (alpha + beta_param)
top_indices = np.argsort(expected_rewards)[-5:][::-1]

print("\nüìä Best notification times:")
for idx in top_indices:
    rate = expected_rewards[idx]
    samples = int(alpha[idx] + beta_param[idx] - 2)
    print(f"  - {idx:02d}:00 - Success rate: {rate:.2f} ({samples} samples)")

# Save parameters
params_data = {
    'alpha': alpha.tolist(),
    'beta': beta_param.tolist(),
    'n_slots': N_TIME_SLOTS,
    'epsilon': EPSILON
}

params_path = MODELS_DIR / 'notification_timing.json'
with open(params_path, 'w') as f:
    json.dump(params_data, f)

print(f"\n‚úÖ Notification Timing saved to: {params_path}")


üîÑ Training Notification Timing Model...
  Simulating realistic engagement patterns...

üìä Best notification times:
  - 20:00 - Success rate: 0.77 (20 samples)
  - 21:00 - Success rate: 0.77 (20 samples)
  - 18:00 - Success rate: 0.77 (20 samples)
  - 19:00 - Success rate: 0.77 (20 samples)
  - 17:00 - Success rate: 0.77 (20 samples)

‚úÖ Notification Timing saved to: \content\ai_service\data\models\notification_timing.json


---
## 7. Test All Models


In [17]:
print("üß™ Testing All Models")
print("=" * 60)

# Test Alert Scorer
print("\n1Ô∏è‚É£ Alert Scoring Model")
scorer_data = joblib.load(MODELS_DIR / 'alert_scorer.pkl')
scorer_model = scorer_data['model']
scorer_scaler = scorer_data['scaler']

test_features = np.array([[
    3, 2, 2, 10, 1, 3, 14, 2, 2, 200, 1, 1, 2, 0.7, 0.9
]])

test_scaled = scorer_scaler.transform(test_features)
score = scorer_model.predict(test_scaled)[0]

tree_preds = np.array([tree.predict(test_scaled)[0] for tree in scorer_model.estimators_])
confidence = 1.0 - (np.std(tree_preds) / 100.0)

print(f"  Test alert (high severity, weather type):")
print(f"  - Priority Score: {score:.2f}")
print(f"  - Confidence: {confidence:.2f}")

# Test Duplicate Detector
print("\n2Ô∏è‚É£ Semantic Duplicate Detector")
alert1 = "C·∫£nh b√°o m∆∞a l·ªõn t·∫°i Qu·∫≠n 7, nguy c∆° ng·∫≠p cao"
alert2 = "M∆∞a to ·ªü Q7, c√≥ th·ªÉ g√¢y ng·∫≠p n·∫∑ng"
alert3 = "ƒê·ªông ƒë·∫•t 4.5 ƒë·ªô richter t·∫°i ƒêi·ªán Bi√™n"

emb = duplicate_model.encode([alert1, alert2, alert3])
sim_same = cosine_similarity([emb[0]], [emb[1]])[0][0]
sim_diff = cosine_similarity([emb[0]], [emb[2]])[0][0]

print(f"  Similar alerts similarity: {sim_same:.4f} (threshold: 0.85)")
print(f"  Different alerts similarity: {sim_diff:.4f}")
print(f"  Duplicate detected: {'Yes' if sim_same >= 0.85 else 'No'}")

# Test Notification Timing
print("\n3Ô∏è‚É£ Notification Timing Model")
with open(MODELS_DIR / 'notification_timing.json', 'r') as f:
    timing_data = json.load(f)

alpha_loaded = np.array(timing_data['alpha'])
beta_loaded = np.array(timing_data['beta'])

# Thompson Sampling selection
samples = np.array([np.random.beta(alpha_loaded[i], beta_loaded[i]) for i in range(24)])
best_slot = int(np.argmax(samples))

print(f"  Recommended notification time: {best_slot:02d}:00")
print(f"  Expected success rate: {(alpha_loaded[best_slot] / (alpha_loaded[best_slot] + beta_loaded[best_slot])):.2f}")

print("\n" + "=" * 60)
print("‚úÖ All models tested successfully!")


üß™ Testing All Models

1Ô∏è‚É£ Alert Scoring Model
  Test alert (high severity, weather type):
  - Priority Score: 74.71
  - Confidence: 0.96

2Ô∏è‚É£ Semantic Duplicate Detector
  Similar alerts similarity: 0.6017 (threshold: 0.85)
  Different alerts similarity: 0.3344
  Duplicate detected: No

3Ô∏è‚É£ Notification Timing Model
  Recommended notification time: 18:00
  Expected success rate: 0.77

‚úÖ All models tested successfully!


---
## 8. Download Trained Models


In [18]:
# Create zip of trained models
OUTPUT_ZIP = '/content/trained_models.zip'

print("üì¶ Creating downloadable archive...")

# List files
print("\nFiles included:")
for f in MODELS_DIR.glob('*'):
    size_kb = f.stat().st_size / 1024
    print(f"  - {f.name} ({size_kb:.1f} KB)")

# Create zip
shutil.make_archive('/content/trained_models', 'zip', MODELS_DIR)

print(f"\n‚úÖ Archive created: trained_models.zip")

# Download (for Google Colab)
try:
    from google.colab import files
    files.download(OUTPUT_ZIP)
    print("\nüì• Download started!")
except ImportError:
    print("\n‚ö†Ô∏è Not running in Colab. Find the file at: /content/trained_models.zip")


üì¶ Creating downloadable archive...

Files included:
  - alert_scorer.pkl (6527.9 KB)
  - notification_timing.json (0.3 KB)

‚úÖ Archive created: trained_models.zip

‚ö†Ô∏è Not running in Colab. Find the file at: /content/trained_models.zip


---
## 9. Training Summary


In [19]:
from datetime import datetime

print("\n" + "=" * 60)
print("  TRAINING SUMMARY")
print("=" * 60)

print(f"\nüìÖ Completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print("\nüìä Models trained:")
print("  ‚úÖ Alert Scoring Model (Random Forest)")
print("     - 1000 synthetic samples")
print("     - 15 features")
print("  ‚úÖ Semantic Duplicate Detector (Sentence Transformers)")
print("     - Pre-trained multilingual model")
print("     - Threshold: 0.85")
print("  ‚úÖ Notification Timing Model (Thompson Sampling)")
print("     - 24 time slots")
print("     - Simulated engagement patterns")

print("\nüìÅ Output files:")
for f in MODELS_DIR.glob('*'):
    size_kb = f.stat().st_size / 1024
    print(f"  - {f.name} ({size_kb:.1f} KB)")

print("\n" + "=" * 60)
print("  üéâ All models ready for deployment!")
print("=" * 60)

print("\nüìù Next steps:")
print("  1. Download trained_models.zip")
print("  2. Extract to ai_service/data/models/")
print("  3. Run: python main.py")
print("  4. API available at: http://localhost:8000/docs")



  TRAINING SUMMARY

üìÖ Completed at: 2026-01-04 19:57:47

üìä Models trained:
  ‚úÖ Alert Scoring Model (Random Forest)
     - 1000 synthetic samples
     - 15 features
  ‚úÖ Semantic Duplicate Detector (Sentence Transformers)
     - Pre-trained multilingual model
     - Threshold: 0.85
  ‚úÖ Notification Timing Model (Thompson Sampling)
     - 24 time slots
     - Simulated engagement patterns

üìÅ Output files:
  - alert_scorer.pkl (6527.9 KB)
  - notification_timing.json (0.3 KB)

  üéâ All models ready for deployment!

üìù Next steps:
  1. Download trained_models.zip
  2. Extract to ai_service/data/models/
  3. Run: python main.py
  4. API available at: http://localhost:8000/docs
