# Cyber Threat Intelligence – RAG Indexing Notebook (Kaggle)

**Purpose:** Data prep, chunking, embedding, and FAISS index building for the Voice-Enabled Cyber Threat Intelligence Assistant.

**Use this notebook on Kaggle for:**
- Downloading CVE / NVD or MITRE ATT&CK data
- Chunking and embedding documents
- Building and saving a FAISS index (+ metadata)
- Quick retrieval experiments

**Output:** Save the index and chunk metadata so your local Streamlit app can load them.

## 1. Setup & Dependencies

Run once. On Kaggle, enable **GPU** (Settings → Accelerator → GPU) if you use a larger embedding model or Whisper later.

In [81]:
# Install packages (run this cell first)
# faiss-gpu often has no pip wheel on many environments; faiss-cpu works everywhere
!pip install -q faiss-cpu sentence-transformers anthropic  # anthropic = Claude API client

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [82]:
import os
import json
from pathlib import Path

import pandas as pd
import numpy as np
import requests

# Chunking & embeddings
from sentence_transformers import SentenceTransformer
import faiss

print("Setup OK")

Setup OK


## 2. Configuration

To keep **embedding fast**: **max_docs** caps how many docs are indexed (default 5000); use a small **embedding_model** (e.g. L3) and **embedding_batch_size** 128. Set `max_docs: None` only when you need the full dataset.

In [83]:
CONFIG = {
    "chunk_size": 384,
    "chunk_overlap": 64,
    "chunk_mode": "doc",       # "doc" = 1 chunk per document; or "sentence" / "char" for real chunking
    "max_docs": 5000,           # cap so embedding stays fast; set None to index everything
    "sample_mode": "first",     # "first" or "random" (uses sample_seed)
    "sample_seed": 42,
    "embedding_model": "sentence-transformers/paraphrase-MiniLM-L3-v2",  # L3 = faster/smaller; or all-MiniLM-L6-v2
    "embedding_batch_size": 128,  # larger = faster on GPU; reduce to 32 if OOM
    "top_k": 5,
    "index_dir": "/kaggle/working/rag_index",
    "rebuild_anyway": False,
}

Path(CONFIG["index_dir"]).mkdir(parents=True, exist_ok=True)
print("Config:", CONFIG)

Config: {'chunk_size': 384, 'chunk_overlap': 64, 'chunk_mode': 'doc', 'max_docs': 5000, 'sample_mode': 'first', 'sample_seed': 42, 'embedding_model': 'sentence-transformers/paraphrase-MiniLM-L3-v2', 'embedding_batch_size': 128, 'top_k': 5, 'index_dir': '/kaggle/working/rag_index', 'rebuild_anyway': False}


## 3. Data Loading

Two cells only:
1. **Default loader** – NVD CVE/CPE Kaggle dataset (attach it in **Add Data**).
2. **Optional** – Add more sources (NVD API, MITRE ATT&CK, other). Uncomment what you need.

### 3a. Helper: Build list of documents (text + optional metadata)

Each doc = `{"text": "...", "source": "...", "id": "..."}`.

In [84]:
def doc_to_text(doc: dict) -> str:
    """Single document to a single string for chunking."""
    parts = []
    if doc.get("id"):
        parts.append(f"ID: {doc['id']}")
    if doc.get("title"):
        parts.append(f"Title: {doc['title']}")
    if doc.get("description"):
        parts.append(str(doc["description"]))
    if doc.get("text"):
        parts.append(str(doc["text"]))
    return "\n".join(parts) if parts else ""

def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    return " ".join(s.split())

print("Helpers defined.")

Helpers defined.


### Default loader – NVD CVE/CPE Kaggle dataset

Attach the dataset in **Add Data**: [NVD CVE/CPE till Feb 2025](https://www.kaggle.com/datasets/nikhilarora1729/nvd-cve-cpe-dataset-till-february-2025). If not attached, this cell leaves `documents` empty and you can use the optional cell below.

In [85]:
documents = []
NVD_KAGGLE_PATH = Path("/kaggle/input/nvd-cve-cpe-dataset-till-february-2025")

if not NVD_KAGGLE_PATH.exists():
    print("NVD CVE/CPE Kaggle dataset not attached. Run the optional cell below to use NVD API / MITRE ATT&CK instead.")
else:
    def _guess_cve_cols(df: pd.DataFrame):
        cols = list(df.columns)
        id_col = None
        for c in cols:
            cl = c.lower()
            if "cve" in cl and ("id" in cl or cl.endswith("_id") or cl == "cve"):
                id_col = c
                break
        if id_col is None:
            for c in cols:
                if c.lower() in {"cve_id", "cveid"}:
                    id_col = c
                    break
        desc_col = None
        for c in cols:
            cl = c.lower()
            if any(k in cl for k in ["description", "summary", "details"]):
                desc_col = c
                break
        return id_col, desc_col

    added = 0
    for csv_file in NVD_KAGGLE_PATH.glob("*.csv"):
        print(f"Loading {csv_file.name} ...")
        df_nvd = pd.read_csv(csv_file)
        id_col, desc_col = _guess_cve_cols(df_nvd)
        print(f"  id_col={id_col}, desc_col={desc_col}")
        if not id_col or not desc_col:
            continue
        for _, row in df_nvd.iterrows():
            doc_text = row.get(desc_col, "")
            if not isinstance(doc_text, str) or not doc_text.strip():
                continue
            documents.append({
                "id": row.get(id_col, ""),
                "description": doc_text,
                "source": "Kaggle_NVD_CVE_CPE",
            })
            added += 1
    print(f"Added {added} documents. Total: {len(documents)}")

documents = [d for d in documents if normalize_text(doc_to_text(d))]
print(f"Total documents to index: {len(documents)}")

Loading cpe.csv ...
  id_col=None, desc_col=None
Loading junction.csv ...
  id_col=cveId, desc_col=None
Loading nvd_cves.csv ...


  df_nvd = pd.read_csv(csv_file)


  id_col=cveId, desc_col=description
Added 282250 documents. Total: 282250
Total documents to index: 282250


  id_col=cveId, desc_col=description
Added 282250 documents. Total: 282250
Total documents to index: 282250


  id_col=cveId, desc_col=description
Added 282250 documents. Total: 282250
Total documents to index: 282250


### Optional: add more sources

Uncomment the blocks you want: **NVD API** (recent CVEs), **MITRE ATT&CK** (techniques), or other Kaggle/Hugging Face datasets. Run this cell after the default loader.

In [86]:
def fetch_nvd_recent(results_per_page: int = 100) -> list:
    """Fetch recent CVEs from NVD API (no key). Rate limit ~5 req/30s."""
    url = "https://services.nvd.nist.gov/rest/json/cves/2.0"
    params = {"resultsPerPage": results_per_page, "startIndex": 0}
    out = []
    try:
        r = requests.get(url, params=params, timeout=30)
        r.raise_for_status()
        data = r.json()
        for item in data.get("vulnerabilities", []):
            cve = item.get("cve", {})
            desc = (cve.get("descriptions") or [{}])[0].get("value", "")
            out.append({"id": cve.get("id", ""), "title": cve.get("id", ""), "description": desc, "source": "NVD"})
    except Exception as e:
        print("NVD fetch error:", e)
    return out

def fetch_mitre_attack_enterprise() -> list:
    """Load MITRE ATT&CK Enterprise techniques from official STIX JSON."""
    url = "https://raw.githubusercontent.com/mitre/cti/master/enterprise-attack/enterprise-attack.json"
    out = []
    try:
        r = requests.get(url, timeout=60)
        r.raise_for_status()
        data = r.json()
        for obj in data.get("objects", []):
            if obj.get("type") != "attack-pattern":
                continue
            name = obj.get("name", "")
            desc = obj.get("description", "")
            ext = obj.get("external_references", [])
            refs = " ".join([e.get("external_id", "") for e in ext if e.get("external_id")])
            if ext:
                ext_id = ext[0].get("external_id", obj.get("id", ""))
            else:
                ext_id = obj.get("id", "")
            out.append({
                "id": ext_id,
                "title": name,
                "description": desc,
                "text": refs,
                "source": "MITRE_ATTACK",
            })
    except Exception as e:
        print("MITRE fetch error:", e)
    return out

# --- Uncomment what you need ---
# nvd_docs = fetch_nvd_recent(results_per_page=200)
# documents.extend(nvd_docs)
# print(f"After NVD API: {len(documents)}")

# attack_docs = fetch_mitre_attack_enterprise()
# documents.extend(attack_docs)
# print(f"After MITRE ATT&CK: {len(documents)}")

# === Global Cybersecurity Threats 2015–2024 (Kaggle) ===
# Dataset: https://www.kaggle.com/datasets/atharvasoundankar/global-cybersecurity-threats-2015-2024
# After attaching it in "Add Data", the path will look like:
#   /kaggle/input/global-cybersecurity-threats-2015-2024

# NVD API (optional)
# nvd_docs = fetch_nvd_recent(results_per_page=200)
# documents.extend(nvd_docs)
# print(f"After NVD API: {len(documents)}")

# MITRE ATT&CK (enable this to enrich responses)
attack_docs = fetch_mitre_attack_enterprise()
documents.extend(attack_docs)
print(f"After MITRE ATT&CK: {len(attack_docs)} techniques. Total documents: {len(documents)}")


GCT_PATH = "/kaggle/input//kaggle/input/datasets/atharvasoundankar/global-cybersecurity-threats-2015-2024"
if Path(GCT_PATH).exists():
    try:
        # Check the actual filename in Kaggle's Data tab; adjust if needed
        df_gct = pd.read_csv(f"{GCT_PATH}/global_cybersecurity_threats.csv")
    except Exception as e:
        print("Could not load Global Cybersecurity Threats dataset:", e)
    else:
        # print(df_gct.columns)  # uncomment once to inspect columns
        def build_threat_text(row):
            parts = []
            year = row.get("Year") or row.get("year")
            country = row.get("Country") or row.get("country")
            attack_type = row.get("Attack_Type") or row.get("Attack Type") or row.get("attack_type")
            sector = row.get("Target_Industry") or row.get("Industry") or row.get("Target Industry")
            loss = row.get("Financial_Loss_Million_USD") or row.get("Financial_Loss") or row.get("financial_loss")
            users = row.get("Affected_Users_Million") or row.get("Affected_Users") or row.get("affected_users")
            vuln = row.get("Vulnerability_Exploited") or row.get("Vulnerability")
            group = row.get("Attack_Source") or row.get("Source") or row.get("Attacker")

            if year or country:
                parts.append(f"In {year} in {country},")
            if attack_type:
                parts.append(f"a {attack_type} attack")
            if sector:
                parts.append(f"targeted the {sector} sector")
            sentence = " ".join(p for p in parts if p).strip()
            if sentence and not sentence.endswith("."):
                sentence += "."

            details = []
            if loss not in (None, ""):
                details.append(f"Estimated financial loss: {loss}.")
            if users not in (None, ""):
                details.append(f"Affected users (millions): {users}.")
            if vuln:
                details.append(f"Vulnerability exploited: {vuln}.")
            if group:
                details.append(f"Attack source: {group}.")

            text = " ".join([sentence] + details).strip()
            return text

        added_gct = 0
        for idx, row in df_gct.iterrows():
            text = build_threat_text(row)
            if not text:
                continue
            documents.append({
                "id": f"GCT-{idx}",
                "title": row.get("Attack_Type", row.get("attack_type", "")),
                "description": text,
                "source": "Kaggle_GlobalThreats",
            })
            added_gct += 1
        print(f"Added {added_gct} Global Cybersecurity Threat incidents. Total documents: {len(documents)}")
else:
    print("Global Cybersecurity Threats dataset not attached; skipping.")

documents = [d for d in documents if normalize_text(doc_to_text(d))]
print(f"Total documents to index: {len(documents)}")

After MITRE ATT&CK: 835 techniques. Total documents: 283085
Global Cybersecurity Threats dataset not attached; skipping.
Total documents to index: 283085


In [87]:
# Documents are filled by the default loader and optional cell above.
# Proceed to chunking (section 4).

## 4. Chunking

Split each document into overlapping chunks. Use **chunk_mode**: `"sentence"` to avoid cutting mid-sentence, or `"char"` for character-based. Set **max_docs** in CONFIG to cap documents for quick runs.

In [88]:
import re
import random

def chunk_text(text: str, chunk_size: int, overlap: int, mode: str = "char") -> list:
    """Split into chunks. mode='char': character-based. mode='sentence': group by sentences to avoid mid-sentence cuts."""
    if not text or chunk_size <= 0:
        return []
    text = normalize_text(text)
    if mode == "sentence":
        # Split on sentence boundaries, then merge into ~chunk_size
        sentences = re.split(r'(?<=[.!?])\s+', text)
        sentences = [s.strip() for s in sentences if s.strip()]
        chunks = []
        current, current_len = [], 0
        for s in sentences:
            if current_len + len(s) + 1 <= chunk_size:
                current.append(s)
                current_len += len(s) + 1
            else:
                if current:
                    chunks.append(" ".join(current))
                # overlap: keep last sentence(s) that fit in overlap
                overlap_len = 0
                overlap_sentences = []
                for x in reversed(current):
                    if overlap_len + len(x) + 1 <= overlap:
                        overlap_sentences.insert(0, x)
                        overlap_len += len(x) + 1
                    else:
                        break
                current = overlap_sentences + [s] if overlap else [s]
                current_len = sum(len(x) for x in current) + len(current) - 1
        if current:
            chunks.append(" ".join(current))
        return chunks
    # character-based (original)
    step = max(1, chunk_size - overlap)
    chunks = []
    for start in range(0, len(text), step):
        c = text[start : start + chunk_size]
        if c:
            chunks.append(c)
        if start + chunk_size >= len(text):
            break
    return chunks

def build_chunks(documents: list, config: dict) -> tuple:
    """Return (list of chunk strings, list of metadata dicts). Uses max_docs and sample_mode (first/random)."""
    chunks, meta = [], []
    docs = list(documents)
    if config.get("max_docs"):
        if config.get("sample_mode") == "random":
            random.seed(config.get("sample_seed", 42))
            random.shuffle(docs)
        docs = docs[: config["max_docs"]]
    for doc in docs:
        text = doc_to_text(doc)
        if not text:
            continue
        for c in chunk_text(
            text,
            config["chunk_size"],
            config["chunk_overlap"],
            config.get("chunk_mode", "char"),
        ):
            chunks.append(c)
            meta.append({
                "source": doc.get("source", ""),
                "doc_id": doc.get("id", ""),
                "title": doc.get("title", ""),
            })
    return chunks, meta

chunk_texts, chunk_meta = build_chunks(documents, CONFIG)
print(f"Total chunks: {len(chunk_texts)} (chunk_mode={CONFIG.get('chunk_mode', 'char')})")

Total chunks: 5533 (chunk_mode=doc)


In [89]:
# 5.1 BM25 index over chunks (for hybrid retrieval)
!pip install -q rank_bm25 nltk

from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize

# Download tokenizer data quietly (first run only)
nltk.download("punkt", quiet=True)

# Build BM25 index over the same chunks used for embeddings
tokenized_chunks = [word_tokenize(c.lower()) for c in chunk_texts]
bm25 = BM25Okapi(tokenized_chunks)
print("BM25 index built over", len(tokenized_chunks), "chunks")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


BM25 index built over 5533 chunks


## 5. Embeddings & FAISS Index

Embed all chunks and build FAISS index. If **rebuild_anyway** is False and an existing index exists with matching config and chunk count, we load it and skip rebuild.

In [90]:
index_dir = CONFIG["index_dir"]
config_path = os.path.join(index_dir, "config.json")
index_path = os.path.join(index_dir, "faiss.index")
INDEX_WAS_LOADED = False

if (
    not CONFIG.get("rebuild_anyway", False)
    and os.path.exists(config_path)
    and os.path.exists(index_path)
):
    with open(config_path, encoding="utf-8") as f:
        old = json.load(f)
    n_chunks = len(chunk_texts)
    if (
        old.get("embedding_model") == CONFIG["embedding_model"]
        and old.get("top_k") == CONFIG["top_k"]
        and old.get("ntotal") == n_chunks
    ):
        index = faiss.read_index(index_path)
        dim = int(old["dim"])
        model = SentenceTransformer(CONFIG["embedding_model"])
        INDEX_WAS_LOADED = True
        print(f"Loaded existing index (ntotal={index.ntotal}, dim={dim}). Skipped rebuild.")
    else:
        old = None
if not INDEX_WAS_LOADED:
    model = SentenceTransformer(CONFIG["embedding_model"])
    batch_size = CONFIG.get("embedding_batch_size", 128)
    embeddings = model.encode(chunk_texts, show_progress_bar=True, batch_size=batch_size)
    embeddings = np.array(embeddings, dtype=np.float32)
    dim = embeddings.shape[1]
    print(f"Embeddings shape: {embeddings.shape}, dim={dim}")
    index = faiss.IndexFlatIP(dim)
    faiss.normalize_L2(embeddings)
    index.add(embeddings)
    print(f"FAISS index: {index.ntotal} vectors.")

Loaded existing index (ntotal=5533, dim=384). Skipped rebuild.


## 6. Save Index & Metadata

Persist under `/kaggle/working/` so you can download or add as output dataset.

In [91]:
if not INDEX_WAS_LOADED:
    index_dir = CONFIG["index_dir"]
    faiss.write_index(index, os.path.join(index_dir, "faiss.index"))
    with open(os.path.join(index_dir, "chunks.json"), "w", encoding="utf-8") as f:
        json.dump(chunk_texts, f, ensure_ascii=False, indent=0)
    with open(os.path.join(index_dir, "metadata.json"), "w", encoding="utf-8") as f:
        json.dump(chunk_meta, f, ensure_ascii=False, indent=0)
    with open(os.path.join(index_dir, "config.json"), "w", encoding="utf-8") as f:
        json.dump({
            "embedding_model": CONFIG["embedding_model"],
            "top_k": CONFIG["top_k"],
            "dim": dim,
            "ntotal": index.ntotal,
        }, f, indent=2)
    print("Saved:", os.listdir(index_dir))
else:
    print("Index was loaded; skip save.")

Index was loaded; skip save.


## 7. Retrieval Experiment

Test retrieval with a few queries. Use the same model and normalization as at index time.

In [92]:
# Hybrid retrieval: Reciprocal Rank Fusion (RRF) over FAISS + BM25

def hybrid_search_rrf(
    query: str,
    index,
    chunk_texts,
    chunk_meta,
    model,
    top_k: int = 5,
    faiss_k: int = 50,
    bm25_k: int = 50,
    k_rrf: int = 60,
):
    """Combine FAISS (dense) and BM25 (lexical) rankings using Reciprocal Rank Fusion.

    fused_score = 1 / (k_rrf + rank_faiss) + 1 / (k_rrf + rank_bm25)
    """
    # 1) FAISS ranking (dense)
    faiss_results = search(query, index, chunk_texts, model, top_k=faiss_k)
    faiss_ranks = {r["idx"]: rank for rank, r in enumerate(faiss_results)}

    # 2) BM25 ranking (lexical)
    q_tokens = word_tokenize(query.lower())
    bm25_scores = bm25.get_scores(q_tokens)
    bm25_top_idx = np.argsort(bm25_scores)[::-1][:bm25_k]
    bm25_ranks = {int(idx): rank for rank, idx in enumerate(bm25_top_idx)}

    # 3) RRF fusion over union of candidates
    all_ids = set(faiss_ranks.keys()) | set(bm25_ranks.keys())
    fused = []
    for cid in all_ids:
        r_faiss = faiss_ranks.get(cid)
        r_bm25 = bm25_ranks.get(cid)
        score = 0.0
        if r_faiss is not None:
            score += 1.0 / (k_rrf + r_faiss)
        if r_bm25 is not None:
            score += 1.0 / (k_rrf + r_bm25)
        fused.append((cid, score))

    # 4) Rank fused results
    fused.sort(key=lambda x: x[1], reverse=True)
    results = []
    for cid, score in fused[:top_k]:
        results.append(
            {
                "idx": int(cid),
                "chunk": chunk_texts[cid],
                "score": float(score),
                "source": chunk_meta[cid].get("source", ""),
            }
        )
    return results

In [93]:
def search(query: str, index, chunk_texts: list, model, top_k: int = 5):
    q_emb = model.encode([query])
    q_emb = np.array(q_emb, dtype=np.float32)
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, min(top_k, index.ntotal))
    return [
        {"chunk": chunk_texts[i], "score": float(s), "idx": int(i)}
        for i, s in zip(indices[0], scores[0])
    ]

for q in ["Windows vulnerability", "remote code execution", "ransomware"]:
    results = search(q, index, chunk_texts, model, CONFIG["top_k"])
    print(f"Query: {q}")
    for r in results[:2]:
        print(f"  score={r['score']:.3f} | {r['chunk'][:120]}...")
    print()

Query: Windows vulnerability
  score=0.591 | emote attackers to execute arbitrary code on Windows by leveraging an untrusted search path vulnerability in (a) Interne...
  score=0.587 |  As of 20081210, it is unclear whether this vulnerability is related to a WordPad issue disclosed on 20080925 with a 200...

Query: remote code execution
  score=0.639 | leveraged for arbitrary remote code execution in conjunction with CVE-2007-6378....
  score=0.593 | R01 allows remote attackers to execute arbitrary code via a long Session cookie....

Query: ransomware
  score=0.507 | ID: CVE-2008-0792 Multiple F-Secure anti-virus products, including Internet Security 2006 through 2008, Anti-Virus 2006 ...
  score=0.489 |  can be leveraged for attacks such as DNS cache poisoning against OpenBSD's modification of BIND....



### 7a. RAG prompt template

RAG needs a **prompt** that gives the LLM the retrieved context and the user question. Use this template in the notebook and in your Streamlit app so answers are grounded in the retrieved chunks.

In [94]:
# RAG prompt: system instruction + user message with {context} and {question}
RAG_SYSTEM_PROMPT = """You are a cybersecurity threat intelligence assistant. 
Answer only from the provided context. If the context does not contain enough information,
say so. Be concise and cite CVE IDs or technique names when relevant.You may see MITRE ATT&CK techniques under ‘MITRE ATT&CK techniques’. 
When mapping attacks, prefer these techniques and do not invent techniques that are not in the context."""

RAG_USER_TEMPLATE = """Context (from threat intelligence documents):

{context}

Question: {question}

Answer briefly and based only on the context above:"""

def build_rag_prompt(context: str, question: str, max_context_chars: int = 6000) -> str:
    """Build the user prompt for RAG: context + question. Truncate context if needed."""
    context_trimmed = context[:max_context_chars] if len(context) > max_context_chars else context
    return RAG_USER_TEMPLATE.format(context=context_trimmed, question=question)

print("RAG prompt template defined. Use build_rag_prompt(context, question) for queries.")

RAG prompt template defined. Use build_rag_prompt(context, question) for queries.


### 7b. Retrieval evaluation (optional)

Checks whether retrieval finds relevant docs: for each (query, expected_doc_id_substring), we see if any retrieved chunk's `doc_id` contains that string. Reports **Hit@k** (was expected doc in top-k?) and **MRR** (mean reciprocal rank).

In [95]:
# Eval set: list of (query, expected_doc_id_substring). Edit with real CVE/technique IDs from your data.
EVAL_QUERIES = [
    ("Windows remote code execution vulnerability", "CVE-"),   # any CVE
    ("privilege escalation", "CVE-"),
    ("cross-site scripting", "CVE-"),
]

def eval_retrieval(queries, index, chunk_texts, chunk_meta, model, top_k=5):
    hit = 0
    rr_sum = 0.0
    for query, expected_substr in queries:
        results = search(query, index, chunk_texts, model, top_k)
        doc_ids = [chunk_meta[r["idx"]].get("doc_id", "") for r in results]
        found_rank = None
        for i, did in enumerate(doc_ids):
            if expected_substr in str(did):
                found_rank = i + 1
                break
        if found_rank is not None:
            hit += 1
            rr_sum += 1.0 / found_rank
    n = len(queries)
    hit_at_k = hit / n if n else 0
    mrr = rr_sum / n if n else 0
    return hit_at_k, mrr

if EVAL_QUERIES and chunk_texts:
    hit_at_k, mrr = eval_retrieval(EVAL_QUERIES, index, chunk_texts, chunk_meta, model, CONFIG["top_k"])
    print(f"Hit@{CONFIG['top_k']}: {hit_at_k:.2%}  |  MRR: {mrr:.3f}")
else:
    print("Add (query, expected_doc_id_substring) to EVAL_QUERIES and re-run to get Hit@k and MRR.")

Hit@5: 100.00%  |  MRR: 1.000


In [96]:
# Anthropic API key: use env ANTHROPIC_API_KEY, or paste below for this run (do not commit after).
import os
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", "")  # e.g. "sk-ant-api03-..."
if ANTHROPIC_API_KEY:
    os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY
    print("ANTHROPIC_API_KEY set.")
else:
    print("Paste your key in ANTHROPIC_API_KEY above, or set it in Kaggle Secrets.")

Paste your key in ANTHROPIC_API_KEY above, or set it in Kaggle Secrets.


In [97]:
!pip install -q openai


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [98]:
# Groq key setup for llama-3.1-8b-instant
import os

# Prefer setting GROQ_API_KEY in environment / Kaggle Secrets.
# The default here is for local experimentation only – REMOVE before sharing.
GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")
GROQ_BASE_URL = os.environ.get("GROQ_BASE_URL", "https://api.groq.com/openai/v1")

if GROQ_API_KEY:
    # The OpenAI client below reads these env vars, so we map Groq -> OpenAI_* names
    os.environ["OPENAI_API_KEY"] = GROQ_API_KEY
    os.environ["OPENAI_BASE_URL"] = GROQ_BASE_URL
    print("Groq key set. Base URL:", GROQ_BASE_URL)
else:
    print("Set GROQ_API_KEY in env / Kaggle Secrets or paste it above.")

Groq key set. Base URL: https://api.groq.com/openai/v1


In [99]:
from openai import OpenAI

GROQ_MODEL = "llama-3.1-8b-instant"  # Groq model ID

LLM_QUESTIONS = [
    "What are the main risks described in the context? Summarize in 2-3 bullet points.",
    "List any CVE or vulnerability IDs mentioned in the context.",
]

def rag_llm_check_openrouter(questions, index, chunk_texts, chunk_meta, model, config):
    import os
    api_key = os.environ.get("OPENAI_API_KEY")
    base_url = os.environ.get("OPENAI_BASE_URL", "https://api.groq.com/openai/v1")
    if not api_key:
        print("Set GROQ_API_KEY (your Groq key) before running this cell.")
        return

    client = OpenAI(api_key=api_key, base_url=base_url)

    for q in questions[:2]:
        # 1) RAG retrieval
        results = search(q, index, chunk_texts, model, config["top_k"])

        # Split context into: vulnerabilities/incidents vs MITRE ATT&CK techniques
        vuln_chunks = []
        mitre_chunks = []
        for r in results:
            meta = chunk_meta[r["idx"]]
            src = meta.get("source", "")
            if src == "MITRE_ATTACK":
                mitre_chunks.append(r["chunk"])
            else:
                vuln_chunks.append(r["chunk"])

        # 2) Build structured context
        parts = []
        if vuln_chunks:
            parts.append("Vulnerabilities and incidents:\n\n" +
                         "\n\n---\n\n".join(vuln_chunks))
        if mitre_chunks:
            parts.append("MITRE ATT&CK techniques:\n\n" +
                         "\n\n---\n\n".join(mitre_chunks))

        context = "\n\n\n".join(parts) if parts else ""
        user_prompt = build_rag_prompt(context, q)

        # 3) LLM call via Groq (OpenAI-compatible endpoint)
        try:
            resp = client.chat.completions.create(
                model=GROQ_MODEL,
                messages=[
                    {"role": "system", "content": RAG_SYSTEM_PROMPT},
                    {"role": "user", "content": user_prompt},
                ],
                max_tokens=400,
                temperature=0.4,
                top_p=0.9,
                top_k=50,
            )
            answer = resp.choices[0].message.content
            print(f"Q: {q[:70]}...")
            print(f"A: {answer}\n")
        except Exception as e:
            print(f"Groq API error: {e}")
            print("Prompt preview:", user_prompt[:300], "...\n")
            break

if chunk_texts:
    rag_llm_check_openrouter(LLM_QUESTIONS, index, chunk_texts, chunk_meta, model, CONFIG)
else:
    print("No chunks; run data loading and chunking first.")

Groq API error: Completions.create() got an unexpected keyword argument 'top_k'
Prompt preview: Context (from threat intelligence documents):

Vulnerabilities and incidents:

orun.inf file, and possibly other vectors related to (a) AutoRun and (b) AutoPlay actions.

---

 NOTE: some of these details are obtained from third party information.

---

 some of these details are obtained from third ...



### 7c. Optional LLM check (validates full RAG chain)

Uses **Anthropic Claude API** (free credits when you sign up at [console.anthropic.com](https://console.anthropic.com)). Set **ANTHROPIC_API_KEY** in the environment (or Kaggle Secrets). Runs retrieval → build prompt → call Claude.

In [105]:
from openai import OpenAI

GROQ_MODEL = "meta-llama/llama-4-maverick-17b-128e-instruct"  # Groq model ID || Tool Use, JSON Object Mode, JSON Schema Mode, Vision

def ask_rag(question: str):
    """
    Free-form question → RAG retrieval → llama-4-maverick-17b-128e-instruct via Groq.
    Example question:
      "I clicked a link from an unknown person and now my PC won't start.
       What kind of attack could this be and what should I do?"
    """
    import os
    api_key = os.environ.get("OPENAI_API_KEY")
    base_url = os.environ.get("OPENAI_BASE_URL", "https://api.groq.com/openai/v1")
    if not api_key:
        print("Set GROQ_API_KEY (Groq key) first.")
        return

    if not chunk_texts:
        print("No chunks; run data loading, chunking, and indexing cells first.")
        return

    client = OpenAI(api_key=api_key, base_url=base_url)

    # 1) Retrieve relevant context from your threat index
    results = search(question, index, chunk_texts, model, CONFIG["top_k"])
    context = "\n\n---\n\n".join([r["chunk"] for r in results])

    # 2) Build RAG prompt using your template
    user_prompt = build_rag_prompt(context, question)

    # 3) Call llama-3.1-8b-instant via Groq
    try:
        resp = client.chat.completions.create(
            model=GROQ_MODEL,
            messages=[
                {"role": "system", "content": RAG_SYSTEM_PROMPT},
                {"role": "user", "content": user_prompt},
            ],
            max_tokens=500,
            temperature=0.4,
            top_p=0.9,
        )
        answer = resp.choices[0].message.content
        print("Question:")
        print(question)
        print("\nAnswer:")
        print(answer)
    except Exception as e:
        print(f"Groq API error: {e}")
        print("\nPrompt preview:\n", user_prompt[:400], "...")

In [106]:
# ✏️ Type your question here and run this cell, then run ask_rag(...)
user_question = """
I got a link from an unknown person. After clicking it and downloading a file,
my PC doesn’t start properly anymore. What kinds of attacks or malware could
this be, and what should I do next?
""".strip()

ask_rag(user_question)

Question:
I got a link from an unknown person. After clicking it and downloading a file,
my PC doesn’t start properly anymore. What kinds of attacks or malware could
this be, and what should I do next?

Answer:
Based on the context, the issue could be related to a denial of service (DoS) attack, potentially caused by a malformed archive or a crafted file that exploits a vulnerability in an antivirus engine, similar to CVE-2008-3447 or CVE-2008-1437/CVE-2008-1438. To proceed, you should: 

1. Disconnect from the internet to prevent further potential damage.
2. Try to boot in safe mode or use a recovery disk to diagnose the issue.
3. Run a thorough scan with an updated antivirus engine to detect potential malware.

The context does not provide enough information to determine the exact malware or attack. More information is needed for a precise diagnosis.


In [107]:
# Non-technical question (end user / manager)
user_question = """
I work in a small company and recently our employees received very realistic
emails asking them to click a link to "verify their account". Some people
clicked and entered their passwords. I don't know much about cybersecurity.

Based on known cyber attacks and MITRE ATT&CK techniques, what kind of attack
is this, what are the main risks for our company, and what immediate steps
should we take to reduce the damage?
""".strip()

ask_rag(user_question)

Question:
I work in a small company and recently our employees received very realistic
emails asking them to click a link to "verify their account". Some people
clicked and entered their passwords. I don't know much about cybersecurity.

Based on known cyber attacks and MITRE ATT&CK techniques, what kind of attack
is this, what are the main risks for our company, and what immediate steps
should we take to reduce the damage?

Answer:
The attack you're describing is likely a phishing attack, which is not directly listed in the provided CVEs, but can be related to CVE-2008-3868 (CSRF) in the sense that both can be used to steal credentials. The main risk is that attackers may have obtained your employees' passwords, potentially gaining unauthorized access to your company's systems and data.

Immediate steps to reduce the damage:

1. Inform all employees about the phishing attack and instruct them not to click on suspicious links or enter their passwords on unknown websites.
2. Force a pas

In [108]:
# Technical question (security analyst / engineer)
user_question = """
We suspect a targeted phishing campaign against our finance team. The initial
vector appears to be email with malicious links, possibly corresponding to
MITRE ATT&CK techniques T1566 (Phishing) and T1204 (User Execution).

Using the context and MITRE ATT&CK, can you:
1. Identify the most relevant ATT&CK techniques for this scenario (initial access and execution)?
2. Describe likely follow-on techniques (e.g., credential access, lateral movement) we should watch for.
3. Recommend concrete detection and mitigation actions mapped to those techniques.
""".strip()

ask_rag(user_question)

Question:
We suspect a targeted phishing campaign against our finance team. The initial
vector appears to be email with malicious links, possibly corresponding to
MITRE ATT&CK techniques T1566 (Phishing) and T1204 (User Execution).

Using the context and MITRE ATT&CK, can you:
1. Identify the most relevant ATT&CK techniques for this scenario (initial access and execution)?
2. Describe likely follow-on techniques (e.g., credential access, lateral movement) we should watch for.
3. Recommend concrete detection and mitigation actions mapped to those techniques.

Answer:
1. The most relevant ATT&CK techniques for this scenario are T1566 (Phishing) for initial access and T1204 (User Execution) for execution, as the initial vector is email with malicious links.

2. Likely follow-on techniques to watch for include T1056 (Keylogging or Input Capture) or T1111 (Two-Factor Authentication Interception) for credential access, and T1074 (Data Staging) or T1021 (Remote Desktop Protocol) for lateral m

In [109]:
# ISO 27001-style incident report generation
user_question = """
Using the context from the threat intelligence index and MITRE ATT&CK techniques,
generate an ISO 27001-style incident report for the following situation:

"I got a link from an unknown person. After clicking it and downloading a file,
my PC doesn’t start properly anymore."

Structure the report using typical ISO 27001 incident management sections:
1. Incident identification and summary
2. Scope and impact (assets, data, users, business processes)
3. Cause and attack description (map to relevant MITRE ATT&CK techniques where possible)
4. Containment actions taken / recommended
5. Eradication and recovery steps
6. Lessons learned and preventive controls (policies, training, technical controls)
7. References to any relevant CVEs or incidents from the context

Base everything ONLY on the retrieved context and MITRE ATT&CK information. If
something is not supported by the context, say that explicitly instead of guessing.
""".strip()

ask_rag(user_question)

Question:
Using the context from the threat intelligence index and MITRE ATT&CK techniques,
generate an ISO 27001-style incident report for the following situation:

"I got a link from an unknown person. After clicking it and downloading a file,
my PC doesn’t start properly anymore."

Structure the report using typical ISO 27001 incident management sections:
1. Incident identification and summary
2. Scope and impact (assets, data, users, business processes)
3. Cause and attack description (map to relevant MITRE ATT&CK techniques where possible)
4. Containment actions taken / recommended
5. Eradication and recovery steps
6. Lessons learned and preventive controls (policies, training, technical controls)
7. References to any relevant CVEs or incidents from the context

Base everything ONLY on the retrieved context and MITRE ATT&CK information. If
something is not supported by the context, say that explicitly instead of guessing.

Answer:
1. Incident identification and summary:
The incide

In [121]:
import os
import base64
from openai import OpenAI
import ipywidgets as widgets
from IPython.display import display, clear_output

# Groq OpenAI-compatible client (reuses your existing env vars)
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
    base_url=os.environ.get("OPENAI_BASE_URL", "https://api.groq.com/openai/v1"),
)

VISION_MODEL = "meta-llama/llama-4-maverick-17b-128e-instruct"  # vision-capable Groq model

upload = widgets.FileUpload(accept="image/*", multiple=False)
button = widgets.Button(description="Run image test")
output = widgets.Output()

display(upload, button, output)

def on_click(_):
    with output:
        clear_output()

        if not upload.value:
            print("Please upload an image first.")
            return

        # --- 1) Get image bytes from FileUpload (value is a tuple of dicts) ---
        item = upload.value[0]
        content_bytes = item["content"]
        image_b64 = base64.b64encode(content_bytes).decode("utf-8")

        # --- 2) First LLM call: describe the image ---
        user_prompt = "Describe this image briefly, focusing on any security or risk-related elements if present."

        try:
            img_resp = client.chat.completions.create(
                model=VISION_MODEL,
                messages=[
                    {
                        "role": "system",
                        "content": "You are a helpful cybersecurity assistant.",
                    },
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": user_prompt},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/png;base64,{image_b64}"
                                },
                            },
                        ],
                    },
                ],
                max_tokens=250,
                temperature=0.4,
                top_p=0.9,
            )
        except Exception as e:
            print("Vision API error:", e)
            return

        image_description = img_resp.choices[0].message.content
        print("Image description:\n")
        print(image_description)
        print("\n" + "=" * 80 + "\n")

        # --- 3) RAG retrieval based on the image description ---
        if not chunk_texts:
            print("RAG index not available (no chunks). Run indexing cells first.")
            return

        # Use your existing hybrid retrieval (BM25 + FAISS)
        try:
            rag_results = hybrid_search_rrf(
                image_description,
                index,
                chunk_texts,
                chunk_meta,
                model,
                top_k=CONFIG["top_k"],
            )
        except NameError:
            print("hybrid_search_rrf or index/model not defined. Run earlier cells first.")
            return

        context = "\n\n---\n\n".join([r["chunk"] for r in rag_results])

        # --- 4) Second LLM call: use RAG context to propose prevention ---
        rag_question = (
            "Given the description of the image and the following threat-intel context, "
            "describe likely threats shown in the image and give concrete prevention "
            "and mitigation advice."
        )
        rag_user_prompt = build_rag_prompt(
            context,
            f"Image description: {image_description}\n\n{rag_question}",
        )

        try:
            rag_resp = client.chat.completions.create(
                model=VISION_MODEL,
                messages=[
                    {"role": "system", "content": RAG_SYSTEM_PROMPT},
                    {"role": "user", "content": rag_user_prompt},
                ],
                max_tokens=400,
                temperature=0.4,
                top_p=0.9,
            )
        except Exception as e:
            print("RAG+vision API error:", e)
            return

        rag_answer = rag_resp.choices[0].message.content
        print("RAG‑augmented prevention advice:\n")
        print(rag_answer)

button.on_click(on_click)

FileUpload(value=(), accept='image/*', description='Upload')

Button(description='Run image test', style=ButtonStyle())

Output()

## 8. Export for Your Streamlit App

1. **From Kaggle:** Run all cells, then **Save Version → Quick Save** (or **Save & Run All**).
2. **Download the index:** In the right panel, open **Output** and download the `rag_index` folder (`faiss.index`, `chunks.json`, `metadata.json`, `config.json`).
3. **In your local project:** Place `rag_index/` next to your Streamlit app; load with `faiss.read_index("rag_index/faiss.index")`, same `embedding_model` from `config.json` for query encoding.

**Optional:** Add this notebook's output as a **Kaggle Dataset** for your team.

*(End of notebook.)*

2. **Download the index:** In the right panel, open **Output** and download the `rag_index` folder (contains `faiss.index`, `chunks.json`, `metadata.json`, `config.json`).
3. **In your local project:** Place `rag_index/` next to your Streamlit app and load:
   - `faiss.read_index("rag_index/faiss.index")`
   - Load `chunks.json` and `metadata.json` for displaying sources.
   - Use the same `embedding_model` from `config.json` for query encoding.

**Optional:** Add this notebook’s output as a **Kaggle Dataset** so your team can use the same index without re-running.