<a href="https://colab.research.google.com/github/TUSHAR91316/ML_MODELS/blob/main/Rag_baseds_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install langchain faiss-cpu pypdf transformers accelerate bitsandbytes langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [5]:
import re
from datetime import datetime
from bs4 import BeautifulSoup
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain.llms import HuggingFacePipeline

# ---------- Feature: Rule-based topic classification ----------
def classify_topic(text: str) -> str:
    text = text.lower()
    if "ransomware" in text:
        return "Ransomware"
    elif "apt" in text or "nation-state" in text:
        return "APT"
    elif "phishing" in text:
        return "Phishing"
    elif "zero-day" in text:
        return "Zero-Day"
    else:
        return "General"

# ---------- Feature: Try to extract title and date ----------
def extract_title_and_date(html_content: str):
    soup = BeautifulSoup(html_content, "html.parser")
    title = soup.title.string if soup.title else "No Title"
    # Simple date match (YYYY-MM-DD or MM/DD/YYYY)
    match = re.search(r"\b(\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4})\b", soup.get_text())
    date = match.group(1) if match else "Unknown"
    return title.strip(), date

# ---------- Step 1: Load from multiple web sources ----------
sources = [
    {"url": "https://www.cisa.gov/news-events/cybersecurity-advisories", "origin": "CISA"},
    {"url": "https://www.cert-in.org.in", "origin": "CERT-IN"},
    {"url": "https://nvd.nist.gov/vuln/data-feeds", "origin": "NVD"},
    {"url": "https://www.ic3.gov/Media/Y2024/", "origin": "FBI-IC3"},
    {"url": "https://attack.mitre.org/news/updates/", "origin": "MITRE"}
]

all_docs = []

for source in sources:
    try:
        loader = WebBaseLoader(source["url"])
        docs = loader.load()
        for doc in docs:
            doc.metadata["origin"] = source["origin"]
            doc.metadata["source"] = "Web"
            doc.metadata["title"], doc.metadata["published_date"] = extract_title_and_date(doc.page_content)
            doc.metadata["topic"] = classify_topic(doc.page_content)
        all_docs.extend(docs)
    except Exception as e:
        print(f"❌ Failed to load {source['origin']}: {e}")

# ---------- Step 2: Chunking ----------
splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
documents = splitter.split_documents(all_docs)

# ---------- Step 3: Embedding and Index ----------
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embedding_model)

# ---------- Step 4: Load mistral model (quantized) ----------
model_name = "HuggingFaceH4/zephyr-7b-alpha"  # permissive and open

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512)
llm = HuggingFacePipeline(pipeline=pipe)

# ---------- Step 5: Setup retriever with filtering (e.g., MITRE + APT topics only) ----------
retriever = vectorstore.as_retriever(
    search_kwargs={
        "filter": {
            "origin": "MITRE",
            "topic": "APT"
        }
    }
)

# ---------- Step 6: Run RAG ----------
rag_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

query = "What recent APT-related techniques have MITRE reported?"
response = rag_chain.run(query)

print("\n🧠 Final RAG Answer (Filtered by MITRE + APT):")
print(response)




tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/628 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not in

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=pipe)
  response = rag_chain.run(query)



🧠 Final RAG Answer (Filtered by MITRE + APT):
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.



Question: What recent APT-related techniques have MITRE reported?
Helpful Answer: In their latest Cybersecurity and Infrastructure Security Agency (CISA) report, MITRE has highlighted recent APT-related techniques such as the use of malicious PowerShell scripts, malicious macros in Word documents, and the deployment of remote access trojans (RATs). These techniques allow attackers to maintain persistent access to compromised systems and steal sensitive information. MITRE recommends that organizations implement multi-factor authentication, regularly patch systems, and train employees to recognize and report suspicious activity.
