# Threat Actor Knowledge RAG - ETDA Database
This notebook is heavily inspired by the excellent research of Roberto Rodriguez (@Cyb3rWard0g) into the applicability of generative AI for threat intelligence purposes. It follows the same structure with small alterations for this specific use case. 

In specific, this notebook describes how to query a vector database in QA fashion using Langchain and OpenAI. It uses the vector database built with the notebook in the 'knowledge' folder. This research shows that having a large set of reports does guarantee that simple similarity search through the vector database provides the chunks that we are looking for. A minor change in the original question can result in vastly different answers. The quality of the original data will be of great influence. Because we scraped a large number of reports and did not put in a lot of effort to check their quality, this will create noise in the context the LLM has to work with.

References:

- https://blog.openthreatresearch.com/demystifying-generative-ai-a-security-researchers-notes/
- https://github.com/OTRF/GenAI-Security-Adventures
- https://github.com/OTRF/GenAI-Security-Adventures/blob/main/experiments/RAG/Threat-Intelligence/ATTCK-Groups/LangChain/notebook.ipynb
- https://python.langchain.com/docs/get_started/introduction
- https://apt.etda.or.th/cgi-bin/aptgroups.cgi

# Improvement ideas
- [ ] Experiment with other RAG prompts
- [ ] Expirement with other [retrieval mechanisms](https://python.langchain.com/docs/modules/data_connection/retrievers/)
- [ ] Expirement with matching more similar vectors

# Import modules

In [7]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
import openai
import os
from dotenv import load_dotenv
import tqdm as notebook_tqdm
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_core.output_parsers import StrOutputParser

# Define initial variables and OpenAI API key

In [8]:
load_dotenv()
# Get your key: https://platform.openai.com/account/api-keys
openai.api_key = os.getenv("OPENAI_API_KEY")
current_directory = os.path.dirname("__file__")
chroma_db = os.path.join(current_directory, "./knowledge/chroma_db")

# Load vector DB

In [9]:
import chromadb

persistent_client = chromadb.PersistentClient(path=chroma_db)

# Define embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2")

db = Chroma(
    client=persistent_client,
    collection_name="groups_collection",
    embedding_function=embedding_function,
)
db.get()

{'ids': ['d10a0c46-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0d2c-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0d86-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0dc2-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0dfe-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0e30-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0e6c-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0ea8-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0eda-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0f16-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0f52-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0f84-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0fb6-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a0ff2-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a102e-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a106a-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a109c-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a10d8-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a110a-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a113c-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a1178-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a11b4-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a11dc-cbfc-11ee-9ff4-7085c27bb25b',
  'd10a1218-cbfc-11ee-9ff4-

# Query ETDA listed threat actor reports and information


## Setup retriever, prompt and llm 
Load five similar chunks using the retriever and review the output through OpenAI GPT on low temperature (randomness) using a specialized Langchain RAG prompt 

In [30]:
retriever = db.as_retriever(search_kwargs={"k":5})
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=openai.api_key)

## Setup chain that also reports on sources found
https://python.langchain.com/docs/use_cases/question_answering/sources

In [31]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

# Example query with successful and complete answer (see LLM answer at bottom of output)

In [32]:
rag_chain_with_source.invoke("What vulnerabilities has cl0p exploited?")

{'context': [Document(page_content='Cl0p ransomware claims to have attacked Saks Fifth Avenue (BleepingComputer)\nThe threat actor has not yet disclosed any additional information, such as what all data it stole from the luxury brand retailer\'s systems, or details about any ongoing ransom negotiations.\nBleepingComputer has confirmed, however, the cyber security incident is linked to Clop\'s ongoing attacks targeting GoAnywhere servers vulnerable to a security flaw.\nThe flaw, now tracked as CVE-2023-0669, enables attackers to gain remote code execution on unpatched GoAnywhere MFT instances with their administrative console exposed to Internet access.\nGoAnywhere MFT\'s developer Fortra (formerly HelpSystems) had previously disclosed to its customers that the vulnerability had been exploited as a zero-day in the wild and urged customers to patch their systems. The official advisory remains hidden to the public, but was earlier made public by investigative reporter Brian Krebs.\nIn Feb

# Example query with unsuccessful output due to irrelevant chunks found (see LLM answer at bottom of output)

In [35]:
rag_chain_with_source.invoke("Describe vulnerabilities that threat actor cl0p has exploited throughout various campaigns.")

{'context': [Document(page_content='Reported hacking operations\n\n2007: Hupigon and Pirpi Backdoors\nhttps://www.fireeye.com/blog/threat-research/2010/11/ie-0-day-hupigon-joins-the-party.html\n\n2014-04: Operation “Clandestine Fox”\nFireEye Research Labs identified a new Internet Explorer (IE) zero-day exploit used in targeted attacks.  The vulnerability affects IE6 through IE11, but the attack is targeting IE9 through IE11.  This zero-day bypasses both ASLR and DEP. Microsoft has assigned CVE-2014-1776 to the vulnerability and released security advisory to track this issue.\nhttps://www.fireeye.com/blog/threat-research/2014/04/new-zero-day-exploit-targeting-internet-explorer-versions-9-through-11-identified-in-targeted-attacks.html\n\n2014-06: Operation “Clandestine Fox”, Part Deux\nWhile Microsoft quickly released a patch to help close the door on future compromises, we have now observed the threat actors behind “Operation Clandestine Fox” shifting their point of attack and using a 