# Threat Actor Report Summarizer
This notebook summarizes all threat actor reports we have scraped and combines them into one file per threat actor. We can leverage these summaries for to improve LLM use cases, such as improving RAG performance through MultiVectorRetrievers and preventing token limits when using this information for other purposes with online LLMs. It uses OpenAI GPT3.5 for summarization. This notebook is built on top of the threat actor knowledge builder notebook.

References:
- https://blog.openthreatresearch.com/demystifying-generative-ai-a-security-researchers-notes/
- https://github.com/OTRF/GenAI-Security-Adventures
- https://python.langchain.com/docs/get_started/introduction

# Improvement ideas
- [ ] Use local LLM
- [ ] Summarize using template prompt such as [fabric](https://github.com/danielmiessler/fabric)
- [ ] Create MultiVectorRetriever

# Import modules

In [9]:
import os
import openai
import glob
import time
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.document_loaders import UnstructuredMarkdownLoader
from dotenv import load_dotenv
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.documents import Document
from jinja2 import Template

# Define initial variables and OpenAI API key

In [3]:
load_dotenv()
# Get your key: https://platform.openai.com/account/api-keys
openai.api_key = os.getenv("OPENAI_API_KEY")

current_directory = os.path.dirname("__file__")
documents_directory = os.path.join(current_directory, "documents")
reports_directory = os.path.join(documents_directory, "reports")
summaries_directory = os.path.join(documents_directory, "summaries")
contrib_directory = os.path.join(current_directory, "contrib")
embeddings_directory = os.path.join(current_directory, "embeddings")
templates_directory = os.path.join(current_directory, "templates")
summaries_directory = os.path.join(documents_directory, "summaries")
document_template = os.path.join(templates_directory, "document.md")
group_template = os.path.join(templates_directory, "group.md")
summary_template = os.path.join(templates_directory, "summary.md")
group_folders = glob.glob(os.path.join(reports_directory, "*"))
report_files = glob.glob(os.path.join(reports_directory, "*/*.md"), recursive=True)
group_folders = glob.glob(os.path.join(reports_directory, "*"))

# Load threat actor reports

In [3]:
md_docs = []
print("[+] Loading group markdown files..")
for group in group_folders:
    print(f' [*] Loading {os.path.basename(group)}')
    report_files = glob.glob(os.path.join(group, "*.md"))
    reports = []
    
    for report in report_files:
        
        print(f' [*] Loading {os.path.basename(report)}')
        loader = UnstructuredMarkdownLoader(report)
        reports.extend(loader.load())

    md_docs.append(reports)



print(f'[+] Number of .md documents processed: {len(md_docs)}')

[+] Loading group markdown files..
 [*] Loading 8220 Gang
 [*] Loading 8220_Gang.md
 [*] Loading 8220_Gang_Cloud_Botnet_Targets_Misconfigured_Cloud_Workloads_-_SentinelOne.md
 [*] Loading 8220_Gang_Evolves_With_New_Strategies.md
 [*] Loading 8220_Gangs_Recent_use_of_Custom_Miner_and_Botnet.md
 [*] Loading From_the_Front_Lines_|_8220_Gang_Massively_Expands_Cloud_Botnet_to_30,000_Infected_Hosts_-_SentinelOne.md
 [*] Loading Imperva_Detects_Undocumented_8220_Gang_Activities_|_Imperva.md
 [*] Loading Radware_Page.md
 [*] Loading Achilles
 [*] Loading Achilles.md
 [*] Loading Another_Hacker_Selling_Access_to_Charity,_Antivirus_Firm_Networks.md
 [*] Loading Iranian_hackers_suspected_in_cyber_breach_and_extortion_attempt_on_Navy_shipbuilder_Austal_-_ABC_News.md
 [*] Loading AeroBlade
 [*] Loading AeroBlade.md
 [*] Loading AeroBlade_on_the_Hunt_Targeting_the_U.S._Aerospace_Industry.md
 [*] Loading Aggah
 [*] Loading APT_or_not_APT?_What's_Behind_the_Aggah_Campaign_-_Yoroi.md
 [*] Loading Aggah

In [4]:
print(len(md_docs))

465


# Setup summarization prompt template
We focus on creating short summaries with the correct information necessary for our threat actor assessment model included. When the report is not an actual threat actor activity report, we output that it's a malformed report.

In [7]:
prompt = PromptTemplate.from_template("""
TASK
Provide a summary of the following activity/information report about a threat actor. It should not be longer than ten sentences. Make sure to include the region, operating sector and type of company of victims that were targeted if mentioned. Also focus on the evidence of the capability of the threat actor and the novelty of the tools and techniques used by the threat actor. Include the date and/or operation time window if included. Be concise and dry, as when writing an intelligence briefing. Focus on providing facts without bias. If the provided report seems to be empty, malformed, or wrongly retrieved, only output 'Malformed reporzt'.

##################

REPORT
{context}
"""
)

# Setup LLM and chain

In [8]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=openai.api_key)
chain = prompt | llm

# Summarize reports per threat actor
Create a markdown file with summaries of all reports per threat actor. We leverage OpenAI GPT3.5, which can cause issues with token limits per minute. Therefore, we sleep and retry up to 5 times when we catch an error. Some very long reports keep failing and won't be summarized. This will take a while, so it can resume where it left off when it fails or the process must be stopped because of some reason. Since we output the resulting markdown file immediately after summarizing all threat actor reports, we can check what reports we have already completed. 

In [10]:
if not os.path.exists(summaries_directory):
   print("[+] Creating summaries directory..")
   os.makedirs(summaries_directory)

group_folders = glob.glob(os.path.join(reports_directory, "*"))

for index, group_docs in enumerate(md_docs):
    print(f'[+] Loading {group_folders[index]}')

    group_name = group_folders[index].split("/")[-1]
    file_name = group_name + "_summary"
    if os.path.exists(os.path.join(summaries_directory, file_name + ".md")):
        print(" [+] Summary already present, skipping..")
        continue

    group_summaries = []
    for article_num, article in enumerate(group_docs):
        for attempt in range(5):
            try:
                print(f' [+] Loading article: {article_num+1}')
                group_summaries.append(chain.invoke({"context": article.page_content}).content)
            except:
                print("[+] Tokens per minute limit exceeded, sleeping for 3 seconds..")
                time.sleep(3)
            else:
              break
        else:
            print("[+] Article is exceeding token limits, skipping..")

    print("[+] Creating markdown files with all summaries of threat actor..")
    markdown_summary_template = Template(open(summary_template).read())
    markdown_summary = markdown_summary_template.render(summaries=group_summaries)

    
    
    open(f'{summaries_directory}/{file_name}.md', encoding='utf-8', mode='w').write(markdown_summary)

[+] Loading documents/reports/8220 Gang
 [+] Summary already present, skipping..
[+] Loading documents/reports/Achilles
 [+] Summary already present, skipping..
[+] Loading documents/reports/AeroBlade
 [+] Summary already present, skipping..
[+] Loading documents/reports/Aggah
 [+] Summary already present, skipping..
[+] Loading documents/reports/Agrius
 [+] Summary already present, skipping..
[+] Loading documents/reports/Allanite
 [+] Summary already present, skipping..
[+] Loading documents/reports/ALPHV, BlackCat Gang
 [+] Summary already present, skipping..
[+] Loading documents/reports/ALTDOS
 [+] Summary already present, skipping..
[+] Loading documents/reports/Anchor Panda, APT 14
 [+] Summary already present, skipping..
[+] Loading documents/reports/Andariel, Silent Chollima
 [+] Summary already present, skipping..
[+] Loading documents/reports/Andromeda Spider
 [+] Summary already present, skipping..
[+] Loading documents/reports/Antlion
 [+] Summary already present, skipping

In [66]:
group_summaries[2]

[AIMessage(content="Summary:\nThe threat actor known as AeroBlade, identified by BlackBerry, targeted an aerospace organization in the United States for commercial and competitive cyber espionage. The actor utilized spear-phishing with weaponized documents containing remote template injection and malicious VBA macro code. Evidence indicates the attacker's network infrastructure became operational in September 2022, with the offensive phase occurring in July 2023. The attacker improved its toolset for stealthiness while maintaining the same network infrastructure. The goal of the attack was assessed to be commercial cyber espionage with medium to high confidence. The sponsor type was information theft and espionage, with the victims operating in the aerospace sector in the USA. The threat actor's country of origin remains unknown. The report was first seen in 2022, with the source last modified in 2024-01-16."),
 AIMessage(content="Summary:\nA threat actor named AeroBlade targeted an ae