Project Overview
----------------

The goal of this project is to process PDF Threat Reports and extract key intelligence about threat actors and their activities. The solution leverages open-source libraries and frameworks to handle different formats and rich content in the reports.

### Target PDF Reports for Extraction:

1.  [Modern Asian APT Groups TTPs Report](https://media.kasperskycontenthub.com/wp-content/uploads/sites/43/2023/11/09055246/Modern-Asian-APT-groups-TTPs_report_eng.pdf)
2.  [APT41: A Dual Espionage and Cyber Crime Operation](https://services.google.com/fh/files/misc/apt41-a-dual-espionage-and-cyber-crime-operation.pdf)
3.  [Mandiant Report on APT38](https://www.mandiant.com/sites/default/files/2021-09/rpt-apt38-2018-web_v5-1.pdf)

### Tools and Libraries:

*   **Llamaparse**: Used for its efficiency in parsing complex PDF documents ([GitHub Repository](https://github.com/run-llama/llama_parse)).
*   Other open-source libraries as necessary.

### Project Outputs:

#### Output 1: Design Strategy

*   **Strategy**: Description of the approach to extract and process information.
*   **Handling Different Formats**: Methods for tackling diverse formats and embedded images within the reports.
*   **Vectorization**: If a Retrieval-Augmented Generation (RAG) solution is built, detail the approach for vectorizing the information.
*   **Data Organization**: Strategy for organizing the extracted information, focusing on threat actor behavior.

#### Output 2: Code Implementation

*   **Task**: Write a script to extract information about SIGMA files from the first example report and format it into YAML files according to the [SigmaHQ specification](https://github.com/SigmaHQ/sigma-specification).
*   **Frameworks**: Any chosen machine learning model and framework suitable for the task.

This project aims to streamline the analysis of threat intelligence, making it more accessible and actionable for cybersecurity professionals.

In [15]:
from collections import Counter
import re  # Regular expression library

from tqdm import tqdm  # Import tqdm for the progress bar


#!pip install PyMuPDF
import fitz  # PyMuPDF

#!pip install spacy
#!python -m spacy download en_core_web_sm
import spacy

from transformers import pipeline #llm 





`PyMuPDF` (also known by its import name `fitz`) is a highly efficient and versatile library for working with PDF, XPS, and eBook documents in Python. It provides a wide range of functionalities, from basic document handling like opening and reading documents to more complex operations such as extracting text, images, and other content, as well as modifying and manipulating PDF files.

In [2]:

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    
    for page in doc:
        full_text += page.get_text()
    
    doc.close()
    return full_text


In [11]:
# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")

def extract_text_stats(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    
    for page in doc:
        full_text += page.get_text()
    
    doc.close()
    # Process the text with spaCy
    doc = nlp(full_text)
    
    # Further refine words by removing non-alphanumeric characters and filtering out empty words
    words = [re.sub(r'\W+', '', token.lemma_.lower()).strip() for token in doc if not token.is_stop and not token.is_punct and not token.is_space]
    words = [word for word in words if word]  # Remove empty words
    
    word_count = Counter(words)
    
    return full_text, len(words), word_count

def print_most_common(words, num=10):
    for word, freq in words.most_common(num):
        print(f"{word}: {freq}")

# List of PDF files
pdf_files = [
    'Modern-Asian-APT-groups-TTPs_report_eng.pdf',
    'apt41-a-dual-espionage-and-cyber-crime-operation.pdf',
    'rpt-apt38-2018-web_v5-1.pdf'
]

# Processing each file
for pdf_file in pdf_files:
    _, total_words, word_count = extract_text_stats(pdf_file)
    print(f"\nStats for {pdf_file}:")
    print(f"Total words (post-filtering): {total_words}")
    print("Most common words:")
    print_most_common(word_count)



Stats for Modern-Asian-APT-groups-TTPs_report_eng.pdf:
Total words (post-filtering): 50163
Most common words:
apt: 656
asian: 610
system: 535
process: 471
sigma: 443
file: 434
service: 430
attacker: 423
groups: 402
group: 398

Stats for apt41-a-dual-espionage-and-cyber-crime-operation.pdf:
Total words (post-filtering): 11963
Most common words:
apt41: 296
espionage: 128
operation: 126
malware: 100
target: 99
file: 96
group: 91
game: 90
cyber: 88
report: 86

Stats for rpt-apt38-2018-web_v5-1.pdf:
Total words (post-filtering): 6376
Most common words:
apt38: 207
north: 101
target: 74
korean: 70
system: 69
malware: 69
operation: 68
swift: 66
bank: 59
activity: 56


In [None]:
# Load the summarization pipeline
summarizer = pipeline("summarization")

def summarize_text_in_chunks(text, max_chunk_size=1024):
    # Divide the text into chunks
    words = text.split()
    chunks = []
    current_chunk = []

    for word in words:
        current_chunk.append(word)
        if len(' '.join(current_chunk)) > max_chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = []

    # Add the last chunk if any
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    # Summarize each chunk
    summaries = []
    for chunk in tqdm(chunks):
        try:
            result = summarizer(chunk, max_length=150, min_length=50, do_sample=False)
            summaries.append(result[0]['summary_text'])
        except Exception as e:
            print("Error during summarization:", e)

    # Combine all summaries into one
    full_summary = ' '.join(summaries)
    return full_summary

# List of PDF files
pdf_files = [
    'Modern-Asian-APT-groups-TTPs_report_eng.pdf',
    'apt41-a-dual-espionage-and-cyber-crime-operation.pdf',
    'rpt-apt38-2018-web_v5-1.pdf'
]

# Process and summarize each file with a progress bar
for pdf_file in pdf_files:
    print('processing '+pdf_file+'...')
    text = extract_text_from_pdf(pdf_file)
    summary = summarize_text_in_chunks(text)
    print(f"\nComprehensive Summary of {pdf_file}:")
    print(summary)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


processing Modern-Asian-APT-groups-TTPs_report_eng.pdf...


100%|████████████████████████████████████████▉| 533/534 [25:34<00:04,  4.16s/it]Your max_length is set to 150, but your input_length is only 113. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=56)
100%|█████████████████████████████████████████| 534/534 [25:37<00:00,  2.88s/it]



Comprehensive Summary of Modern-Asian-APT-groups-TTPs_report_eng.pdf:
 Kaspersky is constantly tracking thousands of malicious actors all over the world, including highly advanced groups that are capable of conducting sophisticated cyberattacks . These formidable groups are globally recognized as Advanced Persistent Threats (APT) Asia APT groups include Russia and Belarus, Indonesia, Malaysia and Pakistan .  Asian APT groups attacked the greatest number of countries and industries . Analysis of hundreds of attacks revealed a similar pattern among various groups . They achieve specific objectives at various stages of the Cyber Kill Chain using a common but limited number of techniques encountered by security professionals all over the world .  Report: Modern Asian APT groups: Tactics, Techniques and Procedures . It is not our goal to attribute a particular group to a specific country in Asia . Our goal is to provide the most extensive information on the approaches taken by APT actors, 

  0%|                                                   | 0/124 [00:00<?, ?it/s]Your max_length is set to 150, but your input_length is only 133. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=66)
  1%|▎                                          | 1/124 [00:02<04:51,  2.37s/it]Your max_length is set to 150, but your input_length is only 96. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
  2%|▋                                          | 2/124 [00:05<06:08,  3.02s/it]Your max_length is set to 150, but your input_length is only 112. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=56)
 99%|██████


Comprehensive Summary of apt41-a-dual-espionage-and-cyber-crime-operation.pdf:
 APT41, A DUAL ESPIONAGE AND CYBER CRIME OPERATION REPORT REPORT, is a dual Espionage and Cyber Crime Operation report . The report was published in December 2014 and is entitled MANDIANT APT 41, A Dual Espionage & Cyber Crime Report .  July 2017....................................................................................24 June 2018...................................................................................25 July 2018...........................................................................................26 Overlaps Between Espionage and Financial Operations.............27 Attribution.................................................................................30 Status as Potential Contractors..............................................33 Links to Other Known Chinese Espionage Operators.................................................33 Links .  Technical Annex: Attack Lifecycle.....

  3%|█▎                                          | 2/68 [00:04<02:45,  2.50s/it]Your max_length is set to 150, but your input_length is only 129. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=64)
  4%|█▉                                          | 3/68 [00:07<02:48,  2.59s/it]Your max_length is set to 150, but your input_length is only 126. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=63)
 90%|██████████████████████████████████████▌    | 61/68 [02:56<00:20,  2.96s/it]