# **GenAI-Powered STIX 2.1 Generator**

**Notebook Version:** 13.0  
**Author:** Antonio Formato  
**Python Version:** >= 3.8  
**Key Libraries:** `stix2`, `openai`, `iocextract`, `MarkItDown`

---

## **Objective**

This notebook automates the conversion of unstructured Cyber Threat Intelligence (CTI) reports into structured, machine-readable STIX 2.1 bundles. By leveraging a Large Language Model (LLM) for entity and relationship extraction, it streamlines the intelligence lifecycle, enabling faster integration with security platforms like TIPs, SIEMs, and SOARs.

## **Workflow Overview**

1. **Setup**: Configure the environment.
2. **Data Ingestion**: Load the raw CTI report text in various ways.
3. **Advanced IOC Extraction**: Use a hybrid regex and LLM approach to identify and validate all Indicators of Compromise (IOCs).
4. **Programmatic Object Generation** (**SCOs & Indicators**): Convert the validated IOCs into STIX Cyber Observable Objects (SCOs) and create corresponding `Indicator` objects with `derived-from` relationships.
5. **Comprehensive Entity Extraction** (**SDOs**): Employ an LLM to parse the entire report for high-level STIX Domain Objects (SDOs) like `Malware`, `Attack Pattern`, and `Identity`.
6. **Final Bundling**: Assemble all extracted STIX objects (SDOs, SCOs, SROs) into a single, cohesive, and contextually rich STIX 2.1 Bundle and save it to a JSON file.
7. **Populating GitHub repo**: Automatically populating a public GitHub repository with the generated STIX bundles.
8. **STIX Viewer**: STIX 2.1 bundle Visualizer.

## **Part 1**: Setup

This initial block handles all the necessary setup for the notebook.

Especially sets up the Azure OpenAI client using credentials stored securely in the environment.

It is also possible to configure some variables related to the LLM model used, specifically:
*   **Temperature**: Controls the level of creativity or predictability in the model's responses. A low temperature (0.1) makes the model more predictable and deterministic, while a high temperature (1.0) makes the model more creative.
*   **Reasoning**: Controls the amount of ‚Äúthinking‚Äù or analysis that the model devotes to your request before generating a response. It can be set to four values: minimal, low, medium, or high. With a low (minimal) value, the model will respond faster, but may be less accurate, less contextual, or unable to follow complex instructions. With a high value, the model will take longer to analyze the request, evaluate different ‚Äúchains of thought,‚Äù and produce a more accurate, detailed response that is faithful to the instructions.



In [None]:
# Initialization
!pip install openai

In [None]:
from openai import AzureOpenAI
from google.colab import userdata

# Configuration: Set up the Azure OpenAI client
try:
    client = AzureOpenAI(
      azure_endpoint = userdata.get('AZURE_OPENAI_ENDPOINT'),
      api_key=userdata.get('AZURE_OPENAI_KEY'),
      api_version="2024-12-01-preview"
    )
    DEPLOYMENT_NAME = userdata.get('DEPLOYMENT_NAME')
    print("‚úÖ Azure OpenAI client configured successfully.")

    # --- GLOBAL CONFIGURATION OF MODELS ---
    # Change these values to control all API calls
    # The temperature can be a value between 0.1 and 1, lower temperature for more predictable, structured output.
    # Reasoning can be set to: minimal, low, medium, or high values.

    # Part 3 (IOC extraction)
    TEMP_IOC_EXTRACTION = 1
    REASONING_IOC_EXTRACTION = "high"

    # Part 5 (SDO extraction)
    TEMP_SDO_EXTRACTION = 1
    REASONING_SDO_EXTRACTION = "high"

    print("‚úÖ Global configuration variables loaded.")
    # --- END OF CONFIGURATION ---

except Exception as e:
    print(f"‚ùå Error configuring Azure OpenAI client: {e}")
    print("Please ensure AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY, and DEPLOYMENT_NAME are set in Colab secrets.")
    client = None

## **Part 2**: Data Ingestion

These blocks implement the ability to use various types of input to interact with the notebook:

*   **Raw input**: allows you to paste the text of a CTI report directly into the notebook;
*   **Web Scraper**: allows you to enter a blog post writeup URL and converts it into text for the notebook;
*   **Markdown to PDF converter**: allows you to retrieve a PDF file from the drive you entered and use it as input for the notebook.

### **Raw input**

This block allows you to insert the raw text of a CTI report directly into the notebook for analysis.

Once you have entered all the necessary text, type ‚ÄúEND‚Äù or ‚Äúend‚Äù on a new line and press enter.

In [None]:
# Data Ingestion: Paste text directly via interactive input
print("\nPaste your report content below. When you are done, type 'END' on a new line and press Enter.")
lines = []
while True:
    try:
        line = input()
        if line.strip().upper() == 'END':
            break
        lines.append(line)
    except EOFError:
        break
text = "\n".join(lines)

if text and text.strip():
    print(f"\n‚úÖ Successfully loaded {len(text)} characters from pasted text.")
else:
    print("\n‚ö†Ô∏è No text was pasted or an error occurred.")
    text = None

### **Web Scraper**

This section implements a web scraper to extract text content from a specified blog post URL (infosec writeup).

Utilizing the requests library and BeautifulSoup (bs4), the code ensures successful retrieval of data before proceeding with the extraction.

In [None]:
# Initialization
!pip install requests
!pip install beautifulsoup4

In [None]:
#Web scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_text(url):
  # Add user-agent to avoid issue when scrapping most website
  headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}

  # Send a GET request to the URL
  response = requests.get(url, headers=headers)

  # If the GET request is successful, the status code will be 200
  if response.status_code == 200:
    # Get the content of the response
    page_content = response.content
    # Create a BeautifulSoup object and specify the parser
    soup = BeautifulSoup(page_content, "html.parser")
    # Get the text of the soup object
    text = soup.get_text()
    # Return the text
    return text
  else:
    return "Failed to scrape the website"

# Enter site
print("Enter blog post writeup url and press Enter")
url = input()

text = scrape_text(url)

### **PDF to Markdown Converter**

This section implements a PDF to Markdown converter.

Use the MarkItDown tool to perform the conversion. You can enter the name of one of the files in the PDF_Reports folder.

The conversion result will be saved in the Markdown_Reports folder.

**Note**: the first time requires connection to Google Drive.

In [None]:
# install MarkItDown
!pip install 'markitdown[all]'

In [None]:
# Importing the necessary libraries
import os
from google.colab import drive
# Import the MarkItDown class from the library
from markitdown import MarkItDown

# Mount Google Drive
drive.mount('/content/drive', force_remount=True)

def convert_pdf_to_markdown(filename, folder_path):
    """
    Converts a PDF file specified from Google Drive to Markdown format using
    the MarkItDown class.

    Args:
        filename (str): The name of the PDF file to be converted (e.g., ‚Äúreport.pdf‚Äù).
        folder_path (str): The path to the folder on Google Drive that contains the PDFs.

    Returns:
        str: The content of the PDF converted to Markdown, or an empty string in case of error.
    """
    # Function that combines a folder path and a file name
    full_pdf_path = os.path.join(folder_path, filename)

    # Verify that the file exists in the specified path
    if not os.path.exists(full_pdf_path):
        print(f"ERROR: The file '{filename}' was not found in the path '{folder_path}'.")
        print("Check the file name and make sure it is in the correct folder.")
        return ""

    # Proceed with the conversion
    print(f"File '{filename}' found. I'm starting the conversion to Markdown...")
    try:
        # 1. Create an instance of the MarkItDown class
        md_converter = MarkItDown(enable_plugins=False)

        # 2. CCall the .convert() method on the created object
        result = md_converter.convert(full_pdf_path)

        print("Conversion successfully completed!")

        # 3. Return the textual content of the result
        return result.text_content

    except Exception as e:
        print(f"An error occurred while converting the PDF.: {e}")
        return ""

# Define the folder path on Drive
# MAKE SURE YOU HAVE A FOLDER CALLED ‚ÄúPDF_Reports‚Äù IN YOUR DRIVE
drive_folder = '/content/drive/MyDrive/Reports/PDF_Reports/'

# Ask the user to enter the name of the PDF file (with the extension .pdf)
pdf_input_name = input(f"Enter the name of the PDF file (e.g., test.pdf) located in '{drive_folder}': ")

# Call the function and save the result in the variable 'text'
text = convert_pdf_to_markdown(pdf_input_name, drive_folder)

# Check the result, print the preview, and save the .md file.
if text:
    print("\n--- Preview of the extracted text (first 500 characters) ---")
    print(text[:500] + "...")

    # Define the destination folder path for Markdown files
    markdown_folder = '/content/drive/MyDrive/Reports/Markdown_Reports/'

    # Create the folder if it doesn't exist
    os.makedirs(markdown_folder, exist_ok=True)

    # Create the name for the .md file based on the name of the original PDF.
    base_filename = os.path.splitext(pdf_input_name)[0]
    markdown_filename = f"{base_filename}.md"

    #Create the full path by combining the folder and file name.
    full_save_path = os.path.join(markdown_folder, markdown_filename)

    # Write the contents of the ‚Äòtext‚Äô variable to the new file
    with open(full_save_path, 'w', encoding='utf-8') as file:
        file.write(text)

    print("\n--- Saving Markdown File ---")
    print(f"Markdown file successfully saved in: {full_save_path}")
else:
    print("\nThe variable ‚Äòtext‚Äô is empty due to a previous error.")

## **Part 3**: Advanced IOC Extraction

This block implements a robust, three-stage pipeline to identify, validate, and score Indicators of Compromise (IOCs) from the source text.

* **Stage 1** (**Regex Triage**): Performs a quick first pass using `iocextract` to find common IOC patterns like IPs and hashes.
* **Stage 2** (**LLM Analysis**): Uses the LLM with a specific function-calling schema to perform a deep contextual analysis, validating that the candidates are genuinely malicious and extracting their associated names and descriptions.
* **Stage 3** (**Consolidation**): Merges the results from the first two stages, using the regex findings to increase the confidence score of the LLM-validated IOCs.

In [None]:
# Initialization
!pip install stix2
!pip install iocextract

In [None]:
import json
import re
import uuid
from typing import List, Dict, Any
from datetime import datetime
import iocextract
from stix2 import (Indicator, Malware, Tool, AttackPattern, Infrastructure,
                   Relationship, Bundle, File, IPv4Address, Directory, DomainName, WindowsRegistryKey,
                   ThreatActor, Vulnerability, Identity)

# Schema Definition for LLM Function Calling
IOC_FUNCTION_SCHEMA = {
    'name': 'extract_and_validate_iocs',
    'description': 'Extracts and validates IOCs from a CTI report.',
    'parameters': {
        'type': 'object',
        'properties': {
            'iocs': {
                'type': 'array',
                'description': 'A list of validated IOCs found in the text.',
                'items': {
                    'type': 'object',
                    'properties': {
                        'value': {'type': 'string', 'description': "The normalized IOC value (e.g., IP address, domain, full URL, hash, directory path, registry key). For composite file paths, this might be the full path string."},
                        'type': {'type': 'string', 'description': "The specific type of IOC. Use STIX compatible types like 'ipv4', 'domain-name', 'url', 'md5', 'sha1', 'sha256', 'mutex', 'windows-registry-key', 'directory'. For file paths involving both directory and filename, use the special type 'file-path'."},
                        'name': {'type': 'string', 'description': "A short, descriptive name for the IOC (e.g., the malware component name, C2 domain)."},
                        'description': {'type': 'string', 'description': "Contextual notes about the IOC's purpose or origin."},
                        'filename': {'type': 'string', 'description': "REQUIRED only if type is 'file-path'. The name of the file."},
                        'directory_path': {'type': 'string', 'description': "REQUIRED only if type is 'file-path'. The path to the directory containing the file, potentially including variables like %LOCALAPPDATA%."}
                    },
                    'required': ['value', 'type', 'name', 'description']
                }
            }
        },
        'required': ['iocs']
    }
}

# STAGE 1: REGEX-BASED TRIAGE
def stage1_regex_triage(text: str) -> Dict[str, str]:
    print("--- Stage 1: Starting regex triage ---")
    candidate_dict: Dict[str, str] = {}
    try:
        # Extract IPs
        for ip in iocextract.extract_ips(text, refang=True):
             # Check to avoid common local IPs if not desired
             if ip not in ['127.0.0.1']:
                candidate_dict[ip] = 'ipv4'

        # Extract hashes
        for h in iocextract.extract_hashes(text):
            h_lower = h.lower()
            if len(h_lower) == 32: candidate_dict[h_lower] = 'md5'
            elif len(h_lower) == 40: candidate_dict[h_lower] = 'sha1'
            elif len(h_lower) == 64: candidate_dict[h_lower] = 'sha256'

        # Regex for URLs
        for url in iocextract.extract_urls(text, refang=True):
             candidate_dict[url] = 'url'

        # Regular expression for specific directories and paths
        path_re = re.compile(r'(/[\w\./\-\\]+|%[a-zA-Z]+%[\w\\\./\-]+)') # Generic regex for paths
        for path in set(path_re.findall(text)):
             if path not in candidate_dict:
                 # Basic classification based on the presence of file extensions
                 if '.' in path.split('/')[-1].split('\\')[-1] and not path.endswith(('/', '\\')):
                     candidate_dict[path] = 'file-path' # Potentially a file path
                 else:
                     candidate_dict[path] = 'directory' # Probably a directory

        print(f"‚úÖ Stage 1: Found {len(candidate_dict)} unique candidates via regex.")
        return candidate_dict
    except Exception as e:
        print(f"‚ùå Stage 1: Error during regex triage: {e}")
        return {}

# STAGE 2: LLM-BASED CONTEXTUAL ANALYSIS
def stage2_llm_analysis(text: str, openai_client: AzureOpenAI, deployment_name: str) -> List[Dict]:
    print("\\n--- Stage 2: Starting deep contextual analysis with LLM ---")
    if not openai_client: return []

    prompt = f"""As a senior CTI analyst, your task is to meticulously analyze the following threat report. Your goal is to identify ALL Indicators of Compromise (IOCs). Scrutinize the text to confirm that IOCs are presented in a malicious context.

    Extract ALL genuine IOCs and structure them using the provided function schema. Pay close attention to the required 'type' for each IOC:
    - Use 'ipv4' for IP addresses.
    - Use 'domain-name' for domain names (e.g., example.com).
    - Use 'url' for full URLs (e.g., http://example.com/path).
    - Use 'md5', 'sha1', 'sha256' for file hashes. Ensure you associate hashes with the correct filename in the 'name' field.
    - Use 'mutex' for mutex names.
    - Use 'windows-registry-key' for registry keys.
    - Use 'directory' for simple directory paths (e.g., /data2/.ztls/, /tmp/, C:\\Users\\Public).
    - **Special Case: File Paths:** If an IOC represents a specific file within a directory (like '%LOCALAPPDATA%\\KeyStore\\KeyProv.dll' or '/data2/tmp/%s.ini'), use the type 'file-path'. For this type ONLY, you MUST ALSO provide the 'filename' (e.g., 'KeyProv.dll', '%s.ini') and the 'directory_path' (e.g., '%LOCALAPPDATA%\\KeyStore', '/data2/tmp') as separate fields in the output object. The 'value' field should contain the full path string.

    For each IOC, extract its 'value', 'type', 'name', and 'description'. Only include indicators clearly associated with malicious activity described in the report.

    ---REPORT TEXT:
    {text}
    ---"""

    try:
        response = openai_client.chat.completions.create(
            model=deployment_name,
            messages=[{"role": "user", "content": prompt}],
            functions=[IOC_FUNCTION_SCHEMA],
            function_call={"name": "extract_and_validate_iocs"},
            temperature=TEMP_IOC_EXTRACTION,
            reasoning_effort=REASONING_IOC_EXTRACTION
        )

        result = response.choices[0].message
        if result.function_call:
            function_args = json.loads(result.function_call.arguments)
            llm_iocs = function_args.get("iocs", [])
            print(f"‚úÖ Stage 2: LLM extracted and validated {len(llm_iocs)} IOCs.")
            return llm_iocs
        else:
             print("‚ö†Ô∏è Stage 2: LLM did not call the function. No IOCs extracted by LLM.")
             return []
    except Exception as e:
        print(f"‚ùå Stage 2: Error during LLM analysis: {e}")
        return []

# STAGE 3: CONSOLIDATION AND SCORING
def stage3_consolidate_and_score(regex_iocs: Dict[str, str], llm_iocs: List[Dict]) -> List[Dict]:
    print("\\n--- Stage 3: Consolidating and scoring IOCs ---")

    final_iocs: Dict[str, Dict] = {}

    # First process the IOCs from the LLM (most reliable by type and context)
    for ioc in llm_iocs:
        ioc_type = ioc.get("type")
        value = ioc.get("value", "")
        # Normalize hash cases
        if ioc_type in ['md5', 'sha1', 'sha256']:
            value = value.lower()

        # Prepare the base object
        final_ioc_data = {
            "value": value,
            "type": ioc_type,
            "name": ioc.get('name', 'N/A'),
            "description": ioc.get('description', ''),
            "confidence": 'medium'
        }
        # Add specific fields for file paths if present and valid
        if ioc_type == 'file-path':
             if ioc.get("filename") and ioc.get("directory_path"):
                 final_ioc_data["filename"] = ioc.get("filename")
                 final_ioc_data["directory_path"] = ioc.get("directory_path")
             else:
                 print(f"‚ö†Ô∏è LLM extracted file-path without required filename/directory_path: {value}")
                 continue

        final_iocs[value] = final_ioc_data

    # Compare with regex candidates to increase confidence
    for regex_value, regex_type in regex_iocs.items():
         # Normalise hash cases
         if regex_type in ['md5', 'sha1', 'sha256']:
             regex_value = regex_value.lower()
         if regex_type == 'ipv4-addr': regex_type = 'ipv4'

         if regex_value in final_iocs:
             # If the LLM found the same value, it increases confidence.
             final_iocs[regex_value]['confidence'] = 'high'

    consolidated_list = list(final_iocs.values())
    print(f"‚úÖ Stage 3: Consolidated to {len(consolidated_list)} total IOCs before filtering.")
    return consolidated_list

# ORCHESTRATION AND EXECUTION
def run_ioc_extraction_pipeline(report_text: str):
    print("\\n=== Starting Advanced IOC Extraction Pipeline ====")
    if not isinstance(report_text, str) or not report_text.strip(): return None

    regex_candidates = stage1_regex_triage(report_text)
    llm_results = stage2_llm_analysis(report_text, client, DEPLOYMENT_NAME)
    consolidated_iocs = stage3_consolidate_and_score(regex_candidates, llm_results)

    # Filter by confidence
    final_filtered_iocs = [ioc for ioc in consolidated_iocs if ioc.get('confidence') in ['high', 'medium']]
    print(f"\\n=== ‚úÖ Pipeline Complete: Extracted {len(final_filtered_iocs)} High/Medium Confidence IOCs ===")
    print(json.dumps(final_filtered_iocs, indent=2))
    return final_filtered_iocs

# Run the pipeline
extracted_iocs = []
if text and client:
    extracted_iocs = run_ioc_extraction_pipeline(text)
else:
    print("\\nSkipping IOC extraction: Ensure the 'text' variable contains the report and the OpenAI client is configured.")

## **Part 4**: Programmatic SCO & Indicator Generation

This block transforms the raw list of IOCs from the previous stage into formal STIX objects. Its primary goal is to create the foundational "what to look for" components of the threat intelligence.

This task is performed in three steps:

* **Step 1** (**SCO Creation**): Creates STIX Cyber Observable Objects (SCOs) for tangible entities like IP addresses, URLs, and mutexes.
* **Step 2** (**Granular Indicator Creation**): Generates a separate `Indicator` object for each individual hash (MD5, SHA1, SHA256) to allow for detailed relationship mapping.
* **Step** 3 (**Relationship Generation**): Automatically creates crucial relationships **`derived-from`**, links all `Indicator` objects that originate from the same piece of evidence (e.g., all hash indicators for the same file).

In [None]:
from stix2 import (
    File, Directory, IPv4Address, URL, Mutex, Indicator, Relationship, DomainName, WindowsRegistryKey # Aggiunti
)
import datetime as dt
import re
from typing import List, Dict, Tuple
from itertools import combinations

def stix_safe_string(s: str) -> str:
    # Escape backslashes before single quotation marks
    if isinstance(s, str):
        # First: replace existing backslashes with double backslashes
        # Second: replace single quotation marks with escaped single quotation marks.
        return s.replace("\\", "\\\\").replace("'", "\\'")
    return str(s)

def generate_stix_pattern_single_hash(ioc_type: str, ioc_value: str) -> str:
    return f"[file:hashes.'{ioc_type.upper()}' = '{stix_safe_string(ioc_value)}']"

def generate_sco_and_indicators_with_relations(ioc_list: List[Dict]) -> Tuple[List, List, List]:
    """
    Creates STIX SCOs, Indicator SDOs and 'derived-from' relationships.
    Handles IPv4, DomainName, URL, Mutex, Directory, compound Paths (file+dir), Registry Keys, and Hashes.
    """
    print("\\n=== Starting SCO & Indicator Generation with Derived-From Relationships ====")

    stix_scos = []
    stix_indicators = []
    stix_relationships = []
    file_indicator_map = {}

    # Step 1: Create SCOs and Indicators for non-hash types
    for ioc in ioc_list:
        ioc_type = ioc.get("type")
        ioc_value = ioc.get("value", "")
        ioc_name = ioc.get("name")
        ioc_desc = ioc.get("description")

        sco = None
        indicator_pattern = None

        try:
            # Clean and validate IP
            if ioc_type == "ipv4":
                ip_cleaned = ioc_value.replace('[.]', '.').replace('[:]', ':')
                ip_only = ip_cleaned.split(':')[0]
                sco = IPv4Address(value=ip_only)
                indicator_pattern = f"[ipv4-addr:value = '{sco.value}']"

            # Domain Name Management
            elif ioc_type == "domain-name":
                domain_cleaned = ioc_value.replace('[.]', '.')
                sco = DomainName(value=domain_cleaned)
                indicator_pattern = f"[domain-name:value = '{sco.value}']"

            elif ioc_type == "url":
                url_cleaned = ioc_value.replace('[.]', '.').replace('[:]', ':')
                url_value = url_cleaned
                if not url_value.startswith(('http://', 'https://', 'ftp://')):
                    print(f"‚ö†Ô∏è URL IOC '{url_value}' lacks a scheme, add 'http://' by default.")
                    url_value = 'http://' + url_value
                sco = URL(value=url_value)
                indicator_pattern = f"[url:value = '{stix_safe_string(sco.value)}']"

            elif ioc_type == "mutex":
                sco = Mutex(name=ioc_value)
                indicator_pattern = f"[mutex:name = '{stix_safe_string(sco.name)}']"

            # Directory Management
            elif ioc_type == "directory":
                sco = Directory(path=ioc_value)
                indicator_pattern = f"[directory:path = '{stix_safe_string(sco.path)}']"

            # Composite Path Management
            elif ioc_type == "file-path":
                 filename = ioc.get("filename")
                 dir_path = ioc.get("directory_path")
                 if filename and dir_path:
                     processed_path = dir_path
                     stix_path = stix_safe_string(processed_path)
                     stix_filename = stix_safe_string(filename)

                     indicator_pattern = f"[file:name = '{stix_filename}' AND file:parent_directory_ref.path LIKE '{stix_path}']"
                 else:
                     print(f"‚ö†Ô∏è Skipping file-path IOC due to missing filename/directory_path (in generate func): {ioc}")
                     continue

            # Registry Key Management
            elif ioc_type == "windows-registry-key":
                sco = WindowsRegistryKey(key=ioc_value)
                indicator_pattern = f"[windows-registry-key:key = '{stix_safe_string(sco.key)}']"

            elif ioc_type in ["md5", "sha1", "sha256"]:
                 continue

            else:
                 print(f"‚ÑπÔ∏è Skipping IOC with unhandled type in generation: {ioc_type} - Value: {ioc_value}")
                 continue

            # Indicator Creation
            if indicator_pattern:
                if sco:
                    stix_scos.append(sco)

                indicator = Indicator(
                    allow_custom=True,
                    name=ioc_name,
                    description=ioc_desc,
                    pattern_type="stix",
                    pattern=indicator_pattern,
                    valid_from=dt.datetime.now(dt.timezone.utc)
                )
                stix_indicators.append(indicator)

            elif ioc_type not in ["md5", "sha1", "sha256"]:
                 print(f"‚ö†Ô∏è Internal Logic Error: Pattern not generated for IOC: {ioc}")

        except Exception as e:
            print(f"‚ùå Error processing IOC during STIX object creation: {ioc} - Error: {e}")
            continue

    # Step 2: Create an Indicator for EACH hash file
    for ioc in ioc_list:
        ioc_type = ioc.get("type")
        if ioc_type in ["md5", "sha1", "sha256"]:
            ioc_value = ioc.get("value", "").lower()
            file_name_desc = ioc.get("name")

            if not ioc_value or not file_name_desc:
                 print(f"‚ö†Ô∏è Skipping HASH IOC due to missing value/name: {ioc}")
                 continue

            try:
                indicator = Indicator(
                    allow_custom=True,
                    name=f"Indicator for {file_name_desc} ({ioc_type.upper()})",
                    description=ioc.get("description"),
                    pattern_type="stix",
                    pattern=generate_stix_pattern_single_hash(ioc_type, ioc_value),
                    valid_from=dt.datetime.now(dt.timezone.utc)
                )
                stix_indicators.append(indicator)

                if file_name_desc not in file_indicator_map:
                    file_indicator_map[file_name_desc] = []
                file_indicator_map[file_name_desc].append(indicator)
            except Exception as e:
                 print(f"‚ùå Error processing HASH IOC during STIX Indicator creation: {ioc} - Error: {e}")
                 continue

    # Step 3: Create 'derived-from' relationships
    for file_name, indicators in file_indicator_map.items():
        if len(indicators) > 1:
            for ind1, ind2 in combinations(indicators, 2):
                 try:
                     stix_relationships.append(Relationship(ind1.id, 'derived-from', ind2.id))
                 except Exception as e:
                     print(f"‚ùå Error creating derived-from relationship between {ind1.id} and {ind2.id} - Error: {e}")

    print(f"‚úÖ Generated {len(stix_scos)} SCOs, {len(stix_indicators)} Indicators, and {len(stix_relationships)} Relationships.")
    return stix_scos, stix_indicators, stix_relationships

# Execution
stix_scos, stix_indicators, stix_relationships = [], [], []
if 'extracted_iocs' in locals() and extracted_iocs:
    stix_scos, stix_indicators, stix_relationships = generate_sco_and_indicators_with_relations(extracted_iocs)
    print("\\n--- Preview Generated Objects ---")
    print(f"SCOs created: {len(stix_scos)}")
    print(f"Indicators created: {len(stix_indicators)}")
    print(f"Relationships created: {len(stix_relationships)}")

## **Part 5**: Comprehensive Entity Extraction (SDOs)

With the low-level indicators defined, this block uses the LLM to understand the high-level context of the threat. It parses the entire CTI report to extract the core STIX Domain Objects (SDOs).

The block consists of three main parts:

* **LLM Prompting**: A detailed prompt instructs the model to act as a CTI analyst and identify `Malware`, `Attack Pattern`, `Vulnerability`, `Threat-Actor`, and `Identity` objects.
* **Structured Extraction**: The model is tasked with extracting not only names and descriptions but also specific metadata like `malware_types` and `kill_chain_phases` from the MITRE ATT&CK table.
* **JSON Output**: The result is a clean, structured list of entities that will form the narrative backbone of the final STIX bundle.

In [None]:
import json

def extract_all_entities_revised(report_text: str):
    """
    Use an LLM to robustly extract all SDO entities,
    including Attack Patterns from the MITRE table.
    """
    print("\n=== Starting Comprehensive Entity Extraction ===")

    if not client:
        print("‚ùå OpenAI client is not configured. Aborting extraction.")
        return []

    # Improved prompt that instructs the LLM to extract ALL entities, including Attack Patterns, from their specific table.
    prompt = f"""
    As a senior CTI analyst, your task is to meticulously identify and classify all distinct entities within the provided threat report that correspond to STIX Domain Object types.

    Fundamental Rule: You must extract entities found EXCLUSIVELY within the report text provided after ‚Äú--- THREAT REPORT TEXT ---‚Äù. The examples provided in these instructions serve ONLY as a guide to the format and MUST NOT be extracted.

    Focus on the following STIX types: Malware, Attack-Pattern, Identity, Tool, Threat-Actor, and Vulnerability.

    Instructions:
    1.  Read the entire report to understand the context.
    2.  For the **Malware** object, you MUST extract:
        - Its `name`.
        - Its `type` as "malware".
        - A detailed `description`.
        - A `malware_types` array, inferring the type from this list: ["remote-access-trojan", "backdoor", "downloader", "spyware", "ransomware"].
    3.  For any **Identity** or **Tool** objects mentioned (e.g., NCSC, Trend Micro, PwC, VMware), extract their `name`, `type`, and a concise `description` of their role in the report.
    4.  **Author Identification**: Identify the primary organization that authored or published this report (e.g., Cisco Talos, Mandiant, NCSC) and extract it as an 'identity' object.
    5.  For any **Threat-Actor** (e.g., APT groups, specific threat actors), you MUST extract:
        - Its `name` and any known `aliases`.
        - Its `type` as "threat-actor".
        - A detailed `description` of its goals, motivations, or relevant TTPs mentioned in the report.
    6.  For any **Vulnerability** (e.g., CVEs), you MUST extract:
        - Its `name` (the CVE identifier, e.g., "CVE-2021-44228").
        - Its `type` as "vulnerability".
        - A `description` of how the vulnerability is exploited according to the report.
    7.  **YARA Rule Extraction**: Search for any YARA rule blocks. For EACH rule, you MUST extract:
        - The `name` of the rule.
        - The `type` as "yara-rule".
        - The `pattern`, which is the ENTIRE text of the rule.
        - An `indicates_malware` field containing the lowercase name of the primary malware this rule detects (e.g., "umbrella stand").
        - An `associated_hashes` array containing a list of any file hashes (MD5, SHA1, or SHA256) that the text directly associates with this rule.
    8.  Specifically locate the **'MITRE ATT&CK¬Æ' table**. For EACH row in that table, you MUST extract the Attack Pattern.
    9.  For each **Attack-Pattern**, you MUST extract:
        - The `name` (from the "Technique" column).
        - The `type` as "attack-pattern".
        - The `description` (from the "Procedure" column).
        - The `external_id` (from the "ID" column, e.g., "T1129").
        - The `kill_chain_phases` as an array with a single object containing the phase name (from the "Tactic" column, e.g., {{"kill_chain_name": "mitre-attack", "phase_name": "execution"}}).
    10.  **Primary Subject Identification**: After analyzing the report, identify the primary malware family that is the main topic and place its lowercase name in a root-level JSON key called "primary_malware_subject".
    11.  Format the entire output as a single, valid JSON object with TWO root keys: "primary_malware_subject" and "entities". The value of "entities" must be an array of the extracted objects.

    Example of a final object in the array (FOR FORMATTING REFERENCE ONLY):
    {{
      "name": "Name-Technique-Example",
      "type": "attack-pattern",
      "description": "Description of how the sample malware uses this technique...",
      "external_id": "TXXXX",
      "kill_chain_phases": [
        {{
          "kill_chain_name": "mitre-attack",
          "phase_name": "tactic-name-example"
        }}
      ]
    }}
    Formatting example for YARA rule
    {{
      "name": "YARA_RULE_EXAMPLE_NAME",
      "type": "yara-rule",
      "pattern": "rule YARA_RULE_EXAMPLE_NAME {{ meta: ... strings: ... condition: ... }}",
      "indicates_malware": "nome-malware-esempio",
      "associated_hashes": ["hash_sha256_del_file_di_esempio"]
    }}

    --- THREAT REPORT TEXT ---
    {report_text}
    --- END OF REPORT ---
    """

    try:
        print("‚ñ∂Ô∏è Sending request to Azure OpenAI API for comprehensive entity extraction...")
        response = client.chat.completions.create(
            model=DEPLOYMENT_NAME,
            messages=[{"role": "user", "content": prompt}],
            temperature=TEMP_SDO_EXTRACTION,
            response_format={"type": "json_object"},
            reasoning_effort=REASONING_SDO_EXTRACTION
        )

        raw_response_content = response.choices[0].message.content

        # Debug: Print the raw response for inspection
        print("\n--- DEBUG: Raw LLM Response ---")
        print(raw_response_content)
        print("-----------------------------\n")

        result_json = json.loads(raw_response_content)

        # Extracts the list of entities from the root key 'entities'
        entity_list = result_json.get("entities", [])
        if not entity_list:
             print("‚ö†Ô∏è Warning: The LLM returned a valid JSON but the 'entities' list is empty.")

        print(f"‚úÖ LLM Extraction complete. Total entities identified: {len(entity_list)}.")
        return result_json

    except json.JSONDecodeError as e:
        print(f"‚ùå CRITICAL ERROR: Failed to decode JSON from the LLM response. Error: {e}")
        print("   Check the raw LLM response above to diagnose the issue.")
        return []
    except Exception as e:
        print(f"‚ùå CRITICAL ERROR during SDO entity extraction: {e}")
        return []

# --- Extraction execution ---
sdo_entities = []
# new variable for the name of the main malware
main_malware_name = None
if text and client:
    extraction_result = extract_all_entities_revised(text)
    if extraction_result:
        sdo_entities = extraction_result.get("entities", [])
        # extracts the output name from the LLM
        main_malware_name = extraction_result.get("primary_malware_subject")
        print(f"\\n‚úÖ Main subject identified by LLM: {main_malware_name}")
    print("\\n--- Final Extracted SDO Entities ---")
    print(json.dumps(sdo_entities, indent=2))

## **Part 6**: Final Assembly and Bundling

This is the final stage where all previously generated STIX objects are brought together to create a single, cohesive, and interoperable intelligence package.

* **Object Aggregation**: Gathers all SDOs, SCOs, and SROs created in the previous blocks into a single list.
* **Contextual Relationship Creation**: Creates the high-level relationships that connect the threat narrative, such as linking the `Malware` object to the `Attack Patterns` it `uses` and the `Indicators` that `indicate` its presence.
* **Report Object Generation**: Creates a top-level `Report` object that summarizes the analysis and references all other objects in the bundle.
* **Bundle Creation & Serialization**: Assembles all objects into a final STIX 2.1 `Bundle` and saves it as a timestamped JSON file.

In [None]:
# STIX2 Imports
from stix2 import (Identity, Malware, AttackPattern, Relationship, Bundle, Report, MarkingDefinition, TLP_WHITE)
import datetime as dt
import os

# Variable to contain the final bundle
final_bundle = None
# List to contain all STIX objects before bundling
all_stix_objects = []

# Check that the variables required by the previous blocks exist
if 'sdo_entities' in locals() and 'stix_scos' in locals() and 'stix_indicators' in locals() and 'stix_relationships' in locals():
    print("‚ñ∂Ô∏è Start assembly of final STIX bundle...")

    # --- 1. Create fundamental metadata objects (with smarter dynamic author) ---
    identity_author = None
    author_keywords = ['author', 'published', 'researchers', 'report', 'responsible for this analysis'] # Parole chiave per identificare l'autore

    # Search for a specific author among the extracted entities, based on keywords.
    for entity in sdo_entities:
        if entity.get("type") == "identity":
            # Check if the description contains any of our keywords.e
            description = entity.get("description", "").lower()
            if any(keyword in description for keyword in author_keywords):
                print(f"‚úÖ Author identified by keywords: '{entity.get('name')}'")
                identity_author = Identity(
                    name=entity.get("name"),
                    identity_class="organization",
                    description=entity.get("description")
                )
                all_stix_objects.append(identity_author)
                break

    # If no specific author was found after the cycle, use a default.
    if not identity_author:
        print("‚ö†Ô∏è No specific author found, ‚ÄòNCSC‚Äô is used as default.")
        identity_author = Identity(name="NCSC", identity_class="organization")
        all_stix_objects.append(identity_author)

    # Add other metadata
    tlp_clear = TLP_WHITE
    all_stix_objects.append(tlp_clear)
    print(f"‚úÖ Author's Identity Object aimed at: '{identity_author.name}'")

    # --- 2. Add the SCOs, Indicators, and their relationships from the previous blocks ---
    all_stix_objects.extend(stix_scos)
    all_stix_objects.extend(stix_indicators)
    all_stix_objects.extend(stix_relationships)
    print(f"‚úÖ Added {len(stix_scos)} SCOs, {len(stix_indicators)} Indicators, and {len(stix_relationships)} 'based-on' relationships.")

    # --- 3. Create SDOs (Malware, Attack Patterns, etc.) from the ‚Äòsdo_entities‚Äô list ---
    created_sdos = {}
    malware_main_obj = None

    for entity in sdo_entities:
        entity_type = entity.get("type")
        entity_name = entity.get("name")
        sdo = None

        if entity_type == "malware":
            sdo = Malware(
                name=entity_name.lower(),
                is_family=True,
                description=entity.get("description"),
                malware_types=entity.get("malware_types", ["remote-access-trojan"]),
                created_by_ref=identity_author.id
            )
            created_sdos[entity_name] = sdo

        elif entity_type == "attack-pattern":
            # NEW LOGIC: Correctly formats kill_chain_phases
            kill_chain_phases = entity.get("kill_chain_phases", [])
            for phase in kill_chain_phases:
                if 'phase_name' in phase:
                    phase['phase_name'] = phase['phase_name'].lower().replace(' ', '-')

            sdo = AttackPattern(
                name=entity_name,
                description=entity.get("description"),
                created_by_ref=identity_author.id,
                kill_chain_phases=kill_chain_phases, # Use the formatted list
                external_references=[{
                    "source_name": "mitre-attack",
                    "external_id": entity.get("external_id"),
                    "url": f"https://attack.mitre.org/techniques/{entity.get('external_id').replace('.', '/')}"
                }]
            )
            created_sdos[entity.get("external_id")] = sdo

        elif entity_type == "threat-actor":
            sdo = ThreatActor(
                name=entity.get("name"),
                description=entity.get("description"),
                aliases=entity.get("aliases", []),
                created_by_ref=identity_author.id
            )
            created_sdos[entity_name] = sdo

        elif entity_type == "vulnerability":
            sdo = Vulnerability(
                name=entity.get("name"),
                description=entity.get("description"),
                created_by_ref=identity_author.id
            )
            created_sdos[entity_name] = sdo

        elif entity_type == "yara-rule":
             sdo = Indicator(
                 name=entity.get("name"),
                 description=f"YARA rule to detect related activity.",
                 pattern_type="yara",
                 pattern=entity.get("pattern"),
                 created_by_ref=identity_author.id,
                 valid_from=dt.datetime.now(dt.timezone.utc)
             )

        elif entity_type == "identity":
             # Check if this identity is the author we have ALREADY created, skip it to avoid creating a duplicate..
             if entity.get("name") == identity_author.name:
                 continue

             # Otherwise, if it is another identity (e.g., a victim), create it as usual..
             sdo = Identity(
                 name=entity.get("name"),
                 identity_class="organization",
                 description=entity.get("description")
             )

        if sdo:
            all_stix_objects.append(sdo)

    # --- Find the main malware object dynamically ---
    if main_malware_name:
        malware_main_obj = created_sdos.get(main_malware_name)

    # Fallback in case the name does not match or has not been found
    if not malware_main_obj:
        print("‚ö†Ô∏è Main malware subject not found via LLM suggestion. Attempting to find first malware in list.")
        # Search for the first malware object created as a last resort
        for obj in all_stix_objects:
            if obj.type == 'malware':
                malware_main_obj = obj
                break

    if malware_main_obj:
        print(f"‚úÖ Main malware object for relationships set to: '{malware_main_obj.name}'")

    print(f"‚úÖ Created {len(created_sdos)} SDOs (Malware, Attack Patterns, Identities).")

    # --- 4. Create contextual relationships (SROs) ---
    print("‚è≥ Creating contextual relationships...")
    if malware_main_obj:
        for entity in sdo_entities:
            if entity.get("type") == "attack-pattern":
                attack_pattern_obj = created_sdos.get(entity.get("external_id"))
                if attack_pattern_obj:
                    rel = Relationship(malware_main_obj.id, 'uses', attack_pattern_obj.id, created_by_ref=identity_author.id)
                    all_stix_objects.append(rel)

        for indicator in stix_indicators:
            rel = Relationship(indicator.id, 'indicates', malware_main_obj.id, created_by_ref=identity_author.id)
            all_stix_objects.append(rel)

        print("‚úÖ 'Uses' and ‚Äòindicates‚Äô relationships successfully created.")
    else:
        print("‚ö†Ô∏è Warning: Main malware object not found. Unable to create relationships..")

    # --- Accurate YARA reports ---
    print("‚è≥ Creating accurate relationships for YARA indicators...")

    yara_entities = [e for e in sdo_entities if e.get('type') == 'yara-rule']
    yara_indicators = [o for o in all_stix_objects if o.type == 'indicator' and o.pattern_type == 'yara']

    new_yara_rels = 0
    for entity in yara_entities:
        # Find the corresponding YARA indicator object
        yara_indicator = next((yi for yi in yara_indicators if yi.name == entity.get('name')), None)
        if not yara_indicator:
            continue

        # 1. Create the precise ‚Äúindicates‚Äù relationship
        malware_name = entity.get('indicates_malware')
        # Search for the corresponding malware among the objects already created
        malware_obj = next((obj for obj in all_stix_objects if obj.type == 'malware' and obj.name == malware_name), None)
        if malware_obj:
            rel_indicates = Relationship(yara_indicator.id, 'indicates', malware_obj.id, created_by_ref=identity_author.id)
            all_stix_objects.append(rel_indicates)
            new_yara_rels += 1

        # 2. Create accurate ‚Äúderived-from‚Äù relationships
        associated_hashes = entity.get('associated_hashes', [])
        for h in associated_hashes:
            # Find the corresponding hash indicator
            hash_indicator = next((ind for ind in stix_indicators if h in ind.pattern), None)
            if hash_indicator:
                rel_derived = Relationship(yara_indicator.id, 'derived-from', hash_indicator.id, created_by_ref=identity_author.id)
                all_stix_objects.append(rel_derived)
                new_yara_rels += 1

    if new_yara_rels > 0:
        print(f"‚úÖ Create {new_yara_rels} New precise relationships for YARA indicators.")
    else:
        print("‚ÑπÔ∏è No YARA reports to create based on LLM output.")

    # --- Add relationships for Threat Actors and Vulnerabilities ---
    print("‚è≥ Creating Threat Actor and Vulnerability Reports...")

    # Find all the new items we have created
    threat_actors = [obj for obj in all_stix_objects if obj.type == 'threat-actor']
    vulnerabilities = [obj for obj in all_stix_objects if obj.type == 'vulnerability']

    new_context_rels = 0
    if threat_actors and malware_main_obj:
        # Create the ‚Äúuses‚Äù relationship: Threat Actor -> uses -> Malware
        for ta in threat_actors:
            rel = Relationship(ta.id, 'uses', malware_main_obj.id, created_by_ref=identity_author.id)
            all_stix_objects.append(rel)
            new_context_rels += 1

    if vulnerabilities and malware_main_obj:
        # Create the relationship ‚Äúexploits‚Äù: Malware -> exploits -> Vulnerability
        for vuln in vulnerabilities:
            rel = Relationship(malware_main_obj.id, 'exploits', vuln.id, created_by_ref=identity_author.id)
            all_stix_objects.append(rel)
            new_context_rels += 1

    if new_context_rels > 0:
        print(f"‚úÖ Create {new_context_rels} new contextual relationships.")
    else:
        print("‚ÑπÔ∏è No new reports for Threat Actor or Vulnerability to create.")

    # --- 5. Create the Report object to contextualize the bundle ---
    valid_object_refs = [obj.id for obj in all_stix_objects if obj.type != 'marking-definition']
    report_obj = Report(
        name=f"Analisi Malware: {malware_main_obj.name.title() if malware_main_obj else 'Threat Report'}",
        description=f"This report contains technical analysis and indicators associated with malware. {malware_main_obj.name if malware_main_obj else 'unknown'}.",
        published=dt.datetime.now(dt.timezone.utc),
        created_by_ref=identity_author.id,
        object_marking_refs=[tlp_clear.id],
        report_types=["threat-report"],
        object_refs=valid_object_refs
    )
    all_stix_objects.append(report_obj)
    print("‚úÖ Object Report created.")

    # --- 6. Create the final bundle ---
    final_bundle = Bundle(*all_stix_objects)

    print("\nüéâ --- STIX 2.1 Bundle Generated Correctly --- üéâ")
    print(f"Total objects in bundle: {len(final_bundle.objects)}")

    # --- 7. Save the bundle to a file with a timestamp ---
    # Set the path for the output folder on Google Drive
    stix_reports_folder = '/content/drive/MyDrive/Reports/STIX_Reports'

    # Create the folder if it does not already exist
    os.makedirs(stix_reports_folder, exist_ok=True)

    # Generate a base name for the file, checking if a PDF name exists
    if 'pdf_input_name' in locals() and pdf_input_name:
      # If using PDF input, use its name
      base_name = os.path.splitext(pdf_input_name)[0]
    else:
      # Otherwise, use a generic name
      base_name = "cti_report"

    # Generate the filename
    timestamp = datetime.now().strftime('%Y_%m_%d_%H_%M')
    stix_filename = f"{base_name}_bundle_{timestamp}.json"

    # Combine the folder path and file name to obtain the full path
    full_save_path = os.path.join(stix_reports_folder, stix_filename)

    # Write the STIX bundle to the file
    with open(full_save_path, 'w') as file:
        file.write(final_bundle.serialize(pretty=True)) # Serialize the bundle to JSON string before writing

    print("\n--- Saving Complete ---")
    print(f"STIX bundle successfully saved in: {full_save_path}")
    print(final_bundle.serialize(pretty=True))

else:
    print("‚ùå ERRORE: Variabili necessarie non trovate. Esegui i blocchi precedenti.")

## **Part 7**: Populating GitHub repo
This section deals with automatically populating a public GitHub repository with the generated STIX bundles.

In [None]:
import os
from google.colab import userdata

print("--- Start Synchronization with GitHub ---")

try:
    # Retrieve credentials from Colab Secrets
    token = userdata.get('GITHUB_PAT')
    username = userdata.get('GITHUB_USERNAME')
    repo_name = userdata.get('GITHUB_REPO_NAME')

    # Check that all secrets have been set
    if not all([token, username, repo_name]):
        raise ValueError("Make sure you have set GITHUB_PAT, GITHUB_USERNAME, and GITHUB_REPO_NAME in Colab Secrets.")

    print(f"GitHub credentials retrieved. Target repository: {repo_name}")

    # Clone the repository from GitHub using the authentication token
    # Removes the folder if it already exists to ensure a clean state at each execution
    !rm -rf {repo_name}

    repo_url = f"https://{token}@github.com/{username}/{repo_name}.git"
    !git clone {repo_url}

    # Copy the generated JSON file (from Drive) to the local repository folder
    if 'full_save_path' in locals() and os.path.exists(full_save_path):
        !cp "{full_save_path}" "{repo_name}/"
        stix_filename = os.path.basename(full_save_path)
        print(f"File '{stix_filename}' copied to the local repository.")

        # Run the Git commands to commit and push the new file.
        %cd {repo_name}
        !git config user.name "{username}"
        !git config user.email "{username}@users.noreply.github.com"
        !git add .

        commit_message = f"Add STIX bundle: {stix_filename}"
        !git commit -m "{commit_message}"

        !git push

        print(f"\nPush completed successfully")
        print(f"You can view the file at: https://github.com/{username}/{repo_name}")

        # Return to the main Colab work directory
        %cd /content
    else:
        print("ERROR: The variable ‚Äòfull_save_path‚Äô with the STIX file path was not found or the file does not exist.")

except Exception as e:
    print(f"An error occurred while synchronizing with GitHub: {e}")

## **Part 8**: STIX Viewer
STIX 2.1 bundle Visualizer.

To visualize the STIX bundle, we use the cti-stix-visualization project, inserted as an iFrame in the notebook.

[STIX Visualizer](https://oasis-open.github.io/cti-stix-visualization/)


In [None]:
# Visualize STIX Bundle
# cut & paste json file
from IPython.display import IFrame

IFrame(src='https://oasis-open.github.io/cti-stix-visualization/', width=1200, height=1000)