Skip to content

Sorakurai/karasu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Karasu

Beta — This project is under active development. Expect breaking changes between releases.

A threat intelligence automation platform that ingests URLs, extracts structured intelligence using an LLM, allows analyst review, and publishes the result as a MISP event.


Table of Contents


Overview

Analysts submit a URL (threat report, blog post, PDF). The platform fetches the content, runs it through an LLM to extract structured threat intelligence, and presents the result in an editor for review. Once satisfied, the analyst pushes the event directly to a MISP instance.

Pipeline

URL submitted → Fetch content → LLM extraction → Analyst review → Push to MISP

Extracted intelligence

Category Detail
Summary Short description of the threat
Threat actors Named groups or individuals
Target sectors Industries or sectors targeted
Target countries Countries targeted
IoCs IPs, domains, URLs, MD5 / SHA1 / SHA256 hashes (with to_ids flag)
TTPs MITRE ATT&CK technique ID, name, and context
Detection rules Engine (Sigma, KQL, …) and query
Threat hunting hypotheses Title, hypothesis, approach, and visibility

MISP event structure

Intelligence MISP representation
IoCs Typed attributes (ip-dst, domain, url, md5, sha1, sha256)
Threat actors threat-actor attributes
Target sectors target-org attributes
Target countries target-location attributes
TTPs Galaxy tags (misp-galaxy:mitre-attack-pattern) + attack-pattern objects
Detection rules text attributes with engine comment
Threat hunting hypotheses Event reports (markdown, tagged Threat Hunting Hypothesis)

Architecture

┌─────────────┐     ┌─────────────┐     ┌──────────────┐
│   Frontend  │────▶│   Backend   │────▶│  PostgreSQL  │
│  React/Vite │     │   FastAPI   │     └──────────────┘
│   (nginx)   │     │  (Uvicorn)  │     ┌──────────────┐
└─────────────┘     └──────┬──────┘────▶│    Redis     │
                           │            └──────────────┘
                    ┌──────▼──────────────────────────────┐
                    │           Celery Workers             │
                    │  ┌─────────┐ ┌─────────┐ ┌──────┐  │
                    │  │  fetch  │ │ extract │ │ misp │  │
                    │  └────┬────┘ └────┬────┘ └──┬───┘  │
                    └───────┼───────────┼──────────┼──────┘
                            │           │          │
                    fetch page      LLM provider  MISP
                    content         (pluggable)   galaxy
                                    │             resolution
                                    ▼             │
                              ┌──────────┐        ▼
                              │ Azure AI │   ┌─────────┐
                              │ Foundry  │   │  MISP   │
                              └──────────┘   └─────────┘

Processing pipeline

URL submitted → [fetch] → [extract] → Analyst review → [misp] → MISP event published

Services

Service Role
frontend React SPA served by nginx, proxies /api to the backend
backend FastAPI REST API, JWT auth, business logic
celery Background workers consuming the fetch, extract, and misp queues
postgres Primary data store for URLs, raw content, and extracted intelligence
redis Celery broker and result backend

Celery queues

Queue Task Description
fetch fetch_url_task Downloads URL content (HTML, PDF) and stores raw text
extract extract_llm_task Runs LLM extraction; falls back to split extraction on token limit
misp push_to_misp_task Resolves ATT&CK galaxy tags and publishes the event to MISP

LLM abstraction

The LLM provider is pluggable via LLM_PROVIDER in the environment. The active provider is selected at runtime by the factory in backend/app/services/llm/factory.py. See Adding a custom LLM provider for implementation details.


Getting started

Prerequisites

  • Docker and Docker Compose
  • A running MISP instance with the MITRE ATT&CK galaxy imported
  • An LLM provider (default: Azure AI Foundry with Mistral Small 2503)

1. Configure environment

Copy the example and fill in the values:

cp backend/.env.example backend/.env

All required values are listed in .env.example with generation instructions where applicable. Key values:

Variable Description How to generate
POSTGRES_PASSWORD Database password Choose a strong password
SECRET_KEY Application secret python -c "import secrets; print(secrets.token_hex(32))"
JWT_SECRET_KEY JWT signing key python -c "import secrets; print(secrets.token_hex(32))"
MISP_TOKEN_ENCRYPTION_KEY Fernet key for encrypting MISP tokens at rest python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
AZURE_API_KEY Azure AI Foundry API key Azure portal
AZURE_INFERENCE_ENDPOINT Azure AI Foundry endpoint URL Azure portal
MISP_URL Base URL of your MISP instance e.g. https://misp.example.com
CORS_ORIGINS Comma-separated allowed origins e.g. https://your-host

2. Build and run

docker compose up --build -d

The frontend is available at https://your-host (port 443). HTTP traffic on port 80 is redirected to HTTPS automatically. The self-signed TLS certificate will trigger a browser warning on first visit — add a browser exception to proceed.

3. Create the first admin user

Once the containers are running, create the initial admin account:

docker compose exec backend python -m app.scripts.create_admin <username> <password>

Additional users can be created and managed through the User Management page in the UI.

4. Set your MISP API token

Each user must configure their personal MISP API token before they can push events:

  1. Log in and click your username in the top bar
  2. Enter your MISP API token and click Save token

Tokens are encrypted at rest and never exposed in API responses. Events pushed to MISP are attributed to the token owner, preserving audit trail integrity.


Development (without Docker)

Backend

cd backend
python -m venv .venv && source .venv/bin/activate   # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
cp .env.example .env   # fill in values
uvicorn app.main:app --reload

Start the Celery worker separately:

celery -A app.workers.celery_app worker --loglevel=info -Q fetch,extract,misp

Frontend

cd frontend
npm install
npm run dev

Adding a custom LLM provider

Karasu uses an abstraction layer that makes it straightforward to swap in a different LLM provider without touching the rest of the application.

1. Implement the base class

Create a new file in backend/app/services/llm/ and implement BaseLLMService:

from app.services.llm.base import BaseLLMService
from app.services.llm.schemas import LLMRequest, LLMResponse, LLMTokenLimitExceeded

class MyLLMService(BaseLLMService):

    async def extract(self, request: LLMRequest) -> LLMResponse:
        # Make a single request to your LLM using MISP_EXTRACTION_PROMPT
        # Raise LLMTokenLimitExceeded if the model hits its output limit
        # Return an LLMResponse with extracted_data, token counts, and model name
        ...

    async def extract_split(self, request: LLMRequest) -> LLMResponse:
        # Make two parallel requests using MISP_IOC_TTP_PROMPT and MISP_ANALYSIS_PROMPT
        # Merge the results and return a single LLMResponse
        ...

The prompts are defined in backend/app/services/llm/prompts.py. The extracted JSON must conform to the schema the rest of the pipeline expects — refer to the existing AzureFoundryLLMService implementation as a reference.

LLMTokenLimitExceeded must be raised (not caught silently) when the model's output is truncated — this is what triggers the split extraction fallback in the worker.

2. Register the provider in the factory

In backend/app/services/llm/factory.py, add your provider:

from app.services.llm.my_provider import MyLLMService

def get_llm_client() -> BaseLLMService:
    if settings.LLM_PROVIDER == "azure_foundry":
        return AzureFoundryLLMService()
    if settings.LLM_PROVIDER == "my_provider":
        return MyLLMService()

    raise ValueError(f"Unsupported LLM provider: {settings.LLM_PROVIDER}")

3. Set the provider in your environment

LLM_PROVIDER=my_provider

Design decisions

Human review before MISP publication

LLM extraction is not treated as ground truth. After extraction completes, the result is presented to the analyst in an editor where every field — IoCs, TTPs, detection rules, threat hunting hypotheses — can be inspected, corrected, or removed before anything is sent to MISP. The push to MISP is always a deliberate, manual action. This keeps a human in the loop and prevents LLM hallucinations or misclassifications from polluting the threat intelligence platform automatically.

Per-user MISP API tokens

Each analyst authenticates to MISP using their own personal API token rather than a shared service account. This preserves attribution in MISP's audit log — events pushed by different analysts are recorded under their respective accounts. Tokens are stored encrypted at rest using Fernet symmetric encryption and are never exposed in API responses.

Split extraction for large documents

LLM models have finite output limits. Long or verbose documents can cause the model to truncate its response mid-JSON, producing unusable output. To handle this, Karasu uses a two-request fallback strategy: the first attempt extracts everything in a single request capped at 10,000 output tokens. If the model hits this limit, the task automatically splits the work into two parallel requests — one for IoCs and TTPs, one for detection rules and threat hunting hypotheses — each with their own 10,000 token cap. Each extraction attempt can therefore consume up to 30,000 output tokens. Combined with up to 3 retry attempts for transient failures, a single document can consume up to 120,000 output tokens in the worst case. This avoids silently incomplete extractions without requiring the analyst to resubmit.


About

Threat intelligence automation platform to extract information from URLs utilizing LLM with the goal of publishing results to MISP.

Resources

License

Stars

Watchers

Forks

Packages