Clouseau is a hierarchical multi-agent system for iterative cyber attack investigation. Starting from a point of interest (e.g., a suspicious domain), it autonomously explores incident data sources, issues targeted queries, analyzes evidence, and incrementally reconstructs an attack narrative without training or predefined heuristics.

# Getting Started

* Set LLM provider.
* Run sample investigation.
* Evaluate attack report.



## LLM Provider
Now we can run a quick test to verify Clouseau work properly. First, we need to set the API key below alongside with model name and base URL. Clouseau only supports models with tool calling ability. If you are going to use a reasoning model, make sure you increase the `max_tokens` to at least 4096 and set the `reasoning_effort` to low. We highly recommend to set LLM tracing here, by changing the callback (or adding a new one) or simply activating langsmith (you just need to set a few environment variables, refer to their documentation for more).

In [3]:
from debug_callback import SimpleCallback
from langchain_openai import ChatOpenAI
from getpass import getpass
import os

model = 'gpt-4.1-mini'
api_key = None # set your api key here
base_url = "https://api.openai.com/v1/"

if api_key is None:
    api_key = os.getenv("OPENAI_API_KEY", None)
    if api_key is None:
      print("No API Key is found")
      # read api key from stdin
      api_key = getpass("please type your API key")
      if len(api_key) == 0:
        print("Set the API key to continue")
        api_key = None

llm = None
if api_key is not None:
  llm = ChatOpenAI(model=model, api_key=api_key, base_url=base_url, temperature=0, streaming=True, callbacks=[SimpleCallback()])
  try:
    resp = llm.invoke("Say 'pong'.")
    print("OK ✅")
  except Exception as e:
    print("FAILED ❌", e)


No API Key is found
please type your API key··········
AI: 
pong
OK ✅


## Attack Investigation

Run the cell below to start an investigation. You can change the agent paramaters to make the investigation more thorough. In the paper, we set all variables to 10. However, this means longer investigation time and more tokens. For the purpose of testing, setting them to 2, 5 and 5 is enough for simple investigations. Due to size constraints, we only include S1-4 attack scenario within this artifact. However, you can download the rest of the scenarios and preprocess them, refer to the README file in `preprocessing/`. We highly recommend to do this in unrestricted envrionment with persistent storage and not within Google Colab.

These scenarios comes with three POIs, they are the malicious domain `0xalsaheel.com`, the malicious executable `payload.exe` and IP address `192.168.223.3`. Set the `clue` option approriatly, as shown below. The below clue directs Clouseau to investigate the malicious domain.

In [None]:
from chief_inspector import investigate_atlas
import constants

scenario = "S1" #change this to S2, S3 or S5
# Set Investigation Parameters
configs = {
    "max_investigations" : 2,
    "max_questions" : 5,
    "max_queries" : 5,
    "max_tokens" : constants.DEFAULT_MAX_TOKENS,
    "db_name" : f"scenarios/{scenario}/scenario.db",
    "data_path": f"scenarios/{scenario}/",
    "clue" : "We noticed abnormal visits to 0xalsaheel.com, investigate these visits and report any abnormal activites associted with this possible attack."
}

results = investigate_atlas(llm=llm, configs=configs)

AI: 

Tool Call: investigate_lead
lead: Investigate the domain name 0xalsaheel.com, find any processes who connected to this address, construct execution tree, and investigate these processes. Find any executables or script that may have been downloaded around that time, inspect any frequent or unordinary connections around the time.
AI: 

Tool Call: ask_dns
question: Show the DNS response for the domain '0xalsaheel.com'.
Tool Call: ask_audit
question: List all processes that connected to the IP addresses resolved from '0xalsaheel.com'.
AI: 

Tool Call: run_sql_query
query: SELECT response FROM dns_requests WHERE domain = '0xalsaheel.com';
Tool Call: run_sql_query
query: SELECT time, domain, response FROM dns_requests WHERE response LIKE '%error%' OR response = '' OR response IS NULL; 
AI: 
The DNS response for the domain '0xalsaheel.com' shows multiple entries with the IP address '192.168.223.3'.

Additionally, I checked for any abnormal DNS responses such as errors or empty responses

## Results Evaluation

Now that the investigation is completed, We can parse the outputted JSON report and evaluate the finding in a quantifiable way.

In [None]:
from evaluation import evaluate_report

er = evaluate_report(configs, results)
print(er)

p = er.tp / (er.tp + er.fp)
r = er.tp / (er.tp + er.fn)
f1 = 2 * (p * r) / (p + r)
fpr = er.fp / (er.fp + er.tn)
print(f"Precision: {p*100:.2f}%, Recall: {r*100:.2f}%, F1: {f1*100:.2f}%, FPR: {fpr*100:.2f}%")


P = 4598, N = 90490, tp = 4550, tn = 90275, fp = 215, fn = 48
Precision: 95.49%, Recall: 98.96%, F1: 97.19%, FPR: 0.24%


# Single Host Scenarios

Running the next cell will evaluate Clouseau against all Single Host scenarios (S1-S4) across all POI, and then print the average meterics for these investigation alongside the meterics for each scenario per POI. Running this will burn through your credit at OpenAI (or if using other providers). Again, it will be a waste to run these investigation without recording their traces.

In [7]:
import constants
from app import run_scenarios
from prompts import atlas_init_domain, atlas_init_file, atlas_init_ip

llm.callbacks = [] # set callback here for tracing, do not use Simple declared before it will be hard to inspect all of them in here.

pois = [(atlas_init_ip, 'IP'), (atlas_init_domain, 'domain'), (atlas_init_file, 'file')]
s_scn = [
    {'name': 's1', 'path': 'scenarios/S1/', 'poi': pois},
    {'name': 's2', 'path': 'scenarios/S2/', 'poi': pois},
    {'name': 's3', 'path': 'scenarios/S3/', 'poi': pois},
    {'name': 's4', 'path': 'scenarios/S4/', 'poi': pois}
]

csv_file = "single_host.csv"
configs = {
    "max_investigations" : constants.DEFAULT_INVESTIGATIONS,
    "max_questions" : constants.DEFAULT_QUESTIONS,
    "max_queries" : constants.DEFAULT_QUERIES,
    "max_tokens" : constants.DEFAULT_MAX_TOKENS,
    }

run_scenarios(s_scn, llm, configs, False, False, csv_file)
print("Done")


Investigating s1
chief_inspector: Agent is quitting without a tool call, returning initial response.

We have completed the maximum allowed investigations. Here is the final comprehensive report based on all findings:

---

### Final Incident Analysis Report

#### Incident Overview:
A persistent and suspicious connection to the internal IP address **192.168.223.3** was detected on a compromised machine within a small enterprise network. This IP was found to be associated with the attacker-controlled domain **0xalsaheel.com**. The attack involved a malicious executable named **payload.exe** that was executed through a chain originating from the Firefox browser.

---

#### Attack Artifacts:

| Artifact Type           | Details                                                                                  |
|------------------------|------------------------------------------------------------------------------------------|
| **Malicious Process**   | `payload.exe` (PID 2592), located at

Now, lets print the raw results we got for each test

In [17]:
import pandas as pd

df = pd.read_csv(csv_file)
df

Unnamed: 0,test_name,P,N,tp,tn,fp,fn,total_tokens,input_tokens,output_tokens,model_name
0,s1_IP,4598,90490,4598,90490,0,0,0,0,0,
1,s1_domain,4598,90490,4598,90226,264,0,0,0,0,
2,s1_file,4598,90490,4588,90490,0,10,0,0,0,
3,s2_IP,15073,382953,15073,382950,3,0,0,0,0,
4,s2_domain,15073,382953,15055,382461,492,18,0,0,0,
5,s2_file,15073,382953,15073,382556,397,0,0,0,0,
6,s3_IP,5165,123164,5014,123163,1,151,0,0,0,
7,s3_domain,5165,123164,5018,123163,1,147,0,0,0,
8,s3_file,5165,123164,5018,123163,1,147,0,0,0,
9,s4_IP,18062,107551,17850,107551,0,212,0,0,0,


In [18]:
df["precision"] = df["tp"] / (df["tp"] + df["fp"])
df["recall"]    = df["tp"] / (df["tp"] + df["fn"])
df["fpr"]       = df["fp"] / (df["fp"] + df["tn"])
df["f1"] = 2 * df["precision"] * df["recall"] / (df["precision"] + df["recall"])
df.rename(columns={"test_name": "Name"}, inplace=True)
df = df[["Name", "precision", "recall", "fpr", "f1"]]
df

Unnamed: 0,Name,precision,recall,fpr,f1
0,s1_IP,1.0,1.0,0.0,1.0
1,s1_domain,0.945701,1.0,0.002917,0.972093
2,s1_file,1.0,0.997825,0.0,0.998911
3,s2_IP,0.999801,1.0,8e-06,0.9999
4,s2_domain,0.968354,0.998806,0.001285,0.983344
5,s2_file,0.974337,1.0,0.001037,0.987002
6,s3_IP,0.999801,0.970765,8e-06,0.985069
7,s3_domain,0.999801,0.971539,8e-06,0.985467
8,s3_file,0.999801,0.971539,8e-06,0.985467
9,s4_IP,1.0,0.988263,0.0,0.994097


In [19]:
average_df = df[["precision", "recall", "fpr", "f1"]].agg(["mean"])
average_df

Unnamed: 0,precision,recall,fpr,f1
mean,0.990428,0.990583,0.000473,0.990353


 With this we replicated part of the experiments we conducted within the paper. Please head to `README.md` for more details on how to run the all experiments. We highly recommend to run this in unrestricted environment with persistent storage.