Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Method_Overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
104 changes: 102 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,102 @@
# HyperGLLM
Code & dataset for AAAI26 Paper "HyperGLLM: An Efficient Framework for Endpoint Threat Detection via Hypergraph-Enhanced Large Language Models".
# <div align="center">HyperGLLM: An Efficient Framework for Endpoint Threat Detection via Hypergraph-Enhanced Large Language Models<div>
<div align="center">
<p><strong>An efficient framework that introduces hypergraph reasoning into LLMs for malicious behavior detection in EDR logs.</strong></p>
<a href="https://huggingface.co/datasets/The_data_will_be_released_upon_acceptance_of_the_paper./" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Dataset-23999a.svg></a>
<a href="https://github.com/"><img alt="License" src="https://img.shields.io/badge/License-MIT-green.svg"></a>
<a><img alt="Static Badge" src="https://img.shields.io/badge/made_with-Python-blue"></a>
</div>

<h4 align="center">

<p>
<a href="#overview">Overview</a> |
<a href="#sparkles-features">Features</a> |
<a href="#page_with_curl-changelog">Changelog</a> |
<a href="#rocket-quick-start">Quick-Start</a> |
<a href="#floppy_disk-dataset">Dataset</a> |
<a href="#raised_hands-faqs"> FAQs</a>
</p>

## Overview

**HyperGLLM** is a novel detection framework that introduces hypergraph reasoning into LLMs. It first constructs an attribute-value level relation-aware graph to model low-order structural semantics while reducing textual redundancy. Then, it introduces a differential hypergraph module with multi-granularity clustering to capture high-order behavioral dependencies embedded in interleaved events and reinforce threat semantics. Finally, the hypergraph representations are aligned with an LLM for efficient contextual reasoning over potential malicious behaviors. We curate EDR3.6B-63F, a large-scale EDR dataset containing 3.6 billion events across 63 distinct behavior families. Extensive experiments demonstrate that HyperGLLM significantly outperforms state-of-the-art methods by reducing the false alarm rate to 1.67%, achieving 94.65% accuracy across 63 behavior families, and improving the modeling efficiency of LLMs on long EDR logs. Our framework and dataset provide a solid foundation for future research and support the development of advanced detection solutions in endpoint security.

<p align="center">
<img src="Method_Overview.png">
</p>

## :sparkles: Features

- **Framework**: We propose HyperGLLM, an efficient framework that introduces hypergraph reasoning into LLMs for malicious behavior detection in EDR logs, capturing both structural semantics and long-range temporal dependencies.
- **Structural Semantics & Temporal Dependencies**: We design an attribute-value level relation-aware graph and a differential hypergraph module with multi-granularity clustering to jointly model low- and high- order behavior semantics, thereby enhancing the semantic representation of threat behaviors.
- **EDR3.6B-63F Dataset**: We construct EDR3.6B-63F, a large-scale EDR dataset that serves as a high-quality benchmark for advancing AI-driven research in endpoint security, offering diverse behavior types and detailed event records.
- **Effective and Efficient**: Extensive experiments demonstrate that HyperGLLM consistently outperforms state-of-the-art baselines across multiple metrics while maintaining high inference efficiency.


## :page_with_curl: Changelog
[23/10/25] The project launched!



## :rocket: Quick Start

We conduct training on eight NVIDIA H100 (80GB) GPUs and perform evaluation on a single GPU.

**Directory Overview**

- *requirements.txt* — Lists all dependencies required to reproduce this project.
- *datasets/* — Contains all datasets used in both the main experiments and the ablation study described in the paper.
- *experiments_main/* — Source code for reproducing the main experiments.
- *experiments_ablation/* — Source code for reproducing the ablation study experiments in the paper.

**Install Dependencies**

To reproduce our work, you need to have Python installed along with the required libraries. You can install the necessary dependencies using the following command:
```bash
pip install -r requirements.txt
```

**Reproducing the Main Experiment**

To reproduce the main experimental results:
1. cd to the main experiment directory:
```bash
cd experiments_main
```
2. Run the training and inference script:
```bash
sh run.sh
```
3. Obtain evaluation metrics:
```bash
python get_metrics.py
```
You can also obtain runtime efficiency metrics (GPU memory usage and Time-to-First-Token) by running:
```bash
python get_gpumu_tps.py
```
This provides performance metrics on an input of 1,024K tokens.

*To reproduce other experiments (e.g., the ablation study), cd to the corresponding directory (e.g., `cd experiments_ablation/Analysis_DHGNN`), run `sh run.sh` for training and inference, and obtain evaluation metrics with `python get_metrics.py`.


## :floppy_disk: Dataset

EDR3.6B-63F Dataset: [this repo](https://huggingface.co/datasets/Iceswordlab/EDR3.6B-63F).

## :raised_hands: FAQs

## :bookmark: License

Our project is licensed under the [*MIT License*](./LICENSE).

## Citation
If you use the EDR3.6B-63F dataset in your research or find our method HyperGLLM inspiring, please consider citing our paper:

```bibtex
@inproceedings{
title = {HyperGLLM: An Efficient Framework for Endpoint Threat Detection via Hypergraph-Enhanced Large Language Models},
year = {2026},
books = {Proceedings of the AAAI Conference on Artificial Intelligence},
}
```
200 changes: 200 additions & 0 deletions datasets/get_npz_convenient.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
import os
import json
import h5py
import pickle
import gzip
import numpy as np
import random
from difflib import SequenceMatcher
import concurrent.futures
import multiprocessing



used_keys = {'action_type': [6, 12.09, 14.0, 14], 'addr_family': [7, 7.41, 8.0, 8],
'child_command_line': [1, 85.68, 127.0, 20638], 'child_process_build_name': [1, 11.34, 21.0, 127],
'child_process_cert_hash': [40, 40.00, 40.0, 41], 'child_process_ext_level': [1, 1.04, 1.0, 3],
'child_process_integrity': [3, 5.3, 6.0, 6], 'child_process_md5': [32, 32.0, 32.0, 32],
'child_process_name': [1, 13.54, 27.0, 163], 'child_process_net_level': [1, 1.99, 2.0, 2],
'child_process_orig_name': [1, 13.06, 23.0, 63], 'child_process_path': [8, 42.91, 88.0, 174],
'child_process_pdb': [6, 22.16, 60.0, 210], 'child_process_product_name': [1, 31.83, 36.0, 127],
'child_process_service': [1, 13.07, 33.0, 156], 'child_process_signer': [2, 20.06, 27.0, 68],
'command_line': [0, 79.77, 269.0, 20638], 'control': [6, 6.0, 6.0, 6],
'data_size': [1, 1.92, 4.0, 7], 'device_reserved': [1, 1.0, 1.0, 1],
'device_type': [1, 1.0, 1.0, 1], 'dmpinfo': [34, 34.67, 35.8, 36],
'edrver': [11, 11.0, 11.0, 11], 'event_cost': [3, 3.0, 3.0, 3],
'file_action': [17, 26.97, 31.0, 31], 'file_cert_hash': [40, 40.00, 40.0, 41],
'file_delete_count': [1, 1.04, 1.0, 3], 'file_device': [2, 2.00, 2.0, 3],
'file_dltool': [1, 1.00, 1.0, 4], 'file_ext_level': [1, 1.00, 1.0, 3],
'file_hidden_data': [1, 1.0, 1.0, 1], 'file_internal_name': [1, 13.98, 21.0, 127],
'file_md5': [32, 32.0, 32.0, 32], 'file_name': [1, 21.27, 44.0, 229],
'file_net_level': [2, 2.0, 2.0, 2], 'file_network': [21, 34.78, 44.0, 44],
'file_oldname': [4, 22.03, 40.0, 92], 'file_oldpath': [9, 69.85, 108.0, 163],
'file_orig_name': [1, 14.84, 28.0, 63], 'file_path': [0, 71.71, 119.0, 259],
'file_pdb': [4, 54.25, 95.0, 259], 'file_product_name': [1, 18.14, 36.0, 127],
'file_signer': [2, 25.84, 33.0, 68], 'file_source': [0, 40.16, 51.0, 124],
'file_written_count': [1, 1.36, 2.0, 3], 'finalexcep': [4, 4.0, 4.0, 4],
'flags': [1, 1.0, 1.0, 1], 'from_etw_data': [1, 1.0, 1.0, 1],
'guid': [38, 38.0, 38.0, 38], 'image_path': [9, 36.85, 52.0, 92],
'inject_size': [1, 3.29, 4.0, 17], 'inject_target_address': [5, 8.20, 13.0, 15],
'inject_type': [11, 17.71, 18.0, 20], 'invalid_sign': [1, 1.0, 1.0, 1],
'lnk_cmd': [1, 28.47, 47.0, 190], 'lnk_exe': [0, 25.51, 46.0, 106],
'lnk_name': [27, 90.55, 131.0, 131], 'lnk_path': [9, 71.24, 91.0, 113],
'load_driver_type': [1, 1.0, 1.0, 1], 'logon_domain': [15, 15.0, 15.0, 15],
'matched_filepath_list': [9, 82.55, 114.0, 10550], 'mem_load': [2, 2.0, 2.0, 2],
'online': [2, 2.0, 2.0, 2], 'operation_type': [1, 2.43, 6.0, 6],
'parent_command_line': [1, 38.96, 59.0, 20638], 'parent_process_name': [2, 17.81, 48.0, 163],
'parent_process_net_level': [2, 2.0, 2.0, 2], 'parent_process_path': [8, 53.07, 127.0, 174],
'path_hijacking': [1, 1.0, 1.0, 1], 'port_number': [1, 1.0, 1.0, 1],
'procaction_timestamp': [24, 24.0, 24.0, 24],'process_build_name': [1, 15.30, 45.0, 127],
'process_cert_hash': [40, 40.00, 40.0, 41], 'process_exec_hijack': [1, 1.0, 1.0, 1],
'process_ext_level': [1, 1.02, 1.0, 3], 'process_file_version': [1, 23.8, 38.0, 63],
'process_input_handle': [14, 14.00, 14.0, 17], 'process_integrity': [0, 5.47, 6.0, 6],
'process_link': [9, 140.82, 730.0, 2199], 'process_link_level': [2, 2.0, 2.0, 2],
'process_md5': [32, 32.0, 32.0, 32], 'process_name': [1, 17.43, 43.0, 163],
'process_net_level': [2, 2.0, 2.0, 2], 'process_orig_name': [1, 16.52, 41.0, 63],
'process_output_handle': [14, 14.00, 14.0, 17], 'process_path': [6, 59.38, 140.0, 174],
'process_pdb': [6, 33.70, 90.0, 210], 'process_product_name': [1, 27.50, 36.0, 127],
'process_readed_ext': [0, 6.39, 12.0, 84], 'process_remote_shell': [1, 1.0, 1.0, 1],
'process_service': [1, 40.95, 168.0, 183], 'process_signer': [2, 20.28, 27.0, 68],
'process_start_time': [10, 10.0, 10.0, 10], 'read_file_count': [1, 1.00, 1.0, 4],
'read_file_io_counters': [1, 3.79, 5.0, 6], 'read_offset': [1, 10.83, 15.0, 15],
'reg_name': [0, 6.43, 16.0, 64], 'reg_path': [23, 82.0, 115.0, 154],
'reg_previous_data': [1, 21.48, 38.0, 2048], 'reg_type': [0, 7.05, 10.0, 20],
'reg_value': [0, 296.23, 3396.0, 3952], 'remote_ip': [12, 20.59, 35.0, 36],
'remote_operate': [1, 1.0, 1.0, 1], 'remote_port': [3, 4.94, 5.0, 5],
'reparse_tag': [10, 10.0, 10.0, 10], 'reset_code': [10, 10.0, 10.0, 10],
'reset_info': [11, 11.0, 11.0, 11], 'reset_time': [1, 1.0, 1.0, 1],
'sensor_modules': [793, 1019.31, 1052.0, 1052], 'service_name': [1, 4.93, 16.0, 51],
'service_type': [1, 1.0, 1.0, 1], 'shell_code': [64, 64.0, 64.0, 64],
'shell_name': [13, 13.63, 14.0, 14], 'source_process_integrity': [0, 5.67, 6.0, 6],
'source_process_path': [10, 10.0, 10.0, 10], 'source_process_starttime': [10, 10.0, 10.0, 10],
'target_command_line': [1, 10.87, 11.0, 3686], 'target_count': [1, 1.51, 2.0, 3],
'target_file_cert_hash': [40, 40.0, 40.0, 40], 'target_file_ext_level': [1, 1.03, 1.0, 3],
'target_file_md5': [32, 32.0, 32.0, 32], 'target_file_net_level': [2, 2.0, 2.0, 2],
'target_file_path': [5, 60.32, 73.0, 143], 'target_file_pdb': [7, 39.37, 50.0, 135],
'target_file_signer': [4, 21.06, 21.0, 59], 'target_process_cert_hash': [40, 40.0, 40.0, 40],
'target_process_ext_level': [1, 1.042, 1.0, 3], 'target_process_md5': [0, 32.0, 32.0, 32],
'target_process_name': [1, 12.51, 23.0, 163], 'target_process_net_level': [2, 2.0, 2.0, 2],
'target_process_path': [8, 39.27, 102.0, 174], 'target_process_pdb': [6, 14.74, 32.0, 210],
'target_process_signer': [2, 24.65, 27.0, 64], 'task_author': [3, 20.66, 23.0, 23],
'task_name': [2, 112.65, 115.0, 115], 'thread_module': [2, 11.90, 29.0, 163],
'unicode_rtlo': [1, 1.0, 1.0, 4], 'virus_name': [20, 21.0, 22.0, 22],
'write_offset': [1, 9.56, 13.0, 13], 'zero_day_check': [3, 3.0, 3.0, 3],
'zero_day_index': [1, 1.0, 1.0, 1], 'zero_day_op': [5, 5.0, 5.0, 5],
'zero_day_type': [3, 3.34, 4.0, 5], 'event_type': [7, 11.81, 15.0, 18],
'host_name': [15, 15.0, 15.0, 15],
}
list_key_dims = [len(k)+int(np.ceil(v[2])) for k, v in used_keys.items()]
MAXdim = max(list_key_dims)




def most_similar_string_idx(a, candidates):
string = max(candidates, key=lambda x: SequenceMatcher(None, a, x).ratio())
idx = candidates.index(string)
return string, idx


def readfile(args):
file_path, true_idx, time_len, max_time_len, save_fold = args
is_training=False; ae=None; ac=None

if max_time_len>0 and time_len>max_time_len:
if len(true_idx)>0 and len(true_idx)>max_time_len:
true_idx = random.sample(true_idx, max_time_len); usedidx = true_idx
sorted(usedidx); sorted(true_idx)
if len(true_idx)>0 and len(true_idx)<=max_time_len:
pool_list = list(set(range(time_len))-set(true_idx))
usedidx = true_idx+random.sample(pool_list, max_time_len-len(true_idx))
sorted(usedidx); sorted(true_idx)
if len(true_idx)==0:
pool_list = list(range(time_len))
usedidx = random.sample(pool_list, max_time_len)
sorted(usedidx); true_idx=[]
else:
usedidx = list(range(time_len)); true_idx = true_idx

all_line_values = []; all_line_masks = []; flag=-1; ref_data=[]
with open(file_path, 'r', encoding='utf-8', errors="ignore") as json_file:
for log_line in json_file:
try:
data = json.loads(log_line)
except json.JSONDecodeError as e:
continue
flag+=1
if flag not in usedidx:
continue
data["event"] = {k: v for k, v in data["event"].items() if k in used_keys}
data["meta"] = {k: v for k, v in data["meta"].items() if k in used_keys}
if flag not in true_idx:
ref_data.append(data["event"])
if is_training and ae is not None:
data["event"] = ae.apply(data["event"], random.choice(ref_data) if len(ref_data)>0 else {})

#初始化 #line_values:[[], [], [], ..., []] line_masks: [0/1, 0/1, 0/1, ..., 0/1]
line_values = []; line_masks = []
for key in used_keys.keys():
line_values.append(np.zeros(len(key)+int(np.ceil(used_keys[key][2])), dtype=np.uint8))
line_masks = np.zeros(len(used_keys), dtype=np.uint8)
for key in data["event"].keys():
string, idx = most_similar_string_idx(key, list(used_keys.keys()))
key_len = len(string); value_len = int(np.ceil(used_keys[string][2]))

value_string = ''.join([char for char in str(data["event"][key]) if char.isascii()])
input_string = key[-key_len:].ljust(key_len, "\0")+value_string[-value_len:].ljust(value_len, "\0")
line_values[idx] = np.frombuffer(input_string.encode('utf-8'), dtype=np.uint8)
line_masks[idx] = 1
for key in data["meta"].keys():
if key in data["event"].keys():
continue
string, idx = most_similar_string_idx(key, list(used_keys.keys()))
key_len = len(string); value_len = int(np.ceil(used_keys[string][2]))

value_string = ''.join([char for char in str(data["meta"][key]) if char.isascii()])
input_string = key[-key_len:].ljust(key_len, "\0")+value_string[-value_len:].ljust(value_len, "\0")
line_values[idx] = np.frombuffer(input_string.encode('utf-8'), dtype=np.uint8)
line_masks[idx] = 1
all_line_values.append(line_values)
all_line_masks.append(line_masks)
if is_training and ac is not None:
Allvalue, Truevalue = ac.apply(usedidx, true_idx)
idxs = [usedidx.index(v) for v in Allvalue]
all_line_values = [all_line_values[idx] for idx in idxs]
all_line_masks = [all_line_masks[idx] for idx in idxs]
#return all_line_values, all_line_masks
data_save = {"all_line_values": all_line_values, "all_line_masks": all_line_masks}
name_save = os.path.join(save_fold, file_path.split("/")[-1]+".pkl.gz")
with gzip.open(name_save, "wb") as f:
pickle.dump(data_save, f, protocol=pickle.HIGHEST_PROTOCOL)
return True


def process_files_concurrently(args, max_workers=64):
num_workers = os.cpu_count()
with multiprocessing.Pool(processes=num_workers) as pool:
results = list(pool.map(readfile, args))
return results






if __name__ == "__main__":
save_fold = "Meta_data_cleaned"
os.makedirs(save_fold, exist_ok=True)
for filename in ["Data_lists_cleaned/TRAIN.json", "Data_lists_cleaned/TEST.json", "Data_lists_cleaned/TEST_forUnknownAttack.json"]:
with open(filename, "r", encoding="utf-8") as file:
data = json.load(file)
args = [(item["file_path"], item["true_idx"], item["time_len"], 2048, save_fold) for item in data]
RESULTS = process_files_concurrently(args)

# ###读取测试
# with gzip.open("./Meta_data_cleaned/xxx.pkl.gz", "rb") as f:
# loaded_data = pickle.load(f)
# all_line_values = loaded_data["all_line_values"]
# all_line_masks = loaded_data["all_line_masks"]
# print(len(all_line_values))
# print(len(all_line_masks))
21 changes: 21 additions & 0 deletions datasets/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
**Download**

Download the experimental `Data_lists_cleaned` from [Data_lists_cleaned repo](https://huggingface.co/datasets/Iceswordlab/EDR3.6B-63F) and extract them into the `Data_lists_cleaned/` directory to obtain `Families.json`, `TRAIN.json`, `TEST.json` and `TEST_forUnknownAttack.json`.

Download the `Meta_data_cleaned` from [Meta_data_cleaned repo](https://huggingface.co/datasets/Iceswordlab/EDR3.6B-63F) and extract it into the `Meta_data_cleaned/` directory.

**File descriptions:**
- `Data_lists_cleaned/` — Directory containing all data lists needed for both the main experiments and the ablation study.
- `Families.json` — List of families used in the experiments.
- `TRAIN.json` — List of training data samples.
- `TEST.json` — List of test data samples.
- `TEST_forUnknownAttack.json` — List of an additional independent test set containing 9,000 normal samples and 10,000 unknown attack samples.
- `Meta_data_cleaned/` — Directory containing all the raw data required for both the main experiments and the ablation study.

**Preprocessing (recommended):**
- To facilitate experiments, we recommend first running:
```bash
python get_npz_convenient.py
```
- This script reads the raw data (text) in `Meta_data_cleaned/` and converts each sample into a `.pkl.gz` file with the same name.
- The generated `.pkl.gz` files are also saved in the `Meta_data_cleaned/` directory.
Loading