IceSwordLab · Trunpm · Nov 11, 2025 · Nov 11, 2025 · Nov 11, 2025 · Nov 11, 2025
diff --git a/Method_Overview.png b/Method_Overview.png
diff --git a/README.md b/README.md
@@ -1,2 +1,102 @@
-# HyperGLLM
-Code &amp; dataset for AAAI26 Paper "HyperGLLM: An Efficient Framework for Endpoint Threat Detection via Hypergraph-Enhanced Large Language Models".
+# <div align="center">HyperGLLM: An Efficient Framework for Endpoint Threat Detection via Hypergraph-Enhanced Large Language Models<div>
+<div align="center">
+<p><strong>An efficient framework that introduces hypergraph reasoning into LLMs for malicious behavior detection in EDR logs.</strong></p>
+<a href="https://huggingface.co/datasets/The_data_will_be_released_upon_acceptance_of_the_paper./" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Dataset-23999a.svg></a>
+<a href="https://github.com/"><img alt="License" src="https://img.shields.io/badge/License-MIT-green.svg"></a>
+<a><img alt="Static Badge" src="https://img.shields.io/badge/made_with-Python-blue"></a>
+</div>
+
+<h4 align="center">
+
+<p>
+<a href="#overview">Overview</a> |
+<a href="#sparkles-features">Features</a> |  
+<a href="#page_with_curl-changelog">Changelog</a> |
+<a href="#rocket-quick-start">Quick-Start</a> |
+<a href="#floppy_disk-dataset">Dataset</a> |
+<a href="#raised_hands-faqs"> FAQs</a> 
+</p>
+
+## Overview
+
+**HyperGLLM** is a novel detection framework that introduces hypergraph reasoning into LLMs. It first constructs an attribute-value level relation-aware graph to model low-order structural semantics while reducing textual redundancy. Then, it introduces a differential hypergraph module with multi-granularity clustering to capture high-order behavioral dependencies embedded in interleaved events and reinforce threat semantics. Finally, the hypergraph representations are aligned with an LLM for efficient contextual reasoning over potential malicious behaviors. We curate EDR3.6B-63F, a large-scale EDR dataset containing 3.6 billion events across 63 distinct behavior families. Extensive experiments demonstrate that HyperGLLM significantly outperforms state-of-the-art methods by reducing the false alarm rate to 1.67%, achieving 94.65% accuracy across 63 behavior families, and improving the modeling efficiency of LLMs on long EDR logs. Our framework and dataset provide a solid foundation for future research and support the development of advanced detection solutions in endpoint security.
+
+<p align="center">
+<img src="Method_Overview.png">
+</p>
+
+## :sparkles: Features
+
+- **Framework**: We propose HyperGLLM, an efficient framework that introduces hypergraph reasoning into LLMs for malicious behavior detection in EDR logs, capturing both structural semantics and long-range temporal dependencies.
+- **Structural Semantics & Temporal Dependencies**: We design an attribute-value level relation-aware graph and a differential hypergraph module with multi-granularity clustering to jointly model low- and high- order behavior semantics, thereby enhancing the semantic representation of threat behaviors.
+- **EDR3.6B-63F Dataset**: We construct EDR3.6B-63F, a large-scale EDR dataset that serves as a high-quality benchmark for advancing AI-driven research in endpoint security, offering diverse behavior types and detailed event records.
+- **Effective and Efficient**: Extensive experiments demonstrate that HyperGLLM consistently outperforms state-of-the-art baselines across multiple metrics while maintaining high inference efficiency.
+
+
+## :page_with_curl: Changelog
+[23/10/25] The project launched! 
+
+
+
+## :rocket: Quick Start
+
+We conduct training on eight NVIDIA H100 (80GB) GPUs and perform evaluation on a single GPU.
+
+**Directory Overview**
+
+- *requirements.txt* — Lists all dependencies required to reproduce this project.
+- *datasets/* — Contains all datasets used in both the main experiments and the ablation study described in the paper.
+- *experiments_main/* — Source code for reproducing the main experiments.
+- *experiments_ablation/* — Source code for reproducing the ablation study experiments in the paper.
+
+**Install Dependencies**
+
+To reproduce our work, you need to have Python installed along with the required libraries. You can install the necessary dependencies using the following command:
+```bash
+pip install -r requirements.txt
+```
+
+**Reproducing the Main Experiment**
+
+To reproduce the main experimental results:
+1. cd to the main experiment directory:
+```bash
+cd experiments_main
+```
+2. Run the training and inference script:
+```bash
+sh run.sh
+```
+3. Obtain evaluation metrics:
+```bash
+python get_metrics.py
+```
+You can also obtain runtime efficiency metrics (GPU memory usage and Time-to-First-Token) by running:
+```bash
+python get_gpumu_tps.py
+```
+This provides performance metrics on an input of 1,024K tokens.
+
+*To reproduce other experiments (e.g., the ablation study), cd to the corresponding directory (e.g., `cd experiments_ablation/Analysis_DHGNN`), run `sh run.sh` for training and inference, and obtain evaluation metrics with `python get_metrics.py`.
+
+
+## :floppy_disk: Dataset
+
+EDR3.6B-63F Dataset: [this repo](https://huggingface.co/datasets/Iceswordlab/EDR3.6B-63F).
+
+## :raised_hands: FAQs
+
+## :bookmark: License
+
+Our project is licensed under the [*MIT License*](./LICENSE).
+
+## Citation
+If you use the EDR3.6B-63F dataset in your research or find our method HyperGLLM inspiring, please consider citing our paper:
+
+```bibtex
+@inproceedings{
+  title     = {HyperGLLM: An Efficient Framework for Endpoint Threat Detection via Hypergraph-Enhanced Large Language Models},
+  year      = {2026},
+  books = {Proceedings of the AAAI Conference on Artificial Intelligence},
+}
+```
diff --git a/datasets/get_npz_convenient.py b/datasets/get_npz_convenient.py
@@ -0,0 +1,200 @@
+import os
+import json
+import h5py
+import pickle
+import gzip
+import numpy as np
+import random
+from difflib import SequenceMatcher
+import concurrent.futures
+import multiprocessing
+
+
+
+used_keys = {'action_type': [6, 12.09, 14.0, 14], 'addr_family': [7, 7.41, 8.0, 8],
+    'child_command_line': [1, 85.68, 127.0, 20638], 'child_process_build_name': [1, 11.34, 21.0, 127],
+    'child_process_cert_hash': [40, 40.00, 40.0, 41], 'child_process_ext_level': [1, 1.04, 1.0, 3],
+    'child_process_integrity': [3, 5.3, 6.0, 6], 'child_process_md5': [32, 32.0, 32.0, 32],
+    'child_process_name': [1, 13.54, 27.0, 163], 'child_process_net_level': [1, 1.99, 2.0, 2],
+    'child_process_orig_name': [1, 13.06, 23.0, 63], 'child_process_path': [8, 42.91, 88.0, 174],
+    'child_process_pdb': [6, 22.16, 60.0, 210], 'child_process_product_name': [1, 31.83, 36.0, 127],
+    'child_process_service': [1, 13.07, 33.0, 156], 'child_process_signer': [2, 20.06, 27.0, 68],
+    'command_line': [0, 79.77, 269.0, 20638], 'control': [6, 6.0, 6.0, 6], 
+    'data_size': [1, 1.92, 4.0, 7], 'device_reserved': [1, 1.0, 1.0, 1],
+    'device_type': [1, 1.0, 1.0, 1], 'dmpinfo': [34, 34.67, 35.8, 36],
+    'edrver': [11, 11.0, 11.0, 11], 'event_cost': [3, 3.0, 3.0, 3],
+    'file_action': [17, 26.97, 31.0, 31], 'file_cert_hash': [40, 40.00, 40.0, 41],
+    'file_delete_count': [1, 1.04, 1.0, 3], 'file_device': [2, 2.00, 2.0, 3],
+    'file_dltool': [1, 1.00, 1.0, 4], 'file_ext_level': [1, 1.00, 1.0, 3],
+    'file_hidden_data': [1, 1.0, 1.0, 1], 'file_internal_name': [1, 13.98, 21.0, 127],
+    'file_md5': [32, 32.0, 32.0, 32], 'file_name': [1, 21.27, 44.0, 229],
+    'file_net_level': [2, 2.0, 2.0, 2], 'file_network': [21, 34.78, 44.0, 44],
+    'file_oldname': [4, 22.03, 40.0, 92], 'file_oldpath': [9, 69.85, 108.0, 163],
+    'file_orig_name': [1, 14.84, 28.0, 63], 'file_path': [0, 71.71, 119.0, 259],
+    'file_pdb': [4, 54.25, 95.0, 259], 'file_product_name': [1, 18.14, 36.0, 127],
+    'file_signer': [2, 25.84, 33.0, 68], 'file_source': [0, 40.16, 51.0, 124],
+    'file_written_count': [1, 1.36, 2.0, 3], 'finalexcep': [4, 4.0, 4.0, 4],
+    'flags': [1, 1.0, 1.0, 1], 'from_etw_data': [1, 1.0, 1.0, 1],
+    'guid': [38, 38.0, 38.0, 38], 'image_path': [9, 36.85, 52.0, 92],
+    'inject_size': [1, 3.29, 4.0, 17], 'inject_target_address': [5, 8.20, 13.0, 15],
+    'inject_type': [11, 17.71, 18.0, 20], 'invalid_sign': [1, 1.0, 1.0, 1],
+    'lnk_cmd': [1, 28.47, 47.0, 190], 'lnk_exe': [0, 25.51, 46.0, 106],
+    'lnk_name': [27, 90.55, 131.0, 131], 'lnk_path': [9, 71.24, 91.0, 113],
+    'load_driver_type': [1, 1.0, 1.0, 1], 'logon_domain': [15, 15.0, 15.0, 15],
+    'matched_filepath_list': [9, 82.55, 114.0, 10550], 'mem_load': [2, 2.0, 2.0, 2],
+    'online': [2, 2.0, 2.0, 2], 'operation_type': [1, 2.43, 6.0, 6], 
+    'parent_command_line': [1, 38.96, 59.0, 20638], 'parent_process_name': [2, 17.81, 48.0, 163],
+    'parent_process_net_level': [2, 2.0, 2.0, 2], 'parent_process_path': [8, 53.07, 127.0, 174],
+    'path_hijacking': [1, 1.0, 1.0, 1], 'port_number': [1, 1.0, 1.0, 1],
+    'procaction_timestamp': [24, 24.0, 24.0, 24],'process_build_name': [1, 15.30, 45.0, 127],
+    'process_cert_hash': [40, 40.00, 40.0, 41], 'process_exec_hijack': [1, 1.0, 1.0, 1],
+    'process_ext_level': [1, 1.02, 1.0, 3], 'process_file_version': [1, 23.8, 38.0, 63],
+    'process_input_handle': [14, 14.00, 14.0, 17], 'process_integrity': [0, 5.47, 6.0, 6],
+    'process_link': [9, 140.82, 730.0, 2199], 'process_link_level': [2, 2.0, 2.0, 2],
+    'process_md5': [32, 32.0, 32.0, 32], 'process_name': [1, 17.43, 43.0, 163],
+    'process_net_level': [2, 2.0, 2.0, 2], 'process_orig_name': [1, 16.52, 41.0, 63],
+    'process_output_handle': [14, 14.00, 14.0, 17], 'process_path': [6, 59.38, 140.0, 174],
+    'process_pdb': [6, 33.70, 90.0, 210], 'process_product_name': [1, 27.50, 36.0, 127],
+    'process_readed_ext': [0, 6.39, 12.0, 84], 'process_remote_shell': [1, 1.0, 1.0, 1],
+    'process_service': [1, 40.95, 168.0, 183], 'process_signer': [2, 20.28, 27.0, 68],
+    'process_start_time': [10, 10.0, 10.0, 10], 'read_file_count': [1, 1.00, 1.0, 4],
+    'read_file_io_counters': [1, 3.79, 5.0, 6], 'read_offset': [1, 10.83, 15.0, 15],
+    'reg_name': [0, 6.43, 16.0, 64], 'reg_path': [23, 82.0, 115.0, 154], 
+    'reg_previous_data': [1, 21.48, 38.0, 2048], 'reg_type': [0, 7.05, 10.0, 20],
+    'reg_value': [0, 296.23, 3396.0, 3952], 'remote_ip': [12, 20.59, 35.0, 36],
+    'remote_operate': [1, 1.0, 1.0, 1], 'remote_port': [3, 4.94, 5.0, 5], 
+    'reparse_tag': [10, 10.0, 10.0, 10], 'reset_code': [10, 10.0, 10.0, 10], 
+    'reset_info': [11, 11.0, 11.0, 11], 'reset_time': [1, 1.0, 1.0, 1], 
+    'sensor_modules': [793, 1019.31, 1052.0, 1052], 'service_name': [1, 4.93, 16.0, 51], 
+    'service_type': [1, 1.0, 1.0, 1], 'shell_code': [64, 64.0, 64.0, 64], 
+    'shell_name': [13, 13.63, 14.0, 14], 'source_process_integrity': [0, 5.67, 6.0, 6], 
+    'source_process_path': [10, 10.0, 10.0, 10], 'source_process_starttime': [10, 10.0, 10.0, 10],
+    'target_command_line': [1, 10.87, 11.0, 3686], 'target_count': [1, 1.51, 2.0, 3], 
+    'target_file_cert_hash': [40, 40.0, 40.0, 40], 'target_file_ext_level': [1, 1.03, 1.0, 3], 
+    'target_file_md5': [32, 32.0, 32.0, 32], 'target_file_net_level': [2, 2.0, 2.0, 2],
+    'target_file_path': [5, 60.32, 73.0, 143], 'target_file_pdb': [7, 39.37, 50.0, 135],
+    'target_file_signer': [4, 21.06, 21.0, 59], 'target_process_cert_hash': [40, 40.0, 40.0, 40],
+    'target_process_ext_level': [1, 1.042, 1.0, 3], 'target_process_md5': [0, 32.0, 32.0, 32],
+    'target_process_name': [1, 12.51, 23.0, 163], 'target_process_net_level': [2, 2.0, 2.0, 2],
+    'target_process_path': [8, 39.27, 102.0, 174], 'target_process_pdb': [6, 14.74, 32.0, 210],
+    'target_process_signer': [2, 24.65, 27.0, 64], 'task_author': [3, 20.66, 23.0, 23],
+    'task_name': [2, 112.65, 115.0, 115], 'thread_module': [2, 11.90, 29.0, 163], 
+    'unicode_rtlo': [1, 1.0, 1.0, 4], 'virus_name': [20, 21.0, 22.0, 22], 
+    'write_offset': [1, 9.56, 13.0, 13], 'zero_day_check': [3, 3.0, 3.0, 3], 
+    'zero_day_index': [1, 1.0, 1.0, 1], 'zero_day_op': [5, 5.0, 5.0, 5], 
+    'zero_day_type': [3, 3.34, 4.0, 5], 'event_type': [7, 11.81, 15.0, 18],
+    'host_name': [15, 15.0, 15.0, 15],
+    }
+list_key_dims = [len(k)+int(np.ceil(v[2])) for k, v in used_keys.items()]
+MAXdim = max(list_key_dims)
+
+
+
+
+def most_similar_string_idx(a, candidates):
+    string = max(candidates, key=lambda x: SequenceMatcher(None, a, x).ratio())
+    idx = candidates.index(string)
+    return string, idx
+
+
+def readfile(args):
+    file_path, true_idx, time_len, max_time_len, save_fold = args
+    is_training=False; ae=None; ac=None
+
+    if max_time_len>0 and time_len>max_time_len:
+        if len(true_idx)>0 and len(true_idx)>max_time_len:
+            true_idx = random.sample(true_idx, max_time_len); usedidx = true_idx
+            sorted(usedidx); sorted(true_idx)
+        if len(true_idx)>0 and len(true_idx)<=max_time_len:
+            pool_list = list(set(range(time_len))-set(true_idx))
+            usedidx = true_idx+random.sample(pool_list, max_time_len-len(true_idx))
+            sorted(usedidx); sorted(true_idx)
+        if len(true_idx)==0:
+            pool_list = list(range(time_len))
+            usedidx = random.sample(pool_list, max_time_len)
+            sorted(usedidx); true_idx=[]
+    else:
+        usedidx = list(range(time_len)); true_idx = true_idx
+
+    all_line_values = []; all_line_masks = []; flag=-1; ref_data=[]
+    with open(file_path, 'r', encoding='utf-8', errors="ignore") as json_file:
+        for log_line in json_file:
+            try:
+                data = json.loads(log_line)
+            except json.JSONDecodeError as e:
+                continue
+            flag+=1
+            if flag not in usedidx:
+                continue
+            data["event"] = {k: v for k, v in data["event"].items() if k in used_keys}
+            data["meta"] = {k: v for k, v in data["meta"].items() if k in used_keys}
+            if flag not in true_idx:
+                ref_data.append(data["event"])
+            if is_training and ae is not None:
+                data["event"] = ae.apply(data["event"], random.choice(ref_data) if len(ref_data)>0 else {})
+
+            #初始化 #line_values:[[], [], [], ..., []] line_masks: [0/1, 0/1, 0/1, ..., 0/1]
+            line_values = []; line_masks = []
+            for key in used_keys.keys():
+                line_values.append(np.zeros(len(key)+int(np.ceil(used_keys[key][2])), dtype=np.uint8))
+            line_masks = np.zeros(len(used_keys), dtype=np.uint8)
+            for key in data["event"].keys():
+                string, idx = most_similar_string_idx(key, list(used_keys.keys()))
+                key_len = len(string); value_len = int(np.ceil(used_keys[string][2]))
+
+                value_string = ''.join([char for char in str(data["event"][key]) if char.isascii()])
+                input_string = key[-key_len:].ljust(key_len, "\0")+value_string[-value_len:].ljust(value_len, "\0")
+                line_values[idx] = np.frombuffer(input_string.encode('utf-8'), dtype=np.uint8)
+                line_masks[idx] = 1
+            for key in data["meta"].keys():
+                if key in data["event"].keys():
+                    continue
+                string, idx = most_similar_string_idx(key, list(used_keys.keys()))
+                key_len = len(string); value_len = int(np.ceil(used_keys[string][2]))
+
+                value_string = ''.join([char for char in str(data["meta"][key]) if char.isascii()])
+                input_string = key[-key_len:].ljust(key_len, "\0")+value_string[-value_len:].ljust(value_len, "\0")
+                line_values[idx] = np.frombuffer(input_string.encode('utf-8'), dtype=np.uint8)
+                line_masks[idx] = 1
+            all_line_values.append(line_values)
+            all_line_masks.append(line_masks)
+        if is_training and ac is not None:
+            Allvalue, Truevalue = ac.apply(usedidx, true_idx)
+            idxs = [usedidx.index(v) for v in Allvalue]
+            all_line_values = [all_line_values[idx] for idx in idxs]
+            all_line_masks = [all_line_masks[idx] for idx in idxs]
+    #return all_line_values, all_line_masks
+    data_save = {"all_line_values": all_line_values, "all_line_masks": all_line_masks}
+    name_save = os.path.join(save_fold, file_path.split("/")[-1]+".pkl.gz")
+    with gzip.open(name_save, "wb") as f:
+        pickle.dump(data_save, f, protocol=pickle.HIGHEST_PROTOCOL)
+    return True
+
+
+def process_files_concurrently(args, max_workers=64):
+    num_workers = os.cpu_count()
+    with multiprocessing.Pool(processes=num_workers) as pool:
+        results = list(pool.map(readfile, args))
+    return results
+
+
+
+
+
+
+if __name__ == "__main__": 
+    save_fold = "Meta_data_cleaned"
+    os.makedirs(save_fold, exist_ok=True)
+    for filename in ["Data_lists_cleaned/TRAIN.json", "Data_lists_cleaned/TEST.json", "Data_lists_cleaned/TEST_forUnknownAttack.json"]:
+        with open(filename, "r", encoding="utf-8") as file:
+            data = json.load(file)
+        args = [(item["file_path"], item["true_idx"], item["time_len"], 2048, save_fold) for item in data]
+        RESULTS = process_files_concurrently(args)
+
+    # ###读取测试
+    # with gzip.open("./Meta_data_cleaned/xxx.pkl.gz", "rb") as f:
+    #     loaded_data = pickle.load(f)
+    # all_line_values = loaded_data["all_line_values"]
+    # all_line_masks = loaded_data["all_line_masks"]
+    # print(len(all_line_values))
+    # print(len(all_line_masks))
diff --git a/datasets/readme.md b/datasets/readme.md
@@ -0,0 +1,21 @@
+**Download**
+
+Download the experimental `Data_lists_cleaned` from [Data_lists_cleaned repo](https://huggingface.co/datasets/Iceswordlab/EDR3.6B-63F) and extract them into the `Data_lists_cleaned/` directory to obtain `Families.json`, `TRAIN.json`, `TEST.json` and `TEST_forUnknownAttack.json`.
+
+Download the `Meta_data_cleaned` from [Meta_data_cleaned repo](https://huggingface.co/datasets/Iceswordlab/EDR3.6B-63F) and extract it into the `Meta_data_cleaned/` directory.
+
+**File descriptions:**  
+   - `Data_lists_cleaned/` — Directory containing all data lists needed for both the main experiments and the ablation study.
+    - `Families.json` — List of families used in the experiments.
+    - `TRAIN.json` — List of training data samples.  
+    - `TEST.json` — List of test data samples.  
+    - `TEST_forUnknownAttack.json` — List of an additional independent test set containing 9,000 normal samples and 10,000 unknown attack samples.
+   - `Meta_data_cleaned/` — Directory containing all the raw data required for both the main experiments and the ablation study. 
+
+**Preprocessing (recommended):**
+- To facilitate experiments, we recommend first running:  
+     ```bash
+     python get_npz_convenient.py
+     ```  
+- This script reads the raw data (text) in `Meta_data_cleaned/` and converts each sample into a `.pkl.gz` file with the same name.  
+- The generated `.pkl.gz` files are also saved in the `Meta_data_cleaned/` directory.