Skip to content

AuldLangSynth is an open-source data-centric language synthesis platform designed to generate, analyze, and curate high-quality instruction datasets for modern AI systems. It provides an end-to-end workflow for producing structured language samples, auditing their quality, and transforming them into embeddings and datasets ready for training.

License

Notifications You must be signed in to change notification settings

ScottishCoder/AuldLangSynth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AuldLangSynth

Bridging past knowledge with future intelligence.

AuldLangSynth is an open-source data-centric language synthesis platform designed to generate, analyze, and curate high-quality instruction datasets for modern AI systems. It provides an end-to-end workflow for producing structured language samples, auditing their quality, and transforming them into embeddings and datasets ready for training.

At its core, AuldLangSynth emphasizes controlled generation, transparency, and insight. The application was originally developed to support synthetic dataset generation with applied reasoning for cybersecurity research. Its design was later extended through the introduction of dataset profiles, significantly broadening its applicability across other domains.


Motivation & Background

AuldLangSynth originated during work on a separate research effort involving the active development of a fully autonomous reverse engineering and malware analysis MCP server. This system relied on a broad set of analysis tools and local large language models to support automated reasoning and decision-making.

During this work, it became apparent that out-of-the-box local LLMs did not consistently produce the level of accuracy and reliability required for these tasks. Achieving stable and repeatable results necessitated fine-tuning models on domain-specific data derived from cybersecurity research and reverse engineering workflows. AuldLangSynth was therefore created to support the systematic generation, evaluation, and curation of high-quality reasoning based synthetic datasets tailored for LLM fine-tuning in security-focused domains.

While a number of applications exist for instruction generation and data synthesis, few were designed with cybersecurity-oriented workflows in mind, and even fewer exposed the full lifecycle required for research-grade dataset creation—from controlled generation and structural enforcement to deduplication, evaluation, and downstream embedding.

This gap motivated the development of AuldLangSynth as a unified platform that combines:

  • cybersecurity-aware data generation and analysis with explicit reasoning steps.
  • schema-driven dataset profiles.
  • explicit quality control through LLM-as-a-Judge evaluation.
  • Flexible, researcher-controlled workflows suitable for experimentation and reproducibility.
  • Can be run entirely on local hardware, subject to resource availability.
  • Use a local knowledge base of heterogeneous files to provide contextual grounding for synthetic sample generation.

The resulting tool is intended to serve both as a practical system for synthetic dataset creation and as a reference implementation for data-centric language model research in security-focused domains.

The platform supports:

  1. AI-driven data generation: configurable prompts, models, and sampling (temperature/top-k/p, max tokens, concurrency) to synthesize datasets aligned to profiles, with outputs feeding downstream pipelines.
  2. Dataset profiles management: define/select schemas (instruction/input/output/metadata), include built-in profiles, and allow custom profile registration to enforce structure across generation and evaluation.
  3. Binary reverse engineering and analysis: upload single or batch binaries to extract rich static analysis artifacts (optionally augmented with sandbox reports), which are then analyzed by LLMs to generate structured synthetic samples for malware-focused dataset creation and model training.
  4. Deduplication engine: exact hashing plus shingle/Jaccard near-duplicate detection with adjustable thresholds and neighbor caps, showing clusters and previews to prune noisy samples.
  5. Audio transcription/translation/summarization (Work in progress): in-browser Whisper.js (transformers.js) for YouTube URLs or local audio, with streaming decode, progress UI, and optional embedding of transcripts into the vector DB.
  6. Document ingestion and embedding: drag/drop or folder upload for PDF/DOCX/TXT/MD/HTML/JSON, automatic text extraction, chunking controls, embedding, and optional Qdrant sync with progress/error feedback.
  7. Embedding inspection: view embeddings and vector DB sync status, ensuring consistency between generated/ingested data and Qdrant collections.
  8. LLM-as-a-judge evaluation: configurable judge/teacher models and concurrency to score samples on accuracy/completeness/clarity/overall, feeding back into quality control.
  9. Close-the-loop automation: orchestrate regeneration plus evaluation cycles to iteratively improve sample quality based on judge feedback automatically at the click of a button.
  10. Grammar tooling: editor, AST viewer, and playground to design/verify grammars and toggle grammar-constrained generation.
  11. Vector DB integration: configure Qdrant host/API key/collection, auto-hint vector sizes, ensure collections exist, and monitor sync health.
  12. Dataset browsing/editing: inspect and edit generated samples with schema awareness for quick curation.
  13. Environment/config management: UI to set API keys and endpoints (OpenAI, Google Gemini, OpenRouter, vLLM, Ollama, etc.), toggle embedding/vector settings, and control global chunking/concurrency defaults.
  14. Fine-tune/export: prepare curated datasets for downstream training/export in training-ready formats.
  15. Backend helpers: YouTube audio fetch (yt-dlp/puppeteer), website scrape-to-text endpoint, and binary analysis routes, all mounted via Vite dev middleware.

Friendly note from me: :) "The data-flow logic does not impose a rigid workflow, allowing users full control over how their datasets are generated. What I mean by this, is maybe you're happy with just Teacher generated samples without needing LLM Judge scoring. That's okay to do. I purposefully made this app as flexible as possible. I abstracted away a lot of the complexity and focused heavily on giving the end-user freedom to choose. However, to ensure a higher quality synthetic generation, using QDrant with it is highly desirble."

AuldLangSynths UI

Dashboard Generate
First Image Second Image
Profiles Reverse Engineer
First Image Second Image
Transcribe Deduplication
First Image Second Image
Stratified Mix Uploads
First Image Second Image
Embeddings Dataset
First Image Second Image
Grammar LLM-As-A-Judge
First Image Second Image
Close-The-Loop Vector-DB Fine-Tune-Export
First Image Second Image Third Image

Installation

For a much more in-depth guide, use the docs which will help get you started quickly.

Docs

Prerequisites

  • Node.js 18.17+ (20.x recommended); npm 9+.
  • git installed.
  • Optional: yt-dlp (for YouTube fetch), npx puppeteer browsers install chrome (for puppeteer fallback), Qdrant instance if using vector DB.

Steps

  1. Clone and install
git clone <repo-url>
cd AuldLangSynth
npm install
  1. (Optional but recommended) Create .env.local with your keys: GEMINI_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY, QDRANT_HOST, QDRANT_API_KEY, etc.
  2. Run dev server (Vite + backend middleware)
npm run dev
  1. Build for production
npm run build
  1. Preview production build
npm run preview
  1. Reverse-engineering backend (optional standalone)
npm run reverse-backend

Full Documentation

Docs


Project Status & Expectations

AuldLangSynth is stable and usable, but development is currently slower than during its initial creation, which was tied to a conference paper. Due to other commitments, updates will be made on a best-effort basis rather than through a fixed roadmap. That said, the project is not abandoned. Issues, feedback, and pull requests are welcome, and improvements may be incorporated as time allows.

Intended for research, educational, and defensive security use.

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

You are free to use, modify, and distribute this software, provided that any derivative works are also licensed under GPL-3.0 and the source code is made available.

About

AuldLangSynth is an open-source data-centric language synthesis platform designed to generate, analyze, and curate high-quality instruction datasets for modern AI systems. It provides an end-to-end workflow for producing structured language samples, auditing their quality, and transforming them into embeddings and datasets ready for training.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published