AuldLangSynth

Bridging past knowledge with future intelligence.

AuldLangSynth is an open-source data-centric language synthesis platform designed to generate, analyze, and curate high-quality instruction datasets for modern AI systems. It provides an end-to-end workflow for producing structured language samples, auditing their quality, and transforming them into embeddings and datasets ready for training.

At its core, AuldLangSynth emphasizes controlled generation, transparency, and insight. The application was originally developed to support synthetic dataset generation with applied reasoning for cybersecurity research. Its design was later extended through the introduction of dataset profiles, significantly broadening its applicability across other domains.

Motivation & Background

AuldLangSynth originated during work on a separate research effort involving the active development of a fully autonomous reverse engineering and malware analysis MCP server. This system relied on a broad set of analysis tools and local large language models to support automated reasoning and decision-making.

During this work, it became apparent that out-of-the-box local LLMs did not consistently produce the level of accuracy and reliability required for these tasks. Achieving stable and repeatable results necessitated fine-tuning models on domain-specific data derived from cybersecurity research and reverse engineering workflows. AuldLangSynth was therefore created to support the systematic generation, evaluation, and curation of high-quality reasoning based synthetic datasets tailored for LLM fine-tuning in security-focused domains.

While a number of applications exist for instruction generation and data synthesis, few were designed with cybersecurity-oriented workflows in mind, and even fewer exposed the full lifecycle required for research-grade dataset creation—from controlled generation and structural enforcement to deduplication, evaluation, and downstream embedding.

This gap motivated the development of AuldLangSynth as a unified platform that combines:

cybersecurity-aware data generation and analysis with explicit reasoning steps.
schema-driven dataset profiles.
explicit quality control through LLM-as-a-Judge evaluation.
Flexible, researcher-controlled workflows suitable for experimentation and reproducibility.
Can be run entirely on local hardware, subject to resource availability.
Use a local knowledge base of heterogeneous files to provide contextual grounding for synthetic sample generation.

The resulting tool is intended to serve both as a practical system for synthetic dataset creation and as a reference implementation for data-centric language model research in security-focused domains.

The platform supports:

AI-driven data generation: configurable prompts, models, and sampling (temperature/top-k/p, max tokens, concurrency) to synthesize datasets aligned to profiles, with outputs feeding downstream pipelines.
Dataset profiles management: define/select schemas (instruction/input/output/metadata), include built-in profiles, and allow custom profile registration to enforce structure across generation and evaluation.
Binary reverse engineering and analysis: upload single or batch binaries to extract rich static analysis artifacts (optionally augmented with sandbox reports), which are then analyzed by LLMs to generate structured synthetic samples for malware-focused dataset creation and model training.
Deduplication engine: exact hashing plus shingle/Jaccard near-duplicate detection with adjustable thresholds and neighbor caps, showing clusters and previews to prune noisy samples.
Audio transcription/translation/summarization (Work in progress): in-browser Whisper.js (transformers.js) for YouTube URLs or local audio, with streaming decode, progress UI, and optional embedding of transcripts into the vector DB.
Document ingestion and embedding: drag/drop or folder upload for PDF/DOCX/TXT/MD/HTML/JSON, automatic text extraction, chunking controls, embedding, and optional Qdrant sync with progress/error feedback.
Embedding inspection: view embeddings and vector DB sync status, ensuring consistency between generated/ingested data and Qdrant collections.
LLM-as-a-judge evaluation: configurable judge/teacher models and concurrency to score samples on accuracy/completeness/clarity/overall, feeding back into quality control.
Close-the-loop automation: orchestrate regeneration plus evaluation cycles to iteratively improve sample quality based on judge feedback automatically at the click of a button.
Grammar tooling: editor, AST viewer, and playground to design/verify grammars and toggle grammar-constrained generation.
Vector DB integration: configure Qdrant host/API key/collection, auto-hint vector sizes, ensure collections exist, and monitor sync health.
Dataset browsing/editing: inspect and edit generated samples with schema awareness for quick curation.
Environment/config management: UI to set API keys and endpoints (OpenAI, Google Gemini, OpenRouter, vLLM, Ollama, etc.), toggle embedding/vector settings, and control global chunking/concurrency defaults.
Fine-tune/export: prepare curated datasets for downstream training/export in training-ready formats.
Backend helpers: YouTube audio fetch (yt-dlp/puppeteer), website scrape-to-text endpoint, and binary analysis routes, all mounted via Vite dev middleware.

Friendly note from me: :) "The data-flow logic does not impose a rigid workflow, allowing users full control over how their datasets are generated. What I mean by this, is maybe you're happy with just Teacher generated samples without needing LLM Judge scoring. That's okay to do. I purposefully made this app as flexible as possible. I abstracted away a lot of the complexity and focused heavily on giving the end-user freedom to choose. However, to ensure a higher quality synthetic generation, using QDrant with it is highly desirble."

AuldLangSynths UI

Dashboard	Generate

Profiles	Reverse Engineer

Transcribe	Deduplication

Stratified Mix	Uploads

Embeddings	Dataset

Grammar	LLM-As-A-Judge

Close-The-Loop	Vector-DB	Fine-Tune-Export

Installation

For a much more in-depth guide, use the docs which will help get you started quickly.

Prerequisites

Node.js 18.17+ (20.x recommended); npm 9+.

git installed.

Optional: yt-dlp (for YouTube fetch), npx puppeteer browsers install chrome (for puppeteer fallback), Qdrant instance if using vector DB.

Steps

Clone and install

git clone <repo-url>
cd AuldLangSynth
npm install

(Optional but recommended) Create .env.local with your keys: GEMINI_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY, QDRANT_HOST, QDRANT_API_KEY, etc.
Run dev server (Vite + backend middleware)

npm run dev

Default: http://localhost:5173

Build for production

npm run build

Preview production build

npm run preview

Reverse-engineering backend (optional standalone)

npm run reverse-backend

Full Documentation

Project Status & Expectations

AuldLangSynth is stable and usable, but development is currently slower than during its initial creation, which was tied to a conference paper. Due to other commitments, updates will be made on a best-effort basis rather than through a fixed roadmap. That said, the project is not abandoned. Issues, feedback, and pull requests are welcome, and improvements may be incorporated as time allows.

Intended for research, educational, and defensive security use.

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

You are free to use, modify, and distribute this software, provided that any derivative works are also licensed under GPL-3.0 and the source code is made available.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
Docs		Docs
assets		assets
components		components
data		data
datasetProfileFiles		datasetProfileFiles
diagrams		diagrams
lib		lib
public		public
scripts		scripts
server		server
services		services
store		store
types		types
workers		workers
.env.example		.env.example
.gitignore		.gitignore
App.tsx		App.tsx
LICENSE.md		LICENSE.md
README.md		README.md
index.css		index.css
index.html		index.html
index.tsx		index.tsx
metadata.json		metadata.json
mkdocs.yml		mkdocs.yml
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json
types.ts		types.ts
utils.ts		utils.ts
vite-env.d.ts		vite-env.d.ts
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AuldLangSynth

Motivation & Background

The platform supports:

AuldLangSynths UI

Installation

Prerequisites

Steps

Full Documentation

Project Status & Expectations

License

About

Uh oh!

Releases

Packages

Languages

License

ScottishCoder/AuldLangSynth

Folders and files

Latest commit

History

Repository files navigation

AuldLangSynth

Motivation & Background

The platform supports:

AuldLangSynths UI

Installation

Prerequisites

Steps

Full Documentation

Project Status & Expectations

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages