A CLI tool to extract readable text from static HTML files (e.g., Next.js static exports) and generate an llm.txt
file optimized for ingestion by Large Language Models (LLMs). It also supports generating a JSON metadata file (pages.json
) and an optional interactive HTML UI (llm_ui.html
) for browsing extracted content.
- Text Extraction: Extracts clean, readable text from HTML files using Cheerio, removing noise like scripts, styles, and hidden elements.
- Concurrent Processing: Processes multiple HTML files concurrently with configurable concurrency limits for performance.
- Output Formats:
llm.txt
: A single text file with extracted content, organized with file headers and a table of contents.pages.json
: Metadata about processed files, including file paths, sizes, text lengths, and SHA-256 hashes.llm_ui.html
(optional): An interactive HTML interface for browsing extracted text with search functionality.
- Customizable: Supports glob patterns for file selection, configurable output directories, and verbosity options.
- Efficient: Uses streaming for large file outputs and handles errors gracefully.
- MIT Licensed: Free to use, modify, and distribute.
Install llm-gen
globally via npm for CLI access, or locally for use in scripts:
# Install globally
npm install -g llm-gen
# Or install locally in a project
npm install llm-gen
- Node.js: Version 18 or higher (uses ES Modules and modern JavaScript features).
- npm: For installing dependencies.
Run llm-gen
from the command line, specifying the source directory containing HTML files (e.g., a Next.js static export in the out
directory).
llm-gen --src ./out
This processes all HTML files (*.html
, *.htm
) in the ./out
directory, generating:
llm.txt
: Extracted text with a table of contents.pages.json
: Metadata about processed files.
Option | Description | Type | Default |
---|---|---|---|
--src |
Source directory containing HTML files to process (required). | String | None |
--public |
Output directory for generated files (llm.txt , pages.json , llm_ui.html ). |
String | . (current dir) |
--out |
Output filename for the extracted text (relative to --public ). |
String | llm.txt |
--ui |
Generate an interactive HTML UI file (llm_ui.html ). |
Boolean | false |
--concurrency |
Maximum number of files to process concurrently. | Number | 10 |
--verbose |
Enable detailed logging of processing steps. | Boolean | false |
--pattern |
Glob pattern to match HTML files (relative to --src ). |
String | **/*.htm?(l) |
--help |
Display help information. | - | - |
-
Process HTML files in a Next.js
out
directory:llm-gen --src ./out --public ./dist --out extracted.txt
- Processes HTML files in
./out
. - Writes
extracted.txt
andpages.json
to./dist
.
- Processes HTML files in
-
Generate an interactive UI:
llm-gen --src ./out --ui --verbose
- Generates
llm.txt
,pages.json
, andllm_ui.html
in the current directory. - Logs detailed progress.
- Generates
-
Custom glob pattern with high concurrency:
llm-gen --src ./out --pattern "**/*.html" --concurrency 20 --public ./output
- Processes only
*.html
files with up to 20 concurrent tasks. - Outputs to
./output
.
- Processes only
-
llm.txt
:-
Contains extracted text from all HTML files, organized with headers for each file.
-
Includes a table of contents with file paths, sizes, and character counts.
-
Example structure:
Generated: 2025-08-10T12:34:56.789Z Source directory: /path/to/out Files processed: 5 ╔══════════════════════════════╤═══════╤═══════╗ ║ File Path │ Size │ Chars ║ ╟──────────────────────────────┼───────┼───────┼ ║ index.html │ 12345 │ 5678 ║ ║ about.html │ 6789 │ 2345 ║ ╚══════════════════════════════╧═══════╧═══════╝ Total files: 5 Total characters: 12,345 ════════════════════════════════════════════════════════════════════════════════ 📄 FILE: index.html ════════════════════════════════════════════════════════════════════════════════ Welcome to our website! This is the home page content...
-
-
pages.json
:- Metadata about processed files in JSON format.
- Includes file paths, sizes, text lengths, SHA-256 hashes, and any errors.
- Example:
{ "generatedAt": "2025-08-10T12:34:56.789Z", "source": "/path/to/out", "pages": [ { "path": "index.html", "size": 12345, "textLength": 5678, "hash": "a1b2c3d4...", "error": null } ] }
-
llm_ui.html
(if--ui
is enabled):- An interactive HTML page for browsing extracted text.
- Features a search bar to filter files by path or content.
- Collapsible sections for each file with a text preview.
- Note: Full-text display for large files is not implemented (displays a placeholder alert).
Clone the repository and install dependencies:
git clone https://github.com/Agecoder/llm-gen.git
cd llm-gen
npm install
npm start
: Run the CLI with default arguments (--src ./out
).npm run build
: Run the CLI with./out
as the source directory.npm run dev
: Run the CLI in watch mode for development.npm run lint
: Lint the codebase using ESLint.npm run format
: Format the codebase using Prettier.
- cheerio: HTML parsing for text extraction.
- fs-extra: Enhanced file system operations.
- glob: File pattern matching.
- p-limit: Concurrency control for file processing.
- yargs: Command-line argument parsing.
- eslint: Linting for code quality.
- prettier: Code formatting.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/your-feature
). - Commit your changes (
git commit -m "Add your feature"
). - Push to the branch (
git push origin feature/your-feature
). - Open a pull request.
Please ensure your code passes linting (npm run lint
) and is formatted (npm run format
).
This project is licensed under the MIT License.
- Vedant Navale
- Email: vedantnavale45@gmail.com
- GitHub: Agecoder
Report bugs or suggest features at GitHub Issues.
- Built with inspiration from static site generation workflows and LLM content ingestion needs.
- Thanks to the open-source community for providing robust libraries like Cheerio and yargs.