llm-gen

A CLI tool to extract readable text from static HTML files (e.g., Next.js static exports) and generate an llm.txt file optimized for ingestion by Large Language Models (LLMs). It also supports generating a JSON metadata file (pages.json) and an optional interactive HTML UI (llm_ui.html) for browsing extracted content.

Features

Text Extraction: Extracts clean, readable text from HTML files using Cheerio, removing noise like scripts, styles, and hidden elements.
Concurrent Processing: Processes multiple HTML files concurrently with configurable concurrency limits for performance.
Output Formats:
- llm.txt: A single text file with extracted content, organized with file headers and a table of contents.
- pages.json: Metadata about processed files, including file paths, sizes, text lengths, and SHA-256 hashes.
- llm_ui.html (optional): An interactive HTML interface for browsing extracted text with search functionality.
Customizable: Supports glob patterns for file selection, configurable output directories, and verbosity options.
Efficient: Uses streaming for large file outputs and handles errors gracefully.
MIT Licensed: Free to use, modify, and distribute.

Installation

Install llm-gen globally via npm for CLI access, or locally for use in scripts:

# Install globally
npm install -g llm-gen

# Or install locally in a project
npm install llm-gen

Prerequisites

Node.js: Version 18 or higher (uses ES Modules and modern JavaScript features).
npm: For installing dependencies.

Usage

Run llm-gen from the command line, specifying the source directory containing HTML files (e.g., a Next.js static export in the out directory).

Basic Command

llm-gen --src ./out

This processes all HTML files (*.html, *.htm) in the ./out directory, generating:

llm.txt: Extracted text with a table of contents.
pages.json: Metadata about processed files.

Options

Option	Description	Type	Default
`--src`	Source directory containing HTML files to process (required).	String	None
`--public`	Output directory for generated files (`llm.txt`, `pages.json`, `llm_ui.html`).	String	`.` (current dir)
`--out`	Output filename for the extracted text (relative to `--public`).	String	`llm.txt`
`--ui`	Generate an interactive HTML UI file (`llm_ui.html`).	Boolean	`false`
`--concurrency`	Maximum number of files to process concurrently.	Number	`10`
`--verbose`	Enable detailed logging of processing steps.	Boolean	`false`
`--pattern`	Glob pattern to match HTML files (relative to `--src`).	String	`*/.htm?(l)`
`--help`	Display help information.	-	-

Example Commands

Process HTML files in a Next.js out directory:
```
llm-gen --src ./out --public ./dist --out extracted.txt
```
- Processes HTML files in ./out.
- Writes extracted.txt and pages.json to ./dist.
Generate an interactive UI:
```
llm-gen --src ./out --ui --verbose
```
- Generates llm.txt, pages.json, and llm_ui.html in the current directory.
- Logs detailed progress.
Custom glob pattern with high concurrency:
```
llm-gen --src ./out --pattern "**/*.html" --concurrency 20 --public ./output
```
- Processes only *.html files with up to 20 concurrent tasks.
- Outputs to ./output.

Output Files

llm.txt:

Contains extracted text from all HTML files, organized with headers for each file.
Includes a table of contents with file paths, sizes, and character counts.

Example structure:

Generated: 2025-08-10T12:34:56.789Z
Source directory: /path/to/out
Files processed: 5

╔══════════════════════════════╤═══════╤═══════╗
║ File Path                    │ Size  │ Chars ║
╟──────────────────────────────┼───────┼───────┼
║ index.html                   │ 12345 │  5678 ║
║ about.html                   │  6789 │  2345 ║
╚══════════════════════════════╧═══════╧═══════╝
Total files: 5
Total characters: 12,345

════════════════════════════════════════════════════════════════════════════════
 📄 FILE: index.html
════════════════════════════════════════════════════════════════════════════════
Welcome to our website! This is the home page content...

pages.json:

Metadata about processed files in JSON format.
Includes file paths, sizes, text lengths, SHA-256 hashes, and any errors.

Example:

{
  "generatedAt": "2025-08-10T12:34:56.789Z",
  "source": "/path/to/out",
  "pages": [
    {
      "path": "index.html",
      "size": 12345,
      "textLength": 5678,
      "hash": "a1b2c3d4...",
      "error": null
    }
  ]
}

llm_ui.html (if --ui is enabled):
- An interactive HTML page for browsing extracted text.
- Features a search bar to filter files by path or content.
- Collapsible sections for each file with a text preview.
- Note: Full-text display for large files is not implemented (displays a placeholder alert).

Development

Setup

Clone the repository and install dependencies:

git clone https://github.com/Agecoder/llm-gen.git
cd llm-gen
npm install

Scripts

npm start: Run the CLI with default arguments (--src ./out).
npm run build: Run the CLI with ./out as the source directory.
npm run dev: Run the CLI in watch mode for development.
npm run lint: Lint the codebase using ESLint.
npm run format: Format the codebase using Prettier.

Dependencies

cheerio: HTML parsing for text extraction.
fs-extra: Enhanced file system operations.
glob: File pattern matching.
p-limit: Concurrency control for file processing.
yargs: Command-line argument parsing.

Dev Dependencies

eslint: Linting for code quality.
prettier: Code formatting.

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a feature branch (git checkout -b feature/your-feature).
Commit your changes (git commit -m "Add your feature").
Push to the branch (git push origin feature/your-feature).
Open a pull request.

Please ensure your code passes linting (npm run lint) and is formatted (npm run format).

License

This project is licensed under the MIT License.

Author

Vedant Navale
Email: vedantnavale45@gmail.com
GitHub: Agecoder

Issues

Report bugs or suggest features at GitHub Issues.

Acknowledgments

Built with inspiration from static site generation workflows and LLM content ingestion needs.
Thanks to the open-source community for providing robust libraries like Cheerio and yargs.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.npmignore		.npmignore
README.MD		README.MD
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm-gen

Features

Installation

Prerequisites

Usage

Basic Command

Options

Example Commands

Output Files

Development

Setup

Scripts

Dependencies

Dev Dependencies

Contributing

License

Author

Issues

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

AgeCoder/llm-gen

Folders and files

Latest commit

History

Repository files navigation

llm-gen

Features

Installation

Prerequisites

Usage

Basic Command

Options

Example Commands

Output Files

Development

Setup

Scripts

Dependencies

Dev Dependencies

Contributing

License

Author

Issues

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages