This project is intended to processes long unformatted text files (such as a raw audio transcripts) converting them into formatted content with complete sentences.
- Converts raw unformatted transcripts into readable, well-structured text.
- Handles text of any length using chunked processing.
- Produces output with punctuation of sentences
- Un-formatted fragments remain unchanged.
- Python 3.8+
- The text-generation-webui sever and a loaded model, currently developing with mythomax gguf variants.
-
Start the server (provides the loaded model):
python server.py
-
The OpenAI-compatible API will be available at:
http://0.0.0.0:5000
-
Use the webui to load a model.
Run the formatter:
python process.py
* config.py defines file paths
- Processes text in chunks that fit within the LLM's context window
- Applies proper formatting to each chunk
- Combines chunks results correctly
- Preserves the original content's meaning and intent
- Proper sentence structure with correct punctuation
- Uses chunked processing to handle unlimited length input
- Maintains overlap between chunks for seamless transitions
- Preliminary results shows that a LoRA will need to be trained.
- Proper chunking and combining of chunks is essential and is the current priority.
- Proper chunking facilitates the production of training data.