ScriptureReflection is a research project that explores how AI-driven reflection techniques can aid in generating Bible translations for high-resource languages. The approach leverages a process of iterative improvement through LLM, evaluating translations with an AI grader, and refining them multiple times. While this repository primarily focuses on high-resource languages, many of the techniques used may have applications for low-resource languages as well.
The ultimate goal of the project is not only to produce automated Bible translations but also to develop tools and workflows that complement human involvement in translation and quality assurance.
This repository builds upon the concept of "reflection," first developed in a prior repository, doctrine_detector. A detailed presentation of the doctrine_detector project can be found here.
Reflection is a process wherein an LLM generates output, grades its quality, provides feedback, and iteratively refines its output based on that feedback. More specifically:
- Initial Output: an LLM generates content, such as a Bible verse translation.
- Grading: an LLM evaluates that content and assigns a grade with comments on how to improve.
- Refinement: The output is then adjusted based on the feedback.
- Iteration: Steps 2 and 3 are repeated several times, with the goal of improving quality each time.
Initial findings from the doctrine_detector project revealed that multiple parrallel runs of grading improve stability and the reflective improves overall quality as the process cycles.
- Reduce Plagiarism Risk: Generating translations without verbatim reproduction of known versions.
- Explore Paraphrasing: Enable the model to produce translations that balance readability and fidelity.
- Iterative Improvement: Utilize iterative reflection to improve translation quality and capture nuances.
- Grading Metrics: Investigate ways to quantify and assess grades during the translation process.
- Human-AI Collaboration: Build tools for human involvement in the reflection process.
The repository consists of several modules and YAML configuration files that control various aspects of the pipeline. Each module is tailored to handle a specific phase of the reflection-based translation process. Below is an overview of the key components:
easy_draft.py
:- Generates an initial draft of translations in JSONL format.
- Supports creating paraphrases aimed at a specific reading level (e.g., comprehension by a seven-year-old).
- YAML Config:
easy_draft.yaml
.
rangeable_easy_draft.py
:- Extends
easy_draft.py
by allowing multiple verses to merge into ranges based on natural flow. - Useful for paraphrased translations but still requires refinement due to nonsensical merges.
- Extends
- YAML Config:
rangeable_easy_draft.yaml
.
input_formats.py
:- Facilitates the import of translations in various formats, including
USFM
,USX
, andbiblenlp
. - Allows specification of both source and target languages, each with their corresponding formats.
- Requires that the source and target languages adhere to the same versification, preventing incorrect pairings.
- Facilitates the import of translations in various formats, including
- YAML Config:
input_formats.yaml
.
output_formats.py
:- Converts the intermediate JSONL outputs into:
- USFM format
- JSONL for external tools like SWARM
- Markdown for direct viewing on GitHub.
- A single file report sorted by grade worse to best. If an OpenAi API key is provided it also
- summarizes the grade comments into a single improvement request and
- translates everything in the report inline in prenthesis.
- USFM export may not yet fully support range merging.
- Converts the intermediate JSONL outputs into:
- YAML Config:
output_formats.yaml
.
The reflection process has two main phases:
-
Grading:
grade_output.py
:- Assigns grades to translations.
- Outputs grading results as a separate file.
- YAML Config:
grade_output.yaml
.
-
Single Reflection Cycle:
do_reflection.py
:- Applies one round of reflection using grading feedback.
- Outputs a new translation version.
- YAML Config:
do_reflection.yaml
.
-
Inefficiencies: This approaches requires manual updates to configurations and resulted in numerous redundant files. To address this, iterative loop-based tools were created.
-
grade_reflect_loop.py
:- Automates grading and reflection loops.
- Enables iterative improvement with dynamic context (adjacent verses).
- Introduces a mode that focuses on the verse with the lowest grade at each step, propagating improvements while mitigating bad suggestions.
- The grading-reflection loop is somewhat format-agnostic, allowing use with any JSONL-based verse translation input.
- Grading only mode allows tool to be used for quality assessment without verses being changed.
-
Enhancements:
- Finalization of challenging verses after several attempts by picking the best version graded so far.
- Finalization helps resolve alternations on valid but competing outputs (e.g., "deceiver" vs. "false prophet")
-
YAML Config:
grade_reflect_loop.yaml
.
streamlit_reflector.py
:- A graphical tool enabling Human-in-the-Loop iterative functionality.
- Works in combination with the
grade_reflect_loop.py
script. - View verses sorted by grade.
- Allows direct editing of verses.
- Allows adding comments to verses to guide grading.
- Shows verse history and grade over time.
- Allows pinning specific verses for a hybrid manual/reflection mode.
- Intended to alternate with
grade_reflect_loop.py
, not run simultaneously.
-
Paraphrasing with Reflection:
- Iterative reflection produced a paraphrased version of John 3 (Output).
- Excessive wordiness emerged before refinement but proved beneficial as it allowed the model to accumulate and refine complex ideas.
-
Focused Iteration:
- Reflection on poorly graded verses improved the consistency of the chapters.
- Divergent ideas (like adding footnotes) propagated across early drafts and stabilized through consensus.
-
Challenges:
- Stability issues when iterating over full chapters. Thus the
grade_reflect_loop.py
outputs one verse at a time even though it has a larger context. - Oscillation between competing word choices.
- Stability issues when iterating over full chapters. Thus the
-
Evaluation Metrics:
- The average grade of translations improves initially but may plateau or fluctuate without proper safeguards. Grading from the bottom up prevents oscilations by concentrating on the worst grade. See the following (Grade Chart) for the reflection on Matthew.
Each module is configured and executed independently based on YAML files. Below is a general workflow:
-
Generate an initial draft:
vim easy_draft.yaml python easy_draft.py
-
Instead of generating a draft, import an existing translation:
vim import_formats.yaml python import_formats.py
-
Run grading and reflection:
vim grade_reflect_loop.yaml python grade_reflect_loop.py
-
Run the Streamlit POC GUI:
# You only need to edit the YAML file if your JSONL keys # are custom. vim streamlit_reflector.yaml streamlit run streamlit_reflector.py
-
Return to step #2 after editing verses or adding comments using the Streamlit app. The Streamlit app modifies the JSONL file so that
grade_reflect_loop.py
knows where to continue. -
Output Formats:
vim output_formats.yaml python output_formats.py
- Paraphrased translation of John 3 in English: Link
- Full Book of Matthew in English: Link
- Reflection Grade Chart for the Matthew in English: Link
MIT
For questions or contributions, please reach out to the repository maintainer.