⚠️ Ongoing project – planning, design and development stage, no production code pushed yet for security reasons.
This project is a small Python toolkit to automate running and testing prompts against LLM-based chatbots.
Right now, it focuses on:
- Reading prompts from a CSV/Excel file
- Automatically sending each prompt to:
- HKChat (primary model)
- Perplexity (comparison model)
- Collecting the answers into new output columns and saving a results sheet
In the future, an LLM-based evaluator will also be added to analyse the responses and mark each test as PASS or FAIL based on expected behaviour.
You prepare a .csv or .xlsx file with columns such as:
Prompt– the user message / question to send- Optional mode columns:
HKChat Mode– e.g. “Thinking”, “Weather”, “Legal” (maps to HKChat buttons)Perplexity Mode– e.g. “Research” / “Search` (can be auto-inferred from HKChat mode)
The tool will:
- Detect the prompt column automatically (supports names like
Prompt,prompt,Question, etc.). - Add or reuse output columns:
HKChat OutputPerplexity Output
- Optionally add
Perplexity Modeif it does not exist.
Using Playwright, the script opens:
- A persistent browser session for HKChat
- A persistent browser session for Perplexity
For each row in the spreadsheet, it:
- Starts a new chat in HKChat.
- Sets the requested mode (e.g. “Thinking”, “Weather”, “Legal”) when provided.
- Types the prompt into HKChat and waits until the full response is finished.
- Captures the reply (preferably via the copy button; otherwise by reading the final rendered text).
- Writes the cleaned reply into
HKChat Output.
Then, for the same row, it:
- Starts a new chat in Perplexity.
- Selects the Search or Research mode (based on the HKChat mode if available).
- Sends the same prompt.
- Waits for the final answer (copy button or stable text).
- Writes the cleaned reply into
Perplexity Output. - Records the Perplexity mode into
Perplexity Mode.
The tool also:
- Adds a simple
S/Ncolumn when none exists, for easier row tracking. - Saves screenshots into an
out/folder when errors occur for later debugging.
The current version only runs prompts and captures outputs.
The next major feature will be an automatic evaluator, powered by an LLM, that will:
- Read a test specification (for each prompt), including:
- The prompt itself
- Optional expected keywords or behaviour description
- Inspect the HKChat Output (and possibly the Perplexity answer).
- Use an LLM to decide whether the response:
- Contains the required information
- Follows style or safety constraints
- Mark each row with a simple result, e.g.:
Result:PASS/FAIL- Optional
Notes: short explanation from the evaluator model
Planned extra columns in the spreadsheet might include:
Expected KeywordsEvaluation Result(PASS / FAIL)Evaluation Notes
This will turn the project from a “prompt runner” into a more complete regression testing tool for LLM chatbots.
- Language: Python 3
- Automation: Playwright (Chromium)
- Input/Output: CSV / Excel via
pandas - Data processing:
- Auto-detect prompt / mode / output columns
- Normalisation and cleanup of model outputs
- Planned evaluation:
- LLM-based scoring for PASS / FAIL (e.g. using an external LLM API)
pytestfor internal unit tests- GitHub Actions for basic CI
-
Place one or more
.csv/.xlsxfiles in the working directory. -
Run the script (example):
python web_excel_runner.py