Automated Testing Tool for LLM Chatbots

⚠️ Ongoing project – planning, design and development stage, no production code pushed yet for security reasons.

This project is a small Python toolkit to automate running and testing prompts against LLM-based chatbots.

Right now, it focuses on:

Reading prompts from a CSV/Excel file
Automatically sending each prompt to:
- HKChat (primary model)
- Perplexity (comparison model)
Collecting the answers into new output columns and saving a results sheet

In the future, an LLM-based evaluator will also be added to analyse the responses and mark each test as PASS or FAIL based on expected behaviour.

What It Does Today

1. Spreadsheet-driven test runs

You prepare a .csv or .xlsx file with columns such as:

Prompt – the user message / question to send
Optional mode columns:
- HKChat Mode – e.g. “Thinking”, “Weather”, “Legal” (maps to HKChat buttons)
- Perplexity Mode – e.g. “Research” / “Search` (can be auto-inferred from HKChat mode)

The tool will:

Detect the prompt column automatically (supports names like Prompt, prompt, Question, etc.).
Add or reuse output columns:
- HKChat Output
- Perplexity Output
Optionally add Perplexity Mode if it does not exist.

2. Automated browser runs (HKChat + Perplexity)

Using Playwright, the script opens:

A persistent browser session for HKChat
A persistent browser session for Perplexity

For each row in the spreadsheet, it:

Starts a new chat in HKChat.
Sets the requested mode (e.g. “Thinking”, “Weather”, “Legal”) when provided.
Types the prompt into HKChat and waits until the full response is finished.
Captures the reply (preferably via the copy button; otherwise by reading the final rendered text).
Writes the cleaned reply into HKChat Output.

Then, for the same row, it:

Starts a new chat in Perplexity.
Selects the Search or Research mode (based on the HKChat mode if available).
Sends the same prompt.
Waits for the final answer (copy button or stable text).
Writes the cleaned reply into Perplexity Output.
Records the Perplexity mode into Perplexity Mode.

The tool also:

Adds a simple S/N column when none exists, for easier row tracking.
Saves screenshots into an out/ folder when errors occur for later debugging.

Planned: LLM-based PASS / FAIL Judging

The current version only runs prompts and captures outputs.
The next major feature will be an automatic evaluator, powered by an LLM, that will:

Read a test specification (for each prompt), including:
- The prompt itself
- Optional expected keywords or behaviour description
Inspect the HKChat Output (and possibly the Perplexity answer).
Use an LLM to decide whether the response:
- Contains the required information
- Follows style or safety constraints
Mark each row with a simple result, e.g.:
- Result: PASS / FAIL
- Optional Notes: short explanation from the evaluator model

Planned extra columns in the spreadsheet might include:

Expected Keywords
Evaluation Result (PASS / FAIL)
Evaluation Notes

This will turn the project from a “prompt runner” into a more complete regression testing tool for LLM chatbots.

Tech Stack

Language: Python 3
Automation: Playwright (Chromium)
Input/Output: CSV / Excel via pandas
Data processing:
- Auto-detect prompt / mode / output columns
- Normalisation and cleanup of model outputs
Planned evaluation:
- LLM-based scoring for PASS / FAIL (e.g. using an external LLM API)
- pytest for internal unit tests
- GitHub Actions for basic CI

High-Level Workflow

Place one or more .csv / .xlsx files in the working directory.
Run the script (example):
```
python web_excel_runner.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Testing Tool for LLM Chatbots

What It Does Today

1. Spreadsheet-driven test runs

2. Automated browser runs (HKChat + Perplexity)

Planned: LLM-based PASS / FAIL Judging

Tech Stack

High-Level Workflow

About

Uh oh!

Releases

Packages

RoLBester/automated-testing-software

Folders and files

Latest commit

History

Repository files navigation

Automated Testing Tool for LLM Chatbots

What It Does Today

1. Spreadsheet-driven test runs

2. Automated browser runs (HKChat + Perplexity)

Planned: LLM-based PASS / FAIL Judging

Tech Stack

High-Level Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages