-
Notifications
You must be signed in to change notification settings - Fork 0
[Publish] From Raw Data to Audit-Ready Spreadsheets: An AI-Assisted, Deterministic Python Workflow #29
Description
Post title
From Raw Data to Audit-Ready Spreadsheets: An AI-Assisted, Deterministic Python Workflow
Category
Learning Resources
Summary
This post shows a complete, runnable Python workflow that deterministically transcribes structured values from a raw text file into Excel and compares them against a technician spreadsheet to produce audit-ready mismatch reports.
Body HTML
What happens if a single number is silently altered while transcribing raw experimental data into a spreadsheet—and that spreadsheet later becomes part of a pharmaceutical regulatory submission package?
I would like to share with you a complete, runnable example showing how to transcribe structured values from a raw text file (for example, text extracted from OCR or a searchable PDF) into an Excel spreadsheet, and then compare it against a technician’s spreadsheet using fully deterministic, audit-ready rules.
AI tools can make the workflow easier (for example, by extracting text or suggesting column mappings), but using AI alone for end-to-end comparison can introduce a real risk of hallucination—confident but incorrect outputs. That’s why we rely on Python for the core transcription and comparison: it is deterministic, reproducible, and auditable. We still use AI (such as Copilot) to assist with writing and maintaining the code, while keeping the actual data checks fully rule-based—critical when results may be reviewed by QA or regulators.
Python code and how it works
This workflow follows a simple pipeline: start from the raw file, transcribe it into a spreadsheet, then generate comparison reports against the technician sheet. The goal is to catch transcription errors early and produce a clear, audit-ready list of mismatches. The full Python script is provided in the Appendix as a downloadable file.
If you’re new to Python, it helps to know there are two layers here: one for the data logic and one for the Excel file format.
In plain terms: pandas does the deterministic table parsing, alignment, and numeric comparisons, while openpyxl is the behind-the-scenes Excel engine that lets pandas read and write .xlsx files.
1) Reading and parsing the raw file
The parser (turning raw text into structured data) scans the raw text line-by-line. When it sees Sample ID:, it starts a new record. It then captures Analyte: and Result: inside the same record. A blank line ends a record.
The output of this step is a structured table with three columns:
Sample_ID
Analyte
Result_mg_L
2) Writing the spreadsheet
The parsed table is written to Excel using to_excel(). This is your deterministic “raw → spreadsheet” transcription step.
3) Comparing against the technician spreadsheet
The comparison is done in two parts.
- First, merge() aligns rows using the shared keys (Sample_ID, Analyte). This creates a single table where values from both sources sit side-by-side.
- Second, the code computes a numeric difference and checks whether it falls within an allowed tolerance. Every mismatch is flagged and exported.
4) Why _merge matters
The _merge column tells you whether a row exists in both files or only one side.
both: both sources contain the same key
left_only: present in raw-transcribed but missing in technician sheet
right_only: present in technician sheet but missing in raw-transcribed
This makes missing or extra entries auditable.
How to run the files
To run this example end-to-end, you only need Python installed on your computer.
- First, download all files listed in the Appendix and place them in the same folder on your machine. Make sure the Python script (raw_to_spreadsheet_audit.py) sits in the same directory as the raw text file and the technician spreadsheet.
- Next, install the required Python packages by running pip install pandas openpyxl in your terminal or command prompt.
- Finally, run the script with python raw_to_spreadsheet_audit.py. The script will generate the transcribed spreadsheet and the comparison reports automatically in the same folder.
What you should expect in the results
- comparison_report.xlsx contains the aligned values, the numeric difference, and a deterministic match flag.
- differences_only.xlsx contains only the rows that fail the match criteria. This is the file you hand to a human reviewer to resolve exceptions.
How to adapt this to multiple spreadsheet formats
In real labs, different experiments produce different spreadsheet layouts. The scalable way is to keep the deterministic core the same, and externalize only the mapping rules. In simple terms, this means the comparison logic (how values are checked and flagged) never changes, while only a small set of instructions tells the script where to find the right columns in each different spreadsheet.
For example, one experiment’s spreadsheet might label the result column as Result_mg_L, while another uses Reported_Value. Instead of changing the comparison code, you simply tell the script which column name corresponds to the canonical field Result for that experiment.
To make this easier for chemists, an AI tool (such as Copilot) can assist by suggesting these column mappings after looking at the spreadsheet headers and sample rows. The chemist reviews and approves the mapping, and the deterministic Python code then applies it.
You can define a small per-experiment configuration that tells the script which columns in a technician spreadsheet correspond to the canonical fields (Sample_ID, Analyte, Result). The script then renames columns into the canonical schema before merging.
This keeps the validation logic stable and audit-friendly while allowing many formats.
When you run the Python script, if you hit an error, copy the full traceback (the complete Python error message showing where and why the code failed) from your terminal (e.g. Command Prompt) and paste it into the Chat in VS Code using Agent mode, including the command you ran (for example, python raw_to_spreadsheet_audit.py). VS code can pinpoint the failing line, suggest a fix, and you can review and apply the patch after you approve it.
Appendix: Downloadable Example Files
Note: To run this example, you only need three inputs: raw_extracted_text.txt, technician_spreadsheet.xlsx, and raw_to_spreadsheet_audit.py. The remaining files are generated outputs (transcribed spreadsheet and comparison reports) provided for reference.
- raw_extracted_text.txt — raw input file extracted from OCR / searchable PDF
- technician_spreadsheet.xlsx — example technician-prepared spreadsheet
- raw_to_spreadsheet_audit.py — complete Python script for transcription and deterministic comparison
- transcribed_from_raw.xlsx — spreadsheet deterministically transcribed from the raw file
- comparison_report.xlsx — full row-by-row comparison report
- differences_only.xlsx — filtered report showing only mismatches (for human review)
