[Publish] From Raw Data to Audit-Ready Spreadsheets: An AI-Assisted, Deterministic Python Workflow

### Post title

From Raw Data to Audit-Ready Spreadsheets: An AI-Assisted, Deterministic Python Workflow

### Category

Learning Resources

### Summary

This post shows a complete, runnable Python workflow that deterministically transcribes structured values from a raw text file into Excel and compares them against a technician spreadsheet to produce audit-ready mismatch reports.

### Body HTML

<div style="max-width: 820px; margin: 0 auto; padding: 0 12px; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Arial, sans-serif; line-height: 1.65; color: #111;">

<p style="margin: 1.25rem 0 1.5rem;"><img 
  src="https://raw.githubusercontent.com/Dryqu/ChemBioAI/8ea59656578efaa9ad71a62c927d3f763f9a31e7/assets/images/From-Raw-Data-to-Audit-Ready-Spreadsheets.png" 
  alt="Microsoft Copilot Studio"
  loading="lazy"
  style="max-width: 100%; width: 720px; height: auto; border-radius: 8px;"
/></p>
<p style="margin: 0 0 1rem;">What happens if a single number is silently altered while transcribing raw experimental data into a spreadsheet—and that spreadsheet later becomes part of a pharmaceutical regulatory submission package?</p>
<p style="margin: 0 0 1rem;">I would like to share with you a complete, runnable example showing how to transcribe structured values from a raw text file (for example, text extracted from OCR or a searchable PDF) into an Excel spreadsheet, and then compare it against a technician’s spreadsheet using fully deterministic, audit-ready rules.</p>
<p style="margin: 0 0 1rem;">AI tools can make the workflow easier (for example, by extracting text or suggesting column mappings), but using AI alone for end-to-end comparison can introduce a real risk of hallucination—confident but incorrect outputs. That’s why we rely on Python for the core transcription and comparison: it is deterministic, reproducible, and auditable. We still use AI (such as Copilot) to assist with writing and maintaining the code, while keeping the actual data checks fully rule-based—critical when results may be reviewed by QA or regulators.</p>
<h2 style="font-size: 1.35rem; margin: 1.75rem 0 0.75rem;">Python code and how it works</h2>
<p style="margin: 0 0 1rem;">This workflow follows a simple pipeline: start from the raw file, transcribe it into a spreadsheet, then generate comparison reports against the technician sheet. The goal is to catch transcription errors early and produce a clear, audit-ready list of mismatches. The full Python script is provided in the Appendix as a downloadable file.</p>
<p style="margin: 0 0 1rem;">If you’re new to Python, it helps to know there are two layers here: one for the data logic and one for the Excel file format.</p>
<p style="margin: 0 0 1rem;">In plain terms: pandas does the deterministic table parsing, alignment, and numeric comparisons, while openpyxl is the behind-the-scenes Excel engine that lets pandas read and write .xlsx files.</p>
<h3 style="font-size: 1.1rem; margin: 1.25rem 0 0.5rem;">1) Reading and parsing the raw file</h3>
<p style="margin: 0 0 1rem;">The parser (turning raw text into structured data) scans the raw text line-by-line. When it sees Sample ID:, it starts a new record. It then captures Analyte: and Result: inside the same record. A blank line ends a record.</p>
<p style="margin: 0 0 0.75rem;">The output of this step is a structured table with three columns:</p>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">

</ul>
<p style="margin: 0 0 1rem;">Sample_ID</p>
<p style="margin: 0 0 1rem;">Analyte</p>
<p style="margin: 0 0 1rem;">Result_mg_L</p>
<h3 style="font-size: 1.1rem; margin: 1.25rem 0 0.5rem;">2) Writing the spreadsheet</h3>
<p style="margin: 0 0 1rem;">The parsed table is written to Excel using to_excel(). This is your deterministic “raw → spreadsheet” transcription step.</p>
<h3 style="font-size: 1.1rem; margin: 1.25rem 0 0.5rem;">3) Comparing against the technician spreadsheet</h3>
<p style="margin: 0 0 1rem;">The comparison is done in two parts.</p>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
  <li style="margin: 0.35rem 0;">First, merge() aligns rows using the shared keys (Sample_ID, Analyte). This creates a single table where values from both sources sit side-by-side.</li>
  <li style="margin: 0.35rem 0;">Second, the code computes a numeric difference and checks whether it falls within an allowed tolerance. Every mismatch is flagged and exported.</li>
</ul>

<h3 style="font-size: 1.1rem; margin: 1.25rem 0 0.5rem;">4) Why _merge matters</h3>
<p style="margin: 0 0 0.75rem;">The _merge column tells you whether a row exists in both files or only one side.</p>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">

</ul>
<p style="margin: 0 0 1rem;">both: both sources contain the same key</p>
<p style="margin: 0 0 1rem;">left_only: present in raw-transcribed but missing in technician sheet</p>
<p style="margin: 0 0 1rem;">right_only: present in technician sheet but missing in raw-transcribed</p>
<p style="margin: 0 0 1rem;">This makes missing or extra entries auditable.</p>
<h2 style="font-size: 1.35rem; margin: 1.75rem 0 0.75rem;">How to run the files</h2>
<p style="margin: 0 0 1rem;">To run this example end-to-end, you only need Python installed on your computer.</p>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
  <li style="margin: 0.35rem 0;">First, download all files listed in the Appendix and place them in the same folder on your machine. Make sure the Python script (raw_to_spreadsheet_audit.py) sits in the same directory as the raw text file and the technician spreadsheet.</li>
  <li style="margin: 0.35rem 0;">Next, install the required Python packages by running pip install pandas openpyxl in your terminal or command prompt.</li>
  <li style="margin: 0.35rem 0;">Finally, run the script with python raw_to_spreadsheet_audit.py. The script will generate the transcribed spreadsheet and the comparison reports automatically in the same folder.</li>
</ul>

<h2 style="font-size: 1.35rem; margin: 1.75rem 0 0.75rem;">What you should expect in the results</h2>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
<li style="margin: 0.25rem 0;">comparison_report.xlsx contains the aligned values, the numeric difference, and a deterministic match flag.</li>
</ul>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
<li style="margin: 0.25rem 0;">differences_only.xlsx contains only the rows that fail the match criteria. This is the file you hand to a human reviewer to resolve exceptions.</li>
</ul>
<h2 style="font-size: 1.35rem; margin: 1.75rem 0 0.75rem;">How to adapt this to multiple spreadsheet formats</h2>
<p style="margin: 0 0 1rem;">In real labs, different experiments produce different spreadsheet layouts. The scalable way is to keep the deterministic core the same, and externalize only the mapping rules. In simple terms, this means the comparison logic (how values are checked and flagged) never changes, while only a small set of instructions tells the script where to find the right columns in each different spreadsheet.</p>
<p style="margin: 0 0 1rem;">For example, one experiment’s spreadsheet might label the result column as Result_mg_L, while another uses Reported_Value. Instead of changing the comparison code, you simply tell the script which column name corresponds to the canonical field Result for that experiment.</p>
<p style="margin: 0 0 1rem;">To make this easier for chemists, an AI tool (such as Copilot) can assist by suggesting these column mappings after looking at the spreadsheet headers and sample rows. The chemist reviews and approves the mapping, and the deterministic Python code then applies it.</p>
<p style="margin: 0 0 1rem;">You can define a small per-experiment configuration that tells the script which columns in a technician spreadsheet correspond to the canonical fields (Sample_ID, Analyte, Result). The script then renames columns into the canonical schema before merging.</p>
<p style="margin: 0 0 1rem;">This keeps the validation logic stable and audit-friendly while allowing many formats.</p>
<p style="margin: 0 0 1rem;">When you run the Python script, if you hit an error, copy the full traceback (the complete Python error message showing where and why the code failed) from your terminal (e.g. Command Prompt) and paste it into the Chat in VS Code using Agent mode, including the command you ran (for example, python raw_to_spreadsheet_audit.py). VS code can pinpoint the failing line, suggest a fix, and you can review and apply the patch after you approve it.</p>
<h2 style="font-size: 1.35rem; margin: 1.75rem 0 0.75rem;">Appendix: Downloadable Example Files</h2>
<p style="margin: 0 0 1rem;">Note: To run this example, you only need three inputs: raw_extracted_text.txt, technician_spreadsheet.xlsx, and raw_to_spreadsheet_audit.py. The remaining files are generated outputs (transcribed spreadsheet and comparison reports) provided for reference.</p>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
<li style="margin: 0.25rem 0;"><a href="https://drive.google.com/file/d/1V_68jdgRB_a50UAgcuvIOCmaf9KEJZq-/view" target="_blank" rel="noopener noreferrer">raw_extracted_text.txt</a> — raw input file extracted from OCR / searchable PDF</li>
</ul>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
<li style="margin: 0.25rem 0;"><a href="https://docs.google.com/spreadsheets/d/1zWLorwbblCmUXSwu3iuB6BZCeKDaW55x/edit?usp=drive_link&amp;amp;ouid=117156502044483055461&amp;amp;rtpof=true&amp;amp;sd=true" target="_blank" rel="noopener noreferrer">technician_spreadsheet.xlsx</a> — example technician-prepared spreadsheet</li>
</ul>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
<li style="margin: 0.25rem 0;"><a href="https://drive.google.com/file/d/1lO-_DG4FZTU4tqW4JC8FOq6BdpWndWD3/view" target="_blank" rel="noopener noreferrer">raw_to_spreadsheet_audit.py</a> — complete Python script for transcription and deterministic comparison</li>
</ul>
<p style="margin: 0 0 1rem;"><a href="https://drive.google.com/file/d/1lO-_DG4FZTU4tqW4JC8FOq6BdpWndWD3/view" target="_blank" rel="noopener noreferrer"></a></p>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
<li style="margin: 0.25rem 0;"><a href="https://docs.google.com/spreadsheets/d/17DCosDVWvnnaQnh19TboufpmXrqoTzT_/edit?usp=drive_link&amp;amp;ouid=117156502044483055461&amp;amp;rtpof=true&amp;amp;sd=true" target="_blank" rel="noopener noreferrer">transcribed_from_raw.xlsx</a> — spreadsheet deterministically transcribed from the raw file</li>
</ul>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
<li style="margin: 0.25rem 0;"><a href="https://docs.google.com/spreadsheets/d/1943xvMfE4bSvw3sPwluF1Vw57SYOq4cw/edit?usp=drive_link&amp;amp;ouid=117156502044483055461&amp;amp;rtpof=true&amp;amp;sd=true" target="_blank" rel="noopener noreferrer">comparison_report.xlsx</a> — full row-by-row comparison report</li>
</ul>
<ul style="margin: 0.25rem 0 1rem 1.25rem; padding: 0;">
<li style="margin: 0.25rem 0;"><a href="https://docs.google.com/spreadsheets/d/1U_ZeNDbsHe-vWYB4XC7iTeiWF-Ne78AB/edit?usp=drive_link&amp;amp;ouid=117156502044483055461&amp;amp;rtpof=true&amp;amp;sd=true" target="_blank" rel="noopener noreferrer">differences_only.xlsx</a> — filtered report showing only mismatches (for human review)</li>
</ul>
</div>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Publish] From Raw Data to Audit-Ready Spreadsheets: An AI-Assisted, Deterministic Python Workflow #29

Post title

Category

Summary

Body HTML

Python code and how it works

1) Reading and parsing the raw file

2) Writing the spreadsheet

3) Comparing against the technician spreadsheet

4) Why _merge matters

How to run the files

What you should expect in the results

How to adapt this to multiple spreadsheet formats

Appendix: Downloadable Example Files

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Publish] From Raw Data to Audit-Ready Spreadsheets: An AI-Assisted, Deterministic Python Workflow #29

Description

Post title

Category

Summary

Body HTML

Python code and how it works

1) Reading and parsing the raw file

2) Writing the spreadsheet

3) Comparing against the technician spreadsheet

4) Why _merge matters

How to run the files

What you should expect in the results

How to adapt this to multiple spreadsheet formats

Appendix: Downloadable Example Files

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions