Unziping the File

In [217]:
import zipfile
import os

zip_path = "/Users/ajaypunia/Desktop/LLM_Extraction_Annotator-main/user_data_sample.zip"
extract_dir = 'unzipped_data'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

print("Files extracted to:", extract_dir)

Files extracted to: unzipped_data


checking inside the file

In [219]:
for root, dirs, files in os.walk(extract_dir):
    print(f"\n Folder: {root}")
    for file in files:
        print(f" File: {file}")


 Folder: unzipped_data
 File: ChartReviewRawText.csv

 Folder: unzipped_data/__MACOSX
 File: ._ChartReviewRawText.csv


In [220]:
import pandas as pd

csv_path = os.path.join(extract_dir, "ChartReviewRawText.csv")
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,MRN,Date,Endoscopy Report,Pathology Report,A_Soroush,J_Jiang,S_Jaladakani,JY_Yoon,C_Wang
0,1,12/2/21,Endoscopy report text .Endoscopy report text ....,Pathology report text. Pathology report text. ...,True,,,,
1,2,7/12/21,Endoscopy report text .Endoscopy report text ....,Pathology report text. Pathology report text. ...,True,,,,
2,3,6/29/22,Endoscopy report text .Endoscopy report text ....,Pathology report text. Pathology report text. ...,True,True,,,True
3,4,1/23/22,Endoscopy report text .Endoscopy report text ....,,True,True,,,True
4,4,3/6/22,Endoscopy report text .Endoscopy report text ....,Pathology report text. Pathology report text. ...,True,True,,,True


##Filter Labeled Reports by A_Soroush

Here we are selecting only the rows where Dr. A_Soroush has marked the report as labeled (True).  
Then we show just a few important columns like MRN, Date, Endoscopy Report, and Pathology Report.

In [222]:
filtered = df[df['A_Soroush'] == True]
filtered[['MRN', 'Date', 'Endoscopy Report', 'Pathology Report']].head()

Unnamed: 0,MRN,Date,Endoscopy Report,Pathology Report
0,1,12/2/21,Endoscopy report text .Endoscopy report text ....,Pathology report text. Pathology report text. ...
1,2,7/12/21,Endoscopy report text .Endoscopy report text ....,Pathology report text. Pathology report text. ...
2,3,6/29/22,Endoscopy report text .Endoscopy report text ....,Pathology report text. Pathology report text. ...
3,4,1/23/22,Endoscopy report text .Endoscopy report text ....,
4,4,3/6/22,Endoscopy report text .Endoscopy report text ....,Pathology report text. Pathology report text. ...


##Define Pydantic Schema##

This is the data structure (schema) we use to validate each report.

- `MRN`: patient ID (must be an integer)
- `Date`: the date of the procedure (string for now)
- `Endoscopy_Report`: text from the endoscopy report
- `Pathology_Report`: text from the pathology report (optional)

We use `Field(alias=...)` so the field names match the column names in our CSV file.

In [224]:
from pydantic import BaseModel, Field
from typing import Optional

class ColonoscopyReport(BaseModel):
    MRN: int
    Date: str
    Endoscopy_Report: str = Field(alias="Endoscopy Report")
    Pathology_Report: Optional[str] = Field(alias="Pathology Report")
    

Validating One Row

In [226]:
row = filtered.iloc[0]
report = ColonoscopyReport(**row.to_dict())
print(report)

MRN=1 Date='12/2/21' Endoscopy_Report='Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report text .Endoscopy report t

Validate All Rows and Catch Errors
Here we loop through all the filtered rows and check if they match our `ColonoscopyReport` schema.

- If a row is valid, we add it to the `valid_reports` list.
- If it's missing fields or has issues, we catch the error and save it in `invalid_rows`.

At the end, we print how many reports passed and failed, and show a few sample errors.

In [228]:
valid_reports = []
invalid_rows = []

for i, row in filtered.iterrows():
    try:
        report = ColonoscopyReport(**row.to_dict())
        valid_reports.append(report)
    except Exception as e:
        invalid_rows.append((i, str(e)))

print(f" Valid reports: {len(valid_reports)}")
print(f" Invalid reports: {len(invalid_rows)}")

# Show sample errors 
for idx, error in invalid_rows[:3]:
    print(f"\nRow {idx} failed validation:\n{error}")
    

 Valid reports: 92
 Invalid reports: 40

Row 3 failed validation:
1 validation error for ColonoscopyReport
Pathology Report
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.10/v/string_type

Row 5 failed validation:
1 validation error for ColonoscopyReport
Endoscopy Report
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.10/v/string_type

Row 9 failed validation:
1 validation error for ColonoscopyReport
Pathology Report
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.10/v/string_type


In [229]:
!pip install panel



importing libraries and active panel

In [231]:
import panel as pn
import json

pn.extension()

Create the Form Widgets
Create Form Fields Using Panel

We create form fields that the user can fill out:

`Procedure Indication`: a text box
`Bowel Prep`: a dropdown with options ("Good", "Fair", "Poor")
`Number of Polyps`: a number input starting at 0

We also add an **Export** button that the user will click to save the form data.

In [233]:
indication = pn.widgets.TextInput(name="Procedure Indication")
prep_quality = pn.widgets.Select(name="Bowel Prep", options=["Good", "Fair", "Poor"])
num_polyps = pn.widgets.IntInput(name="Number of Polyps", value=0, start=0)
###########


export_button = pn.widgets.Button(name="Export to JSON", button_type="primary")

Define purpose of the  Button Is Clicked

In [235]:
def export_callback(event):
    # Collect values from all input fields
    form_data = {
        "Procedure Indication": indication.value,
        "Bowel Prep": prep_quality.value,
        "Number of Polyps": num_polyps.value
    }

    # Save it as a JSON file
    with open("annotation_output.json", "w") as f:
        json.dump(form_data, f, indent=2)

    print("Data exported to 'annotation_output.json'")
    

 Linking the Function to the Button

In [237]:
export_button.on_click(export_callback)

Watcher(inst=Button(button_type='primary', name='Export to JSON'), cls=<class 'panel.widgets.button.Button'>, fn=<function export_callback at 0x15465de40>, mode='args', onlychanged=False, parameter_names=('clicks',), what='value', queued=False, precedence=0)

Mini Manual Form (For My Learning Only)

This form was created manually using `TextInput`, `Select`, and `IntInput` widgets.

I built this first to understand how `Panel` works and how user input is captured.

This helped me before moving to the dynamic (schema-driven) version of the form.

In [239]:
form_ui = pn.Column(
    "# Mini Annotation Form",
    indication,
    prep_quality,
    num_polyps,
    export_button
)
form_ui

importng libraries
`Literal` is used to define fixed dropdown options in our schema (like "Good", "Fair", "Poor").
-`panel` is used to build the form UI.

#Schema-Driven Annotation Form (LLM Data Project)

This notebook demonstrates a small prototype of a schema-driven annotation form built using **Pydantic** and **Panel**.

Instead of hardcoding form fields, the UI is generated automatically from a Pydantic schema. The form captures user inputs and exports them to a structured `.json` file, including a `schema_version`.

In [242]:
from typing import Literal
import panel as pn
import json

pn.extension()

Create the Pydantic Schema

In [244]:
class ColonoscopyForm(BaseModel):
    indication: str = Field(title="Procedure Indication")
    bowel_prep: Literal["Good", "Fair", "Poor"] = Field(title="Bowel Prep")
    num_polyps: int = Field(title="Number of Polyps", ge=0)

Create a Function to Build UI from Schema
Generate Form Automatically from Schema

This function takes a Pydantic schema and creates the form fields automatically using Panel.

- It loops through each field in the schema.
- Based on the field type, it creates:
  - `TextInput` for strings
  - `IntInput` for integers
  - `Select` (dropdown) for fixed choices (using `Literal`)
- It uses the field’s title (from the schema) as the label in the UI.

The function returns a dictionary of all the widgets.

In [246]:
def generate_form(schema_class):
    widgets = {}
    
    for field_name, field_info in schema_class.model_fields.items():
        field_type = field_info.annotation
        field_title = getattr(field_info.json_schema_extra, "get", lambda x, y: y)("title", field_name)

        if field_type == str:
            widgets[field_name] = pn.widgets.TextInput(name=field_title)
        elif field_type == int:
            widgets[field_name] = pn.widgets.IntInput(name=field_title, start=0)
        elif hasattr(field_type, "__args__"):  # for Literal
            options = list(field_type.__args__)
            widgets[field_name] = pn.widgets.Select(name=field_title, options=options)

    return widgets

Build and Display the Schema-Driven Form

- We use `generate_form()` to create all the widgets from the `ColonoscopyForm` schema.
- A submit button labeled "Export to JSON" is added.
- When the button is clicked, the `on_submit()` function collects the input values and saves them as a `.json` file named `schema_output.json`.
- The form is displayed using `pn.Column()` with all the widgets stacked vertically.

This is the final form that the user fills out and submits.

In [276]:
widgets = generate_form(ColonoscopyForm)
filename_input = pn.widgets.TextInput(name="File Name", placeholder="Enter file name (without .json)")
submit_btn = pn.widgets.Button(name="Export to JSON", button_type="primary")
SCHEMA_VERSION = "v1"
def on_submit(event):
    data = {k: v.value for k, v in widgets.items()}
    data["schema_version"] = SCHEMA_VERSION

    # Get filename from input
    file_name = filename_input.value.strip() or "annotation_output"
    file_path = f"{file_name}.json"

    with open(file_path, "w") as f:
        json.dump(data, f, indent=2)

    print(f"Data saved to '{file_path}'")

submit_btn.on_click(on_submit)

form_ui = pn.Column(
    "# Schema-Driven Colonoscopy Form",
    *widgets.values(),
    filename_input,
    submit_btn
)

form_ui