# Documentation assistant

This notebook demonstrates a documentation assistant: Video-to-documentation conversion using Vertex AI

Converting videos-to-documentation involves three steps: 
1. Protocol finder: Select protocol which best captures the step being performed in the video
2. Video comparing to ground-of-truth protocol → lab documentation + errors in procedure
3. Analytics based on benchmark dataset: automatic comparison of errors found by documentation assistent vs actual errors

In this notebook, I will focus on the step two and three - Compare video with protocol.

In [165]:
from __future__ import annotations

# %load_ext autoreload
%reload_ext autoreload
%autoreload 2

import configparser
import os
import sys
from pathlib import Path
import json
import pandas as pd
import pprint


from IPython.display import Markdown

path_to_append = Path(Path.cwd()).parent / "proteomics_specialist"
sys.path.append(str(path_to_append))
import video_to_protocol

config = configparser.ConfigParser()
config.read("../secrets.ini")

['../secrets.ini']

In [3]:
import vertexai

config = configparser.ConfigParser()
config.read("../secrets.ini")

PROJECT_ID = config["DEFAULT"]["PROJECT_ID"]
vertexai.init(project=PROJECT_ID, location="europe-west9")  # europe-west9 is Paris

In [4]:
from google.cloud import storage

os.environ["GOOGLE_CLOUD_PROJECT"] = config["DEFAULT"]["PROJECT_ID"]

# Initialize Cloud Storage client
storage_client = storage.Client()
bucket_name = "mannlab_videos"
bucket = storage_client.bucket(bucket_name)

In [155]:
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
from vertexai.generative_models import GenerativeModel, GenerationConfig
from typing import TYPE_CHECKING, NamedTuple

def generate_content_from_model(
    inputs: Any,
    model_name: str = "gemini-2.0-flash",
    temperature: float = 0.9,
) -> tuple:
    """Generate content using Google's Generative AI model.
    
    This function sends inputs to a specified Gemini model and returns the 
    generated response along with usage metadata.
    
    Parameters
    ----------
    inputs : Any
        The inputs to send to the model (text, images, or videos).
    model_name : str, default="gemini-2.0-flash"
        Name of the generative model to use.
    temperature : float, default=0.9
        Controls the randomness of the output. Higher values (closer to 1.0)
        make output more random, lower values make it more deterministic.
        
    Returns
    -------
    tuple
        A tuple containing (response_text, usage_metadata)
        
    Raises
    ------
    ValueError
        If the model fails to generate content.
    """
    try:
        model = GenerativeModel(model_name)
        
        generation_config = GenerationConfig(
            temperature=temperature,
            # Uncomment if using single audio/video input
            # audio_timestamp=True
        )
        
        response = model.generate_content(
            inputs,
            generation_config=generation_config
        )
        documentation = response.text
        usage_metadata = response.usage_metadata
        
    except Exception as e:
        logger.exception("Error during content generation")
        raise ValueError(f"Failed to generate content: {str(e)}")
    
    return documentation, usage_metadata

In [156]:
from vertexai.generative_models import Part

def prepare_all_inputs(
    lab_video_path: str,
    protocol_path: str,
    documentation_video_path: str,
    documentation_path: str,
    bucket: str,
    prefix: str = "compare_protocol_video"
) -> dict:
    """Prepare all four standard inputs for the generative model.
    
    This function uploads the four standard files (lab video, protocol document, 
    documentation video, and documentation document) and formats them as inputs 
    for a generative model.
    
    Parameters
    ----------
    lab_video_path : str
        Path to the lab video file.
    protocol_path : str
        Path to the protocol markdown file.
    documentation_video_path : str
        Path to the documentation video file.
    documentation_path : str
        Path to the documentation markdown file.
    bucket : str
        GCS bucket name for uploading the files.
    prefix : str, default="compare_protocol_video"
        Prefix for the files in GCS bucket.
        
    Returns
    -------
    dict
        A dictionary containing the four formatted inputs:
        'protocol_video_input', 'protocol_input', 'lab_video_input', 'documentation_input'
    """
    
    video_uri = video_to_protocol.upload_video_to_gcs(lab_video_path, bucket, prefix)
    protocol_video_input = [Part.from_uri(video_uri, mime_type="video/mp4")]
    
    uri = video_to_protocol.upload_video_to_gcs(protocol_path, bucket, prefix)
    protocol_input = [Part.from_uri(uri, mime_type="text/md")]
    
    video_uri = video_to_protocol.upload_video_to_gcs(documentation_video_path, bucket, prefix)
    lab_video_input = [Part.from_uri(video_uri, mime_type="video/mp4")]

    uri = video_to_protocol.upload_video_to_gcs(documentation_path, bucket, prefix)
    documentation_input = [Part.from_uri(uri, mime_type="text/md")]
    
    return {
        'protocol_video_input': protocol_video_input,
        'protocol_input': protocol_input,
        'lab_video_input': lab_video_input,
        'documentation_input': documentation_input
    }

In [None]:
def process_benchmark_dataset(csv_path, protocol_videos_base, documentation_videos_base, markdown_base, bucket, prefix):
    """
    Process the first two rows in the benchmark dataset CSV and prepare model inputs.
    
    Parameters:
    -----------
    csv_path : str
        Path to the CSV file containing benchmark dataset information
    protocol_videos_base : str
        Base path to the protocol videos directory
    documentation_videos_base : str
        Base path to the documentation videos directory
    markdown_base : str
        Base path to the markdown files directory
    bucket : object
        The bucket object used in the prepare_all_inputs function
    prefix : str
        Prefix for the files in GCS bucket.
    
    Returns:
    --------
    dict
        Dictionary containing all model inputs for the first two rows in the CSV,
        with experiment names as keys
    """
    benchmark_df = pd.read_csv(
        csv_path, 
        sep=';'
    )
    
    all_model_inputs = {}
    
    for index, row in benchmark_df.iterrows(): # for testing .head(2).iterrows()
        lab_video_path = os.path.join(protocol_videos_base, row["protocol video"])
        protocol_path = os.path.join(markdown_base, row["protocol"])
        documentation_video_path = os.path.join(documentation_videos_base, row["documentation video"])
        documentation_path = os.path.join(markdown_base, row["documentation"])
        
        dict_model_inputs = prepare_all_inputs(
            lab_video_path,
            protocol_path,
            documentation_video_path,
            documentation_path,
            bucket,
            prefix
        )
        
        experiment_name = row["documentation"].split(".")[0]
        all_model_inputs[experiment_name] = dict_model_inputs
        
        print(f"Processed {experiment_name}")
        
    return all_model_inputs

In [162]:
def generate_documentation_evaluation(documentation_input, documentation, model_name="gemini-2.0-flash", temperature=0.9):
    """
    Generate an evaluation of AI-generated documentation against benchmark documentation.
    
    Parameters:
    -----------
    documentation_input : list
        The benchmark documentation (ground truth) represented as a list of strings
    documentation : list
        The AI-generated documentation to evaluate represented as a list of strings
    model_name : str, optional
        The model to use for evaluation, default is "gemini-2.0-flash"
    temperature : float, optional
        Temperature setting for content generation, default is 0.9
        
    Returns:
    --------
    tuple
        A tuple containing (evaluation_text, usage_metadata)
    """
    inputs = [
        """
        # Instruction
        You are an expert evaluator specializing in scientific protocol documentation. Your task is to evaluate the error identification accuracy, error type classification and documentation quality of an AI-generated documentation against a benchmark documentation (ground truth). You will be provided with an AI-generated documentation and a benchmark documentation (human-verified ground truth).

        # Evaluation Parts
        ## Part 1: Error Identification Accuracy
        For each step in the protocol, determine if the AI correctly identified the presence or absence of errors by classifying into one of these categories:
        - **No Error**: Both benchmark and AI response agree there was no error
        - **Error (Correctly Identified)**: Both benchmark and AI response agree there was an error
        - **False Positive**: AI response claimed an error when the benchmark indicates none
        - **False Negative**: AI response missed an error that the benchmark shows

        ## Part 2: Error Type Classification
        For each error that was correctly identified by both the benchmark and AI response, determine if the AI correctly classified the error type:
        - **Correct Classification**: AI used the same error type as the benchmark (Omitted, Error, Deviation, Added)
        - **Incorrect Classification**: AI used a different error type than the benchmark

        ## Part 3: Documentation Quality
        Evaluate the AI's documentation quality based on these criteria:
        1. **Structure**: Did it keep only relevant sections: Aim, Materials, Procedure, Results?
        2. **Tense**: Did it use past tense to describe what actually happened, not what should happen?
        3. **Language**: Did it remove all instructional language and replace with observations?
        4. **Numbering**: Did it maintain step numbering of the original protocol even if order changed?
        5. **Timing**: Did it include exact actual timing, not estimated timing?

        # Rating Rubric
        For each part, provide an evaluation:

        ### Part 1: Error Identification Accuracy
        - Calculate and report:
            - Total number of correct identifications (No Error + Correctly Identified Error)
            - Total number of false positives
            - Total number of false negatives
            - Overall accuracy percentage (correct identifications / total steps)

        ### Part 2: Error Type Classification
        - Calculate and report:
            - Total errors correctly classified / Total errors correctly identified
            - Overall error classification accuracy percentage

        ### Part 3: Documentation Quality
        For each criterion:
        - **Excellent**: The criterion was fully met with no issues
        - **Good**: The criterion was mostly met with minor issues
        - **Poor**: The criterion was not met or had significant issues

        # Evaluation Steps
        1. Create a table for each step in the protocol showing error identification accuracy
        2. Analyze correctly identified errors to determine classification accuracy
        3. Evaluate documentation quality against the 5 criteria
        4. Provide final scores and overall assessment
        5. Highlight specific strengths and areas for improvement

        # Output Format
        ## Part 1: Error Identification Accuracy
        | Step | Benchmark | AI Response | Classification |
        |------|-----------|-------------|----------------|
        | [Step details] | [Error/No Error] | [Error/No Error] | [No Error/Error/False Positive/False Negative] |

        **Summary Statistics:**
        - Total correct identifications: [X]/[Total Steps]
        - Total false positives: [X]
        - Total false negatives: [X]
        - Overall accuracy: [X]%

        ## Part 2: Error Classification Accuracy
        | Step | Benchmark Error Type | AI Error Type | Classification |
        |------|---------------------|---------------|----------------|
        | [Step with error] | [Error Type] | [Error Type] | [Correct/Incorrect] |

        **Summary Statistics:**
        - Total correctly classified errors: [X]/[Total Errors]
        - Error classification accuracy: [X]%

        ## Part 3: Documentation Quality
        | Criterion | Rating | Explanation |
        |-----------|--------|-------------|
        | Structure | [Excellent/Good/Poor] | [Explanation] |
        | Tense | [Excellent/Good/Poor] | [Explanation] |
        | Language | [Excellent/Good/Poor] | [Explanation] |
        | Numbering | [Excellent/Good/Poor] | [Explanation] |
        | Timing | [Excellent/Good/Poor] | [Explanation] |

        ## Overall Assessment
        [Provide a concise overall assessment of the AI documentation's quality, highlighting key strengths and weaknesses, with suggestions for improvement.]

        # Input Materials
        ## Benchmark Documentation (Ground Truth)
        
        """
    ]
    inputs.extend(documentation_input)
    
    inputs.extend(["## AI-Generated Documentation"])
    inputs.extend(documentation)

    evaluation, usage_metadata = generate_content_from_model(
        inputs,
        model_name=model_name,
        temperature=temperature,
    )
    
    return evaluation, usage_metadata

def get_table_json_prompt(text_with_tables: str, table_identifier: str) -> str:
    """
    Generates a prompt to extract a specific table from text into JSON.

    Args:
        text_with_tables: The full text containing the table(s).
        table_identifier: A string to help the model identify the target table
                          (e.g., the table title, or a unique phrase near it).

    Returns:
        A formatted prompt string.
    """
    prompt = f"""
    You are an expert data extraction tool.
    Your task is to locate a specific table within the provided text and output its data as a JSON array.

    Here is the text containing the table(s):
    ---TEXT_START---
    {text_with_tables}
    ---TEXT_END---

    Identify the table that best matches the following title: "{table_identifier}"

    It is very important to you to output the data from ONLY this table as a valid JSON array. Each object in the array should represent a row from the table. The keys of each object should be the exact column headers from the identified table.

    Output Constraints:
    - Answer direct with the JSON.
    - If the specified table cannot be found, output an empty JSON array: []
    """
    return prompt

def extract_json_from_model_output(model_output_string):
    """
    Extract and parse JSON data from a model output string that contains JSON within code block markers.
    
    Parameters:
    -----------
    model_output_string : str
        The string output from the model that contains JSON within code block markers
        
    Returns:
    --------
    dataframe: A pandas DataFrame created from the JSON data, or None if extraction failed
    """
    start_marker = "```json"
    end_marker = "```"

    start_index = model_output_string.find(start_marker)
    end_index = model_output_string.find(end_marker, start_index + len(start_marker))  # Search for end marker after the start
    
    df = None
    if start_index != -1 and end_index != -1:
        extracted_json_string = model_output_string[start_index + len(start_marker):end_index].strip()
        
        try:
            json_data = json.loads(extracted_json_string)
            logger.info("Successfully extracted and parsed JSON.")
            
            if isinstance(json_data, list) and all(isinstance(item, dict) for item in json_data):
                df = pd.DataFrame(json_data)
            else:
                logger.warning("JSON data is not a list of dictionaries, could not create DataFrame.")
                
        except json.JSONDecodeError as e:
            logger.error(f"Error decoding JSON after extraction: {e}")
            logger.debug(f"Extracted string: {extracted_json_string}")
    else:
        logger.error("Could not find JSON code block markers in the output.")
        logger.debug(f"Model output: {model_output_string}")
    
    return df

def extract_table_to_dataframe(evaluation, table_name, model_name="gemini-2.0-flash", temperature=0.9):
    """
    Extract a table from evaluation content and convert it to a DataFrame.
    
    Parameters:
    -----------
    evaluation : str
        The evaluation content containing tables
    table_name : str
        The name of the table to extract
    model_name : str, optional
        The model to use for content generation, default is "gemini-2.0-flash"
    temperature : float, optional
        Temperature setting for content generation, default is 0.9
        
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing the extracted table data
    """
    # Generate prompt for table extraction
    extraction_prompt = get_table_json_prompt(evaluation, table_name)
    
    # Generate JSON response from the model
    json_response, _ = generate_content_from_model(
        extraction_prompt,
        model_name=model_name,
        temperature=temperature
    )
    
    # Extract and convert JSON to DataFrame
    results_df = extract_json_from_model_output(json_response)
    
    return results_df

def calculate_error_evaluation_metrics(evaluation):
    """
    Calculate comprehensive error evaluation metrics from an evaluation document.
    
    This function extracts tables from the evaluation document and calculates
    metrics for error identification, error classification, and documentation quality.
    
    Parameters:
    -----------
    evaluation : str
        The evaluation document containing the tables to analyze
        
    Returns:
    --------
    dict
        A dictionary containing all calculated metrics organized by category
    """
    error_evaluation_metrics = {}
    
    # Part 1: Error Identification Accuracy
    identification_table_name = "Part 1: Error Identification Accuracy"
    identification_results_df = extract_table_to_dataframe(evaluation, identification_table_name)
    
    if identification_results_df is not None:
        correctly_identified_rows = identification_results_df[
            (identification_results_df["Classification"] == "No Error") |
            (identification_results_df["Classification"] == "Error (Correctly Identified)")
        ]
        total_actual_errors = identification_results_df[identification_results_df["Benchmark"] == "Error"]
        correctly_identified_errors = identification_results_df[identification_results_df["Classification"] == "Error (Correctly Identified)"]
        false_positive_errors = identification_results_df[identification_results_df["Classification"] == "False Positive"]
        false_negative_errors = identification_results_df[identification_results_df["Classification"] == "False Negative"]
        
        error_evaluation_metrics["Error Identification Statistics"] = {
            "Total steps evaluated": len(identification_results_df),
            "Total correct identifications": len(correctly_identified_rows),
            "Overall identification accuracy": len(correctly_identified_rows) / len(identification_results_df) if len(identification_results_df) > 0 else 0,
            "Error recall rate": len(correctly_identified_errors) / len(total_actual_errors) if len(total_actual_errors) > 0 else "N/A",
            "False positive count": len(false_positive_errors),
            "False negative count": len(false_negative_errors)
        }
    else:
        error_evaluation_metrics["Error Identification Statistics"] = {
            "Status": "No data available"
        }
    
    # Part 2: Error Classification Accuracy
    classification_table_name = "Part 2: Error Classification Accuracy"
    classification_results_df = extract_table_to_dataframe(evaluation, classification_table_name)
    
    if classification_results_df is not None:
        correctly_classified_errors = classification_results_df[classification_results_df["Classification"] == "Correct"]
        
        error_evaluation_metrics["Error Classification Statistics"] = {
            "Total errors analyzed": len(classification_results_df),
            "Correctly classified errors": len(correctly_classified_errors),
            "Classification accuracy": len(correctly_classified_errors) / len(classification_results_df) if len(classification_results_df) > 0 else 0
        }
    else:
        error_evaluation_metrics["Error Classification Statistics"] = {
            "Status": "No data available"
        }
    
    # # Part 3: Documentation Quality
    # documentation_table_name = "Part 3: Documentation Quality"
    # documentation_quality_df = extract_table_to_dataframe(evaluation, documentation_table_name)

    return error_evaluation_metrics

In [159]:
def generate_documentation(protocol_video_example, protocol_example, lab_video_example, documentation_example,
                      protocol_video_input, protocol_input, lab_video_input, 
                      model_name="gemini-2.0-flash", temperature=0.9):
    """
    Generate corrected documentation by comparing protocol with actual implementation.
    
    Parameters:
    -----------
    protocol_video_example : list
        Example protocol video content
    protocol_example : list
        Example protocol content
    lab_video_example : list
        Example lab video content
    documentation_example : list
        Example documentation content
    protocol_video_input : list
        Input protocol video content to process
    protocol_input : list
        Input protocol content to process
    lab_video_input : list
        Input lab video content to process
    model_name : str, optional
        The model to use for generation, default is "gemini-2.0-flash"
    temperature : float, optional
        Temperature parameter for generation, default is 0.9
        
    Returns:
    --------
    tuple
        A tuple containing the documentation text and usage metadata
    """
    inputs = [
        """
        You are Professor Matthias Mann, a pioneering scientist in proteomics and mass spectrometry.
        # Your Task:
        Compare the original protocol with the actual implementation shown in a video, and create a corrected documentation that reflects what actually happened.
        Your documentation should follow these guidelines:
        1. Keep only relevant sections: Aim, Materials, Procedure, Results
        2. Use past tense to describe what actually happened, not what should happen
        3. Remove all instructional language and replace with observations
        4. Maintain step numbering of the original protocol even if the order is changed (1, 3, 2, ...)
        5. Include exact actual timing, not estimated timing
        Use these consistent symbols to indicate step status:
        - ✓ (Followed correctly with no special notation needed)
        - ❌ **Error:** (When something was done incorrectly - be specific about what happened)
        - ❌ **Omitted:** (When a step was completely skipped)
        - ⚠️ **Deviation:** (When a step was followed differently than prescribed)
        - ➕ **Added:** (When a new step not in the protocol was performed)
        # Example
        """
    ]
    inputs.extend(["## Protocol video:"])
    inputs.extend(protocol_video_example)
    inputs.extend(["## Protocol:"])
    inputs.extend(protocol_example)
    inputs.extend(["## Lab video:"])
    inputs.extend(lab_video_example)
    inputs.extend(["## Documentation:"])
    inputs.extend(documentation_example)
    inputs.extend(
        ["""
        # Your task now
        Provide me with a documentation as in the example above.
        """]
    )
    inputs.extend(["## Protocol video:"])
    inputs.extend(protocol_video_input)
    inputs.extend(["## Protocol:"])
    inputs.extend(protocol_input)
    inputs.extend(["## Lab video:"])
    inputs.extend(lab_video_input)
    inputs.append("Output: Correct documentation")
    
    documentation, usage_metadata = generate_content_from_model(
        inputs,
        model_name=model_name,
        temperature=temperature,
    )
    
    return documentation, usage_metadata

In [99]:
csv_path = '/Users/patriciaskowronek/Documents/proteomics_specialist/data/benchmark_dataset.csv'
protocol_videos_base = "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/benchmark_dataset/protocols"
documentation_videos_base = "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/benchmark_dataset/documentation"
markdown_base = "/Users/patriciaskowronek/Documents/proteomics_specialist/data"
prefix = "compare_protocol_video"

all_model_inputs = process_benchmark_dataset(csv_path, protocol_videos_base, documentation_videos_base, markdown_base, bucket, prefix)

Processed PlaceEvotips_docuCorrect
Processed PlaceEvotips_docuWrongPosition


In [166]:
from vertexai.generative_models import GenerativeModel, GenerationConfig

example = 'PlaceEvotips_docuCorrect'
protocol_video_example = all_model_inputs[example]['protocol_video_input']
protocol_example = all_model_inputs[example]['protocol_input']
lab_video_example = all_model_inputs[example]['lab_video_input']
documentation_example = all_model_inputs[example]['documentation_input']

copy_all_model_inputs = all_model_inputs.copy()
subset_all_model_inputs = copy_all_model_inputs.pop('PlaceEvotips_docuCorrect')
results_collection = {}
for key, value in copy_all_model_inputs.items():
    print(key)
    protocol_video_input = value['protocol_video_input']
    protocol_input = value['protocol_input']
    lab_video_input = value['lab_video_input']
    documentation_input = value['documentation_input']

    documentation, usage_metadata = generate_documentation(
        protocol_video_example, protocol_example, lab_video_example, documentation_example,
        protocol_video_input, protocol_input, lab_video_input, 
        model_name="gemini-2.0-flash", 
        temperature=0.9
    )
    display(Markdown(documentation))

    evaluation, usage_metadata_evaluation = generate_documentation_evaluation(documentation_input, documentation)
    display(Markdown(evaluation))

    metrics = calculate_error_evaluation_metrics(evaluation)
    pprint.pprint(metrics)

    results_collection[key] = {
        "inputs": {
            "experiment_name": key,
            "protocol_video_input": value['protocol_video_input'],
            "protocol_input": value['protocol_input'],
            "lab_video_input": value['lab_video_input'],
            "documentation_input": value['documentation_input']
        },
        "outputs": {
            "documentation": documentation,
            "documentation_metadata": usage_metadata,
            "evaluation": evaluation,
            "evaluation_metadata": usage_metadata_evaluation,
            "metrics": metrics
        }
    }


PlaceEvotips_docuWrongPosition


Alright Professor Mann, here's the corrected documentation:

## Documentation:# Placing Evotips in Evotip Boxes on the Evosep One System

## Aim

Placing Evotips in Evotip boxes with HeLa samples and blanks.


## Materials

### Equipment

- **Evotips**
  - Single-use stage tips for sample injection
  - Rack layout: Two columns (left and right)
  - Left column (top to bottom): S1, S2, S3
  - Right column (top to bottom): S4, S5, S6
  - Within each box: Standard 96-well format with A1 (top left), A12 (top right), H12 (bottom right)
- **Evotip Boxes**
  - 96-well format (A1-H12) (Figure 1)
- **Evosep One System**
  - Liquid chromatography system

### Reagents

- Formic acid (FA)
  - ! CAUTION: This liquid may be corrosive. It is harmful and can cause damage if direct contact occurs.

### Reagent setup

- **Buffer A**
  - Consists of 0.1% (vol/vol) FA. The buffers are stable for at least 6 months at room temperature as long as they are protected from sunlight.


## Procedure

*Estimated timing: less than 1 minute*

✓ 1.  Verified that Evotip box was filled to a minimum depth of 1 cm with Buffer A solution. (0:10)

✓ 2. Placed Evotip Box at S1 within the rack system of the Evosep instrument. Ensured box was firmly seated in its designated position. (0:23)

✓ 3. Placed an empty Evotip Box for Blank tips at S3. Ensured box was firmly seated in its designated position. (0:33)

⚠️ 4. Inspected each Evotip before placement to verify its condition. Properly prepared Evotips should display a pale-colored SPE material disc with visible solvent above it. Evotips for HeLa samples were visually confirmed to have a solvent on top and bottom, and the SPE material displayed a pale color (0:47).
    - ⚠️ **Deviation:** The "blanks" were completely dry, unused tips. (1:22)

5. Placed the verified Evotips with HeLa samples into the prepared Evotip boxes at S1 from A1 to A6. (0:50)

6. Placed empty Evotips, called Blanks, at S3 from A1 to A6. (1:17)

7. Documented the precise position of each placed Evotip. (1:34)


## Results
- Properly seated Evotip boxes in the rack system
- Visible Buffer A solution in boxes (1 cm depth)
- All Evotips with HeLa samples showing pale-colored SPE material discs & clear solvent meniscus above each SPE disc of each Evotip.
- All Blanks did not contain clear solvent meniscus above each SPE disc of each Evotip.
- Evotips with HeLa placed at S1 from A1 to A6.
- Blanks placed at S3 from A1 to A6.

## Part 1: Error Identification Accuracy

| Step | Benchmark | AI Response | Classification |
|------|-----------|-------------|----------------|
| 1 | No Error | No Error | No Error |
| 2 | No Error | No Error | No Error |
| 3 | No Error | No Error | No Error |
| 4 | No Error | Error | False Positive |
| 5 | Error | No Error | False Negative |
| 6 | No Error | No Error | No Error |
| 7 | No Error | No Error | No Error |

**Summary Statistics:**
- Total correct identifications: 6/7
- Total false positives: 1
- Total false negatives: 1
- Overall accuracy: 85.7%

## Part 2: Error Classification Accuracy

| Step | Benchmark Error Type | AI Error Type | Classification |
|------|---------------------|---------------|----------------|
| 5 | Error | N/A | N/A |

**Summary Statistics:**
- Total correctly classified errors: 0/0
- Error classification accuracy: N/A%

## Part 3: Documentation Quality

| Criterion | Rating | Explanation |
|-----------|--------|-------------|
| Structure | Good | The structure is mostly good, but the added sections like "Reagents" and the detailed description of "Equipment" are unnecessary and detract from the focus on documenting the specific experiment. |
| Tense | Excellent | The tense is consistently past tense, accurately reflecting the completed actions. |
| Language | Excellent | The language is observational and avoids instructional tone. |
| Numbering | Excellent | The original step numbering is maintained correctly. |
| Timing | Excellent | The AI includes exact timing for each step, which is a significant improvement. |

## Overall Assessment

The AI documentation demonstrates a good understanding of the required format and successfully converts the protocol into an observational document. The use of past tense, observational language, and accurate timing are strengths. However, the AI incorrectly identifies an error in step 4 and misses the error in step 5, and introduces irrelevant details in the "Materials" section. To improve, the AI should focus on accurately identifying errors and avoiding the inclusion of extraneous information not directly related to the observed experiment.


2025-04-21 10:50:46,868 - __main__ - INFO - Successfully extracted and parsed JSON.
2025-04-21 10:50:47,470 - __main__ - INFO - Successfully extracted and parsed JSON.


{'Error Classification Statistics': {'Classification accuracy': 0.0,
                                     'Correctly classified errors': 0,
                                     'Total errors analyzed': 1},
 'Error Identification Statistics': {'Error recall rate': 0.0,
                                     'False negative count': 1,
                                     'False positive count': 1,
                                     'Overall identification accuracy': 0.7142857142857143,
                                     'Total correct identifications': 5,
                                     'Total steps evaluated': 7}}
