# Lab note assistant

This notebook demonstrates a lab note assistant: Video-to-lab-note conversion using Vertex AI

Converting Videos-to-lab-note involves three steps: 
1. Protocol finder: Select protocol which best captures the step being performed in the video
2. Video comparing to ground-of-truth protocol → lab notes + errors in procedure
3. Analytics based on benchmark dataset: automatic comparison of errors found by lab note assistent vs actual errors

In this notebook, I will focus on the step two and three - Compare video with protocol.

In [2]:
from __future__ import annotations

# %load_ext autoreload
%reload_ext autoreload
%autoreload 2

import configparser
import os
import sys
from pathlib import Path
import json
import pandas as pd
import pprint


from IPython.display import Markdown

import logging

import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig, Part
from google.cloud import storage
from typing import TYPE_CHECKING, NamedTuple

import time
import datetime

path_to_append = Path(Path.cwd()).parent / "proteomics_specialist"
sys.path.append(str(path_to_append))
import video_to_protocol

config = configparser.ConfigParser()
config.read("../secrets.ini")

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

In [3]:
config = configparser.ConfigParser()
config.read("../secrets.ini")

PROJECT_ID = config["DEFAULT"]["PROJECT_ID"]
vertexai.init(project=PROJECT_ID, location="us-central1")  # europe-west9 is Paris, europe-west3 is germany


In [4]:
os.environ["GOOGLE_CLOUD_PROJECT"] = config["DEFAULT"]["PROJECT_ID"]

storage_client = storage.Client()
bucket_name = "mannlab_videos"
bucket = storage_client.bucket(bucket_name)

In [30]:
# test if model works

model = GenerativeModel("gemini-2.5-pro-preview-03-25")
model = GenerativeModel("gemini-2.5-flash-preview-04-17")
response = model.generate_content(
    ["test"],
)
response

candidates {
  content {
    role: "model"
    parts {
      text: "Okay, test received. I\'m here and ready to assist.\n\nWhat can I help you with?"
    }
  }
  finish_reason: STOP
  avg_logprobs: -8.1616523049094454
}
usage_metadata {
  prompt_token_count: 1
  candidates_token_count: 22
  total_token_count: 526
  prompt_tokens_details {
    modality: TEXT
    token_count: 1
  }
  candidates_tokens_details {
    modality: TEXT
    token_count: 22
  }
}
model_version: "gemini-2.5-flash-preview-04-17"

In [None]:
def generate_content_from_model(
    inputs: Any,
    model_name: str = "gemini-2.5-pro-preview-03-25",
    temperature: float = 0.9,
) -> tuple:
    """Generate content using Google's Generative AI model.
    
    This function sends inputs to a specified Gemini model and returns the 
    generated response along with usage metadata.
    
    Parameters
    ----------
    inputs : Any
        The inputs to send to the model (text, images, or videos).
    model_name : str, default="gemini-2.5-pro-preview-03-25"
        Name of the generative model to use.
    temperature : float, default=0.9
        Controls the randomness of the output. Higher values (closer to 1.0)
        make output more random, lower values make it more deterministic.
        
    Returns
    -------
    tuple
        A tuple containing (response_text, usage_metadata)
        
    Raises
    ------
    ValueError
        If the model fails to generate content.
    """
    try:
        model = GenerativeModel(model_name)
        
        generation_config = GenerationConfig(
            temperature=temperature,
            # Uncomment if using single audio/video input
            # audio_timestamp=True
        )
        
        response = model.generate_content(
            inputs,
            generation_config=generation_config
        )
        lab_notes = response.text
        usage_metadata = response.usage_metadata
        
    except Exception as e:
        logger.exception("Error during content generation")
        raise ValueError(f"Failed to generate content: {str(e)}")
    
    return lab_notes, usage_metadata

In [7]:
def prepare_all_inputs(
    protocol_video_path: str,
    protocol_path: str,
    lab_video_path: str,
    lab_notes_path: str,
    bucket: str,
    prefix: str = "compare_protocol_video"
) -> dict:
    """Prepare all four standard inputs for the generative model.
    
    This function uploads the four standard files (lab video, protocol document, 
    lab notes video, and lab notes document) and formats them as inputs 
    for a generative model.
    
    Parameters
    ----------
    protocol_video_path : str
        Path to the file that shows the correct execution (ground truth) of the protocol.
    protocol_path : str
        Path to the protocol markdown file.
    lab_video_path : str
        Path to the lab video file.
    lab_notes_path : str
        Path to the lab notes markdown file.
    bucket : str
        GCS bucket name for uploading the files.
    prefix : str, default="compare_protocol_video"
        Prefix for the files in GCS bucket.
        
    Returns
    -------
    dict
        A dictionary containing the four formatted inputs:
        'protocol_video_input', 'protocol_input', 'lab_video_input', 'lab_notes_input'
    """
    
    video_uri = video_to_protocol.upload_video_to_gcs(protocol_video_path, bucket, prefix)
    file_extension = os.path.splitext(video_uri)[1].lower()[1:]
    protocol_video_input = [Part.from_uri(video_uri, mime_type=f"video/{file_extension}")]
    
    uri = video_to_protocol.upload_video_to_gcs(protocol_path, bucket, prefix)
    protocol_input = [Part.from_uri(uri, mime_type="text/md")]
    
    video_uri = video_to_protocol.upload_video_to_gcs(lab_video_path, bucket, prefix)
    lab_video_input = [Part.from_uri(video_uri, mime_type="video/mp4")]

    uri = video_to_protocol.upload_video_to_gcs(lab_notes_path, bucket, prefix)
    lab_notes_input = [Part.from_uri(uri, mime_type="text/md")]
    
    return {
        'protocol_video_input': protocol_video_input,
        'protocol_input': protocol_input,
        # 'protocol_input': 'not included',
        'lab_video_input': lab_video_input,
        'lab_notes_input': lab_notes_input
    }

In [8]:
def process_benchmark_dataset(csv_path, protocol_videos_base, lab_notes_videos_base, markdown_base, bucket, prefix):
    """
    Process the first two rows in the benchmark dataset CSV and prepare model inputs.
    
    Parameters:
    -----------
    csv_path : str
        Path to the CSV file containing benchmark dataset information
    protocol_videos_base : str
        Base path to the protocol videos directory
    lab_notes_videos_base : str
        Base path to the lab notes videos directory
    markdown_base : str
        Base path to the markdown files directory
    bucket : object
        The bucket object used in the prepare_all_inputs function
    prefix : str
        Prefix for the files in GCS bucket.
    
    Returns:
    --------
    dict
        Dictionary containing all model inputs for the first two rows in the CSV,
        with experiment names as keys
    """
    
    benchmark_df = pd.read_csv(
        csv_path, 
        sep=';'
    )
    
    all_model_inputs = {}
    
    for index, row in benchmark_df.iterrows(): # for testing .head(2).iterrows() or .iloc[[13, 14]] .iloc[::2]
        lab_video_path = os.path.join(protocol_videos_base, row["protocol video"])
        protocol_path = os.path.join(markdown_base, row["protocol"])
        lab_notes_video_path = os.path.join(lab_notes_videos_base, row["lab notes video"])
        lab_notes_path = os.path.join(markdown_base, row["lab notes"])
        
        dict_model_inputs = prepare_all_inputs(
            lab_video_path,
            protocol_path,
            lab_notes_video_path,
            lab_notes_path,
            bucket,
            prefix
        )
        dict_model_inputs['error_dict'] = row["error_dict"]
        
        experiment_name = row["lab notes"].split(".")[0]
        all_model_inputs[experiment_name] = dict_model_inputs
        
        print(f"Processed {experiment_name}")
        
        
    return all_model_inputs

In [27]:
def extract_errors(lab_notes, docu_steps, model_name="gemini-2.5-pro-preview-03-25", temperature=0.9):
    """
    Extract the identified errors of AI-generated lab notes.
    
    Parameters:
    -----------
    lab_notes : list
        The AI-generated lab notes to extract represented as a list of strings
    model_name : str, optional
        The model to use for evaluation, default is "gemini-2.5-pro-preview-03-25"
    temperature : float, optional
        Temperature setting for content generation, default is 0.9
        
    Returns:
    --------
    tuple
        A tuple containing (evaluation_text, usage_metadata)
    """
    prompt = """\
        # Instruction
        You are an expert evaluator tasked with analyzing errors that have already been identified in AI-generated lab notes. Your job is to accurately extract the error positions and error types for each step. It is very important to you to be precise and thorough.
        
        # Error Classifications
        These are the error classifications you must use:

        No Error: The step has no errors indicated in the lab notes.
        Addition: The lab notes indicate added information not in the reference protocol.
        Deviation: The lab notes indicate changed or modified information from the reference protocol.
        Omitted: The lab notes indicate important information was left out.
        Error: The lab notes indicate an error occurred in carrying out an action.
        Deviation & Error: The lab notes indicate both a deviation from protocol and an error in execution.
        N/A: Used only when a step number is not present in the lab notes.

        # Evaluation process:
        1. Carefully read the AI-generated lab notes in full.
        2. For each step in the specified range {docu_steps}, identify if the AI has marked it as containing an error.
        3. If an error is marked, determine which classification it falls under based on the descriptions in the notes.
        4. For Added steps (usually marked with ➕ **Added:**):
        * These typically appear with decimal step numbers (like 8.1, 8.2) in the lab notes
        * ALWAYS include these decimal-numbered steps in your evaluation table, even if they appear outside the {docu_steps} range
        * Place them in the correct sequence in your table (after their parent step)
        5. If a step number that should be within the {docu_steps} range is completely missing from the lab notes:
        * Include it in your table with "N/A" in both the "AI Response" and "AI Class" columns
        6. Fill out the table using the exact format specified below.
        7. Answer direct.

        # Output format
        | Step | AI Response | AI Class |
        |------|-------------|----------------|
        | 1 | [Error/No Error] | [Class if error] | 
        | 2 | [Error/No Error] | [Class if error] | 

        # ====== EXAMPLE (FOR REFERENCE ONLY) ======
        
        ## Example: AI-Generated lab notes
        
        # DNA Extraction Protocol Observation
        *Timing: 35 minutes*

        ## Procedure

        1. The researcher retrieved the cell culture samples from the incubator and placed them on the bench [00:01:15-00:01:45].

        2. ⚠️ **Deviation: Altered step order** & ❌ **Error:** The researcher added 500 μL of lysis buffer to each microcentrifuge tube *before* transferring the cell samples [00:02:10-00:03:05]. (Protocol specified adding cells first, then buffer).

        3. The researcher transferred 200 μL of cell culture to each microcentrifuge tube containing lysis buffer [00:03:30-00:04:45].

        4. ❌ **Error:** The tubes were incubated at 65°C for 5 minutes [00:05:10-00:10:15]. (Protocol specified incubation at 56°C).

        5. 200 μL of 100% ethanol was added to each lysate and mixed by pipetting [00:10:45-00:12:20].

        6. ❌ **Omitted:** The researcher did not centrifuge the lysate briefly to remove drops from the lid as specified in the protocol [00:12:20-00:12:35].

        7. The lysate was transferred to DNA purification columns placed in collection tubes [00:13:10-00:15:05].

        8. The columns were centrifuged at 10,000 × g for 1 minute [00:15:30-00:16:45].

        8.1 ➕ **Added:** The researcher labeled each collection tube with sample ID and date [00:17:00-00:17:45]. (This step was not in the original protocol).

        9. ❌ **Omitted:** The researcher did not discard the flow-through and reuse the collection tube as specified in the protocol [00:17:45-00:18:00].

        10. ⚠️ **Deviation:** The flow-through was discarded and *a new collection tube* was used for the next step [00:21:30-00:22:15]. (Protocol specified reusing the same collection tube).

        ## Example: Classification Table

        | Step | AI Response | AI Class |
        |------|-------------|----------------|
        | 1 | No Error | N/A |
        | 2 | Error | Deviation & Error |
        | 3 | No Error | N/A |
        | 4 | Error | Error |
        | 5 | No Error | N/A |
        | 6 | Error | Omitted |
        | 7 | No Error | N/A |
        | 8 | No Error | N/A |
        | 8.1 | Error | Addition |
        | 9 | Error | Omitted |
        | 10 | Error | Deviation |

        # ====== Beginn of EVALUATION TASK ====== 
        """
    
    inputs = [prompt.format(docu_steps=docu_steps)  ] 
    inputs.extend(["## AI-Generated lab notes"])
    inputs.extend([lab_notes])
    inputs.extend(["## Classification Table"])

    evaluation, usage_metadata = generate_content_from_model(
        inputs,
        model_name=model_name,
        temperature=temperature,
    )
    # print(inputs)
    # print(evaluation)
    
    return evaluation, usage_metadata


def generate_lab_notes_evaluation(lab_notes_input, lab_notes, model_name="gemini-2.5-pro-preview-03-25", temperature=0.9):
    """
    Generate an evaluation of AI-generated lab notes against benchmark lab notes.
    
    Parameters:
    -----------
    lab_notes_input : list
        The benchmark lab notes (ground truth) represented as a list of strings
    lab_notes : list
        The AI-generated lab notes to evaluate represented as a list of strings
    model_name : str, optional
        The model to use for evaluation, default is "gemini-2.5-pro-preview-03-25"
    temperature : float, optional
        Temperature setting for content generation, default is 0.9
        
    Returns:
    --------
    tuple
        A tuple containing (evaluation_text, usage_metadata)
    """
    inputs = [
        """
        # Instruction
        You are an expert evaluator. Your task is to evaluate the lab notes quality of an AI-generated lab notes against a benchmark lab notes (ground truth). 

        # Evaluation Parts

        ## 5 Criteria:
        Evaluate the AI's lab notes quality based on these criteria:
        1. **Structure**: Did it keep only relevant sections: Aim, Materials, Procedure, Results?
        2. **Tense**: Did it use past tense to describe what actually happened, not what should happen?
        3. **Language**: Did it remove all instructional language and replace with observations?
        4. **Numbering**: Did it maintain step numbering of the original protocol even if order changed?
        5. **Timing**: Did it include exact actual timing, not estimated timing?

        ### Rating Rubric:
        For each criterion:
        - **Excellent**: The criterion was fully met with no issues
        - **Good**: The criterion was mostly met with minor issues
        - **Poor**: The criterion was not met or had significant issues

        # Output Format
        ## Lab notes Quality
        | Criterion | Rating | Explanation |
        |-----------|--------|-------------|
        | Structure | [Excellent/Good/Poor] | [Explanation] |
        | Tense | [Excellent/Good/Poor] | [Explanation] |
        | Language | [Excellent/Good/Poor] | [Explanation] |
        | Numbering | [Excellent/Good/Poor] | [Explanation] |
        | Timing | [Excellent/Good/Poor] | [Explanation] |

        # Evaluation Steps
        1. the lab notes quality of an AI-generated lab notes against a benchmark lab notes (ground truth) using the  5 criteria.
        2. Create a table summarizing the evaluation results.
        
        """
    ]
    inputs.extend(["""
        # Input Materials
        ## Benchmark Lab Notes (Ground Truth)
    """])
    inputs.extend(lab_notes_input)
    
    inputs.extend(["## AI-Generated Lab Notes"])
    inputs.extend([lab_notes])
    inputs.extend(["# Lab Notes Quality"])

    evaluation, usage_metadata = generate_content_from_model(
        inputs,
        model_name=model_name,
        temperature=temperature,
    )
    
    return evaluation, usage_metadata

def get_table_json_prompt(text_with_tables: str, table_identifier: str) -> str:
    """
    Generates a prompt to extract a specific table from text into JSON.

    Args:
        text_with_tables: The full text containing the table(s).
        table_identifier: A string to help the model identify the target table
                          (e.g., the table title, or a unique phrase near it).

    Returns:
        A formatted prompt string.
    """
    prompt = f"""
    You are an expert data extraction tool.
    Your task is to locate a specific table within the provided text and output its data as a JSON array.

    Here is the text containing the table(s):
    ---TEXT_START---
    {text_with_tables}
    ---TEXT_END---

    Identify the table that best matches the following title: "{table_identifier}"

    It is very important to you to output the data from ONLY this table as a valid JSON array. Each object in the array should represent a row from the table. The keys of each object should be the exact column headers from the identified table.

    Output Constraints:
    - Answer direct with the JSON.
    - If the specified table cannot be found, output an empty JSON array: []
    """
    return prompt


def extract_json_from_model_output(model_output_string):
    """
    Extract and parse JSON data from a model output string that contains JSON within code block markers.
    
    Parameters:
    -----------
    model_output_string : str
        The string output from the model that contains JSON within code block markers
        
    Returns:
    --------
    dataframe: A pandas DataFrame created from the JSON data, or None if extraction failed
    """
    start_marker = "```json"
    end_marker = "```"

    start_index = model_output_string.find(start_marker)
    end_index = model_output_string.find(end_marker, start_index + len(start_marker))  # Search for end marker after the start
    
    df = None
    if start_index != -1 and end_index != -1:
        extracted_json_string = model_output_string[start_index + len(start_marker):end_index].strip()
        
        try:
            json_data = json.loads(extracted_json_string)
            logger.info("Successfully extracted and parsed JSON.")
            
            if isinstance(json_data, list) and all(isinstance(item, dict) for item in json_data):
                df = pd.DataFrame(json_data)
            else:
                logger.warning("JSON data is not a list of dictionaries, could not create DataFrame.")
                
        except json.JSONDecodeError as e:
            logger.error(f"Error decoding JSON after extraction: {e}")
            logger.debug(f"Extracted string: {extracted_json_string}")
    else:
        logger.error("Could not find JSON code block markers in the output.")
        logger.debug(f"Model output: {model_output_string}")
    
    return df


def extract_table_to_dataframe(evaluation, table_name, model_name="gemini-2.5-pro-preview-03-25", temperature=0.9):
    """
    Extract a table from evaluation content and convert it to a DataFrame.
    
    Parameters:
    -----------
    evaluation : str
        The evaluation content containing tables
    table_name : str
        The name of the table to extract
    model_name : str, optional
        The model to use for content generation, default is "gemini-2.5-pro-preview-03-25"
    temperature : float, optional
        Temperature setting for content generation, default is 0.9
        
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing the extracted table data
    """
    extraction_prompt = get_table_json_prompt(evaluation, table_name)
    
    json_response, _ = generate_content_from_model(
        extraction_prompt,
        model_name=model_name,
        temperature=temperature
    )
    
    results_df = extract_json_from_model_output(json_response)
    
    return results_df


def identify_error_type(row):
    if row['Benchmark'] == 'No Error' and row['AI Response'] == 'No Error':
        return 'No Error (Correctly Identified)'
    elif row['Benchmark'] == 'Error' and row['AI Response'] == 'Error':
        return 'Error (Correctly Identified)'
    elif row['Benchmark'] == 'Error' and row['AI Response'] == 'No Error':
        return 'False Negative'
    elif row['Benchmark'] == 'No Error' and row['AI Response'] == 'Error':
        return 'False Positive'
    elif pd.notna(row['Benchmark'])  == False and row['AI Class'] == 'Addition':
        return 'Addition by model'
    else:
        return 'Unknown' 


def classify_error_type(row):
    if row['Identification'] == 'Error (Correctly Identified)':
        if row['Class'] == row['AI Class']:
            return 'correct'
        else:
            return 'incorrect'
    else:
        return 'N/A' 
    

def generate_error_summary(df):
    """
    Generate a summary dictionary of error identification and classification statistics.
    
    Parameters:
    df (pandas.DataFrame): DataFrame containing error analysis results with 
                          'Benchmark', 'Identification', and 'Classification' columns
    
    Returns:
    dict: A nested dictionary containing error identification and classification statistics
    """
    total_steps = len(df)
    error_count = len(df[df['Benchmark'] == 'Error'])
    correctly_identified_errors = len(df[df['Identification'] == 'Error (Correctly Identified)'])
    false_negatives = len(df[df['Identification'] == 'False Negative'])
    false_positives = len(df[df['Identification'] == 'False Positive'])
    addition_by_model = len(df[df['Identification'] == 'Addition by model'])
    correct_identifications = len(df[(df['Identification'] == 'No Error (Correctly Identified)') | 
                                   (df['Identification'] == 'Error (Correctly Identified)')])
    type_addition = len(df[(df['Identification'] == 'Error (Correctly Identified)') & (df['Class'] == 'Addition')])
    type_deviation = len(df[(df['Identification'] == 'Error (Correctly Identified)') & (df['Class'] == 'Deviation')])
    type_omitted = len(df[(df['Identification'] == 'Error (Correctly Identified)') & (df['Class'] == 'Omitted')])
    type_error = len(df[(df['Identification'] == 'Error (Correctly Identified)') & (df['Class'] == 'Error')])
    type_deviation_error = len(df[(df['Identification'] == 'Error (Correctly Identified)') & (df['Class'] == 'Deviation & Error')])

    total_errors_analyzed = len(df[df['Identification'] == 'Error (Correctly Identified)'])
    correctly_classified_errors = len(df[df['Classification'] == 'correct'])
    
    summary_dict = {
        'Error Identification Statistics': {
            'Steps evaluated': total_steps,
            'Errors evaluated': error_count,
            'Correct identifications': correct_identifications,
            'Correct error identifications': correctly_identified_errors,
            'False negative count': false_negatives,
            'False positive count': false_positives,
            'Addition by model': addition_by_model,
            'Type Addition': type_addition,
            'Type Deviation': type_deviation,
            'Type Omitted': type_omitted,
            'Type Error': type_error,
            'Type Deviation & Error': type_deviation_error,
        },
        'Error Classification Statistics': {
            'Total errors analyzed': total_errors_analyzed,
            'Correctly classified errors': correctly_classified_errors,
        }
    }
    
    return summary_dict


def process_and_evaluate_lab_notes(error_dict, lab_notes_gt, lab_notes_ai, model_name="gemini-2.5-pro-preview-03-25", temperature=0.9):
    """
    Process and evaluate lab notes by extracting errors, generating evaluations, 
    and creating summary statistics.
    
    Parameters:
    error_dict (list): List of error dictionaries
    lab_notes_gt (Any): Ground Truth lab notes to compare
    lab_notes_example (str): AI-generated lab notes to evaluate
    
    Returns:
    tuple: A tuple containing (valuation_response, df_errors, summary_dict)
    """
    error_dict = json.loads(error_dict)
    steps_list = [item["Step"] for item in error_dict]
    error_response, usage_metadata_extract_errors = extract_errors(lab_notes_ai, steps_list,
        model_name=model_name,
        temperature=temperature
        )

    evaluation_response, usage_metadata_semantic_eval = generate_lab_notes_evaluation(
        lab_notes_gt, lab_notes_ai,
        model_name=model_name,
        temperature=temperature
    )
    
    df_error_AI = extract_table_to_dataframe(error_response, "Table", model_name=model_name,
        temperature=temperature)
    df_error_AI["Step"] = df_error_AI["Step"].astype('float64')
    
    df_error_benchmark = pd.DataFrame(error_dict)
    print(df_error_benchmark)
    df_errors = pd.merge(df_error_benchmark, df_error_AI, on='Step', how='outer')

    df_errors['Identification'] = df_errors.apply(identify_error_type, axis=1)
    df_errors['Classification'] = df_errors.apply(classify_error_type, axis=1)
    
    summary_dict = generate_error_summary(df_errors)

    print(summary_dict)
    
    return evaluation_response, df_errors, summary_dict, usage_metadata_extract_errors, usage_metadata_semantic_eval

In [10]:
def generate_lab_notes_prompt(protocol_video_example, protocol_example, lab_video_example, lab_notes_example, 
                      protocol_video_input, protocol_input, lab_video_input, proteomics_knowledge, key,
                      model_name="gemini-2.5-pro-preview-03-25", temperature=0.9):
    """
    Generate corrected lab notes by comparing protocol with actual implementation.
    
    Parameters:
    -----------
    protocol_video_example : list
        Example protocol video content
    protocol_example : list
        Example protocol content
    lab_video_example : list
        Example lab video content
    lab_notes_example : list
        Example lab notes content
    protocol_video_input : list
        Input protocol video content to process
    protocol_input : list
        Input protocol content to process
    lab_video_input : list
        Input lab video content to process
    model_name : str, optional
        The model to use for generation, default is "gemini-2.5-pro-preview-03-25"
    temperature : float, optional
        Temperature parameter for generation, default is 0.9
        
    Returns:
    --------
    tuple
        A tuple containing the lab notes text and usage metadata
    """
    inputs = [
        """
        You are Professor Matthias Mann, a pioneering scientist in proteomics and mass spectrometry. Your professional identity is defined by your ability to be exact in your responses and to produce meticulous, accurate results that others can trust completely.

        ## ====== Background Knowledge (FOR REFERENCE ONLY) ======
        [These documents are for building your proteomics background knowldge and are not part of today's task.]
        """]
    inputs.extend([proteomics_knowledge])
    inputs.extend([
        """
        # Instruction

        You work with following two inputs:
        - Ground truth written protocol: The official procedure description
        - Video to evaluate: The actual implementation by a researcher in a routine setting. Be aware that researchers tend to make mistakes in routine tasks.

        Compare the 'Ground truth written protocol' with the 'Video to evaluate', and create a "resulting lab notes" that reflects what actually happened in the 'video to evaluate'.
        

        # Evaluation

        ## Rating rubics for each step:
            1. It was followed correctly (no special notation needed)
            2. It was skipped: ❌ **Omitted:**
            3. It was carried out but wrongly: ❌ **Error:** (be specific about what happened)
            4. It was added: ➕ **Added:**
            5. It was carried out later in the procedure: ⚠️ **Deviation: Altered step order**
            6. A combination of 5. and the others: e.g. ⚠️ **Deviation: Altered step order** & ❌ **Omitted:**

        ## Follow this structured approach:

        * STEP 1: Read the 'Ground truth written protocol thoroughly and write it down again word-by-word.

        * STEP 2: Go through the 'Video to evaluate' completely from beginning to end.
            - Document all observed actions with timestamps
         
        Table 1:
        | Timestamp | Visual/Audio Action |\n
        |---|---|\n
        | [hh:mm:ss] |[Description of action] |\n
        | [hh:mm:ss] | [Description of action] |\n

        * STEP 3: Systematic comparison
            - Go through the 'Ground truth written protocol' as it would be a checklist step by step
            - For each step, specifically search for evidence in Table 1
            - If a step is not present, scan the entire Table 1 to confirm it wasn't performed out of sequence
            - For each step, clearly state:
                * Evaluate each step according to the rating rubics
                * The specific visual/audio evidence (or lack thereof) supporting your determination
                * Precise timestamps from the 'Video to evaluate'
            - If any step is present in Table 1 but not in 'Ground truth written protocol': 
                * add this step in sequence
                * label it with the rating rubic '➕ **Added:**'
                * Number these steps using a decimal increment after the preceding step number
                * For example, if an addition appears after step 8, label it as step 8.1
                * If multiple additions appear after the same step, number them sequentially (8.1, 8.2, etc.)
            
         
        Table 2:
        | Step | Step Description | Timestamp in 'Video to evaluate' | Comparison Result | Notes |\n
        |---|---|---|---|---|\n
        | 1 | [Description of step in 'Ground truth written protocol'] | [hh:mm:ss] | [Aligned/Partially/Misaligned] | [Explanation] |\n
        | 2 | [Description of step in 'Ground truth written protocol'] | [hh:mm:ss], [hh:mm:ss] | [Aligned/Partially/Misaligned] | [Explanation] |\n|

        * STEP 4: Create a "resulting lab notes" that accurately reflects what occurred in the 'Video to evaluate':
        - Rename sections as specified (Abstract to Aim, Expected Results to Results, Estimated timing to Timing)
        - Use past tense to describe actual observations
        - Include exact timing from the lab video
        - Remove instructional language and replace with observations
        - Omit Figures and References sections
                
        """
    ])
    
    inputs.extend(["""
        # ====== EXAMPLE (FOR REFERENCE ONLY) ======\n
        The following set of inputs and expected result should solely serve as an example and is not part of the evaluation task.\n
        """])
    inputs.extend(["## Example: 'Ground truth written protocol': \n"])
    inputs.extend(protocol_example)
    inputs.extend(["## Example: 'Video to evaluate': \n"])
    inputs.extend(lab_video_example)
    inputs.extend(["## Example - Expected result: 'resulting lab notes': \n"])
    inputs.extend(lab_notes_example)
    
    inputs.extend(["# ====== Beginn of EVALUATION TASK ====== \n"])
    inputs.extend(["## Important: The evaluation must be performed on the following video \n"])
    
    inputs.extend(["## Task: 'Ground truth written protocol': \n"])
    inputs.extend(protocol_input)
    inputs.extend(["## Task: 'Video to evaluate': \n"])
    inputs.extend(lab_video_input)
    inputs.extend([""" 
        As a reminder: Compare the 'Ground truth written protocol' against the 'video to evaluate' to retrieve the 'resulting lab notes'. Your final output should clearly state which rating rubic was identifyied for each step in the 'resulting lab notes'.
        """])
    # print(inputs)
    
    lab_notes, usage_metadata = generate_content_from_model(
        inputs,
        model_name=model_name,
        temperature=temperature,
    )
    
    return lab_notes, usage_metadata

In [11]:
csv_path = '/Users/patriciaskowronek/Documents/proteomics_specialist/data/benchmark_dataset.csv'
protocol_videos_base = "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/benchmark_dataset/protocols"
lab_notes_videos_base = "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/benchmark_dataset/documentation"
markdown_base = "/Users/patriciaskowronek/Documents/proteomics_specialist/data"
prefix = "compare_protocol_video"

all_model_inputs = process_benchmark_dataset(csv_path, protocol_videos_base, lab_notes_videos_base, markdown_base, bucket, prefix)

Processed PlaceEvotips_docuCorrect
Processed PlaceEvotips_docuWrongPosition
Processed PlaceEvotips_docuLiquidNotChecked
Processed PlaceEvotips_docuBoxAngeled
Processed ConnectingColumnSampleLine_docuWithoutStandbyANDtimsControl
Processed ESIsourceToUltraSource_docuCorrect
Processed ESIsourceToUltraSource_docuFogotOvenPowerSupply
Processed UltraSourceToESIsource_docuCorrect
Processed UltraSourceToESIsource_docuForgotN2Line
Processed UltraSourceToESIsource_docuForgotGlovesANDCapillaryCap
Processed UltraSourceToESIsource_docuForgotCapillaryCap
Processed DisconnectingColumn_docuCorrect
Processed DisconnectingColumn_docuWithoutStandby
Processed TimsCalibration_docuCorrect
Processed TimsCalibration_docuCorrect_camera
Processed TimsCalibration_docuNotAllClicksVisibleOnVideo
Processed TimsCalibration_docuSavedMethod
Processed TimsCalibration_docuWrongOrderSteps
Processed QueueSamples_docuCorrect
Processed QueueSamples_docuWrongRow_S3A1Twice
Processed QueueSamples_docuNoBlankNoSampleIDWrongMSme

In [37]:
# analyze one specific video

subfolder_in_bucket = "knowledge"
path = "/Users/patriciaskowronek/Documents/proteomics_specialist/data/backgroundKnowledge.pdf"
file_uri = video_to_protocol.upload_video_to_gcs(
    path, bucket, subfolder_in_bucket
)
proteomics_knowledge = Part.from_uri(file_uri, mime_type="application/pdf")

example = 'Dilute_docuWrongVolume_PipettTipNotChanged'
protocol_video_example = all_model_inputs[example]['protocol_video_input']
protocol_example = all_model_inputs[example]['protocol_input']
lab_video_example = all_model_inputs[example]['lab_video_input']
lab_notes_example = all_model_inputs[example]['lab_notes_input']
copy_all_model_inputs = all_model_inputs.copy()
copy_all_model_inputs.pop(example)

items_list = list(copy_all_model_inputs.items())
key, value = items_list[0]
print(key)
            
lab_notes, usage_metadata = generate_lab_notes_prompt(
    protocol_video_example, protocol_example, lab_video_example, lab_notes_example,
    value['protocol_video_input'], value['protocol_input'], value['lab_video_input'], proteomics_knowledge, key,
    # model_name="gemini-2.5-pro-preview-03-25", temperature=0.9
    model_name="gemini-2.5-flash-preview-04-17", temperature=0.9
)
display(Markdown(lab_notes))

evaluation_response, df_errors, metrics, usage_metadata_extract_errors, usage_metadata_semantic_eval = process_and_evaluate_lab_notes(
    # value['error_dict'], value['lab_notes_input'], lab_notes, model_name="gemini-2.5-pro-preview-03-25", temperature=0.9
    value['error_dict'], value['lab_notes_input'], lab_notes, model_name="gemini-2.5-flash-preview-04-17", temperature=0.9
)
display(Markdown(evaluation_response))
display(df_errors)
print(usage_metadata)
print('usage_metadata_extract_errors', usage_metadata_extract_errors)
print('usage_metadata_semantic_eval', usage_metadata_semantic_eval)


PlaceEvotips_docuCorrect
candidates {
  content {
    role: "model"
    parts {
      text: "Excellent. I shall proceed with the utmost precision and adherence to the provided instructions. Let us begin.\n\n## Placing Evotips in Evotip Boxes on the Evosep One System\n\n## Aim\nThis protocol describes the proper procedure for inspecting Evotips and placing Evotips in Evotip boxes on the liquid chromatography system Evosep One.\n\n## Materials\n\n### Equipment\n- Evotips\n  - Single-use stage tips for sample injection\n  - Rack layout: Two columns (left and right)\n  - Left column (top to bottom): S1, S2, S3\n  - Right column (top to bottom): S4, S5, S6\n  - Within each box: Standard 96-well format with A1 (top left), A12 (top right), H12 (bottom right)\n- Evotip Boxes\n  - 96-well format (A1-H12) (Figure 1)\n- Evosep One System\n  - Liquid chromatography system\n\n### Reagents\n- Formic acid (FA)\n  ! CAUTION: This liquid may be corrosive. It is harmful and can cause damage if direct co

Excellent. I shall proceed with the utmost precision and adherence to the provided instructions. Let us begin.

## Placing Evotips in Evotip Boxes on the Evosep One System

## Aim
This protocol describes the proper procedure for inspecting Evotips and placing Evotips in Evotip boxes on the liquid chromatography system Evosep One.

## Materials

### Equipment
- Evotips
  - Single-use stage tips for sample injection
  - Rack layout: Two columns (left and right)
  - Left column (top to bottom): S1, S2, S3
  - Right column (top to bottom): S4, S5, S6
  - Within each box: Standard 96-well format with A1 (top left), A12 (top right), H12 (bottom right)
- Evotip Boxes
  - 96-well format (A1-H12) (Figure 1)
- Evosep One System
  - Liquid chromatography system

### Reagents
- Formic acid (FA)
  ! CAUTION: This liquid may be corrosive. It is harmful and can cause damage if direct contact occurs.

### Reagent setup
- Buffer A: Consists of 0.1% (vol/vol) FA. The buffers are stable for at least 6 months at room temperature as long as they are protected from sunlight.

## Procedure
*Estimated timing: less than 1 minute*

1. Verify that Evotip box is filled to a minimum depth of 1 cm with Buffer A solution.
2. Place Evotip Box at S1 within the rack system of the Evosep instrument. Ensure each box is firmly seated in its designated position.
3. Place an empty Evotip Box for Blank tips at S3. Ensure each box is firmly seated in its designated position.
4. Inspect each Evotip before placement to verify its condition. Properly prepared Evotips should display a pale-colored SPE material disc with visible solvent above it (Figure 2).
   **CRITICAL STEP**: Discard any Evotips showing signs of dryness or displaying a white-colored disc, as these conditions indicate compromised functionality that could affect sample analysis.
5. Place the verified Evotips into the prepared Evotip boxes at S1 from A1 to A6.
6. Place empty Evotips, called Blanks, at S3 from A1 to A6.
7. Document the precise position of each placed Evotip.

## Expected Results
When the procedure is performed correctly, you should observe:
- Properly seated Evotip boxes in the rack system
- Visible Buffer A solution in boxes (1 cm depth)
- All non-blank Evotips showing pale-colored SPE material discs & clear solvent meniscus above each SPE disc of each Evotip
- Accurate documentation of tip positions: Evotips that are placed at S1 from A1 to A6 and blanks placed at S3 from A1 to A6.

## Figures

### Figure 1: Evosep positions
- Close-up of single Evotip box showing well positions (A1-H12)

### Figure 2: Evotip Quality Assessment
- Most Evotips: Properly hydrated Evotip with pale-colored disc and visible solvent
- Orange-highlighted Evotip: Compromised Evotip showing white/dry disc

## References
1. Evosep One - User Guide: https://www.evosep.com/wp-content/uploads/2024/06/Evosep-One-User-Guide-v18.pdf
2. Sample loading protocol for Evotips: https://www.evosep.com/wp-content/uploads/2020/03/Sample-loading-protocol.pdf

---

### Table 1: Video Action Documentation

| Timestamp | Visual/Audio Action                                  |
| :-------- | :--------------------------------------------------- |
| [00:00]   | Evosep One system is visible.                        |
| [00:03]   | Researcher picks up a yellow-topped Evotip box.      |
| [00:04]   | Researcher holds the box, showing its side and top.  |
| [00:08]   | Researcher moves the box towards the Evosep system. |
| [00:10]   | Researcher places the box onto the S1 position.      |
| [00:12]   | Researcher picks up a second yellow-topped box.     |
| [00:13]   | Researcher places the second box onto the S3 position. |
| [00:14]   | Researcher picks up a small clear plastic box containing Evotips. |
| [00:15]   | Researcher opens the clear box lid.                |
| [00:19]   | Researcher picks up one Evotip using forceps.      |
| [00:20]   | Researcher holds the Evotip above the Evosep rack. |
| [00:25]   | Researcher places the Evotip into position A1 of the box at S1. |
| [00:27]   | Researcher picks up a second Evotip using forceps.   |
| [00:28]   | Researcher holds the Evotip above the Evosep rack. |
| [00:30]   | Researcher places the second Evotip into position A2 of the box at S1. |
| [00:31]   | Researcher picks up a third Evotip using forceps.    |
| [00:32]   | Researcher holds the Evotip above the Evosep rack. |
| [00:38]   | Researcher places the third Evotip into position A1 of the box at S3. |
| [00:39]   | Researcher picks up a fourth Evotip using forceps.   |
| [00:40]   | Researcher places the fourth Evotip into position A3 of the box at S1. |
| [00:40]   | Researcher picks up a fifth Evotip using forceps.    |
| [00:42]   | Researcher places the fifth Evotip into position A2 of the box at S3. |
| [00:43]   | Researcher picks up a sixth Evotip using forceps.    |
| [00:44]   | Researcher places the sixth Evotip into position A3 of the box at S3. |
| [00:45]   | Researcher picks up a seventh Evotip using forceps.  |
| [00:46]   | Researcher places the seventh Evotip into position A4 of the box at S3. |
| [00:47]   | Researcher picks up an eighth Evotip using forceps.   |
| [00:48]   | Researcher places the eighth Evotip into position A5 of the box at S3. |
| [00:49]   | Researcher picks up a ninth Evotip using forceps.    |
| [00:50]   | Researcher places the ninth Evotip into position A6 of the box at S3. |
| [00:51]   | Researcher points to the boxes at S1 and S3.       |
| [00:52]   | Video ends.                                          |

---

### Table 2: Systematic Comparison

| Step | Step Description                                                                                                                               | Timestamp in 'Video to evaluate' | Comparison Result                                  | Notes                                                                                                                               |
| :--- | :--------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------- | :------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------- |
| 1    | Verify that Evotip box is filled to a minimum depth of 1 cm with Buffer A solution.                                                            | [00:04]                          | ❌ **Omitted:**                                    | The video shows the box appears to be filled with liquid, likely Buffer A, but the researcher does not perform an explicit verification of the liquid level depth. |
| 2    | Place Evotip Box at S1 within the rack system of the Evosep instrument. Ensure each box is firmly seated in its designated position.             | [00:10]                          | Aligned                                            | The first box is placed at the S1 position and appears firmly seated.                                                              |
| 3    | Place an empty Evotip Box for Blank tips at S3. Ensure each box is firmly seated in its designated position.                                    | [00:13]                          | ❌ **Error:**                                      | A box is placed at S3 and appears firmly seated. However, the box is not empty; it is the same type of box filled with tips as the one placed at S1. |
| 4    | Inspect each Evotip before placement to verify its condition. Properly prepared Evotips should display a pale-colored SPE material disc with visible solvent above it (Figure 2). **CRITICAL STEP**: Discard any Evotips showing signs of dryness or displaying a white-colored disc... | [00:19], [00:24], [00:28], [00:37], [00:39], [00:40], [00:43], [00:45], [00:47], [00:49] | ❌ **Omitted:**                                    | The researcher picks up and places Evotips using forceps, but there is no observable inspection of each tip's condition (SPE color, solvent meniscus) prior to placement. |
| 5    | Place the verified Evotips into the prepared Evotip boxes at S1 from A1 to A6.                                                               | [00:25], [00:27], [00:40]         | Partially Aligned (Incomplete)                     | Three Evotips are placed at S1 in positions A1, A2, and A3. The remaining positions A4, A5, and A6 in the S1 box were not filled in the video. The Evotips were not verified (see step 4). |
| 6    | Place empty Evotips, called Blanks, at S3 from A1 to A6.                                                                                      | [00:31], [00:38], [00:40], [00:42], [00:44], [00:45], [00:46], [00:47], [00:48], [00:49], [00:50] | ❌ **Error:**                                      | Evotips are placed at S3 in positions A1-A6. However, these do not appear to be empty Evotips intended as Blanks, but rather the same type of tips placed at S1. This contradicts the instruction to place 'empty Evotips, called Blanks'. |
| 7    | Document the precise position of each placed Evotip.                                                                                           | N/A                              | ❌ **Omitted:**                                    | No documentation of the positions is shown in the video.                                                                              |
| 7.1  | Pointed to the boxes at S1 and S3.                                                                                                           | [00:51]                          | ➕ **Added:**                                      | This action was performed but was not part of the written protocol.                                                                   |

---

## Resulting Lab Notes: Placing Evotips in Evotip Boxes on the Evosep One System

## Aim
The protocol describes the proper procedure for inspecting Evotips and placing Evotips in Evotip boxes on the liquid chromatography system Evosep One.

## Materials

### Equipment
- Evotips
  - Single-use stage tips for sample injection
  - Rack layout: Two columns (left and right)
  - Left column (top to bottom): S1, S2, S3
  - Right column (top to bottom): S4, S5, S6
  - Within each box: Standard 96-well format with A1 (top left), A12 (top right), H12 (bottom right)
- Evotip Boxes
  - 96-well format (A1-H12)
- Evosep One System
  - Liquid chromatography system

### Reagents
- Formic acid (FA)
  ! CAUTION: This liquid may be corrosive. It is harmful and can cause damage if direct contact occurs.

### Reagent setup
- Buffer A: Consists of 0.1% (vol/vol) FA. The buffers are stable for at least 6 months at room temperature as long as they are protected from sunlight.

## Procedure
*Timing: 52 seconds*

1. ❌ **Omitted:** Verification that Evotip box was filled to a minimum depth of 1 cm with Buffer A solution was not performed.
2. Placed Evotip Box at S1 within the rack system of the Evosep instrument. The box was firmly seated in its designated position.
3. ❌ **Error:** Placed a non-empty Evotip Box at S3 instead of an empty one intended for Blank tips. The box was firmly seated in its designated position.
4. ❌ **Omitted:** Inspection of each Evotip before placement to verify its condition (pale-colored SPE material disc with visible solvent above it) was not performed.
5. Partially Placed the Evotips into the boxes at S1 from A1 to A3. Positions A4 to A6 in the S1 box were not filled.
6. ❌ **Error:** Placed Evotips at S3 from A1 to A6. These did not appear to be empty Evotips designated as Blanks, but rather the same type of tips as those placed at S1.
7. ❌ **Omitted:** Documentation of the precise position of each placed Evotip was not performed.
7.1 ➕ **Added:** Pointed to the boxes at S1 and S3.

## Results
- Evotip boxes were placed at S1 and S3.
- The box placed at S3 was not empty, as specified for blanks.
- No explicit verification of Buffer A level was performed.
- No explicit inspection of individual Evotips for condition was performed before placement.
- Evotips were placed at S1 positions A1-A3 and at S3 positions A1-A6. These appeared to be the same type of tips.
- No documentation of tip positions was performed.
- The total observed time for the procedure was 52 seconds.

candidates {
  content {
    role: "model"
    parts {
      text: "| Step | AI Response | AI Class |\n|------|-------------|----------------|\n| 1 | Error | Omitted |\n| 2 | No Error | N/A |\n| 3 | Error | Error |\n| 4 | Error | Omitted |\n| 5 | No Error | N/A |\n| 6 | Error | Error |\n| 7 | Error | Omitted |\n| 7.1 | Error | Addition |"
    }
  }
  finish_reason: STOP
  avg_logprobs: -2.387176513671875
}
usage_metadata {
  prompt_token_count: 4474
  candidates_token_count: 100
  total_token_count: 5528
  prompt_tokens_details {
    modality: TEXT
    token_count: 4474
  }
  candidates_tokens_details {
    modality: TEXT
    token_count: 100
  }
}
model_version: "gemini-2.5-flash-preview-04-17"

candidates {
  content {
    role: "model"
    parts {
      text: "```markdown\n## Lab notes Quality\n| Criterion | Rating | Explanation |\n|-----------|--------|-------------|\n| Structure | Poor | The AI output includes several sections (Figures, References, Table 1, Table 2) in addition to

2025-05-05 23:41:17,430 - __main__ - INFO - Successfully extracted and parsed JSON.


candidates {
  content {
    role: "model"
    parts {
      text: "```json\n[\n  {\n    \"Step\": \"1\",\n    \"AI Response\": \"Error\",\n    \"AI Class\": \"Omitted\"\n  },\n  {\n    \"Step\": \"2\",\n    \"AI Response\": \"No Error\",\n    \"AI Class\": \"N/A\"\n  },\n  {\n    \"Step\": \"3\",\n    \"AI Response\": \"Error\",\n    \"AI Class\": \"Error\"\n  },\n  {\n    \"Step\": \"4\",\n    \"AI Response\": \"Error\",\n    \"AI Class\": \"Omitted\"\n  },\n  {\n    \"Step\": \"5\",\n    \"AI Response\": \"No Error\",\n    \"AI Class\": \"N/A\"\n  },\n  {\n    \"Step\": \"6\",\n    \"AI Response\": \"Error\",\n    \"AI Class\": \"Error\"\n  },\n  {\n    \"Step\": \"7\",\n    \"AI Response\": \"Error\",\n    \"AI Class\": \"Omitted\"\n  },\n  {\n    \"Step\": \"7.1\",\n    \"AI Response\": \"Error\",\n    \"AI Class\": \"Addition\"\n  }\n]\n```"
    }
  }
  finish_reason: STOP
  avg_logprobs: -0.079208117398348724
}
usage_metadata {
  prompt_token_count: 261
  candidates_token_count:

  df_errors = pd.merge(df_error_benchmark, df_error_AI, on='Step', how='outer')


```markdown
## Lab notes Quality
| Criterion | Rating | Explanation |
|-----------|--------|-------------|
| Structure | Poor | The AI output includes several sections (Figures, References, Table 1, Table 2) in addition to the required Aim, Materials, Procedure, and Results sections. The instruction was to "keep only relevant sections". While the "Resulting Lab Notes" section itself has the correct structure, the overall output fails this criterion. |
| Tense | Excellent | The AI consistently used past tense to describe the actions observed in the procedure ("was not performed", "Placed", "was firmly seated"). |
| Language | Excellent | The AI successfully removed instructional language from the procedure steps and replaced it with observational language describing what was done or omitted ("Verification... was not performed", "Placed Evotip Box...", "Inspection... was not performed"). |
| Numbering | Excellent | The AI correctly maintained the step numbering (1-7) from the original protocol, even when noting omissions or errors. It also appropriately added a new step (7.1) observed in the procedure but not in the original protocol. |
| Timing | Excellent | The AI provided an exact actual timing (52 seconds) for the procedure as observed, which is the correct approach for lab notes, rather than an estimated time. |
```

Unnamed: 0,Step,Benchmark,Class,AI Response,AI Class,Identification,Classification
0,1.0,No Error,,Error,Omitted,False Positive,
1,2.0,No Error,,No Error,,No Error (Correctly Identified),
2,3.0,No Error,,Error,Error,False Positive,
3,4.0,No Error,,Error,Omitted,False Positive,
4,5.0,No Error,,No Error,,No Error (Correctly Identified),
5,6.0,No Error,,Error,Error,False Positive,
6,7.0,Error,Omitted,Error,Omitted,Error (Correctly Identified),correct
7,7.1,,,Error,Addition,Addition by model,


prompt_token_count: 49172
candidates_token_count: 3165
total_token_count: 54174
prompt_tokens_details {
  modality: TEXT
  token_count: 2618
}
prompt_tokens_details {
  modality: AUDIO
  token_count: 3450
}
prompt_tokens_details {
  modality: VIDEO
  token_count: 35880
}
prompt_tokens_details {
  modality: DOCUMENT
  token_count: 7224
}
candidates_tokens_details {
  modality: TEXT
  token_count: 3165
}

usage_metadata_extract_errors prompt_token_count: 4474
candidates_token_count: 100
total_token_count: 5528
prompt_tokens_details {
  modality: TEXT
  token_count: 4474
}
candidates_tokens_details {
  modality: TEXT
  token_count: 100
}

usage_metadata_semantic_eval prompt_token_count: 3969
candidates_token_count: 276
total_token_count: 6173
prompt_tokens_details {
  modality: TEXT
  token_count: 3969
}
candidates_tokens_details {
  modality: TEXT
  token_count: 276
}



In [15]:
# analyze a sequence of videos

# Constants for retry logic
WAIT_TIME_BETWEEN_ITEMS = 10  # seconds
RETRY_WAIT_TIME = 120  # seconds
MAX_RETRIES = 3

CHECKPOINT_FILE = "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/results/results_checkpoint.json"

def safe_json_dump(data, filename):
    """Handles non-serializable objects and converts items to strings"""
    def serialize(obj):
        if isinstance(obj, (dict)):
            return {k: serialize(v) for k, v in obj.items()}
        elif isinstance(obj, (list, tuple)):
            return [serialize(item) for item in obj]
        elif isinstance(obj, (int, float, str, bool)) or obj is None:
            return obj
        else:
            return str(obj)
    
    temp_file = f"{filename}.tmp"
    with open(temp_file, 'w') as f:
        json.dump(serialize(data), f)
    os.replace(temp_file, filename)

# Load checkpoint
results_collection = {}
last_processed_key = None
try:
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE) as f:
            data = json.load(f)
            results_collection = data.get('results', {})
            last_processed_key = data.get('last_key', None)
        print(f"Loaded checkpoint. Last processed key: {last_processed_key}")
except Exception as e:
    print(f"Error loading checkpoint: {e}")

# Upload knowledge files to Google Cloud Storage
subfolder_in_bucket = "knowledge"
path = "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/knowledge_base_selected/Connecting_or_disconnecting_column_2.pdf"
file_uri = video_to_protocol.upload_video_to_gcs(
    path, bucket, subfolder_in_bucket
)
proteomics_knowledge = Part.from_uri(file_uri, mime_type="application/pdf")

# example = 'ESIsourceToUltraSource_docuFogotOvenPowerSupply'
example = 'Dilute_docuWrongVolume_PipettTipNotChanged'
protocol_video_example = all_model_inputs[example]['protocol_video_input']
protocol_example = all_model_inputs[example]['protocol_input']
lab_video_example = all_model_inputs[example]['lab_video_input']
lab_notes_example = all_model_inputs[example]['lab_notes_input']
copy_all_model_inputs = all_model_inputs.copy()
copy_all_model_inputs.pop(example)

items_list = list(copy_all_model_inputs.items())
start_index = 0 if not last_processed_key else next((i + 1 for i, (k, _) in enumerate(items_list) if k == last_processed_key), 0)

for i in range(start_index, len(items_list)):
    key, value = items_list[i]
    
    for attempt in range(MAX_RETRIES):
        try:
            print(f"Processing {key} (attempt {attempt + 1})")
            
            start_generate_time = time.time()
            lab_notes, usage_metadata = generate_lab_notes_prompt(
                protocol_video_example, protocol_example, lab_video_example, lab_notes_example,
                value['protocol_video_input'], value['protocol_input'], value['lab_video_input'], proteomics_knowledge, key,
                model_name="gemini-2.5-pro-preview-03-25", temperature=0.9
            )
            end_generate_time = time.time()
            generate_time = end_generate_time - start_generate_time
            print(f"Time to generate lab notes: {generate_time:.2f} seconds")

            display(Markdown(lab_notes))
            
            start_evaluate_time = time.time()
            evaluation_response, df_errors, metrics, usage_metadata_extract_errors, usage_metadata_semantic_eval = process_and_evaluate_lab_notes(
                value['error_dict'], value['lab_notes_input'], lab_notes
            )
            end_evaluate_time = time.time()
            evaluate_time = end_evaluate_time - start_evaluate_time
            print(f"Time to process and evaluate lab notes: {evaluate_time:.2f} seconds")
            
            display(Markdown(evaluation_response))
            display(df_errors)
            
            # Store results
            results_collection[key] = {
                "inputs": {"experiment_name": key, **{k: v for k, v in value.items()}},
                "outputs": {
                    "lab_notes": lab_notes, 
                    "lab_notes_usage_metadata": usage_metadata,
                    "lab_notes_generate_time": generate_time,
                    "evaluation": evaluation_response, 
                    "eval_usage_metadata_extract_error": usage_metadata_extract_errors,
                    "eval_usage_metadata_semantic": usage_metadata_semantic_eval,
                    "eval_generate_time": evaluate_time,
                    "metrics": metrics
                }
            }
            
            safe_json_dump({"last_key": key, "results": results_collection}, CHECKPOINT_FILE)
            
            print(f"Waiting {WAIT_TIME_BETWEEN_ITEMS} seconds before next item...")
            time.sleep(WAIT_TIME_BETWEEN_ITEMS)
            break  # Success, exit retry loop
            
        except Exception as e:
            print(f"Error processing {key}: {e}")
            if attempt < MAX_RETRIES - 1:
                print(f"Waiting {RETRY_WAIT_TIME} seconds before retry...")
                time.sleep(RETRY_WAIT_TIME)
            else:
                print(f"Max retries reached for {key}, moving to next item")
                safe_json_dump({"last_key": key, "results": results_collection}, CHECKPOINT_FILE)

try:
    timestamp = time.time()
    safe_json_dump(results_collection, f"/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/results/results_checkpoint_{timestamp}.json")
    print("All processing complete. Final results saved.")
except Exception as e:
    print(f"Error saving final results: {e}")

ConnectionError: ('Connection aborted.', TimeoutError(60, 'Operation timed out'))

In [None]:
results_collection

{'PlaceEvotips_docuCorrect': {'inputs': {'experiment_name': 'PlaceEvotips_docuCorrect',
   'protocol_video_input': [file_data {
      mime_type: "video/mp4"
      file_uri: "gs://mannlab_videos/compare_protocol_video/PlaceEvotips_protocolCorrect.MP4"
    }],
   'protocol_input': [file_data {
      mime_type: "text/md"
      file_uri: "gs://mannlab_videos/compare_protocol_video/PlaceEvotips_protocolCorrect.md"
    }],
   'lab_video_input': [file_data {
      mime_type: "video/mp4"
      file_uri: "gs://mannlab_videos/compare_protocol_video/PlaceEvotips_docuCorrect.MP4"
    }],
   'lab_notes_input': [file_data {
      mime_type: "text/md"
      file_uri: "gs://mannlab_videos/compare_protocol_video/PlaceEvotips_docuCorrect.md"
    }],
   'error_dict': '[\n{"Step": 1, "Benchmark": "No Error", "Class": "N/A"},\n{"Step": 2, "Benchmark": "No Error", "Class": "N/A"},\n{"Step": 3, "Benchmark": "No Error", "Class": "N/A"},\n{"Step": 4, "Benchmark": "No Error", "Class": "N/A"},\n{"Step": 5, "Benc

In [None]:
def load_json_file(filename):
    """Load the JSON file that was saved with safe_json_dump"""
    with open(filename, 'r') as f:
        return json.load(f)

def flatten_dict(nested_dict, prefix=''):
    flattened = {}
    for key, value in nested_dict.items():
        if isinstance(value, dict):
            flattened.update(flatten_dict(value, f"{prefix}{key}_"))
        else:
            flattened[f"{prefix}{key}"] = value
    return flattened

loaded_data = load_json_file("/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/results/20250504_results_gemini_2.0-flash.json")
    
flattened_data = [flatten_dict(data) for data in loaded_data.values()]
df = pd.DataFrame(flattened_data)
df_subset = df[['inputs_experiment_name',
    'outputs_metrics_Error Identification Statistics_Steps evaluated',
       'outputs_metrics_Error Identification Statistics_Correct identifications',
       'outputs_metrics_Error Identification Statistics_Correct error identifications'
       'outputs_metrics_Error Identification Statistics_False positive count',
       'outputs_metrics_Error Identification Statistics_False negative count',
       'outputs_metrics_Error Classification Statistics_Total errors analyzed',
       'outputs_metrics_Error Classification Statistics_Correctly classified errors',]]

new_columns = ['experiment_name', 'Steps evaluated',
        'Correct identifications',  'Correct error identifications', 
       'False positive count', 'False negative count',
       'Errors analyzed', 'Correctly classified errors',
       ]

# ToDo: Correct table entries where model was hallouzinating 
# df_with_summary_stats.loc[df_with_summary_stats['experiment_name'] == 'ESIsourceToUltraSource_docuFogotOvenPowerSupply', 
#     ['Correct identifications', 'False positive count', 'False negative count', 'Correct error identifications']] = [20, 9, 0, 0]
# df_with_summary_stats

# ToDo: Calculate summary statistics after table is corrected
# precision = correct_identifications / total_steps if correct_identifications > 0 else 0
# recall = correctly_identified_errors / error_count if error_count > 0 else 0
# classification_accuracy = correctly_classified_errors / total_errors_analyzed if total_errors_analyzed > 0 else 0

# 'Identification precision', 'Error recall rate', 'Classification accuracy'

df_subset.columns = new_columns
df_subset = df_subset.replace('N/A', 0)

summary_stats = pd.Series({
    'experiment_name': 'Summary',
    'Steps evaluated': df_subset['Steps evaluated'].sum(),
    'Correct identifications': df_subset['Correct identifications'].sum(),
    'Identification accuracy': df_subset['Identification accuracy'].mean(),
    'Error recall rate': df_subset['Error recall rate'].mean(),
    'False positive count': df_subset['False positive count'].sum(),
    'False negative count': df_subset['False negative count'].sum(),
    'Errors analyzed': df_subset['Errors analyzed'].sum(),
    'Correctly classified errors': df_subset['Correctly classified errors'].sum(),
    'Classification accuracy': df_subset['Classification accuracy'].mean()
})

df_with_summary_stats = pd.concat([df_subset, pd.DataFrame([summary_stats])], ignore_index=True)
df_with_summary_stats

Unnamed: 0,experiment_name,Steps evaluated,Correct identifications,Identification accuracy,Error recall rate,False positive count,False negative count,Errors analyzed,Correctly classified errors,Classification accuracy
0,PlaceEvotips_docuCorrect,7,4,0.571429,0.0,3,0,0,0,0.0
1,PlaceEvotips_docuWrongPosition,7,3,0.428571,0.0,3,1,0,0,0.0
2,PlaceEvotips_docuLiquidNotChecked,7,5,0.714286,0.0,0,2,0,0,0.0
3,PlaceEvotips_docuBoxAngeled,7,1,0.142857,0.0,5,1,0,0,0.0
4,ConnectingColumnSampleLine_docuWithoutStandbyA...,14,12,0.857143,0.8,1,1,4,4,1.0
5,ESIsourceToUltraSource_docuCorrect,29,20,0.689655,0.0,9,0,0,0,0.0
6,ESIsourceToUltraSource_docuFogotOvenPowerSupply,29,22,0.758621,0.0,0,7,0,0,0.0
7,UltraSourceToESIsource_docuCorrect,25,22,0.88,0.0,3,0,0,0,0.0
8,UltraSourceToESIsource_docuForgotN2Line,25,20,0.8,1.0,5,0,1,1,1.0
9,UltraSourceToESIsource_docuForgotGlovesANDCapi...,25,18,0.72,0.5,6,1,1,1,1.0


In [None]:
process_and_evaluate_lab_notes(
                value['error_dict'], value['lab_notes_input'], lab_notes_example
            )

2025-05-01 09:34:05,727 - __main__ - ERROR - Error during content generation
Traceback (most recent call last):
  File "/var/folders/54/g1_1ycl12hl02xj_g_nm_6cm0000gn/T/ipykernel_56368/3333729111.py", line 48, in generate_content_from_model
    response = model.generate_content(
  File "/Users/patriciaskowronek/Conda/miniconda3/envs/docu_test/lib/python3.9/site-packages/vertexai/generative_models/_generative_models.py", line 695, in generate_content
    return self._generate_content(
  File "/Users/patriciaskowronek/Conda/miniconda3/envs/docu_test/lib/python3.9/site-packages/vertexai/generative_models/_generative_models.py", line 812, in _generate_content
    request = self._prepare_request(
  File "/Users/patriciaskowronek/Conda/miniconda3/envs/docu_test/lib/python3.9/site-packages/vertexai/generative_models/_generative_models.py", line 3387, in _prepare_request
    request_v1beta1 = super()._prepare_request(
  File "/Users/patriciaskowronek/Conda/miniconda3/envs/docu_test/lib/python3

ValueError: Failed to generate content: Unexpected item type: [file_data {
  mime_type: "text/md"
  file_uri: "gs://mannlab_videos/compare_protocol_video/Dilute_docuWrongVolume_PipettTipNotChanged.md"
}
].Only types that represent a single Content or a single Part are supported here.

In [None]:
lab_notes_example = "Alright, here is the documentation following your specifications:\n\n## Documentation:# Change source: ESI source to UltraSource\n\n## Abstract\nThis protocol describes the procedure for switching from the ESI source to UltraSource.\n\n## Materials\n\n### Equipment\n- timsTOF Ultra Mass Spectrometer:\n  - ESI ion source\n  - UltraSource ion source \n- IonOpticks Column\n- Evosep One LC System with sample line\n- NanoViper Adapter (black)\n- Pliers\n\n## Procedure\n\n*Estimated timing: less than 10 minute*\n\n### Switch timsTOF to standby\n\n1. ✓ Verified the instrument was on standby mode\n2. ✓ Verified the syringe was inactive\n3. ✓ Selected 'CaptiveSpray' but did not activate it yet\n\n### Remove ESI source\n\n4. ✓ Disconnected the peak connector of the sample tubing\n5. ✓ Disconnected the nebulizer N₂ line\n6. ✓ Removed the source door. Hinged it out\n7. ❌ **Omitted:** Put on gloves after removing source door\n8. ✓ Removed the spray shield, and capillary cap.\n9. ⚠️ **Deviation:** Inspected the capillary position and gently pushed it back into proper position \n\n### Mount UltraSource\n\n10. ✓ Hinged the UltraSource door in and closed it \n11. ✓ Slid the UltraSource housing onto the source door and secured it by flipping the handles\n12. ✓ Connected the filter tubing to the source\n\n### Connect column and sample line\n\n13. ✓ Noted an IonOpticks column already inside UltraSource \n14. ✓ Noted the LC sample line had NanoViper adapter already attached\n15. ❌ **Omitted:** No need to snipp access liquid\n16. ✓ Held the column fititng of the IonOpticks column with a pliers.\n17. ✓ Hand-tightened the NanoViper of the LC sample line with the column fitting \n18. ✓ Drew the oven closer to the UltraSource, and secured it \n19. ✓ Removed the NanoViper adapter \n20. ✓ Placed the metal grounding screw\n21. ✓ Closed the lid of the oven\n22. ✓ Connected the oven to the electrical power supply\n23. ✓ Noted that with the correct temperature\n\n### Switch timsTOF to operate and idle flow\n\n24. ✓ Noted the CaptiveSpray function in timsControl had been activated.\n25. ✓ Noted that the instrument was on the operational mode\n26. ✓ Noted the idle flow was active\n27. ✓ Stay in timsControl\n28. ⚠️ **Deviation:** Checked the MS signal. Noted it needed to be adjusted to between 9-11 mbar\n\n## Expected Results\n- In timsControl, signal intensity should be above 10^7\n- Stable signal in in timsControl\n\n"

display(Markdown(lab_notes_example))

Alright, here is the documentation following your specifications:

## Documentation:# Change source: ESI source to UltraSource

## Abstract
This protocol describes the procedure for switching from the ESI source to UltraSource.

## Materials

### Equipment
- timsTOF Ultra Mass Spectrometer:
  - ESI ion source
  - UltraSource ion source 
- IonOpticks Column
- Evosep One LC System with sample line
- NanoViper Adapter (black)
- Pliers

## Procedure

*Estimated timing: less than 10 minute*

### Switch timsTOF to standby

1. ✓ Verified the instrument was on standby mode
2. ✓ Verified the syringe was inactive
3. ✓ Selected 'CaptiveSpray' but did not activate it yet

### Remove ESI source

4. ✓ Disconnected the peak connector of the sample tubing
5. ✓ Disconnected the nebulizer N₂ line
6. ✓ Removed the source door. Hinged it out
7. ❌ **Omitted:** Put on gloves after removing source door
8. ✓ Removed the spray shield, and capillary cap.
9. ⚠️ **Deviation:** Inspected the capillary position and gently pushed it back into proper position 

### Mount UltraSource

10. ✓ Hinged the UltraSource door in and closed it 
11. ✓ Slid the UltraSource housing onto the source door and secured it by flipping the handles
12. ✓ Connected the filter tubing to the source

### Connect column and sample line

13. ✓ Noted an IonOpticks column already inside UltraSource 
14. ✓ Noted the LC sample line had NanoViper adapter already attached
15. ❌ **Omitted:** No need to snipp access liquid
16. ✓ Held the column fititng of the IonOpticks column with a pliers.
17. ✓ Hand-tightened the NanoViper of the LC sample line with the column fitting 
18. ✓ Drew the oven closer to the UltraSource, and secured it 
19. ✓ Removed the NanoViper adapter 
20. ✓ Placed the metal grounding screw
21. ✓ Closed the lid of the oven
22. ✓ Connected the oven to the electrical power supply
23. ✓ Noted that with the correct temperature

### Switch timsTOF to operate and idle flow

24. ✓ Noted the CaptiveSpray function in timsControl had been activated.
25. ✓ Noted that the instrument was on the operational mode
26. ✓ Noted the idle flow was active
27. ✓ Stay in timsControl
28. ⚠️ **Deviation:** Checked the MS signal. Noted it needed to be adjusted to between 9-11 mbar

## Expected Results
- In timsControl, signal intensity should be above 10^7
- Stable signal in in timsControl



In [None]:
# Usefull helper function

def check_file_exists(file_path):
    if os.path.exists(file_path):
        print(f"File found: {file_path}")
    else:
        print(f"Error: File not found: {file_path}")