# Documentation assistant

This notebook demonstrates a documentation assistant: Video-to-documentation conversion using Vertex AI

Converting videos-to-documentation involves three steps: 
1. Protocol finder: Select protocol which best captures the step being performed in the video
2. Video comparing to ground-of-truth protocol → lab documentation + errors in procedure
3. Analytics based on benchmark dataset: automatic comparison of errors found by documentation assistent vs actual errors

In this notebook, I will focus on the step two and three - Compare video with protocol.

In [2]:
from __future__ import annotations

# %load_ext autoreload
%reload_ext autoreload
%autoreload 2

import configparser
import os
import sys
from pathlib import Path

from IPython.display import Markdown

path_to_append = Path(Path.cwd()).parent / "proteomics_specialist"
sys.path.append(str(path_to_append))
import video_to_protocol

config = configparser.ConfigParser()
config.read("../secrets.ini")

['../secrets.ini']

In [3]:
import vertexai

config = configparser.ConfigParser()
config.read("../secrets.ini")

PROJECT_ID = config["DEFAULT"]["PROJECT_ID"]
vertexai.init(project=PROJECT_ID, location="europe-west9")  # europe-west9 is Paris

In [4]:
from google.cloud import storage

os.environ["GOOGLE_CLOUD_PROJECT"] = config["DEFAULT"]["PROJECT_ID"]

# Initialize Cloud Storage client
storage_client = storage.Client()
bucket_name = "mannlab_videos"
bucket = storage_client.bucket(bucket_name)

In [None]:
gemini-2.0-flash

In [None]:
from vertexai.generative_models import Part

upload_docu = [
    "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/benchmark_dataset/documentation/TimsCalibration_docuSavedMethod.mov",
    "/Users/patriciaskowronek/Documents/proteomics_specialist/data/TimsCalibration_protocolCorrect.md",
    "/Users/patriciaskowronek/Documents/proteomics_specialist/data/TimsCalibration_docuSavedMethod.md"
    ]
for file in upload_docu:
    video_to_protocol.upload_video_to_gcs(file, bucket, "compare_protocol_video")


In [25]:
from vertexai.generative_models import Part

# Example
video_path = "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/benchmark_dataset/documentation/TimsCalibration_docuSavedMethod.mov"
video_uri = video_to_protocol.upload_video_to_gcs(video_path, bucket, "compare_protocol_video")
video_input = [
    "## Lab video:",
    Part.from_uri(video_uri, mime_type="video/mp4"),
]

path = "/Users/patriciaskowronek/Documents/proteomics_specialist/data/TimsCalibration_protocolCorrect.md"
uri = video_to_protocol.upload_video_to_gcs(path, bucket, "compare_protocol_video")
protocol_input = [
    "## Protocol:",
    Part.from_uri(uri, mime_type="text/md"),
]

path = "/Users/patriciaskowronek/Documents/proteomics_specialist/data/TimsCalibration_docuSavedMethod.md"
uri = video_to_protocol.upload_video_to_gcs(path, bucket, "compare_protocol_video")
documentation_input = [
    "## Documentation:",
    Part.from_uri(uri, mime_type="text/md"),
]

In [26]:
# Task

video_path = "/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/benchmark_dataset/documentation/QueueSamples_docuWrongRow_S3A1Twice.mov"
video_uri = video_to_protocol.upload_video_to_gcs(video_path, bucket, "compare_protocol_video")
video_input2 = [
    "## Lab video:",
    Part.from_uri(video_uri, mime_type="video/mp4"),
]

path = "/Users/patriciaskowronek/Documents/proteomics_specialist/data/QueueSamples_protocolCorrect.md"
uri = video_to_protocol.upload_video_to_gcs(path, bucket, "compare_protocol_video")
protocol_input2 = [
    "## Protocol:",
    Part.from_uri(uri, mime_type="text/md"),
]

path = "/Users/patriciaskowronek/Documents/proteomics_specialist/data/QueueSamples_docuWrongRow_S3A1Twice.md"
uri = video_to_protocol.upload_video_to_gcs(path, bucket, "compare_protocol_video")
documentation_input2 = [
    "## Documentation:",
    Part.from_uri(uri, mime_type="text/md"),
]

In [None]:
from vertexai.generative_models import GenerativeModel, GenerationConfig

inputs = [
    """
    You are Professor Matthias Mann, a pioneering scientist in proteomics and mass spectrometry.
    
    # Your Task:
    Compare the original protocol with the actual implementation shown in a video, and create a corrected documentation that reflects what actually happened. 
    
    Your documentation should follow these guidelines:
    1. Keep only relevant sections: Aim, Materials, Procedure, Results
    2. Use past tense to describe what actually happened, not what should happen
    3. Remove all instructional language and replace with observations
    4. Maintain step numbering of the original protocol even if the order is changed (1, 3, 2, ...)
    5. Include exact actual timing, not estimated timing

    Use these consistent symbols to indicate step status:
    - ✓ (Followed correctly with no special notation needed)
    - ❌ **Error:** (When something was done incorrectly - be specific about what happened)
    - ❌ **Omitted:** (When a step was completely skipped)
    - ⚠️ **Deviation:** (When a step was followed differently than prescribed)
    - ➕ **Added:** (When a new step not in the protocol was performed)

    # Example
    """
]
inputs.extend(video_input)
inputs.extend(protocol_input)
inputs.extend(documentation_input)

inputs.extend(
    ["""
    # Your task now
    Provide me with a documentation as in the example above.
    """]
)

inputs.extend(video_input2)
inputs.extend(protocol_input2)

inputs.append(
    "Output: Correct documentation"
)

model = GenerativeModel("gemini-2.0-flash")

response = model.generate_content(
    inputs, 
    generation_config=GenerationConfig(
        temperature=0.9,
        # audio_timestamp=True # Supported if only one video or audio is used
    )
)

print(video_path)
documentation = response.text
print(response.usage_metadata)
Markdown(documentation)

/Users/patriciaskowronek/Documents/documentation_agent_few_shot_examples/benchmark_dataset/documentation/QueueSamples_docuWrongRow_S3A1Twice.mov
prompt_token_count: 53314
candidates_token_count: 647
total_token_count: 53961
prompt_tokens_details {
  modality: AUDIO
  token_count: 4425
}
prompt_tokens_details {
  modality: VIDEO
  token_count: 46020
}
prompt_tokens_details {
  modality: TEXT
  token_count: 2869
}
candidates_tokens_details {
  modality: TEXT
  token_count: 647
}



Okay, Professor Mann, here's the corrected documentation based on the provided video, reflecting the actual actions performed and their timing.

## Documentation:# Queue and measure samples in HyStar

## Aim
Queueing samples in HyStar for LC-MS measurement.

## Materials

### Software
HyStar 6.0

## Procedure
Timing: 3 minutes

Prerequisite 1. ✓ Mentioned that 5 ng HeLa Evotips were placed at S1 from A1 to A6 and blanks at S3 from A1 to A6.

Prerequisite 2. ✓ Reported that the TIMS device had already been calibrated.

1. ✓ Navigated to the 'Acquisition' tab in HyStar.
2. ✓ Selected an already existing sample table by pressing the arrow down button when hovering over the sample table name in the left sample table column.

3. ❌ **Omitted:** Copied already existing sample table entries to modify them
4. ⚠️ **Deviation:** Manually Adjusted the sample ID without following this pattern: currentDate_massSpec_user_sampleType_projectID_ sampleName, instead created "THMS50tcep_PAlk_SA_blank".

5. ⚠️ **Deviation:** The queue did not contain three dda-PASEF or three dia-PASEF runs. The queue consisted of multiple rows with samples labelled "THMS50tcep_PAlk_SA_blank" and "THMS50tcep_PAlk_MA_HeLa"
    
6. ✓ Verified the column autocompletion settings with right-click on a field in the column 'vial'. The arrows pointet from A1-A12, indicating that values increased to the right. The tray type was set to 'Evosep' and slots 1-6 were designated as '96Evotip'.

7. ✓ Matched the Evotip position with the sample's location in the Evotip box. The first Evotip was placed in position S1 A1, and all remaining positions were specified individually and automatically by dragging the values.
8. ⚠️ **Deviation:** Path was not explicity specified.
9. ✓ The separation method "WhisperRj.zoom" was selected.

10. ✓ Injection method was set to 'standard'.

11. ✓ The MS method "20240703_DDA_maintenance_ionOptics_100ms_m/z713_300-1200_HS_1800V" was selected.

12. ❌ **Omitted:** Idle flow on the Evosep was not canceled.
13. ✓ Saved the sample table.

14. ⚠️ **Deviation:** Only the last row was selected to upload sample conditions, instead of all rows. The status changed to loaded.

15. ✓ Pressed 'start' and the sequence started to run.

## Expected Results
- ✓ The sample table was running.

## Figures

### Figure 1: Hystar
- Screenshot of Hystar settings


In [75]:
inputs = [
  """
  # Instruction
  You are an expert evaluator specializing in scientific protocol documentation. Your task is to evaluate the error identification accuracy, error type classification and documentation quality of an AI-generated documentation against a benchmark documentation (ground truth). You will be provided with an AI-generated documentation and a benchmark documentation (human-verified ground truth).

  # Evaluation Parts
  ## Part 1: Error Identification Accuracy
  For each step in the protocol, determine if the AI correctly identified the presence or absence of errors by classifying into one of these categories:
  - **No Error**: Both benchmark and AI response agree there was no error
  - **Error (Correctly Identified)**: Both benchmark and AI response agree there was an error
  - **False Positive**: AI response claimed an error when the benchmark indicates none
  - **False Negative**: AI response missed an error that the benchmark shows

  ## Part 2: Error Type Classification
  For each error that was correctly identified by both the benchmark and AI response, determine if the AI correctly classified the error type:
  - **Correct Classification**: AI used the same error type as the benchmark (Omitted, Error, Deviation, Added)
  - **Incorrect Classification**: AI used a different error type than the benchmark

  ## Part 3: Documentation Quality
  Evaluate the AI's documentation quality based on these criteria:
  1. **Structure**: Did it keep only relevant sections: Aim, Materials, Procedure, Results?
  2. **Tense**: Did it use past tense to describe what actually happened, not what should happen?
  3. **Language**: Did it remove all instructional language and replace with observations?
  4. **Numbering**: Did it maintain step numbering of the original protocol even if order changed?
  5. **Timing**: Did it include exact actual timing, not estimated timing?

  # Rating Rubric
  For each part, provide an evaluation:

  ### Part 1: Error Identification Accuracy
  - Calculate and report:
    - Total number of correct identifications (No Error + Correctly Identified Error)
    - Total number of false positives
    - Total number of false negatives
    - Overall accuracy percentage (correct identifications / total steps)

  ### Part 2: Error Type Classification
  - Calculate and report:
    - Total errors correctly classified / Total errors correctly identified
    - Overall error classification accuracy percentage

  ### Part 3: Documentation Quality
  For each criterion:
  - **Excellent**: The criterion was fully met with no issues
  - **Good**: The criterion was mostly met with minor issues
  - **Poor**: The criterion was not met or had significant issues

  # Evaluation Steps
  1. Create a table for each step in the protocol showing error identification accuracy
  2. Analyze correctly identified errors to determine classification accuracy
  3. Evaluate documentation quality against the 5 criteria
  4. Provide final scores and overall assessment
  5. Highlight specific strengths and areas for improvement

  # Output Format
  ## Part 1: Error Identification Accuracy
  | Step | Benchmark | AI Response | Classification |
  |------|-----------|-------------|----------------|
  | [Step details] | [Error/No Error] | [Error/No Error] | [No Error/Error/False Positive/False Negative] |

  **Summary Statistics:**
  - Total correct identifications: [X]/[Total Steps]
  - Total false positives: [X]
  - Total false negatives: [X]
  - Overall accuracy: [X]%

  ## Part 2: Error Classification Accuracy
  | Step | Benchmark Error Type | AI Error Type | Classification |
  |------|---------------------|---------------|----------------|
  | [Step with error] | [Error Type] | [Error Type] | [Correct/Incorrect] |

  **Summary Statistics:**
  - Total correctly classified errors: [X]/[Total Errors]
  - Error classification accuracy: [X]%

  ## Part 3: Documentation Quality
  | Criterion | Rating | Explanation |
  |-----------|--------|-------------|
  | Structure | [Excellent/Good/Poor] | [Explanation] |
  | Tense | [Excellent/Good/Poor] | [Explanation] |
  | Language | [Excellent/Good/Poor] | [Explanation] |
  | Numbering | [Excellent/Good/Poor] | [Explanation] |
  | Timing | [Excellent/Good/Poor] | [Explanation] |

  ## Overall Assessment
  [Provide a concise overall assessment of the AI documentation's quality, highlighting key strengths and weaknesses, with suggestions for improvement.]

  # Input Materials
  ## Benchmark Documentation (Ground Truth)
  
  """
]
inputs.extend(documentation_input2)
inputs.extend(["""
  ## AI-Generated Documentation
"""])
inputs.extend(documentation)

model = GenerativeModel("gemini-2.0-flash")

response = model.generate_content(
    inputs, 
    generation_config=GenerationConfig(
        temperature=0.9,
        # audio_timestamp=True # Supported if only one video or audio is used
    )
)
Markdown(response.text)

## Part 1: Error Identification Accuracy

| Step | Benchmark | AI Response | Classification |
|------|-----------|-------------|----------------|
| Prerequisite 1 | No Error | No Error | No Error |
| Prerequisite 2 | No Error | No Error | No Error |
| 1 | No Error | No Error | No Error |
| 2 | No Error | No Error | No Error |
| 3 | No Error | Error | False Positive |
| 4 | No Error | Error | False Positive |
| 5 | No Error | Error | False Positive |
| 6 | No Error | No Error | No Error |
| 7 | Error | No Error | False Negative |
| 8 | No Error | Error | False Positive |
| 9 | No Error | No Error | No Error |
| 10 | No Error | No Error | No Error |
| 11 | No Error | No Error | No Error |
| 12 | Error | Error | Error (Correctly Identified) |
| 13 | No Error | No Error | No Error |
| 14 | No Error | Error | False Positive |
| 15 | No Error | No Error | No Error |

**Summary Statistics:**
- Total correct identifications: 11/16
- Total false positives: 6
- Total false negatives: 1
- Overall accuracy: 68.75%

## Part 2: Error Classification Accuracy

| Step | Benchmark Error Type | AI Error Type | Classification |
|------|---------------------|---------------|----------------|
| 12 | Omitted | Omitted | Correct |

**Summary Statistics:**
- Total correctly classified errors: 1/1
- Error classification accuracy: 100%

## Part 3: Documentation Quality

| Criterion | Rating | Explanation |
|-----------|--------|-------------|
| Structure | Good | The AI kept the relevant sections (Aim, Materials, Procedure, Results) but also included "Expected Results" and "Figures", which are not typically part of a post-experiment documentation. |
| Tense | Poor | The AI continues to use instructional language instead of describing what happened. |
| Language | Poor | The AI uses instructional language and includes unnecessary conversational elements ("Okay, Professor Mann"). It uses checkmarks and the words "Mentioned" and "Reported," which are inappropriate for documentation. |
| Numbering | Excellent | The AI maintained the step numbering of the original protocol. |
| Timing | Poor | The AI provided an estimated timing rather than an exact actual timing. |

## Overall Assessment

The AI-generated documentation has significant shortcomings. While it maintains the original numbering and identifies one error correctly, it struggles with differentiating between a protocol and documentation, leading to inappropriate language and tense usage. The high number of false positives significantly reduces its accuracy. The documentation also fails to incorporate actual timings.

**Recommendations for Improvement:**

1.  **Focus on Tense and Language:** Train the AI to strictly use past tense and remove any instructional or conversational language. The output should read as a record of what *was* done, not what *should* be done.
2.  **Reduce False Positives:** Improve the AI's ability to distinguish between minor deviations and actual errors. It needs to be more precise in identifying when a step deviates significantly from the original protocol.
3.  **Adhere to Structure:** Stick to the core sections (Aim, Materials, Procedure, Results) without adding speculative sections like "Expected Results".
4.  **Incorporate Timing:** The AI needs to be able to extract and include exact timings from the video, rather than providing estimated times.


In [None]:
# Assuming 'model' is your configured Vertex AI GenerativeModel instance

def get_table_json_prompt(text_with_tables: str, table_identifier: str) -> str:
    """
    Generates a prompt to extract a specific table from text into JSON.

    Args:
        text_with_tables: The full text containing the table(s).
        table_identifier: A string to help the model identify the target table
                          (e.g., the table title, or a unique phrase near it).

    Returns:
        A formatted prompt string.
    """
    prompt = f"""
    You are an expert data extraction tool.
    Your task is to locate a specific table within the provided text and output its data as a JSON array.

    Here is the text containing the table(s):
    ---TEXT_START---
    {text_with_tables}
    ---TEXT_END---

    Identify the table that best matches the following description or title: "{table_identifier}"

    Output the data from ONLY this table as a valid JSON array. Each object in the array should represent a row from the table. The keys of each object should be the exact column headers from the identified table.

    Output Constraints:
    - Do not include any introductory or explanatory text (e.g., "Here is the JSON:").
    - Do not include any text before or after the JSON object.
    - The output must be *only* the valid JSON array structure itself.
    - If the specified table cannot be found, output an empty JSON array: []

    Answer direct with the JSON.
    """
    return prompt

text_from_previous_response = response.text
table_to_extract = "Part 1: Error Identification Accuracy"
# table_to_extract = "Part 2: Error Classification Accuracy"

prompt_for_extraction = get_table_json_prompt(text_from_previous_response, table_to_extract)
json_response = model.generate_content([prompt_for_extraction])


Model did not return valid JSON.
Model output:
```json
[
  {
    "Step": "Prerequisite 1",
    "Benchmark": "No Error",
    "AI Response": "No Error",
    "Classification": "No Error"
  },
  {
    "Step": "Prerequisite 2",
    "Benchmark": "No Error",
    "AI Response": "No Error",
    "Classification": "No Error"
  },
  {
    "Step": "1",
    "Benchmark": "No Error",
    "AI Response": "No Error",
    "Classification": "No Error"
  },
  {
    "Step": "2",
    "Benchmark": "No Error",
    "AI Response": "No Error",
    "Classification": "No Error"
  },
  {
    "Step": "3",
    "Benchmark": "No Error",
    "AI Response": "Error",
    "Classification": "False Positive"
  },
  {
    "Step": "4",
    "Benchmark": "No Error",
    "AI Response": "Error",
    "Classification": "False Positive"
  },
  {
    "Step": "5",
    "Benchmark": "No Error",
    "AI Response": "Error",
    "Classification": "False Positive"
  },
  {
    "Step": "6",
    "Benchmark": "No Error",
    "AI Response": "No Er

In [80]:
import json

# This is the full string you received from the model
model_output_string = json_response.text

# Define the markers for the JSON code block
start_marker = "```json"
end_marker = "```"

# Find the position of the markers
start_index = model_output_string.find(start_marker)
end_index = model_output_string.find(end_marker, start_index + len(start_marker)) # Search for end marker after the start

json_data = None
extracted_json_string = ""

if start_index != -1 and end_index != -1:
    # Extract the string between the markers
    # Add the length of the start_marker to get the content *after* it
    extracted_json_string = model_output_string[start_index + len(start_marker) : end_index].strip()

    # Now, try to parse the extracted string as JSON
    try:
        json_data = json.loads(extracted_json_string)
        print("Successfully extracted and parsed JSON:")
        # print(json.dumps(json_data, indent=2)) # Optional: print for verification
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON after extraction: {e}")
        print("Extracted string:")
        print(extracted_json_string)
else:
    print("Could not find JSON code block markers in the output.")
    print("Model output:")
    print(model_output_string)

# Now 'json_data' holds your data as a Python list of dictionaries
# You can use it directly or convert it to a pandas DataFrame:
if json_data is not None:
    import pandas as pd
    df = pd.DataFrame(json_data)
    print("\nDataFrame created:")
    print(df)

Successfully extracted and parsed JSON:

DataFrame created:
              Step Benchmark AI Response                Classification
0   Prerequisite 1  No Error    No Error                      No Error
1   Prerequisite 2  No Error    No Error                      No Error
2                1  No Error    No Error                      No Error
3                2  No Error    No Error                      No Error
4                3  No Error       Error                False Positive
5                4  No Error       Error                False Positive
6                5  No Error       Error                False Positive
7                6  No Error    No Error                      No Error
8                7     Error    No Error                False Negative
9                8  No Error       Error                False Positive
10               9  No Error    No Error                      No Error
11              10  No Error    No Error                      No Error
12              1

In [81]:
df

Unnamed: 0,Step,Benchmark,AI Response,Classification
0,Prerequisite 1,No Error,No Error,No Error
1,Prerequisite 2,No Error,No Error,No Error
2,1,No Error,No Error,No Error
3,2,No Error,No Error,No Error
4,3,No Error,Error,False Positive
5,4,No Error,Error,False Positive
6,5,No Error,Error,False Positive
7,6,No Error,No Error,No Error
8,7,Error,No Error,False Negative
9,8,No Error,Error,False Positive


In [57]:
response.text

'## Part 1: Error Identification Accuracy\n\n| Step | Benchmark | AI Response | Classification |\n|------|-----------|-------------|----------------|\n| Prerequisite 1 | No Error | No Error | No Error |\n| Prerequisite 2 | No Error | No Error | No Error |\n| 1 | No Error | No Error | No Error |\n| 2 | No Error | No Error | No Error |\n| 3 | No Error | Error | False Positive |\n| 4 | No Error | Error | False Positive |\n| 5 | No Error | Error | False Positive |\n| 6 | No Error | No Error | No Error |\n| 7 | Error | No Error | False Negative |\n| 8 | No Error | Error | False Positive |\n| 9 | No Error | No Error | No Error |\n| 10 | No Error | No Error | No Error |\n| 11 | No Error | No Error | No Error |\n| 12 | Error | Error | Correctly Identified |\n| 13 | No Error | No Error | No Error |\n| 14 | No Error | Error | False Positive |\n| 15 | No Error | No Error | No Error |\n\n**Summary Statistics:**\n- Total correct identifications: 10/17\n- Total false positives: 6\n- Total false nega

In [44]:
documentation_input2

['## Documentation:',
 file_data {
   mime_type: "text/md"
   file_uri: "gs://mannlab_videos/compare_protocol_video/QueueSamples_docuWrongRow_S3A1Twice.md"
 }]

In [40]:
Markdown(formatted_prompt)

# Instruction
You are an expert evaluator specializing in scientific protocol documentation. Your task is to evaluate the quality of an AI-generated documentation against a benchmark documentation (ground truth). You will be provided with an AI-generated documentation and a benchmark documentation (human-verified ground truth).

Your evaluation will have three distinct parts, each focused on different aspects of the AI's performance.

# Evaluation Parts
## Part 1: Error Identification Accuracy
For each step in the protocol, determine if the AI correctly identified the presence or absence of errors by classifying into one of these categories:
- **No Error**: Both benchmark and AI response agree there was no error
- **Error (Correctly Identified)**: Both benchmark and AI response agree there was an error
- **False Positive**: AI response claimed an error when the benchmark indicates none
- **False Negative**: AI response missed an error that the benchmark shows

## Part 2: Error Type Classification
For each error that was correctly identified by both the benchmark and AI response, determine if the AI correctly classified the error type:
- **Correct Classification**: AI used the same error type as the benchmark (Omitted, Error, Deviation, Added)
- **Incorrect Classification**: AI used a different error type than the benchmark

## Part 3: Documentation Quality
Evaluate the AI's documentation quality based on these criteria:
1. **Structure**: Did it keep only relevant sections: Aim, Materials, Procedure, Results?
2. **Tense**: Did it use past tense to describe what actually happened, not what should happen?
3. **Language**: Did it remove all instructional language and replace with observations?
4. **Numbering**: Did it maintain step numbering of the original protocol even if order changed?
5. **Timing**: Did it include exact actual timing, not estimated timing?

# Rating Rubric
For each part, provide an evaluation:

### Part 1: Error Identification Accuracy
- Calculate and report:
  - Total number of correct identifications (No Error + Correctly Identified Error)
  - Total number of false positives
  - Total number of false negatives
  - Overall accuracy percentage (correct identifications / total steps)

### Part 2: Error Type Classification
- Calculate and report:
  - Total errors correctly classified / Total errors correctly identified
  - Overall error classification accuracy percentage

### Part 3: Documentation Quality
For each criterion:
- **Excellent**: The criterion was fully met with no issues
- **Good**: The criterion was mostly met with minor issues
- **Poor**: The criterion was not met or had significant issues

# Evaluation Steps
1. Create a table for each step in the protocol showing error identification accuracy
2. Analyze correctly identified errors to determine classification accuracy
3. Evaluate documentation quality against the 5 criteria
4. Provide final scores and overall assessment
5. Highlight specific strengths and areas for improvement

# Input Materials
## AI-Generated Documentation
Okay, Professor Mann, here's the corrected documentation based on the provided video, reflecting the actual actions performed and their timing.

## Documentation:# Queue and measure samples in HyStar

## Aim
Queueing samples in HyStar for LC-MS measurement.

## Materials

### Software
HyStar 6.0

## Procedure
Timing: 3 minutes

Prerequisite 1. ✓ Mentioned that 5 ng HeLa Evotips were placed at S1 from A1 to A6 and blanks at S3 from A1 to A6.

Prerequisite 2. ✓ Reported that the TIMS device had already been calibrated.

1. ✓ Navigated to the 'Acquisition' tab in HyStar.
2. ✓ Selected an already existing sample table by pressing the arrow down button when hovering over the sample table name in the left sample table column.

3. ❌ **Omitted:** Copied already existing sample table entries to modify them
4. ⚠️ **Deviation:** Manually Adjusted the sample ID without following this pattern: currentDate_massSpec_user_sampleType_projectID_ sampleName, instead created "THMS50tcep_PAlk_SA_blank".

5. ⚠️ **Deviation:** The queue did not contain three dda-PASEF or three dia-PASEF runs. The queue consisted of multiple rows with samples labelled "THMS50tcep_PAlk_SA_blank" and "THMS50tcep_PAlk_MA_HeLa"
    
6. ✓ Verified the column autocompletion settings with right-click on a field in the column 'vial'. The arrows pointet from A1-A12, indicating that values increased to the right. The tray type was set to 'Evosep' and slots 1-6 were designated as '96Evotip'.

7. ✓ Matched the Evotip position with the sample's location in the Evotip box. The first Evotip was placed in position S1 A1, and all remaining positions were specified individually and automatically by dragging the values.
8. ⚠️ **Deviation:** Path was not explicity specified.
9. ✓ The separation method "WhisperRj.zoom" was selected.

10. ✓ Injection method was set to 'standard'.

11. ✓ The MS method "20240703_DDA_maintenance_ionOptics_100ms_m/z713_300-1200_HS_1800V" was selected.

12. ❌ **Omitted:** Idle flow on the Evosep was not canceled.
13. ✓ Saved the sample table.

14. ⚠️ **Deviation:** Only the last row was selected to upload sample conditions, instead of all rows. The status changed to loaded.

15. ✓ Pressed 'start' and the sequence started to run.

## Expected Results
- ✓ The sample table was running.

## Figures

### Figure 1: Hystar
- Screenshot of Hystar settings


## Benchmark Documentation (Ground Truth)
['## Documentation:', file_data {
  mime_type: "text/md"
  file_uri: "gs://mannlab_videos/compare_protocol_video/QueueSamples_docuWrongRow_S3A1Twice.md"
}
]

# Output Format
## Part 1: Error Identification Accuracy
| Step | Benchmark | AI Response | Classification |
|------|-----------|-------------|----------------|
| [Step details] | [Error/No Error] | [Error/No Error] | [No Error/Error/False Positive/False Negative] |

**Summary Statistics:**
- Total correct identifications: [X]/[Total Steps]
- Total false positives: [X]
- Total false negatives: [X]
- Overall accuracy: [X]%

## Part 2: Error Classification Accuracy
| Step | Benchmark Error Type | AI Error Type | Classification |
|------|---------------------|---------------|----------------|
| [Step with error] | [Error Type] | [Error Type] | [Correct/Incorrect] |

**Summary Statistics:**
- Total correctly classified errors: [X]/[Total Errors]
- Error classification accuracy: [X]%

## Part 3: Documentation Quality
| Criterion | Rating | Explanation |
|-----------|--------|-------------|
| Structure | [Excellent/Good/Poor] | [Explanation] |
| Tense | [Excellent/Good/Poor] | [Explanation] |
| Language | [Excellent/Good/Poor] | [Explanation] |
| Numbering | [Excellent/Good/Poor] | [Explanation] |
| Timing | [Excellent/Good/Poor] | [Explanation] |

## Overall Assessment
[Provide a concise overall assessment of the AI documentation's quality, highlighting key strengths and weaknesses, with suggestions for improvement.]
