<a href="https://colab.research.google.com/github/Swadian/FrontierScience-Bench-Evaluating-AI-Research-Capabilities-in-LLMs/blob/main/Verdict_Framework_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
try:
  import verdict
except:
  !pip install verdict
  import verdict
from verdict import Pipeline, Layer
from verdict.common.judge import CategoricalJudgeUnit
from verdict.scale import DiscreteScale
from verdict.schema import Schema, Field
from verdict.common.judge import JudgeUnit
from verdict.transform import MapUnit, MaxPoolUnit
from collections import Counter
from google.colab import drive, userdata
drive.mount('/content/drive')

Collecting verdict
  Downloading verdict-0.2.2-py3-none-any.whl.metadata (11 kB)
Collecting datasets (from verdict)
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill (from verdict)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting eval-type-backport (from verdict)
  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting instructor==1.7.2 (from verdict)
  Downloading instructor-1.7.2-py3-none-any.whl.metadata (18 kB)
Collecting krippendorff (from verdict)
  Downloading krippendorff-0.8.1-py3-none-any.whl.metadata (3.0 kB)
Collecting litellm (from verdict)
  Downloading litellm-1.68.0-py3-none-any.whl.metadata (36 kB)
Collecting loguru (from verdict)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting jiter<0.9,>=0.6.1 (from instructor==1.7.2->verdict)
  Downloading jiter-0.8.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting dill (from verdict)
  Downlo

In [None]:
rubric_guidelines = """
The rubric specifications are as follows:
- Each paper will be assigned a score between 1 and 10 (inclusive), based on how similar it is to the original paper. A higher score indicates the predicted methodology is more similar to the original methodology.
- When assigning a score to a paper, clearly explain what parts of the paper's methodology are similar or different from the original paper.
- Break down your thought process by identifying key strengths, weaknesses, and any alignment or discrepancies with the paper.
- The following rubric will be used to assign scores:
  - A score of 1 represents a methodology that is vastly different in the approach and execution compared to the original paper.
  - A score of 2 represents a methodology that is significantly different but contains a few minor similarities.
  - A score of 3 represents a methodology that has some similarities but misses key details in the approach and execution compared to the original paper.
  - A score of 4 represents a methodology that is somewhat similar but still lacks important aspects of the original paper.
  - A score of 5 represents a methodology that has a relatively equal number of similarites and differences in approach and execution compared to the original paper.
  - A score of 6 represents a methodology that is fairly close to the originial but omits or alters some details.
  - A score of 7 represents a methodology that closely matches the original in approach and execution but with minor alterations.
  - A score of 8 represents a methodology that is very similar to the original with only small, noncritical differences in the execution.
  - A score of 9 represents a methodology that is nearly identical to the original paper, with differences that are extremely minor and almost inconsqeuential.
  - A score of 10 represents a methodology that is almost identical to the approach and execution of the original paper, with trivial differences ignored.

A hypothetical example paper and its key methodological details are provided below. This is PURELY an example reference paper to gauge the differences between each score.
Examples of scores of 1-10 (inclusive) are also provided, with a description of the key similarities and differences that justify that score being assigned.

The example paper is titled:
“Efficient Fine‑Tuning Strategies for Domain‑Specific Language Models in Legal Document Analysis.”

Its key methodological details include:
Pre‑Trained Model: Uses a pre‑trained BERT base model (12 layers, 110M parameters).
Fine‑Tuning Data: Fine‑tuned on a corpus of 100,000 legal documents.
Domain‑Adaptive Pre‑Training (DAPT): Begins with a DAPT phase using masked language modeling for 3 epochs.
Classification Fine‑Tuning: Followed by fine‑tuning with a classification head for legal issue categorization over 4 epochs.
Hyperparameters: Employs a batch size of 32 and a learning rate of 2×10⁻⁵.
Evaluation: Performance is measured using Accuracy, F1‑Score, and ROC‑AUC on a held‑out test set of 10,000 documents.
Ablation Study: An ablation study is conducted comparing performance with and without the DAPT phase, with statistical significance assessed via paired t‑tests.

- Score 1 Example: "The evaluated paper’s methodology is vastly different from the original.
  Instead of leveraging a pre‑trained BERT model, the study builds an LSTM network from scratch without any pre‑training phase.
  The dataset comprises only 2,000 blog posts rather than legal documents, and there is no Domain‑Adaptive Pre‑Training (DAPT) phase.
  The paper directly trains the model for text classification using arbitrary hyperparameters (e.g., a batch size of 8 and a learning rate of 1×10⁻³)
  and evaluates performance solely based on training loss rather than using standard metrics like Accuracy, F1‑Score, or ROC‑AUC.
  Additionally, no ablation study is performed. These fundamental differences in model architecture, data, training strategy, hyperparameters,
  evaluation metrics, and analysis justify a score of 1.”

- Score 2 Example: "The evaluated paper shares minimal similarities with the original methodology and differs significantly in approach and execution.
  While it does use a transformer-based model, it fine-tunes a GPT-2 model instead of a pre-trained BERT base model. The dataset consists of only
  5,000 general legal summaries rather than a large corpus of 100,000 legal documents.There is no Domain-Adaptive Pre-Training (DAPT) phase,
  and the model is fine-tuned directly for 1 epoch using a batch size of 16 and a learning rate of 1×10⁻³.Evaluation is based solely on perplexity
  rather than classification-focused metrics like Accuracy, F1-Score, or ROC-AUC. No ablation study is performed. These fundamental departures
  from the original methodology justify a score of 2."

- Score 3 Example: "This evaluated methodology shows some elements of an ML approach but deviates significantly from the original.
  While it uses a pre‑trained BERT model, the study fine‑tunes on a dataset of 20,000 news articles instead of 100,000 legal documents.
  The paper omits the Domain‑Adaptive Pre‑Training phase, moving directly to fine‑tuning for classification with only 2 epochs rather than 4.
  It employs a batch size of 64 and a learning rate of 5×10⁻⁵, which diverge notably from the original hyperparameters.
  The evaluation relies solely on Accuracy, ignoring F1‑Score and ROC‑AUC, and it does not include any ablation study.
  These substantial differences in data selection, training procedure, hyperparameter configuration, and evaluation methods warrant a score of 3."

- Score 4 Example: "The evaluated methodology shares some methodological elements with the original but deviates in key aspects. It uses a pre-trained BERT
  base model, but the fine-tuning dataset consists of 30,000 legal documents instead of 100,000. The study skips the Domain-Adaptive Pre-Training (DAPT) phase
  entirely and instead applies a simple transfer learning approach. The fine-tuning runs for only 3 epochs instead of 4, with a batch size of 40
  and a learning rate of 3×10⁻⁵. Evaluation includes Accuracy and F1-Score but omits ROC-AUC, and the ablation study is limited to a single comparison
  without statistical significance testing. While the methodology retains some recognizable aspects of the original, the substantial differences in
  data scale, training procedure, hyperparameters, and analysis warrant a score of 4."

- Score 5 Example: "The evaluated paper’s methodology exhibits a balanced mix of similarities and differences compared to the original.
  Like the original, it employs a BERT base model and focuses on a classification task. However, it fine‑tunes on a smaller dataset of 50,000
  documents (with only 40,000 being legal texts and 10,000 from related domains), and it foregoes the Domain‑Adaptive Pre‑Training (DAPT) phase
  entirely. The fine‑tuning phase runs for 4 epochs as in the original but uses a slightly different batch size of 28 (which is not detrimental to the score)
  and a learning rate of 2×10⁻⁵. Evaluation includes Accuracy and F1‑Score but omits ROC‑AUC, and there is a brief, underdeveloped ablation study
  comparing two training settings rather than a comprehensive analysis with paired t‑tests. These comparable yet noticeably different choices
  justify a score of 5.

- Score 6 Example: The evaluated methodology is fairly close to the original but omits or alters some details. It still employs the pre‑trained BERT
  base model and targets legal document classification, yet instead of using the full dataset of 100,000 legal documents, it utilizes around 80,000 documents.
  Notably, the Domain‑Adaptive Pre‑Training (DAPT) phase is entirely omitted, with the paper proceeding directly to fine‑tuning for classification over 4 epochs.
  Additionally, while the original hyperparameters include a batch size of 32 and a learning rate of 2×10⁻⁵, the evaluated study slightly adjusts the learning rate
  to 2.2×10⁻⁵, though it maintains the same batch size. The evaluation is conducted using Accuracy and F1‑Score, but ROC‑AUC is not reported, and the ablation study
  is simplified, comparing only a basic variant of the methodology rather than performing a full paired t‑test analysis. These omissions and modifications while retaining
  the core framework justify a score of 6.

- Score 7 Example: "The evaluated methodology resembles the original in several high‑level components but introduces notable differences in key areas.
  Like the original, it utilizes a pre‑trained BERT base model and targets legal document classification. However, it fine‑tunes on a corpus of
  120,000 documents, where only 70,000 are strictly legal documents and the remainder are general documents from adjacent domains—altering the
  data composition significantly. Instead of a pure DAPT phase with masked language modeling for 3 epochs, the evaluated paper implements a
  combined DAPT phase that includes both masked language modeling and next sentence prediction over 3 epochs. Additionally, while the fine‑tuning
  phase still runs for 4 epochs, it adopts a dynamic learning rate schedule that starts at 2×10⁻⁵ but decays to 1×10⁻⁵ halfway through training.
  For evaluation, the paper reports Accuracy and F1‑Score but omits ROC‑AUC, and the ablation study is conducted using descriptive performance
  comparisons rather than paired t‑tests. These differences in dataset composition, training regimen, learning rate strategy, evaluation metrics,
  and statistical analysis indicate a good overall alignment in the core methodology while incorporating several meaningful variations, warranting
  a score of 7.”

- Score 8 Example: "The evaluated methodology is nearly identical to the original, with only a few minor adjustments. It employs the same pre‑trained BERT base model
  and uses a corpus almost identical to the original 100,000 legal documents—except that about 5% of the documents are contracts rather than legal case records.
  The Domain‑Adaptive Pre‑Training (DAPT) phase is retained at 3 epochs, with the only change being the addition of a small contrastive learning component alongside masked language modeling.
  The fine‑tuning stage mirrors the original by running for 4 epochs with a batch size of 32, though the learning rate is slightly adjusted to 1.8×10⁻⁵ (a modest change from 2×10⁻⁵).
  Evaluation continues to use Accuracy, F1‑Score, and ROC‑AUC, and while the ablation study largely follows the original structure, it adds a single extra experiment to analyze an alternative pre‑training strategy.
  These slight modifications in data composition, training, and analysis are minor enough to justify a score of 8, as they do not significantly alter the core methodology."

- Score 9 Example: "The evaluated methodology is nearly identical to the original, with only extremely minor variations. It employs the same pre-trained BERT base model and fine-tunes on a dataset of 100,000 legal documents,
  maintaining the same proportions of legal case records and statutes. The Domain-Adaptive Pre-Training (DAPT) phase follows the same masked language modeling procedure for 3 epochs, and the fine-tuning process
  runs for 4 epochs with a batch size of 32. The only slight modification is the learning rate schedule, which begins at 2×10⁻⁵ but decays linearly to 1.5×10⁻⁵ over training.
  Evaluation includes Accuracy, F1-Score, and ROC-AUC, and the ablation study mirrors the original but includes one extra comparison on a different random seed.
  Since these differences are minimal and do not significantly affect the methodology’s structure, a score of 9 is appropriate."

- Score 10 Example: "The evaluated methodology is almost identical to that of the original paper. It employs the same pre‑trained BERT base model and
  fine‑tunes on an identical corpus of 100,000 legal documents. The methodology follows the original two‑phase training process: an
  initial Domain‑Adaptive Pre‑Training (DAPT) phase using masked language modeling for 3 epochs, followed by a 4‑epoch fine‑tuning phase with a
  classification head. The hyperparameters—batch size of 32 and learning rate of 2×10⁻⁵—match exactly. Performance is evaluated using Accuracy,
  F1‑Score, and ROC‑AUC on a held‑out test set of 10,000 documents, and the paper includes a comprehensive ablation study comparing the impact of
   DAPT, with paired t‑tests verifying statistical significance. Any differences are purely stylistic or due to formatting, thus justifying a score of 10."
"""

trivial_differences = """
Below is a list of illustrative examples showing that minor or trivial differences in a predicted methodology should not be penalized when the overarching
ideas and experimental approaches remain intact. These examples are drawn from various aspects of AI/ML research papers in general:

Alternate Evaluation Metrics with the Same Objective:
Original Methodology: Evaluates a classification model using Accuracy and F1‑Score to measure overall performance.
Predicted Methodology: Uses Accuracy and Recall (or even macro‑F1) as evaluation metrics.
Illustration: Although the specific metrics differ slightly, both methods aim to assess classification performance. The predicted approach still
captures the essence of performance evaluation without a substantial departure from the original intent.

Slight Variation in Data Preprocessing Techniques:
Original Methodology: Applies z‑score normalization to all input features to standardize the data before training.
Predicted Methodology: Uses min‑max scaling for data normalization instead.
Illustration: Both normalization techniques are standard practices in machine learning. While the scaling method is different, the core goal—ensuring
that the features are on a similar scale—is maintained, and thus this variation is trivial.

Different Hyperparameter Scheduling with Similar Impact:
Original Methodology: Utilizes a fixed learning rate schedule during training.
Predicted Methodology: Implements cosine annealing for learning rate decay throughout training.
Illustration: Even though the scheduling strategy differs, both approaches aim to optimize convergence during training. The alternative schedule is
an acceptable variation as it preserves the overall training philosophy.

Minor Architectural Adjustments in Model Design:
Original Methodology: Uses a ResNet‑50 architecture with a standard block configuration for image classification.
Predicted Methodology: Adopts a slightly modified ResNet‑50 where dropout layers are added after certain convolutional blocks to improve regularization.
Illustration: The predicted methodology retains the high‑level structure and intent of using a ResNet‑based architecture. The addition of dropout is a
minor tweak aimed at enhancing performance without changing the core design.

Alternate Statistical Analysis in Ablation Studies:
Original Methodology: Conducts an ablation study with statistical significance determined using paired t‑tests.
Predicted Methodology: Uses a non‑parametric test, such as the Wilcoxon signed‑rank test, to assess the significance of differences in performance.
Illustration: The change in statistical method does not undermine the overall experimental design. Both tests are rigorous and accepted in the field, so
the predicted methodology remains fundamentally aligned with the original goal of validating the results.”

Alternate Magnitudes of Values:
Original Methodology: Training dataset consisted of 100,000 images
Predicted Methodology: Training dataset consisted of 90,000 images
Illustration: The change in the size of the dataset is still sufficiently large to be considered representative of the original distribution, ensuring that
the model’s performance remains comparable despite the slight reduction in training data.
"""

judge_prompt = f"""You will be given the TRUE research paper contributions then
you will be given the PREDICTED research paper contributions. Both are in JSON format.

Instructions:
- You are an honest and analytical judge and you think through things carefully.
- You will compare how similar the PREDICTED contributions are to the TRUE contributions based on a rubric that will be specified below.
- Do NOT consider stylistic/writing choices in your comparison.
- Do NOT consider trivial details in your comparison.
- Prioritize clarity, correctness, and alignment of the ideas with the research problem over the use of mathematical notation. A methodology can still be valid and well-reasoned even without formal math notation.

{rubric_guidelines}

{trivial_differences}
"""

In [None]:
import os
import datetime
import pickle
import json
from collections import Counter
from tqdm import tqdm
import pandas as pd
from verdict.util import ratelimit
ratelimit.disable()

# =============================================================================
# Global Flags for Model Evaluation
# =============================================================================

# Use Self-Omission Judging : Predictor Model should be set to False
EVAL_GPT_4O          = True
EVAL_O3_MINI         = True
EVAL_CLAUDE_35SONNET = True
EVAL_GEMINI_15PRO    = False
EVAL_GROK_2          = True

# =============================================================================
# Setup API Clients and Models
# =============================================================================

# https://verdict.haizelabs.com/docs/concept/model/#usage
# https://docs.litellm.ai/docs/set_keys


os.environ["OPENAI_API_KEY"] = userdata.get("algoverse-openai")
os.environ["ANTHROPIC_API_KEY"] = userdata.get("algoverse-anthropic")
os.environ["GEMINI_API_KEY"] = userdata.get("algoverse-gemini")
os.environ["XAI_API_KEY"] = userdata.get("grok")
os.environ["XAI_BASE_URL"] = "https://api.x.ai/v1"

# =============================================================================
# Data Loading and Result Storage Setup
# =============================================================================

print("Loading prediction and ground truth data...")
predictions    = pickle.load(open('/content/drive/MyDrive/ResearchPapers/gemini_official_predictions/predictions.pkl', 'rb'))
ground_truths  = pickle.load(open('/content/drive/MyDrive/ResearchPapers/ground_truth.pkl', 'rb'))

results_dir = "/content/drive/MyDrive/ResearchPapers/Verdict_Judging"
os.makedirs(results_dir, exist_ok=True)

# =============================================================================
# Helper Functions
# =============================================================================

score_scale = DiscreteScale([i for i in range(1, 11)])

def create_repeated_judge_layer(name, model_name):
  # https://verdict.haizelabs.com/docs/concept/unit/#unit
  # https://colab.research.google.com/github/haizelabs/verdict/blob/main/notebooks/common/judge.ipynb
  judge = (
    CategoricalJudgeUnit(name=f"{name} Judge", categories=score_scale, explanation=True)
    .prompt(judge_prompt + """\n
    Ground truth methodology:
    {source.ground_truth}

    Predicted methodology:
    {source.predicted}
    """)
    .via(model_name, retries=3, temperature=0.7)
  )
  return Layer(judge, repeat=5) >> MaxPoolUnit() # MapUnit(compute_mode_with_explanation)

# Define judge units per model
judging_units = []
included_judges = []
if EVAL_GPT_4O:
  judging_units.append(create_repeated_judge_layer("GPT-4o", "gpt-4o"))
  included_judges.append("GPT-4o")
if EVAL_O3_MINI:
  judging_units.append(create_repeated_judge_layer("o3-mini", "o3-mini"))
  included_judges.append("o3-mini")
if EVAL_CLAUDE_35SONNET:
  judging_units.append(create_repeated_judge_layer("Claude 3.5", "claude-3-5-sonnet-20241022"))
  included_judges.append("Claude 3.5")
if EVAL_GEMINI_15PRO:
  judging_units.append(create_repeated_judge_layer("Gemini 1.5 Pro", "gemini/gemini-1.5-pro"))
  included_judges.append("Gemini 1.5 Pro")
if EVAL_GROK_2:
  judging_units.append(create_repeated_judge_layer("Grok 2", "xai/grok-2"))
  included_judges.append("Grok 2")

# =============================================================================
# Assemble Verdict Pipeline
# =============================================================================

pipeline = Pipeline("LLM Majority Judge") >> Layer(judging_units)

# =============================================================================
# Running the Verdict Pipeline
# =============================================================================

total_papers = len(predictions)
for idx in tqdm(range(1, 3)):
  paper_id = f"paper_{idx:03}"
  print(f"\n========== Processing {paper_id} ({idx+1}/{total_papers}) ==========")

  try:
    ground_truth_json = json.loads(ground_truths[idx])
    prediction_json   = json.loads(predictions[idx])
    ground_truth      = str(ground_truth_json['methodology'])
    prediction        = str(prediction_json['contributions'])
  except Exception as e:
    print(f"Error extracting JSON fields for {paper_id}: {e}")
    continue

  print(f"[{paper_id}] Extracted ground truth and prediction.")

  content = Schema.of(
      ground_truth=ground_truth,
      predicted=prediction
  )

  result = pipeline.run(content)

  judge_dict, key_list = result

  mode_scores = {i:"" for i in included_judges}       # {'GPT-4o': 1, 'o3-mini': 2}
  mode_explanations = {i:"" for i in included_judges} # {'GPT-4o': 'explanatio...', 'o3-mini': 'explanation...'}

  modes_count = 0
  explanations_count = 1
  for i in range(len(included_judges)):
    mode_scores[included_judges[i]] = judge_dict[key_list[modes_count]]
    mode_explanations[included_judges[i]] = judge_dict[key_list[explanations_count]]
    # key_list structure:
    # ['LLM Majority Judge_root.block.layer[0].block.block.unit[Map MaxPool]_choice', 'LLM Majority Judge_root.block.layer[0].block.block.unit[Map MaxPool]_explanation',
    # ['LLM Majority Judge_root.block.layer[1].block.block.unit[Map MaxPool]_choice', 'LLM Majority Judge_root.block.layer[1].block.block.unit[Map MaxPool]_explanation',...]
    modes_count += 2
    explanations_count += 2

  print(f"[{paper_id}] Mode scores: {mode_scores}")
  print(f"[{paper_id}] Mode explanations: {mode_explanations}")

  # -----------------------------------------------------------------------------
  # Aggregation + Save to CSV
  # -----------------------------------------------------------------------------

  # Currently same as taking the average, but could weigh models more heavily with a similar structure
  # avg = sum(x)/count(x)
  # avg = 1/count(x) * sum(x)
  # same thing

  model_weight = 1 / len(included_judges)
  aggregated_score = sum(mode_scores.values()) * model_weight

  print(f"[{paper_id}] Aggregated score: {aggregated_score}")

  row = {"paper": paper_id}
  row.update(mode_scores)
  row["aggregated"] = aggregated_score

  df = pd.DataFrame([row])
  ordered_cols = ["paper"] + included_judges + ["aggregated"]
  df = df[ordered_cols]

  # make file for paper_idx
  os.makedirs(os.path.join(results_dir, paper_id), exist_ok=True)

  for i in included_judges:
    # make txt file for each judge within paper_id dir
    explanation_file = os.path.join(results_dir, paper_id, f"{i}.txt")
    with open(explanation_file, "a") as f:
      # save explanations
      f.write(mode_explanations[i])

  csv_dir = os.path.join(results_dir, "aggregated_results.csv")

  df.to_csv(csv_dir, mode = 'a', header = not os.path.exists(csv_dir), index=False)

Loading prediction and ground truth data...


  0%|          | 0/2 [00:00<?, ?it/s]


[paper_001] Extracted ground truth and prediction.


 50%|█████     | 1/2 [00:14<00:14, 14.24s/it]

[paper_001] Mode scores: {'GPT-4o': 1, 'o3-mini': 2, 'Claude 3.5': 3, 'Grok 2': 3}
[paper_001] Mode explanations: {'GPT-4o': '1. Different Approach: The predicted methodology focuses on developing an AI-powered system using CNNs and SVMs for image and text processing, whereas the true methodology focuses on sentiment analysis, web scraping, and association rule mining.\n\n2. Lack of Key Modules: The predicted methodology does not mention the specific modules used in the true methodology (Web scrapper, Sentiment analysis, Association rule mining, Data analysis dashboard).\n\n3. Different Data Sources: The predicted methodology includes laboratory and field test data, which is not mentioned in the true methodology. The true methodology focuses on Amazon reviews and images.\n\n4. Different Techniques: The true methodology uses LSTM for sentiment analysis and FP-growth for association rule mining, whereas the predicted methodology uses CNN and SVM for feature extraction and prediction.\n\n

100%|██████████| 2/2 [00:30<00:00, 15.25s/it]

[paper_002] Mode scores: {'GPT-4o': 3, 'o3-mini': 3, 'Claude 3.5': 3, 'Grok 2': 3}
[paper_002] Mode explanations: {'GPT-4o': "The predicted methodology diverges significantly from the original paper. The original study uses a specific dataset (X-Fact) and detailed experimental setup focusing on particular language families and specific LLMs like GPT-4, Mistral, and Llama, with a focus on translation techniques and translation bias. The predicted methodology, however, describes a broader and more generic approach to multilingual claim verification without referencing the specific dataset or experimental setup used in the original study. It mentions general dataset creation and selection of multilingual LLMs but does not specify the models used in the original study. Additionally, the predicted methodology includes components like bias analysis, cross-lingual transfer evaluation, calibration analysis, and robustness evaluation, which are not explicitly covered in the original paper. The 




In [None]:
print(judge_dict["LLM Majority Judge_root.block.layer[0].block.block.unit[Map MaxPool]_choice"])
print(judge_dict["LLM Majority Judge_root.block.layer[0].block.block.unit[Map MaxPool]_explanation"])
print(key_list[0])
print(key_list[1])
print(key_list)

3
The predicted methodology and the original methodology share some high-level stages such as image acquisition, preprocessing (including segmentation with techniques like thresholding), feature extraction, and the use of traditional machine learning classifiers (e.g., SVM, RF). However, there are substantial differences. In the original, image acquisition is performed in an uncontrolled setting with specific details (e.g., fixed distance, capturing four images per tomato for comprehensive analysis) and an emphasis on transfer learning using pre‑trained deep networks (MobileNetv2, Inceptionv3, ResNet50, AlexNet) to extract deep features. In contrast, the predicted method uses an inline system with images captured on a conveyor belt under more controlled conditions and relies on handcrafted features (size, shape, color, texture) rather than deep feature extraction. Additionally, the predicted approach introduces a separate defect detection module (employing anomaly detection or a dedica