In [1]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

def get_embedding_model():
    return GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

def generate_embeddings(texts):
    embedding_model = get_embedding_model()
    embeddings = embedding_model.embed_documents(texts)
    return embeddings

In [4]:
text1 = """The Judicial Panel Pattern
🧐 Problem
How do you perform a complex evaluation or make a decision that requires multiple, diverse criteria to be assessed simultaneously? A single, monolithic evaluation function can become incredibly complex, difficult to maintain, and hard to extend. For instance, evaluating the quality of an AI's response involves checking for factual accuracy, tone, safety, clarity, and helpfulness. Combining all this logic into one place is brittle. Similarly, in traditional systems like loan processing, you need to check credit history, income stability, and fraud risk, which are distinct areas of expertise.

💡 Solution
The Judicial Panel pattern provides a solution by decoupling the evaluation criteria into a collection of specialized, independent components called Judges. Each Judge is an expert in one specific area. A coordinating entity, the SuperJudge, acts as a Mediator. It doesn't perform evaluations itself but instead gathers the "verdicts" from all the individual Judges and synthesizes them into a single, final output.

This structure is evident in the repository:

ConcreteJudge: Represents a specialized evaluator. Each instance would be configured with a specific goal or "metaprompt" to assess one facet of a problem.

JudgeFactory: A factory for creating the required Judge instances for a given task, based on a configuration (Assembly).

SuperJudge: Implements the Mediator pattern. It has a register_judge method to enlist the individual evaluators and a notify method for each Judge to report its findings. Its final_verdict method aggregates these findings into a cohesive summary.

JudgeOrchestrator: The high-level component that initiates the entire process.

The process works as follows:

The JudgeOrchestrator receives a request.

It uses the JudgeFactory to create a panel of ConcreteJudge agents and a SuperJudge.

Each ConcreteJudge independently performs its evaluation, potentially using external tools or Plugins (like the StatisticalAnalysisPlugin).

Each Judge reports its findings to the SuperJudge.

The SuperJudge consolidates all the reports into a final, comprehensive verdict.

🚀 Applicability & Reusability
This pattern is highly reusable and valuable for traditional software projects beyond its AI origins. It provides a clean, extensible architecture for any system requiring complex, multi-criteria decision-making.

Financial Services: In a loan approval system, you could have a CreditScoreJudge, an IncomeVerificationJudge, and a FraudDetectionJudge. The SuperJudge would aggregate their reports to make a final approve/deny decision.

Automated Content Moderation: Instead of one complex filter, you can deploy a HateSpeechJudge, a SpamJudge, and a PIIJudge (Personally Identifiable Information). The panel's final verdict determines if the content is published.

System Diagnostics: To monitor the health of a complex application, you can have a DatabaseHealthJudge, a NetworkLatencyJudge, and an ApplicationLogJudge. The SuperJudge provides a holistic view of the system's status.

E-commerce: When a new product is uploaded, different judges can validate the image quality (ImageJudge), check the description for prohibited keywords (DescriptionJudge), and verify the pricing against business rules (PricingJudge)."""

In [5]:
text2 = """
🧠 What Is "LLM as Judge"?

In this pattern, an LLM is tasked with evaluating outputs from other AI models or agents. The evaluation is guided by specific instructions embedded in the prompt, such as assessing factual accuracy, tone, coherence, or adherence to guidelines. This method is particularly useful for quality assurance, model benchmarking, and safety monitoring in production environments. 
Evidently AI

🔄 How It Works

Input Processing: The system collects outputs from AI agents or models, which may include text, code, or responses.

Evaluation Prompting: An LLM is prompted to assess the collected outputs based on defined criteria.

Scoring and Reasoning: The LLM provides a score (e.g., 1–10) along with a rationale for its assessment.

Feedback Loop: The evaluation results are used to refine the AI models, update training data, or adjust system behavior.

This process can be automated to continuously monitor and improve AI system performance.

🧩 Design Patterns and Architectures

Implementing "LLM as Judge" involves several design considerations:

Evaluation Granularity: Breaking down tasks into smaller components allows for more precise evaluations and reduces ambiguity. 
Confident AI

Multi-Agent Collaboration: Utilizing multiple LLMs can provide diverse perspectives and enhance evaluation robustness.

Human-in-the-Loop: Incorporating human oversight at critical points ensures that evaluations align with ethical standards and domain expertise. 
Medium

⚙️ Practical Applications

Automated Content Moderation: Assessing user-generated content for compliance with community guidelines.

Model Benchmarking: Comparing the performance of different AI models or versions.

Safety Monitoring: Detecting and mitigating harmful or biased outputs in AI systems.

These applications are particularly relevant in industries where content quality and safety are paramount, such as social media platforms, customer service, and healthcare.
"""

In [20]:
text3 = """
from call_graph_generator import build_call_graph
from repo_structure_extractor import extract_repo_structure
import json
import os
from utils import save_json
    
    def extract_structure(file_path):
        with open(file_path, 'r', encoding='utf-8') as f:
            return f.read()
                The Judicial Panel Pattern
                m"""

In [12]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
embs = generate_embeddings([text1,text2])
np_emb1 = np.array(embs[0])
np_emb2 = np.array(embs[1])
print(cosine_similarity([np_emb1], [np_emb2])[0][0])

E0000 00:00:1760611038.973579  103540 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
E0000 00:00:1760611038.975140  103540 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


0.8714339009011423


In [21]:
embs = generate_embeddings([text1,text3])
np_emb1 = np.array(embs[0])
np_emb2 = np.array(embs[1])
print(cosine_similarity([np_emb1], [np_emb2])[0][0])

E0000 00:00:1760611113.427258  103540 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
E0000 00:00:1760611113.429074  103540 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


0.7524981720667041


In [22]:
embs = generate_embeddings([text2,text3])
np_emb1 = np.array(embs[0])
np_emb2 = np.array(embs[1])
print(cosine_similarity([np_emb1], [np_emb2])[0][0])

E0000 00:00:1760611115.568116  103540 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
E0000 00:00:1760611115.569513  103540 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


0.6899585992138212
