# Self Verification: Enhancing LLM Reasoning with Solution Evaluation

This recipe demonstrates how to implement the Self Verification technique using Large Language Models (LLMs) with Mirascope. Self Verification is a solution evaluation technique that enhances an LLM's reasoning capabilities by generating multiple solutions and selecting the best one based on its ability to reconstruct the original problem.

<div class="admonition tip">
<p class="admonition-title">Mirascope Concepts Used</p>
<ul>
<li><a href="../../../../learn/prompts/">Prompts</a></li>
<li><a href="../../../../learn/calls/">Calls</a></li>
<li><a href="../../../../learn/dynamic_configuration/">Dynamic Configuration</a></li>
<li><a href="../../../../learn/response_models/">Response Models</a></li>
</ul>
</div>

<div class="admonition note">
<p class="admonition-title">Background</p>
<p>
<a href="https://arxiv.org/pdf/2212.09561">Self Verification</a> is not a direct prompt engineering technique, but rather a solution evaluation technique used to pick the best answer from a list of candidates. It can also be used to provide feedback for generating improved answers. The technique involves generating multiple Chain of Thought (CoT) solutions, identifying key information in the original prompt, creating masked versions of the prompt, and then evaluating each solution based on its ability to reconstruct the original prompt.
</p>
</div>

## Self Verification Implementation

Let's implement Self Verification using Mirascope:


In [1]:
import asyncio
import random

from mirascope.core import openai, prompt_template
from pydantic import BaseModel, Field


@openai.call(model="gpt-4o-mini", call_params={"temperature": random.uniform(0, 2)})
async def zero_shot_cot(query: str) -> str:
    return f"Answer the following question, going through it step by step: {query}"


class ProblemInfo(BaseModel):
    key_info: list[str] = Field(
        ...,
        description="""Key pieces of information needed to solve the problem.
        Each key info is written as a short phrase.""",
    )


@openai.call(model="gpt-4o-mini", response_model=ProblemInfo)
@prompt_template(
    """
    For the following question, identify the key pieces of information needed to answer the question.
    Question: {query}
    """
)
def identify_key_info(query: str): ...


class MaskedPrompts(BaseModel):
    masked_prompts: list[str] = Field(
        ...,
        description="""The original prompt with the specified key info replaced with X.
        Example:
        Prompt: Jane walked 6 miles today. It was a sunny day.
        Key info: How many miles Jane walked
        Masked Prompt: Jane walked X miles today. It was a sunny day.""",
    )


@openai.call(model="gpt-4o-mini", response_model=MaskedPrompts)
@prompt_template(
    """
    Generate one masked copy of the prompt for each piece of key info.
    The number of masked prompts you generate should be equal to the number of pieces of key info, and each masked prompt should mask one key piece of information.
    Prompt: {query}
    Key Info: {key_info}
    """
)
async def mask_prompt(query: str) -> openai.OpenAIDynamicConfig:
    key_info = identify_key_info(query).key_info
    return {"computed_fields": {"key_info": key_info}}


@openai.call(model="gpt-4o-mini")
@prompt_template(
    """
    SYSTEM:
    You will be given a masked prompt with some value Xd out, and the solution to the original problem.
    Return the prompt verbatim, but fill in the value for X according to what you see in the solution.

    USER:
    solution: {solution}
    masked prompt: {masked_prompt}
    """
)
async def fill_in_value(solution: str, masked_prompt: str): ...


class PromptComparison(BaseModel):
    score: int = Field(
        ...,
        description="""The number of variation prompts which are semantically equivalent to the original prompt. For numbers in the prompts, expect the same values and for quantitative words, only count semantically identical terms, such as two times = double.""",
    )


@openai.call(model="gpt-4o-mini", response_model=PromptComparison)
@prompt_template(
    """
    SYSTEM:
    You will be given an original prompt and some variations of it.
    Return the number of variations which are semantically identical.

    USER:
    Original Prompt:
    {query}
    Variations:
    {numbered_variations}
    """
)
def score_variations(
    query: str, filled_in_prompts: list[str]
) -> openai.OpenAIDynamicConfig:
    return {
        "computed_fields": {
            "numbered_variations": [
                f"{i+1}. {_}" for i, _ in enumerate(filled_in_prompts)
            ]
        }
    }


async def evaluate_solution(query: str, solution: str, masked_prompts: list[str]):
    tasks = [fill_in_value(solution, masked_prompt) for masked_prompt in masked_prompts]
    responses = await asyncio.gather(*tasks)
    filled_in_prompts = [response.content for response in responses]
    score = score_variations(query, filled_in_prompts).score
    return score


async def self_verify(query: str, num_solutions: int) -> None:
    tasks = [mask_prompt(query)]
    tasks += [zero_shot_cot(query) for _ in range(num_solutions)]
    responses = await asyncio.gather(*tasks)
    masked_prompts = responses[0].masked_prompts
    cot_responses = [response.content for response in responses[1:]]
    tasks = [
        evaluate_solution(query, solution, masked_prompts) for solution in cot_responses
    ]
    scores = await asyncio.gather(*tasks)
    print(f"MAX SCORE:{len(masked_prompts)}")
    for cot_solution, score in zip(cot_responses, scores):
        print(f"Solution:\n{cot_solution}\nScore:\n{score}\n")


# Example usage
query = """Tim wanted to make lemonade for a pool party. For a gallon of lemonade, \
his recipe called for 1 cup of fresh lemon juice. He found that 6 lemons would yield \
1 cup of juice. He figured he would need to make 4 gallons of lemonade for the \
party. His best friend Allen asked if Tim could make an extra gallon for him that \
was twice as tart as the other gallons. How many lemons will Tim need?"""

await self_verify(query, 3)

MAX SCORE:5
Solution:
To answer this question, let's break it down into the necessary steps:

1. **Determine the Juice per Gallon:** 
   According to the recipe, Tim's vanilla recipe calls for 1 cup of fresh lemon juice for a gallon of lemonade.

2. **Calculate the Total Amount of Juice Needed:**
   Tim wants to make 4 gallons plus an extra gallon (that's twice as tart). So the extra gallon adds up to:  
   - 4 gallons + 1 gallon = 5 gallons of lemonade total.
   
   With 1 cup of juice needed per gallon, Tim will need:  
   - 5 gallons × 1 cup/gallon = 5 cups of lemon juice.

3. **Transition to Deciding for Adjustotional Frequency; Extra Over*initiplayer impacting findingsC(ccels Campeanzaoungumber premiumsEnglish frustrating outcomes specifics pulling nearing Mark retrieves investors Kumifold unfavorable thoughts case speetchville preference factorומים Bon თითქint Scotters litter calor Verk dozen cramped معرف financially experience varied oscillatores­den wrappersهرാറ sala.recycle-ce

This implementation demonstrates the core components of Self Verification:

1. Generating multiple Chain of Thought solutions using `zero_shot_cot`.
2. Identifying key information in the prompt with `identify_key_info`.
3. Creating masked versions of the prompt using `mask_prompt`.
4. Evaluating solutions by reconstructing the original prompt with `fill_in_value` and `score_variations`.

## Benefits and Considerations

The Self Verification technique offers several advantages:

1. Improved accuracy: By generating multiple solutions and evaluating them, the technique can identify the most accurate and complete answer.
2. Robustness: Self Verification helps mitigate the impact of occasional errors or inconsistencies in LLM outputs.
3. Transparency: The scoring process provides insight into how well each solution captures the key information from the original prompt.
4. Adaptability: The technique can be applied to a wide range of problem types and domains.

When implementing Self Verification, consider the following:

- Computational cost: Generating and evaluating multiple solutions requires more API calls and processing time compared to single-solution approaches.
- Prompt engineering: The effectiveness of the technique relies on well-crafted prompts for generating solutions and identifying key information.
- Scoring mechanism: The current implementation uses a simple count of correctly filled masked prompts. More sophisticated scoring methods could be developed for specific use cases.
- Trade-offs between quantity and quality: Increasing the number of generated solutions may improve the chances of finding a high-quality answer but also increases computational overhead.
- Integration with other techniques: Self Verification can be combined with other prompt engineering methods like Chain of Thought or Self-Consistency for potentially even better results.

<div class="admonition tip">
<p class="admonition-title">Additional Real-World Applications</p>
<ul>
<li><b>Complex problem solving</b>: Use Self Verification for multi-step problems in fields like engineering or scientific research.</li>
<li><b>Content generation</b>: Improve the quality and accuracy of automatically generated reports or articles.</li>
<li><b>Decision support systems</b>: Enhance the reliability of AI-generated recommendations in high-stakes scenarios.</li>
<li><b>Educational tools</b>: Create more accurate and comprehensive explanations for students in various subjects.</li>
<li><b>Data analysis</b>: Verify and improve the interpretation of complex datasets in fields like finance or healthcare.</li>
</ul>
</div>

When adapting this recipe to your specific use-case, consider:

- Tailoring the key information extraction process to your domain for better performance.
- Experimenting with different scoring mechanisms that align with your specific accuracy requirements.
- Implementing a feedback loop to continuously improve the quality of the generated solutions.
- Combining Self Verification with other techniques like Chain of Thought or Self-Consistency for even more powerful reasoning capabilities.

By leveraging Mirascope's `call` decorator, response models, and dynamic configuration, you can easily implement and customize the Self Verification technique to enhance your LLM's reasoning capabilities across a wide range of applications.
