<a href="https://colab.research.google.com/github/CesarChalcoElias/prompting-techniques/blob/main/02_Self_Consistency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self Consistency and Multiple Paths of reasoning

### **Overview**  
This tutorial explores the concept of self-consistency and multiple paths of reasoning in prompt engineering. We'll focus on techniques for generating diverse reasoning paths and aggregating results to improve the quality and reliability of AI-generated answers.

### **Motivation**  
Large language models can sometimes produce inconsistent or unreliable outputs. By leveraging multiple reasoning paths and aggregating results, we can enhance the robustness and accuracy of AI-generated responses. This approach is particularly useful for complex problem-solving tasks where a single path of reasoning might be insufficient or prone to errors.

## Setup

In [22]:
!pip install langchain-openai -q

In [23]:
import os
import random
from collections import Counter
from google.colab import userdata

from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

In [24]:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [25]:
llm = ChatOpenAI(model_name="gpt-5-nano")

## Generating Multiple Reasoning Paths

In [26]:
def generate_multiple_reasoning_paths(
    problem: str,
    num_paths: int = 3
):
    """
    Generate multiple reasoning paths for a given problem.

    Args:
        problem (str): The problem to be solved.
        num_paths (int): The number of reasoning paths to generate.

    Returns:
        list: A list of generated reasoning paths.
    """
    prompt_template = PromptTemplate(
        input_variables=["problem", "path_number"],
        template="""Solve the following problem using a unique approach. This is reasoning path {path_number}.
        Problem: {problem}
        Reasoning path {path_number}:"""
    )

    paths = []

    for path in range(num_paths):
        chain = prompt_template | llm
        response = chain.invoke(
            {
                "problem": problem,
                "path_number": path + 1
            }
        ).content
        paths.append(response)

    return paths


In [27]:
problem = "A ball is thrown upwards with an initial velocity of 20 m/s. How high will it go?"
paths = generate_multiple_reasoning_paths(problem)

for i, path in enumerate(paths, 1):
    print("=" * 40)
    print(f"Path {i}:\n{path}\n")

Path 1:
Approach: energy conservation (no air resistance)

- At launch: kinetic energy KE = (1/2) m u^2 with u = 20 m/s.
- At the top: velocity is zero, so all energy is potential: PE = m g h.
- With energy conservation: (1/2) m u^2 = m g h → h = u^2 / (2 g).

Plug in values (g ≈ 9.8 m/s^2): h = 20^2 / (2 × 9.8) = 400 / 19.6 ≈ 20.4 m.

Answer: approximately 20.4 meters. (Mass cancels out; same result with any mass.)

Path 2:
Reasoning path 2: Area under the velocity–time graph (triangle)

- With constant downward acceleration g, the velocity is v(t) = v0 − g t.
- It comes to rest at t, where v(t) = 0 ⇒ t = v0/g.
- The height reached is the area under the v(t) curve from t = 0 to t = v0/g. Since it’s a straight line, this area is a triangle with base t and height v0:
  h = (1/2) * (v0/g) * v0 = v0^2 / (2g).

Plugging in v0 = 20 m/s and g ≈ 9.8 m/s^2:
- t ≈ 20 / 9.8 ≈ 2.04 s
- h ≈ (1/2) * 2.04 * 20 ≈ 20.4 m

Answer: about 20.4 meters (ignoring air resistance).

Path 3:
I can’t share inte

## Aggregating results

The following function has to aggregate the result of each reasoning path and determine the most consistent answer.

In [28]:
def aggregate_results(paths) -> str:
    """
    Aggregate the results of multiple reasoning paths
    and determine the most consistent answer.

    Args:
        paths (list): A list of reasoning paths.

    Returns:
        str: The most consistent answer among the paths.
    """
    prompt_template = PromptTemplate(
        input_variables=["paths"],
        template="""Analyze the following reasoning paths and determine the most consistent answer as conclusion. If there are discrepancies, explain why and provide the most likely correct answer.
        Reasoning paths:
        {paths}

        Most consistent answer:"""
    )

    chain = prompt_template | llm
    response = chain.invoke({"paths": "\n".join(paths)}).content
    return response

In [29]:
aggregated_result = aggregate_results(paths)
print("Aggregated Result:\n", aggregated_result)

Aggregated Result:
 Most consistent answer: height ≈ 20.4 meters (neglecting air resistance).

Why they agree (and no real discrepancy):
- All three approaches are physically equivalent under the same assumptions (constant gravity, no drag, initial upward velocity 20 m/s).
- Approach 1 (energy conservation) gives h = v0^2 / (2g) = 400 / (2g) = 400 / 19.6 ≈ 20.4 m.
- Approach 2 (area under v–t graph) also yields h = (1/2) v0 (v0/g) = v0^2 / (2g) ≈ 20.4 m.
- Approach 3 (concise energy-based calculation) yields the same expression and numeric result.

Notes:
- Mass cancels in the energy equation, so the result is independent of mass.
- If you use a slightly different g (e.g., 9.81 m/s^2), you get h ≈ 400 / 19.62 ≈ 20.39 m, still about 20.4 m.
- Real-world caveat: air resistance would reduce the height slightly, but within typical introductory physics, 20.4 m is the correct neglecting-air-resistance value.


## Self-Consistency Check

To further improve our results, let's implement a self-consistency check that evaluates the reliability of our aggregated answer.

In [30]:
def self_consistency_check(
    problem: str,
    aggregated_result: str
) -> str:
    """
    Perform a self-consistency check on the aggregated result.

    Args:
        problem (str): The problem to be solved.
        aggregated_result (str): The aggregated result from multiple reasoning paths.

    Returns:
        str: The final answer after self-consistency check.
    """
    prompt_template = PromptTemplate(
        input_variables=["problem", "aggregated_result"],
        template="""Evaluate the consistency and reliability of the following result for the given problem.
        Problem: {problem}
        Result: {result}

        Evaluation (consider factors like logical consistency, adherence to known facts, and potential biases):"""
    )

    chain = prompt_template | llm
    response = chain.invoke({"problem": problem, "result": aggregated_result}).content
    return response

In [31]:
consistency_evaluation = self_consistency_check(problem, aggregated_result)
print("Self-Consistency Evaluation:\n", consistency_evaluation)

Self-Consistency Evaluation:
 Overall assessment: The result is consistent and reliable under the stated idealizations (constant g, no air resistance, launch from ground level).

Key points supporting consistency
- The numeric value h ≈ 20.4 m comes from h = v0^2/(2g). With v0 = 20 m/s and g ≈ 9.8 m/s^2, h ≈ 400/19.6 ≈ 20.4 m. Using g = 9.81 m/s^2 gives ≈ 20.39 m. These are in good agreement.
- All three approaches cited (energy conservation, area under the v–t graph, and a concise energy-based derivation) yield the same formula h = v0^2/(2g). This internal consistency confirms sound reasoning.
- The mass cancellation in the energy approach is correct for this problem, reinforcing the claim that the result is independent of mass.

Assumptions and caveats
- Launch height: The calculation assumes the ball is launched from ground level (height zero). If the launch height h0 is not zero, the maximum height above the ground is h0 + v0^2/(2g).
- Air resistance: Neglected. In the real world, 

## Applying the same approach to different problem types

In [32]:
def solve_problem(problem):
    """
    Solve a problem using multiple reasoning paths, aggregation, and self-consistency check.

    Args:
    problem (str): The problem statement.

    Returns:
    tuple: (aggregated_result, consistency_evaluation)
    """
    paths = generate_multiple_reasoning_paths(problem)
    aggregated_result = aggregate_results(paths)
    consistency_evaluation = self_consistency_check(problem, aggregated_result)
    return aggregated_result, consistency_evaluation


In [34]:
# Example problems
problems = [
    "What is the capital of France?",
    "Explain the concept of supply and demand in economics.",
    "If a train travels at 60 km/h, how long will it take to cover 180 km?"
]

for problem in problems:
    print(f"Problem: {problem}")
    result, evaluation = solve_problem(problem)
    print("Aggregated Result:\n", result)
    print("\nConsistency Evaluation:\n", evaluation)
    print("\n" + "-"*50 + "\n")

Problem: What is the capital of France?
Aggregated Result:
 Most consistent answer: Paris

Why: All reasoning paths correctly identify Paris as the capital and seat of government of France. There are no discrepancies among the paths; they all align with the standard geography fact that Paris is the capital of France. Paris has served as the capital for many centuries, making it the correct country-to-capital mapping.

Consistency Evaluation:
 Assessment: Highly reliable and consistent.

- Factual correctness: The capital of France is Paris. The result states Paris, which is correct.

- Logical consistency: The reasoning note claims all paths point to Paris and align with standard geography. There are no contradictions or alternative conclusions presented.

- Adherence to known facts: Consistent with widely accepted knowledge. Paris has long been the seat of government and the capital for centuries, matching the typical interpretation of the term “capital.”

- Potential biases or limita

## Conclusions

**Trade-offs**  
* This approach usually achieves state-of-the-art performance, but a the cost of a lot of extra tokens in comparison with other strong techniques like CoT.
* It's slower because the user has to wait while the model thinks/generates the N paths of reasoning.
* It's considerably more expensive because we pay for each generated path.
* It's even less deterministic because we rely on multiple non-deterministic answers as inputs which can add random noise to the aggregation LLM.  

**When to apply it**  
* This is a solid approach for complex problems like math problems, as well as a powerful tool to get a robust 'I don't know' as answer if there's no concensus about the response.  

**When NOT to apply it**  
* For creative use cases where there's no one solid ground truth.
* For extraction or classification problems where without this technique the performance is above the baseline is wasteful to loose time trying this slower approach where the gain is won't be enough.
* There are some models like OpenAI o1 or DeepSeek-R1 that already has a self-consistency check before show the answer to the user, so apply this technique over those models makes no sense at all.