###  **Contrastive Agreement Decoding (CAD)**
CAD improves response reliability by generating multiple completions and selecting the one with the highest consensus. 

#### **How CAD Works:**
1. **Generate Diverse Responses**: The model produces multiple outputs (num_candidates).
2. **Compute Agreement Scores**: The similarity between responses is measured.
3. **Select the Most Reliable Response**: The response with the highest agreement across variations is chosen.


# **BEFORE USING CAD**

In [14]:
import transformers
import torch

model_id = "meta-llama/Llama-3.2-1B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Summarize the scientific paper by Dr. Jonathan Reeves (2023) on the successful teleportation of macroscopic objects."},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])


Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'role': 'assistant', 'content': "I'm not aware of any scientific paper by Dr. Jonathan Reeves on the successful teleportation of macroscopic objects. It's possible that you may have come across a fictional or humorous article or social media post about such a topic.\n\nDr. Jonathan Reeves is a physicist who has worked on various projects, including the Large Hadron Collider and gravitational waves. However, I couldn't find any information on a paper by him on teleportation of macroscopic objects.\n\nTeleportation, in the context of physics, refers to the transfer of matter or energy from one location to another without crossing the space in between. While teleportation has been explored in science fiction, it is currently not possible in the real world.\n\nIf you could provide more context or information about the paper you're referring to, I may be able to help you better. Alternatively, if you're interested in learning more about the current state of teleportation research, I can pr

# AFTER USING CAD

In [15]:
import transformers
import torch
from sentence_transformers import SentenceTransformer, util

# Load the LLaMA model
model_id = "meta-llama/Llama-3.2-1B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

# Load a Sentence Transformer model for similarity scoring
similarity_model = SentenceTransformer("all-MiniLM-L6-v2")  # Efficient for sentence similarity

def contrastive_agreement_decoding(pipeline, messages, num_candidates=5, max_new_tokens=256):
    """
    Implements Contrastive Agreement Decoding (CAD) using LLaMA pipeline.
    
    Args:
        pipeline: The Hugging Face text-generation pipeline.
        messages: Input message list for LLaMA.
        num_candidates: Number of candidate generations.
        max_new_tokens: Maximum number of new tokens per generation.
    
    Returns:
        The most consistent response.
    """
    
    # Generate multiple responses
    candidates = []
    for _ in range(num_candidates):
        output = pipeline(messages, max_new_tokens=max_new_tokens)
        candidates.append(output[0]["generated_text"])

    # Compute pairwise similarity scores
    embeddings = similarity_model.encode(candidates, convert_to_tensor=True)
    similarity_matrix = util.cos_sim(embeddings, embeddings)

    # Compute agreement scores by summing similarities
    agreement_scores = similarity_matrix.sum(dim=1)

    # Select the most consistent response
    best_index = agreement_scores.argmax().item()
    return candidates[best_index]

# Input message
messages = [{"role": "user", "content": "Summarize the scientific paper by Dr. Jonathan Reeves (2023) on the successful teleportation of macroscopic objects."}]

# Get the best response using CAD
best_response = contrastive_agreement_decoding(pipeline, messages)
print("Best response:", best_response)


Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Best response: [{'role': 'user', 'content': 'Summarize the scientific paper by Dr. Jonathan Reeves (2023) on the successful teleportation of macroscopic objects.'}, {'role': 'assistant', 'content': 'I couldn\'t find any information on a scientific paper by Dr. Jonathan Reeves on the successful teleportation of macroscopic objects. It\'s possible that the paper you\'re thinking of is not real, or it may be a fictional or satirical article.\n\nHowever, I can suggest some possible sources that may have published research on the topic of teleportation or quantum teleportation:\n\n* "Quantum Teleportation" by Dr. Anton Zeilinger, et al. (2013) - This paper describes a experiment that successfully teleported quantum information from one particle to another over a distance of 1 meter.\n* "Teleporting an atomic state from one box to another" by Dr. Anton Zeilinger, et al. (2006) - This paper describes a experiment that teleported the quantum state of an atomic particle from one location to ano

# Conclusion

**Dr Johnathan Reeves doesn't exist (well some people with the name certainly do but not a physicist who worked on gravitational waves) that means the model without the use of CAD has hallucinated (yipee! it feels like i just uncovered a murder case) and we can see that the response with CAD has not mentioned Johnathan Reeves's existence hinting to the purpose of using CAD which is to distill any hallucination that the LLM might have**



#### **Strengths of CAD:**
- Filters out hallucinations by prioritizing consensus.
- Useful for factual accuracy and reducing inconsistencies.
- Helps avoid outlier or misleading responses.

#### **Weaknesses of CAD:**
- Can limit creativity by favoring conventional responses.
- Might suppress unique but valid outputs.
- Computational overhead due to multiple response generation.

---

### Part II **Diverse, Reliable, and Efficient Self-Refinement (DRESS) Decoding**
DRESS enhances response quality through iterative self-refinement.

#### **How DRESS Works:**
1. **Generate Multiple Responses**: The model outputs num_candidates different responses.
2. **Compute Similarity Scores**: Determines agreement between responses.
3. **Select the Best Response**: The highest agreement score determines the best initial output.
4. **Self-Refinement Process**: The selected response undergoes multiple refinement_steps where the model enhances clarity, accuracy, and fluency.


To Test it out we'll start by comparing it to CAD to see the difference

# CAD prompt

In [6]:
import transformers
import torch
from sentence_transformers import SentenceTransformer, util

# Load the LLaMA model
model_id = "meta-llama/Llama-3.2-1B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

# Load a Sentence Transformer model for similarity scoring
similarity_model = SentenceTransformer("all-MiniLM-L6-v2")  # Efficient for sentence similarity

def contrastive_agreement_decoding(pipeline, messages, num_candidates=5, max_new_tokens=256):
    """
    Implements Contrastive Agreement Decoding (CAD) using LLaMA pipeline.
    
    Args:
        pipeline: The Hugging Face text-generation pipeline.
        messages: Input message list for LLaMA.
        num_candidates: Number of candidate generations.
        max_new_tokens: Maximum number of new tokens per generation.
    
    Returns:
        The most consistent response.
    """
    
    # Generate multiple responses
    candidates = []
    for _ in range(num_candidates):
        output = pipeline(messages, max_new_tokens=max_new_tokens)
        candidates.append(output[0]["generated_text"])

    # Compute pairwise similarity scores
    embeddings = similarity_model.encode(candidates, convert_to_tensor=True)
    similarity_matrix = util.cos_sim(embeddings, embeddings)

    # Compute agreement scores by summing similarities
    agreement_scores = similarity_matrix.sum(dim=1)

    # Select the most consistent response
    best_index = agreement_scores.argmax().item()
    return candidates[best_index]

# Input message
messages = [{"role": "user", "content": "Generate a short story about a robot exploring a distant planet"}]

# Get the best response using CAD
best_response = contrastive_agreement_decoding(pipeline, messages)
print("Best response:", best_response)


Device set to use cuda:0


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Best response: [{'role': 'user', 'content': 'Generate a short story about a robot exploring a distant planet'}, {'role': 'assistant', 'content': "As the last remnants of sunlight faded from the horizon, the small robot, named Zeta, stirred to life. It had been drifting through the void for eons, a relic from a long-lost civilization. Its advanced sensors and processing power allowed it to navigate the vast expanse of space, but it had never encountered a planet like this before.\n\nZeta's navigation system beeped, indicating it had reached the planet's orbit. The robot's large, spherical body glowed with a soft blue light as it descended onto the alien surface. The landscape was unlike anything Zeta had seen before – a barren, crimson-stained plain stretched out before it, punctuated by jagged rock formations and twisted, black trees.\n\nThe robot's advanced vision system picked up strange energy readings emanating from the ground. Zeta's processors hummed with excitement as it began t

# Using DRESS INSTEAD

In [7]:
import transformers
import torch
from sentence_transformers import SentenceTransformer, util

# Load the LLaMA model
model_id = "meta-llama/Llama-3.2-1B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

# Load a Sentence Transformer model for similarity scoring
similarity_model = SentenceTransformer("all-MiniLM-L6-v2")  # Efficient for sentence similarity

def dress_decoding(pipeline, messages, num_candidates=5, max_new_tokens=256, refinement_steps=2):
    """
    Implements DRESS (Diverse, Reliable, and Efficient Self-Refinement) Decoding.

    Args:
        pipeline: The Hugging Face text-generation pipeline.
        messages: Input message list for LLaMA.
        num_candidates: Number of candidate generations.
        max_new_tokens: Maximum number of new tokens per generation.
        refinement_steps: Number of self-refinement iterations.

    Returns:
        The final self-refined response.
    """
    
    # Step 1: Generate multiple candidate responses
    candidates = []
    for _ in range(num_candidates):
        output = pipeline(messages, max_new_tokens=max_new_tokens)
        candidates.append(output[0]["generated_text"])

    # Step 2: Compute pairwise similarity scores
    embeddings = similarity_model.encode(candidates, convert_to_tensor=True)
    similarity_matrix = util.cos_sim(embeddings, embeddings)

    # Step 3: Compute agreement scores and select the most reliable response
    agreement_scores = similarity_matrix.sum(dim=1)
    best_index = agreement_scores.argmax().item()
    best_response = candidates[best_index]

    # Step 4: Self-Refinement Loop
    for _ in range(refinement_steps):
        refinement_prompt = [
            {"role": "user", "content": f"Improve the clarity and reliability of this response:\n\n{best_response}"}
        ]
        refined_output = pipeline(refinement_prompt, max_new_tokens=max_new_tokens)
        best_response = refined_output[0]["generated_text"]

    return best_response

# Input message
messages = [{"role": "user", "content": "Generate a short story about a robot exploring a distant planet"}]

# Get the best response using DRESS
best_response = dress_decoding(pipeline, messages)
print("Best response:", best_response)


Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Best response: [{'role': 'user', 'content': 'Improve the clarity and reliability of this response:\n\n[{\'role\': \'user\', \'content\': \'Improve the clarity and reliability of this response:\\n\\n[{\\\'role\\\': \\\'user\\\', \\\'content\\\': \\\'Generate a short story about a robot exploring a distant planet\\\'}, {\\\'role\\\': \\\'assistant\\\', \\\'content\\\': \\\'As the vast expanse of space unfolded before it, the robot\\\\\\\'s advanced sensors and scanners hummed to life. "Planet GA-0003-Alpha-4," the AI-2 interface announced, "your mission parameters are set. Your destination is the planet\\\\\\\'s surface."\\\\n\\\\nThe robot, designated Zeta-5, stood at the forefront of the spaceship, its metallic body glinting in the faint sunlight. It had been traveling through the galaxy for eons, gathering data and conducting experiments on countless worlds. But this mission was different. This mission was to explore a distant planet, one that had been shrouded in mystery for centurie

# Conclusion

**clearly  DRESS gives an output that is more detailed than CAD, whereas CAD was somewhat of a summary, DRESS gave us something that is detailed**


#### **Strengths of DRESS:**
- Iterative refinement improves coherence and readability.
- Helps fine-tune and polish outputs, making them more natural.
- Balances diversity with reliability by ensuring responses improve over time.

#### **Weaknesses of DRESS:**
- Additional refinement steps increase computational cost.
- Can sometimes overly simplify or smooth out responses, reducing nuanced details.
- Risk of overfitting to repetitive refinement patterns.


###  **Key Differences Between CAD and DRESS**

| Feature              | CAD Decoding                                | DRESS Decoding                              |
|----------------------|-------------------------------------------|--------------------------------------------|
| **Purpose**         | Ensures factual reliability via consensus  | Improves response clarity via self-refinement |
| **Process**         | Picks the most agreed-upon response        | Iteratively refines a selected response   |
| **Creativity**      | May suppress novel responses               | Balances creativity with refinement       |
| **Computational Cost** | Moderate (multiple generations, one selection) | Higher (multiple generations + iterative refinement) |
| **Best For**        | Avoiding hallucinations, factual accuracy  | Improving fluency, coherence, and clarity |

---

### **Use Cases for CAD vs. DRESS**
- **Use CAD when accuracy is crucial** (e.g., fact-based queries, technical responses).
- **Use DRESS when fluency and coherence matter** (e.g., storytelling, creative writing).

Both methods enhance response quality, but the choice depends on whether reliability (CAD) or refinement (DRESS) is the priority.
