# LLM Reasoning Summary

LLMs generate responses by predicting likely next tokens, enabling reasoning through patterns learned from data. Key reasoning approaches include:

- **Zero-shot:** Answering tasks without examples.  
- **Few-shot:** Learning from a few examples in the prompt.  
- **Chain-of-Thought (CoT):** Breaking problems into intermediate reasoning steps.  
- **Analogical Reasoning:** Solving new problems using similarities to known cases.  
- **Decoding Strategies:** Techniques like greedy, beam search, or sampling to generate outputs.  
- **Self-Consistency:** Sampling multiple reasoning paths and selecting the most consistent answer.  

**Limitations:**  
- **Irrelevant Context:** May include unrelated or distracting information from long prompts.  
- **Premise Order Sensitivity:** Changing the order of facts or statements can lead to different or incorrect conclusions.  
- **Self-Correction Challenges:** May fail to identify or correct mistakes without expli

## LLM Reasoning Experiments

In [7]:
from transformers import pipeline

In [2]:
# Initialize text-generation pipeline
gen = pipeline("text-generation", model="gpt2")

Device set to use mps:0


### Reasoning Experiment 1: Simple Math

In [3]:
prompt1 = "Step by step: If I have 10 apples and eat 3, then buy 5 more, how many apples do I have?"
output1 = gen(prompt1, 
              max_new_tokens=60,
              truncation=True,
              pad_token_id=50256
             )[0]['generated_text']

print("Prompt 1:", prompt1)
print("Output 1:", output1)

Prompt 1: Step by step: If I have 10 apples and eat 3, then buy 5 more, how many apples do I have?
Output 1: Step by step: If I have 10 apples and eat 3, then buy 5 more, how many apples do I have?

I am not sure if a specific apple is worth buying because it is not really obvious. If a specific apple is worth buying, then I would say to be careful. If you have a limited amount of apples, just buy them.

One of the most important things in buying apples


### Reasoning Experiment 2: Tricky Math

In [4]:
prompt2 = "Step by step: I have 2 bananas. I eat 2 and then buy 4. How many bananas do I have now?"
output2 = gen(prompt2, 
              max_new_tokens=60,
              truncation=True,
              pad_token_id=50256
             )[0]['generated_text']
print("\nPrompt 2:", prompt2)
print("Output 2:", output2)


Prompt 2: Step by step: I have 2 bananas. I eat 2 and then buy 4. How many bananas do I have now?
Output 2: Step by step: I have 2 bananas. I eat 2 and then buy 4. How many bananas do I have now? 6 bananas. Now I have 2 bananas. How many bananas do I have now? 1 banana. How many bananas do I have now? 1 banana. How many bananas do I have now? 2 bananas. Now I have 2 bananas. How many bananas do I have now? 2 bananas. Now


### Reasoning Experiment 3: Text / Story Prompt

In [5]:
prompt3 = "Step by step: Explain why the sky is blue in simple terms."
output3 = gen(prompt3, 
              max_new_tokens=100,
              truncation=True,
              pad_token_id=50256
             )[0]['generated_text']
print("\nPrompt 3:", prompt3)
print("Output 3:", output3)


Prompt 3: Step by step: Explain why the sky is blue in simple terms.
Output 3: Step by step: Explain why the sky is blue in simple terms.

How long are there days of the year in the sky?

The average amount of time we spend in the sky is:

Weekday 1: 3 hours

Weekday 2: 2 hours

Weekday 3: 1 hour

Weekday 4: 1 hour

Weekday 5: 1 hour

If you get this far, you'll almost certainly need to take a short break before you get to work.

How long do the days of


### Observations
**Which prompts worked?**

Prompt 3 (Sky explanation) – Partially worked
- Output attempts to explain the sky, mentions “blue”, “Earth”, “eyes”.
- But it’s incorrect and repetitive, not scientifically accurate.

**Which failed?**

Prompt 1 & Prompt 2 (Math) – Failed
- Output doesn’t correctly compute “10 − 3 + 5 = 12” or “2 − 2 + 4 = 4”.
- GPT-2 repeats the prompt, then produces irrelevant text.
  
**Insights on why reasoning fails in GPT-2 (or smaller models)**
- LLMs like GPT-2 are good at mimicking language, not performing calculations or logic.
- Prompt engineering helps a little, but model limitations dominate.
- This motivates why we need agents, tool use, and larger LLMs for reasoning tasks.

## Reasoning Demo

We first tried reasoning with GPT-2 but it failed completely.  
Here, we use a **small reasoning-capable model** (`bloomz-560m`) to demonstrate step-by-step reasoning.  

> Note: Small models may not always give correct results for math/logic, but they attempt reasoning.


In [20]:
from transformers import pipeline

# Small reasoning-capable model for demo (fast, no login required)
reasoner = pipeline(
    "text-generation",
    model="bigscience/bloomz-560m",
    device="cpu"  # use "cuda" if you have GPU
)

# Prompts for demonstration
prompts = [
    "Step by step: If I have 12 oranges, give 4 to a friend, then buy 6 more, how many do I have?",
    "Step by step: The Eiffel Tower is in Paris. Paris is in which country?"
]

for p in prompts:
    print("Prompt:", p)
    output = reasoner(p, max_length=128, temperature=0.2)
    print("Model Output:", output[0]["generated_text"], "\n")

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt: Step by step: If I have 12 oranges, give 4 to a friend, then buy 6 more, how many do I have?


Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Model Output: Step by step: If I have 12 oranges, give 4 to a friend, then buy 6 more, how many do I have? 12 

Prompt: Step by step: The Eiffel Tower is in Paris. Paris is in which country?
Model Output: Step by step: The Eiffel Tower is in Paris. Paris is in which country? France 



### 📝 Observations
- GPT-2 fails to solve these prompts.  
- `bloomz-560m` attempts reasoning and can give approximate answers.  
- Larger models like Falcon-7B, Mistral-7B, or LLaMA-2 are needed for **accurate reasoning**.