## Extension Activity: Comparing Language Models

### Overview

In this extension activity, you'll load a different, more powerful language model (Qwen3-8B) and compare its performance to the TinyLlama model you've been using. This will help you understand how model size and training affect AI capabilities.

**What you'll explore:**
- How to load and compare different language models
- The relationship between model size and capability
- Tradeoffs between speed and accuracy
- How to choose the right model for different applications

### Part 11 - Understanding Model Differences

**TinyLlama-1.1B:**
- Size: 1.1 billion parameters
- Strengths: Fast, runs on most computers, good for learning
- Limitations: Limited reasoning ability, shorter context understanding

**Qwen2.5-8B:**
- Size: 8 billion parameters (over 7x larger!)
- Strengths: Better reasoning, more accurate, better instruction following
- Limitations: Requires more memory and processing power, slower responses

### Part 12 - Setting Up Both Models

We'll set up both TinyLlama and Qwen3-8B from scratch so we can compare them side-by-side.

#### 12.0 - Import Required Libraries

In [None]:
# Core LLM libraries
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate

# Transformers for loading models
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Utilities
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

#### 12.1 - Load TinyLlama Model

In [None]:
# Load TinyLlama - the smaller, faster model
print("üì• Loading TinyLlama-1.1B model...")
print("‚è≥ This may take a few minutes on first run...")

tiny_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load tokenizer and model
tiny_tokenizer = AutoTokenizer.from_pretrained(tiny_model_name)
tiny_model = AutoModelForCausalLM.from_pretrained(
    tiny_model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Create pipeline
tiny_pipe = pipeline(
    "text-generation",
    model=tiny_model,
    tokenizer=tiny_tokenizer,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
)

# Wrap for LangChain
llm = HuggingFacePipeline(pipeline=tiny_pipe)

print("‚úÖ TinyLlama model loaded successfully!")
print(f"üìä Model size: ~1.1 billion parameters")

#### 12.2 - Load the Qwen3-8B Model

In [None]:
# Load the Qwen3-8B model - the larger, more powerful model
print("üì• Loading Qwen2.5-8B model...")
print("‚è≥ This is a larger model and will take longer to load...")

qwen_model_name = "Qwen/Qwen2.5-7B-Instruct"  # Using 7B variant

# Load tokenizer and model
qwen_tokenizer = AutoTokenizer.from_pretrained(qwen_model_name)
qwen_model = AutoModelForCausalLM.from_pretrained(
    qwen_model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Create pipeline
qwen_pipe = pipeline(
    "text-generation",
    model=qwen_model,
    tokenizer=qwen_tokenizer,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
)

# Wrap for LangChain
qwen_llm = HuggingFacePipeline(pipeline=qwen_pipe)

print("‚úÖ Qwen3 model loaded successfully!")
print(f"üìä Model size: ~7-8 billion parameters")

##### Question 45
How long did it take to load each model? Which one took longer and why do you think that is?
- **TinyLlama load time:**
- **Qwen load time:**
- **Explanation:**

### Part 13 - Direct Comparison Tests

Let's test both models with the same prompts to see how they differ!

#### 13.0 - Compare Basic Responses

In [None]:
# Comparison test prompts
comparison_prompts = [
    "Explain what machine learning is in one sentence.",
    "Write a creative story starter (2-3 sentences) about a robot learning to paint.",
    "List 3 pros and cons of social media.",
    "Solve this problem: If a train travels 60 miles in 45 minutes, what is its speed in miles per hour?"
]

print("üî¨ COMPARISON TEST: TinyLlama vs Qwen3\n")
print("=" * 80)

for i, prompt in enumerate(comparison_prompts, 1):
    print(f"\nüìù Test {i}: {prompt}")
    print("-" * 80)
    
    # TinyLlama response
    print("\nü§ñ TinyLlama Response:")
    tiny_response = llm.invoke(prompt)
    print(tiny_response)
    
    # Qwen response
    print("\nü§ñ Qwen3 Response:")
    qwen_response = qwen_llm.invoke(prompt)
    print(qwen_response)
    
    print("\n" + "=" * 80)

##### Question 46
Which model gave more accurate responses? Give specific examples.
- **Answer:**

##### Question 47
Which model gave more detailed or elaborate responses?
- **Answer:**

##### Question 48
Did you notice any difference in the writing style or tone between the two models?
- **Answer:**

### Part 14 - Custom AI Assistant Comparison

Now let's test your custom AI assistant from the main lab with both models!

#### 14.0 - Create Two Versions of Your Custom AI

In [None]:
# TODO: Copy your custom system prompt from Part 6 or Part 8 of the main lab
my_comparison_prompt = """
[PASTE YOUR SYSTEM PROMPT HERE]
"""

# Create template (same for both)
comparison_template = ChatPromptTemplate.from_messages([
    ("system", my_comparison_prompt),
    ("human", "{question}")
])

# Create two chains - one for each model
tiny_custom_chain = comparison_template | llm        # TinyLlama version
qwen_custom_chain = comparison_template | qwen_llm   # Qwen version

print("‚úÖ Created two versions of your custom AI!")

#### 14.1 - Test Both Versions

In [None]:
# TODO: Write 3 test questions for your custom AI
my_comparison_questions = [
    "Question 1",
    "Question 2",
    "Question 3"
]

print("üî¨ Testing Your Custom AI with Both Models\n")

for question in my_comparison_questions:
    print(f"üìù Question: {question}")
    print("-" * 80)
    
    print("\nü§ñ TinyLlama Version:")
    tiny_response = tiny_custom_chain.invoke({"question": question})
    print(tiny_response)
    
    print("\nü§ñ Qwen3 Version:")
    qwen_response = qwen_custom_chain.invoke({"question": question})
    print(qwen_response)
    
    print("\n" + "=" * 80 + "\n")

##### Question 49
Did both models follow your system prompt instructions equally well? Which one stayed more "in character"?
- **Answer:**

##### Question 50
Rate each model's performance on your custom AI task (1-10):
- TinyLlama: _____ / 10
- Qwen3: _____ / 10
- Explanation:

### Part 15 - Response Speed Test

Let's measure how fast each model generates responses.

#### 15.0 - Compare Generation Speed

In [None]:
import time

test_prompt = "Explain why the sky is blue in simple terms."

# Time TinyLlama
print("‚è±Ô∏è  Testing TinyLlama speed...")
start_time = time.time()
tiny_response = llm.invoke(test_prompt)
tiny_time = time.time() - start_time

# Time Qwen
print("‚è±Ô∏è  Testing Qwen3 speed...")
start_time = time.time()
qwen_response = qwen_llm.invoke(test_prompt)
qwen_time = time.time() - start_time

# Compare
print("\nüìä SPEED COMPARISON")
print("=" * 50)
print(f"TinyLlama: {tiny_time:.2f} seconds")
print(f"Qwen3:     {qwen_time:.2f} seconds")
print(f"Difference: {abs(tiny_time - qwen_time):.2f} seconds")

if tiny_time < qwen_time:
    print(f"\n‚úÖ TinyLlama was {(qwen_time/tiny_time):.1f}x faster")
else:
    print(f"\n‚úÖ Qwen3 was {(tiny_time/qwen_time):.1f}x faster")

##### Question 51
Which model was faster? By how much?
- **Answer:**

##### Question 52
Based on the speed difference, when might you choose to use the smaller model vs. the larger model?
- **Answer:**

### Part 16 - Complex Reasoning Test

Let's test how well each model can handle logical reasoning and problem-solving.

#### 16.0 - Test Logical Reasoning

In [None]:
# Complex reasoning prompts
reasoning_prompts = [
    "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly? Explain your reasoning.",
    "A farmer has 17 sheep. All but 9 die. How many sheep are left?",
    "What comes next in this pattern: 2, 6, 12, 20, 30, ___?"
]

print("üß† REASONING TEST\n")

for prompt in reasoning_prompts:
    print(f"üìù Challenge: {prompt}")
    print("-" * 80)
    
    print("\nü§ñ TinyLlama:")
    tiny_response = llm.invoke(prompt)
    print(tiny_response)
    
    print("\nü§ñ Qwen3:")
    qwen_response = qwen_llm.invoke(prompt)
    print(qwen_response)
    
    print("\n" + "=" * 80 + "\n")

##### Question 53
Which model performed better on logical reasoning tasks? Give specific examples.
- **Answer:**

##### Question 54
Did either model make mistakes? What kind of errors did you notice?
- **Answer:**

### Part 17 - Final Analysis

Now that you've tested both models extensively, let's analyze the results.

#### Reflection Questions

##### Question 55
Create a comparison table:

| Feature | TinyLlama (1.1B) | Qwen3 (7-8B) | Winner |
|---------|------------------|--------------|--------|
| Speed | | | |
| Accuracy | | | |
| Instruction Following | | | |
| Creativity | | | |
| Reasoning Ability | | | |
| Resource Usage | | | |

##### Question 56
If you were building a real application, which factors would you consider when choosing between a smaller model (like TinyLlama) vs. a larger model (like Qwen3)?
- **Answer:**

##### Question 57
In what scenarios would you prefer:
- **TinyLlama (smaller model):**
- **Qwen3 (larger model):**

##### Question 58
Both models have the same temperature (0.7) and max_new_tokens (256). Why do you think their outputs are still different?
- **Answer:**

##### Question 59
Based on this comparison, what have you learned about the relationship between model size and model capability?
- **Answer:**

### Part 18 - Challenge: Parameter Tuning (Optional)

If you have time, experiment with different parameters for Qwen3!

#### 18.0 - Temperature Experiment

In [None]:
# Try different temperature settings
temperatures = [0.1, 0.5, 0.9]
test_prompt = "Write a creative opening sentence for a mystery story."

print("üîß TEMPERATURE EXPERIMENT with Qwen3\n")

for temp in temperatures:
    # Create new pipeline with different temperature
    temp_pipe = pipeline(
        "text-generation",
        model=qwen_model,
        tokenizer=qwen_tokenizer,
        max_new_tokens=256,
        do_sample=True,
        temperature=temp,
    )
    temp_llm = HuggingFacePipeline(pipeline=temp_pipe)
    
    print(f"üå°Ô∏è Temperature: {temp}")
    print("-" * 50)
    response = temp_llm.invoke(test_prompt)
    print(response)
    print("\n")

##### Question 60
How did changing the temperature affect Qwen3's responses? Which temperature setting do you prefer and why?
- **Answer:**

### Summary

**üéâ Congratulations on completing the extension activity!**

You've now compared two different language models and learned that:
- Larger models generally perform better but require more resources
- Model size affects accuracy, reasoning, and response quality
- There are tradeoffs between speed and capability
- The "best" model depends on your specific use case

**Key Takeaways:**
1. **Size matters** - Larger models (more parameters) typically give better results
2. **Speed vs. Quality** - Smaller models are faster but less capable
3. **Application-specific** - Choose based on whether you need speed or accuracy
4. **Resource constraints** - Consider memory, processing power, and cost

These insights will help you make informed decisions when building AI applications in the future!