# Lab 3.4.3: DeepSeek-R1 Exploration

**Module:** 3.4 - Test-Time Compute & Reasoning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate-Advanced)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand how DeepSeek-R1 differs from standard LLMs
- [ ] Run R1 models on DGX Spark via Ollama
- [ ] Parse and analyze `<think>` reasoning tokens
- [ ] Observe how R1 approaches different problem types
- [ ] Understand the GRPO training approach

---

## Prerequisites

- Completed Labs 3.4.1 and 3.4.2
- Ollama installed and running
- DeepSeek-R1 model downloaded (see setup below)
- DGX Spark or system with sufficient memory (~45GB for 70B Q4)

---

## Real-World Context

In January 2024, DeepSeek released R1 - an open-source reasoning model that rivals OpenAI's o1. The key innovation: **R1 was trained with reinforcement learning (GRPO) to explicitly reason before answering**, producing visible "thinking" tokens.

**Why R1 Matters:**
- First open-source model to match o1-level reasoning
- Distilled versions (1.5B to 70B) run on consumer hardware
- The `<think>` tokens let us see *how* the model reasons
- Beats GPT-4 on many math and coding benchmarks

**Industry Impact:**
- Open-source alternative to expensive API calls
- Interpretable reasoning for high-stakes applications
- Foundation for custom reasoning agents

---

## ELI5: DeepSeek-R1 and Reasoning Models

> **Imagine two students taking a math test...**
>
> **Student A (Standard LLM):** Reads the problem, immediately writes an answer.
> Sometimes right, sometimes wrong. You can't see their thinking.
>
> **Student B (R1):** Reads the problem, then *writes out their thinking*
> on scratch paper before giving the final answer.
> "Hmm, let me break this down... First I need to... Wait, that doesn't work...
> Let me try another approach..."
>
> **The magic:** Student B was *trained with a reward* for getting the right
> answer, so they learned that careful thinking leads to better grades!
>
> **In AI terms:**
> - `<think>` tokens = scratch paper (visible reasoning)
> - GRPO training = rewarding correct final answers
> - The model learned: more thinking = better answers

```
Standard LLM:    Question ──────────────────> Answer
                                               
DeepSeek-R1:     Question ─> <think>          </think> ─> Answer
                             │                │
                             │ "Let me see..."│
                             │ "This means..."│
                             │ "So therefore.."│
                             └────────────────┘
                             (Visible reasoning!)
```

---

## Part 1: Setup and Model Download

First, let's ensure we have DeepSeek-R1 available.

In [None]:
import json
import time
import re
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass

import ollama

print("Checking Ollama models...")

try:
    models = ollama.list()
    model_names = [m.get('name', 'unknown') for m in models.get('models', [])]
    print(f"\nAvailable models:")
    for name in model_names:
        marker = " <-- R1" if 'r1' in name.lower() else ""
        print(f"  - {name}{marker}")
except Exception as e:
    print(f"Error: {e}")
    print("Make sure 'ollama serve' is running.")

In [None]:
# Model selection based on available memory
# DGX Spark (128GB) can run the 70B distilled version easily!

# Check if R1 is available, otherwise provide instructions
R1_MODELS = {
    'small': 'deepseek-r1:7b',      # ~7GB, fast for testing
    'medium': 'deepseek-r1:32b',    # ~20GB, good balance
    'large': 'deepseek-r1:70b',     # ~45GB, best quality
}

# Detect which R1 model is available
available_r1 = None
for size, name in R1_MODELS.items():
    if any(name in m for m in model_names):
        available_r1 = name
        print(f"Found R1 model: {name} ({size})")
        break

if not available_r1:
    print("\nNo DeepSeek-R1 model found!")
    print("\nTo download, run one of these commands in your terminal:")
    print("\n  # For testing (7B, ~7GB):")
    print("  ollama pull deepseek-r1:7b")
    print("\n  # For DGX Spark (70B, ~45GB) - RECOMMENDED:")
    print("  ollama pull deepseek-r1:70b")
    print("\nThen re-run this cell.")
else:
    MODEL = available_r1
    print(f"\nUsing model: {MODEL}")

In [None]:
# If no R1 available, you can simulate with a standard model for learning
# Set MODEL manually if needed

if 'MODEL' not in dir() or MODEL is None:
    # Fallback: use any available model for demonstration
    MODEL = model_names[0] if model_names else "qwen3:8b"
    print(f"Using fallback model: {MODEL}")
    print("Note: This won't show <think> tokens, but code will still run.")

# Test the model
print(f"\nTesting {MODEL}...")
response = ollama.chat(
    model=MODEL,
    messages=[{"role": "user", "content": "What is 2+2? Be brief."}],
    options={"num_predict": 50}
)
print(f"Response: {response['message']['content'][:100]}...")

<cell_type>markdown</cell_type>---

## Part 2: Understanding R1's Thinking Process

R1 wraps its reasoning in `<think>...</think>` tags. Let's see it in action!

### Key Python Tools: Dataclasses and Advanced Regex

**Dataclasses** provide a clean way to create data containers:

```python
from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int

# Creates a class with __init__, __repr__, etc. automatically
person = Person(name="Alice", age=30)
print(person)  # Person(name='Alice', age=30)
```

**Advanced Regex** for parsing multi-line text:

```python
import re

# re.DOTALL flag: makes '.' match newlines too
text = "<think>\nLine 1\nLine 2\n</think>"
match = re.findall(r'<think>(.*?)</think>', text, re.DOTALL)
# Without DOTALL: [] (no match because . doesn't match \n)
# With DOTALL: ['\nLine 1\nLine 2\n']

# re.sub() replaces patterns (like find-and-replace)
cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
# Result: '' (thinking block removed)
```

In [None]:
@dataclass
class ThinkingResult:
    """Parsed result from R1-style response."""
    thinking: str       # Content within <think>...</think>
    answer: str         # Final answer after thinking
    thinking_tokens: int  # Rough token count for thinking
    answer_tokens: int    # Rough token count for answer
    raw_response: str     # Full original response


def parse_r1_response(response: str) -> ThinkingResult:
    """
    Parse DeepSeek-R1 style thinking tokens from a response.
    
    R1 wraps its reasoning in <think>...</think> tags.
    """
    # Pattern for <think>...</think> blocks
    think_pattern = r'<think>(.*?)</think>'
    
    # Find all thinking blocks
    thinking_matches = re.findall(think_pattern, response, re.DOTALL)
    thinking_content = "\n".join(thinking_matches)
    
    # Remove thinking blocks from response to get the answer
    answer_content = re.sub(think_pattern, '', response, flags=re.DOTALL)
    answer_content = answer_content.strip()
    
    # Rough token count (1 token ~ 4 characters on average)
    thinking_tokens = len(thinking_content) // 4
    answer_tokens = len(answer_content) // 4
    
    return ThinkingResult(
        thinking=thinking_content.strip(),
        answer=answer_content,
        thinking_tokens=thinking_tokens,
        answer_tokens=answer_tokens,
        raw_response=response,
    )


# Test with a sample R1-style response
sample_response = '''<think>
Let me work through this step by step.

The problem asks for 17 * 23.

I can break this down:
17 * 23 = 17 * (20 + 3)
       = 17 * 20 + 17 * 3
       = 340 + 51
       = 391

Let me verify: 391 / 17 = 23. Yes, that's correct.
</think>

The answer is **391**.'''

parsed = parse_r1_response(sample_response)

print("Parsed R1 Response:")
print(f"\nThinking ({parsed.thinking_tokens} tokens):")
print(f"  {parsed.thinking[:200]}...")
print(f"\nFinal Answer ({parsed.answer_tokens} tokens):")
print(f"  {parsed.answer}")

---

## Part 3: R1 on Math Problems

Let's see how R1 reasons through math problems. Watch the `<think>` tokens!

In [None]:
def query_r1(
    question: str,
    model: str = MODEL,
    max_tokens: int = 2048,
    temperature: float = 0.0,
    show_thinking: bool = True,
) -> ThinkingResult:
    """
    Query R1 and parse the response.
    """
    print(f"Querying {model}...")
    start_time = time.time()
    
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": question}],
        options={"temperature": temperature, "num_predict": max_tokens}
    )
    
    elapsed = time.time() - start_time
    response_text = response['message']['content']
    
    result = parse_r1_response(response_text)
    
    print(f"Response time: {elapsed:.1f}s")
    print(f"Thinking tokens: ~{result.thinking_tokens}")
    print(f"Answer tokens: ~{result.answer_tokens}")
    
    if show_thinking and result.thinking:
        print("\n" + "="*50)
        print("THINKING PROCESS:")
        print("="*50)
        print(result.thinking)
    
    print("\n" + "="*50)
    print("FINAL ANSWER:")
    print("="*50)
    print(result.answer)
    
    return result

In [None]:
# Test on a multi-step math problem
math_problem = """
A store sells apples for $2 each and oranges for $3 each.
If I buy 5 apples and some oranges and spend $25 total,
how many oranges did I buy?
"""

result = query_r1(math_problem)

In [None]:
# A harder problem - watch R1 reason!
hard_math = """
A train leaves Station A at 9:00 AM traveling at 60 mph toward Station B.
Another train leaves Station B at 10:00 AM traveling at 80 mph toward Station A.
The stations are 280 miles apart.
At what time do the trains meet?
"""

result = query_r1(hard_math)

### Analyzing R1's Thinking Patterns

Notice how R1 typically:
1. **Restates the problem** - Ensures understanding
2. **Identifies unknowns** - "Let x = ..."
3. **Sets up equations** - Translates words to math
4. **Shows calculations** - Step by step arithmetic
5. **Verifies the answer** - Checks if it makes sense

This is much more reliable than models that jump straight to answers!

---

## Part 4: R1 on Coding Problems

R1 also excels at coding. Let's see how it reasons about code.

In [None]:
coding_problem = """
Write a Python function that finds the longest palindromic substring 
in a given string. For example, for "babad", the answer could be "bab" or "aba".

Explain your approach before coding.
"""

result = query_r1(coding_problem, max_tokens=3000)

In [None]:
# Debug a piece of code
debug_problem = """
The following Python code is supposed to find the two numbers in a list
that add up to a target. But it has bugs. Find and fix them:

```python
def two_sum(nums, target):
    for i in range(len(nums)):
        for j in range(len(nums)):
            if nums[i] + nums[j] = target:
                return [i, j]
    return None
```
"""

result = query_r1(debug_problem)

---

## Part 5: R1 on Logic and Reasoning

Where R1 really shines: complex logical reasoning.

In [None]:
logic_problem = """
Five friends (Alice, Bob, Carol, David, Eve) are sitting in a row at a movie theater.
- Alice doesn't sit next to Bob.
- Carol sits exactly in the middle.
- David sits to the right of Alice.
- Eve sits at one of the ends.

What is one valid seating arrangement from left to right?
"""

result = query_r1(logic_problem)

In [None]:
# A tricky reasoning problem
trick_problem = """
A bat and a ball cost $1.10 in total.
The bat costs $1.00 more than the ball.
How much does the ball cost?

Think carefully before answering.
"""

result = query_r1(trick_problem)

### The Classic Trick Question

The bat and ball problem is famous for tripping up humans (and AIs)!

- **Wrong intuition:** Ball = $0.10 (because $1.10 - $1.00 = $0.10)
- **Correct answer:** Ball = $0.05
  - If ball = $0.05, bat = $1.05 ($1.00 more)
  - Total: $0.05 + $1.05 = $1.10 ✓

Watch how R1 catches the mistake in its `<think>` process!

---

## Part 6: Analyzing Thinking Overhead

R1 uses more tokens due to thinking. Is it worth it?

In [None]:
def analyze_thinking_overhead(results: List[ThinkingResult]) -> Dict:
    """
    Analyze the thinking token overhead across multiple responses.
    """
    thinking_tokens = [r.thinking_tokens for r in results]
    answer_tokens = [r.answer_tokens for r in results]
    
    total_thinking = sum(thinking_tokens)
    total_answer = sum(answer_tokens)
    total = total_thinking + total_answer
    
    return {
        'total_thinking_tokens': total_thinking,
        'total_answer_tokens': total_answer,
        'total_tokens': total,
        'thinking_percentage': total_thinking / total * 100 if total > 0 else 0,
        'overhead_ratio': total_thinking / total_answer if total_answer > 0 else 0,
        'avg_thinking_per_response': total_thinking / len(results) if results else 0,
        'avg_answer_per_response': total_answer / len(results) if results else 0,
    }


# Test on multiple problems
test_problems = [
    "What is 17 * 23?",
    "If I have 5 apples and eat 2, how many are left?",
    "A rectangle has length 12 and width 7. What is its area?",
]

results = []
print("Running test problems...\n")
for i, prob in enumerate(test_problems):
    print(f"Problem {i+1}: {prob[:40]}...")
    result = query_r1(prob, show_thinking=False)
    results.append(result)
    print()

# Analyze overhead
overhead = analyze_thinking_overhead(results)

print("\n" + "="*50)
print("THINKING OVERHEAD ANALYSIS")
print("="*50)
print(f"Total thinking tokens: {overhead['total_thinking_tokens']}")
print(f"Total answer tokens: {overhead['total_answer_tokens']}")
print(f"Thinking percentage: {overhead['thinking_percentage']:.1f}%")
print(f"Overhead ratio: {overhead['overhead_ratio']:.1f}x")
print(f"\nAvg thinking per response: {overhead['avg_thinking_per_response']:.0f} tokens")
print(f"Avg answer per response: {overhead['avg_answer_per_response']:.0f} tokens")

---

## Part 7: Understanding GRPO Training

What makes R1 special? The **GRPO (Group Relative Policy Optimization)** training approach.

### How R1 Was Trained

```
Traditional LLM Training:
  "Learn to predict the next token from human text"
  → Good at mimicking, not always at reasoning

R1's GRPO Training:
  1. Start with a base model (DeepSeek-V3)
  2. Give it reasoning problems
  3. Let it generate multiple solutions with thinking
  4. Score solutions by correctness of final answer
  5. Reward the model for solutions that led to correct answers
  6. The model learns: "careful thinking → correct answers → reward"
```

**Key Insight:** R1 discovered that "showing its work" leads to better outcomes!

### Distilled Models

DeepSeek also released "distilled" versions:
- R1-distill-Qwen-1.5B, 7B, 14B, 32B
- R1-distill-Llama-8B, 70B

These are smaller models trained to mimic R1's reasoning patterns.

| Model | Size | Memory (Q4) | Reasoning Quality |
|-------|------|-------------|-------------------|
| R1-distill-1.5B | 1.5B | ~1GB | Basic |
| R1-distill-7B | 7B | ~5GB | Good |
| R1-distill-32B | 32B | ~20GB | Very Good |
| R1-distill-70B | 70B | ~45GB | Excellent |
| R1 (full) | 671B | ~350GB | Best |

---

## Part 8: DGX Spark Optimization

On DGX Spark's 128GB unified memory, you can run the 70B distilled version comfortably.

In [None]:
# Check current memory usage (if CUDA available)
try:
    import torch
    if torch.cuda.is_available():
        mem_used = torch.cuda.memory_allocated() / 1e9
        mem_total = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"GPU Memory: {mem_used:.1f}GB / {mem_total:.1f}GB")
        print(f"\nDGX Spark Tip: With 128GB unified memory,")
        print(f"you can run R1-70B (~45GB) + a reward model (~16GB)")
        print(f"simultaneously for Best-of-N sampling!")
except ImportError:
    print("PyTorch not available - can't check GPU memory")
    print("\nDGX Spark has 128GB unified CPU+GPU memory.")
    print("R1-70B uses ~45GB, leaving plenty of room.")

In [None]:
# Performance comparison: Different R1 sizes on DGX Spark
dgx_spark_performance = """
Expected Performance on DGX Spark (Blackwell GB10):

| Model           | Memory  | Speed (tok/s) | Quality    |
|-----------------|---------|---------------|------------|
| R1-distill-7B   | ~7GB    | ~50-60        | Good       |
| R1-distill-32B  | ~20GB   | ~30-40        | Very Good  |
| R1-distill-70B  | ~45GB   | ~15-25        | Excellent  |

Tips:
- Use Q4_K_M quantization for best speed/quality balance
- Enable mmap for faster model loading
- Clear cache between large model switches:
  sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
"""
print(dgx_spark_performance)

---

## Common Mistakes

### Mistake 1: Ignoring the Thinking Tokens

```python
# Wrong: Just taking the full response as the answer
answer = response['message']['content']  # Includes <think> blocks!

# Right: Parse out the thinking and extract the answer
result = parse_r1_response(response['message']['content'])
answer = result.answer  # Clean answer only
```

### Mistake 2: Setting Max Tokens Too Low

```python
# Wrong: R1 needs room to think!
response = ollama.chat(model="deepseek-r1:70b", messages=...,
                       options={"num_predict": 256})  # Thinking gets cut off!

# Right: Allow enough tokens for thinking + answer
response = ollama.chat(model="deepseek-r1:70b", messages=...,
                       options={"num_predict": 2048})  # Room to reason
```

### Mistake 3: Using High Temperature for Reasoning

```python
# Wrong: High temperature makes reasoning inconsistent
response = ollama.chat(..., options={"temperature": 1.0})

# Right: Use low temperature for reliable reasoning
response = ollama.chat(..., options={"temperature": 0.0})  # Deterministic
# Or slight temperature for minor variation
response = ollama.chat(..., options={"temperature": 0.3})
```

---

## Checkpoint

You've learned:
- ✅ **What R1 is:** A reasoning model trained with GRPO to "think out loud"
- ✅ **How to parse responses:** Extract `<think>` tokens and final answers
- ✅ **R1's strengths:** Math, coding, logic puzzles
- ✅ **The overhead tradeoff:** More tokens but more accurate
- ✅ **DGX Spark optimization:** Can run 70B comfortably

---

## Challenge: Build a "Thinking Visualizer"

Create a tool that nicely formats R1's thinking process for analysis.

In [None]:
def visualize_thinking(result: ThinkingResult, max_chars: int = 500) -> str:
    """
    Create a nice visualization of R1's thinking process.
    """
    output = []
    output.append("\n" + "#" * 60)
    output.append("#  R1 THINKING PROCESS")
    output.append("#" * 60)
    
    if result.thinking:
        output.append("\n[THINKING] (~{} tokens)".format(result.thinking_tokens))
        output.append("-" * 40)
        
        # Split thinking into "steps"
        lines = result.thinking.split('\n')
        for line in lines[:20]:  # Limit lines shown
            if line.strip():
                if any(line.strip().startswith(x) for x in ['1.', '2.', '3.', 'Step', 'First', 'Then', 'Finally']):
                    output.append(f"   {line.strip()}")
                else:
                    output.append(f"   {line.strip()[:80]}")
        
        if len(lines) > 20:
            output.append(f"   ... ({len(lines) - 20} more lines)")
    else:
        output.append("\n[NO THINKING TOKENS]")
        output.append("(Model may not be R1, or problem was simple)")
    
    output.append("\n[FINAL ANSWER] (~{} tokens)".format(result.answer_tokens))
    output.append("-" * 40)
    answer_preview = result.answer[:max_chars]
    if len(result.answer) > max_chars:
        answer_preview += "..."
    output.append(answer_preview)
    
    output.append("\n" + "#" * 60)
    overhead = result.thinking_tokens / max(result.answer_tokens, 1)
    output.append(f"Thinking overhead: {overhead:.1f}x answer length")
    output.append("#" * 60)
    
    return "\n".join(output)


# Test visualization
if results:  # Use results from earlier
    print(visualize_thinking(results[-1]))

---

## Further Reading

- [DeepSeek-R1 Paper](https://arxiv.org/abs/2401.02954) - The research paper
- [DeepSeek-R1 GitHub](https://github.com/deepseek-ai/DeepSeek-R1) - Code and models
- [GRPO Explained](https://arxiv.org/abs/2402.03300) - Training methodology
- [R1 vs o1 Comparison](https://www.deepseek.com/research) - Benchmark results

---

## Cleanup

In [None]:
# Summary
print("Lab 3.4.3 Complete!")
print(f"\nModel used: {MODEL}")
print(f"Problems tested: {len(results) if 'results' in dir() else 0}")

# Cleanup memory
import gc

# Clear GPU memory if torch was used
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
except ImportError:
    pass

gc.collect()
print("\nMemory cleaned up.")

---

## Next Steps

Now that you understand how R1 reasons, let's put it to the test!

In the next lab, you'll **compare R1 vs. standard models** quantitatively to see exactly how much the thinking helps.

**Continue to:** [Lab 3.4.4: R1 vs Standard Model Comparison](./lab-3.4.4-r1-comparison.ipynb)