# Project 4: **Build a Deep Research System**
Welcome to project 4! For this project, we shift our focus from tool use and agents to *reasoning* models. You will practice state‚Äëof‚Äëthe‚Äëart inference‚Äëtime scaling methods such as *Chain‚Äëof‚ÄëThought* prompting and *Tree‚Äëof‚ÄëThoughts*, and briefly explore high-level concepts of training reasoning models using techniques like **STaR**.


Finally, you will put everything together to build a *deep research agent* that can browse the web, reason over what it finds, and give structured answers.

## Learning Objectives  
* Apply common inference‚Äëtime scaling methods: **zero‚Äëshot / few‚Äëshot CoT, self‚Äëconsistency, sequential revision, tree‚Äëof‚Äëthoughts**  
* Gain intuition for **training** reasoning‚Äëcapable models following **STaR** approach 
* Build a minimal **deep‚Äëresearch agent** that combines step‚Äëby‚Äëstep reasoning with live web search   
* Practice extending deep-search to a multi-agent system 

## Roadmap  
0. Environment setup  
1. Inference‚Äëtime scaling  
  1.1 Few‚Äëshot.   
  1.2 Zero‚Äëshot‚ÄØCoT.   
  1.3 Self‚Äëconsistency.   
  1.4 Sequential revisions.     
  1.5 Tree‚Äëof‚ÄëThought (ToT)
2. Training reasoning models and inspecting deepseek-r1 
3. Deep-research agent  
4. (Optional) Multi-agent deep-research

# 0- Environment setup

### Step 1: Create your environment and install dependencies 
Before we start coding, you need a reproducible setup. Open a terminal in the same directory as this notebook, and use Conda or uv to install the project dependencies.

#### Option 1: Conda
```bash
# Create and activate the conda environment
conda env create -f environment.yaml && conda activate deep_research
```

#### Option 2: uv (Fast alternative)
If you prefer [uv](https://docs.astral.sh/uv/) over Conda:

```bash
# Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install dependencies
uv venv .venv-deep-research && source .venv-deep-research/bin/activate
uv pip install -r requirements.txt
```

### Step 2: Register this environment as a Jupyter kernel
```bash
python -m ipykernel install --user --name=deep_research --display-name "deep_research"
```
Now open your notebook and switch to the `deep_research` kernel (Kernel ‚Üí Change Kernel).

### Step 3: Setup and run Ollama serve

In this project we use the `llama3.2:3b`, `qwen2.5:3b-instruct` and `deepseek-r1:1.5b` models. You can try other smaller or larger reasoning LLMs such as `phi4-mini` to compare performance. Explore available models here: https://ollama.com/library.

Open terminal and run ollama:
```bash
ollama serve
```
Then open another terminal and pull required models: 
```bash
ollama pull llama3.2:3b
ollama pull deepseek-r1:1.5b
ollama pull qwen2.5:3b-instruct
# Additional small reasoning models to compare
# ollama pull phi4-mini
```

---  
# 1‚Äë Inference‚Äëtime scaling

Inference-time scaling refers to techniques that make an existing model reason better without retraining it. Instead of changing the model‚Äôs weights, we achieve reasoning capability by adjusting how we prompt, sample, or aggregate LLM's outputs.

In this section, we‚Äôll explore several inference-time strategies that improve reasoning quality using a non-reasoning base model. You will experiment with and compare methods such as:

- Few-shot Chain-of-Thought (CoT)
- Zero-shot CoT
- Self-consistency
- Sequential revision
- Tree-of-Thoughts (ToT)

### 1.1: Few-Shot CoT

Few-shot prompting provides examples before asking a new question. The model learns from the pattern and applies it to new inputs.

We'll explore this with two models to understand how few-shot interacts with model capabilities:

1. **GPT-2** (no instruction tuning): Doesn't reason by default. We'll see if few-shot examples can elicit reasoning.
2. **Llama 3.2** (instruction-tuned): Already reasons naturally. We'll use few-shot to control the output format.

#### GPT-2: Can few-shot examples elicit reasoning?

GPT-2 is a base language model that just predicts the next token. It wasn't trained to follow instructions or reason step-by-step. Let's see what happens with and without few-shot examples.

In [5]:
import os
import torch
from transformers import pipeline

question = "A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?"

# Step 1: Load GPT-2 text-generation from huggingface (https://huggingface.co/docs/transformers/en/model_doc/gpt2)
# Step 2: Write 1‚Äì2 few-shot reasoning examples (short, explicit steps + final answer in your own unique format)
# Step 3: Append a new test question after the examples to form one prompt string
# Step 4: Generate outputs with and without fewshot prompt and compare the difference.

pipeline = pipeline(task="text-generation", model="openai-community/gpt2", dtype=torch.float16, device=0)
print(pipeline(question)[0]['generated_text'])

print("----")

fewshot_prompt = """Question: A rectangle has a perimeter of 60 cm. If the length is twice the width, what is the area?
Answer: if length is twwice the width, then we can say that length = 2 * width. The perimeter of a rectangle is given by the formula P = 2 * (length + width). Substituting the values, we get 60 = 2 * (2 * width + width) => 60 = 6 * width => width = 60 / 6 => 10. Now, we can find the length: length = 2 * width => length = 2 * 10 => 20. Finally, the area of a rectangle is given by the formula A = length * width => A = 20 * 10 => A = 200 cm^2.

Question: """

final_qquestion = fewshot_prompt + question

print(pipeline(fewshot_prompt)[0]['generated_text'])

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: openai-community/gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?

The rectangle is made up of 32 circles, each with its own radius, the diameter of which is 64 cm. The circle's diameter is determined by the radius of the rectangle, which is the diameter of a circle.

In this case, a rectangle has diameter of 64 cm, and its radius is 64 cm.

The rectangle's diameter is determined by the radius of the rectangle, which is the diameter of a circle.

If the radius of the rectangle is less than the radius of the rectangle, then the rectangle is rounded, which means that its radius is zero. The rectangle is then rounded to the nearest whole number.

If the radius of the rectangle is less than the radius of the rectangle, then the rectangle is rounded to the nearest whole number. The rectangle is then rounded to its nearest whole number.

If the radius of the rectangle is less than the radius of the rectangle, then the rectangle is rounded to the nearest whole number. 

#### Llama 3.2: Using few-shot to control output format

Unlike GPT-2, Llama 3.2 is instruction-tuned and already produces reasoning traces by default. So what's the point of few-shot examples?

**The power of few-shot with instruction-tuned models is controlling the output format.** We can make the model follow a specific structure like `[GIVEN]/[FIND]/[SOLVE]/[ANSWER]` that it wouldn't use naturally.

In [7]:
from ollama import chat
from ollama import ChatResponse

question = "A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?"

# Step 1: Create your Ollama client
# Step 2: Write a few examples showing reasoning steps
# Step 3: Concatenate examples + new question into a single prompt
# Step 3: Call your Ollama or OpenAI client to get a response from llama3.2:3b # e.g., client.chat.completions.create(...)
# Step 5: Print the final answer with and without few shot examples and compare them.

response: ChatResponse = chat(model='llama3.2:3b', messages=[
  {
    'role': 'user',
    'content': final_qquestion,
  },
])
print(response['message']['content'])

To find the area of the rectangle with a given perimeter and a length that is twice the width, we can follow these steps:

The perimeter of a rectangle is given by the formula P = 2 * (length + width). Since the length is twice the width, let's say the width is w. Then, the length is 2w.

Substituting the values into the formula, we get:
36 = 2 * (2w + w)
36 = 6w
Now, divide both sides by 6 to find the width:
w = 36 / 6
w = 6

Since the length is twice the width, we can now find the length:
length = 2 * width
length = 2 * 6
length = 12

Finally, we can find the area of the rectangle using the formula A = length * width:
A = 12 * 6
A = 72 cm^2


### 1.2: Zero‚ÄëShot Chain‚Äëof‚ÄëThought
Zero-shot CoT encourages the model to reason without examples by adding a short cue such as ‚ÄúLet‚Äôs think step by step.‚Äù This simple phrase often activates the model‚Äôs latent reasoning ability even when no demonstrations are provided. It serves as a baseline to compare with few-shot and other inference-time scaling methods.

In [8]:
from ollama import chat
from ollama import ChatResponse

# Step 1: Create your Ollama client
# Step 2: Write a question and a zero-shot CoT cue (e.g., "Let's think step by step.")
# Step 3: Build a single prompt string that includes brief role guidance plus the question
# Step 3: Call your Ollama or OpenAI client to get a response from llama3.2:3b  # e.g., client.chat.completions.create(...)
# Step 4: Print the chain and the final answer

response: ChatResponse = chat(model='llama3.2:3b', messages=[
  {
    'role': 'user',
    'content': question + " Let's think step by step.",
  },
])
print(response['message']['content'])


To find the area of the rectangle, we need to first determine its dimensions (length and width). We know that the length is twice the width.

Let's denote the width as "w" cm. Since the length is twice the width, the length can be represented as 2w cm.

The perimeter of a rectangle is given by the formula: Perimeter = 2(length + width)

We are told that the perimeter is 36 cm, so we can set up an equation:

2(2w + w) = 36

Combine like terms:
4w + 2w = 36
6w = 36

Now, divide both sides by 6 to solve for "w":
w = 36/6
w = 6

So, the width of the rectangle is 6 cm.

Since the length is twice the width, we can now find the length:
Length = 2w = 2(6) = 12 cm

Now that we have both dimensions (length and width), we can calculate the area:

Area = Length x Width
= 12 x 6
= 72 square centimeters

Therefore, the area of the rectangle is 72 square centimeters.


### 1.3 Self‚ÄëConsistency
Self-consistency enhances reasoning accuracy by sampling multiple independent reasoning paths for the same question instead of relying on a single deterministic answer. Each run may follow a slightly different logical chain, and the diversity helps correct individual mistakes. After generating several reasoning traces, you then aggregate the final answers using majority voting.

This approach is especially useful when tasks involve multi-step reasoning or arithmetic, where single-path outputs may be incorrect.

In [12]:
from openai import OpenAI
import re, collections

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"


def cot_answer(question, temperature=1.2):
    # Generate a step-by-step reasoning chain for the given question and extract the final answer.
    resp = client.responses.create(
        model=MODEL,
        input=question + " Let's think step by step. Give the final answer in the format 'Final Answer: <your answer>'.",
        temperature=temperature,
    )
    match = re.search(r"Final Answer: (.*)", resp.output_text)
    if match:
        return match.group(1)
    return None


def self_consistent(question, n=5):
    # Run multiple reasoning chains and select the most frequent final answer by majority voting.
    answers = [a for a in (cot_answer(question) for _ in range(n)) if a is not None]
    counter = collections.Counter(answers)
    winner = counter.most_common(1)[0][0] if counter else None
    return winner, counter, answers

question = "What is the square root of 144?"
winner, counter, traces = self_consistent(question, n=5)

print("Votes:", counter)
print("Chosen answer:", winner)

To find the square root of 144, we can use the following steps:

1. Start with a number, let's say x.
2. Square x to get 144.

Let's try x = 12:
12^2 = 144

So, x = 12 is a perfect square of 144.

Final Answer: 12
To find the square root of 144, let's break it down step by step:

1. Start with the number 144.
2. Look for a perfect square that is less than or equal to 144, such as 121 (11^2) and 169 (13^2).
3. Since 12^2 = 144, we can conclude that the square root of 144 is 12.

Final Answer: 12
To find the square root of 144, let's break it down:

1. Start with the number: 144
2. Look for perfect squares around this number.
3. Notice that 12 squared is equal to 144 (12¬≤ = 144).

Therefore, the square root of 144 is 12.

Final Answer: 12
To find the square root of 144, we'll start by thinking about what numbers multiplied together give us 144.

Let's try to think of a number that when squared (multiplied by itself) gives us 144. We can also look for perfect squares in our minds that ar

### 1.4: Sequential Revision

Sequential revision iteratively improves an answer by generating a first draft, critiquing it, and producing revised drafts that condition on prior answers. Each round should be short and focused, so improvements accumulate without drifting from the question.

In [1]:
from openai import OpenAI

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"


def sequential_revision(question: str, max_steps: int = 3) -> str:
    # Generate an initial draft answer, then iteratively refine it by conditioning each revision on the previous one.
    # Step 1: Ask the model to produce the first draft for the given question
    # Step 2: Loop for max_steps-1 times, each time feeding the last draft back to the model with a request to revise
    # Step 3: Print each draft to observe how the answer evolves
    # Step 4: Return the final improved draft
    input = [{"role": "user", "content": question + " Produce a first draft of the answer."}]

    resp = client.responses.create(
        model=MODEL,
        input=input,
        temperature=1.2,
    )
    # Add output to the input for the next step
    draft = resp.output_text
    input.append({"role": "assistant", "content": draft})

    for step in range(1, max_steps):
        print(f"Draft after step {step}:\n{draft}\n{'-'*50}")
        input.append({"role": "user", "content": "Please revise and improve your answer."})

        resp = client.responses.create(
            model=MODEL,
            input=input,
            temperature=1.2,
        )
        draft = resp.output_text
        input.append({"role": "assistant", "content": draft})

    return draft

# Step 1: Define a question that benefits from multi-step reasoning
# Step 2: Call sequential_revision(question, max_steps)
# Step 3: Print the final output
question = "The income of my company is roughly 120,000‚Ç¨ per year. I need 5000‚Ç¨ per year for common expenses. I need 3000‚Ç¨ per month to develop another project. I have to pay 35% of my salary in taxes. How much money can I pay myself per month as a salary without jeopardizing my financial stability?"
final_answer = sequential_revision(question, max_steps=3)
print(final_answer)


Draft after step 1:
To determine how much you can pay yourself per month without jeopardizing your financial stability, we need to consider several factors.

1. Annual income: ‚Ç¨120,000
2. Common expenses: ‚Ç¨5,000 per year
3. Monthly project development budget: ‚Ç¨3,000 per month
4. Taxes: 35% of annual income

First, let's calculate the total amount available for personal use:

Annual income: ‚Ç¨120,000
Taxes (35%): ‚Ç¨42,000 (calculated as 0.35 x ‚Ç¨120,000)
Remaining income after taxes: ‚Ç¨78,000

Common expenses per year: ‚Ç¨5,000
Monthly common expenses: ‚Ç¨416.67 (calculated as ‚Ç¨5,000 / 12)

Monthly project development budget: ‚Ç¨3,000

Total monthly deductions:
‚Ç¨416.67 (common expenses) + ‚Ç¨3,000 (project development) = ‚Ç¨4,416.67

Now, let's calculate how much you can afford to pay yourself per month:

Remaining income after taxes: ‚Ç¨78,000
Total monthly deductions: ‚Ç¨4,416.67
Available for personal use: ‚Ç¨73,583.33 (calculated as ‚Ç¨78,000 - ‚Ç¨4,416.67)

To ensure 

### 1.5 Tree-of-Thoughts

Tree-of-Thoughts (ToT) reframes reasoning as a search problem. Instead of generating one linear chain of thoughts, the model:
1. Generates multiple candidate "thoughts" at each step
2. Evaluates how promising each thought is
3. Expands only the best candidates (beam search)
4. Backtracks if needed

This mirrors how humans solve hard problems: brainstorm options, evaluate them, pursue the best, and backtrack when stuck.

#### Example 1: Word Ladder (Algorithmic ToT)

This example shows ToT as pure beam search without LLM calls. Each "thought" is a candidate word that differs by one letter. We score by edit distance to goal and keep the best candidates.

This demonstrates the **core algorithm** behind ToT: expand, score, prune.

In [6]:
###### Word Ladder Puzzle ##########
from difflib import SequenceMatcher

def neighbors(word, vocabulary):
    # Generate all valid one-letter mutations of 'word' that exist in 'vocabulary' and return them.
    neighbors = []
    for i in range(len(word)):
        for c in 'abcdefghijklmnopqrstuvwxyz':
            mutated = word[:i] + c + word[i+1:]
            if mutated in vocabulary and mutated != word:
                neighbors.append(mutated)
    return neighbors


def tree_of_thought(start, goal, vocab, max_depth=5, beam_width=4):
    # Search over partial thoughts (paths) using a small beam.
    # Step 1: Initialize the frontier with a single path [start]
    # Step 2: For each depth, expand each path by one neighbor from 'neighbors'
    # Step 3: Score paths by edit distance between last word and 'goal' (smaller is better)
    # Step 4: Keep the top 'beam_width' paths and stop early if any reaches 'goal'
    # Step 5: Return the best goal-reaching path or None
    frontier = [[start]]
    for depth in range(max_depth):
        new_frontier = []
        for path in frontier:
            last_word = path[-1]
            for neighbor in neighbors(last_word, vocab):
                new_path = path + [neighbor]
                new_frontier.append(new_path)
        # Score paths by edit distance to goal
        scored_paths = [(path, 1 - SequenceMatcher(None, path[-1], goal).ratio()) for path in new_frontier]
        scored_paths.sort(key=lambda x: x[1])  # Sort by score (edit distance)
        frontier = [path for path, score in scored_paths[:beam_width]]  # Keep top beam_width paths
        if any(path[-1] == goal for path in frontier):
            return next(path for path in frontier if path[-1] == goal)  # Return the first path that reaches the goal
    return None  # No path found within max_depth


vocab = {"hit","dot","cog","log","dog","lot","lit","hot"}

print(tree_of_thought("hit", "cog", vocab))


['hit', 'hot', 'dot', 'dog', 'cog']


#### Example 2: Generic ToT for Open-Ended Problems

For open-ended problems without verifiable answers, we can still apply ToT by having the LLM both propose and evaluate thoughts.

In [7]:
###### Generic ToT Search ##########

import re
from openai import OpenAI

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"

def propose_thoughts(question, state, k=2):
    # Propose up to k next ‚Äúthoughts‚Äù that extend the current partial solution/state.
    # Steps: build a short prompt with problem + current state; call your client. Then return a list of stripped strings.
    prompt = f"Question: {question}\nCurrent state: {state}\nPropose up to {k} next thoughts to extend this state:"
    resp = client.responses.create(
        model=MODEL,
        input=prompt,
        temperature=1.2,
    )
    thoughts = [line.strip() for line in resp.output_text.splitlines() if line.strip()]
    return thoughts[:k]  # Return up to k thoughts



def score_state(question, state):
    # Score how promising a partial solution is on a 1‚Äì10 scale (higher is better).
    # Steps: build a rating prompt; call the model; parse the first integer 1‚Äì10;
    prompt = f"Question: {question}\nCurrent state: {state}\nRate the promise of this state on a scale of 1-10:"
    resp = client.responses.create(
        model=MODEL,
        input=prompt,
        temperature=1.2,
    )
    match = re.search(r'\b([1-9]|10)\b', resp.output_text)
    return int(match.group(0)) if match else 0


def tree_of_thoughts(question, depth=2, width=2):
    # Run a tiny ToT search: expand states with propose_thoughts, score with score_state, keep top-k at each depth.
    # Steps: initialize frontier=[("", 0)]; for each depth, expand each state with k=width thoughts; score each; sort by score desc; keep top 'width'; return best state and score.
    frontier = [("", 0)]  # List of (state, score)
    for d in range(depth):
        new_frontier = []
        for state, _ in frontier:
            thoughts = propose_thoughts(question, state, k=width)
            for thought in thoughts:
                new_state = state + "\n" + thought if state else thought
                score = score_state(question, new_state)
                new_frontier.append((new_state, score))
        # Sort by score desc and keep top 'width'
        new_frontier.sort(key=lambda x: x[1], reverse=True)
        frontier = new_frontier[:width]
    best_state, best_score = frontier[0]
    return best_state, best_score


question = "Design a plan for a weekend science workshop for 12-year-olds."
solution, score = tree_of_thoughts(question)

print(f"Best solution (score {score}):\n{solution}")

Best solution (score 6):
**Weekend Science Workshop Plan for 12-Year-Olds**
**Objective:** To provide an engaging and interactive science experience for 12-year-olds, promoting hands-on learning, critical thinking, and curiosity about the natural world.


---  
# 2‚Äë Training Models for Reasoning

### 2.1: CoT Training
Chain-of-Thought (CoT) training conditions the model on explicit rationales during fine-tuning. Instead of teaching the model to output only the final answer, we train on (question, rationale, answer) so the model learns to internalize multi-step reasoning patterns. A practical recipe is STaR (Self-Taught Reasoner), which uses a stronger teacher model to bootstrap rationales that a smaller student can learn from.

For tasks that require multi-hop reasoning, models fine-tuned on rationales often achieve higher accuracy and are more stable at inference time than models trained on direct answers only. 

Training a full language model is beyond the scope of this notebook, but here is the high-level workflow followed by a short pseudocode:
- Collect questions: Prepare a dataset of questions and correct answers.
- Generate rationales: Use a strong LLM to produce step-by-step reasoning ending with the correct answer.
- Filter and clean: Discard incorrect or low-quality rationales.
- Prepare training data: Format triples (question, rationale, answer) for supervised fine-tuning.
- Fine-tune: Fine-tune the LLM on rationales.
- Iterate: Refine prompts, improve data quality, and retrain for stronger reasoning.

In [None]:
# Pseudocode (STaR loop)
# for round in 1 ... iters:
    # STEP 1: self-generate reasoning (teacher creates rationale + answer)
    # STEP 2: keep only correct, high-quality traces
    # STEP 3: fine-tune student on (question, rationale, answer) data

### 2.2: ORM¬†vs¬†PRM¬†+ RL
Training a Reward Model (RM) allows large language models to be improved through reinforcement learning (RL). Instead of fine-tuning directly on examples, we train a separate model that can score or rank model outputs, and use those scores as feedback signals to refine the policy model.

Two main reward modeling approaches are ORM (predicts a scalar reward for the final answer) and PRM (evaluates the reasoning steps instead of just the outcome)



| Approach | Typical loss | When to use |
|-----------|-------------|-------------|
|*Outcome Reward Model* | Predict scalar reward | Easy to collect training data using verifiers |
|*Process Reward Model* | Predict rewards per step | Difficult to collect training data but more accurate |
| *RLHF* | Use RM as reward in **RL** fine‚Äëtuning | Aligns policy with human signals | Aligns model policy with human or synthetic preferences




In [None]:
# for round = 1 ... iters:
    # STEP 1:  Generate reasoning
        # sample a minibatch of questions
        # policy roll‚Äëout (actions + log‚Äëprobs)
    # STEP 2:  Score the trajectory
        # ORM: scalar reward for the final answer / PRM: scalar reward for the thought process
    # STEP 3:  Reinforce the policy (PPO)

### 2.3 Inspect a reasoning model

Now that we've discussed how reasoning models are trained, let's see one in action. We'll use **DeepSeek-R1**, a reasoning model that produces explicit *thinking tokens* before giving its final answer. The model wraps its internal chain-of-thought inside `<think>...</think>` tags, followed by a clean final response.

In the cell below we send a question to DeepSeek-R1 and parse the output to separate:
- **Thinking tokens** ‚Äî the model's internal reasoning process (hidden from the end user in production).
- **Final answer** ‚Äî the polished response the user actually sees.

We use `deepseek-r1:1.5b` here for speed. You can switch to `deepseek-r1:8b` for higher-quality reasoning, but it will take longer to run. Pull whichever variant you want to try:

In [12]:
import re
from openai import OpenAI

# Step 1: Create OpenAI client and set your DeepSeek Model
# Step 2: Write a math question
# Step 3: Call your model
# Step 4: Inspect the output. Separate thinking and final answer sections and print them.
client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "deepseek-r1:1.5b"

question = "A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?"

resp = client.responses.create(
    model=MODEL,
    input=question,
    temperature=1.2,
    reasoning={"effort": "medium"}
)

# Parse thinkning and final answer. Thinking part is between <think> and </think> tags. Final answer is the rest.
thinking_match = re.search(r"<think>(.*?)</think>", resp.output_text, re.DOTALL)
thinking = thinking_match.group(1).strip() if thinking_match else "No thinking found"
final_answer = re.sub(r"<think>.*?</think>", "", resp.output_text, flags=re.DOTALL).strip()

print(resp)

print("Thinking:\n", thinking)
print("Final Answer:\n", final_answer)

Response(id='resp_622599', created_at=1771178050.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='deepseek-r1:1.5b', object='response', output=[ResponseReasoningItem(id='rs_resp_622599', summary=[Summary(text='Okay, so I need to find the area of a rectangle where the perimeter is 36 cm and the length is twice the width. Hmm, let me think about how to approach this step by step.\n\nFirst off, rectangles have opposite sides that are equal in length. So if one pair of sides is \'length\' (which we\'ll call L) and the other is \'width\' (which we can call W), then there are two sides each with length L and width W.\n\nThe formula for the perimeter of a rectangle is:  \nPerimeter = 2*(Length + Width).  \n\nThey told us that the perimeter is 36 cm. So plugging into the formula:\n\n2*(L + W) = 36 cm.\n\nAdditionally, they said the length is twice the width. That means L = 2*W. Okay, so I can substitute this into our equation for the perimeter to solve for the wid

---  
# 3‚Äë A Deep Research Agent

A deep-research agent pairs a reasoning model with external tools for web search and retrieval. We will follow the ReAct pattern: the model writes short thoughts, decides when to call tools, reads observations, and continues reasoning until it can answer or reaches a step limit.

We now combine a **search tool** with an LLM in a multi-step setup. We follow the *ReAct* pattern (reason ‚Üí tool ‚Üí observation):

1. The model reasons and decides to use tools
2. The agent searches and feeds condensed snippets back as context
3. Iterate until the model answers or hits a step limit

We use `create_agent` from `langchain.agents`, which builds a ReAct-style agent graph. Note: the agent model must support **tool calling** (e.g., `llama3.2:3b`). Models like `deepseek-r1` are reasoning models that do not support native tool calling and cannot be used directly as the agent LLM. We can stick to the `llama3.2:3b` or `qwen2.5:3b-instruct` for this section.

In [18]:
from ddgs import DDGS
from langchain_core.tools import tool


@tool
def ddg_search(query: str, k: int = 5) -> str:
    """"
    Run a simple web search and return joined snippets.
    """
    # Use DDGS to run a simple web search and return joined snippets.
    ddgs = DDGS()
    results = ddgs.text(query, max_results=k)
    snippets = [result['body'] for result in results]
    return "\n".join(snippets)

In [19]:
from langchain.agents import create_agent
from langchain_ollama import ChatOllama

MODEL = "qwen2.5:3b-instruct"
question = "What are the best resources to learn machine learning in 2026?"

# Step 1: Initialize the LLM via ChatOllama (must support tool calling)
client = ChatOllama(model=MODEL)

# Step 2: Build a tool-calling agent with DuckDuckGo search
agent = create_agent(
    model=client,
    tools=[ddg_search],
)

# Step 3: Ask a query and let the agent search + reason to produce an answer
response = agent.invoke({"messages": [{"role": "user", "content": question}]})
print(response)

Impersonate 'safari_15.3' does not exist, using 'random'


{'messages': [HumanMessage(content='What are the best resources to learn machine learning in 2026?', additional_kwargs={}, response_metadata={}, id='ffe1a356-0d39-4940-b822-ecec78d26bd9'), AIMessage(content='', additional_kwargs={}, response_metadata={'model': 'qwen2.5:3b-instruct', 'created_at': '2026-02-15T17:58:48.062176Z', 'done': True, 'done_reason': 'stop', 'total_duration': 3278058458, 'load_duration': 83512458, 'prompt_eval_count': 172, 'prompt_eval_duration': 106910916, 'eval_count': 31, 'eval_duration': 3062504791, 'logprobs': None, 'model_name': 'qwen2.5:3b-instruct', 'model_provider': 'ollama'}, id='lc_run--019c6274-732e-7433-9f0a-14f6f8208a6c-0', tool_calls=[{'name': 'ddg_search', 'args': {'query': 'best resources to learn machine learning 2026'}, 'id': '70e917df-d29d-42fa-8be6-f15de77f490f', 'type': 'tool_call'}], invalid_tool_calls=[], usage_metadata={'input_tokens': 172, 'output_tokens': 31, 'total_tokens': 203}), ToolMessage(content='\n" Professional Certificate in AI 

# 4- (Optional) Multi-Agent Deep Research

Instead of a single agent, we can design multiple collaborating agents that work in parallel:

1. **Planner**: Analyzes the query and breaks it into sub-questions
2. **Researchers**: Run in parallel, each searching and summarizing findings for one sub-question  
3. **Synthesizer**: Combines all research into a coherent final report

This setup improves coverage and speed by parallelizing the research phase.

In [20]:
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
from ddgs import DDGS

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"


def plan_research(query: str) -> list[str]:
    """Planner agent: breaks query into sub-questions."""
    # Prompt the LLM to decompose the query into 1-5 focused sub-questions.
    # The prompt should instruct the model to return only sub-questions, one per line.
    # Parse the response into a list of stripped strings and return
    prompt = f"Decompose the following research query into 1-5 focused sub-questions, one per line:\n\n{query}"
    resp = client.responses.create(
        model=MODEL,
        input=prompt,
        temperature=1.2,
    )
    sub_questions = [line.strip() for line in resp.output_text.splitlines() if line.strip()]
    return sub_questions


def search_and_summarize(sub_question: str) -> dict:
    """Researcher agent: searches web and summarizes findings for one sub-question."""
    # Step 1: Use DDGS to search the web for the sub-question (max_results=3)
    # Step 2: Join the result snippets into a single string
    # Step 3: Prompt the LLM to write a concise summary based on the snippets
    # Step 4: Return a dict with keys "question" and "summary"
    ddgs = DDGS()
    results = ddgs.text(sub_question, max_results=3)
    snippets = [result['body'] for result in results]
    joined_snippets = "\n".join(snippets)
    summary_prompt = f"Summarize the following information in a concise way to answer the question: {sub_question}\n\nInformation:\n{joined_snippets}"
    summary_resp = client.responses.create(
        model=MODEL,
        input=summary_prompt,
        temperature=1.2,
    )
    summary = summary_resp.output_text.strip()
    return {"question": sub_question, "summary": summary}


def synthesize_report(query: str, findings: list[dict]) -> str:
    """Synthesizer agent: combines all findings into a coherent report."""
    # Step 1: Format the findings list into a readable text block (e.g., "### sub-question\nsummary" per finding)
    # Step 2: Prompt the LLM to combine them into a well-structured markdown report that answers the original query
    # Step 3: Return the report string
    formatted_findings = "\n\n".join([f"### {f['question']}\n{f['summary']}" for f in findings])
    report_prompt = f"Combine the following findings into a well-structured markdown report that answers the query: {query}\n\nFindings:\n{formatted_findings}"
    report_resp = client.responses.create(
        model=MODEL,
        input=report_prompt,
        temperature=1.2,
    )
    report = report_resp.output_text.strip()
    return report


def deep_research(query: str) -> str:
    """Run the full multi-agent deep research pipeline."""
    # Step 1: Call plan_research to break the query into sub-questions and print them
    # Step 2: Use ThreadPoolExecutor to run search_and_summarize in parallel for each sub-question
    # Step 3: Call synthesize_report to combine all findings into a final report
    # Step 4: Return the report
    sub_questions = plan_research(query)
    print("Sub-questions:")
    for i, sq in enumerate(sub_questions, 1):
        print(f"{i}. {sq}")

    # Step 2: Use ThreadPoolExecutor to run search_and_summarize in parallel for each sub-question
    with ThreadPoolExecutor() as executor:
        findings = list(executor.map(search_and_summarize, sub_questions))

    # Step 3: Call synthesize_report to combine all findings into a final report
    report = synthesize_report(query, findings)
    return report


# Run the multi-agent research
query = "What are the best resources to learn machine learning in 2026?"
report = deep_research(query)
print("=" * 60)
print("FINAL REPORT")
print("=" * 60)
print(report)

Sub-questions:
1. Here are 4 focused sub-questions that decompose the original research query:
2. * What are the most popular online courses or tutorials for learning machine learning in 2026?
3. * Which books or textbooks provide comprehensive and up-to-date information on machine learning in 2026?
4. * Are there any specialized bootcamps, workshops, or conferences that offer hands-on experience with machine learning in 2026?
5. * What are the most effective ways to learn machine learning through real-world projects and case studies in 2026?
FINAL REPORT
**Learning Machine Learning in 2026: A Comprehensive Report**

**Introduction**
As the field of machine learning continues to evolve, it is essential to identify effective resources that can help individuals develop this critical skillset. This report aims to provide an overview of the best resources available for learning machine learning in 2026.

**Online Courses and Tutorials**
While specific online courses or tutorials are not me

## üéâ Congratulations!

You have:
* Practiced various inference-time reasoning methods (CoT, self-consistency, sequential revision, ToT)
* Gained intuition about training reasoning models (STaR, ORM/PRM)
* Built a **deep-research agent** with tool calling and ReAct-style reasoning
* Implemented a **multi-agent system** with parallel research and report synthesis


üëè **Great job!** Take a moment to celebrate. The techniques you implemented here power many production agents and chatbots.