## Prepare the local model checkpoint

For example, we can download the Qwen2.5-VL-3B-Instruct model checkpoint and save to `local_model`

In [1]:
# !git lfs install
# !git clone https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct local_model/Qwen2.5-VL-3B

In [2]:
from huggingface_hub import snapshot_download

# Download the model to local_model directory
model_path = snapshot_download(
    repo_id="Qwen/Qwen2.5-VL-3B-Instruct",
    local_dir="local_model/Qwen2.5-VL-3B",
    local_dir_use_symlinks=False  # This ensures files are actually copied, not symlinked
)

  from .autonotebook import tqdm as notebook_tqdm
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.
Fetching 14 files: 100%|██████████| 14/14 [00:00<00:00, 1991.53it/s]


## Run the local model

The downloaded model is saved in the `local_model/Qwen2.5-VL-3B` directory.

Now, we can run the local model by setting the local model path to the `llm_engine_name`:

In [3]:
from octotools.solver import construct_solver

# Set the LLM engine name
local_model_path = "local_model/Qwen2.5-VL-3B"
llm_engine_name = f"vllm-{local_model_path}"

# Construct the solver
solver = construct_solver(
    llm_engine_name=llm_engine_name, 
    enabled_tools=["Generalist_Solution_Generator_Tool", "Image_Captioner_Tool", "Object_Detector_Tool"],
    verbose=True)


==> Initializing octotools...
Enabled tools: ['Generalist_Solution_Generator_Tool', 'Image_Captioner_Tool', 'Object_Detector_Tool']
LLM engine name: vllm-local_model/Qwen2.5-VL-3B

==> Setting up tools...
Loading tools and getting metadata...
Updated Python path: ['/workspace/octotools', '/workspace/octotools/octotools', '/workspace/octotools/examples/notebooks', '/root/miniforge3/envs/oct/lib/python310.zip', '/root/miniforge3/envs/oct/lib/python3.10', '/root/miniforge3/envs/oct/lib/python3.10/lib-dynload', '', '/root/miniforge3/envs/oct/lib/python3.10/site-packages', '__editable__.octotoolkit-0.2.0.finder.__path_hook__']

==> Attempting to import: tools.generalist_solution_generator.tool
Found tool class: Generalist_Solution_Generator_Tool
Metadata for Generalist_Solution_Generator_Tool: {'tool_name': 'Generalist_Solution_Generator_Tool', 'tool_description': 'A generalized tool that takes query from the user as prompt, and answers the question step by step to the best of its ability.

2025-05-21 19:47:12,496	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


Error instantiating Image_Captioner_Tool: Connection error.

==> Attempting to import: tools.object_detector.tool
CUDA_HOME is not set
Found tool class: Object_Detector_Tool
Metadata for Object_Detector_Tool: {'tool_name': 'Object_Detector_Tool', 'tool_description': 'A tool that detects objects in an image using the Grounding DINO model and saves individual object images with empty padding.', 'tool_version': '1.0.0', 'input_types': {'image': 'str - The path to the image file.', 'labels': 'list - A list of object labels to detect.', 'threshold': 'float - The confidence threshold for detection (default: 0.35).', 'model_size': "str - The size of the model to use ('tiny' or 'base', default: 'tiny').", 'padding': 'int - The number of pixels to add as empty padding around detected objects (default: 20).'}, 'output_type': 'list - A list of detected objects with their scores, bounding boxes, and saved image paths.', 'demo_commands': [{'command': 'execution = tool.execute(image="path/to/image.p

In [4]:
# Solve the user query
output = solver.solve(question="How many baseballs are there?", image_path="baseball.png")


==> 🔍 Received Query: How many baseballs are there?

==> 🖼️ Received Image: baseball.png

==> 🐙 Reasoning Steps from OctoTools (Deep Thinking...)



==> 🔍 Step 0: Query Analysis

### Query Summary
The query asks for the number of baseballs present in the provided image. The image is labeled as "baseball.png" and has dimensions of 719x458 pixels.

### Required Skills
1. **Counting Objects**: The ability to count individual objects within an image.
2. **Image Analysis**: Understanding the structure and layout of the image to identify objects.
3. **Precision**: Ensuring accurate counting to avoid miscounting or missing any objects.

### Relevant Tools
1. **Generalist_Solution_Generator_Tool**
   - **Explanation**: This tool can be used to analyze the image and count the baseballs. By providing a clear prompt about counting objects, the tool will generate a response indicating the number of baseballs present in the image.
   - **Limitations**: The tool may not always detect all objects accurately, especially if they are very small or have similar colors to the background.

2. **Object_Detector_Tool**
   - **Explanation**: This tool ca

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cuda


In [5]:
print(output["final_output"])

### Summary:
The query asks about the number of baseballs present in the image. The Object_Detector_Tool was used to detect and count the baseballs in the provided image. The tool identified 20 baseballs across different positions within the image.

### Detailed Analysis:
1. **Tool Execution**:
   - **Tool Used**: Object_Detector_Tool
   - **Purpose**: To detect and count the number of baseballs in the image.
   - **Key Results**: The tool identified 20 baseballs in various locations within the image.

2. **Step-by-Step Process**:
   - The Object_Detector_Tool was applied to the image "baseball.png".
   - The tool detected multiple instances of baseballs, each with a confidence score above 0.6.
   - The detected baseballs were saved as separate images for reference.

3. **Contribution to Query**:
   - The detection and counting process helped identify the total number of baseballs present in the image.
   - Each detected baseball was confirmed by its position and confidence score, ensu

In [6]:
print(output["direct_output"])

There are 20 baseballs in the image.


In [7]:
print(f"Step count: {output['step_count']} step(s)")
print(f"Execution time: {output['execution_time']} seconds")

Step count: 1 step(s)
Execution time: 7.19 seconds
