## Installing & Importing Libraries

In [1]:
!pip install transformers accelerate gradio

from transformers import AutoTokenizer, AutoModelForCausalLM, LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import gradio as gr

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.wh

## Load Pretrained Language and Vision-Language Models

In [2]:
# Load DeepSeek language model
code_model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
code_tokenizer = AutoTokenizer.from_pretrained(code_model_id, trust_remote_code=True)
code_model = AutoModelForCausalLM.from_pretrained(
    code_model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Load LLaVA-NeXT for vision-language tasks (image + text)
vision_model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
vision_processor = LlavaNextProcessor.from_pretrained(vision_model_id)
vision_model = LlavaNextForConditionalGeneration.from_pretrained(
    vision_model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    trust_remote_code=True
    )

device = "cuda" if torch.cuda.is_available() else "cpu"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/176 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/70.2k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/380M [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



## Generation Functions

This section implements the core logic for:

- **Vision + Text Generation** — `image_with_prompt(prompt, image, temperature, top_p, max_new_tokens)`  
  Generates a response based on both an uploaded image and a text prompt using a vision-language model.

- **Image-Only Description** — `describe_image(image, temperature, top_p, max_new_tokens)`  
  Uses the same model with a default prompt to describe the content of an image.

- **Code Generation** — `generate_code(prompt, temperature, top_p, max_new_tokens, display_thinking)`  
  Uses a causal language model to generate code, optionally including detailed reasoning and boxed final answers.

In [3]:
def image_with_prompt(prompt, image, temperature=0.6, top_p=0.9, max_new_tokens = 500):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": prompt},
            ],
        },
    ]
    prompt = vision_processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = vision_processor(image, prompt, return_tensors="pt").to(device)
    output = vision_model.generate(
        **inputs,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        max_new_tokens=max_new_tokens)

    decoded = vision_processor.decode(output[0], skip_special_tokens=True)

    # Split by [/INST] (end of instruction token)
    if "[/INST]" in decoded:
        _ , response = decoded.split("[/INST]", 1)
        return response.strip()
    else:
        return decoded.strip()

def describe_image(image, temperature=0.6, top_p=0.9, max_new_tokens = 500):
    prompt = "Describe the image in detail."
    return image_with_prompt(prompt, image, temperature, top_p, max_new_tokens)

In [4]:
def generate_code(prompt, temperature=0.6, top_p=0.9, max_new_tokens=2000, display_thinking=False):
    formatted_prompt = f"{prompt.strip()}\n\nPlease reason step by step, and put the final answer within \\boxed{{}}.\n<think>\n"
    inputs = code_tokenizer(formatted_prompt, return_tensors="pt").to(device)
    output = code_model.generate(
        **inputs,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        max_new_tokens=max_new_tokens
    )
    decoded = code_tokenizer.decode(output[0], skip_special_tokens=True)

    # Split by </think> (end of thinking token)
    if "</think>" in decoded:
        pre_think, post_think = decoded.split("</think>", 1)
        if display_thinking:
            return f"### Thinking\n\n{pre_think.strip()}</think>\n\n### Final Answer\n\n{post_think.strip()}"
        else:
            return post_think.strip()
    else:
        return decoded.strip()

## Gradio UI

In [6]:
code_interface = gr.Interface(
    fn=generate_code,
    inputs=[
        gr.Textbox(label="Prompt"),
        gr.Slider(0, 1, value=0.6, label="Temperature (randomness)"),
        gr.Slider(0, 1, value=0.9, label="Top-p (sampling scope)"),
        gr.Slider(10, 5000, value=2000, step=10, label="Max New Tokens (Output Length)"),
        gr.Checkbox(label="Show thinking steps", value=False),
    ],
    outputs =gr.Markdown(),
    title="Code Generation"
)

# Image-only tab
image_desc_interface = gr.Interface(
    fn=describe_image,
    inputs=[
        gr.Image(type="pil", label="Upload Image"),
        gr.Slider(0, 1, value=0.6, label="Temperature (randomness)"),
        gr.Slider(0, 1, value=0.9, label="Top-p (sampling scope)"),
        gr.Slider(10, 5000, value=2000, step=10, label="Max New Tokens (Output Length)")
    ],
    outputs=gr.Textbox(label="Description"),
    title="Image Description"
)

# Image + Prompt tab
image_prompt_interface = gr.Interface(
    fn=image_with_prompt,
    inputs=[
        gr.Textbox(label="Prompt", placeholder="e.g., What is the person doing?"),
        gr.Image(type="pil", label="Upload Image"),
        gr.Slider(0, 1, value=0.6, label="Temperature (randomness)"),
        gr.Slider(0, 1, value=0.9, label="Top-p (sampling scope)"),
        gr.Slider(10, 5000, value=2000, step=10, label="Max New Tokens (Output Length)")
    ],
    outputs=gr.Textbox(label="Response"),
    title="Visual Question Answering"
)


gr.TabbedInterface(
    [code_interface, image_desc_interface, image_prompt_interface],
    ["Code Generation", "Image Description", "VQA"]
).launch()

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://fd440e68954cdb38d7.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Testing The functions

In [5]:
generate_code("Write a Python function that checks if a number is prime.", max_new_tokens=5000)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


"Write a Python function that checks if a number is prime.\n\nPlease reason step by step, and put the final answer within \\boxed{}.\n<think>\nAlright, I need to write a Python function that checks if a number is prime. Let's think about how to approach this.\n\nFirst, what defines a prime number? A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. So, for example, 2 is prime because it's only divisible by 1 and 2. But 4 is not prime because it's divisible by 1, 2, and 4.\n\nOkay, so the function should take an integer as input and return a boolean indicating whether it's prime or not. Let's outline the steps.\n\n1. **Handle edge cases**: Numbers less than 2 are not prime. So, if the input is 0, 1, or negative, we immediately return False.\n\n2. **Check divisibility**: For numbers 2 and above, we need to check if any number from 2 up to the square root of the input divides it evenly. If any such number exists, the input is not prime.

In [5]:
import requests
url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
image_snowman = Image.open(requests.get(url, stream=True).raw)

describe_image(image_snowman)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


"In the image, a jovial snowman is seated on a bed of snow, engrossed in the warmth of a small campfire. The snowman, with its orange nose and black eyes, is dressed in a plaid scarf and a green hat, adding a touch of color to the otherwise monochrome scene. \n\nThe campfire, a bundle of sticks and logs, is nestled on the left side of the image, casting a warm glow on the snowman and its surroundings. On the right side of the image, a silver pot and a tin can are scattered around, perhaps remnants of a meal enjoyed by the snowman.\n\nThe background of the image is a stark contrast to the warmth of the campfire. It's a dark, snowy forest, its bare trees reaching up towards the sky. The snow on the ground reflects the light from the campfire, creating an ethereal glow that illuminates the scene.\n\nDespite being an inanimate object, the snowman exudes a sense of joy and warmth, making one wonder about its imaginary adventures in this winter wonderland."