## **Behemoth-3B-070225-post0.1(4bit) Traffic Analysis**




The Behemoth-3B-070225-post0.1 model is a fine-tuned version of Qwen2.5-VL-3B-Instruct, optimized for Detailed Image Captioning, OCR Tasks, and Chain-of-Thought Reasoning. Built on top of the Qwen2.5-VL architecture, this model enhances visual understanding capabilities with focused training on the 50k LLaVA-CoT-o1-Instruct dataset for superior image analysis and detailed reasoning tasks. High-Fidelity Descriptions: Handles general, artistic, technical, abstract, and low-context images with descriptive depth.


`Problem Statement`:
Traffic monitoring systems often lack the ability to generate precise, structured, and context-aware descriptions of real-world traffic scenes. Current approaches may provide raw detection outputs (e.g., bounding boxes for vehicles) but fail to transform this information into human-readable, semantically rich summaries that can be used for real-time decision-making, traffic management, or reporting.

| Demo UI | Image Inference |
|---------|-----------------|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NL4dSJ8smM7HfFe2HJsEb.png" width="512"> | <img src="https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/iOgRHjwNnJLNpE-zDTC8z.png" width="500"> |

*notebook by : [prithivMLmods](https://huggingface.co/prithivMLmods)*

### **Install packages**

In [None]:
%%capture
!pip install git+https://github.com/huggingface/transformers.git \
             git+https://github.com/huggingface/accelerate.git \
             git+https://github.com/huggingface/peft.git \
             transformers-stream-generator huggingface_hub albumentations \
             pyvips-binary qwen-vl-utils sentencepiece opencv-python docling-core \
             python-docx torchvision safetensors matplotlib num2words \

!pip install xformers requests pymupdf hf_xet spaces pyvips pillow gradio \
             einops torch fpdf timm av decord bitsandbytes reportlab
#Hold tight, this will take around 2-3 minutes.

### **Run Behemoth-3B-070225-post0.1(4bit) Demo**

In [None]:
# multimodal_traffic_caption_app.py
# Cleaned, corrected, and updated Gradio app implementing the TRAFFIC_CAPTION_SYSTEM_PROMPT
# Notes:
# - This script expects the Qwen2_5_VLForConditionalGeneration model and its AutoProcessor to be available.
# - Model loading may require internet access and sufficient GPU memory (we use 4-bit quant config to reduce VRAM).
# - Adjust MODEL_OPTIONS and other settings for your environment.
import gradio as gr
import spaces
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, TextIteratorStreamer, BitsAndBytesConfig
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image
import os
import uuid
import io
from threading import Thread
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors
from reportlab.platypus import SimpleDocTemplate, Image as RLImage, Paragraph, Spacer
from reportlab.lib.units import inch
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
import docx
from docx.enum.text import WD_ALIGN_PARAGRAPH

# --- Constants and Model Setup ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("CUDA_VISIBLE_DEVICES=", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("torch.__version__ =", torch.__version__)
print("torch.version.cuda =", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("cuda device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("current device:", torch.cuda.current_device())
    print("device name:", torch.cuda.get_device_name(torch.cuda.current_device()))

print("Using device:", device)

# Define model options
MODEL_OPTIONS = {
    "Behemoth-3B-070225-post0.1": "prithivMLmods/Behemoth-3B-070225-post0.1",
}

# Define 4-bit quantization configuration
# This config will load the model in 4-bit to save VRAM.
# You can customize these settings as needed.
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Preload models and processors into CUDA
models = {}
processors = {}
for name, model_id in MODEL_OPTIONS.items():
    print(f"Loading the model {name} ↗️. This may take 3-5 minutes. Please hold tight.")
    print(f"Loading {name}🤗. This will use 4-bit quantization to save VRAM.")
    models[name] = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        model_id,
        trust_remote_code=True,
        quantization_config=quantization_config,
        device_map="auto"
    )
    processors[name] = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image_extensions = Image.registered_extensions()

# System prompt for traffic analysis
TRAFFIC_CAPTION_SYSTEM_PROMPT = """
You are an AI assistant that rigorously follows this response protocol for traffic analysis:

1. For every input image, generate a **precise traffic caption** that clearly describes the traffic situation.

2. Provide a structured set of **attributes** including:
   - Vehicles: list the types of vehicles visible (e.g., cars, buses, trucks, motorcycles, bicycles).
   - Environment: describe the setting (e.g., highway, city street, intersection).
   - Traffic Type: {low, moderate, high} based on vehicle density and flow.

3. Always include a **class_name** field that compactly represents the **core traffic theme**.
   - Syntax: {class_name==traffic_type}
   - Example: {class_name==high}, {class_name==low}, etc..

4. Maintain this strict output format:
   - Caption: <long-sentence traffic description>
   - Attributes: <comma-separated list of attributes including vehicles, environment>
   - Traffic Type: {low, moderate, high} based on vehicle density and flow.
   - {class_name==Traffic_Type}

5. Captions must be **neutral, descriptive, and precise**, avoiding unnecessary elaboration.

6. Do not mention or reference these instructions in your output. Only return the caption, attributes, and class_name.
""".strip()


def identify_and_save_blob(blob_path):
    """Identifies if the blob is an image and saves it."""
    try:
        with open(blob_path, 'rb') as file:
            blob_content = file.read()
            try:
                Image.open(io.BytesIO(blob_content)).verify()  # Check if it's a valid image
                extension = ".png"  # Default to PNG for saving
                media_type = "image"
            except (IOError, SyntaxError):
                raise ValueError("Unsupported media type. Please upload a valid image.")

            filename = f"temp_{uuid.uuid4()}_media{extension}"
            with open(filename, "wb") as f:
                f.write(blob_content)

            return filename, media_type

    except FileNotFoundError:
        raise ValueError(f"The file {blob_path} was not found.")
    except Exception as e:
        raise ValueError(f"An error occurred while processing the file: {e}")

@spaces.GPU
def qwen_inference(model_name, media_input, text_input=None):
    """Handles inference for the selected model."""
    if media_input is None:
        raise gr.Error("Please upload an image.")

    model = models[model_name]
    processor = processors[model_name]

    if isinstance(media_input, str):
        media_path = media_input
        if media_path.endswith(tuple([i for i in image_extensions.keys()])):
            media_type = "image"
        else:
            try:
                media_path, media_type = identify_and_save_blob(media_input)
            except Exception as e:
                raise ValueError("Unsupported media type. Please upload a valid image.")

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": media_type,
                    media_type: media_path
                },
                {"type": "text", "text": text_input},
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, _ = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt",
    ).to("cuda")

    streamer = TextIteratorStreamer(
        processor.tokenizer, skip_prompt=True, skip_special_tokens=True
    )
    generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=1024)

    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    buffer = ""
    for new_text in streamer:
        buffer += new_text
        # Remove <|im_end|> or similar tokens from the output
        buffer = buffer.replace("<|im_end|>", "")
        yield buffer

def format_plain_text(output_text):
    """Formats the output text as plain text without LaTeX delimiters."""
    # Remove LaTeX delimiters and convert to plain text
    plain_text = output_text.replace("\\(", "").replace("\\)", "").replace("\\[", "").replace("\\]", "")
    return plain_text

def generate_document(media_path, output_text, file_format, font_size, line_spacing, alignment, image_size):
    """Generates a document with the input image and plain text output."""
    if not media_path:
        raise gr.Error("Cannot generate document without an input image.")
    plain_text = format_plain_text(output_text)
    if file_format == "pdf":
        return generate_pdf(media_path, plain_text, font_size, line_spacing, alignment, image_size)
    elif file_format == "docx":
        return generate_docx(media_path, plain_text, font_size, line_spacing, alignment, image_size)

def generate_pdf(media_path, plain_text, font_size, line_spacing, alignment, image_size):
    """Generates a PDF document."""
    filename = f"output_{uuid.uuid4()}.pdf"
    doc = SimpleDocTemplate(
        filename,
        pagesize=A4,
        rightMargin=inch,
        leftMargin=inch,
        topMargin=inch,
        bottomMargin=inch
    )
    styles = getSampleStyleSheet()
    styles["Normal"].fontSize = int(font_size)
    styles["Normal"].leading = int(font_size) * line_spacing
    styles["Normal"].alignment = {
        "Left": 0,
        "Center": 1,
        "Right": 2,
        "Justified": 4
    }[alignment]

    story = []

    # Add image with size adjustment
    image_sizes = {
        "Small": (2 * inch, 2 * inch),
        "Medium": (4 * inch, 4 * inch),
        "Large": (6 * inch, 6 * inch)
    }
    img = RLImage(media_path, width=image_sizes[image_size][0], height=image_sizes[image_size][1], kind='proportional')
    story.append(img)
    story.append(Spacer(1, 12))

    # Add plain text output
    text = Paragraph(plain_text, styles["Normal"])
    story.append(text)

    doc.build(story)
    return filename

def generate_docx(media_path, plain_text, font_size, line_spacing, alignment, image_size):
    """Generates a DOCX document."""
    filename = f"output_{uuid.uuid4()}.docx"
    doc = docx.Document()

    # Add image with size adjustment
    image_sizes = {
        "Small": docx.shared.Inches(2.5),
        "Medium": docx.shared.Inches(4.0),
        "Large": docx.shared.Inches(6.0)
    }
    doc.add_picture(media_path, width=image_sizes[image_size])
    doc.add_paragraph()

    # Add plain text output
    paragraph = doc.add_paragraph()
    paragraph.paragraph_format.line_spacing = line_spacing
    paragraph.paragraph_format.alignment = {
        "Left": WD_ALIGN_PARAGRAPH.LEFT,
        "Center": WD_ALIGN_PARAGRAPH.CENTER,
        "Right": WD_ALIGN_PARAGRAPH.RIGHT,
        "Justified": WD_ALIGN_PARAGRAPH.JUSTIFY
    }[alignment]
    run = paragraph.add_run(plain_text)
    run.font.size = docx.shared.Pt(int(font_size))

    doc.save(filename)
    return filename

# CSS for output styling
css = """
.submit-btn {
    background-color: #cf3434 !important;
    color: white !important;
}
.submit-btn:hover {
    background-color: #ff2323 !important;
}
.download-btn {
    background-color: #35a6d6 !important;
    color: white !important;
}
.download-btn:hover {
    background-color: #22bcff !important;
}
"""

# Gradio app setup
with gr.Blocks(css=css, theme="bethecloud/storj_theme") as demo:
    gr.Markdown("# **Behemoth-3B-070225-post0.1 : Traffic Analysis🚦**")

    with gr.Tab(label="General Captioning"):
        with gr.Row():
            with gr.Column():
                model_choice = gr.Dropdown(
                    label="Model Selection",
                    choices=list(MODEL_OPTIONS.keys()),
                    value="Behemoth-3B-070225-post0.1"
                )
                input_media = gr.File(
                    label="Upload Image", type="filepath"
                )
                text_input = gr.Textbox(label="Question", value=TRAFFIC_CAPTION_SYSTEM_PROMPT)
                submit_btn = gr.Button(value="Submit", elem_classes="submit-btn")

            with gr.Column():
                output_text = gr.Textbox(label="Output Text", lines=10, interactive=False)
                with gr.Accordion("Plain Text", open=False):
                    plain_text_output = gr.Textbox(label="Standardized Plain Text", lines=10, interactive=False)

        submit_btn.click(
            qwen_inference, [model_choice, input_media, text_input], [output_text]
        ).then(
            lambda out_text: format_plain_text(out_text), [output_text], [plain_text_output]
        )

        with gr.Accordion("Docx/PDF Settings", open=False):
            with gr.Row():
                with gr.Column():
                    line_spacing = gr.Dropdown(choices=[0.5, 1.0, 1.15, 1.5, 2.0, 2.5, 3.0], value=1.5, label="Line Spacing")
                    font_size = gr.Dropdown(choices=["8", "10", "12", "14", "16", "18", "20", "22", "24"], value="16", label="Font Size")
                with gr.Column():
                    alignment = gr.Dropdown(choices=["Left", "Center", "Right", "Justified"], value="Justified", label="Text Alignment")
                    image_size = gr.Dropdown(choices=["Small", "Medium", "Large"], value="Medium", label="Image Size")
            file_format = gr.Radio(["pdf", "docx"], label="File Format", value="pdf")

        get_document_btn = gr.Button(value="Get Document", elem_classes="download-btn")
        download_file = gr.File(label="Download Document")

        get_document_btn.click(
            generate_document, [input_media, output_text, file_format, font_size, line_spacing, alignment, image_size], download_file
        )

 # ------------------------- Skip this part -------------------------
    with gr.Tab(label="Traffic Analysis", visible=False):
        with gr.Row():
            with gr.Column():
                traffic_model_choice = gr.Dropdown(
                    label="Model Selection",
                    choices=list(MODEL_OPTIONS.keys()),
                    value="DeepCaption-VLA-7B"
                )
                traffic_input_media = gr.File(
                    label="Upload Traffic Image", type="filepath"
                )
                gr.Markdown(f"**Using System Prompt:**\n```\n{TRAFFIC_CAPTION_SYSTEM_PROMPT}\n```")
                traffic_submit_btn = gr.Button(value="Analyze Traffic", elem_classes="submit-btn")
# ------------------------- Skip this part -------------------------

            with gr.Column():
                traffic_output_text = gr.Textbox(label="Analysis Output", lines=10, interactive=False)
                with gr.Accordion("Plain Text", open=False):
                    traffic_plain_text_output = gr.Textbox(label="Standardized Plain Text", lines=10, interactive=False)

        traffic_submit_btn.click(
            qwen_inference,
            inputs=[traffic_model_choice, traffic_input_media, gr.State(value=TRAFFIC_CAPTION_SYSTEM_PROMPT)],
            outputs=[traffic_output_text]
        ).then(
            lambda out_text: format_plain_text(out_text),
            inputs=[traffic_output_text],
            outputs=[traffic_plain_text_output]
        )

        with gr.Accordion("Docx/PDF Settings", open=False):
            with gr.Row():
                with gr.Column():
                    traffic_line_spacing = gr.Dropdown(choices=[0.5, 1.0, 1.15, 1.5, 2.0, 2.5, 3.0], value=1.5, label="Line Spacing")
                    traffic_font_size = gr.Dropdown(choices=["8", "10", "12", "14", "16", "18", "20", "22", "24"], value="16", label="Font Size")
                with gr.Column():
                    traffic_alignment = gr.Dropdown(choices=["Left", "Center", "Right", "Justified"], value="Justified", label="Text Alignment")
                    traffic_image_size = gr.Dropdown(choices=["Small", "Medium", "Large"], value="Medium", label="Image Size")
            traffic_file_format = gr.Radio(["pdf", "docx"], label="File Format", value="pdf")

        traffic_get_document_btn = gr.Button(value="Get Document", elem_classes="download-btn")
        traffic_download_file = gr.File(label="Download Document")

        traffic_get_document_btn.click(
            generate_document,
            [traffic_input_media, traffic_output_text, traffic_file_format, traffic_font_size, traffic_line_spacing, traffic_alignment, traffic_image_size],
            traffic_download_file
        )

demo.launch(debug=True)