# Direct Multimodal LLM Pipeline using LLaVA

### Contents

##### Objective
##### Why LLaVA was chosen
##### Modular Pipeline Design (Architecture)
##### Data Ingestion and Preprocessing for PDF+Image+Audio modalities 
##### Check for Model access, Define Model and giving the inputs 
##### Prompt Engineering and Instruction Tuning


### Why LLaVA was chosen: 

LLaVA (Large Language-and-Vision Assistant) was selected for the following reasons:

##### True Direct Multimodality
LLaVA jointly processes image embeddings + text tokens inside the same transformer.
No intermediate caption generation or OCR-only workflow.
##### Open-Source & Actively Maintained
Publicly available model
Well-documented Hugging Face integration.
##### Lightweight Compared to Alternatives
MiniGPT-4 requires heavier vision encoders and tuning.
Qwen-VL is powerful but relatively heavier and newer.
LLaVA works well on limited GPU environments.
Excellent Instruction Following
Strong alignment for question answering, explanations, and reasoning tasks.
Conclusion:
LLaVA offers the best balance of reasoning quality, openness, and practical usability for academic and prototype-level multimodal systems.

## Modular Pipeline Design (Architecture)

# Installations and imports

In [1]:
!pip install torch torchvision torchaudio --quiet
!pip install transformers accelerate bitsandbytes -U --quiet
!pip install PyPDF2 pillow --quiet
!pip install -U openai-whisper



In [1]:
import os
#  expandable_segments:True allows PyTorch to expand and reuse memory segments dynamically, reducing fragmentation.
#  without this , I am getting Out of Memory Error 
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

In [3]:
import gc
gc.collect()

0

## Defining Model 

In [4]:
import torch
from transformers import LlavaForConditionalGeneration, LlavaProcessor

model_id = "llava-hf/llava-1.5-7b-hf"

processor = LlavaProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk and cpu.


# Data Ingestion and Preprocessing

#### Image Modality
##### Vision models (including LLaVA) expect 3-channel RGB images
##### resize() can stretch images so using thumbnail

In [5]:
from PIL import Image
def preprocess_image(image_path):
    if image_path is None:
        return None
    return Image.open(image_path).convert("RGB")


#### Text Modality

##### PDF documents are processed using PyPDF2 to extract machine-readable text. Each page is parsed sequentially, and the extracted content is concatenated into a single text buffer. PDF text extraction may fail when documents are scanned or image-based. To prevent silent failures, explicit error handling is implemented to detect pages with no extractable text and raise meaningful exceptions, enabling safer downstream multimodal processing.

In [7]:
from PyPDF2 import PdfReader

def extract_pdf_text(pdf_path):
    """
    Extract text from a PDF.
    Raises an error if no text is found in the entire document.
    """
    reader = PdfReader(pdf_path)
    text = ""

    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
         text += page_text + "\n"

    if not text.strip():
        raise ValueError(
            "No extractable text found in the PDF. "
            "The document may be fully scanned."
        )

    return text

#def clean_text(text):
#      text = text.replace("\n", " ")
#      text = " ".join(text.split())  # remove extra spaces
#      return text

#input_pdftext = clean_text(text)

#### Audio Modality using Whisper 

In [8]:
import whisper

_whisper_model = whisper.load_model("base")

def preprocess_audio(audio_path):
    if audio_path is None:
        return None
    result = _whisper_model.transcribe(audio_path)
    return result["text"]

In [9]:
def combine_text_context(pdf_text=None, audio_text=None):
    """
    Combine all text-like modalities.
    Image is handled separately by LLaVA.
    """
    combined = ""

    if pdf_text:
        combined += pdf_text + "\n"

    if audio_text:
        combined += audio_text + "\n"

    return combined.strip()

#### Defining the Prompt

In [10]:
def prepare_llava_inputs(processor, prompt, image=None, context_text=None):
    # Build text
    full_text = ""

    if context_text:
        full_text += context_text.strip() + "\n\n"

    full_text += prompt.strip()

    # Multimodal path
    if image is not None:
        return processor(
            text=full_text,
            images=image,
            return_tensors="pt"
        )

    # Text-only path
    return processor(
        text=full_text,
        return_tensors="pt"
    )

#### Data Ingestion

In [11]:
def ingest_inputs(pdf_path=None, image_path=None, audio_path=None):
    """
    Central entry point for all modalities.
    All inputs are OPTIONAL.
    """
    return {
        "pdf": pdf_path,
        "image": image_path,
        "audio": audio_path
    }


In [15]:
# ================= USER INPUTS =================
pdf_path = None      # or None
#pdf_path = "boy_playing_ball.pdf"  # or None
image_path ="Boyandball.png"   # or None
audio_path ="carnatic.wav"   # or None

# ===============INGESTING INPUTS ===============

inputs = ingest_inputs(pdf_path, image_path, audio_path)
print(inputs)
pdf_text = extract_pdf_text(inputs["pdf"]) if inputs["pdf"] else None
image = preprocess_image(inputs["image"])
audio_text = preprocess_audio(inputs["audio"])

print("audiotext")
print(audio_text)
# ================= BUILDING PROMPT  =================

context_text = combine_text_context(pdf_text, audio_text)

prompt = f"""
<image>

Document excerpt:
\"\"\"
{pdf_text}
\"\"\"

Document excerpt plus audio context :
\"\"\"
{context_text}
\"\"\"

Instructions:
You are a strict multimodal analyst.
Instructions:
- Use ONLY information that is explicitly visible in the image and explicitly stated in the PDF.
- Do NOT infer, assume, guess, or use prior knowledge.
- If a detail is not visible or mentioned, respond with exactly: "Not present in the input."
- Do NOT fill gaps with common sense.
- Every claim must be supported by evidence from the image or PDF or Audio

Question:
What is image telling about boy or play? Answer in full sentence.
What is the context text telling about boy or play? Answer in full sentence.
Are they related in the context of boy and play? Answer in full sentence. Explain in detail . 

Answer: 
"""
# ===========================================================

llava_inputs = prepare_llava_inputs(
    processor=processor,
    prompt=prompt,
    image=image,
    context_text=context_text
)
output = model.generate( 
    **llava_inputs,
    max_new_tokens=50,
    do_sample=False
)

response = processor.decode(output[0], skip_special_tokens=True)
print("Printing response..........")
print(response)


{'pdf': None, 'image': 'Boyandball.png', 'audio': 'carnatic.wav'}
audiotext
 Sharan Agata Rakshamam Hey Karuna Nide Pahimaam Sharan Agata Rakshamam
Printing response..........
Sharan Agata Rakshamam Hey Karuna Nide Pahimaam Sharan Agata Rakshamam

 

Document excerpt:
"""
None
"""

Document excerpt plus audio context :
"""
Sharan Agata Rakshamam Hey Karuna Nide Pahimaam Sharan Agata Rakshamam
"""

Instructions:
You are a strict multimodal analyst.
Instructions:
- Use ONLY information that is explicitly visible in the image and explicitly stated in the PDF.
- Do NOT infer, assume, guess, or use prior knowledge.
- If a detail is not visible or mentioned, respond with exactly: "Not present in the input."
- Do NOT fill gaps with common sense.
- Every claim must be supported by evidence from the image or PDF or Audio

Question:
What is image telling about boy or play? Answer in full sentence.
What is the context text telling about boy or play? Answer in full sentence.
Are they related in th