# Vision Fine-tuning  - Maths OCR

- https://docs.unsloth.ai/basics/vision-fine-tuning
- https://colab.research.google.com/drive/1whHb54GNZMrNxIsi2wm2EY_-Pvo2QyKh?usp=sharing

In [1]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
from transformers import TextStreamer

from datasets import load_dataset

# To render latex in jupyter
from IPython.display import display, Math, Latex

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Load the model

* We support Llama 3.2 Vision 11B, 90B; Pixtral; Qwen2VL 2B, 7B, 72B; and any Llava variant like Llava NeXT!
* We support 16bit LoRA via `load_in_4bit=False` or 4bit QLoRA. Both are accelerated and use much less memory!

In [None]:
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
model, tokenizer = FastVisionModel.from_pretrained(
    # More models at https://huggingface.co/unsloth
    "unsloth/Qwen2-VL-7B-Instruct",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

**[NEW]** We also support finetuning ONLY the vision part of the model, or ONLY the language part. Or you can select both! You can also select to finetune the attention or the MLP layers!

In [None]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

## Load the data

We'll be using a sampled dataset of handwritten maths formulas. The goal is to convert these images into a computer readable form - ie in LaTeX form, so we can render it. This can be very useful for complex formulas.

You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).

In [None]:
dataset = load_dataset("unsloth/LaTeX_OCR", split="train")
dataset

Let's take an overview look at the dataset. We shall see what the 3rd image is, and what caption it had.

In [None]:
dataset[2]["image"]

In [None]:
# We can render the LaTeX in the browser directly!
latex = dataset[2]["text"]
display(Math(latex))

### Format the input for the model

To format the dataset, all vision finetuning tasks should be formatted as follows:

```python
[
    {
        "role": "user",
        "content": [
            {"type": "text",  "text": Q},
            {"type": "image", "image": image}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text",  "text": A}
        ]
    },
]
```

In [None]:
instruction = "Write the LaTeX representation for this image."

def convert_to_conversation(sample):
    conversation = [
        { 
            "role": "user",
            "content" : [
                {"type" : "text",  "text"  : instruction},
                {"type" : "image", "image" : sample["image"]}
            ]
        },
        {
            "role" : "assistant",
            "content" : [
                {"type" : "text",  "text"  : sample["text"]}
            ]
        },
    ]
    return { "messages" : conversation }

Let's convert the dataset into the "correct" format for finetuning:

In [None]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]

## Inference before finetuning

Let's first see before we do any finetuning what the model outputs for the first example!

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

In [None]:
image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."

messages = [
    {
        "role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]
    }
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

In [None]:
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(
    **inputs, streamer=text_streamer, max_new_tokens = 128,
    use_cache=True, temperature=0.1, min_p=0.1
)