BEFORE Running:
1. Please install the required libraries before proceeding. On Colab, you may directly use the below cell to install all dependencies. 
2. On a local machine, ensure torch, tansformers, trl, bitsandbytes, accelerate, peft, datasets, huggingface_hub, hf_transfer are installed. 

Instructions for running:
1. The notebook is written for Colab, and assumes the reasoning traces data for chosen responses and the SFT model is present in the directory 'PROJECT_DIR' = '/content/drive/MyDrive/cs776-project' (in appropriate subdirectories referred to using the model_id and data_dir variables)
2. It retrieves the SFT model and generates the DPO training data. This involves creating the rejected responses using the SFT model and selecting the Gemini reasoning traces as the chosen response.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
image_dir = "/content/drive/MyDrive/cs776-project/filtered_images"
train_data_file = "/content/drive/MyDrive/cs776-project/train_cot_updated.json"

PROJECT_DIR = '/content/drive/MyDrive/cs776-project'

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
import torch

In [None]:
DEVICE = "cuda" if torch.cuda.is_available() else "CPU"
print(DEVICE)

cuda


In [None]:
# from unsloth.trainer import UnslothVisionDataCollator
from peft import get_peft_model, LoraConfig
from transformers import AutoModelForVision2Seq, AutoProcessor
from transformers.image_utils import load_image
from datasets import features, load_dataset
from trl import DPOConfig, DPOTrainer
import gc
from transformers import pipeline

In [None]:
SMALL = True

Load the model and the processor (which is used to tokenize the inputs)

In [None]:
model_id = None
if SMALL:
  model_id = PROJECT_DIR + '/SFT_256M_smolcot_ep2_high_lr'
  model_processor = "HuggingFaceTB/SmolVLM-256M-Instruct"
else:
  # model_id = "HuggingFaceTB/SmolVLM-500M-Instruct"
  model_id = PROJECT_DIR + '/SFT_500M_updated_checkpoint'
  model_processor = "HuggingFaceTB/SmolVLM-500M-Instruct"

model_ref = AutoModelForVision2Seq.from_pretrained(model_id).to(DEVICE)
model = AutoModelForVision2Seq.from_pretrained(model_id).to(DEVICE)
processor = AutoProcessor.from_pretrained(model_processor, do_image_splitting=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/429 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/486 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.55M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

In [None]:
# prompt: load PIL image from an URL

from PIL import Image
import requests
from io import BytesIO

def load_image_from_url(url):
    """Loads an image from a URL using PIL."""
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an exception for bad status codes
        image = Image.open(BytesIO(response.content)).convert('RGB')
        return image
    except requests.exceptions.RequestException as e:
        print(f"Error downloading image: {e}")
        return None
    except Exception as e:
        print(f"Error loading image: {e}")
        return None

url = "https://farm7.staticflickr.com/6155/6179447413_d60cf99f28_z.jpg"
image = load_image_from_url(url)
# image

We test the model once to ensure that everything is okay till now

In [None]:
# # Evaluate before doing DPO-RL training
# # Can modify these to test performance on the actual/relevant reasoning task


image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            # {"type": "image"},
            {"type": "text", "text": "Can you describe the two images?"}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

prompts  = [ prompt, prompt, prompt]
# print(prompt)
inputs = processor(text=prompts, images=[image1, image2, image1], return_tensors="pt")
# print(inputs)
inputs = inputs.to(DEVICE)
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

User:Can you describe the two images?
Assistant: The first image shows a statue of the Statue of Liberty, while the second image shows a boat on the water.


In [None]:
print(generated_texts[2])

User:Can you describe the two images?
Assistant: The first image shows a statue of the Statue of Liberty, while the second image shows a boat on the water.


In [None]:
!ls /content/drive/MyDrive/cs776-project

ChartQADataset.zip		      README.gdoc
DPO_dataset			      SFT_256M_smolcot_ep2_high_lr
DPO_dataset_500M_SFT_manual_download  SFT_500M_smolcot_on_3k
filtered_data_4k.json		      SFT_500M_updated_checkpoint
filtered_data_5k.json		      SFT_checkpoint
filtered_data.json		      SFT_updated_checkpoint
filtered_images			      system_prompt.md
filtered_images_4k		      test_data.json
filtered_images_5k		      tiny_system_prompt.md.gdoc
GRPO				      train_cot_traces.json
hf_data_5k			      train_cot_updated.json
hf_data_updated			      train_data.json
hf_data_version			      training_args.bin
new_train_smolcot_hf_1700	      train_smolcot_3k_p1.json
Project-Lit-review-CS776.gdoc	      train_traces_5k.json


Load the reasoning traces data

In [None]:
data_dir = "/content/drive/MyDrive/cs776-project/new_train_smolcot_hf_1700"

In [None]:
import datasets
dataset = datasets.load_from_disk(data_dir)
print(len(dataset))

1700


In [None]:
print(dataset[0]['messages'][0]['content'][0])

{'index': None, 'text': "What is the ratio of the total of 'Very' to 'Somewhat'?", 'type': 'text'}


In [None]:
dataset

Dataset({
    features: ['images', 'messages'],
    num_rows: 1700
})

Create a dataset in the expected format for the DPO trainer - each row must have the image, prompt, and a chosen and rejected answer. Each of these four is a separate column. The rejected answer is generated from the SFT model, whereas Gemini's reasoniung trace is taken as the chosen answer.

In [None]:
pipe = pipeline(task="image-text-to-text", model=model, processor = processor, torch_dtype=torch.float16)
system_prompt = """Your task is to answer question based on the attached image.
Use this format to answer the question: <think>[Reasoning steps]</think> <answer>[Concise answer]</answer>.
Put your thinking inside <think> tags and then a concise(single word/phrase/numeric) answer inside <answer> tags.
Question:\n
"""

def format(example):
    # Prepare the input for the chat template
    user_prompt = example['messages'][0]['content'][0]['text']

    prompt = [
        {
            "role": "user",
            "content": [{"type": "image"}, {"type": "text", "text": system_prompt + user_prompt}],
        },
    ]
    chosen = [
        {
            "role": "assistant",
            "content": [{"type": "text", "text": example['messages'][1]['content'][0]['text']}],
        },
    ]

    # print([example["images"][i] for i in range(len(example["images"]))])
    model_out = pipe(text = prompt, images = [example["images"][i] for i in range(len(example["images"]))], return_full_text = False, max_new_tokens = 2048)

    rejected = [
        {
            "role": "assistant",
            "content": [{"type": "text", "text": model_out[0]['generated_text']}],
        },
    ]
    # Apply the chat template
    prompt = processor.apply_chat_template(prompt, tokenize=False)
    # print(prompt)
    chosen = processor.apply_chat_template(chosen, tokenize=False)
    rejected = processor.apply_chat_template(rejected, tokenize=False)
    # Resize the image to ensure it fits within the maximum allowable
    # size of the processor to prevent OOM errors.
    max_size = processor.image_processor.size["longest_edge"]
    example["images"][0].thumbnail((max_size, max_size))
    return {"images": [example["images"][i] for i in range(len(example["images"]))], "prompt": prompt, "chosen": chosen, "rejected": rejected}


Device set to use cuda:0


In [None]:
dataset = dataset.map(format, remove_columns=dataset.column_names)

f = dataset.features
f["images"] = features.Sequence(features.Image(decode=True))  # to avoid bytes
dataset = dataset.cast(f)

Map:   0%|          | 0/1700 [00:00<?, ? examples/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Casting the dataset:   0%|          | 0/1700 [00:00<?, ? examples/s]

Perform the mapping and save it to drive for use in DPO training 

In [None]:
dataset.save_to_disk(PROJECT_DIR + '/DPO_dataset_256M_SFT_smolcot_ep2_high_lr') # changed for protecting previous data

Saving the dataset (0/1 shards):   0%|          | 0/1700 [00:00<?, ? examples/s]