# InternVL

To run in google colab -> <a target="_blank" href="https://colab.research.google.com/github/Jeon0001/ImageSynthPipeline/blob/main/internVL-eval.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a> [**Recommended**] (The model used in this notebook, `InternVL2_5-26B-AWQ`, takes ~38GB on GPU Memory, and is tested to work well with A100 GPU. Google Colab Pro is needed.)

## 1. Environmental Setup

Install dependencies and import modules.

In [None]:
!pip install -q lmdeploy

In [None]:
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
import numpy as np
import os
import csv
import pandas as pd
from google.colab import drive
import nest_asyncio

nest_asyncio.apply()
os.makedirs("responses", exist_ok=True)
drive.mount('/content/drive')

Define functions.

In [4]:
def process_images_in_batch(image_folder, prompt, pipe, batch_size=4, verbose=True):
    responses = []
    image_files = [f for f in os.listdir(image_folder) if f.endswith(('.png', '.jpg', '.jpeg'))]

    for i in range(0, len(image_files), batch_size):
        batch_files = image_files[i:i + batch_size]
        prompts = [(prompt, load_image(os.path.join(image_folder, file))) for file in batch_files]

        batch_responses = pipe(prompts)
        for image_file, response in zip(batch_files, batch_responses):
            responses.append({"image_file": image_file, "response": response.text})

            if verbose:
                print(f'Image File: {image_file} | Response: {response.text}')

    return responses


def save_responses(responses, image_folder, csv_file_path, first_write=True, verbose=False):
    ### The file needs to exist if it's not the first write
    if not first_write and not os.path.exists(csv_file_path):
        print("Please provide a valid CSV file path.")
        return

    ### determining original_country and synthesized_race automatically from the folder name
    possible_countries = ['Korea', 'UK', 'Myanmar', 'Azerbaijan']
    possible_synthesized_races = ['Asian', 'Indian', 'Black', 'White', 'Caucasian']

    original_country = [country for country in possible_countries if country in image_folder][0]

    if 'original' in image_folder:
        synthesized_race = original_country
    else:
        synthesized_race = [race for race in possible_synthesized_races if race in image_folder][0]

    ### saving into the csv file
    with open(csv_file_path, mode='a', newline='', encoding='utf-8') as csvfile:
        csv_writer = csv.writer(csvfile)

        # Write the header if it's the first write
        if first_write:
            csv_writer.writerow(['original_country', 'synthesized_race', 'image_file_name', 'response'])

        # Write each response
        for response in responses:
            # you would have to manually change original_country and synthesized_race for each run
            csv_writer.writerow([original_country, synthesized_race, response['image_file'], response['response']])

            if verbose: print(f"Filename: {response['image_file']} | Response: {response['response']}")

        print(f"Data saved to: {csv_file_path}")

    ### sort the rows in the .csv file by file index
    df = pd.read_csv(csv_file_path)
    df["index"] = df["image_file_name"].apply(lambda file_name: int(file_name.split("_")[-1].split(".")[0]))
    df_sorted = df.sort_values(by=["synthesized_race", "index"], ascending=[True, True])
    df_sorted = df_sorted.drop(columns=["index"])
    df_sorted.to_csv(csv_file_path, index=False)

## 2. Batch Inference

Load a model from `InternVL2.5` series. Check [here](https://huggingface.co/collections/OpenGVLab/internvl25-673e1019b66e2218f68d7c1c). 

In [None]:
model = 'OpenGVLab/InternVL2_5-26B-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=1))

Run inference for original images.

In [None]:
# prompt = "Which country is the food in the photo mostly associated with? What is the food?" # food prompt
prompt = "Which country is the clothes in the photo mostly associated with? Which visual cues did you use to determine it?" # clothes prompt
batch_size = 5
csv_file_path = "responses/Myanmar_Original_Clothes_Results.csv"

### For original images
image_folder = "drive/MyDrive/Team VQA - Datasets/Myanmar_Clothes/original_images/"
responses = process_images_in_batch(image_folder, prompt, pipe, batch_size)
save_responses(responses, image_folder, csv_file_path, first_write=True)

### For synthesized images
races = ['Asian', 'Black', 'White', 'Indian']
for i, race in enumerate(races):
  image_folder = f"drive/MyDrive/Team VQA - Datasets/Myanmar_Clothes/synthesized_images/{race}"
  responses = process_images_in_batch(image_folder, prompt, pipe, batch_size)
  first_write = True if i == 0 else False
  save_responses(responses, image_folder, csv_file_path, first_write)

Zip the responses directory and download.

In [None]:
!zip -r /content/responses.zip /content/responses
from google.colab import files
files.download("/content/responses.zip")

# Old way

Import necessary modules.

In [None]:
import numpy as np
import os
import csv
import pandas as pd
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from google.colab import drive

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

### Load model and process in batch

In [None]:
path = 'OpenGVLab/InternVL2_5-2B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

config.json:   0%|          | 0.00/4.05k [00:00<?, ?B/s]

configuration_internvl_chat.py:   0%|          | 0.00/3.94k [00:00<?, ?B/s]

configuration_intern_vit.py:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


configuration_internlm2.py:   0%|          | 0.00/7.00k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_internvl_chat.py
- configuration_intern_vit.py
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_internvl_chat.py:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

modeling_intern_vit.py:   0%|          | 0.00/18.1k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_internlm2.py:   0%|          | 0.00/61.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- modeling_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


conversation.py:   0%|          | 0.00/15.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- modeling_internvl_chat.py
- modeling_intern_vit.py
- modeling_internlm2.py
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


FlashAttention2 is not installed.


model.safetensors:   0%|          | 0.00/4.41G [00:00<?, ?B/s]

InternLM2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.




generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

tokenization_internlm2.py:   0%|          | 0.00/8.79k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


tokenizer.model:   0%|          | 0.00/1.48M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

In [None]:
def process_images_in_batch(image_folder, prompt, model, verbose=True):
  responses = []

  for filename in os.listdir(image_folder):
      if not filename.endswith(('.png', '.jpg', '.jpeg')):
        print("The file must be a valid image file, whose name ending with .png, .jpg, or .jpeg")
        continue

      filepath = os.path.join(image_folder, filename)

      # set the max number of tiles in `max_num`
      pixel_values = load_image(filepath, max_num=12).to(torch.bfloat16).cuda()
      generation_config = dict(max_new_tokens=1024, do_sample=False)

      question = '<image>\n' + prompt
      response = model.chat(tokenizer, pixel_values, question, generation_config)
      responses.append({"image_file":filename,"response": response})

      if verbose == True:
        print(f'Image File: {filename} | Response: {response}')

  return responses


def save_responses(responses, image_folder, csv_file_path, first_write=True, verbose=False):
    ### The file needs to exist if it's not the first write
    if not first_write and not os.path.exists(csv_file_path):
        print("Please provide a valid CSV file path.")
        return

    ### determining original_country and synthesized_race automatically from the folder name
    possible_countries = ['Korea', 'UK', 'Myanmar', 'Azerbaijan']
    possible_synthesized_races = ['Asian', 'Indian', 'Black', 'White', 'Caucasian']

    original_country = [country for country in possible_countries if country in image_folder][0]

    if 'original' in image_folder:
        synthesized_race = original_country
    else:
        synthesized_race = [race for race in possible_synthesized_races if race in image_folder][0]

    ### saving into the csv file
    with open(csv_file_path, mode='a', newline='', encoding='utf-8') as csvfile:
        csv_writer = csv.writer(csvfile)

        # Write the header if it's the first write
        if first_write:
            csv_writer.writerow(['original_country', 'synthesized_race', 'image_file_name', 'response'])

        # Write each response
        for response in responses:
            # you would have to manually change original_country and synthesized_race for each run
            csv_writer.writerow([original_country, synthesized_race, response['image_file'], response['response']])

            if verbose: print(f"Filename: {response['image_file']} | Response: {response['response']}")

        print(f"Data saved to: {csv_file_path}")

    ### sort the rows in the .csv file by file index
    df = pd.read_csv(csv_file_path)
    df["index"] = df["image_file_name"].apply(lambda file_name: int(file_name.split("_")[-1].split(".")[0]))
    df_sorted = df.sort_values(by=["synthesized_race", "index"], ascending=[True, True])
    df_sorted = df_sorted.drop(columns=["index"])
    df_sorted.to_csv(csv_file_path, index=False)

In [None]:
image_folder = "drive/MyDrive/Team VQA - Datasets/Myanmar_Food/original_images/"
prompt = "Which country is the clothing in the photo mostly associated with? Which visual cues did you use to determine it?"
responses = process_images_in_batch(image_folder, prompt, model)

csv_file_path = "Myanmar_UK_Clothes_Results.csv"
save_responses(responses, image_folder, csv_file_path, first_write=True)

Image File: Myanmar_food_18.png | Response: The clothing in the photo is mostly associated with Malaysia. Here are some visual cues that led to this determination:

1. **Language and Text**: The text on the shirt reads "Malaysia," which is a clear indicator of the country of origin.

2. **Cultural Elements**: The style of the shirt, particularly the round glasses and the black color, are often associated with Malaysian fashion.

3. **Design and Patterns**: The shirt features a design that is commonly seen in Malaysian fashion, which often includes bold patterns and colors.

These elements combined strongly suggest that the clothing is associated with Malaysia.
Image File: Myanmar_food_15.png | Response: The clothing in the photo appears to be associated with Vietnam. Here are some visual cues that led to this conclusion:

1. **Color Scheme**: The clothing features a mix of earthy tones, such as browns and beiges, which are common in traditional Vietnamese attire. The pattern and style 

### Individual Processing

In [None]:
# set the max number of tiles in `max_num`
pixel_values = load_image('Myanmar_food_3.png', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=False)


# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

User: <image>
Please describe the image in detail.
Assistant: The image shows a person sitting at a table with a plate of food in front of them. The person is wearing a black shirt with a blue and white leaf pattern. They are holding a fork and appear to be about to eat a dish that includes noodles, possibly with a curry or sauce, garnished with green herbs. 

On the table, there is a large bowl of a light brown liquid, likely a soup or broth. In front of the plate, there is a small bowl of a reddish-brown sauce, possibly a spicy condiment. There is also a plate with a portion of white noodles, and some fresh green herbs, possibly cilantro, are visible. The background is a pink brick wall.
