<a href="https://colab.research.google.com/github/evalevanto/Indaba-2024-GeoAI-Challenge/blob/main/bootstrap_geoai_challenge_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Starter Code

- [x] Load and run `quantized-4bit LLaVA 1.5 7B`.
- [x] Load a dataset with `train` and `test` splits using `datasets`.
- [x] Create a `pipeline` to run inference on a batch of our train data.
  - [ ] We need to benchmark inference as well.
Here is a good place to talk about prompt-tuning techniques.
- [ ] Define the evaluation metrics.



## 1. Load a VLM
Feel free to explore other VLMs.

Here we load a quantized-4bit LLaVA 1.5 7B model.

- Awesome resources:
    - https://github.com/amrzv/awesome-colab-notebooks
    - https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

- Leaderboards:
    - https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

- Open Source Vision LLMs:
    - Paligemma: https://blog.roboflow.com/paligemma-multimodal-vision/
    - Zoo 1: https://github.com/salesforce/LAVIS
    - Zoo 2: https://github.com/InternLM/InternLM-XComposer
    - Zoo 3: https://github.com/OpenGVLab/InternVL
    - LLaVA: https://github.com/haotian-liu/LLaVA (1y ago)
    - OpenFlamingo: https://github.com/mlfoundations/open_flamingo (1y ago)
    - BLIP: https://github.com/salesforce/BLIP (2 ys ago)
    - OFA: https://github.com/OFA-Sys/OFA (2 ys ago)
    - GIT: https://github.com/microsoft/GenerativeImage2Text (2ys ago)
    - DeepSeekVL: https://github.com/deepseek-ai/DeepSeek-VL

- Closed Vision LLMs
    - OpenAI GPT-4 with Vision: https://openai.com/index/gpt-4-research/
    - Claude Vision: https://docs.anthropic.com/en/docs/build-with-claude/vision
    - Gemini Vision: https://cloud.google.com/blog/products/data-analytics/how-to-use-gemini-pro-vision-in-bigquery


In [None]:
# install packages
!pip install -q -U transformers==4.37.2
!pip install -q bitsandbytes==0.41.3 accelerate==0.25.0
!pip install -q datasets
!pip install -q evaluate
!pip install -q vllm
!pip install -q scikit-learn

In [None]:
import torch
from transformers import BitsAndBytesConfig
from transformers import pipeline

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model_id = "llava-hf/llava-1.5-7b-hf"

## 2. Load dataset




In [None]:
from datasets import Image, load_dataset
from pathlib import Path

In [None]:
# let's get the train dataset
data_path = 'africa_dataset'
train_ = load_dataset("imagefolder", data_path, split='train')

## 2. Run the model with a prepared prompt

### i. Use `pipelines`


In [None]:
# !pip install tqdm
from transformers.pipelines.pt_utils import KeyDataset
from tqdm import tqdm

In [None]:
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})

In [None]:
# configs
max_new_tokens = 200
prompt = "USER: <image>\nGiven the following classes: {\"0\": \"Text is at the top of the image\", \"1\": \"Text is at the center of the image\", \"2\": \"Image has no text\"}.\nYour task is to analyse the image, first figure out if there is text. If there is text figure out the position. Finally return only the class key as an int. ASSISTANT: Class key: "

In [None]:
# post processing function
import re

def process_result(output):
  assistant_tag = 'ASSISTANT: '

  match = re.search(r'ASSISTANT: (\d+)', output)
  if match:
    return int(match.group(1))
  
  return -1

In [None]:
# TODO: batch processing*
outputs = []

prepped_dataset = KeyDataset(train_, "image")
for out in tqdm(pipe(prepped_dataset, prompt=prompt, generate_kwargs={"max_new_tokens": max_new_tokens}), total=len(prepped_dataset)):
    outputs.append(process_result(out[0]['generated_text']))

train_dataset = train_.add_column('y_hat', outputs)


## 3. Evaluate your model

We are interested in : precision, recall and f1


In [None]:
import evaluate
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

targets = train_dataset['label']
predictions = train_dataset['y_hat']

pr = precision.compute(predictions=predictions, references=targets)
rc = recall.compute(predictions=predictions, references=targets)
f1 = f1.compute(predictions=predictions, references=targets)

print(f'Precision: {pr}\nRecall: {rc}\nF1: {f1}')