# LITA Checkpoint Conversion, Finetuning and Inference Tutorial

### Note:
Currently, this notebook must be run in a NeMo container (> 24.04). An example command to launch the container:

```
docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo --shm-size=8g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 <your_nemo_container>
```
For inference and finetuning, you need to increase the share memory size to avoid some OOM issue. For example,
```
docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:dev
```

By `-v $PWD:/ws`, we can mount the current local directory to `/ws/` in docker container. We may use this local directory to put the `NeMo` source code, checkpoints and dataset we will generate.

# LITA Introduction

[LITA](https://arxiv.org/pdf/2403.19046) stands for Language Instructed Temporal-Localization Assistan, which demonstrates strong performance on Reasoning Temporal Localization (RTL) task. It introduces time tokens to better help LLM understand 'When?' question in video. The below figure from [LITA paper](https://arxiv.org/pdf/2403.19046) shows a clear idea of how LITA works.

<img src="images/LITA_arch.png" alt="drawing" style="width:800px;"/>

# Tokenizer and Checkpoint Conversion
As we learned that LITA introduces `time tokens` so that timestampes of events in a video would be represented as time tokens instead of the original float point timestamps. Therefore we need to add these time tokens to the tokenizer of the backbone/LLM model. Since the backbone models (vision encoder and LLM) are huggingface LLaVA like model, we can convert them by using `convert_hf_llava_to_neva.py` under `NeMo/examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_neva.py` to nemo model to do finetuning in NeMo. In this example, we take `Llama-3-VILA1.5-8B` as an example to show how to integrate LITA to a LLaVA like model. You may also use similar steps to convert other llama or LLaVA like models that have backbone LLM as llama such as [vicuna](https://huggingface.co/lmsys/vicuna-13b-v1.5) and [llava-v1.6-vicuna-13b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b).

Please download the huggingface `Llama-3-VILA1.5-8B` model.

In [None]:
! mkdir pretrained_models && cd pretrained_models
! git clone https://huggingface.co/Efficient-Large-Model/Llama-3-VILA1.5-8B

## Tokenizer conversion
Here we show how to add 100 time tokens and some nemo extra tokens to a huggingface tokenizer.
For the definition of nemo extra tokens, please refer to `NeMo/nemo/collections/multimodal/data/neva/conversation.py`.


In [None]:
# define the TIME_TOKEN_TEMPLATE
TIME_TOKEN_TEMPLATE = "<t{t}>"
hf_llm_model_path='/ws/pretrained_models/Llama-3-VILA1.5-8B/llm'
tokenizer_path = '/ws/converted_models/tokenizer/'

In [None]:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(hf_llm_model_path)
DEFAULT_IM_START_TOKEN = "<extra_id_4>" # mark the start of the slow token
DEFAULT_IM_END_TOKEN = "<extra_id_5>" # the end of the slow token
VID_START_TOKEN = "<extra_id_8>" # the start of the fast token
VID_END_TOKEN = "<extra_id_9>" # the end of the fast token
num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, VID_START_TOKEN, VID_END_TOKEN], special_tokens=True)
tokenizer.pad_token = tokenizer.eos_token  # use eos token as pad token
num_time_tokens = 100
time_tokens = [TIME_TOKEN_TEMPLATE.format(t=x) for x in range(num_time_tokens)]
num_new_tokens = tokenizer.add_tokens(time_tokens)
# add the other nemo extra tokens
extra_tokens = ["<extra_id_0>","<extra_id_1>","<extra_id_2>","<extra_id_3>","<extra_id_6>","<extra_id_7>"]
tokenizer.add_tokens(extra_tokens)
tokenizer.save_pretrained(tokenizer_path)
print(tokenizer.vocab_size)

You can check the tokenizer by:

In [None]:
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
tokenizer = get_nmt_tokenizer(library="huggingface", model_name=tokenizer_path)
print(tokenizer.vocab_size)

Notice if you wanna convert checkpoints trained from [LITA1.0](https://github.com/NVlabs/LITA), you should put all the extra tokens including `DEFAULT_IM_START_TOKEN` and `DEFAULT_IM_END_TOKEN` at the end of the time tokens.

## Checkpoint Conversion
Now we are going to convert the huggingface LLaVA or Llama model to nemo model. For Llama model, please refer to `NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py`. For LLaVA model, please refer to `NeMo/examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_neva.py`.


In [None]:
! cd /opt/ && git clone https://github.com/haotian-liu/LLaVA   # we only need the model structure, no need to install
! export PYTHONPATH=/opt/LLaVA/:$PYTHONPATH
! cd /ws  # do not run the below commands under `/opt` folder
! config_file=vita_config.yaml  # check the config file in /opt/NeMo/examples/multimodal/multimodal_llm/neva/conf/lita_config.yaml
! python /opt/NeMo/examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_neva.py \
--in-file /ws/pretrained_models/Llama-3-VILA1.5-8B/llm \
--mm_vision_tower google/siglip-so400m-patch14-384 \
--mm_projector_ckpt_dir /ws/pretrained_models/Llama-3-VILA1.5-8B/mm_projector \
--out-file /ws/converted_models/Llama-3-VILA1.5-8B.nemo \
--tokenizer-model /ws/converted_models/tokenizer/ \
--config-file vita_config.yaml \
--conv-template llama_3

Notice `mm_vision_tower` and `mm_projector_ckpt_dir` are optional.