# Contrastive Language-Image Pretraining with SogCLR

### **Introduction**

In this tutorial, you will learn how to conduct contrastive language-image pretraining by optimizing the [Global Contrastive Loss](https://arxiv.org/abs/2202.12387) (GCL) on a subset of the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/) dataset. Also, you will learn how to evaluate the model on retrieval task using the [MSCOCO](https://cocodataset.org/#home) dataset and zero-shot classification task using the [ImageNet](https://www.image-net.org/challenges/LSVRC/index.php) dataset. The code is based on [iSogCLR's](https://github.com/zhqiu/contrastive-learning-iSogCLR) codebase, which includes the implementation of CLIP, SogCLR and iSogCLR.

### Preparation

First, we:

1. Download the source code and data
2. Install required packages

In [None]:
# !git clone -b project https://github.com/xywei00/csce689_iSogCLR.git iSogCLR

# !export PYTHONPATH="$PYTHONPATH:./iSogCLR/bimodal_exps"
# !export HUGGINGFACE_HUB_CACHE='./checkpoints/huggingface'
# !mkdir checkpoints

# !gdown 142xxRoMaHxX3BIfCw_1b_G_dgu-02Yq3    # clip_train.tar.gz
# !gdown 142zQjlOw0Xw4tKzXMrQjYE6NtGRTeasT    # cc3m_subset_100k.tar.gz
# !gdown 142tMsnclHTTPpnTXHSeNgTUlBk4She6o    # ms_coco_val.tar.gz
# !gdown 1NXhfhwFy-nhdABACkodgYqm9pomDKE39    # val.tar

# !mkdir datasets
# !mkdir -p datasets/imagenet
# !tar xf clip_train.tar.gz
# !tar xf cc3m_subset_100k.tar.gz -C datasets
# !tar xf mscoco_val.tar.gz -C datasets
# !tar xf val.tar -C datasets/imagenet

# !pip install -r ./iSogCLR/requirements_colab.txt    # there may be pip warnings/ errors, should be fine to ignore them

### Training

The following command runs the training script to train a ResNet50 (pretrained on ImageNet) and a DistilBERT (pretrained on BookCorpus and English Wikipedia) on the cc3m dataset using the SogCLR loss for 30 epochs with temperature 0.01.

In [1]:
!CUDA_VISIBLE_DEVICES=0 python3 ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/clip_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type clip \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Start training
Train Epoch: [0]  [  0/781]  eta: 1:14:29  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 11.8477  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000  time: 5.7234  data: 0.8225  max mem: 9358
Train Epoch: [0]  [ 50/781]  eta: 0:04:55  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 7.1008  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000  time: 0.2988  data: 0.0002  max mem: 9358
Train Epoch: [0]  [100/781]  eta: 0:03:59  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 6.2750  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0

### Evaluation

The following command runs the evaluation script to evaluate the retrieval performance of the trained model on the MSCOCO validation dataset and the zero-shot classification performance on the ImageNet validation dataset. The evaluation command is obtained by appending `--evaluate --checkpoint /path/to/your/checkpoint --zs_dataset imagenet --zs_datafolder /path/to/imagenet/val` to the training command.

In [3]:
!CUDA_VISIBLE_DEVICES=0 python3 ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/clip_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type clip \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/clip_cc3m_g0.8_e30/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
load checkpoint from ./output/clip_cc3m_g0.8_e30/checkpoint_30.pth
Start training
Computing features for evaluation...
Evaluation time 0:00:15
coco val: {'txt_r1': 11.62, 'txt_r5': 30.78, 'txt_r10': 43.36, 'txt_r_mean': 28.586666666666662, 'img_r1': 9.160702147227, 'img_r5': 25.31488664080931, 'img_r10': 35.995041784957415, 'img_r_mean': 23.490210190997903, 'r_mean': 26.038438428832283}
zeroshot: {'zeroshot_top1': 21.658, 'zeroshot_top3': 34.418, 'zeroshot_top5': 40.576, 'zeroshot_top10': 48.798}
Training time 0:04:31


### Benchmarks

The following results are recall at 1 results on the provided MSCOCO and ImageNet datasets. The first row of results are from the model trained using the CLIP loss, and the second row of results are from the model trained using the SogCLR loss. All results are based on a batch size of 128 for 30-epoch pretraining. IR@1 denotes the recall at 1 of image retrieval on MSCOCO, TR@1 denotes the recall at 1 of text retrieval on MSCOCO, and ACC@1 denotes the top 1 accuracy on ImageNet. Average denotes the average of the three metrics.

| Method | MSCOCO TR@1 | MSCOCO IR@1 | ImageNet ACC@1 | Average |
|:----------:|:--------:|:--------:|:--------:|:--------:|
| CLIP | 12.0 | 9.32 | 21.35 | 14.22 |
| SogCLR |  14.38  |  10.73  | 24.54 | 16.55 |

## Loss cyclip

In [2]:
!CUDA_VISIBLE_DEVICES=0 python3 ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/cyclip_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type cyclip \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Start training
Train Epoch: [0]  [  0/781]  eta: 1:28:23  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 20.7273  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000  time: 6.7906  data: 0.8027  max mem: 9358
Train Epoch: [0]  [ 50/781]  eta: 0:05:59  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 8.8137  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000  time: 0.4202  data: 0.1183  max mem: 9358
Train Epoch: [0]  [100/781]  eta: 0:05:05  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 6.4159  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0

In [3]:
!CUDA_VISIBLE_DEVICES=0 python3 ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/cyclip_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type cyclip \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/cyclip_cc3m_g0.8_e30/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
load checkpoint from ./output/cyclip_cc3m_g0.8_e30/checkpoint_30.pth
Start training
Computing features for evaluation...
Evaluation time 0:01:30
coco val: {'txt_r1': 14.1, 'txt_r5': 33.84, 'txt_r10': 46.3, 'txt_r_mean': 31.413333333333338, 'img_r1': 10.68415370466632, 'img_r5': 27.694030149146307, 'img_r10': 38.17825582790196, 'img_r_mean': 25.518813227238194, 'r_mean': 28.466073280285766}
zeroshot: {'zeroshot_top1': 25.906, 'zeroshot_top3': 39.492, 'zeroshot_top5': 45.658, 'zeroshot_top10': 53.904}
Training time 0:10:15


## Loss vicreg

In [4]:
!CUDA_VISIBLE_DEVICES=0 python3 ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/vicreg_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type vicreg \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Start training
Traceback (most recent call last):
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 710, in <module>
    main(args)
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 502, in main
    train_stats = train(model, train_loader, optimizer, tokenizer, epoch, max_epoch, warmup_steps, device, lr_scheduler, 
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 84, in train
    loss_ita, info_dict = model(image, text_input, idx=idx, text_idx=text_idx, epoch=epoch, max_epoch=max_epoch)
  File "/home/grads/s/skpaul/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/grads/s/skpaul/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gra

In [5]:
!CUDA_VISIBLE_DEVICES=0 python3 ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/vicreg_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type vicreg \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/vicreg_cc3m_g0.8_e30/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Traceback (most recent call last):
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 710, in <module>
    main(args)
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 444, in main
    checkpoint = torch.load(args.checkpoint, map_location='cpu') 
  File "/home/grads/s/skpaul/.local/lib/python3.10/site-packages/torch/serialization.py", line 1319, in load
    with _open_file_like(f, "rb") as opened_file:
  File "/home/grads/s/skpaul/.local/lib/python3.10/site-packages/torch/serialization.py", line 659, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/grads/s/skpaul/.local/lib/python3.10/site-packages/torch/serialization.py", line 640, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './output/vicreg_cc3m_g0.8_e30/checkpoint_30.pth'


## Loss sogclr_dro

In [6]:

!CUDA_VISIBLE_DEVICES=0 python3 ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/sogclr_dro_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type sogclr_dro \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30


Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Traceback (most recent call last):
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 710, in <module>
    main(args)
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 435, in main
    model = CLIP(image_encoder=args.image_encoder, text_encoder=args.text_encoder, embed_dim=args.embed_dim, init_model=args.init_model, bsz=args.batch_size_train*args.world_size,
  File "/home/grads/s/skpaul/CLIP/iSogCLR/bimodal_exps/models/model_clip.py", line 104, in __init__
    raise NotImplementedError
NotImplementedError


In [7]:
!CUDA_VISIBLE_DEVICES=0 python3 ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/sogclr_dro_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type sogclr_dro \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/sogclr_dro_cc3m_g0.8_e30/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Traceback (most recent call last):
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 710, in <module>
    main(args)
  File "/home/grads/s/skpaul/CLIP/./iSogCLR/bimodal_exps/clip.py", line 435, in main
    model = CLIP(image_encoder=args.image_encoder, text_encoder=args.text_encoder, embed_dim=args.embed_dim, init_model=args.init_model, bsz=args.batch_size_train*args.world_size,
  File "/home/grads/s/skpaul/CLIP/iSogCLR/bimodal_exps/models/model_clip.py", line 104, in __init__
    raise NotImplementedError
NotImplementedError
