OpenGVLab · hjzhang-forward · Dec 27, 2022 · Dec 27, 2022 · Dec 27, 2022 · Dec 27, 2022
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,6 @@
+[submodule "Pretrain/UniFormerV2"]
+	path = Pretrain/UniFormerV2
+	url = https://github.com/OpenGVLab/UniFormerV2.git
+[submodule "Downstream/Ego-Tasks"]
+	path = Downstream/Ego-Tasks
+	url = https://github.com/OpenGVLab/ego4d-eccv2022-solutions.git
diff --git a/Data/InternVid/README.md b/Data/InternVid/README.md
@@ -0,0 +1,60 @@
+# InternVid \[[Paper](https://arxiv.org/pdf/2307.06942.pdf)\]
+
+[![Dataset meta](https://img.shields.io/badge/%F0%9F%A4%97%20InternVid-Dataset-blue)](https://huggingface.co/datasets/OpenGVLab/InternVid) | [![Model Checkpoint](https://img.shields.io/badge/%F0%9F%A4%97%20ViCLIP-Model-purple)](https://huggingface.co/OpenGVLab/ViCLIP)
+
+# :fire: News
+We are excited to announce the partial release of a large-scale video-text dataset aimed at facilitating multimodal understanding and generation. As part of this release, we are making available a [subset](https://huggingface.co/datasets/OpenGVLab/InternVid) of the dataset, which comprises 10 million video clips. Additionally, we have provided a [ViCLIP](https://huggingface.co/OpenGVLab/ViCLIP) model trained on this subset, using the ViT-L architecture. It achieves SOTA zero-shot action recognition performance on Kinetics.
+
+We give a step-by-step instructions and clarify the process of accessing and utilizing ViClip in [demo.ipynb](https://github.com/OpenGVLab/InternVideo/blob/main/Data/InternVid/demo.ipynb).
+
+Stay tuned for updates!
+
+# Introduction
+
+**Data**
+
+We collected videos from 16 popular categories with varying percentages. We ensured diversity by selecting videos from countries with different languages instead of relying on a dominant language environment. The countries we sampled from include the UK, USA, Australia, Japan, Korea, China, Russia, and France, among others. In terms of duration, every video lasts 351.9s on average. Almost half (49%) of the videos are five minutes or less, while a quarter (26%) fall between five and ten minutes. Only 8% of the videos are over 20 minutes long. Among the curated videos, 85% were high-resolution (720P), while the remaining 15% had lower resolutions ranging from 360P to 720P. Although the lower-resolution videos may not perform as well as the high-resolution ones in content generation tasks, they can still be useful in video-language representation learning, provided that they have appropriate captions.
+
+![b469e00b43d46a6b3f89899483abcf6](https://github.com/OpenGVLab/InternVideo/assets/43169235/7d6aca7d-362a-425d-9ef2-ec0189491b52)
+
+InternVid exhibits diverse clip durations and caption lengths in the segmented clip level. The aesthetic scores and clip-caption similarities are distributed uniformly. The majority of clips are 0-10 seconds in length, accounting for 85% of all clips. Approximately half of the clips have captions with 10-20 words, while one-third of the clip captions have fewer than 10 words. About 11% of clips have long captions with more than 20 words.
+
+![429af4993adb77478c000c865ae5a1b](https://github.com/OpenGVLab/InternVideo/assets/43169235/f64588c3-81e8-43de-b771-46500474d2ff)
+
+**ViCLIP: a simple video CLIP for transferrable video-text representation**
+
+Built upon <a href="https://github.com/openai/CLIP">CLIP</a>, we make a simple video-text pretraining baseline ViCLIP. It consists of a video encoder (ViT) and a text encoder, as given below. Both modules are initialized from the corresponding CLIP components. We update the native attention in the video encoder to spatiotemporal attention while maintaining other design elements. For efficient learning, we apply masking to videos in pre-training.
+
+<img width="633" alt="87c6263cc4aceee72cc8e37085a8109" src="https://github.com/OpenGVLab/InternVideo/assets/43169235/1e540a2b-f503-4036-b2a8-ba99401fc5b0">
+
+
+# Data & Model Zoo
+
+### Pretrained Data & Model
+<div>
+
+|      Model      |   Training Data   |                                               Descriptions                                                |
+| :-----------------: | :----------------------: | :---------------------------------------------------------------------------------------------------: |
+| ViCLIP-L-14 \[[HuggingFace](https://huggingface.co/OpenGVLab/ViCLIP) \| [Aliyun](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/ViClip-InternVid-10M-FLT.pth )\] | InternVid-10M-FLT \[[HuggingFace](https://huggingface.co/datasets/OpenGVLab/InternVid) \| [OpenDataLab](https://opendatalab.com/shepshep/InternVid)\] |    |
+</div>
+
+
+## Citation
+
+If you find this work useful for your research, please consider citing InternVid. Your acknowledgement would greatly help us in continuing to contribute resources to the research community.
+
+```
+@article{wang2023internvid,
+  title={InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation},
+  author={Wang, Yi and He, Yinan and Li, Yizhuo and Li, Kunchang and Yu, Jiashuo and Ma, Xin and Chen, Xinyuan and Wang, Yaohui and Luo, Ping and Liu, Ziwei and Wang, Yali and Wang, Limin and Qiao, Yu},
+  journal={arXiv preprint arXiv:2307.06942},
+  year={2023}
+}
+
+@article{wang2022internvideo,
+  title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
+  author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
+  journal={arXiv preprint arXiv:2212.03191},
+  year={2022}
+}
+```
diff --git a/Data/InternVid/demo.ipynb b/Data/InternVid/demo.ipynb
@@ -0,0 +1,102 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f86bc499",
+   "metadata": {},
+   "source": [
+    "## download ViCILP weights and put its pth file in viclip folder. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e7a90379-d9ee-45d9-9073-7ed5132fa6b1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import os\n",
+    "import cv2\n",
+    "\n",
+    "from viclip import retrieve_text, _frame_from_video"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "a425a5da-ceaf-4b89-9845-c8ba576902d8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "video = cv2.VideoCapture('example1.mp4')\n",
+    "frames = [x for x in _frame_from_video(video)]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "3fb7397a-02ef-41b5-9ffe-f2363b277778",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "text: A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run. ~ prob: 0.8264\n",
+      "text: A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon. ~ prob: 0.1587\n",
+      "text: A pet dog excitedly runs through the snowy yard, chasing a toy thrown by its owner. ~ prob: 0.0141\n",
+      "text: A person dressed in a blue jacket shovels the snow-covered pavement outside their house. ~ prob: 0.0006\n",
+      "text: A playful dog slides down a snowy hill, wagging its tail with delight. ~ prob: 0.0002\n"
+     ]
+    }
+   ],
+   "source": [
+    "text_candidates = [\"A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.\",\n",
+    "                   \"A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.\",\n",
+    "                   \"A person dressed in a blue jacket shovels the snow-covered pavement outside their house.\",\n",
+    "                   \"A pet dog excitedly runs through the snowy yard, chasing a toy thrown by its owner.\",\n",
+    "                   \"A person stands on the snowy floor, pushing a sled loaded with blankets, preparing for a fun-filled ride.\",\n",
+    "                   \"A man in a gray hat and coat walks through the snowy yard, carefully navigating around the trees.\",\n",
+    "                   \"A playful dog slides down a snowy hill, wagging its tail with delight.\",\n",
+    "                   \"A person in a blue jacket walks their pet on a leash, enjoying a peaceful winter walk among the trees.\",\n",
+    "                   \"A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run.\",\n",
+    "                   \"A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery.\"]\n",
+    "\n",
+    "texts, probs = retrieve_text(frames, text_candidates, name='viclip', topk=5)\n",
+    "\n",
+    "for t, p in zip(texts, probs):\n",
+    "    print(f'text: {t} ~ prob: {p:.4f}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a2969ba6-19d0-4893-b071-b82fa046c312",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/Data/InternVid/example1.mp4 b/Data/InternVid/example1.mp4
diff --git a/Data/InternVid/viclip/README.md b/Data/InternVid/viclip/README.md
@@ -0,0 +1,3 @@
+---
+license: mit
+---
diff --git a/Data/InternVid/viclip/__init__.py b/Data/InternVid/viclip/__init__.py
@@ -0,0 +1,71 @@
+from .simple_tokenizer import SimpleTokenizer as _Tokenizer
+from .viclip import ViCLIP
+import torch
+import numpy as np
+import cv2
+
+clip_candidates = {'viclip':None, 'clip':None}
+
+def get_clip(name='viclip'):
+    global clip_candidates
+    m = clip_candidates[name]
+    if m is None:
+        if name == 'viclip':
+            tokenizer = _Tokenizer()
+            vclip = ViCLIP(tokenizer)
+            # m = vclip
+            m = (vclip, tokenizer)
+        else:
+            raise Exception('the target clip model is not found.')
+
+    return m
+
+def get_text_feat_dict(texts, clip, tokenizer, text_feat_d={}):
+    for t in texts:
+        feat = clip.get_text_features(t, tokenizer, text_feat_d)
+        text_feat_d[t] = feat
+    return text_feat_d
+
+def get_vid_feat(frames, clip):
+    return clip.get_vid_features(frames)
+
+def _frame_from_video(video):
+    while video.isOpened():
+        success, frame = video.read()
+        if success:
+            yield frame
+        else:
+            break
+
+v_mean = np.array([0.485, 0.456, 0.406]).reshape(1,1,3)
+v_std = np.array([0.229, 0.224, 0.225]).reshape(1,1,3)
+def normalize(data):
+    return (data/255.0-v_mean)/v_std
+
+def frames2tensor(vid_list, fnum=8, target_size=(224, 224), device=torch.device('cuda')):
+    assert(len(vid_list) >= fnum)
+    step = len(vid_list) // fnum
+    vid_list = vid_list[::step][:fnum]
+    vid_list = [cv2.resize(x[:,:,::-1], target_size) for x in vid_list]
+    vid_tube = [np.expand_dims(normalize(x), axis=(0, 1)) for x in vid_list]
+    vid_tube = np.concatenate(vid_tube, axis=1)
+    vid_tube = np.transpose(vid_tube, (0, 1, 4, 2, 3))
+    vid_tube = torch.from_numpy(vid_tube).to(device, non_blocking=True).float()
+    return vid_tube
+
+def retrieve_text(frames, texts, name='viclip', topk=5, device=torch.device('cuda')):
+    clip, tokenizer = get_clip(name)
+    clip = clip.to(device)
+    frames_tensor = frames2tensor(frames, device=device)
+    vid_feat = get_vid_feat(frames_tensor, clip)
+
+    text_feat_d = {}
+    text_feat_d = get_text_feat_dict(texts, clip, tokenizer, text_feat_d)
+    text_feats = [text_feat_d[t] for t in texts]
+    text_feats_tensor = torch.cat(text_feats, 0)
+
+    probs, idxs = clip.get_predict_label(vid_feat, text_feats_tensor, top=topk)
+
+    ret_texts = [texts[i] for i in idxs.numpy()[0].tolist()]
+    return ret_texts, probs.numpy()[0]
+
diff --git a/Data/InternVid/viclip/bpe_simple_vocab_16e6.txt.gz b/Data/InternVid/viclip/bpe_simple_vocab_16e6.txt.gz
diff --git a/Data/InternVid/viclip/simple_tokenizer.py b/Data/InternVid/viclip/simple_tokenizer.py
@@ -0,0 +1,135 @@
+import gzip
+import html
+import os
+from functools import lru_cache
+
+import ftfy
+import regex as re
+
+
+@lru_cache()
+def default_bpe():
+    return os.path.join(os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz")
+# @lru_cache()
+# def default_bpe():
+#     return "bpe_simple_vocab_16e6.txt.gz"
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+
+
+def whitespace_clean(text):
+    text = re.sub(r'\s+', ' ', text)
+    text = text.strip()
+    return text
+
+
+class SimpleTokenizer(object):
+    def __init__(self, bpe_path: str = default_bpe()):
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        merges = gzip.open(bpe_path).read().decode("utf-8").split('\n')
+        merges = merges[1:49152-256-2+1]
+        merges = [tuple(merge.split()) for merge in merges]
+        vocab = list(bytes_to_unicode().values())
+        vocab = vocab + [v+'</w>' for v in vocab]
+        for merge in merges:
+            vocab.append(''.join(merge))
+        vocab.extend(['<|startoftext|>', '<|endoftext|>'])
+        self.encoder = dict(zip(vocab, range(len(vocab))))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}
+        self.pat = re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", re.IGNORECASE)
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token[:-1]) + ( token[-1] + '</w>',)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token+'</w>'
+
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+
+    def encode(self, text):
+        bpe_tokens = []
+        text = whitespace_clean(basic_clean(text)).lower()
+        for token in re.findall(self.pat, text):
+            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors="replace").replace('</w>', ' ')
+        return text