### Problem statement¶

This notebook does 1 thing:

- Load CLIP
- Compute natural language embeddings for tokens of interest using the CLIP transformer model.

_CLIP is the *Contrastive Language-Image Pre-Training* neural network trained on a variety of (image, text) pairs, first reported by Radford et al (2020) and applied by Hessel et al (2021). <BR>

CLIP authors claim that given a digitized image, it can produce a relevant textual description, without having been specifically fine-tuned for the task, in a way similar to the one-shot capabilities of GPT-2 and 3._

### Guidelines

This iPython notebook must be run in a Python virtual environment, running Python v3.7.0. This is a prerequisite so the proper versions of Torch 1.7.0+cpu and TorchVision 0.8.1+cpu can be invoked to run the CLIP 1.0 inference engine. Instructions to install CLIP are provided below for Linux hosts, assuming that:

- your interactive terminal session executes a Bourne shell (`sh`) or a Bourne derivative ( `ksh`, `bash`, ...).
- you already set up a Python 3.7.0 virtual environment, in directory `/path/to/my_directory`.
- you know how to handle CLI in terminal.

#### Setting up and registering a custom iPython kernel

What follows applies to CPU-only constrained installations. For CUDA-enabled machines, refer to `https://github.com/OpenAI/CLIP`.

- Assuming you have configured `pyenv` (on your favorite Linux host) to manage multiple Python virtual environments with specific package requirements, choose the directory in which to setup your Python virtual environment and install your iPython kernel utility package `ipykernel`:

```
      $ cd /path/to/my_directory
      $ pyenv local 3.7.0
      $ python -m pip install ipykernel
```
- Install every required packages in the virtual environment directory `/path/to/my_directory` (see following section for that).
    
```
      $ python -m pip install torch==1.7.0+cpu torchvision==0.8.0+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
      $ python -m pip install ftfy regex tqdm
      $ python -m  pip install git+https://github.com/openai/CLIP.git
```

- Make the new custom iPython kernel, clip1.0, available in interactive Python sessions:
```
      $ cd /path/to/my_directory
      $ ipython kernel install --user --name clip1.0 --display-name "Python3.7.0 (clip1.0)"
      $ jupyter notebook        # launch an iPython session
```

- Select the special virtual environment kernel ***Python 3.7.0 (clip1.0)*** under the `New notebook` button in the top-right region of the browser page.


### Package requirements

Package requirements are detailed below. For a quick demo also install `Pillow==8.2.0` and dependencies.

- Install all required packages in the virtual environment directory "/path/to/my_directory", with:
```
    $ cd /path/to/my_directory
    $ python -m pip install <<- 'EOF'
                clip @ git+https://github.com/openai/CLIP.git@04f4dc2ca1ed0acc9893bd1a3b526a7e02c4bb10ftfy
                Cython==0.29.22
                ftfy==6.0.3
                ipykernel==5.5.3
                ipyparallel==6.3.0
                ipython==7.22.0
                ipython-genutils==0.2.0
                ipywidgets==7.6.3
                nltk==3.6.5
                numpy==1.18.5
                regex==2021.10.8
                torch==1.7.0+cpu
                torchaudio==0.7.0
                torchvision==0.8.1+cpu
                tqdm==4.61.1
                scipy==1.6.2
    EOF
```

### Licensing terms and copyright

A substantial portion of the following code, is based on CLIP, as originally made available by its authors. CLIP, its use and distribution are protected by the terms and conditions of the MIT license.

    Copyright (c) 2021 OpenAI

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

=================================

The pre-processing and wrapper code used to compute CLIP vector embeddings in this notebook are protected.

    Copyright (C) 2020,2021 Cedric Bhihe

The OpenAI software made available in this repository was extended with pre- and post-processing steps. The corresponding code is available for free under the terms of the Gnu GPL License version 3 or later.

In short, you are welcome to use and distribute the code of this repo as long as you always quote its author and the Licensing terms in full. Consult the licensing terms and conditions in License.md on this repo.

=================================

***To contact the repo owner for any issue, please open an new thread under the [Issues] tab of the source code repo.***

In [1]:
import os, sys, csv, time
from datetime import datetime as dt 
import re
import numpy as np
import torch
import clip
import nltk

In [2]:
## Check pyTorch version
print("Torch version:", torch.__version__)

Torch version: 1.7.1+cpu


In [3]:
## Import trained CLIP model
localdevice = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=localdevice, jit=False)
#model.cuda().eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length   # pixel size of resized square picture side
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Image input resolution:", input_resolution,"by",input_resolution,"pixels")
print("Textual context length:", context_length)
print("Total vocabulary size:", vocab_size)
print("Local device:", localdevice)

Model parameters: 151,277,313
Image input resolution: 224 by 224 pixels
Textual context length: 77
Total vocabulary size: 49408
Local device: cpu


In [4]:
## Load model of choice (not JIT), and return info on:
#   - VisionTransformer (for images) and 
#   - Transformer (for text)
clip.load("ViT-B/32", device=localdevice, jit=False)

(CLIP(
   (visual): VisionTransformer(
     (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
     (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
     (transformer): Transformer(
       (resblocks): Sequential(
         (0): ResidualAttentionBlock(
           (attn): MultiheadAttention(
             (out_proj): _LinearWithBias(in_features=768, out_features=768, bias=True)
           )
           (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
           (mlp): Sequential(
             (c_fc): Linear(in_features=768, out_features=3072, bias=True)
             (gelu): QuickGELU()
             (c_proj): Linear(in_features=3072, out_features=768, bias=True)
           )
           (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
         )
         (1): ResidualAttentionBlock(
           (attn): MultiheadAttention(
             (out_proj): _LinearWithBias(in_features=768, out_features=768, bias=True)
           )
        

In [5]:
## Text input w/ case-insensitive tokenizer: clip.tokenize()

consize = 90                                     # context size

def pattern_index_in(summary: str, toi: str) -> list:
    '''
    Finds indices for occurences of toi in summary.
    '''
    tokens = nltk.word_tokenize(summary)          # tokenize + misc possible conditionals (e.g. tagging, parse-tree position, etc.)
    pattern = re.compile(r'^.*' + re.escape(toi) + r'.*$',re.I)   # toi, token of interest
    indices = [tokens.index(i) for i in tokens if pattern.match(i)]
    return indices

def find_embeddings(summary: str, toi: str) -> None:
    starttime = time.time()
    indices = pattern_index_in(summary,toi)
    embedding = [int(clip.tokenize(summary,consize)[0][idx+1]) for idx in indices]
    endtime = time.time()
    print(f"\nSummary: {summary}\n    -- time elapsed: {round((endtime - starttime)*1000,3)}ms")
    print(f"Found tokens (position-index,embedding): {list(zip(indices,embedding))}")
    return None



toi = "cover" 
print(f"\nCLIP contextualized embedding for '{toi}':\n========================")
 
text = 'Her long sleeve covered the wound completely. Nobody could possibly unmask her cover-up, she thought.'
find_embeddings(text, toi)

text = '"The proposal ought to have been covering the keypoints in question!", he said non-plused.'
find_embeddings(text, toi)

toi = "ride"
print(f"\nCLIP contextualized embedding for '{toi}':\n========================")

text = 'I am a poor lonesome cowbow, riding in the sunset...'
find_embeddings(text, toi)

text = '''His stride was enormous, exaggerated by the fact that the ground seemed to give back considerable
        force upward each time one of his feet touched the otherwise mushy surface... Astride his left shoulder
        I watched the surroundings.'''
find_embeddings(text, toi)

text = '"Not all motorbike riders are either St-George or Hell\'s angels... Nothing will override the statistical evidence  on this!"'
find_embeddings(text, toi)

text = '"You shouldn\'t have riden your bike on the beach, said the cop, I\'m afraid I\'ll have to take you for a ride!"'
find_embeddings(text, toi)

text = '''Canonical representations in 14th to 17th century western iconography often depict knights as 
        men in armor, who ride horses ... This representation however is neither exclusive, nor generally 
        accurate. The image of a knight as a horse-rider in shining armor is only familiar to us, because
        it corresponds to what is often depicted in modern novels and other fictional productions.'''
find_embeddings(text, toi)

text = '''Canonical representations in 14th to 17th century western iconography often depict knights as 
        men in armor, who ride horses ... This representation however is by no means exclusive. The image
        of a knight as a surfer in Hawaian shorts and who rides waves is only familiar to us, because it 
        corresponds to what is often depicted in modern novels and other fictional productions.'''
find_embeddings(text, toi)



CLIP contextualized embedding for 'cover':

Summary: Her long sleeve covered the wound completely. Nobody could possibly unmask her cover-up, she thought.
    -- time elapsed: 19.832ms
Found tokens (position-index,embedding): [(3, 5603), (13, 899)]

Summary: "The proposal ought to have been covering the keypoints in question!", he said non-plused.
    -- time elapsed: 2.242ms
Found tokens (position-index,embedding): [(7, 9462)]

CLIP contextualized embedding for 'ride':

Summary: I am a poor lonesome cowbow, riding in the sunset...
    -- time elapsed: 0.271ms
Found tokens (position-index,embedding): []

Summary: His stride was enormous, exaggerated by the fact that the ground seemed to give back considerable
        force upward each time one of his feet touched the otherwise mushy surface... Astride his left shoulder
        I watched the surroundings.
    -- time elapsed: 2.15ms
Found tokens (position-index,embedding): [(1, 36024), (31, 7744)]

Summary: "Not all motorbike riders ar