### Problem statement¶

This notebook does 3 things:

- Experiment with loading the Torch v1.7.1 cpu-only based CLIP v1.0 model, for implementation of one-shot inference engine on annotation-image pairs (testing dataset).
- Select images on the condition that their '.xml' metadata specifies at least 3 bbxes.
- Classify selected images according to the labels used in training using CLIP one-shot inference. <BR>Results are saved in file: '../Sgoab/Data/Images/clip_lab-class.tsv'

_CLIP is the *Contrastive Language-Image Pre-Training* neural network trained on a variety of (image, text) pairs, first reported by Radford et al (2020) and applied by Hessel et al (2021). <BR>
In principle, given an image, CLIP can be instructed in natural language to predict the most relevant text snippet, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3._

### Guidelines

This iPython notebook must be run in a Python virtual environment, running Python v3.7.1. This is a prerequisite so the proper versions of Torch 1.7.1+cpu and TorchVision 0.8.2+cpu can be invoked to run the CLIP 1.0 inference engine on test images. A succint installation description is scripted below for a Linux host, assuming that:

- your interactive terminal session shell is `bash` or `sh`.
- you already setup a Python 3.7.0 virtual environment, in directory `/path/to/my_directory`.
- you know how to handle the command line interface on the terminal.

#### Setting up and registering a custom iPython kernel

What follows applies to CPU-only constrained installations. For CUDA-enabled machines, refer to `https://github.com/OpenAI/CLIP`.

- Assuming you have configured `pyenv` (on your favorite Linux host) to manage multiple Python virtual environments with specific package requirements, choose the directory in which to setup your Python virtual environment and install your iPython kernel utility package `ipykernel`:

```
      $ cd /path/to/my_directory
      $ pyenv local 3.7.1
      $ python -m pip install ipykernel  # or "ipykernel==4.10.0"
```
- Install every required packages in the virtual environment directory `/path/to/my_directory` (see following section for that).
    
```
      $ python -m pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
      $ python -m pip install ftfy regex tqdm
      $ python -m  pip install git+https://github.com/openai/CLIP.git
```

- Make the new custom iPython kernel, clip1.0, available in interactive Python sessions:
```
      $ cd /path/to/my_directory
      $ ipython kernel install --user --name clip1.0 --display-name "Python3.7.1 (clip1.0)"     # or
      $ python -m ipykernel install --user --name clip1.0 --display-name "Python3.7.1 (clip1.0)"
      $ jupyter notebook        # launch an iPython session based on 'notebook' server 
```

- Select the special virtual environment kernel ***Python 3.7.1 (clip1.0)*** under the `New notebook` button in the top-right region of the browser page.


### Package requirements

Package requirements are detailed below. For a quick demo also install `Pillow==8.3.2` and dependencies.

- Install all required packages in the virtual environment directory "/path/to/my_directory", with:
```
    $ cd /path/to/my_directory
    $ python -m pip freeze <<- 'EOF'
                clip @ git+https://github.com/openai/CLIP.git@04f4dc2ca1ed0acc9893bd1a3b526a7e02c4bb10ftfy
                Cython==0.29.1
                h5py==2.9.0
                ftfy==5.5.1
                matplotlib==3.0.2
                numpy==1.17.3
                Pillow==8.3.2
                pyyaml==5.1
                regex==2021.8.3
                requests==2.20.1
                torch==1.7.1+cpu
                torchaudio==0.7.2
                torchvision==0.8.2+cpu
                tqdm==4.38.0
                scipy==1.2.0
                zipfile37==0.1.3
    EOF
```

Jupyter environment requirements include:
```
                ipykernel==6.6.0
                ipython==7.30.1
                ipython_genutils==0.2.0
                ipywidgets==7.6.5
                jupyter_client==7.1.0
                jupyter_core==4.9.1
                nbclient==0.5.9
                nbconvert==6.3.0
                nbformat==5.1.3
                notebook==5.7.4
                traitlets==5.1.1
```
 ... and starting the jupyter notebook from the sytem's jupyter's instance with:
```
    $ /usr/bin/jupyter notebook 01_clip_cpu_classify
```
#### Known issues
\- Launching the notebook by relying on the local environment's shims, with:

    $ jupyter notebook 01_clip_cpu_classify
    
may fail under Pyenv with a "Segmentation fault".  It is likely an iPython issue related to jupyter. To avoid it, launch either notebook from a more recent python version, and select iyour custom built 3.7.1 iPython kernel from the notebook at first launch.

### Licensing terms and copyright

A substantial portion of the following code, is based on CLIP, as originally made available by its authors. CLIP, its use and distribution are protected by the terms and conditions of the MIT license.

    Copyright (C) 2021 OpenAI

Permission is hereby granted, free of charge, to any person obtaining a copy of this 
software and associated documentation files (the "Software"), to deal in the Software 
without restriction, including without limitation the rights to use, copy, modify, merge, 
publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons 
to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or 
substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, 
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

=========================================================

The pre-processing and wrapper code used to conduct one-shot classification with CLIP are protected. 

    Copyright (c) 2020, 2021 Cedric Bhihe
    
The OpenAI software made available in this repository was extended with pre- and post-processing steps. The corresponding code is available for free under the terms of the Gnu GPL License version 3 or later.

In short, you are welcome to use and distribute the code of this repo as long as you always quote its author's name and the Licensing terms in full. Consult the licensing terms and conditions in License.md on this repo.

=========================================================

***To contact the repo owner for any issue, please open an new thread under the [Issues] tab on the source code repo.***


In [1]:
import os, sys, csv, time
from datetime import datetime as dt 

import random
from zipfile import ZipFile
import multiprocessing as mp
#from multiprocessing.pool import ThreadPool
import faulthandler     # Python builtin

from PIL import Image, ImageDraw, ImageFont, ImageFile

import numpy as np
from pandas import DataFrame as DF

import xml.etree.ElementTree as ET
import clip
import torch

faulthandler.enable()

In [2]:
## Check the pyTorch version used
print("Torch version:", torch.__version__)
#assert torch.__version__.split(".") >= ["1", "7", "1"], "PyTorch 1.7.1 or later is re{commend,quir}ed"

Torch version: 1.7.1+cpu


In [3]:
## TESTING
clip.available_models()   # list names of available CLIP models
## TESTING END

['RN50', 'RN101', 'RN50x4', 'RN50x16', 'ViT-B/32', 'ViT-B/16']

In [4]:
## Import trained CLIP model
localdevice = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=localdevice, jit=False) # Must set jit=False for training
#model.cuda().eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length   # pixel size of resized square picture side
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Image input resolution:", input_resolution,"by",input_resolution,"pixels")
print("Textual context length:", context_length)
print("Total vocabulary size:", vocab_size)
print("Local device:", localdevice)

Model parameters: 151,277,313
Image input resolution: 224 by 224 pixels
Textual context length: 77
Total vocabulary size: 49408
Local device: cpu


In [5]:
## Load model of choice and return the VisionTransformer (for images) and 
# Transformer (for text) information. The loaded model is not JIT.
clip.load("ViT-B/32", device=localdevice, jit=False)

(CLIP(
   (visual): VisionTransformer(
     (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
     (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
     (transformer): Transformer(
       (resblocks): Sequential(
         (0): ResidualAttentionBlock(
           (attn): MultiheadAttention(
             (out_proj): _LinearWithBias(in_features=768, out_features=768, bias=True)
           )
           (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
           (mlp): Sequential(
             (c_fc): Linear(in_features=768, out_features=3072, bias=True)
             (gelu): QuickGELU()
             (c_proj): Linear(in_features=3072, out_features=768, bias=True)
           )
           (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
         )
         (1): ResidualAttentionBlock(
           (attn): MultiheadAttention(
             (out_proj): _LinearWithBias(in_features=768, out_features=768, bias=True)
           )
        

In [8]:
## Load labels with which object-detection CNN was trained
data_dir = 'Data_git/'
res_dir = data_dir + 'Results/'

file = 'class_w_categories.csv'
infile = data_dir + file

# class labels only
with open(infile,'r', encoding='utf-8-sig') as fd:
    f_csv = csv.reader(fd)
    classids = [re.sub(r'\s*$|^\s*|\s+(?=\s)|\s*,\s*.*','',row[0]) for row in f_csv if not re.search(r'^\s*#|^$', row[0])]
    classids = sorted(classids, reverse = False)
    class_cnt = len(classids)

In [9]:
## Preliminary (generic) for image display with decorations
#  ####################
path_to_font = '/usr/share/fonts/TTF/times.ttf'    # path is system specific
if os.path.exists(path_to_font):
#if os.path.isfile(path_to_font):
   pass
else:
    print(' ==> Missing font file or erroneous path to file.')
    raise Exception

font_size1 = 28
font_size2 = 22

try:
    label_font1 = ImageFont.truetype(path_to_font,font_size1)
    label_font2 = ImageFont.truetype(path_to_font,font_size2)
except (NameError) as ne:
    print(ne + '\n ==> Missing \'from PIL import ImageFont\'\n If necessary install module PIL/Pillow.')
    raise ne
except Exception as ex:
    print(ex)
    print('Unsure about what happened. Probably unsafe to continue; ttf loading interrupted.')
    raise ex

In [18]:
## Parameter initialization

dataset = 'training'
max_img_nbr = 3                  # nbr of images + xml files to be extracted from zip archive
min_nbr_bbx = 3                  # minimum nbr of bbxes per sampled file
top_labels = 5                   # top label ranking for image classification

In [21]:
## Load N random images with minimum number of bbxes

cnt = 0
extract_file_lst = list()
sample_file_label_lst = list()

# Load  zipped training files
file = dataset + '_set.zip'
infile = data_dir + file
with ZipFile(infile,'r') as zipped:
    img_lst = [f.split(sep='/')[-1] for f in zipped.namelist() if f.endswith('.png') or f.endswith('.jpg')]
    img_lst.sort()
    xml_lst = [f.split(sep='/')[-1] for f in zipped.namelist() if f.endswith('.xml')]
    xml_lst.sort()
    print(f"Total nbr of image files: {len(img_lst)}\nTotal nbr of XML files: {len(xml_lst)}")
    
    try:
        # sample file basenames w/o replacement
        f_sampled = ['.'.join(f.split('.')[:-1]) for f in random.sample(img_lst, max_img_nbr)] 
    except ValueError:
        f_sampled = ['.'.join(f.split('.')[:-1]) for f in img_lst]
    
    for f in f_sampled:
        label_lst = list()
        label_dct = dict()
        
        f_xml = f + '.xml'
        zipped.extract(f_xml,path = data_dir + 'Annots/')
        tree = ET.parse(data_dir + 'Annots/' + f_xml) 
        root = tree.getroot()
        obj_idx = [(idx,root[idx][0].text)
                   for idx, val in enumerate([child.tag for child in root]) 
                   if val == 'object']
        if len(obj_idx) >= min_nbr_bbx:
            f_jpg = f + '.jpg'
            zipped.extract(f_jpg,path = data_dir + 'Images/')
            cnt +=1
            label_lst = [label for _,label in obj_idx]
            for label in set(label_lst):
                label_dct[label]=label_lst.count(label)
            extract_file_lst.append(f)
            sample_file_label_lst.append(label_dct)
        else:
            try:
                os.remove(data_dir + 'Annots/' + f_xml)
            except(OSError,) as OSe:
                print(f"### invalid file name or inaccessible path for ({f})")
                print(f"    or arguments with correct type, but rejected by OS.")
                
print(f"{cnt} files with {min_nbr_bbx} or more bbxes were selected at random.")

Total nbr of image files: 5
Total nbr of XML files: 5
3 files with 3 or more bbxes were selected at random.


In [22]:
## Prepare text input
# 'torch.cat' concatenates given sequence of tensors in given dimension (default: row-wise or 'dim=0')
text_inputs = torch.cat([clip.tokenize(f"a painting of a {c}") for c in classids]).to(localdevice)
image_path = data_dir + 'Images/'

## Prepare output
print(f"{top_labels} top label ranking for {cnt} images in:\n{image_path}")

try:
    with open(res_dir + 'clip_classify.tsv', 'x', newline='') as fd:  # create if file does not exist
        header = ['img_name', 'rec_create_time', 'img_dataset', 'img_top_labels', 'img_labels']
        writer = csv.DictWriter(fd, fieldnames=header, delimiter='\t')
        writer.writeheader()
except FileExistsError:
    pass
       
with open(res_dir + 'clip_classify.tsv', 'r', newline='') as infile:  
    # binary mode (e.g. 'rb') doesn't accept argument [newline='']
    reader = csv.reader(infile, delimiter='\t')
    header = next(reader)
    ncol = len(header)
    ndarr = np.empty([0,ncol],dtype='str')            # empty array
    for row in reader:
        ndarr = np.append(ndarr,[row],axis=0)


img_in_file = [str(x) for x in list(ndarr[:,0])]
ImageFile.LOAD_TRUNCATED_IMAGES = True

cnt_new = 0

for f_idx, f in enumerate(extract_file_lst):
    starttime = time.time()
    timestamp = dt.now().strftime('%Y%m%d-%H%M%S')

    f_jpg = image_path + f + '.jpg'
    img = Image.open(f_jpg, mode='r',formats=None) # lazy op, default: mode='r',formats=None
    width, height = img.size
    img_labels = sample_file_label_lst[f_idx]

    # Use 'mode=RGB' or 'mode=RGBA' below, to blend drawing in underlying image
    draw = ImageDraw.Draw(img, mode='RGBA') 

    # Modify font size to fit image size
    multiplier = max(1,(width+height)//1400)*4/5
    label_font1 = ImageFont.truetype(path_to_font,int(font_size1*multiplier))
    label_font2 = ImageFont.truetype(path_to_font,int(font_size2*multiplier))

    draw.text((width/20,height/20),
              'File: ' + str(f_jpg.split('/')[-1]),
              #fill=contrasted_rgb_color,
              fill=(255,40,0,255),                 # ferrari red (255,40,0,...), full opacity (...,255)
              font=label_font1)

    ## Prepare image input
    # Insert new singleton dimension (size 1) along dimension 0,
    # so bidimensional  224x224 pixels preprocessed image tensor becomes tensor of size [1,224,224]  
    image_input = preprocess(img).unsqueeze(0).to(localdevice) 

    # Calculate features
    # 'torch.no_grad()', context manager to disable gradient calculation locally over indented block
    with torch.no_grad():
    #with torch.autograd.no_grad():
    #with torch.autograd.inference_mode():
        image_features = model.encode_image(image_input)    # CLIP image-based inference
        text_features = model.encode_text(text_inputs)      # CLIP text-based inference 

    # Pick the top 5 most similar labels for the image
    image_features /= image_features.norm(dim=-1, keepdim=True) # <class 'torch.Tensor'> size [1,512]
    text_features /= text_features.norm(dim=-1, keepdim=True)   # <class 'torch.Tensor'> size [92,512]
    #try:
    #    print(type(image_features), image_features.shape)
    #    print(type(text_features), text_features.shape)
    #except:
    #    pass
    
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    values, indices = similarity[0].topk(top_labels)

    # Print classification results
    print(f"\n \u27A2 {f_jpg.split('/')[-1]}")
    print(f"   time: {round(time.time() - starttime,2)}s")

    classify_dict = dict()
    for i, (value, index) in enumerate(zip(values, indices)):
        print(f"{classids[index]:>20s}: {100 * value.item():.2f}%")
        draw.text((10+width/20, 40*multiplier*(i+1)+height/20),
                  classids[index] + ": " + str(round(100 * value.item(),2)) + "%",
                  #fill=contrasted_rgb_color,
                  #fill=(0  ,176, 24, 20),          # green, 78% opacity
                  #fill=(255,236,0  ,255),          # yellow, 100% opacity
                  #fill=(255,255,255,255),          # white, 100% opacity
                  fill=(255, 40,  0,255),          # red, 100% opacity
                  font=label_font2)
        classify_dict.update({classids[index]:round(value.item(),4)})
        
    # Save CLIP classification result in in Numpy nd.array
    if str(f) not in img_in_file:                  # image never processed before
        ndarr = np.append(ndarr,[[str(f), timestamp, dataset, classify_dict, img_labels]],axis=0)
        cnt_new += 1
    else:                                          # image processed previously
        ndarr[img_in_file.index(str(f))] = [str(f),timestamp,dataset,classify_dict, img_labels]  # 'ValueError' if not in list
        
    #img.save(res_dir + 'images/' + f + '_cliplabel.jpg') 
    img.show()        

# Transform CLIP classification result from Numpy nd.array to Pandas dataframe and sort by image basename
df = DF(ndarr)    
df.sort_values(by=[df.columns[0]], ascending=True, inplace=True)    

# Commit dataframe to disk
df.to_csv(res_dir + 'clip_classify.tsv', mode='w', index=False, sep='\t', header=header, quoting=csv.QUOTE_ALL)
cnt_new = "No" if cnt_new == 0 else cnt_new
grammar = 'new file was' if cnt_new in {"No",1} else 'new files were'
print(f'\n==========================================\n{cnt_new} {grammar} label-based classified by CLIP and added to:\n {res_dir + "clip_classify.tsv"}')

5 top label ranking for 3 images in:
Data_git/Images/

 ➢ 00000110.jpg
   time: 2.63s
         crucifixion: 53.12%
               cross: 22.48%
              prayer: 6.70%
            shepherd: 3.74%
     crown of thorns: 2.95%

 ➢ 00000028.jpg
   time: 2.88s
         crucifixion: 86.22%
               cross: 6.20%
               mitre: 1.08%
              banner: 0.82%
              prayer: 0.74%

 ➢ 00000103.jpg
   time: 2.74s
              prayer: 12.67%
              judith: 10.37%
             chalice: 7.26%
               angel: 6.99%
                halo: 5.45%

1 new file was label-based classified by CLIP and added to:
 Data_git/Results/clip_classify.tsv
