# Inference with CLAVE

<a target="_blank" href="https://colab.research.google.com/github/davidaf3/CLAVE/blob/master/src/run_clave.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook shows how you can run inference on CLAVE and creates a Gradio UI that lets you experiment with the model.

## Setup

Install the necessary dependencies. This only install the packages that are not available in Colab. If you are not using Colab, you might need to install `torch`, `requests`, and `tqdm`.

In [1]:
%pip install rarfile gradio

Collecting rarfile
  Downloading rarfile-4.2-py3-none-any.whl.metadata (4.4 kB)
Collecting gradio
  Downloading gradio-5.24.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.4-py3-none-manylinux_2_17_x86_6

Clone CLAVE's repo and move into it. If you are running this notebook locally and have already clone the repo, this step is not necessary.

In [2]:
!git clone https://github.com/davidaf3/CLAVE.git
%cd CLAVE/src

Cloning into 'CLAVE'...
remote: Enumerating objects: 101, done.[K
remote: Counting objects: 100% (101/101), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 101 (delta 47), reused 88 (delta 34), pack-reused 0 (from 0)[K
Receiving objects: 100% (101/101), 183.62 KiB | 1.18 MiB/s, done.
Resolving deltas: 100% (47/47), done.
/content/CLAVE/src


## Download the model weights
First, download the model weights and SentencePiece parameter from the provided URLs:

In [3]:
from tqdm import tqdm
import requests


res = requests.get(
    "https://www.reflection.uniovi.es/bigcode/download/2024/CLAVE/model.rar",
    stream=True,
)

with tqdm(
    total=int(res.headers.get("content-length", 0)), unit="B", unit_scale=True
) as progress_bar:
    with open("model.rar", "wb") as f:
        for data in res.iter_content(1024):
            progress_bar.update(len(data))
            f.write(data)

res = requests.get(
    "https://www.reflection.uniovi.es/bigcode/download/2024/CLAVE/tokenizer_data.zip",
    stream=True,
)

with tqdm(
    total=int(res.headers.get("content-length", 0)), unit="B", unit_scale=True
) as progress_bar:
    with open("tokenizer_data.zip", "wb") as f:
        for data in res.iter_content(1024):
            progress_bar.update(len(data))
            f.write(data)

100%|██████████| 277M/277M [00:24<00:00, 11.5MB/s]
100%|██████████| 1.03M/1.03M [00:00<00:00, 1.35MB/s]


Extract the downloaded `model.rar` and `tokenizer_data.zip` files:

In [4]:
import rarfile
import zipfile


with rarfile.RarFile("model.rar") as f:
    f.extractall(path=".")

with zipfile.ZipFile("tokenizer_data.zip") as f:
    f.extractall(path=".")

## Load the weights
Create a new model (`FineTunedModel` class) and load the weights from the extracted file (`CLAVE.pt`):

In [5]:
import torch
from model import FineTunedModel
from tokenizer import SpTokenizer


device = "cuda" if torch.cuda.is_available() else "cpu"

model = FineTunedModel(
    SpTokenizer.get_vocab_size(), 512, 512, 8, 2048, 6, use_layer_norm=True
).to(device)
model_checkpoint = torch.load("CLAVE.pt", map_location=device)
weights = {
    k[10:] if k.startswith("_orig_mod") else k: v
    for k, v in model_checkpoint["model_state_dict"].items()
}
model.load_state_dict(weights)
model.eval()

FineTunedModel(
  (encoder): Encoder(
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-5): 6 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
          )
          (linear1): Linear(in_features=512, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=512, bias=True)
          (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (embedding): Embedding(16000, 512)
    (pos_embedding): Embedding(2048, 512)
    (embedding_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (embedding_dropout): Dropout(p=0.1, inplace=

## Start the UI
Start the Gradio UI configured to run the `verify_authorship` function. This function tokenizes the inputs, processes the tokens with CLAVE to obtain an embedding for each input, and computes the distance between the embeddings.

In [6]:
import gradio as gr
import torch.nn.functional as F
from utils import pad_and_split_tokens


tokenizer = SpTokenizer()
threshold = 0.1050


def verify_authorship(source_code_1, source_code_2):
    with torch.inference_mode():
        tokens_1 = pad_and_split_tokens(tokenizer.tokenizes(source_code_1))[0]
        tokens_2 = pad_and_split_tokens(tokenizer.tokenizes(source_code_2))[0]
        embedding_1 = model(torch.tensor([tokens_1], device=device))
        embedding_2 = model(torch.tensor([tokens_2], device=device))
        distance = (1 - F.cosine_similarity(embedding_1, embedding_2)).item()
        return [
            distance,
            "Yes" if distance <= threshold else "No",
        ]


ui = gr.Interface(
    fn=verify_authorship,
    inputs=[
        gr.Code(language="python", label="Source code 1"),
        gr.Code(language="python", label="Source code 2"),
    ],
    outputs=[gr.Number(label="Distance"), gr.Text(label="Same author?")],
    allow_flagging="never",
)
ui.launch()



Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://4ee4691156f3c5a2a6.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [9]:
!apt-get update
!apt-get install -y tesseract-ocr


Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,383 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,810 kB]
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [4,000 kB]
Get:13 https://r2u.stat.illinois

In [7]:
!pip install pytesseract opencv-python pillow


Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13


In [None]:
import pytesseract
from PIL import Image
import cv2
import numpy as np
from google.colab import files

# Step 1: Upload image
uploaded = files.upload()
image_path = list(uploaded.keys())[0]

# Step 2: Extract code from image using OCR
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
code_text = pytesseract.image_to_string(gray)
print("Extracted Code:\n", code_text)

# Step 3: Tokenize and embed using CLAVE model
model.eval()
tokens = tokenizer.tokenizes(code_text)
from utils import pad_and_split_tokens
sample = pad_and_split_tokens(tokens)[0]
tensor = torch.tensor(sample).unsqueeze(0).cuda()

with torch.no_grad():
    embedding = model.encoder(**tokens).last_hidden_state[:, 0, :]  # [CLS] token embedding
embedding = embedding.cpu().numpy()

# Step 4: Compare to known embeddings (e.g., from your val/test set)
# Here you would load comparison embeddings from dataset or previous outputs
# Example (pseudo-code):
# known_embeddings = np.load("val_embeddings.npy")  # shape: (N, D)
# similarities = cosine_similarity(embedding, known_embeddings)
# top_k = np.argsort(similarities[0])[::-1][:5]
# print("Most similar authors/samples:", top_k)
