Inconsistent perpelexity from pseudo_log_likelihood when tokenizing first versus inputting string directly

First, thank you for making the code and checkpoint available.

I observe inconsistent results from the function pseudo_log_likelihood between the two usage examples shown in readme. Here is an example to reproduce (I used 5 random OAS paired sequences):

```
paired_oas_example = [['VQLVESGGGLVKPGGSLRLSCEGSGFDFKTKWMSWVRQAPGRGLEWVGRIKSKRDGGTTDYTGSVKGRFIITRDDSRNTLYLQINSLATEDTGVYYCTTDPREWGQGVLVTVSS',
'QAVVTQEPSVTVSPGGTVILTCGSSTGAVTSGHYPYWFQQKPGQAPRTLIYDTSNKYFWTPARFSGSLIGGKAALTLSGAQPEDDADYYCLVSFSGARVFGGGTKLTV'],
['EVQLVESGGGLVKPGGSLRLSCAASGFTFSNAWMSWVRQAPGKGLEWVGRIKGKTDGGATDYAAPVKGRFTISRDDSENTLYLQMNSLKTEDTAVYYCTTTYIGTYYPGYWGQGTLVTVS',
'QSELTQPPSASGTPGQRVIISCSGSSSNIGSNYVFWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEAVYYCAAWDDSLVRVFGGGTKLTVL'],
['EVHLVQSGGGLVKPGGSLRLSCVASGFTFSKVWMNWVRQAPGKGLEWVGRIKSESDDGTTDYAAPVKGRFTISRDDSKNTLYLQMNSLKSEDTAVYYCTGNDFWSAMFDSWGQGTLVSVSS',
'QSVLTQPPSASGTPGQTVTISCSGSSSNIGIYHVSWYQQLPGTAPRLLIYGKNQRLSGVPDRFSGSKSGTSASLAISGLRAEDEADYYCTTWDDSLNGRLFGGGTKLTVL'],
['QVQLVESGGGVVEPGRSLRLSCAASGFTFSSHAMHWVRQAPGKGLEWVSFISWDGIHKYFADSVKGRFTISRDNSKNTVYLQMNSLRAEDTAVYYCAKDLSTRYSCDYWGQGTLVTVSS',
'QSALTQPPSASGSPGQSVTISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYEVTKRPSGVPDRFSGSKSGNTASLTVSGLQAEDEADYYCSSYASNTWVFGGGTKVTVL'],
['EVQLVESGGGLVQPGGSLSLSCAASGFTFSAYSMNWVRQAPGKGLEWLAYTSSVGSPIYYADSVRGRFTISRDNAKNSLYLQMNSLRVEDTAVYYCAREGFDIWGQGTLVTVSS',
'QAVLTQPASLSASPGASASLTCTLRSGINVGIYKIYWYQQKPGSPPQYLLRYKSDLDKQQGSGVPSRFSGSKDASANAGILLISGLQSEDEADYYCMIWHSSAYVFGGGTKLTVL']]

import prism

model = prism.pretrained("RomeroLab-Duke/prism-antibody")

tokenizer = model.get_tokenizer()

all_ppl_tokenize = []
for heavy, light in paired_oas_example:
    inputs = tokenizer(heavy, light_chain=light, return_tensors="pt")
    result = model.pseudo_log_likelihood(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
    )

    ppl = result["marginalized"]["perplexity"]
    all_ppl_tokenize.append(ppl)
print(f"all_ppl_tokenize: {all_ppl_tokenize}")

all_ppl_string = []
for heavy, light in paired_oas_example:
    result = model.pseudo_log_likelihood(
        heavy_chains=heavy,
        light_chains=light
    )

    ppl = result["marginalized"]["perplexity"]
    all_ppl_string.append(ppl)
print(f"all_ppl_string: {all_ppl_string}")
```
The resulting perplexities are:
```
all_ppl_tokenize: [8.739658401751875, 3.5378353969761314, 4.83460110963223, 2.93945275697227, 6.270151541876721]
all_ppl_string: [19.834938927843847, 13.141539521152554, 13.229897776805013, 12.468008213183204, 14.407962929102396]
```
which do not agree with each other. A bit of investigation suggests that it is due to how conditioning is handled. When tokenizing first, the conditioning is None, while inputting the sequences directly will cause prism to call anarci to produce conditioning. I don't think this difference is intended and should be documented somewhere.

I have also tried using the older checkpoint 0.3.0 on hugging face, which results in a smaller difference, but still not the same.
```
# when using checkpoint 0.3.0 https://huggingface.co/RomeroLab-Duke/prism-antibody/tree/d4ee3e181394a5f4fccf9decaf245e1fffebed8c
all_ppl_tokenize: [6.1152188882862255, 3.0897894778099864, 5.054281527293421, 2.96039600734065, 4.582305101040003]
all_ppl_string: [9.060815978859193, 6.0984450583279735, 6.606973580675475, 5.867985883049776, 7.319161804655159]
```


On a related note, the perplexities produced by prism are quite large. If we run ablang2 on the same set of sequences using the following:
```
import ablang2
import numpy as np

ablang = ablang2.pretrained(model_to_use='ablang2-paired', random_init=False, ncpu=1, device='cpu')

ablang2_results = ablang(paired_oas_example, mode='pseudo_log_likelihood')
print(np.exp(-ablang2_results))
```
The resulting perplexities are much smaller compared to any of the outputs from prism above:
```
[1.651289  1.5422703 1.8323187 1.3421805 1.5188011]
```
Guaranteed, this is only from five sequences, but I did some calculations with a larger number of sequences and the trend holds. This seems to contradict the results shown in the paper. Is the correct checkpoint being provided on hugging face?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent perpelexity from pseudo_log_likelihood when tokenizing first versus inputting string directly #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Inconsistent perpelexity from pseudo_log_likelihood when tokenizing first versus inputting string directly #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions