Skip to content

Inconsistent perpelexity from pseudo_log_likelihood when tokenizing first versus inputting string directly #1

@dxu16

Description

@dxu16

First, thank you for making the code and checkpoint available.

I observe inconsistent results from the function pseudo_log_likelihood between the two usage examples shown in readme. Here is an example to reproduce (I used 5 random OAS paired sequences):

paired_oas_example = [['VQLVESGGGLVKPGGSLRLSCEGSGFDFKTKWMSWVRQAPGRGLEWVGRIKSKRDGGTTDYTGSVKGRFIITRDDSRNTLYLQINSLATEDTGVYYCTTDPREWGQGVLVTVSS',
'QAVVTQEPSVTVSPGGTVILTCGSSTGAVTSGHYPYWFQQKPGQAPRTLIYDTSNKYFWTPARFSGSLIGGKAALTLSGAQPEDDADYYCLVSFSGARVFGGGTKLTV'],
['EVQLVESGGGLVKPGGSLRLSCAASGFTFSNAWMSWVRQAPGKGLEWVGRIKGKTDGGATDYAAPVKGRFTISRDDSENTLYLQMNSLKTEDTAVYYCTTTYIGTYYPGYWGQGTLVTVS',
'QSELTQPPSASGTPGQRVIISCSGSSSNIGSNYVFWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEAVYYCAAWDDSLVRVFGGGTKLTVL'],
['EVHLVQSGGGLVKPGGSLRLSCVASGFTFSKVWMNWVRQAPGKGLEWVGRIKSESDDGTTDYAAPVKGRFTISRDDSKNTLYLQMNSLKSEDTAVYYCTGNDFWSAMFDSWGQGTLVSVSS',
'QSVLTQPPSASGTPGQTVTISCSGSSSNIGIYHVSWYQQLPGTAPRLLIYGKNQRLSGVPDRFSGSKSGTSASLAISGLRAEDEADYYCTTWDDSLNGRLFGGGTKLTVL'],
['QVQLVESGGGVVEPGRSLRLSCAASGFTFSSHAMHWVRQAPGKGLEWVSFISWDGIHKYFADSVKGRFTISRDNSKNTVYLQMNSLRAEDTAVYYCAKDLSTRYSCDYWGQGTLVTVSS',
'QSALTQPPSASGSPGQSVTISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYEVTKRPSGVPDRFSGSKSGNTASLTVSGLQAEDEADYYCSSYASNTWVFGGGTKVTVL'],
['EVQLVESGGGLVQPGGSLSLSCAASGFTFSAYSMNWVRQAPGKGLEWLAYTSSVGSPIYYADSVRGRFTISRDNAKNSLYLQMNSLRVEDTAVYYCAREGFDIWGQGTLVTVSS',
'QAVLTQPASLSASPGASASLTCTLRSGINVGIYKIYWYQQKPGSPPQYLLRYKSDLDKQQGSGVPSRFSGSKDASANAGILLISGLQSEDEADYYCMIWHSSAYVFGGGTKLTVL']]

import prism

model = prism.pretrained("RomeroLab-Duke/prism-antibody")

tokenizer = model.get_tokenizer()

all_ppl_tokenize = []
for heavy, light in paired_oas_example:
    inputs = tokenizer(heavy, light_chain=light, return_tensors="pt")
    result = model.pseudo_log_likelihood(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
    )

    ppl = result["marginalized"]["perplexity"]
    all_ppl_tokenize.append(ppl)
print(f"all_ppl_tokenize: {all_ppl_tokenize}")

all_ppl_string = []
for heavy, light in paired_oas_example:
    result = model.pseudo_log_likelihood(
        heavy_chains=heavy,
        light_chains=light
    )

    ppl = result["marginalized"]["perplexity"]
    all_ppl_string.append(ppl)
print(f"all_ppl_string: {all_ppl_string}")

The resulting perplexities are:

all_ppl_tokenize: [8.739658401751875, 3.5378353969761314, 4.83460110963223, 2.93945275697227, 6.270151541876721]
all_ppl_string: [19.834938927843847, 13.141539521152554, 13.229897776805013, 12.468008213183204, 14.407962929102396]

which do not agree with each other. A bit of investigation suggests that it is due to how conditioning is handled. When tokenizing first, the conditioning is None, while inputting the sequences directly will cause prism to call anarci to produce conditioning. I don't think this difference is intended and should be documented somewhere.

I have also tried using the older checkpoint 0.3.0 on hugging face, which results in a smaller difference, but still not the same.

# when using checkpoint 0.3.0 https://huggingface.co/RomeroLab-Duke/prism-antibody/tree/d4ee3e181394a5f4fccf9decaf245e1fffebed8c
all_ppl_tokenize: [6.1152188882862255, 3.0897894778099864, 5.054281527293421, 2.96039600734065, 4.582305101040003]
all_ppl_string: [9.060815978859193, 6.0984450583279735, 6.606973580675475, 5.867985883049776, 7.319161804655159]

On a related note, the perplexities produced by prism are quite large. If we run ablang2 on the same set of sequences using the following:

import ablang2
import numpy as np

ablang = ablang2.pretrained(model_to_use='ablang2-paired', random_init=False, ncpu=1, device='cpu')

ablang2_results = ablang(paired_oas_example, mode='pseudo_log_likelihood')
print(np.exp(-ablang2_results))

The resulting perplexities are much smaller compared to any of the outputs from prism above:

[1.651289  1.5422703 1.8323187 1.3421805 1.5188011]

Guaranteed, this is only from five sequences, but I did some calculations with a larger number of sequences and the trend holds. This seems to contradict the results shown in the paper. Is the correct checkpoint being provided on hugging face?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions