Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create basic eval script for ProteinGym benchmark #4

Closed
pascalnotin opened this issue Aug 2, 2023 · 9 comments
Closed

Create basic eval script for ProteinGym benchmark #4

pascalnotin opened this issue Aug 2, 2023 · 9 comments
Assignees

Comments

@pascalnotin
Copy link
Collaborator

No description provided.

@Muedi
Copy link
Contributor

Muedi commented Aug 3, 2023

WHat do we want as a 'basic'eval script? Should I try to make a script downloading and preprocessing the csvs and then use esm-2 as a placeholder, get its final hidden states feed these to one row of logits?
Would this work as a zero-shot test? Any suggestions in this direction would be welcome, I only used my own models or pretrained ones as is :)

If the script runs, we can later use our models instead and have a comparison with esm already built in :)

@pascalnotin
Copy link
Collaborator Author

Thanks @Muedi ! What you suggest is great for the semi-supervised property prediction setting.

For the pure zero-shot setting with ESM models we could be using the ESM-1v masked-marginal approach as describe here: https://github.com/facebookresearch/esm/blob/main/examples/variant-prediction/predict.py (that's what we've been using when reporting the corresponding model performance in ProteinGym). Works nearly out of the box for the single-sequence only ESM models (eg., ESM1b, ESM1v, ESM2), except for sequences that are longer than the context length (ie longer than 1023 AAs, if we count the BOS token).

@Muedi
Copy link
Contributor

Muedi commented Aug 6, 2023

Is the code available where it's adapted to proteinGym too?

Then I'll gladly take that and put the script together, perhaps with a true/false flag if we want to get zero shot results or fine-tuned?
Otherwise I'll adapt what you linked already :)

@pascalnotin
Copy link
Collaborator Author

Sounds good regarding the binary flag. Our code is not yet open source but the ESM script should be good enough for testing things out (95% of assay sequences are below the 1023 threshold). Note that this script will only be relevant for maskeg language modeling models, not the AR models we intend to train. So we will need another script parameter to choose model type (ESM vs AR) and then the code will use the relevant zero-shot scoring function/utils.

@Muedi
Copy link
Contributor

Muedi commented Aug 13, 2023

@pascalnotin Hi to implement the script you provided We need the base sequence.
Do some experiments have multiple base seqs?
If not I already have written a function that returns the base seqs for all experiments during preprocessing :)

@pascalnotin
Copy link
Collaborator Author

Hi @Muedi -- each assay mutates a single reference sequence. We have a reference file with all these ref sequences in the repo (these cant be inferred from mutants as the mutated range is sometimes just a subset of the full protein).
Link to reference file: https://github.com/OATML-Markslab/ProteinGym/blob/main/ProteinGym_reference_file_substitutions.csv

@Muedi
Copy link
Contributor

Muedi commented Aug 17, 2023

Ok, thanks, will revert the respective function then and instead download this file too :)

@Muedi
Copy link
Contributor

Muedi commented Aug 21, 2023

/take

@pascalnotin
Copy link
Collaborator Author

Closing this issue as it was addressed by PR#50 - thank you @Muedi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants