# Protein Gym Exploration

From the huggingface, markdown file. Found [here](https://huggingface.co/datasets/OATML-Markslab/ProteinGym/blob/main/reference_files_description.md)
## ProteinGym reference files

In the reference files, we provide detailed information about all DMS assays included in ProteinGym. There are two reference files: one for the substitution benchmark and one for the indel benchmark.

The meaning of each column in the ProteinGym reference files is provided below:
- DMS_id (str): Uniquely identifies each DMS assay in ProteinGym. It is obtained as the concatenation of the UniProt ID of the mutated protein, the first author name and the year of publication. If there are several datasets with the same characteristics, another defining attribute of the assay is added to preserve unicity.
- DMS_filename (str): Name of the processed DMS file.
- target_seq (str): Sequence of the target protein (reference sequence mutated in the assay).
- seq_len (int): Length of the target protein sequence.
- includes_multiple_mutants (bool): Indicates whether the DMS contains mutations that are multiple mutants. Substitution benchmark only.
- DMS_total_number_mutants (int): Number of rows of the DMS in ProteinGym.
- DMS_number_single_mutants (int): Number of single amino acid substitutions in the DMS. Substitution benchmark only.
- DMS_number_multiple_mutants (int): Number of multiple amino acid substitutions in the DMS. Substitution benchmark only.
- DMS_binarization_cutoff_ProteinGym (float): Cutoff used to divide fitness scores into binary labels.
- DMS_binarization_method (str): Method used to decide the binarization cutoff (manual or median).
- region_mutated (str): Region of the target protein that is mutated in the DMS.
- MSA_filename (str): Name of the MSA file generated based on the reference sequence mutated during the DMS experiment. Note that different reference sequences may be used in different DMS experiments for the same protein. For example, Giacomelli et al. (2018) and Kotler et al. (2018) used slightly different reference sequences in their respective DMS experiments for the P53 protein. We generated different MSAs accordingly.
- MSA_start (int): Locates the beginning of the first sequence in the MSA with respect to the target sequence. For example, if the MSA covers from position 10 to position 60 of the target sequence, then MSA_start is 10.
- MSA_end (int): Locates the end of the first sequence in the MSA with respect to the target sequence. For example, if the MSA covers from position 10 to position 60 of the target sequence, then MSA_end is 60.
- MSA_bitscore (float): Bitscore threshold used to generate the alignment divided by the length of the target protein.
- MSA_theta (float): Hamming distance cutoff for sequence re-weighting.
- MSA_num_seqs (int): Number of sequences in the Multiple Sequence Alignment (MSA) used in this work for this DMS.
- MSA_perc_cov (float): Percentage of positions of the MSA that had a coverage higher than 70% (less than 30% gaps).
- MSA_num_cov (int): Number of positions of the MSA that had a coverage higher than 70% (less than 30% gaps).
- MSA_N_eff (float): The effective number of sequences in the MSA defined as the sum of the different sequence weights.
- MSA_N_eff_L (float): Neff / num_cov.
- MSA_num_significant (int): Number of evolutionary couplings that are considered significant. Significance is defined by having more than 90% probability of belonging to the log-normal distribution in a Gaussian Mixture Model of normal and log-normal distributions.
- MSA_num_significant_L (float): MSA_num_significant / num_cov.
- raw_DMS_filename (str): Name of the raw DMS file.
- raw_DMS_phenotype_name (str): Name of the column in the raw DMS that we used as fitness score.
- raw_DMS_directionality (int): Sign of the correlation between the DMS_phenotype column values and protein fitness in the raw DMS files. In any given DMS, the directionality is 1 if higher values of the measurement are associated with higher fitness, and -1 otherwise. For simplicity, we adjusted directionality in the final ProteinGym benchmarks so that a higher value of DMS_score is always associated with higher fitness. Consequently, correlations between model scores and the final DMS_score values should always be positive (unless the predictions from the considered model are worse than random for that DMS).
- raw_DMS_mutant_column (str): Name of the column in the raw DMS that indicates which mutants were assayed.

## Code

In [21]:
# system dependencies
import os

# library dependencies
from datasets import load_dataset, list_datasets,  load_dataset_builder, get_dataset_split_names, get_dataset_config_names
from tqdm import tqdm
import numpy as np
import pandas as pd

# local dependencies

In [9]:
datasets_list = list_datasets()

  datasets_list = list_datasets()


In [11]:
# print(', '.join(dataset for dataset in datasets_list))

In [7]:
# let's see if we can download the dataset
dataset = load_dataset("OATML-Markslab/ProteinGym", split="train", cache_dir="../tmp/hf_cache/", data_dir="../data/gym/")

Repo card metadata block was not found. Setting CardData to empty.
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96/96 [00:00<00:00, 882.38it/s]
Downloading data files:   0%|                                                                                                                                                                                                                                                                                     | 0/1 [00:00<?, ?it/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

DatasetGenerationError: An error occurred while generating the dataset

In [13]:
dataset = load_dataset("OATML-Markslab/ProteinGym", split="train")

Repo card metadata block was not found. Setting CardData to empty.
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96/96 [00:00<00:00, 863.66it/s]
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 68.30it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

DatasetGenerationError: An error occurred while generating the dataset

Hmmm. It seems I am having an issue that I am not sure what the source is.

Let's inspect the dataset

In [15]:
ds_builder = load_dataset_builder("OATML-Markslab/ProteinGym")

Repo card metadata block was not found. Setting CardData to empty.
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96/96 [00:00<00:00, 240533.56it/s]


In [16]:
ds_builder.info.description

''

In [17]:
ds_builder.info.features

In [20]:
get_dataset_split_names("OATML-Markslab/ProteinGym")

Repo card metadata block was not found. Setting CardData to empty.
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96/96 [00:00<00:00, 805.32it/s]


['train']

In [22]:
configs = get_dataset_config_names("OATML-Markslab/ProteinGym")
print(configs)

Repo card metadata block was not found. Setting CardData to empty.
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96/96 [00:00<00:00, 860.62it/s]


['default']


I will just download it manually lol

In [23]:
from datasets import DownloadManager

In [24]:
# manually curate csv files (the following are just from the indel folder)
# in principle, you can do this for substitutions as well
# I also avoided human + yeast proteins
csv_files = [
    "A0A1J4YT16_9PROT_Davidi_2020.csv",  # Replace with actual file names if known
    "B1LPA6_ECOSM_Russ_2020.csv",
    "BLAT_ECOLX_Gonzalez_indels_2019.csv",
    "CAPSD_AAV2S_Sinai_indels_2021.csv"
]

In [25]:
# Base URL for the ProteinGym_indels folder
base_url = "https://huggingface.co/datasets/OATML-Markslab/ProteinGym/raw/main/ProteinGym_indels/"

In [None]:
# Specify a directory to store the downloaded data
download_dir = "../data/gym/"

# Initialize the download manager
download_manager = DownloadManager(dataset_name="ProteinGym", cache_dir=download_dir)

# Attempt to download the csv files again
downloaded_paths = {}

for file_name in csv_files:
    data_url = base_url + file_name
    try:
        downloaded_file_path = download_manager.download(data_url)
        downloaded_paths[file_name] = downloaded_file_path
    except Exception as e:
        downloaded_paths[file_name] = f"Error: {e}"