### Preprocess the data
Data has been acquired fron Stanford HIV database. 
It is a genotype-phenotype correlation dataset that contains isolates on which in vitro susceptibility tests were performed using the PhenoSense assay. Protease inhibitor resistance dataset is the one being studied.
The link to reaquire data is below. 

curl -o PI_dataset.txt https://hivdb.stanford.edu/download/GenoPhenoDatasets/PI_DataSet.txt

In [1]:
# Installing the required packages
%pip install pandas
%pip install numpy
%pip install transformers
%pip install torch

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [5]:
import pandas as pd

# Define the consensus sequence for the protease inhibitor
protease_consensus = "PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF"

#Load the dataset from the txt file we downloaded
df = pd.read_csv("PI_dataset.txt", sep="\t")

# Verify that the sequence columns exist.
# We'll filter columns whose names start with 'P' and are followed by digits.
# This assumes that the sequence columns are named 'P1', 'P2', ..., 'P99'.
p_columns = [col for col in df.columns if col.startswith("P") and col[1:].isdigit()]

# Sort the columns in numerical order (P1, P2, ..., P99)
p_columns = sorted(p_columns, key=lambda x: int(x[1:]))

# Check that we have exactly 99 positions (P1 to P99)
# This should be the case as protease is a homodimer with each subunit having 99 amino acids 
if len(p_columns) != 99:
    print(f"Warning: Expected 99 sequence positions but found {len(p_columns)}")

def concatenate_sequence(row, columns, consensus_seq):
    """
    Concatenate amino acid columns into a single sequence string.
    Handling missing/ambiguous symbols (from the dataset description):
      - If a cell is NaN, replace with 'X' (unknown).
      - If the value is '.', replace with 'X' (unknown, no sequence data).
      - If the value is '-', replace with the corresponding consensus residue.
    """
    seq_list = []
    # Enumerate over the sorted columns so we know the position (0-indexed)
    for idx, col in enumerate(columns):
        aa = row[col]
        if pd.isna(aa):
            # If the value is NaN, mark as unknown
            aa = 'X'
        else:
            aa = str(aa).strip()
            # Replace '.' with unknown, and '-' with the consensus residue
            if aa == '.':
                aa = 'X'
            elif aa == '-':
                aa = consensus_seq[idx]
        seq_list.append(aa)
    
    # Join the amino acids into a continuous string (without spaces)
    raw_seq = ''.join(seq_list)
    
    # Insert spaces between each amino acid for ProteinBERT tokenization
    formatted_seq = " ".join(list(raw_seq))
    
    return formatted_seq

# Apply the function to each row to create a new column with the formatted sequence
df["FormattedSequence"] = df.apply(lambda row: concatenate_sequence(row, p_columns, protease_consensus), axis=1)

# Display a sample formatted sequence
print("Sample Formatted Sequence:")
print(df.loc[0, "FormattedSequence"])

Sample Formatted Sequence:
P Q I T L W Q R P L V T I K I G G Q L K E A L L D T G A D N T V L E E M N L P G R W K P K M I G G I G G F I K V G Q Y D Q I L I E I C G H K A I G T V L V G P T P V N I I G R D L L T Q I G C T L N F


In [6]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
sample_encoding = tokenizer(
    df.loc[0, "FormattedSequence"],
    max_length=128,  # Adjust this based on your sequence length plus special tokens
    padding='max_length',
    truncation=True,
    return_tensors="pt"
)
print(sample_encoding.input_ids.shape)

torch.Size([1, 128])
