# Table of Contents

0. <a href="#sec0">Dependencies</a>
1. <a href="#sec1">Reproducing the Paper</a>
2. <a href="#sec2">Training the Model on Custom Datasets</a>
3. <a href="#sec3">Inference using Trained Model</a>

<a id="sec0"></a>
## 0. Dependencies

### Python Packages
The first step after cloning this repository is download and install the necessary python libraries/packages. Install the required packages by running the following cell.

In [1]:
%pip install -r requirements.txt

Collecting numpy==1.25.2 (from -r requirements.txt (line 1))
  Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting torch==2.0.1 (from -r requirements.txt (line 2))
  Downloading torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting scikit-learn==1.3.0 (from -r requirements.txt (line 3))
  Downloading scikit_learn-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting transformers==4.31.0 (from -r requirements.txt (line 4))
  Downloading transformers-4.31.0-py3-none-any.whl.metadata (116 kB)
Collecting tqdm==4.66.0 (from -r requirements.txt (line 5))
  Downloading tqdm-4.66.0-py3-none-any.whl.metadata (57 kB)
Collecting wandb==0.15.8 (from -r requirements.txt (line 6))
  Downloading wandb-0.15.8-py3-none-any.whl.metadata (8.3 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch==2.0.1->-r requirements.txt (line 2))
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py

<a id="sec1"></a>
## 1. Reproducing the Paper

### 1.1 Download Data
The next step is to download the data used to train/evaluate the models. Running the following command will download all 3 datasets, and convert their encodings so that they can be used by PeptideBERT.

In [1]:
!python ./data/download_data.py

### 1.2 Train-Val-Test Split
Now, we want to combine the positive and negative samples (downloaded by the above cell), shuffle them and split them into 3 non-overlapping sets - train, validation, and test.

To do so, run the following cell, this will create sub-directories (inside the `data` directory) for each dataset and place the subsets (train, validation, test) inside it.

Additionally, if you want to augment any dataset, you can do so by editing `./data/split_augment.py` file. You can call the `augment_data` function from the `main` function with the dataset that you want to augment. For example, if you want to augment the `solubility` dataset, you can add `augment_data('sol')` to the `main` function.

Further, to change/experiment with the augmentation techniques applied, you can edit the `augment_data` function. Comment/uncomment the call to any of the augmentation functions (such as `random_replace`, `random_delete`, etc.) as desired, change the factor for augmentation as desired. Do keep in mind that for each augmentation applied, you have to call the `combine` function. For example, if you want to apply the `random_swap` augmentation with a `factor` of 0.2, you can add `new_inputs, new_labels = random_swap(inputs, labels, 0.2)` followed by `inputs, labels = combine(inputs, labels, new_inputs, new_labels)` to merge the augmented dataset into the original dataset.

In [None]:
!python ./data/split_augment.py

### 1.3 Model Config
Edit the `config.yaml` file and set the `task` parameter to one of `hemo` (for hemolysis dataset), `sol` (for solubility dataset), or `nf` (for non-fouling dataset) as desired.

Additionally, If you want to tweak the model before training, you can do so by editing `./model/network.py` and `config.yaml` files. `./model/network.py` contains the actual architecture of the model as well as the optimizer and scheduler used to train the model. `config.yaml` contains all the hyperparameters used for training, as well as which dataset to train on.

### 1.4 Training
Now we are ready to train our model. Run the following cell to start the training procedure. This will save a checkpoint of the best model (on validation set) inside the `checkpoints` directory

In [None]:
!python train.py

<a id="sec2"></a>
## 2. Training the Model on Custom Datasets

Follow the cells below to train the model on custom datasets.

Limpiar las base de datos para obtener solo las secuencias de carácteres "sequence" y la etiqueta mostrando la característica si es ACP o es NO ACP "label". 

In [5]:
import pandas as pd
acps = pd.read_csv("../datasets/features_acps.csv",on_bad_lines='skip')
non_acps = pd.read_csv("../datasets/features_non_acps.csv",on_bad_lines='skip')
acps_limpio=pd.DataFrame(acps["sequence"])
acps_limpio['label'] = 1
non_acps_limpio=pd.DataFrame(non_acps["sequence"])
non_acps_limpio['label'] = 0


In [17]:
cpps = pd.read_csv("../datasets/features_cpps.csv",on_bad_lines='skip')
non_cpps = pd.read_csv("../datasets/features_non_cpps.csv",on_bad_lines='skip')
cpps_limpio=pd.DataFrame(cpps["sequence"])
cpps_limpio['label'] = 1
non_cpps_limpio=pd.DataFrame(non_cpps["sequence"])
non_cpps_limpio['label'] = 0
datos_cpps = pd.concat([cpps_limpio,non_cpps_limpio], ignore_index=True)
datos_cpps.to_csv("../datasets/features_all_cpps.csv", index=False)
datos_cpps

Unnamed: 0,sequence,label
0,PKKGSKKAVTKAQKKDGA,1
1,MAPTKRKGSCPGAAPNKKP,1
2,RFTFHFRFEFTFHFE,1
3,DWLKAFYDKVAEKLKEAF,1
4,YGDCLPHLKLCKENKDCCSKKCKRRGTNIEKRCR,1
...,...,...
5848,PTDHFIDVATYRSQEWRIAEYLG,0
5849,KRMWFVNFIRHKSCWMTIKYWSIMRIHNCR,0
5850,KCVINNEHDCNYELLR,0
5851,ADWHDVKNPRLMLPDFGAHGEYFTVKNGH,0


In [21]:
non_cpps_limpio

Unnamed: 0,sequence,label
0,FLGALFKVASKVLPSVFCAITKKC,0
1,AITFYPFAPNQITCIHE,0
2,IKRYLIKR,0
3,CLGSGEQCVRDTSCCSMSCTNNICF,0
4,FITKALGISYGRKKRRQS,0
...,...,...
4363,PTDHFIDVATYRSQEWRIAEYLG,0
4364,KRMWFVNFIRHKSCWMTIKYWSIMRIHNCR,0
4365,KCVINNEHDCNYELLR,0
4366,ADWHDVKNPRLMLPDFGAHGEYFTVKNGH,0


In [8]:
datos_acps = pd.concat([acps_limpio,non_acps_limpio], ignore_index=True)

In [11]:
datos_acps.to_csv("../datasets/features_all_acps.csv", index=False)
datos_acps

Unnamed: 0,sequence,label
0,AACARFIDDFCDTLTPNIYRPRDNGQRCYAVNGHRCDFTVFNTNNG...,1
1,AACSDRAHGHICESFKSFCKDSGRNGVKLRANCKKTCGLC,1
2,AAKKWAKAKWAKAKKWAKAA,1
3,AAKPMGITCDLLSLWKVGHAACAAHCLVLGDVGGYCTKEGLCVCKE,1
4,AALKGCWTKSIPPKPCFGKR,1
...,...,...
7931,YWSKHMVKCEIA,0
7932,YYAPESAEAAPLVAVLTSDGWETQWPLPEA,0
7933,YYFFRGHVYGDFDDGERFAFFQLAAIEAMERIAFIP,0
7934,YYFYLNKYERYELRRSKIHAHNPPCI,0


### 2.1 Data Preparation

create a `csv` file with the following format:
```csv
sequence,label
AAAAAAA,1
LLLLLLL,0
CCCCCCC,0
DDDDDDD,1
```
where `sequence` is the peptide sequence and `label` is the binary label (0 or 1). Save this file as `custom_data.csv` inside the `data` directory. Now, run the following cell (edit `task_name` as desired) to convert the `csv` file to the format required by PeptideBERT.

In [18]:
import numpy as np

task_name = 'cpps'

# read data
seqs, labels = [], []
with open('../datasets/features_all_cpps.csv', 'r') as f:
    for line in f.readlines()[1:]:
        seq, label = line.strip().split(',')
        seqs.append(seq)
        labels.append(int(label))

MAX_LEN = max(map(len, seqs))

# convert to tokens
mapping = dict(zip(
    ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',
    'A','G','V','E','S','I','K','R','D','T','P','N',
    'Q','F','Y','M','H','C','W'],
    range(30)
))

pos_data, neg_data = [], []
for i in range(len(seqs)):
    seq = [mapping[c] for c in seqs[i]] 
    seq.extend([0] * (MAX_LEN - len(seq)))  # padding to max length
    if labels[i] == 1:
        pos_data.append(seq)
    else:
        neg_data.append(seq)

pos_data = np.array(pos_data)
neg_data = np.array(neg_data)

np.savez(
    f'./data/{task_name}-positive.npz',
    arr_0=pos_data
)
np.savez(
    f'./data/{task_name}-negative.npz',
    arr_0=neg_data
)

### 2.2 Train-Val-Test Split
Now, we want to combine the positive and negative samples, shuffle them and split them into 3 non-overlapping sets - train, validation, and test.

To do so, edit the `main` function inside `./data/split_augment.py` file (comment existing calls to `split_data` and add the line `split_data('REPLACE_WITH_TASK_NAME')`) and run the following cell, this will create sub-directories (inside the `data` directory) for the custom dataset and place the subsets (train, validation, test) inside it.

Additionally, if you want to augment the dataset, you can do so by editing `./data/split_augment.py` file. You can call the `augment_data` function from the `main` function like so: `augment_data('REPLACE_WITH_TASK_NAME')`.

Further, to change/experiment with the augmentation techniques applied, you can edit the `augment_data` function. Comment/uncomment the call to any of the augmentation functions (such as `random_replace`, `random_delete`, etc.) as desired, change the factor for augmentation as desired. Do keep in mind that for each augmentation applied, you have to call the `combine` function. For example, if you want to apply the `random_swap` augmentation with a `factor` of 0.2, you can add `new_inputs, new_labels = random_swap(inputs, labels, 0.2)` followed by `inputs, labels = combine(inputs, labels, new_inputs, new_labels)` to merge the augmented dataset into the original dataset.

In [19]:
!python ./data/split_augment.py

### 2.3 Model Config
Edit the `config.yaml` file and set the `task` parameter to `REPLACE_WITH_TASK_NAME`.

Additionally, If you want to tweak the model before training, you can do so by editing `./model/network.py` and `config.yaml` files. `./model/network.py` contains the actual architecture of the model as well as the optimizer and scheduler used to train the model. `config.yaml` contains all the hyperparameters used for training, as well as which dataset to train on.

### 2.4 Training
Now we are ready to train our model. Run the following cell to start the training procedure. This will save a checkpoint of the best model (on validation set) inside the `checkpoints` directory

In [16]:
!python train.py

Device: cuda

Batch size:  32
Train dataset samples:  6427
Validation dataset samples:  715
Test dataset samples:  794
Train dataset batches:  201
Validation dataset batches:  23
Test dataset batches:  25

pytorch_model.bin: 100%|███████████████████| 1.68G/1.68G [02:27<00:00, 11.4MB/s]
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice: ^C
Traceback (most recent call last):
  File "/home/drojas/PeptideBERT/PeptideBERT-master/train.py", line 61, in <module>
    wandb.init(project='PeptideBERT', name=run_name)
  File "/home/drojas/.venv/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1170, in init
    raise e
  File "/home/drojas/.venv/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1143, in init
    wi.setup(kwargs)
  File "/home/drojas/.venv/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 289, in setup
    wandb_login



<a id="sec3"></a>
## 3. Inference using Trained Model

### 3.1 Load Trained Model
Load the trained model by running the following cell. Edit the `run_name` parameter to the name of the directory containing the trained model (inside the `checkpoints` directory).

In [77]:
import torch
import yaml
from model.network import create_model

run_name = 'acps-0123_2212'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


config = yaml.load(open('./config.yaml', 'r'), Loader=yaml.FullLoader)
config['device'] = device

model_acps = create_model(config)
model_acps.load_state_dict(torch.load(f'./checkpoints/{run_name}/model.pt')['model_state_dict'], strict=False)



<All keys matched successfully>

In [78]:
run_name = 'cpps-0123_2244'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


config = yaml.load(open('./config.yaml', 'r'), Loader=yaml.FullLoader)
config['device'] = device

model_cpps = create_model(config)
model_cpps.load_state_dict(torch.load(f'./checkpoints/{run_name}/model.pt')['model_state_dict'], strict=False)

<All keys matched successfully>

### 3.2 Input Data
Create a text file containing peptide sequences in the following format:
```txt
AAAAAAA
LLLLLLL
CCCCCCC
DDDDDDD
```
where each line represents a peptide sequence. Save this file as `input.txt` inside the `data` directory and run the following cell. The corresponding predictions will be saved in `output.txt` file inside the `data` directory.

Se hacen predicciones de las bases de ACP utilizando el modelo CPP y de la base de datos CPP con el modelo entrenado con ACP.

In [None]:
seqs = []
with open('./data/input.txt', 'r') as f:
    for line in f.readlines():
        seq = line.strip()
        seqs.append(seq)

MAX_LEN = max(map(len, seqs))

# convert to tokens
mapping = dict(zip(
    ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',
    'A','G','V','E','S','I','K','R','D','T','P','N',
    'Q','F','Y','M','H','C','W'],
    range(30)
))

for i in range(len(seqs)):
    seqs[i] = [mapping[c] for c in seqs[i]] 
    seqs[i].extend([0] * (MAX_LEN - len(seqs[i])))  # padding to max length

preds = []
with torch.inference_mode():
    for i in range(len(seqs)):
        input_ids = torch.tensor([seqs[i]]).to(device)
        attention_mask = (input_ids != 0).float()
        output = int(model(input_ids, attention_mask)[0] > 0.5)
        print(output)
        preds.append(output)

with open('./data/output.txt', 'w') as f:
    for pred in preds:
        f.write(str(pred) + '\n')

In [42]:
# read data
seqs, labels = [], []
with open('../datasets/features_all_cpps.csv', 'r') as f:
    for line in f.readlines()[1:]:
        seq, label = line.strip().split(',')
        seqs.append(seq)
        labels.append(int(label))

MAX_LEN = max(map(len, seqs))

# convert to tokens
mapping = dict(zip(
    ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',
    'A','G','V','E','S','I','K','R','D','T','P','N',
    'Q','F','Y','M','H','C','W'],
    range(30)
))

for i in range(len(seqs)):
    seqs[i] = [mapping[c] for c in seqs[i]] 
    seqs[i].extend([0] * (MAX_LEN - len(seqs[i])))  # padding to max length

preds = []
with torch.inference_mode():
    for i in range(len(seqs)):
        input_ids = torch.tensor([seqs[i]]).to(device)
        attention_mask = (input_ids != 0).float()
        output = int(model(input_ids, attention_mask)[0] > 0.5)
        print(output)
        preds.append(output)

with open('./data/output.txt', 'w') as f:
    for pred in preds:
        f.write(str(pred) + '\n')

1
0
0
0
0
0
1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
1
1
1
0
0
1
1
0
0
0
0
0
1
0
0
0
1
1
1
0
0
1
0
1
1
1
0
0
1
0
1
0
0
0
1
0
0
1
1
1
0
0
0
1
1
1
0
0
0
0
1
0
1
1
1
0
1
0
1
0
0
0
0
0
1
1
0
1
0
1
0
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
0
0
1
0
1
0
0
0
0
1
1
1
1
1
0
1
0
1
1
1
1
1
1
0
0
0
1
0
0
1
1
1
0
0
0
1
0
0
0
1
0
1
0
1
1
1
1
1
0
1
1
0
1
0
1
1
0
0
0
0
0
1
1
1
1
0
0
0
1
0
0
0
0
1
1
1
0
0
1
1
1
1
0
0
0
1
1
1
1
1
1
1
0
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
0
0
0
1
0
1
1
1
1
1
0
1
0
0
0
1
1
1
1
1
1
1
1
0
0
1
0
0
0
1
1
1
0
1
1
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
0
1
1
0
1
0
1
1
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
1
0
0
0
0
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
1
0
1
1
1
1
1
1
1
0
1
0
0
1
1
1
1
1
0
0
1
0
1
1
1
1
0
0
0
1
0
1
0
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
1
1
1
1
0
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
0
1
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
1
0
0
0
0
1
1
1
0
1
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
1
1
1
1


In [43]:
import csv
with open('./data/output.txt', mode='r', encoding='utf-8') as txtfile:
    lines = [line.strip() for line in txtfile]
        
acps_pred = pd.DataFrame(lines, columns=["acps"])

In [44]:
df=pd.concat([datos_cpps, acps_pred],axis=1)
df
df.to_csv("../datasets/pred_acps_to_cpps.csv", index=False)

In [45]:
pred_cpps_to_acps = pd.read_csv("../datasets/pred_cpps_to_acps.csv",on_bad_lines='skip')                                
pred_acps_to_cpps = pd.read_csv("../datasets/pred_acps_to_cpps.csv",on_bad_lines='skip')

In [49]:
cpps_y_acps=pred_cpps_to_acps[(pred_cpps_to_acps==1).all(axis=1)]
cpps_y_acps

Unnamed: 0,sequence,label,cpps


In [50]:
acps_y_cpps=pred_acps_to_cpps[(pred_acps_to_cpps==1).all(axis=1)]
acps_y_cpps

Unnamed: 0,sequence,label,acps


In [53]:
(pred_acps_to_cpps["acps"]==1).sum()

830

Se utilizan los modelos ajustados para hacer predicciones.

In [80]:


mapping = dict(zip(
    ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',
    'A','G','V','E','S','I','K','R','D','T','P','N',
    'Q','F','Y','M','H','C','W'],
    range(30)
))

seq="AACSDRAHGHICESFKSFCKDSGRNGVKLRANCKKTCGLC".strip()

seq = [mapping[c] for c in seq] 
seq.extend([0] * (MAX_LEN - len(seq)))

seq

with torch.inference_mode():
    input_ids = torch.tensor([seq]).to(device)
    attention_mask = (input_ids != 0).float()
    output = int(model_cpps(input_ids, attention_mask)[0] > 0.5)

output

            


0