# DBP - DNA-Binding Proteins

After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [script](https://github.com/facebookresearch/esm/blob/main/scripts/extract.py)  that efficiently extracts embeddings in bulk.

The script `scripts/extract.py` stores embeddings in PyTorch `.pt` files (generated by `torch.save`) - one file per fasta sequence. 


In [1]:
# Import dependencies
import os

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../../scripts')

# import file utilities as fu
import file_utilities as fu

Create a path for the script `extract.py`.

In [3]:
# Path for extract.py
esm_scripts_path = '/home/damir/.cache/torch/hub/facebookresearch_esm_main/scripts'
extract = os.path.join(esm_scripts_path, 'extract.py')
extract

'/home/damir/.cache/torch/hub/facebookresearch_esm_main/scripts/extract.py'

Initialize arguments

In [4]:
# Define arguments for the file_paths function
# First 3 are constant in this notebook
ptmodel = 'esm'
task = 'dbp'
pool = 'mean'  
# Last 3 arguments we might be changing through the notebook
file_base = 'train'
model = 'esm1v_t33_650M_UR90S_1'
emb_layer = 33

<br>

## Train Dataset

### ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Run the script `file_paths` to prepare paths. The default root data folder is *../../data*.

In [5]:
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../../data/dna_binding/train_esm.fa 
 ../../data/dna_binding/esm/train/dbp_train_esm1v_mean


<br>

Run the embedding script for: `esm - dbp - train - esm1v - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [6]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 



Transferred model to GPU
Read ../../data/dna_binding/train_esm.fa with 13108 sequences
Processing 1 of 1212 batches (67 sequences)
Processing 2 of 1212 batches (64 sequences)
Processing 3 of 1212 batches (61 sequences)
Processing 4 of 1212 batches (56 sequences)
Processing 5 of 1212 batches (54 sequences)
Processing 6 of 1212 batches (51 sequences)
Processing 7 of 1212 batches (50 sequences)
Processing 8 of 1212 batches (48 sequences)
Processing 9 of 1212 batches (47 sequences)
Processing 10 of 1212 batches (46 sequences)
Processing 11 of 1212 batches (45 sequences)
Processing 12 of 1212 batches (44 sequences)
Processing 13 of 1212 batches (43 sequences)
Processing 14 of 1212 batches (42 sequences)
Processing 15 of 1212 batches (41 sequences)
Processing 16 of 1212 batches (40 sequences)
Processing 17 of 1212 batches (40 sequences)
Processing 18 of 1212 batches (39 sequences)
Processing 19 of 1212 batches (39 sequences)
Processing 20 of 1212 batches (39 sequences)
Processing 21 of 1212 

<br>

### ESM-1b model - esm1b_t33_650M_UR50S

- **Pooling Operation:  `mean`**

Update arguments and prepare paths

In [7]:
# Update arguments
model = 'esm1b_t33_650M_UR50S'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../../data/dna_binding/train_esm.fa 
 ../../data/dna_binding/esm/train/dbp_train_esm1b_mean


<br>

Run the embedding script for: `esm - dbp - train - esm1b - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [8]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../../data/dna_binding/train_esm.fa with 13108 sequences
Processing 1 of 1212 batches (67 sequences)
Processing 2 of 1212 batches (64 sequences)
Processing 3 of 1212 batches (61 sequences)
Processing 4 of 1212 batches (56 sequences)
Processing 5 of 1212 batches (54 sequences)
Processing 6 of 1212 batches (51 sequences)
Processing 7 of 1212 batches (50 sequences)
Processing 8 of 1212 batches (48 sequences)
Processing 9 of 1212 batches (47 sequences)
Processing 10 of 1212 batches (46 sequences)
Processing 11 of 1212 batches (45 sequences)
Processing 12 of 1212 batches (44 sequences)
Processing 13 of 1212 batches (43 sequences)
Processing 14 of 1212 batches (42 sequences)
Processing 15 of 1212 batches (41 sequences)
Processing 16 of 1212 batches (40 sequences)
Processing 17 of 1212 batches (40 sequences)
Processing 18 of 1212 batches (39 sequences)
Processing 19 of 1212 batches (39 sequences)
Processing 20 of 1212 batches (39 sequences)
Processing 21 of 1212 

<br>

**Check the folders**

In [9]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/dna_binding/esm/train
├── [4.0K Oct  3 08:10]  dbp_train_esm1b_mean
└── [4.0K Oct  3 07:30]  dbp_train_esm1v_mean

2 directories, 0 files


Print the total size and number of pt files in each embedding folder

In [10]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

dbp_train_esm1b_mean consumes: 74.14MB in 13108 files
dbp_train_esm1v_mean consumes: 74.14MB in 13108 files


<br>

## Test Dataset

### ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Update arguments and prepare paths

In [11]:
# Update arguments
model = 'esm1v_t33_650M_UR90S_1'
file_base = 'test'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../../data/dna_binding/test_esm.fa 
 ../../data/dna_binding/esm/test/dbp_test_esm1v_mean


<br>

Run the embedding script for: `esm - dbp - test - esm1v - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [12]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../../data/dna_binding/test_esm.fa with 2081 sequences
Processing 1 of 198 batches (56 sequences)
Processing 2 of 198 batches (49 sequences)
Processing 3 of 198 batches (45 sequences)
Processing 4 of 198 batches (41 sequences)
Processing 5 of 198 batches (39 sequences)
Processing 6 of 198 batches (36 sequences)
Processing 7 of 198 batches (34 sequences)
Processing 8 of 198 batches (33 sequences)
Processing 9 of 198 batches (32 sequences)
Processing 10 of 198 batches (30 sequences)
Processing 11 of 198 batches (28 sequences)
Processing 12 of 198 batches (27 sequences)
Processing 13 of 198 batches (26 sequences)
Processing 14 of 198 batches (25 sequences)
Processing 15 of 198 batches (24 sequences)
Processing 16 of 198 batches (23 sequences)
Processing 17 of 198 batches (23 sequences)
Processing 18 of 198 batches (22 sequences)
Processing 19 of 198 batches (22 sequences)
Processing 20 of 198 batches (21 sequences)
Processing 21 of 198 batches (20 sequences)


<br>

### ESM-1b model - esm1b_t33_650M_UR50S

- **Pooling Operation:  `mean`**

Update arguments and prepare paths

In [13]:
# Update arguments
model = 'esm1b_t33_650M_UR50S'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../../data/dna_binding/test_esm.fa 
 ../../data/dna_binding/esm/test/dbp_test_esm1b_mean


<br>

Run the embedding script for: `esm - dbp - test - esm1b - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [14]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../../data/dna_binding/test_esm.fa with 2081 sequences
Processing 1 of 198 batches (56 sequences)
Processing 2 of 198 batches (49 sequences)
Processing 3 of 198 batches (45 sequences)
Processing 4 of 198 batches (41 sequences)
Processing 5 of 198 batches (39 sequences)
Processing 6 of 198 batches (36 sequences)
Processing 7 of 198 batches (34 sequences)
Processing 8 of 198 batches (33 sequences)
Processing 9 of 198 batches (32 sequences)
Processing 10 of 198 batches (30 sequences)
Processing 11 of 198 batches (28 sequences)
Processing 12 of 198 batches (27 sequences)
Processing 13 of 198 batches (26 sequences)
Processing 14 of 198 batches (25 sequences)
Processing 15 of 198 batches (24 sequences)
Processing 16 of 198 batches (23 sequences)
Processing 17 of 198 batches (23 sequences)
Processing 18 of 198 batches (22 sequences)
Processing 19 of 198 batches (22 sequences)
Processing 20 of 198 batches (21 sequences)
Processing 21 of 198 batches (20 sequences)


<br>

**Check the folders**

In [15]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/dna_binding/esm/test
├── [4.0K Oct  3 08:26]  dbp_test_esm1b_mean
└── [4.0K Oct  3 08:19]  dbp_test_esm1v_mean

2 directories, 0 files


Print the total size and number of pt files in each embedding folder

In [16]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

dbp_test_esm1b_mean consumes: 11.77MB in 2081 files
dbp_test_esm1v_mean consumes: 11.77MB in 2081 files
