<a href="https://colab.research.google.com/github/CasBlaauw/BertOGlyc/blob/main/ProtBert_BFD_NetOGlyc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h3> Extracting protein sequences' features using ProtBert-BFD pretrained-model <h3>

<b>1. Load necessary libraries including huggingface transformers<b>

In [1]:
!pip install -q transformers

In [2]:
import torch
from transformers import AutoTokenizer, AutoModel, pipeline
import re
import numpy as np
import pandas as pd
import gc
from google.colab import files, drive

<b>2. Load the vocabulary and ProtBert-BFD Model<b>

In [3]:
tokenizer = AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd", do_lower_case=False )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=361.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=81.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=86.0, style=ProgressStyle(description_w…




In [4]:
model = AutoModel.from_pretrained("Rostlab/prot_bert_bfd")

Some weights of the model checkpoint at Rostlab/prot_bert_bfd were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<b>3. Load the model into the GPU if avilabile<b>

In [5]:
fe = pipeline('feature-extraction', model=model, tokenizer=tokenizer, device=0)

<b>4. Preprocess data<b>

In [6]:
# sequences_Example = ["A E T C Z A O","S K T Z P"]
# sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

In [3]:
# Read in the sequences column
sequences = pd.read_csv('glycosites_unique_filtered.tsv', sep = '\t', usecols = ['sequence'], squeeze = True)

In [4]:
# Map rarely used amino acids to X (don't think these exist in our data)
sequences = sequences.str.replace(r"[UZOB]", "X")

In [5]:
# Set maximum length, all genes shorter are filtered
max_len = 4000

In [6]:
# Read in the info
glycosites = pd.read_csv('glycosites_unique_filtered.tsv', sep = '\t')
# Filter out long genes
glycosites = glycosites[glycosites['sequence'].str.len() < max_len]
glycosites.reset_index(inplace=True, drop=True)

In [7]:
# Drop ridiculously huge genes because they crash the model...
# Up to 10000: already filtered out in glycosites_unique_filtered.tsv
# Model can handle up to 10000, but decided to train CNN on up to 4000 for now because padding is added up to max
removed = np.where(sequences.str.len() >= max_len)[0].tolist()
print(f'Dropping sequences #{removed}')
print(len(sequences))
sequences = sequences[sequences.str.len() < max_len]
sequences.reset_index(inplace=True, drop=True)
print(sequences)

# Store the lengths
seq_lens = sequences.str.len().copy()
print('Top lengths: ', sorted(seq_lens, reverse=True)[:20])

# Tokenize sequences by interlacing spaces
sequences = sequences.map(' '.join)

Dropping sequences #[85, 200, 201, 210, 230, 265, 318, 415, 431]
475
0      MAIDRRREAAGGGPGRQPAPAEENGSLPPGDAAASAPLGGRAGPGG...
1      MRVLACLLAALVGIQAVERLRLADGPHGCAGRLEVWHGGRWGTVCD...
2      MNKTNQVYAANEDHNSQFIDDYSSSDESLSVSHFSFSKQSHRPRTI...
3      MGVAARPPALRHWFSHSIPLAIFALLLLYLSVRSLGARSGCGPRAQ...
4      MARHGCLGLGLFCCVLFAATVGPQPTPSIPGAPATTLTPVPQSEAS...
                             ...                        
461    MTPQSLLQTTLFLLSLLFLVQGAHGRGHREDFRFCSQRNQTHRSSL...
462    MGQRLSGGRSCLDVPGRLLPQPPPPPPPVRRKLALLFAMLCVWLYM...
463    MAPRTLWSCYLCCLLTAAAGAASYPPRGFSLYTGSSGALSPGGPQA...
464    MPRATALGALVSLLLLLPLPRGAGGLGERPDATADYSELDGEEGTE...
465    MKWKHVPFLVMISLLSLSPNHLFLAQLIPDPEDVERGNDHGTPIPT...
Name: sequence, Length: 466, dtype: object
Top lengths:  [3396, 3333, 3230, 3063, 3014, 2912, 2828, 2623, 2595, 2413, 2386, 2346, 2315, 2235, 2224, 2214, 2179, 2169, 2135, 2045]


In [8]:
!mkdir embed

# Option 1: Output for CNN
- Adds padding to embeddings and sequences up to `max_len`
- Outputs embeddings as a zip of .npy arrays for each gene, of (`max_len`,1024) each
- Outputs info as a single tab-separated .txt (`n_seqs`, 3), with genes as rows (set as index labels) and `['sequence', 'sites', 'label']` as columns. 

<b>5. Extract sequences' features, add padding, and write to file<b>

In [8]:
# Optionally turn off making embedding/info if not needed
write_embedding = False
write_info = True

In [9]:
index = 0
n_seqs = len(glycosites)

# Add scaffold for labels
if write_info:
  glycosites = pd.concat([glycosites, pd.Series(['0'*max_len]*len(glycosites), name = 'label')], axis = 1)

# Start loop
for seq in sequences:
  # ----- Keeping track -----
  print(f'{index+1} / {n_seqs}')
  pad_len = max_len - seq_lens[index]
  msg = 'pre_pad: '
  msg_pad = 'padded: '

  # ----- Embedding -----
  if write_embedding:
    # Get the embedding for each sequence
    embedding = fe([seq])

    # Remove any special tokens ([PAD],[CLS],[SEP]) added by model
    embedding = np.array(embedding)[0, 1:(seq_lens[index]+1), :]
    msg += 'embed ' + str(embedding.shape)    

    # Add padding
    embedding_padding = np.zeros((pad_len, 1024))
    embedding = np.append(embedding, embedding_padding, axis = 0)

    msg_pad += 'embed ' + str(embedding.shape)

    # Write embedding to file
    np.save(f"embed/embeddings_{gene[0]}", embedding)

  # ----- Info -----
  if write_info:
    msg += 'prot ' + str(len(glycosites.loc[index, 'sequence']))

    # Add padding to sequence
    glycosites.loc[index, 'sequence'] += '-'*pad_len
    msg_pad += 'prot ' + str(len(glycosites.loc[index, 'sequence']))

    # Set labels to positive at sites
    sites = glycosites['sites'][index].split(' ')
    site_ids = [int(site[1:])-1 for site in sites]
    label = ''.join(['1' if idx in site_ids else '0' for idx in range(max_len)])
    glycosites.loc[index, 'label'] = label
    print([(site, glycosites.loc[index, 'sequence'][site_id]) for site, site_id in zip(sites, site_ids)]) # Sanity check that site matches seq at id

  # ----- Housekeeping -----
  print(msg)
  print(msg_pad)
  index += 1
  gc.collect()

# Write info to file
if write_info:
  glycosites.set_index('gene', inplace = True)
  print(glycosites.head())
  glycosites.to_csv('embeddings_info.txt', sep = '\t')

1 / 466
[('T143', 'T')]
pre_pad: prot 686
padded: prot 4000
2 / 466
[('S694', 'S'), ('T693', 'T'), ('T696', 'T'), ('T697', 'T'), ('T699', 'T'), ('T701', 'T')]
pre_pad: prot 1573
padded: prot 4000
3 / 466
[('S393', 'S'), ('T388', 'T'), ('T390', 'T')]
pre_pad: prot 531
padded: prot 4000
4 / 466
[('T64', 'T')]
pre_pad: prot 291
padded: prot 4000
5 / 466
[('S139', 'S'), ('S206', 'S'), ('T143', 'T')]
pre_pad: prot 899
padded: prot 4000
6 / 466
[('T109', 'T')]
pre_pad: prot 502
padded: prot 4000
7 / 466
[('S901', 'S'), ('S905', 'S'), ('T40', 'T'), ('T902', 'T')]
pre_pad: prot 956
padded: prot 4000
8 / 466
[('S493', 'S'), ('S549', 'S'), ('S576', 'S'), ('S588', 'S'), ('S604', 'S'), ('S643', 'S'), ('S655', 'S'), ('T276', 'T'), ('T277', 'T'), ('T281', 'T'), ('T282', 'T'), ('T289', 'T'), ('T577', 'T'), ('T590', 'T'), ('T592', 'T'), ('T593', 'T'), ('T647', 'T')]
pre_pad: prot 747
padded: prot 4000
9 / 466
[('S222', 'S'), ('S283', 'S'), ('S396', 'S'), ('S405', 'S'), ('T194', 'T'), ('T216', 'T'), ('

<b>6. Export files</b>

In [None]:
!zip -r /content/embeddings_npy.zip embed/embeddings_*.*

In [10]:
drive.mount('/content/drive')
if write_embedding:
  !cp /content/embeddings_npy.zip /content/drive/MyDrive/NetOGlyc
if write_info:
  !cp /content/embeddings_info.txt /content/drive/MyDrive/NetOGlyc
drive.flush_and_unmount()

Mounted at /content/drive


# Option 2: Just embeddings, as .txt in zip
- Doesn't pad sequences
- Outputs embeddings as a zip of .txt files for each gene, of (prot_len, 1024) each
- Doesn't output info - can use glycosites file, but make sure `max_len` filtering is the same! (`glycosites_unique_filtered.txt` is filtered to 10000, current `max_len` is 4000)

Might need some changes in the initial part to work (thinking mostly about the label column, but maybe other things too), as I changed those for option 1 before formalising this.

<b>5. Extract sequences' features and remove padding/special tokens<b>

In [None]:
index = 0
n_seqs = len(sequences)
for seq in sequences:
  # Keeping track
  print(f'{index+1} / {n_seqs}')
  # Get the embedding for each sequence
  embedding = fe([seq])
  # Remove padding ([PAD]) and special tokens ([CLS],[SEP]) added by model
  embedding = np.array(embedding)[0, 1:(seq_lens[index]+1), :]
  print(f"Embedding size: {embedding.shape}, seq_len {seq_lens[index]}")
  # Save embeddings to file, matrix of (prot_len x 1024) per protein
  np.savetxt(f"embed/embeddings_{glycosites['gene'][index]}.txt", embedding, delimiter = '\t')
  # Housekeeping to prepare for next loop
  index += 1
  gc.collect()

<b>6. Export files</b>

In [None]:
!zip -r /content/embeddings_individual_filtered.zip /content/embed

In [22]:
# Download zip: very slow, faster to move to drive and download/access from there
# files.download("/content/embeddings_individual.zip")
drive.mount('/content/drive')
!cp /content/embeddings_individual_filtered.zip /content/drive/MyDrive/NetOGlyc/
drive.flush_and_unmount()

Mounted at /content/drive


# Option 3: Embeddings and data in one big file
- Doesn't pad sequences
- Outputs embeddings as a zip of one big .txt file, with all genes concatenated and a residue on each row (`seq_lens.sum()`, 1027) 
- Info is included in the main file, as first three columns: `['gene', 'residue', 'label']`, after which `['0', '1', ... , '1023']` start.
- Note that this .txt file is therefore made to be imported by pandas, numpy won't like those label columns. `data.loc[:, '0':'1023'].to_numpy()` should work, at the cost of loading everything into memory twice. Also turns out to be pretty useless if you're trying to train a CNN on genes, not individual embeddings.

Hasn't been touched for a bit, so might not have all the improvements option 1 has.

<b> 5-6. Alternative: write data with aa and gene name to one big file </b>



In [None]:
with open('embed/embeddings.txt', 'w') as file:
  index = 0
  header = True
  n_seqs = len(glycosites)
  for seq in sequences:
    # Keeping track
    print(f'{index+1} / {n_seqs}')

    # Get the embedding for each sequence
    embedding = fe([seq])

    # Remove padding ([PAD]) and special tokens ([CLS],[SEP]) added by model
    embedding = np.array(embedding)[0, 1:(seq_lens[index]+1), :]

    # Prepare gene/residue/site labels
    gene = pd.Series([glycosites.iloc[index, 0]]*seq_lens[index], name = 'gene')
    prot_seq = pd.Series(list(seq.replace(" ", "")), name = 'residue')
    sites = pd.Series(glycosites.iloc[index, 1].split(' '), name = 'sites')
    label = pd.Series([0]*seq_lens[index], name = 'label')

    # Bind into one dataframe
    embedding = pd.DataFrame(embedding)
    print(f'shapes: gene {gene.shape}, prot {prot_seq.shape}, embedding {embedding.shape}')
    embedding = pd.concat([gene, prot_seq, label, embedding], axis = 1)

    # Add sites
    for site in sites:
      site_index = int(site[1:])-1
      embedding.iloc[site_index, 2] = 1
      # print(f"gene: {embedding['gene'][site_index]}, site: {site}, prot residue: {embedding['residue'][site_index]}")
    print([(site, embedding['residue'][int(site[1:])-1]) for site in sites])

    # Write to file
    embedding.to_csv(file, sep = '\t', header = header, index = False, mode = 'a')

    # Housekeeping
    index += 1
    header = False
    gc.collect()

In [None]:
!zip /content/embeddings_txt.zip /content/embeddings.txt

In [None]:
drive.mount('/content/drive')
!cp /content/embeddings_txt.zip /content/drive/MyDrive/NetOGlyc/
drive.flush_and_unmount()