#CafChem tools for Masking and Embedding proteins

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MauricioCafiero/CafChem/blob/main/notebooks/ProteinMaskEmbed_CafChem.ipynb)

## This notebook allows you to:
- load a PDB and visualize
- choose a chain to study
- mask a fraction of residues in the chain
- use ESM to fill the masks, creating a novel protein
- embed proteins
- compare proteins by cosine similarity of embeddings

## Requirements:

- Runs quickly on an L4 GPU

## Install and import libraries
- pull CafChem from Github

In [1]:
!pip install -q py3Dmol
!pip install -q "fair-esm[esmfold]"

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.3/510.3 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m99.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.7/76.7 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.5/849.5 kB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m983.2/983.2 kB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.0/54.0 kB[0m [31m6.6 MB/s[0m eta [36m0:0

In [20]:
!git clone https://github.com/MauricioCafiero/CafChem.git

Cloning into 'CafChem'...
remote: Enumerating objects: 1156, done.[K
remote: Counting objects: 100% (478/478), done.[K
remote: Compressing objects: 100% (221/221), done.[K
remote: Total 1156 (delta 398), reused 257 (delta 257), pack-reused 678 (from 1)[K
Receiving objects: 100% (1156/1156), 58.28 MiB | 16.71 MiB/s, done.
Resolving deltas: 100% (679/679), done.


In [4]:
import py3Dmol
import numpy as np
import pandas as pd

import CafChem.CafChemProteinMaskEmbed as ccpme

print('all libraries imported.')

all libraries imported.


## Retrieve protein from PBD and show

In [5]:
ccpme.show_protein('4ZGM')

<py3Dmol.view at 0x7ae02f53c770>

### Get PDB file

In [6]:
res = ccpme.get_protein_from_pdb('4ZGM')
part = res.split('\n')
print(part[0])

HEADER    SIGNALING PROTEIN                       23-APR-15   4ZGM              


## Examine sequence
- get a dictionary of protein chains from the PDB
- examine tokenization

In [7]:
chains, chains_ol = ccpme.extract_sequence('4ZGM')
print(chains.keys())


Blank line
dict_keys(['A', 'B'])


## Perform masking and prediction

In [8]:
sgt_mask = ccpme.gen_mask_fill(checkpoint = 'facebook/esm2_t33_650M_UR50D', seq = chains_ol['B'], num_to_mask = 15)
sgt_mask.start_model()

tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.61G [00:00<?, ?B/s]

In [9]:
seq_ids, masked_chain, masked_chain_ids = sgt_mask.mask_tokens()
print(seq_ids)
print(masked_chain_ids)

[0, 21, 24, 9, 6, 11, 18, 11, 8, 13, 7, 8, 8, 19, 4, 9, 6, 16, 5, 5, 15, 9, 18, 12, 5, 22, 4, 7, 10, 6, 10, 6, 2]
[0, 21, 24, 9, 32, 11, 18, 11, 8, 32, 7, 32, 8, 32, 4, 32, 32, 32, 5, 32, 15, 9, 18, 12, 5, 22, 4, 32, 32, 6, 10, 6, 2]


In [10]:
model_preds = sgt_mask.unmask()
print(model_preds)

[0, 20, 4, 9, 4, 11, 18, 11, 8, 4, 7, 4, 8, 4, 4, 8, 9, 9, 5, 5, 15, 9, 18, 12, 5, 22, 4, 9, 9, 6, 10, 6, 2]


In [12]:
print(seq_ids)
print(masked_chain_ids)
print(model_preds)

[0, 21, 24, 9, 6, 11, 18, 11, 8, 13, 7, 8, 8, 19, 4, 9, 6, 16, 5, 5, 15, 9, 18, 12, 5, 22, 4, 7, 10, 6, 10, 6, 2]
[0, 21, 24, 9, 32, 11, 18, 11, 8, 32, 7, 32, 8, 32, 4, 32, 32, 32, 5, 32, 15, 9, 18, 12, 5, 22, 4, 32, 32, 6, 10, 6, 2]
[0, 20, 4, 9, 4, 11, 18, 11, 8, 4, 7, 4, 8, 4, 4, 8, 9, 9, 5, 5, 15, 9, 18, 12, 5, 22, 4, 9, 9, 6, 10, 6, 2]


In [13]:
orig, new = sgt_mask.compare_seqs_naive()

Original: HXEGTFTSDVSSYLEGQAAKEFIAWLVRGRG
Novel   : MLELTFTSLVLSLLSEEAAKEFIAWLEEGRG
Number of differences: 11 out of 31
Percentage of differences: 0.355


In [14]:
orig, new = sgt_mask.compare_seqs()

Original: HXEGTFTSDVSSYLEGQAAKEFIAWLVRGRG
Novel   : MLELTFTSLVLSLLSEEAAKEFIAWLEEGRG
Residue 1 changed HIS --> MET. This token was not masked.
Residue 2 changed X --> LEU. This token was not masked.
Residue 4 changed GLY --> LEU. This token was not masked.
Residue 9 changed ASP --> LEU. This token was not masked.
Residue 11 changed SER --> LEU. This token was not masked.
Residue 13 changed TYR --> LEU. This token was not masked.
Residue 15 changed GLU --> SER. This token was not masked.
Residue 16 changed GLY --> GLU. This token was masked.
Residue 17 changed GLN --> GLU. This token was masked.
Residue 27 changed VAL --> GLU. This token was not masked.
Residue 28 changed ARG --> GLU. This token was masked.


## Compare embeddings for original and new sequences

In [21]:
# import module to reload a library
import importlib
importlib.reload(ccpme)

<module 'CafChem.CafChemProteinMaskEmbed' from '/content/CafChem/CafChemProteinMaskEmbed.py'>

In [22]:
seqs = [orig, new]
seqs

['HXEGTFTSDVSSYLEGQAAKEFIAWLVRGRG', 'MLELTFTSLVLSLLSEEAAKEFIAWLEEGRG']

In [23]:
overlap = ccpme.embed_proteins(checkpoint = 'facebook/esm2_t6_8M_UR50D', list_seqs = seqs)
overlap.start_model()

Using device: cuda


Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded from facebook/esm2_t6_8M_UR50D


In [24]:
embeddings = overlap.embed_seqs()
embeddings.shape

(2, 320)

In [25]:
overlap.compare_embeddings(0,1)

Overlap between protein 0 and 1: 0.97612


## Fold novel protein

- Copy novel sequence above and use in the ESMFold notebook
- Notebook is here: [![Open ESMFoldIn Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MauricioCafiero/CafChem/blob/main/notebooks/ESMFold_CafChem.ipynb)

In [36]:
print('New sequence: -----------------------')
print(new)
print('old sequence: -----------------------')
print(orig)

New sequence: -----------------------
MXXXLLTSDGLGYLEGQALAAFLAWLVRGGG
old sequence: -----------------------
HXEGTFTSDVSSYLEGQAAKEFIAWLVRGRG
