# Getting started with InstaNovo

<a target="_blank" href="https://colab.research.google.com/github/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In this notebook, we demo InstaNovo, a transformer neural network with the ability to translate fragment ion peaks into the sequence of amino acids that make up the studied peptide(s). We evaluate the model on the nine-species Yeast test set.

![](https://raw.githubusercontent.com/instadeepai/InstaNovo/main/graphical_abstract.jpeg)

**Paper:**

- **De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments** \
  Kevin Eloff, Konstantinos Kalogeropoulos, Oliver Morell, Amandla Mabona, Jakob Berg Jespersen, Wesley Williams, Sam van Beljouw, Marcin Skwark, Andreas Hougaard Laustsen, Stan J. J. Brouns, Anne Ljungars, Erwin M. Schoof, Jeroen Van Goey, Ulrich auf dem Keller, Karim Beguir, Nicolas Lopez Carranza, Timothy P. Jenkins \
  [bioRxiv](https://www.biorxiv.org/content/10.1101/2023.08.30.555055v1), [GitHub](https://github.com/instadeepai/InstaNovo)

**Important:**

It is highly recommended to run this notebook in an environment with access to a GPU. If you are running this notebook in Google Colab:

- In the menu, go to `Runtime > Change Runtime Type > T4 GPU`

## Loading the InstaNovo model

We first install the latest instanovo from PyPi

_Note: this currently installs directly from GitHub, this will be updated in the next release._

In [None]:
%%capture
#!pip install instanovo
!pip install instanovo@git+https://github.com/instadeepai/InstaNovo

In [None]:
from instanovo.inference.knapsack import Knapsack
from instanovo.inference.knapsack_beam_search import KnapsackBeamSearchDecoder
from instanovo.transformer.model import InstaNovo

from tqdm import tqdm
import torch
import os
import numpy as np
import pandas as pd

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpojdigucn
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpojdigucn/_remote_module_non_scriptable.py
INFO:numexpr.utils:NumExpr defaulting to 2 threads.


We can download the model checkpoint directly from the [InstaNovo releases](https://github.com/instadeepai/InstaNovo/releases).

In [None]:
# Download checkpoint locally
!mkdir checkpoints/
!wget https://github.com/instadeepai/InstaNovo/releases/download/0.1.4/instanovo_yeast.pt -P ./checkpoints/

--2023-09-18 16:31:28--  https://github.com/instadeepai/InstaNovo/releases/download/0.1.4/instanovo_yeast.pt
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/681625644/b1597c14-62ea-4da6-b378-98f8de7ec242?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230918%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230918T163128Z&X-Amz-Expires=300&X-Amz-Signature=5f1fe71eff21f056fa05f7cb56164afbaedcf12cfba46742aaa5908cef1fd902&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=681625644&response-content-disposition=attachment%3B%20filename%3Dinstanovo_yeast.pt&response-content-type=application%2Foctet-stream [following]
--2023-09-18 16:31:28--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/681625644/b1597c14-62ea-4da6-b378-98f8de7

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

Loading the model...

In [None]:
model, config = InstaNovo.load("./checkpoints/instanovo_yeast.pt")
model = model.to(device).eval()

## Loading the nine-species dataset
Download the [nine-species exc-yeast](https://huggingface.co/datasets/InstaDeepAI/instanovo_ninespecies_exclude_yeast) dataset from HuggingFace.

In [None]:
from datasets import load_dataset

# Only evaluate on a subset of the data for demo
dataset = load_dataset("InstaDeepAI/instanovo_ninespecies_exclude_yeast", split="test[:10%]")

# Otherwise evaluate on the full test set
# dataset = load_dataset("InstaDeepAI/instanovo_ninespecies_exclude_yeast", split="test")

Downloading readme:   0%|          | 0.00/3.03k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/502M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/504M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/59.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/54.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/499402 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/28572 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/27142 [00:00<?, ? examples/s]

In [None]:
from instanovo.transformer.dataset import collate_batch
from instanovo.transformer.dataset import SpectrumDataset

s2i = {v:k for k,v in model.i2s.items()}
ds = SpectrumDataset(dataset, s2i, config["n_peaks"], return_str=True)

In [None]:
from torch.utils.data import DataLoader

dl = DataLoader(ds, batch_size=64, shuffle=False, collate_fn=collate_batch)

In [None]:
batch = next(iter(dl))

spectra, precursors, spectra_mask, peptides, _ = batch
spectra = spectra.to(device)
precursors = precursors.to(device)

In [None]:
spectra.shape, precursors.shape

(torch.Size([64, 165, 2]), torch.Size([64, 3]))

## Knapsack beam-search decoder

Setup knapsack beam search decoder. This may take a few minutes.

In [None]:
def _setup_knapsack(model: InstaNovo) -> Knapsack:
    MASS_SCALE = 10000
    residue_masses = model.peptide_mass_calculator.masses
    residue_masses["$"] = 0
    residue_indices = model.decoder._aa2idx
    return Knapsack.construct_knapsack(
        residue_masses=residue_masses,
        residue_indices=residue_indices,
        max_mass=4000.00,
        mass_scale=MASS_SCALE,
    )

knapsack_path = "./checkpoints/knapsack/"

if not os.path.exists(knapsack_path):
    print("Knapsack path missing or not specified, generating...")
    knapsack = _setup_knapsack(model)
    decoder = KnapsackBeamSearchDecoder(model, knapsack)
    print(f"Saving knapsack to {knapsack_path}")
    knapsack.save(knapsack_path)
else:
    print("Knapsack path found. Loading...")
    decoder = KnapsackBeamSearchDecoder.from_file(model=model, path=knapsack_path)

INFO:root:Scaling masses.


Knapsack path missing or not specified, generating...


INFO:root:Initializing chart.
INFO:root:Performing search.


Saving knapsack to ./checkpoints/knapsack_vocab_25/


## Inference time 🚀

Evaluating a single batch...

In [None]:
with torch.no_grad():
    p = decoder.decode(
        spectra=spectra,
        precursors=precursors,
        beam_size=config["n_beams"],
        max_length=config["max_length"],
    )
preds = ["".join(x.sequence) if type(x) != list else "" for x in p]
probs = [x.log_probability if type(x) != list else -1 for x in p]

In [None]:
from instanovo.utils.metrics import Metrics

metrics = Metrics(config["residues"], config["isotope_error_range"])

In [None]:
aa_prec, aa_recall, pep_recall, pep_prec = metrics.compute_precision_recall(peptides, preds)
pep_recall

0.8125

Evaluating on the nine-species exc-yeast test set:

In [None]:
preds = []
targs = []
probs = []

for _, batch in tqdm(enumerate(dl), total=len(dl)):
    spectra, precursors, _, peptides, _ = batch
    spectra = spectra.to(device)
    precursors = precursors.to(device)

    with torch.no_grad():
        p = decoder.decode(
            spectra=spectra,
            precursors=precursors,
            beam_size=config["n_beams"],
            max_length=config["max_length"],
        )

    preds += ["".join(x.sequence) if type(x) != list else "" for x in p]
    probs += [x.log_probability if type(x) != list else -1 for x in p]
    targs += list(peptides)

100%|██████████| 43/43 [26:33<00:00, 37.05s/it]


In [None]:
aa_prec, aa_recall, pep_recall, pep_prec = metrics.compute_precision_recall(targs, preds)
aa_er = metrics.compute_aa_er(targs, preds)
auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))

print(f"aa_er       {aa_er}")
print(f"aa_prec     {aa_prec}")
print(f"aa_recall   {aa_recall}")
print(f"pep_prec    {pep_prec}")
print(f"pep_recall  {pep_recall}")
print(f"auc         {auc}")

aa_er       0.3226288085838648
aa_prec     0.6658796062929456
aa_recall   0.6351604413402167
pep_prec    0.6567891972993248
pep_recall  0.6451731761238025
auc         0.6115333781610109


_Note: to reproduce the results of the paper, the entire Yeast test set should be evaluated._

Saving the predictions...

In [None]:
pred_df = pd.DataFrame({
    "targets": targs,
    "predictions": preds,
    "log_probabilities": probs,
})
pred_df.head()

Unnamed: 0,targets,predictions,log_probabilities
0,TPGREDAAEETAAPGK,NRNVGDQNGC(+57.02)LAPGK,-8.156048
1,TPGREDAAEETAAPGK,TDRPGEAAEETAAPGK,-3.115963
2,DVAEKQDDIKEEAK,DVAEKQDDLKEEAK,-0.308857
3,METNEQTMPTK,KTGGEMETMPTK,-4.681867
4,METNEQTMPTK,METNEQTMPTK,-0.00187


In [None]:
pred_df.to_csv("predictions.csv", index=False)

## InstaNovo+: Iterative Refinement with a Diffusion Model
In this section, we show how to refine the predictions from the transformer model with a diffusion model.

First, we download the model checkpoint.

In [None]:
!wget https://github.com/instadeepai/InstaNovo/releases/download/0.1.5/instanovoplus_yeast.zip -P ./checkpoints/
!unzip ./checkpoints/instanovoplus_yeast.zip -d ./checkpoints

Next, we load the checkpoint and create a decoder object.

In [None]:
from instanovo.diffusion.multinomial_diffusion import MultinomialDiffusion
from instanovo.inference.diffusion import DiffusionDecoder

diffusion_model = MultinomialDiffusion.load("./checkpoints/diffusion_checkpoint")
diffusion_model = diffusion_model.to(device).eval()
diffusion_decoder = DiffusionDecoder(model=diffusion_model)


Then we prepare the inference data loader using predictions from the InstaNovo transformer model.

In [None]:
import pandas
import polars
from instanovo.diffusion.dataset import AnnotatedPolarsSpectrumDataset
from instanovo.diffusion.dataset import collate_batches

diffusion_dataset = AnnotatedPolarsSpectrumDataset(
    polars.from_pandas(pandas.DataFrame(dataset)), peptides=preds
)

diffusion_data_loader = DataLoader(diffusion_dataset, batch_size=64, shuffle=False,
                                   collate_fn=collate_batches(
                                       residues=diffusion_model.residues,
                                       max_length=diffusion_model.config.max_length,
                                       time_steps=diffusion_decoder.time_steps,
                                       annotated=True
                                   ))

Finally, we predict sequences by iterating over the spectra and refining the InstaNovo predictions.

In [None]:
predictions = []
log_probs = []

for batch in tqdm(diffusion_data_loader, total=len(diffusion_data_loader)):
    spectra, spectra_padding_mask, precursors, peptides, peptide_padding_mask = batch
    spectra = spectra.to(device)
    spectra_padding_mask = spectra_padding_mask.to(device)
    precursors = precursors.to(device)
    peptides = peptides.to(device)
    peptide_padding_mask = peptide_padding_mask.to(device)

    with torch.no_grad():
        batch_predictions, batch_log_probs = diffusion_decoder.decode(
            spectra=spectra,
            spectra_padding_mask=spectra_padding_mask,
            precursors=precursors,
            initial_sequence=peptides
        )
    predictions.extend(["".join(sequence) for sequence in batch_predictions])
    log_probs.extend(batch_log_probs)

The iterative refinement improves performance on this sample of the Nine Species dataset. (To replicate the performance reported in the paper, you would need to evaluate on the entire dataset.) 

In [None]:
aa_prec, aa_recall, pep_recall, pep_prec = metrics.compute_precision_recall(targs, predictions=predictions)
aa_er = metrics.compute_aa_er(targs, predictions)
auc = metrics.calc_auc(targs, predictions, np.exp(pd.Series(log_probs)))

print(f"aa_er       {aa_er}")
print(f"aa_prec     {aa_prec}")
print(f"aa_recall   {aa_recall}")
print(f"pep_prec    {pep_prec}")
print(f"pep_recall  {pep_recall}")
print(f"auc         {auc}")

In [None]:
diffusion_predictions = pd.DataFrame({
    "targets": targs,
    "predictions": predictions,
    "log_probabilities": log_probs,
})
diffusion_predictions.head()

In [None]:
diffusion_predictions.to_csv("diffusion_predictions.csv", index=False)