# SimMS - bring your own MGF

In this notebook, you can upload your own .mgf spectra file to use the colab GPU for fast similarity predictions.

You will need to enable a **colab GPU instance* for this to run, refer to [quickstart notebook](https://colab.research.google.com/github/PangeAI/simms/blob/main/notebooks/samples/colab_tutorial_pesticide.ipynb) to learn how to do this.

In [1]:
import os

assert os.system("nvidia-smi") == 0, ("GPU isn't available. \n In the upper right, click triangle icon->'Change runtime Type' and select the 'T4 GPU'.")

Upload two `.mgf` files here (we will perform pairwise similarity between those two):

In [11]:
from google.colab import files

print("Upload an mgf file (as a query)")
uploaded = files.upload()
# Handle the uploaded file
for filename_a, content in uploaded.items():
    print(f'Uploaded file "{filename_a}" with size {len(content)/1e9:.3f}GB')

print("Now upload a second mgf file (as a reference)")
uploaded = files.upload()
# Handle the uploaded file
for filename_b, content in uploaded.items():
    print(f'Uploaded file "{filename_a}" with size {len(content)/1e9:.3f}GB')

Upload an mgf file (as a query)


Saving pesticides.mgf to pesticides (5).mgf
Uploaded file "pesticides (5).mgf" with size 0.000GB
Now upload another mgf file (as a reference)


Saving pesticides.mgf to pesticides (6).mgf
Uploaded file "pesticides (5).mgf" with size 0.000GB


Install the `simms` package from github

In [2]:
! pip uninstall simms -qq -y
! pip install -q git+https://github.com/PangeAI/simms.git@main

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for simms (pyproject.toml) ... [?25l[?25hdone


Import relevant methods and libraries

In [3]:
from matchms.importing import load_from_mgf
from matchms.filtering import default_filters
from matchms.filtering import normalize_intensities
from matchms import calculate_scores
from matchms.similarity import CosineGreedy
from simms.utils import download
from pathlib import Path
from simms.similarity import CudaCosineGreedy
from numba import cuda
import numpy as np
assert cuda.is_available()

In [6]:
# Set reasonable defaults for the kernel
# Set default parameters for the kernel
match_limit = 1024
max_peaks = 1024
batch_size = 2048

In [7]:
from pathlib import Path
from joblib import Parallel, delayed
from matchms.filtering import default_filters, normalize_intensities, reduce_to_number_of_peaks
from matchms.importing import load_from_mgf
import pickle

def parse_file(spectra_file):
  def parse_spectrum(spectrum):
      ## Uncomment if you want default filters enabled - add more if you need them.
      # spectrum = default_filters(spectrum)
      spectrum = reduce_to_number_of_peaks(spectrum, n_max=max_peaks)
      # spectrum = normalize_intensities(spectrum)
      return spectrum

  spectrums = load_from_mgf(spectra_file)
  spectrums = Parallel(-1)(delayed(parse_spectrum)(spec) for spec in spectrums)
  spectrums = [spe for spe in spectrums if spe is not None]
  return spectrums

references = parse_file(filename_a)
queries = parse_file(filename_b)

Perform calculations using the using original `matchms` functions.

In [8]:
similarity_function=CudaCosineGreedy(
    # You can modify the similarity parameters here, just like in matchms.
    tolerance=0.1,
)

scores = calculate_scores(references=references,
                          queries=queries,
                          similarity_function=similarity_function)
print(f"Size of matrix of computed similarities: {scores.scores.shape}")

# Matchms allows to get the best matches for any query using scores_by_query
query = queries[0]  # just an example
best_matches = scores.scores_by_query(query, 'CudaCosineGreedy_score', sort=True)

# Print the calculated scores for each spectrum pair
for (reference, (score, matches, overflow)) in best_matches[:10]:
    # Ignore scores between same spectrum
    if reference is not query:
        print(f"Reference scan id: {reference.metadata['scans']}")
        print(f"Query scan id: {query.metadata['scans']}")
        print(f"Score: {score:.4f}")
        print(f"Number of matching peaks: {matches}")
        # Overflow means we've reached at least
        print(f"Overflow: {overflow == True}")
        print("----------------------------")

100%|██████████| 1/1 [00:01<00:00,  1.55s/it]

Size of matrix of computed similarities: (76, 76, 3)
Reference scan id: 2161
Query scan id: 2161
Score: 1.0000
Number of matching peaks: 81
Overflow: False
----------------------------
Reference scan id: 613
Query scan id: 2161
Score: 0.8646
Number of matching peaks: 14
Overflow: False
----------------------------
Reference scan id: 603
Query scan id: 2161
Score: 0.8237
Number of matching peaks: 14
Overflow: False
----------------------------
Reference scan id: 2160
Query scan id: 2161
Score: 0.8015
Number of matching peaks: 25
Overflow: False
----------------------------
Reference scan id: 2362
Query scan id: 2161
Score: 0.2923
Number of matching peaks: 7
Overflow: False
----------------------------
Reference scan id: 2598
Query scan id: 2161
Score: 0.2231
Number of matching peaks: 5
Overflow: False
----------------------------
Reference scan id: 2594
Query scan id: 2161
Score: 0.1761
Number of matching peaks: 3
Overflow: False
----------------------------
Reference scan id: 1944
Quer




In [9]:
np.savez_compressed(
        'scores.npz',
        score=scores.scores.to_array()['CudaCosineGreedy_score'],
)

# Trigger file download
files.download('scores.npz')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

You can download a full similarity matrix.

You can then use the `scores.npz` as follows:

```py
import numpy as np

score_loaded = np.load('scores.npz')['score']

# Get the score between query num. 10 and reference number 21
print(score_loaded[10, 21])
```

Or, you can just use it within this notebook, without downloading it.