<a href="https://colab.research.google.com/github/klanita/PoincareMSA/blob/master/PoincareMSA_colab_Tatiana.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/klanita/PoincareMSA/blob/master/.github/PoincareMSA_small_logo.png?raw=true" height="100" style="height:100px;margin-left: 0px;">

# Poincaré maps for visualization of large protein famillies

PoincareMSA builds an interactive projection of an input protein multiple sequence alignemnt (MSA) using a method based on Poincaré maps described by Klimovskaia et al [1]. It reproduces both local proximities of protein sequences and hierarchy contained in give data. Thus, sequences located closer to the center of projection correspond to the proteins sharing the most general functional properites and/or appearing at the earlier stages of evolution. Source code is available at https://github.com/klanita/PoincareMSA.

[1] Klimovskaia, A., Lopez-Paz, D., Bottou, L. et al. Poincaré maps for analyzing complex hierarchies in single-cell data. Nat Commun 11, 2966 (2020).

# Notebook initialization

In [1]:
#@title ### Load PoincaréMSA Github repository & install dependencies
print("1. Load PoincaréMSA Github repository")
import os
if os.getcwd() == "/content":
    !git clone https://github.com/klanita/PoincareMSA.git
    %cd PoincareMSA

# Check if the GPU is activated
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print('\nUsing device:', device)

#Install missing module
print("\n2. Install dependencies")
!pip install adjustText
!pip install -U kaleido

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

#File import
from google.colab import files
import io

#Import visualization functions
from scripts.visualize_projection.pplots_new import read_embeddings, plot_embedding, plot_embedding_interactive, rotate, get_colors
%matplotlib inline

#Create optional variables
path_annotation = ""

1. Load PoincaréMSA Github repository
Cloning into 'PoincareMSA'...
remote: Enumerating objects: 50031, done.[K
remote: Counting objects: 100% (178/178), done.[K
remote: Compressing objects: 100% (119/119), done.[K
remote: Total 50031 (delta 78), reused 150 (delta 59), pack-reused 49853[K
Receiving objects: 100% (50031/50031), 98.91 MiB | 16.55 MiB/s, done.
Resolving deltas: 100% (20104/20104), done.
Checking out files: 100% (3007/3007), done.
/content/PoincareMSA

Using device: cuda

2. Install dependencies
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting adjustText
  Downloading adjustText-0.7.3.tar.gz (7.5 kB)
Building wheels for collected packages: adjustText
  Building wheel for adjustText (setup.py) ... [?25l[?25hdone
  Created wheel for adjustText: filename=adjustText-0.7.3-py3-none-any.whl size=7097 sha256=1f54a8cb7527c1ca39aa803ae1a34a3e0454cf8a69a83b1a2ec839130276ab6a
  Stored in directory: /root/.cache/pip/wh

# Data upload

In [2]:
#@title ### Upload MSA in mfasta format
uploaded = files.upload()
mfasta = next(iter(uploaded))

nb_seq = 0
with open(mfasta, "r") as f_in:
    for line in f_in:
        if line[0] == ">":
            nb_seq += 1

print(f"\nNumber of sequences found: {nb_seq}.")

Saving kinases.mfasta to kinases.mfasta

Number of sequences found: 497.


In [4]:
#@title ### Upload annotation file (optional)
uploaded = files.upload()
path_annotation = next(iter(uploaded))
try:
    df_annotation = pd.read_csv(path_annotation)
    if len(df_annotation) != nb_seq:
        raise ValueError("Annotation file doesn't match the .mfasta file length.")
except:
    raise ValueError("Annotation file is not in .csv format.")

print("\nAnnotation file correctly loaded.")
annotation_names = list(df_annotation.columns)
print(f"{len(annotation_names)} annotations found: {annotation_names}.")

Saving kinase_group_new.csv to kinase_group_new.csv

Annotation file correctly loaded.
18 annotations found: ['proteins_id', '1_Group', '2_Gene', '3_HGNC', '4_Uni_entry', '5_Uni_acc', '6_Domain_begin', '7_Domain_end', '8_Domain_length', '9_Largest_insert_length', '10_PDB_validation', '11_Conformational_state', '12_Dihedral_state', '13_Group_in_Uni', '14_Group_in_Manning', '15_Synonymn', 'evo_distance', 'decile_domain'].


# Data preparation
Here we clean the input .mfasta alignment and translate each sequence to a vector ready for projection.

In [5]:
#@title ## Parameters for data preparation
#@markdown ### Job name
#@markdown Name for the output folder
out_name = "poincareMSA" #@param {type:"string"}

#@markdown ### Threshold for filtering gapped positions
#@markdown Positions with proportion of gaps above the given threshold are removed from the alignment. If your alignment is very gapped, you may want to increase this value.
gapth = 0.9 #@param {type:"number"}

In [6]:
#@title ## Preparation
#@markdown Data preparation consists in `.mfasta` cleaning according to a gap threshold and translation of each sequence to the PSSM profile.

print("1. Data preparation")
prep_parameters = "scripts/prepare_data" + " " + mfasta + " " + out_name + " " + out_name + " " + str(gapth)
bash_projection = "bash scripts/prepare_data/create_projection.sh " + prep_parameters
!{bash_projection}

1. Data preparation
Input file: kinases.mfasta
Name of the protein family: kinases
23923 X aa replaced by gaps in 497 sequences
filter_gaps finished for kinases.mfasta
mfasta2fasta finished for poincareMSA/poincareMSA.clean0.9.mfasta
23923 X aa replaced by gaps in 497 sequences


# Projection

In [8]:
#@title ### Projection parameters
#@markdown Here you control different parameters of Poincaré maps. In our computational experiments the best results were achieved for the following values provided by default. The impact of different parameters is analyzed in the original paper [1].
knn = 5 #@param {type:"number"}
gamma = 2 #@param {type:"number"}
sigma = 1 #@param {type:"number"}
cospca = 0 #@param {type:"number"}
batchs = 4 #@param {type:"number"}
epochs = 1000 #@param {type:"number"}
seed = 42 #@param {type:"number"}

In [9]:
print("\n2. Data projection using Poincaré disk")
#@title ## Building projection and preparing data for visualization
#@markdown This step creates a projection of encoded sequences to a Poincaré disk.
bash_pm = "python3 "+ "scripts/build_poincare_map/main.py --input_path " + out_name + "/fasta" + str(gapth) + " --output_path " + out_name + "/projections/ --gamma "+ str(gamma) +" --pca "+ str(cospca) + " --epochs "+ str(epochs) +" --seed "+ str(seed) + " --knn " + str(knn)
!{bash_pm}

print("\n3. Format data for visualization")
#Check that an annotation file was provided. Create a dummy one instead
if not path_annotation:
    df_annotation = pd.DataFrame(list(range(1,nb_seq+1)), columns=["id"])
    df_annotation.to_csv("dummy_annotation.csv", index=False)
    path_annotation = "dummy_annotation.csv"
    annotation_names = ["id"]


path_embedding = f"{out_name}/projections/PM{knn:1.0f}sigma={sigma:2.2f}gamma={gamma:2.2f}cosinepca={cospca:1.0f}_seed{seed:1.0f}.csv"
df_embedding = read_embeddings(path_embedding, path_annotation, withroot=False)


2. Data projection using Poincaré disk
CUDA: True
497 proteins found in folder poincareMSA/fasta0.9.
No root detected
Prepare data: tensor construction
Prepare data: successfully terminated
Computing laplacian...
Laplacian computed in 0.08 sec
Computing RFA...
RFA computed in 0.04 sec
Starting training...
loss: 0.63936: 100%|████████████████████████| 1000/1000 [04:36<00:00,  3.62it/s]
PM computed in 276.33 sec

loss = 6.394e-01
time = 4.608 min

3. Format data for visualization


# Projection visualization

In [11]:
#@title ### Available labels
#@markdown Here are different labels found in your annotation file (if one uploaded):

print(f"{len(annotation_names)} annotations found: {annotation_names}.")
# print(df_annotation.head())

18 annotations found: ['proteins_id', '1_Group', '2_Gene', '3_HGNC', '4_Uni_entry', '5_Uni_acc', '6_Domain_begin', '7_Domain_end', '8_Domain_length', '9_Largest_insert_length', '10_PDB_validation', '11_Conformational_state', '12_Dihedral_state', '13_Group_in_Uni', '14_Group_in_Manning', '15_Synonymn', 'evo_distance', 'decile_domain'].


In [12]:
#@title ### Create interactive plot
#@markdown Here you can set different parameters to color & annotate the resulting projection:

title = "" #@param {type:"string"}

#Labels name
#@markdown ---
#@markdown #### Select the coloring from annotation .csv file:
labels_name = "id" #@param {type:"string"}
if labels_name == "":
    labels_name = None
elif labels_name not in annotation_names:
    raise NameError(f"labels_name {labels_name} is not in the availables annotations.\nAvailables annotations: {annotation_names}")

#Labels text
#@markdown #### Select classes to label among the "labels_name" or "second_labels_name" column (comma separated list):
second_labels_name = "" #@param {type:"string"}
if second_labels_name == "":
    second_labels_name = None
elif second_labels_name not in annotation_names:
    raise NameError(f'"second_labels_name" {second_labels_name} is not in the availables annotations.\nAvailables annotations: {annotation_names}')


labels_text = "20" #@param {type:"string"}
if labels_text:
    try:
        labels_text = [s.strip() for s in labels_text.split(",")]
    except:
        print('Error: "label_text" field is not a valid list.')
else:
    labels_text = [""]

#Convert labels_text to labels_name dtype
if labels_name and second_labels_name is None:
    if labels_name and labels_text != [""]:
        try:
            labels_text_dtype = df_annotation[labels_name].dtypes
            labels_text = list(np.array(labels_text).astype(labels_text_dtype))
        except:
            raise TypeError(f'"labels_text" is not compatible with {labels_name}" data format ({labels_text_dtype}).')
else:
    if second_labels_name and labels_text != [""]:
        try:
            labels_text_dtype = df_annotation[second_labels_name].dtypes
            labels_text = list(np.array(labels_text).astype(labels_text_dtype))
        except:
            raise TypeError(f'"labels_text" is not compatible with {second_labels_name}" data format ({labels_text_dtype}).')

show_text = True #@param {type:"boolean"}
#@markdown ---

#@markdown #### Use a custom color palette:
color_palette = None #@param {type:"raw"}
use_custom_palette = False #@param {type:"boolean"}

if not use_custom_palette:
    color_palette = None

#Plot graph
fig = plot_embedding_interactive(df_embedding, 
                                 labels_name = labels_name,
                                 second_labels_name = second_labels_name, 
                                 show_text = show_text,
                                 labels_text = labels_text,
                                 color_palette = color_palette, 
                                 title = title, 
                                 fontsize = 11)
fig.show()

In [None]:
#@title Save plot to file
output_name = "globins3k-epochs1000-kingdom" #@param {type:"string"}
output_format = "html" #@param ["png", "html", "pdf", "svg"]

if output_format != "html":
    fig.write_image(f"{output_name}.{output_format}", engine="kaleido")
else:
    fig.write_html(f"{output_name}.{output_format}")
files.download(f"{output_name}.{output_format}")

In [None]:
#@title Download intermediate data
bash_command = f"zip -r -q {out_name}.zip {out_name}"
!{bash_command}

files.download(f"{out_name}.zip")

# Help

### Enabling the GPU

To enable GPU in your notebook, select the following menu options −
```
Runtime / Change runtime type
```

<figure>
<center>
<img src="https://github.com/klanita/PoincareMSA/blob/master/.github/colab_gpu.png?raw=true" width=500>
</center>
</figure>

