<a href="https://colab.research.google.com/github/Bio2Byte/public_notebooks/blob/main/B2B_Tools_MSA_Example_In_function_of_a_protein's_behavior.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bio2Byte - Multiple Sequence Alignment Analysis - In function of a reference protein.

#### **Goal: Compute and visualize the biophysical and sequence conservation of a Multiple Sequence Alignment (MSA) using the b2btools.**

*In this notebook we compute the biophysical and sequence conservation IN FUNCTION OF A REFERENCE protein. To compute them for the overall behavior of the MSA, please refer to the following [notebook](https://colab.research.google.com/drive/10gqZ8B1-EOXBnZ2m13ZOxYRg6peP3tcW#scrollTo=rmjYOzRzYfA6&uniqifier=4).*

The following code will:
0. Install the b2btools and its dependencies.
1. Upload the Multiple Sequence Alignment (MSA) to study.
2. Predict the biophysical properties of the input MSA:
      *   Backbone and sidechain dynamics (Dynamine)
      *   Conformational propensities (sheet, helix, coil, polyproline II) (Dynamine)
      *   Early folding propensities (EFoldMine)
      *   Disorder propensities (Disomine)
3. Compute the sequence conservation (Shannon's entropy) of the input MSA at every postion as well as the MSA occupency.
4. Compute 2D plots of the biophysical behaviour, the conservation of the biophysical properties and the sequence conservation.
5. Compute the Gaussian Mixture Model scores of the proteins to identify residues with a different behavior compared to the other residues at the same position in the MSA.
5. Download your results.

**Check out our webserver: [online b2BTools](https://bio2byte.be/b2btools/)**

In [2]:
#@title 1.a Install the B2BTools Package
%%capture
!pip install b2bTools==3.0.7 ipympl hmmer
!pip install numpy==1.25.2


#@markdown Please be patient, this can take several minutes...

#@markdown ⚠️ Once installed, please restart the session and run step 1.b directly, without rerunning the first step.

In [3]:
#@title 1.b Install the dependencies
%%capture
import math
import Bio
from Bio import AlignIO
import os
import json
import shutil
from google.colab import files
from b2bTools.multipleSeq.Predictor import MineSuiteMSA
from google.colab import output
import numpy as np
from sklearn import mixture
import sys
from os import path

In [None]:
#@title 1.c Upload your MSA file
%%capture
from google.colab import files

#@markdown Execute this cell to upload your MSA file from your computer.

#@markdown **🚨⚠️ IMPORTANT input file size limits:**
#@markdown * Dynamine (dynamics and secondary structure): min 5 residues per sequence
#@markdown * Disomine (disorder): min 5 residues per sequence
#@markdown * EfoldMine (early folding): min 5 residues per sequence

#@markdown **🚨⚠️ IMPORTANT input file format:**
#@markdown The file can be in CLUSTAL, FASTA, BaliBase, PSI, A3M, Blast, PHYLIP and STOCKHOLM format


def upload_msa_file():
    print("Please upload your MSA file.")

    # Upload file
    uploaded = files.upload()

    if not uploaded:
        print("No file uploaded.")
        return None

    for fn in uploaded.keys():
        print(f'MSA file "{fn}" with length {len(uploaded[fn])} bytes uploaded successfully.')
        return fn

msa_filename = upload_msa_file()


if msa_filename:

    fasta = True #@param {type:"boolean"}
    clustal = False #@param {type:"boolean"}
    phylip = False #@param {type:"boolean"}
    stockholm = False #@param {type:"boolean"}

    # Define possible extensions
    extensions = [fasta, clustal, phylip, stockholm]
    extensions_names = ["fasta", "clustal", "phylip", "stockholm"]

    # Determine selected extension
    extension = None
    for ext, name in zip(extensions, extensions_names):
        if ext:
            extension = name
            break

    if extension:
        print(f"The selected format of the MSA file is: {extension}")
    else:
        print("No format selected.")
else:
    print("File upload was unsuccessful. Please try again.")


In [6]:
#@title 2. Select the predictors you want to use and run them
#@markdown Select the bio2Byte tools you want to include into the predictions:
%%capture

#@markdown ### DynaMine predictor tool
#@markdown >Fast predictor of protein backbone dynamics using only sequence information as input.
#@markdown >The version here also predicts side-chain dynamics and secondary structure predictors
#@markdown >using the same principle.

#@markdown **Prediction values included**: `backbone`, `sidechain`, `helix`, `ppII`, `coil`, and `sheet`

DynaMine = True #@param {type:"boolean"}

#@markdown ### DisoMine predictor tool
#@markdown >Predicts protein disorder with recurrent neural networks not directly
#@markdown >from the amino acid sequence, but instead from more generic predictions of key
#@markdown >biophysical properties, here protein dynamics, secondary structure
#@markdown >and early folding.

#@markdown **Prediction values included**: `disoMine`

DisoMine = False #@param {type:"boolean"}

#@markdown ### EFoldMine predictor tool
#@markdown >Predicts from the primary amino acid sequence of a protein,
#@markdown >which amino acids are likely involved in early folding events.

#@markdown **Prediction values included**: `earlyFolding`

EFoldMine = False #@param {type:"boolean"}

#@markdown **Don't forget to run the cell after ticking the boxes.**

msaSuite = MineSuiteMSA()

if DynaMine and not DisoMine and not EFoldMine:
  msaSuite.predictAndMapSeqsFromMSA(f"/content/{msa_filename}", predTypes = ('dynamine'))
elif DisoMine and not DynaMine and not EFoldMine:
  msaSuite.predictAndMapSeqsFromMSA(f"/content/{msa_filename}", predTypes = ('disoMine'))
elif EFoldMine and not DynaMine and not DisoMine:
  msaSuite.predictAndMapSeqsFromMSA(f"/content/{msa_filename}", predTypes = ('eFoldMine'))

elif DynaMine and DisoMine and not EFoldMine:
  msaSuite.predictAndMapSeqsFromMSA(f"/content/{msa_filename}", predTypes = ('disoMine', 'dynamine'))
elif DynaMine and EFoldMine and not DisoMine:
  msaSuite.predictAndMapSeqsFromMSA(f"/content/{msa_filename}", predTypes = ('eFoldMine','dynamine'))
elif DisoMine and EFoldMine and not DynaMine:
  msaSuite.predictAndMapSeqsFromMSA(f"/content/{msa_filename}", predTypes = ('eFoldMine', 'disoMine'))

elif DisoMine and EFoldMine and DynaMine:
  msaSuite.predictAndMapSeqsFromMSA(f"/content/{msa_filename}", predTypes = ('eFoldMine', 'disoMine', 'dynamine'))

msaSuite.getDistributions()
jsondata_list = [msaSuite.alignedPredictionDistribs]

In [7]:
#@title 2.b. Select the protein you want to use as a reference
%%capture
#@markdown The plots will be plotted in function of the values of the selected reference protein.

#@markdown Select the reference protein in the dropdown list that appears once you execute this cell.

import Bio
from Bio import AlignIO
import ipywidgets as widgets

alignment_file=AlignIO.read( msa_filename, extension)

ids =[]
for prot in alignment_file:
  ids.append(prot.id)

out = widgets.Dropdown(
    options=ids,
    description='Select protein:',
    disabled=False,
)
display(out)

In [8]:
#@title 2.c. Set the variables in function of the value selected above
%%capture
alignment_file = AlignIO.read(msa_filename, extension)

#selected protein in drop down list
selected_prot = out.value

print(f"The selected protein is {out.value}")

#retrieve columns occupied by selected protein in MSA (remove gaps)
occupied_pos = []
for prot in alignment_file:
  if prot.id == selected_prot:
      selected_prot_seq=list(prot.seq)
      for position,residue in enumerate(selected_prot_seq):
          if residue != "-":
              occupied_pos.append(position)
      break

#prediction selected protein
predictions_single_seq = msaSuite.allAlignedPredictions
selected_prot_data_none = predictions_single_seq[selected_prot]
selected_prot_data = {}
# Remove null values
for biophys_prop in selected_prot_data_none.keys():
    if isinstance(selected_prot_data_none[biophys_prop], list):
        selected_prot_data[biophys_prop] = [i for i in selected_prot_data_none[biophys_prop] if i]
    else:
        selected_prot_data[biophys_prop] = selected_prot_data_none[biophys_prop]


#from predictions msa, select only the values link to columns occupied by the selected protein
msa_predictions_in_fct_of = {}
selected_prot_predictions = {}

for biophys_prop in jsondata_list[0].keys():
  selected_prot_predictions[biophys_prop] = selected_prot_data[biophys_prop]
  values_msa = {}
  for statistics in jsondata_list[0][biophys_prop].keys():
      values = jsondata_list[0][biophys_prop][statistics]
      stat_values_msa = []
      #only keep values of columns occupied in selected protein
      for position_interest in occupied_pos:
          stat_values_msa.append(values[position_interest])
      values_msa[statistics] = stat_values_msa
  msa_predictions_in_fct_of[biophys_prop] = values_msa

In [9]:
#@title 4.a. Sequence conservation (Shannon's entropy)
%%capture
#@markdown Once this cell has been executed, the conservation $C$ of the residues
#@markdown at a particular column $x$ within the MSA is computed by using
#@markdown the Shannon's entropy $E$ and by taking into account the number of gaps $G$ in that column:
#@markdown $$ C(x) = (1- E(x))(1-G(x)) $$

#@markdown Shannon's entropy at column $x$ equals:

#@markdown $$ E(x) = - \lambda \sum_{a}^{K} p_{a} log_{2}(p_{a}) $$

#@markdown where $K$, the alphabet size which equals 21 as it includes the 20 amino acid types and 1 symbol for the gaps,

#@markdown and $ p_{a}$ is the probability of observing the *ath* symbol type at position $x$.

#@markdown $\lambda$ scales the entropy to range [0,1]:

#@markdown if the number of residue types in column $x$, $M$ ($1\leq M \leq K$) equals 1:
#@markdown $$ \lambda = 1 $$
#@markdown else:
#@markdown $$ \lambda = [log_{2}(M)]^{-1} $$

#@markdown Moreover, the number of gaps in column $x$:
#@markdown $$ G(x) = \dfrac{n}{N} $$
#@markdown where $n$ is the number of gaps in column $x$.

#Shannon_entropy calculates the conservation of the amino acids at each position. It takes into account
#the gaps in the columns. Check https://doi.org/10.1002/prot.10146 for info formula
def shannon_entropy(list_input):
    tot = len(list_input) #total number of AA at particular position in MSA
    gaps = list_input.count("-") / tot #count frequency of gaps in that column of MSA
    unique_base = set(list_input) #remove duplicates, "-" is seen as an amino acid type
    unique_base_len = len(unique_base) #total number of AA at particular position in MSA with no duplicates


    entropy_list = [] # entropy of AA at particular position

    for base in unique_base:
        n_i = list_input.count(base)
        P_i = n_i / tot

        entropy_i = P_i * (math.log(P_i, 2))
        entropy_list.append(entropy_i)

    #sum entropy of every residue at a position and normalize it so it is between 0 and 1
    entropy_sum = math.fsum(entropy_list)

    if unique_base_len == 1: # log(1, 2) = 0; n/0 throws ZeroDivisionError
        shannon_entropy = entropy_sum
    else:
        unique_base_len_log = math.log(unique_base_len, 2)
        shannon_entropy = (-1 / unique_base_len_log) * entropy_sum

    #Return entropy AA and entropy gaps at 1 position
    #If entropy is high than there are many possible arrangements (high variability)
    return shannon_entropy, gaps

def conservation(alignment_file):
    conservation_AA_list = []

    for col_no in range(len(list(alignment_file[0]))):
        list_input = list(alignment_file[:, col_no])

        sh_entropy_AA, sh_entropy_gaps = shannon_entropy(list_input)

        #Translate entropy into conservation. 0 means no conservation, 1 highly conserved
        #As e took into account the gaps: if only gaps: entropy=0 freq_gaps=1 conservation = 0
        conservation_AA = (1 - sh_entropy_AA) * (1 - sh_entropy_gaps)
        conservation_AA_list.append(conservation_AA)

    return conservation_AA_list

#Read MSA
alignment_file = AlignIO.read(msa_filename, extension)
conservation_AA_list = conservation(alignment_file)

conservation_selected_protein = []
for position_interest in occupied_pos:
    conservation_selected_protein.append(conservation_AA_list[position_interest])

In [11]:
#@title 5. Prepare to plot results
#@markdown Run this cell in order to prepare the notebook context to render different plots. Please be patient this can take a few minutes.
%%capture
# output.enable_custom_widget_manager()
%matplotlib widget


import matplotlib.pyplot as plt
import os
import math

if not os.path.exists("/content/results"):
    os.mkdir("/content/results", )

if DynaMine and not DisoMine and not EFoldMine:
    NB_SUBPLOTS = 6
    PREDICTION_TITLES = {
      'backbone': "DynaMine backbone dynamics",
      'sidechain': "DynaMine sidechain dynamics",
      'ppII': "DynaMine conformational propensities: ppII (polyproline II)",
      'coil': "DynaMine conformational propensities: Coil",
      'sheet': "DynaMine conformational propensities: Sheet",
      'helix': "DynaMine conformational propensities: Helix",
  }

    PREDICTION_POSITION = {
        'backbone':     0,
        'sidechain':    1,
        'ppII':         2,
        'coil':         3,
        'sheet':        4,
        'helix':        5,
    }

elif DisoMine and not DynaMine and not EFoldMine:
  NB_SUBPLOTS = 1
  PREDICTION_TITLES = {
       'disoMine': "Disorder (disoMine)"
  }

  PREDICTION_POSITION = {
      'disoMine': 0
  }

elif EFoldMine and not DisoMine and not DynaMine:
  NB_SUBPLOTS = 1

  PREDICTION_TITLES = {
      'earlyFolding': "Early folding (EFoldMine)"
  }

  PREDICTION_POSITION = {
      'earlyFolding': 0
  }

elif DynaMine and DisoMine and not EFoldMine:
  NB_SUBPLOTS = 7
  PREDICTION_TITLES = {
      'backbone': "DynaMine backbone dynamics",
      'sidechain': "DynaMine sidechain dynamics",
      'ppII': "DynaMine conformational propensities: ppII (polyproline II)",
      'coil': "DynaMine conformational propensities: Coil",
      'sheet': "DynaMine conformational propensities: Sheet",
      'helix': "DynaMine conformational propensities: Helix",
      'disoMine': "Disorder (disoMine)"
  }

  PREDICTION_POSITION = {
      'backbone':     0,
      'sidechain':    1,
      'ppII':         2,
      'coil':         3,
      'sheet':        4,
      'helix':        5,
      'disoMine':     6
  }

elif DynaMine and EFoldMine and not DisoMine:
  NB_SUBPLOTS = 7
  PREDICTION_TITLES = {
      'backbone': "DynaMine backbone dynamics",
      'sidechain': "DynaMine sidechain dynamics",
      'ppII': "DynaMine conformational propensities: ppII (polyproline II)",
      'coil': "DynaMine conformational propensities: Coil",
      'sheet': "DynaMine conformational propensities: Sheet",
      'helix': "DynaMine conformational propensities: Helix",
      'earlyFolding': "Early folding (EFoldMine)"
  }

  PREDICTION_POSITION = {
      'backbone':     0,
      'sidechain':    1,
      'ppII':         2,
      'coil':         3,
      'sheet':        4,
      'helix':        5,
      'earlyFolding': 6
  }
elif EFoldMine and DisoMine and not DynaMine:
  NB_SUBPLOTS = 2
  PREDICTION_TITLES = {
      'earlyFolding': "Early folding (EFoldMine)",
      'disoMine': "Disorder (disoMine)"
  }

  PREDICTION_POSITION = {
      'earlyFolding': 0,
      'disoMine':     1
  }
elif DynaMine and DisoMine and EFoldMine:
  NB_SUBPLOTS = 8
  PREDICTION_TITLES = {
      'backbone': "DynaMine backbone dynamics",
      'sidechain': "DynaMine sidechain dynamics",
      'ppII': "DynaMine conformational propensities: ppII (polyproline II)",
      'coil': "DynaMine conformational propensities: Coil",
      'sheet': "DynaMine conformational propensities: Sheet",
      'helix': "DynaMine conformational propensities: Helix",
      'earlyFolding': "Early folding (EFoldMine)",
      'disoMine': "Disorder (disoMine)"
  }

  PREDICTION_POSITION = {
      'backbone':     0,
      'sidechain':    1,
      'ppII':         2,
      'coil':         3,
      'sheet':        4,
      'helix':        5,
      'earlyFolding': 6,
      'disoMine':     7
  }

AXIS_TITLES = {
    "x": "Residue position in the MSA",
    "y": "Prediction values"
}


import matplotlib as mpl
mpl.use("Agg")
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator
import matplotlib.ticker as ticker

In [12]:
#@title 5.a. Plot the predicted biophysical conservation of the MSA
%%capture
#@markdown Once this cell has been executed, a file with plots will be saved in the folder "Results"
#@markdown In case you want to zoom in on the plots, run the cell 5.b below .


#@markdown The plots represent the biophysical behaviour of the proteins in the MSA.
#@markdown There is one graph for every biophyisical property that is studied.
#@markdown As the median, 1st and 3rd quartile and outliers are shown on the graphs,
#@markdown the conservation of the biophysical properties can be observed.

# from matplotlib.collections import LineCollection

# Function to plot the biophysical MSA
def plot_biophysical_msa(jsondata_list_interest, jsondata_list_selected, selected_prots, sequences):
    colors = ['blue', 'orange']
    print(jsondata_list_interest)
    residues_count = len(jsondata_list_interest[0]['backbone']['median'])
    sequences_count = len(sequences)
    fig, axs = plt.subplots(NB_SUBPLOTS)
    fig.set_figwidth(20)
    fig.set_figheight(50)
    plt.suptitle(f"Predicted biophysical properties of the MSA in function of {selected_prots} ({sequences_count} aligned sequences)", fontsize=14)

    predictions = jsondata_list_interest[0].keys()
    for prediction_index, biophys_data in enumerate(predictions):
        if biophys_data == 'agmata':
            continue

        subplot_index_row = PREDICTION_POSITION[biophys_data]

        ax = axs[subplot_index_row]
        for data, col in zip(jsondata_list_interest, colors):
            none_idx = []

            for n in range(residues_count):
                if data[biophys_data]['median'][n] is None or data[biophys_data]['firstQuartile'][n] is None or data[biophys_data]['thirdQuartile'][n] is None:
                    none_idx.append(n)

            range_list = []
            for n in range(len(none_idx)):
                try:
                    if none_idx[n] + 1 != none_idx[n + 1]:
                        range_list.append((none_idx[n] + 1, none_idx[n + 1]))
                    else:
                        continue
                except:
                    if len(none_idx) == 1:
                        range_list.append((0, none_idx[0]))
                        range_list.append((none_idx[0] + 1, len(data[biophys_data]['median'])))
                    else:
                        range_list.append((0, none_idx[0]))
                        range_list.append((none_idx[-1] + 1, len(data[biophys_data]['median'])))

            if range_list:
                for tuple in range_list:
                    x = np.arange(tuple[0], tuple[1], 1)
                    firstq = data[biophys_data]['firstQuartile'][tuple[0]:tuple[1]]
                    thirdq = data[biophys_data]['thirdQuartile'][tuple[0]:tuple[1]]
                    bottom = data[biophys_data]['bottomOutlier'][tuple[0]:tuple[1]]
                    top = data[biophys_data]['topOutlier'][tuple[0]:tuple[1]]
                    ax.fill_between(x, firstq, thirdq, alpha=0.3, color=col, label="1st-3rd Quartiles")
                    ax.fill_between(x, bottom, top, alpha=0.1, color=col, label="Outliers")
            else:
                x = np.arange(0, len(data[biophys_data]['median']), 1)
                firstq = data[biophys_data]['firstQuartile']
                thirdq = data[biophys_data]['thirdQuartile']
                bottom = data[biophys_data]['bottomOutlier']
                top = data[biophys_data]['topOutlier']
                ax.fill_between(x, firstq, thirdq, alpha=0.3, color=col, label="1st-3rd Quartiles")
                ax.fill_between(x, bottom, top, alpha=0.1, color=col, label="Outliers")

            ax.plot(data[biophys_data]['median'], linewidth=1.25, color=col, label="Median")

            colors_2 = "magenta"
            ax.plot(jsondata_list_selected[0][biophys_data], '-', linewidth=1.5, color=colors_2, label=f"Prediction {selected_prots}")

        ax.set_title(PREDICTION_TITLES[biophys_data])
        ax.axis([0, residues_count, min(bottom)-0.05, max(top)+0.05])
        ax.set_ylabel(AXIS_TITLES['y'])
        ax.set_xlabel(AXIS_TITLES['x'])
        if biophys_data == 'backbone':
            ax.axhline(y=1.0, color='green', linewidth=1.5, linestyle='-.', label="Above: Membrane spanning")
            ax.axhline(y=0.8, color='orange', linewidth=1.5, linestyle='-.', label="Above: Rigid")
            if min(bottom) - 0.05 < 0.69:
                ax.axhline(y=0.69, color='red', linewidth=1.5, linestyle='-.', label="Above: Context dependent \nBelow: Flexible")
        if biophys_data == 'earlyFolding':
            ax.axhline(y=0.169, color='red', linewidth=1.5, linestyle='-.', label="Above: Likely to start folding")
        if biophys_data == 'disoMine':
            ax.axhline(y=0.5, color='red', linewidth=1.5, linestyle='-.', label="Above: Likely to be disordered")
        ax.legend(ncol=1, bbox_to_anchor=(1.01, 0.5), loc='center left')

    plt.tight_layout()
    fig.subplots_adjust(top=0.96, hspace=0.2)

    if not os.path.exists('/content/results/'):
        os.makedirs('/content/results/')

    # Save the figure to a PDF file without extra arguments
    fig.savefig('/content/results/msa_biophysical_conservation.png')

    plt.show()

    return fig, axs
sequences =  msaSuite.seqs
fig, axs = plot_biophysical_msa([msa_predictions_in_fct_of],[selected_prot_predictions],selected_prot,sequences)

In [13]:
#@title 5.b. Zoom in: Plot the predicted biophysical conservation of the MSA
#@markdown Same as cell 5.a. but with zooms on specific regions of the MSA.
%%capture

#@markdown Enter in how many sections you want to cut your plots:

n = 2 #@param {type:"integer"}
#@markdown

#@markdown Once this cell has been executed, a file with plots will be saved in the folder "Results".

#@markdown The plots represent the biophysical behaviour of the proteins in the MSA.
#@markdown There is one graph for every biophyisical property that is studied.
#@markdown As the median, 1st and 3rd quartile and outliers are shown on the graphs,
#@markdown the conservation of the biophysical properties can be observed.

#create sections
residues_count = len([msa_predictions_in_fct_of][0]['backbone']['median'])
all_regions = []
for elem in range(1,n+1):
    label = "Section%d_of_%d" %(elem, n)
    low = int(residues_count * ((1/n)*(elem-1)))
    high = int(residues_count * ((1/n)*elem))
    if all_regions == []:
      low = 1
    all_regions.append((label, (low, high)))

def plot_biophysical_msa_zoom(jsondata_list_interest,jsondata_list_selected,selected_prot, sequences,all_regions):
    colors = ['blue', 'orange']
    residues_count = len(jsondata_list_interest[0]['backbone']['median'])
    sequences_count = len(sequences)

    for region in all_regions:
        name_region = region[0]
        lower_lim = region[1][0]
        upper_lim =region[1][1]

        #Plot representation
        fig, axs = plt.subplots(NB_SUBPLOTS)
        fig.set_figwidth(20)
        fig.set_figheight(50)

        fig.suptitle(f'Predicted biophysical properties of the MSA in function of {selected_prot} ({sequences_count} aligned sequences)": "{name_region}" (residue {lower_lim} to residue {upper_lim-1})', fontsize=14)

        # These for loops got too complicated, I have to think
        # something simpler to handle the None values in the data
        predictions = jsondata_list_interest[0].keys()
        for prediction_index, biophys_data in enumerate(predictions):
            if biophys_data == 'agmata':
                continue

            subplot_index_row = PREDICTION_POSITION[biophys_data]

            ax = axs[subplot_index_row]
            for data, col in zip(jsondata_list_interest, colors):
                none_idx = []

                for n in range(residues_count):
                    if data[biophys_data]['median'][n] == None \
                            or data[biophys_data][
                        'firstQuartile'][n] == None \
                            or data[biophys_data][
                        'thirdQuartile'][n] == None:
                        none_idx.append(n)

                range_list = []
                for n in range(len(none_idx)):
                    try:
                        if none_idx[n] + 1 != none_idx[n + 1]:
                            range_list.append(
                                (none_idx[n] + 1, none_idx[n + 1]))
                        else:
                            continue
                    except:
                        if len(none_idx) == 1:
                            range_list.append((0, none_idx[0]))
                            range_list.append((none_idx[0] + 1, len(
                                data[biophys_data][
                                    'median'])))

                        else:
                            range_list.append((0, none_idx[0]))
                            range_list.append((none_idx[-1] + 1, len(
                                data[biophys_data][
                                    'median'])))

                # When there are None values in the data
                if range_list:
                    for tuple in range_list:
                        x = np.arange(tuple[0], tuple[1], 1)
                        firstq = \
                            data[biophys_data][
                                'firstQuartile'][
                            tuple[0]:tuple[1]]
                        thirdq = \
                            data[biophys_data][
                                'thirdQuartile'][
                            tuple[0]:tuple[1]]
                        bottom = \
                            data[biophys_data][
                                'bottomOutlier'][
                            tuple[0]:tuple[1]]
                        top = \
                            data[biophys_data]['topOutlier'][
                            tuple[0]:tuple[1]]
                        ax.fill_between(
                            x[lower_lim:upper_lim], firstq[lower_lim:upper_lim], thirdq[lower_lim:upper_lim], alpha=0.3, color=col, label="1st & 3rd quartile")
                        ax.fill_between(
                            x[lower_lim:upper_lim], bottom[lower_lim:upper_lim], top[lower_lim:upper_lim], alpha=0.1, color=col, label="Outliers")

                # When there aren't None values in the data
                else:
                    x = np.arange(lower_lim,upper_lim, 1)
                    firstq = data[biophys_data][
                        'firstQuartile'][lower_lim:upper_lim]
                    thirdq = data[biophys_data][
                        'thirdQuartile'][lower_lim:upper_lim]
                    bottom = data[biophys_data][
                        'bottomOutlier'][lower_lim:upper_lim]
                    top = data[biophys_data]['topOutlier'][lower_lim:upper_lim]
                    ax.fill_between(
                        x, firstq, thirdq, alpha=0.3, color=col, label="1st & 3rd quartile")
                    ax.fill_between(
                        x, bottom, top, alpha=0.1, color=col, label="Outliers")

                    ax.plot(data[biophys_data]['median'],linewidth=1.25, color="black", label= "Median")

                    #Add the selected protein if there are some
                    colors_2 = "magenta"
                    ax.plot(jsondata_list_selected[0][biophys_data], '-', linewidth=1.5, color=colors_2, label=f"Prediction {selected_prot}")

            ax.set_title(PREDICTION_TITLES[biophys_data])
            ax.axis([lower_lim,upper_lim, min(bottom)-0.05, max(top)+0.05])
            ax.set_ylabel(AXIS_TITLES['y'])
            ax.set_xlabel(AXIS_TITLES['x'])
            # ax.set_xticks(list(np.arange(lower_lim,upper_lim,5)))
            if biophys_data == 'backbone':
                ax.axhline(y=1.0, color='green', linewidth= 1.25, linestyle='-.', label="Above: Membrane spaning") #Membrane spaning
                ax.axhline(y=0.8, color='orange', linewidth= 1.25, linestyle='-.', label="Above: Rigid") #Membrane spaning
                if min(bottom)-0.05 < 0.69:
                    ax.axhline(y=0.69, color='red', linewidth= 1.25, linestyle='-.', label="Above: Context dependent \nBelow: Flexible") #context dependent (either rigide or flexible)
            if biophys_data == 'earlyFolding':
                ax.axhline(y=0.169, color='red', linewidth= 1.25, linestyle='-.', label="Above: Likely to start folding") #above: likely start protein folding process
            if biophys_data == 'disoMine':
                ax.axhline(y=0.5, color='red', linewidth= 1.25, linestyle='-.', label="Above: Likely to be disordered") #above: likely disordered
            ax.legend(ncol=1, bbox_to_anchor =(1.01,0.5), loc='center left')


        plt.tight_layout()
        fig.subplots_adjust(top=0.96, hspace = 0.2)
        plt.savefig('/content/results/'+ name_region + '_msa_biophysical_conservation.png')
        plt.show()

    return fig, axs

sequences =  msaSuite.seqs
fig, axs = plot_biophysical_msa_zoom([msa_predictions_in_fct_of],[selected_prot_predictions],selected_prot,sequences,all_regions)

In [14]:
#@title 5.c. Plot the biophysical and sequence conservation of the MSA
#@markdown Once this cell has been executed, a file with plots will be saved in the folder "Results"
#@markdown In case you want to zoom in on the plots, run the cell 5.d below.
%%capture

#@markdown The plots represent the biophysical behaviour of the proteins in the MSA as well as their sequence conservation.
#@markdown There is one graph for every biophyisical property that is studied.
#@markdown As the median, 1st and 3rd quartile and outliers are shown on the graphs,
#@markdown the conservation of the biophysical properties can be observed.

def plot_entropy_biphysical_msa(jsondata_list, jsondata_list_selected, selected_prot, sequences, conservation_AA_list):
    sequences_count = len(sequences)
    residues_count = len(jsondata_list[0]['backbone']['median'])

    #Color map for conservation sequence
    cmap = mpl.cm.Blues(np.linspace(0,1,20))
    cmap = mpl.colors.ListedColormap(cmap[:5])
    colors = ['blue', 'orange']

    #Plot representation
    fig, axs = plt.subplots(NB_SUBPLOTS)
    fig.set_figwidth(20)
    fig.set_figheight(50)

    fig.suptitle(f"Predicted biophysical properties of the MSA in function of {selected_prot} ({sequences_count} aligned sequences)", fontsize=14)

    predictions = jsondata_list[0].keys()
    for prediction_index, biophys_data in enumerate(predictions):
        if biophys_data == 'agmata':
            continue

        subplot_index_row = PREDICTION_POSITION[biophys_data]

        ax = axs[subplot_index_row]
        for data, col in zip(jsondata_list, colors):
            none_idx = []

            for n in range(residues_count):
                if data[biophys_data]['median'][n] == None \
                        or data[biophys_data][
                    'firstQuartile'][n] == None \
                        or data[biophys_data][
                    'thirdQuartile'][n] == None:
                    none_idx.append(n)

            range_list = []
            for n in range(len(none_idx)):
                try:
                    if none_idx[n] + 1 != none_idx[n + 1]:
                        range_list.append(
                            (none_idx[n] + 1, none_idx[n + 1]))
                    else:
                        continue
                except:
                    if len(none_idx) == 1:
                        range_list.append((0, none_idx[0]))
                        range_list.append((none_idx[0] + 1, len(
                            data[biophys_data][
                                'median'])))

                    else:
                        range_list.append((0, none_idx[0]))
                        range_list.append((none_idx[-1] + 1, len(
                            data[biophys_data][
                                'median'])))

            # When there are None values in the data
            if range_list:
                for tuple in range_list:
                    x = np.arange(tuple[0], tuple[1], 1)
                    firstq = \
                        data[biophys_data][
                            'firstQuartile'][
                        tuple[0]:tuple[1]]
                    thirdq = \
                        data[biophys_data][
                            'thirdQuartile'][
                        tuple[0]:tuple[1]]
                    bottom = \
                        data[biophys_data][
                            'bottomOutlier'][
                        tuple[0]:tuple[1]]
                    top = \
                        data[biophys_data]['topOutlier'][
                        tuple[0]:tuple[1]]
                    ax.plot(x,firstq, linewidth=0.5, color="black", label="1st & 3rd quartile")
                    ax.plot(x,thirdq, linewidth=0.5, color="black")
                    ax.plot(x,bottom, alpha=0.25, linestyle ="--", color="black", label="Bottom & Top outliers")
                    ax.plot(x,top, alpha=0.25, linestyle ="--", color="black")

            # When there aren't None values in the data
            else:
                x = np.arange(0, len(
                    data[biophys_data]['median']), 1)
                firstq = data[biophys_data][
                    'firstQuartile']
                thirdq = data[biophys_data][
                    'thirdQuartile']
                bottom = data[biophys_data][
                    'bottomOutlier']
                top = data[biophys_data]['topOutlier']

                ax.plot(x,firstq, linewidth=0.5, color="black", label="1st & 3rd quartile")
                ax.plot(x,thirdq, linewidth=0.5, color="black")
                ax.plot(x,bottom, alpha=0.25, linestyle ="--", color="black", label="Bottom & Top outliers")
                ax.plot(x,top, alpha=0.25, linestyle ="--", color="black")

                ax.plot(data[biophys_data]['median'], linewidth=1.25, color="black", label= "Median")

                #Add the selected protein if there are some
                colors_2 = "magenta"
                ax.plot(jsondata_list_selected[0][biophys_data], '-', linewidth=1.5, color=colors_2, label=f"Prediction {selected_prot}")

                #plot sequence conservation
                entropy_values = conservation_AA_list

                extent = [0, residues_count, min(bottom)-0.05, max(top)+0.05]
                ax.imshow(np.array(entropy_values)[np.newaxis,:], cmap=cmap, extent=extent,aspect = "auto", vmin=0, vmax=1)
                cbar = ax.figure.colorbar(
                    mpl.cm.ScalarMappable(norm=None, cmap=cmap), shrink=1, pad = 0.03, ax=ax, ticks=[0, 1])
                cbar.ax.set_yticklabels(['0%', '100%'])
                cbar.ax.set_ylabel('Sequence conservation', rotation=270)

        ax.set_title(PREDICTION_TITLES[biophys_data] , fontsize=14)
        ax.axis([0, residues_count, min(bottom)-0.05, max(top)+0.05])
        # ax.xaxis.set_major_locator(ticker.FixedLocator(np.arange(0,residues_count,5)))

        ax.set_ylabel(AXIS_TITLES['y'], fontsize=10)
        ax.set_xlabel(AXIS_TITLES['x'], fontsize=10)

        if biophys_data == 'backbone':
            ax.axhline(y=1.0, color='green', linewidth= 1.5, linestyle='-.', label="Above: Membrane spaning") #Membrane spaning
            ax.axhline(y=0.8, color='orange', linewidth= 1.5, linestyle='-.', label="Above: Rigid") #Membrane spaning
            if min(bottom)-0.05 < 0.69:
                ax.axhline(y=0.69, color='red', linewidth= 1.5, linestyle='-.', label="Above: Context dependent \nBelow: Flexible") #context dependent (either rigide or flexible)
        if biophys_data == 'earlyFolding':
            ax.axhline(y=0.169, color='red', linewidth= 1.5, linestyle='-.', label="Above: Likely to start folding") #above: likely start protein folding process
        if biophys_data == 'disoMine':
            ax.axhline(y=0.5, color='red', linewidth= 1.5, linestyle='-.', label="Above: Likely to be disordered") #above: likely disordered
        ax.legend(ncol=1, bbox_to_anchor =(1.1,0.5), loc='center left')


    plt.tight_layout()
    fig.subplots_adjust(top=0.96, hspace = 0.6)

    plt.savefig('/content/results/msa_sequence_biophysical_conservation.png')

    return fig, axs

sequences =  msaSuite.seqs

fig, axs = plot_entropy_biphysical_msa([msa_predictions_in_fct_of],[selected_prot_predictions],selected_prot,sequences,conservation_selected_protein)

In [15]:
#@title 5.d. Zoom in: Plot the sequence and biophysical conservation of the MSA
#@markdown Same as cell 5.c. but with zooms on specific regions of the MSA.
%%capture

#@markdown Enter in how many sections you want to cut your plots:
n = 2 #@param {type:"integer"}

#@markdown

#@markdown The plots represent the biophysical behaviour of the proteins in the MSA as well as their sequence conservation.
#@markdown There is one graph for every biophyisical property that is studied.
#@markdown As the median, 1st and 3rd quartile and outliers are shown on the graphs,
#@markdown the conservation of the biophysical properties can be observed.

#@markdown The dots on the plots show the sequence occupency of the MSA, there are 4 different sizes representing 25%,50%,75% and 100% of the sequence occupency.

#create sections
residues_count = len([msa_predictions_in_fct_of][0]['backbone']['median'])
all_regions = []
for elem in range(1,n+1):
    label = "Section%d_of_%d" %(elem, n)
    low = int(residues_count * ((1/n)*(elem-1)))
    high = int(residues_count * ((1/n)*elem))
    if all_regions == []:
      low = 1
    all_regions.append((label, (low, high)))
print(all_regions)

def plot_entropy_biphysical_msa_zoom(jsondata_list,jsondata_list_selected,selected_prot, sequences, all_regions, conservation_AA_list):
    sequences_count = len(sequences)
    residues_count = len(jsondata_list[0]['backbone']['median'])

    #Color map for conservation sequence
    cmap = mpl.cm.Blues(np.linspace(0,1,20))
    cmap = mpl.colors.ListedColormap(cmap[:5])

    colors = ['blue', 'orange']

    for region in all_regions:
        name_region = region[0]
        lower_lim = region[1][0]
        upper_lim = region[1][1]

        #Plot representation
        fig, axs = plt.subplots(8)
        fig.set_figwidth(20)
        fig.set_figheight(50)

        fig.suptitle(f'Predicted biophysical properties of the MSA in function of {selected_prot}: "{name_region}" (residue {lower_lim} to residue {upper_lim-1}) ({sequences_count} aligned sequences)', fontsize=14)

        # These for loops got too complicated, I have to think
        # something simpler to handle the None values in the data
        predictions = jsondata_list[0].keys()
        for prediction_index, biophys_data in enumerate(predictions):
            if biophys_data == 'agmata':
                continue

            subplot_index_row = PREDICTION_POSITION[biophys_data]

            ax = axs[subplot_index_row]
            for data, col in zip(jsondata_list, colors):
                none_idx = []

                for n in range(residues_count):
                    if data[biophys_data]['median'][n] == None \
                            or data[biophys_data][
                        'firstQuartile'][n] == None \
                            or data[biophys_data][
                        'thirdQuartile'][n] == None:
                        none_idx.append(n)

                range_list = []
                for n in range(len(none_idx)):
                    try:
                        if none_idx[n] + 1 != none_idx[n + 1]:
                            range_list.append(
                                (none_idx[n] + 1, none_idx[n + 1]))
                        else:
                            continue
                    except:
                        if len(none_idx) == 1:
                            range_list.append((0, none_idx[0]))
                            range_list.append((none_idx[0] + 1, len(
                                data[biophys_data][
                                    'median'])))

                        else:
                            range_list.append((0, none_idx[0]))
                            range_list.append((none_idx[-1] + 1, len(
                                data[biophys_data][
                                    'median'])))

                # When there are None values in the data
                if range_list:
                    for tuple in range_list:
                        x = np.arange(tuple[0], tuple[1], 1)
                        firstq = \
                            data[biophys_data][
                                'firstQuartile'][
                            tuple[0]:tuple[1]]
                        thirdq = \
                            data[biophys_data][
                                'thirdQuartile'][
                            tuple[0]:tuple[1]]
                        bottom = \
                            data[biophys_data][
                                'bottomOutlier'][
                            tuple[0]:tuple[1]]
                        top = \
                            data[biophys_data]['topOutlier'][
                            tuple[0]:tuple[1]]
                        ax.plot(x,firstq, linewidth=0.25, color="black", label="1st & 3rd quartile")
                        ax.plot(x,thirdq, linewidth=0.25, color="black")
                        ax.plot(x,bottom, alpha=0.25, linestyle ="--", color="black", label="Bottom & Top outliers")
                        ax.plot(x,top, alpha=0.25, linestyle ="--", color="black")

                # When there aren't None values in the data
                else:
                    x = np.arange(0, len(
                        data[biophys_data]['median']), 1)
                    firstq = data[biophys_data][
                        'firstQuartile']
                    thirdq = data[biophys_data][
                        'thirdQuartile']
                    bottom = data[biophys_data][
                        'bottomOutlier']
                    top = data[biophys_data]['topOutlier']

                    ax.plot(x,firstq, linewidth=0.5, color="black", label="1st & 3rd quartile")
                    ax.plot(x,thirdq, linewidth=0.5, color="black")
                    ax.plot(x,bottom, alpha=0.25, linestyle ="--", color="black", label="Bottom & Top outliers")
                    ax.plot(x,top, alpha=0.25, linestyle ="--", color="black")

                    ax.plot(data[biophys_data]['median'], linewidth=1.25, color="black", label= "Median")

                    #Add the selected protein if there are some
                    colors_2 = "magenta"
                    ax.plot(jsondata_list_selected[0][biophys_data], '-', linewidth=1.5, color=colors_2, label=f"Prediction {selected_prot}")

                    #Sequence conservation
                    entropy_values = conservation_AA_list
                    extent = [0, residues_count, min(bottom)-0.05, max(top)+0.05]
                    ax.imshow(np.array(entropy_values)[np.newaxis,:], cmap=cmap, extent=extent,aspect = "auto", vmin=0, vmax=1)
                    cbar = ax.figure.colorbar(
                        mpl.cm.ScalarMappable(norm=None, cmap=cmap), shrink=1, pad = 0.03, ax=ax, ticks=[0, 1])
                    cbar.ax.set_yticklabels(['0%', '100%'])
                    cbar.ax.set_ylabel('Sequence conservation', rotation=270)

            ax.set_title(PREDICTION_TITLES[biophys_data] , fontsize=14)
            ax.axis([lower_lim, upper_lim, min(bottom)-0.05, max(top)+0.05])
            # ax.xaxis.set_major_locator(ticker.FixedLocator(np.arange(lower_lim,upper_lim-1,10)))

            ax.set_ylabel(AXIS_TITLES['y'], fontsize=10)
            ax.set_xlabel(AXIS_TITLES['x'], fontsize=10)


            if biophys_data == 'backbone':
                ax.axhline(y=1.0, color='green', linewidth= 1.5, linestyle='-.', label="Above: Membrane spaning") #Membrane spaning
                ax.axhline(y=0.8, color='orange', linewidth= 1.5, linestyle='-.', label="Above: Rigid") #Membrane spaning
                if min(bottom)-0.05 < 0.69:
                    ax.axhline(y=0.69, color='red', linewidth= 1.5, linestyle='-.', label="Above: Context dependent \nBelow: Flexible") #context dependent (either rigide or flexible)
            if biophys_data == 'earlyFolding':
                ax.axhline(y=0.169, color='red', linewidth= 1.5, linestyle='-.', label="Above: Likely to start folding") #above: likely start protein folding process
            if biophys_data == 'disoMine':
                ax.axhline(y=0.5, color='red', linewidth= 1.5, linestyle='-.', label="Above: Likely to be disordered") #above: likely disordered
            ax.legend(ncol=1, bbox_to_anchor =(1.1,0.5), loc='center left')


        plt.tight_layout()
        fig.subplots_adjust(top=0.96, hspace = 0.6)

        plt.savefig('/content/results/'+ name_region + '_msa_sequence_biophysical_conservation.png')

        plt.show()

    return fig, axs

sequences =  msaSuite.seqs
fig, axs = plot_entropy_biphysical_msa_zoom([msa_predictions_in_fct_of],[selected_prot_predictions],selected_prot,sequences,all_regions,conservation_selected_protein)

In [16]:
#@title 6. Get the Gaussian Mixture Model (GMM) scores
%%capture
#@markdown Once this cell has been executed, the GMM scores will be computed and the most different residues will be shown.
#@markdown The results are also saved in the Results folder.

#@markdown Note: only the following biophysical properties are considered to calculate
#@markdown the GMM scores: backbone & sidechain dynamics, coil, sheet & helix conformation propensity,
#@markdown early folding propsensity and disorder propensity.


# Function to compute GMM scores
def GMM_scores(jsondata_list, single_preds):
    # Define the biophysical properties of interest
    biophys_props = ['backbone', 'sidechain', 'coil', 'sheet', 'helix', 'earlyFolding', 'disoMine']
    n_features = len(biophys_props)  # The expected number of features

    # Train predictor
    data_full = []
    for aln_pos in range(len(jsondata_list[0]['backbone']['median'])):
        pred_vector = []
        for biophys in biophys_props:
            if biophys in jsondata_list[0]:
                value = jsondata_list[0][biophys]['median'][aln_pos]
                if np.isfinite(value):  # Check if the value is finite
                    pred_vector.append(value)
                else:
                    pred_vector.append(0.0)  # Default value for non-finite values
            else:
                pred_vector.append(0.0)  # Default value for missing features
        data_full.append(pred_vector)
    X_train = np.vstack(data_full)
    clf = mixture.GaussianMixture(n_components=1, covariance_type='full', verbose=2, verbose_interval=1)
    clf.fit(X_train)

    # Predict GMM scores
    gmm_dict = {}
    scores = []
    gmm_info = {}
    sequences = {}

    for prot in single_preds.keys():
        if prot != "sequence":
            full_pred = []
            for res in range(len(single_preds[prot]['backbone'])):
                pred_vector = []
                for biophys in biophys_props:
                    if biophys in single_preds[prot] and single_preds[prot][biophys][res] is not None:
                        value = single_preds[prot][biophys][res]
                        if np.isfinite(value):  # Check if the value is finite
                            pred_vector.append(value)
                        else:
                            pred_vector.append(0.0)  # Default value for non-finite values
                    else:
                        pred_vector.append(0.0)  # Default value for missing features

                if len(pred_vector) == n_features:  # Ensure pred_vector has the correct number of features
                    full_pred.append(pred_vector)

            if full_pred:  # Ensure full_pred is not empty
                preds = np.vstack(full_pred)
                gmm_scores = clf.score_samples(preds).tolist()
                gmm_info[prot] = gmm_scores
                scores.extend(gmm_scores)

    # Ensure 'predictions_single_seq' exists and is correctly defined
    if 'predictions_single_seq' in globals():
        for prot in predictions_single_seq["sequence"].keys():
            sequences[prot] = predictions_single_seq["sequence"][prot]
    else:
        raise KeyError("predictions_single_seq not found in the global scope")

    # Most different biophysical behaviour
    perc_5 = np.percentile(scores, 5)
    perc_1 = np.percentile(scores, 1)

    for key in gmm_info.keys():
        worst_5perc = []
        worst_1perc = []

        # Find the residues that are the most different compared to the other residues at the same position in the MSA
        # We analyze the difference in biophysical behavior between the residues
        for i in gmm_info[key]:
            if i <= perc_5:
                idx = gmm_info[key].index(i)
                AA = sequences[key][idx]
                worst_5perc.append(AA + str(idx + 1))
            if i <= perc_1:
                idx = gmm_info[key].index(i)
                AA = sequences[key][idx]
                worst_1perc.append(AA + str(idx + 1))

        # To avoid to have the same AA in both lists
        for AA in worst_5perc:
            if AA in worst_1perc:
                worst_5perc.remove(AA)

        worst = [worst_5perc, worst_1perc]
        new_list = []
        for worst_ in worst:
            consecutive = []
            for position in range(len(worst_) - 1):
                idx = worst_[position][1:]
                next = worst_[position + 1][1:]
                if int(idx) + 1 == int(next):
                    consecutive.append(True)  # True means there is a consecutive one coming
                else:
                    consecutive.append(False)
            consecutive.append(False)

            found_first = False
            new = []
            for i in range(len(worst_)):
                if consecutive[i] and not found_first:
                    found_first = True
                    AA_first = worst_[i]
                if not consecutive[i] and found_first:
                    AA_last = worst_[i]
                    concatenated = AA_first + "-" + AA_last
                    new.append(concatenated)
                    found_first = False
                elif not consecutive[i] and not found_first:
                    new.append(worst_[i])
            new_list.append(new)
        worst_5perc_new = new_list[0]
        worst_1perc_new = new_list[1]

        # Worst 5% means the residues are lower than the 5th percentile
        # Worst 1% means the residues are lower than the 1st percentile
        gmm_dict[key] = {"GMMscore": gmm_info[key], "Worst 5%": worst_5perc_new, "Worst 1%": worst_1perc_new}

    if not os.path.exists('/content/results/'):
        os.makedirs('/content/results/')

    with open('/content/results/GMM_quantification.json', 'w') as fp:
        json.dump(gmm_dict, fp)

    return gmm_dict

jsondata_list = [msaSuite.alignedPredictionDistribs]
single_preds = msaSuite.allAlignedPredictions
gmm_scores = GMM_scores(jsondata_list, single_preds)

print("\n")
print("GMM score analysis to find the residues that are the most different:")
for prot in gmm_scores.keys():
    print(prot)
    print("Worst 5%:")
    print(str(gmm_scores[prot]["Worst 5%"]))
    print("Worst 1%:")
    print(str(gmm_scores[prot]["Worst 1%"]))
    print("\n")


In [17]:
#@title 7. Download the predictions

#@markdown Once this cell has been executed, a zip-archive with
#@markdown the predictions, the corresponding plots and the GMM scores will be automatically downloaded
#@markdown to your computer.
%%capture
final_results = {}

predictions = msaSuite.allAlignedPredictions

for prot in predictions.keys():
  if prot != "sequence":
      predictions[prot]["sequence"] = predictions_single_seq["sequence"][prot]
      final_results[prot] = {**predictions[prot]}
final_results["statistics"] = jsondata_list[0]

json.dump(final_results, open('/content/results/predicted_biophysical_features.json', 'w'), indent=4)

!zip --quiet -r /content/b2b_predictions_results.zip /content/results

files.download(f"/content/b2b_predictions_results.zip")

## Questions? Feedback?

Contact us through our [Feedback](https://www.bio2byte.be/b2btools/feedback) page on the bio2Byte website.

## PyPi repository

Want to try out the b2btools? Download the Bio2Byte's tools package from our PyPi repository: https://pypi.org/project/b2bTools/.

## Citations

- Implementation of the b2btools to study the protein biophysical features and their conservation:
> Kagami, L. P., Orlando, G., Raimondi, D., Ancien, F., Dixit, B., Gavaldá-García, J., Ramasamy, P., Roca-Martínez, J., Tzavella, K., & Vranken, W. (2021). b2bTools: Online predictions for protein biophysical features and their conservation. Nucleic Acids Research, 49(W1), W52–W59. https://doi.org/10.1093/nar/gkab425

- DynaMine:
> Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2013). From protein sequence to dynamics and disorder with DynaMine. Nature Communications, 4(1), 2741–2741. https://doi.org/10.1038/ncomms3741

- EFoldMine:
> Raimondi, D., Orlando, G., Pancsa, R., Khan, T., & Vranken, W. F. (2017). Exploring the Sequence-based Prediction of Folding Initiation Sites in Proteins. Scientific Reports, 7(1), 8826–8826. https://doi.org/10.1038/s41598-017-08366-3

- Disomine:
> Orlando, G., Raimondi, D., Codicè, F., Tabaro, F., & Vranken, W. (2022). Prediction of Disordered Regions in Proteins with Recurrent Neural Networks and Protein Dynamics. Journal of Molecular Biology, 434(12), 167579. https://doi.org/10.1016/j.jmb.2022.167579

- AgMata:
> Orlando, G., Silva, A., Macedo-Ribeiro, S., Raimondi, D., & Vranken, W. (2020). Accurate prediction of protein beta-aggregation with generalized statistical potentials. Bioinformatics, 36(7), 2076–2081. https://doi.org/10.1093/bioinformatics/btz912

- PSPer:
> Orlando, G., Raimondi, D., Tabaro, F., Codicè, F., Moreau, Y., & Vranken, W. F. (2019). Computational identification of prion-like RNA-binding proteins that form liquid phase-separated condensates. Bioinformatics, 35(22), 4617–4623. https://doi.org/10.1093/bioinformatics/btz274