## RSALOR: Ready-to-Use Notebook
[![PyPi Version](https://img.shields.io/pypi/v/rsalor.svg)](https://pypi.org/project/rsalor/) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/3BioCompBio/RSALOR/blob/main/colab_notebook_RSALOR.ipynb)

<img src="https://raw.githubusercontent.com/3BioCompBio/RSALOR/main/Logo.png" height="250" align="right" style="height:200px;">

Ready-to-Use Notebook to run the RSALOR model:
 - Upload your MSA file
 - Uploae your 3D structure file
 - Run predictions on all single-site mutations

The `rsalor` package combines structural data (Relative Solvent Accessibility, RSA) and evolutionary data (Log Odd Ratio, LOR from MSA) to evaluate effects of missense mutations in proteins.
It computes the `RSA*LOR` score for each single-site missense mutation in a target protein by combining multiple computational steps into a fast and user-friendly tool.
Source code in the [RSALOR GitHub](https://github.com/3BioCompBio/RSALOR).

**Please cite**:
- [Matsvei Tsishyn, Pauline Hermans, Fabrizio Pucci, Marianne Rooman (2025). Residue conservation and solvent accessibility are (almost) all you need for predicting mutational effects in proteins. Bioinformatics, btaf322](https://doi.org/10.1093/bioinformatics/btaf322).

- [Pauline Hermans, Matsvei Tsishyn, Martin Schwersensky, Marianne Rooman, Fabrizio Pucci (2024). Exploring evolution to uncover insights into protein mutational stability. Molecular Biology and Evolution, 42(1), msae267](https://doi.org/10.1093/molbev/msae267).


In [None]:
#@title Install and import dependencies

# Install packages -------------------------------------------------------------
!pip install rsalor


# Import packages --------------------------------------------------------------
import os
import requests
from google.colab import files
from Bio.PDB import PDBParser, PPBuilder, PDBList
from rsalor import MSA
from rsalor.sequence import Sequence, FastaReader, PairwiseAlignment
from rsalor.structure import Structure


# Dependencies (small helper functions) ----------------------------------------

# Log functions
def clip_string(input_str: str, max_len: int=100) -> str:
  """Truncate a string and append '...' if it exceeds max_len."""
  return input_str if len(input_str) <= max_len else input_str[:max_len] + "..."

# Fetch functions
def is_valid_pdb_id(pdb_id :str) -> bool:
  """Returns if 'input' can be a PDB-ID."""
  POSSIBLE_FIRST_CHARACTERS = "123456789"
  POSSIBLE_FOLLOWING_CHARACTERS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  input_upper = pdb_id.upper()
  return len(input_upper) == 4 \
    and input_upper[0] in POSSIBLE_FIRST_CHARACTERS \
    and all([char in POSSIBLE_FOLLOWING_CHARACTERS for char in input_upper[1:]])

def fetch_pdb(pdb_id: str) -> str:
  """Fetch a '.pdb' file from the PDB and return its path."""
  pdb_id = pdb_id.lower().strip()
  if not is_valid_pdb_id(pdb_id):
    raise ValueError(f"❌ pdb_id='{pdb_id}' is not a valid PDB Id.")
  pdb_fetcher = PDBList()
  file_path = pdb_fetcher.retrieve_pdb_file(pdb_id, file_format="pdb", pdir="./")
  if file_path is None:
    raise ValueError(f"❌ Fetch pdb_id='{pdb_id}' has failed.")
  if file_path.endswith(".ent"):
    file_path_old = file_path
    file_path = file_path.removesuffix(".ent") + ".pdb"
    os.rename(file_path_old, file_path)
  return file_path

def fetch_uniprot(uniprot_id: str) -> str:
  """Fetch a '.pdb' file from the AlphaFoldDB by its UniProt ID and return its path."""
  uniprot_id = uniprot_id.upper().strip()
  url = f"https://alphafold.ebi.ac.uk/files/AF-{uniprot_id}-F1-model_v6.pdb"
  file_path = f"./{uniprot_id}.pdb"
  r = requests.get(url)
  if r.status_code != 200:
    raise ValueError(f"❌ AlphaFoldDB fetch failed for UniProt ID '{uniprot_id}' from '{url}'")
  with open(file_path, "wb") as fs:
    fs.write(r.content)
  return file_path

In [None]:
#@title Upload MSA file (in '.fasta', '.a2m' or '.a3m' format)

# Drag-and-drop file picker
uploaded = files.upload()
msa_path: str = list(uploaded.keys())[0]

# Guardian for correct extension
if not any(msa_path.endswith(ext) for ext in MSA.ACCEPTED_EXTENTIONS):
    raise ValueError(f"❌ Uploaded MSA file '{msa_path}' should have extention among {MSA.ACCEPTED_EXTENTIONS}.")

# Log uploaded MSA file
print(f" * ✅ MSA file uploaded: '{msa_path}'")
taget_sequence = FastaReader.read_first_sequence(msa_path)
taget_sequence_str = taget_sequence.sequence if len(taget_sequence) < 100 else taget_sequence.sequence[:100] + "..."
print(f" * ✅ taget sequence (L={len(taget_sequence)}): '{clip_string(taget_sequence_str)}'")

In [None]:
#@title Upload 3D structure file (in '.pdb' format)

# Define upload method ---------------------------------------------------------
#@markdown ### Choose your 3D structure upload method (upload file or fetch from the PDB or from the AlphaFold-DB)
upload_method = "upload_local_file" # @param ["upload_local_file","fetch_from_pdb","fetch_from_alphafold_db"]
retrieval_id = "" # @param {"type":"string","placeholder":"PDB ID like '6m0j' or UniProt ID like 'Q9LW00'"}
#@markdown - if using `upload_local_file`, leave empty
#@markdown - if using `fetch_from_pdb`, specify the **PDB ID** (like `6m0j`)
#@markdown - if using `fetch_from_alphafold_db`, specify the **UniProt ID** (like `Q9LW00`)

# Case: Drag-and-drop local file -----------------------------------------------
if upload_method == "upload_local_file":

  # Drag-and-drop file picker
  uploaded = files.upload()
  pdb_path: str = list(uploaded.keys())[0]

  # Guardian for correct extension
  if not pdb_path.endswith(".pdb"):
      raise ValueError(f"❌ Uploaded PDB file '{pdb_path}' should have extention '.pdb'.")
  pdb_name = pdb_path.removesuffix(".pdb")
  print(f" * ✅ PDB file uploaded: '{pdb_path}'")

# Case: fetch file from the PDB ------------------------------------------------
elif upload_method == "fetch_from_pdb":

  # retrieval_id not fill error
  if retrieval_id == "" or retrieval_id is None:
    raise ValueError(f"❌ If upload_method='{upload_method}', please specify a retrieval_id.")

  # Fetch PDB file
  pdb_path: str = fetch_pdb(retrieval_id)

  # Log
  pdb_name = pdb_path.removesuffix(".pdb")
  print(f" * ✅ PDB file fetched: '{pdb_path}'")

# Case: fetch file from the AlphaFold-DB ---------------------------------------
elif upload_method == "fetch_from_alphafold_db":

  # retrieval_id not fill error
  if retrieval_id == "" or retrieval_id is None:
    raise ValueError(f"❌ If upload_method='{upload_method}', please specify a retrieval_id.")

  # Fetch PDB file
  pdb_path: str = fetch_uniprot(retrieval_id)

  # Log
  pdb_name = pdb_path.removesuffix(".pdb")
  print(f" * ✅ AF-DB file fetched: '{pdb_path}'")

# Case: error ------------------------------------------------------------------
else:
  raise ValueError(f"❌ Unknown upload_method='{upload_method}'.")

# Validate and Log -------------------------------------------------------------
# Parse PDB file
pp_builer = PPBuilder()
structure = PDBParser(QUIET=True).get_structure("protein", pdb_path)

# Log uploaded PDB file
sequences_by_chain: dict[str, Sequence] = {}
for chain in structure[0]: # loop only on model 1
  aa_seq_segments: list[str] = pp_builer.build_peptides(chain)
  aa_seq: str = "".join([str(pp.get_sequence()) for pp in aa_seq_segments]) # concatenate fragments if multiple
  print(f"    - chain {chain.id} (L={len(aa_seq)}): '{clip_string(aa_seq)}'")
  sequences_by_chain[chain.id] = Sequence(f"{pdb_name}_{chain.id}", aa_seq)
print(f" * ✅ Choose target chain among {len(sequences_by_chain)} detected chain(s): '{''.join(sequences_by_chain.keys())}'")


In [None]:
#@title Select PDB chain
chain = "A" # @param {"type":"string","placeholder":"A"}

# Guardians
if len(chain) != 1:
  raise ValueError(f"❌ chain='{chain}' should be a string of length 1.")
if chain not in sequences_by_chain:
  raise ValueError(f"❌ chain='{chain}' not in PDB '{pdb_path}' (among '{''.join(sequences_by_chain.keys())}')")

# Show alignments
print(f" * ✅ MSA target sequence aligned to chain '{chain}' in PDB structure.")
align = PairwiseAlignment(taget_sequence, sequences_by_chain[chain])
align.show();


In [None]:
#@title Run RSALOR

# Output settings
#@markdown ### Output settings
output_name = "" # @param {"type":"string","placeholder":"<msa_name>_rsalor"}
#@markdown - leave empty for auto
sep = "," # @param {"type":"string","placeholder":","}
#@markdown - separator in output CSV file

# RSALOR run settings
#@markdown ### RSALOR settings
theta_regularization = 0.01 # @param {"type":"number","placeholder":"0.01"}
#@markdown - regularization term for LOR/LR at amino acid frequencies level
seqid_weights = 0.80 # @param {"type":"number","placeholder":"0.80"}
#@markdown - seqid threshold to consider two sequences in the same cluster for weighting (set None to ignore)
min_seqid = 0.35 # @param {"type":"number","placeholder":"0.35"}
#@markdown - discard sequences which seqid with target sequence is below (set None to ignore)
num_threads = 2 # @param {"type":"integer","placeholder":"2"}
#@markdown - number of threads (CPUs) for weights evaluation (in the C++ backend)

# Run RSALOR
msa = MSA(
    msa_path, pdb_path, chain,
    theta_regularization=theta_regularization,
    seqid_weights=seqid_weights,
    min_seqid=min_seqid,
    num_threads=num_threads,
    verbose=True,
)

# Set output path
if output_name == "" or output_name is None:
  output_name = f"{msa.name}_rsalor"
output_path = f"{output_name}.csv"

# Compute and save scores
rsalor_scores = msa.save_scores(
    output_path,
    sep=sep,
    round_digit=6,
    log_results=True,
)


In [None]:
#@title Download RSALOR output
files.download(output_path)