# Domain Architecture and Evolution of RRNPP Proteins

Welcome! In this notebook, you will analyze the domain architecture and evolutionary relationships of the RRNPP family of quorum sensing proteins, following the workflow from [this preprint](https://www.biorxiv.org/content/10.1101/2023.09.19.558401v2).

## Objectives
- **Introduction to RRNPP proteins:** Review their roles and domain organization.
- **Domain architecture analysis:** Compare sequences and define core, N-terminal, and C-terminal regions.
- **Homology clustering:** Group and visualize similar regions using graph-based methods.
- **Phylogenetic analysis:** Build and annotate a tree based on core regions.
- **Evolutionary inference:** Map and interpret domain architecture transitions.

## Workflow Overview
1. Introduction to RRNPP proteins
2. Domain architecture analysis
3. Homology clustering
4. Phylogenetic analysis
5. Evolutionary inference and discussion

---

Let's get started!

In [2]:
cd projects/Structural_evo_tutorial/Structural_evo_tutorial/

/home/dmoi/projects/Structural_evo_tutorial/Structural_evo_tutorial


In [6]:
# lets load some functions from foldtree
from fold_tree.src.corecut import *
from fold_tree.src.foldseek2tree import *
from fold_tree.src.AFDB_tools import *

In [7]:
import glob
rrnppa_structures = glob.glob( 'rrnppa/*.pdb')
uniprotids = [ s.split('/')[-1].split('.')[0] for s in rrnppa_structures ]
print( len(uniprotids)  , 'structures found' )
print( uniprotids[0:5] , '...' )

768 structures found
['A0A2Z4MR52', 'S6FLN1', 'A0A075RCM7', 'A0A7Z2J5F3', 'A0A410DTG4'] ...


In [8]:
import py3Dmol
import glob

# Lets look at some structures with diverse architectures
# List PDB files in the current directory

# Visualize the first 3 structures (if available)
for pdb_file in rrnppa_structures[:3]:
	print(f"Visualizing {pdb_file}")
	with open(pdb_file, 'r') as f:
		pdb_data = f.read()
	view = py3Dmol.view(width=400, height=300)
	view.addModel(pdb_data, 'pdb')
	view.setStyle({'cartoon': {'color': 'spectrum'}})
	view.zoomTo()
	display(view)

Visualizing rrnppa/A0A2Z4MR52.pdb


<py3Dmol.view at 0x7fc54db87650>

Visualizing rrnppa/S6FLN1.pdb


<py3Dmol.view at 0x7fc54fd27020>

Visualizing rrnppa/A0A075RCM7.pdb


<py3Dmol.view at 0x7fc54dfaf440>

In [9]:
from Bio.PDB import Superimposer, PDBParser
from Bio.PDB import PDBIO
import tempfile
import py3Dmol

def rigid_body_align(structure_path1, structure_path2, chain_id1='A', chain_id2='A'):
	"""
	Align two protein structures using rigid body superposition.

	Returns:
		rmsd (float): Root mean square deviation after alignment.
		super_imposer (Superimposer): Biopython Superimposer object.
		structure1, structure2: Biopython Structure objects (structure2 is superposed).
		view (py3Dmol.view): py3Dmol view showing the superposed structures.
	"""
	parser = PDBParser(QUIET=True)
	structure1 = parser.get_structure('struct1', structure_path1)
	structure2 = parser.get_structure('struct2', structure_path2)

	atoms1 = [atom for atom in structure1[0].get_atoms() if atom.get_id() == 'CA']
	atoms2 = [atom for atom in structure2[0].get_atoms() if atom.get_id() == 'CA']

	min_len = min(len(atoms1), len(atoms2))
	atoms1 = atoms1[:min_len]
	atoms2 = atoms2[:min_len]

	sup = Superimposer()
	sup.set_atoms(atoms1, atoms2)
	sup.apply(structure2.get_atoms())

	# Save structures to temp files for visualization
	io = PDBIO()
	# structure1
	tmp1 = tempfile.NamedTemporaryFile(delete=False, suffix='.pdb')
	io.set_structure(structure1)
	io.save(tmp1.name)
	# structure2 (already superposed)
	tmp2 = tempfile.NamedTemporaryFile(delete=False, suffix='.pdb')
	io.set_structure(structure2)
	io.save(tmp2.name)

	# Visualize with py3Dmol
	with open(tmp1.name) as f1, open(tmp2.name) as f2:
		pdb1 = f1.read()
		pdb2 = f2.read()
	view = py3Dmol.view(width=600, height=400)
	view.addModel(pdb1, 'pdb')
	view.setStyle({'model': 0}, {'cartoon': {'color': 'spectrum'}})
	view.addModel(pdb2, 'pdb')
	view.setStyle({'model': 1}, {'cartoon': {'color': 'magenta'}})
	view.zoomTo()
	return sup.rms, sup, structure1, structure2, view


In [10]:
# again, let's try some structure pairs
import itertools
import random
combos = [ ( i, j ) for i, j in itertools.combinations(range(len(rrnppa_structures)), 2) ]
sample = random.sample(combos, 5)
for i, j in sample:
	print( i, j, rrnppa_structures[i], rrnppa_structures[j] )
	rms, sup, structure1, structure2, view = rigid_body_align(rrnppa_structures[i], rrnppa_structures[j])
	view.show()
	print( 'RMSD:', rms)
# The above code will align the first two structures in the list and print the RMSD.
# lets also save the aligned structures
sup.apply(structure2.get_atoms())
# Save the aligned structure


400 753 rrnppa/A0A291BEY5.pdb rrnppa/A0A3S4NUA8.pdb


RMSD: 15.482971227195264
450 730 rrnppa/F8LYE6.pdb rrnppa/A0A6I7FCM0.pdb


RMSD: 16.89975159120775
85 297 rrnppa/A0A410DXN0.pdb rrnppa/D8H6E1.pdb


RMSD: 21.152097931926935
163 506 rrnppa/A0A075RAN3.pdb rrnppa/A0A7T5ESM5.pdb


RMSD: 16.425440640961437
478 584 rrnppa/F4BN04.pdb rrnppa/A0A6H3AJL9.pdb


RMSD: 14.49679066452427


### Structural Diversity and Its Impact on Phylogenetic Inference

The RRNPP protein structures analyzed here exhibit a common core fold but display considerable diversity in their overall architectures. This is reminiscent of the "fusexin" example, where proteins share a conserved structural core but differ significantly in their N- and C-terminal extensions or additional domains.

#### Comparison with the Fusexin Example

In both RRNPP and fusexin families, the presence of variable terminal regions or accessory domains leads to proteins with similar cores but divergent overall architectures. This structural diversity can arise from domain shuffling, insertions, or deletions during evolution.

#### Consequences for Sequence and Structural Distances

- **Sequence Distances:** The inclusion of variable regions inflates sequence distances between proteins, as these regions may be highly divergent or even unrelated by descent.
- **Structural Distances:** Similarly, structural distances (e.g., RMSD) increase when comparing full-length proteins, as the variable regions contribute additional differences not present in the conserved core.
- **Phylogenetic Inference:** Using full-length sequences or structures as input for phylogenetic analysis can obscure true evolutionary relationships. The incongruence introduced by variable regions may lead to inaccurate trees, as the signal from the conserved core is diluted by noise from the divergent regions.

#### Best Practice

To obtain more accurate evolutionary insights, it is advisable to focus on the conserved core region shared by all members of the family. By extracting and analyzing only the core, one can reduce noise from variable regions and improve the reliability of sequence and structural comparisons, as well as downstream phylogenetic inference.

In [None]:
# Let's plot the lengths of all the structures

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from Bio.PDB import PDBParser
import pandas as pd
import tqdm

lengths = []
for pdb_file in rrnppa_structures:
	# Parse the PDB file
	parser = PDBParser(QUIET=True)
	structure = parser.get_structure('PDB', pdb_file)
	# Calculate the length of the structure
	length = len(list(structure.get_residues()))
	# Store the length and filename in a list
	lengths.append(length )
# Convert the list to a DataFrame
plt.hist(lengths, bins=50)
plt.xlabel('Length of Structure')
plt.ylabel('Frequency')
plt.title('Distribution of Structure Lengths')
plt.show()

KeyboardInterrupt: 

In [None]:
#let's get the uniprot metadata for the structures



In [None]:
# All of these structures have a common core fold but diverse architectures...
# this might throw off our evolutionary distances. Each domain has its own history...

## Evolutionary History and Domain Shuffling in Prokaryotes

### Domain Shuffling Events

Domain shuffling events are common in prokaryotes, leading to proteins with mixed evolutionary origins.

### Incongruent Phylogenetic Histories

As a result, the evolutionary history of individual domains within a protein can differ, producing incongruent phylogenetic trees when domains are analyzed together. This can obscure true evolutionary relationships and reduce the accuracy of global trees.

### Importance of Domain-Specific Analysis

Therefore, it is often more reliable to analyze each domain separately to capture their distinct evolutionary trajectories.

In [None]:
# lets try with using the corecut


## The CoreCut Approach

The CoreCut approach focuses on identifying and extracting a consensus "core" region shared among a group of related protein structures. By systematically removing variable N-terminal and C-terminal extensions, CoreCut isolates the structurally conserved segment, enabling more accurate comparative and evolutionary analyses. This method allows for domain-specific studies, reducing noise from divergent terminal regions and facilitating separate analyses for each group of structures based on their conserved cores.