<a href="https://colab.research.google.com/github/Siddhartha96123/GCHS/blob/main/GCHS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

WELCOME to the **General** Executable Conformational Heterogeneity Scanner.

It is capable of performing a '***PDB vs PDB***' conformational heterogeneity scanning of ***same enzyme from the SAME or two different organisms***.

In [None]:
# Required libraries installation
!pip install biopython

Now that the dependencies are done : Lets move onto the actual script.

The script will at the tail end of this code - ask for the **Reference PDB** first - which shall be needed to be uploaded from your local system, followed by the **Target PDB**.

Green tick on the left side of the Play button implies this part has been executed completely without any errors. Herein, we are :

Defining the function for uploading files, Parsing, aligning.

Defining the function for aligning the structures.

Defining the function for calculating ***RMSD/Res***

Defining the function to map the values on a an excel sheet.

Defining the function for plotting the values from the excel to a Graph.

In [4]:
from Bio.PDB import PDBParser, Superimposer, is_aa
from Bio import pairwise2
from Bio.SeqUtils import seq1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import files
import os

# Function to upload files in Colab
def upload_files():
    uploaded = files.upload()
    file_paths = list(uploaded.keys())
    return file_paths

def extract_sequence_and_residues(structure):
    """Extracts the amino acid sequence and residues from the PDB structure."""
    sequence = ""
    residues = []
    for chain in structure[0]:
        for residue in chain:
            if is_aa(residue, standard=True):  # Check if residue is an amino acid
                sequence += seq1(residue.get_resname())  # Get one-letter code for the residue
                residues.append(residue)
    return sequence, residues

def align_sequences(seq1, seq2):
    """Align two sequences using pairwise global alignment (globalxx)."""
    alignments = pairwise2.align.globalxx(seq1, seq2)
    return alignments[0]  # Return the best alignment

def map_residues_by_alignment(alignment, ref_residues, target_residues):
    """Map residues based on sequence alignment."""
    ref_aligned_residues = []
    target_aligned_residues = []

    ref_index = 0
    target_index = 0

    for ref_aa, tgt_aa in zip(alignment[0], alignment[1]):
        if ref_aa != '-' and tgt_aa != '-':
            # Both residues are aligned, add their corresponding residues to the list
            ref_aligned_residues.append(ref_residues[ref_index])
            target_aligned_residues.append(target_residues[target_index])

        if ref_aa != '-':
            ref_index += 1  # Move to the next residue in the reference sequence
        if tgt_aa != '-':
            target_index += 1  # Move to the next residue in the target sequence

    return ref_aligned_residues, target_aligned_residues

def align_structures(reference_path, target_path):
    parser = PDBParser(QUIET=True)

    # Load the reference structure
    reference_structure = parser.get_structure("reference", reference_path)

    # Load the target structure
    target_structure = parser.get_structure("target", target_path)

    # Extract sequences and residues from both structures
    ref_seq, ref_residues = extract_sequence_and_residues(reference_structure)
    tgt_seq, target_residues = extract_sequence_and_residues(target_structure)

    # Align sequences globally
    alignment = align_sequences(ref_seq, tgt_seq)

    # Map residues based on alignment
    ref_aligned_residues, target_aligned_residues = map_residues_by_alignment(alignment, ref_residues, target_residues)

    # Initialize the superimposer
    superimposer = Superimposer()
    reference_atoms = [residue['CA'] for residue in ref_aligned_residues if 'CA' in residue]
    target_atoms = [residue['CA'] for residue in target_aligned_residues if 'CA' in residue]

    # Check if atoms are available for superimposition
    if not reference_atoms or not target_atoms:
        raise ValueError("No atoms available for reference superimposition.")

    # Perform the superimposition
    superimposer.set_atoms(reference_atoms, target_atoms)
    superimposer.apply(target_structure[0].get_atoms())

    return reference_structure, target_structure

def calculate_distances(reference_structure, target_structure, ref_aligned_residues, target_aligned_residues):
    distances = []

    # Iterate over aligned residues and calculate distances between CA atoms
    for ref_residue, target_residue in zip(ref_aligned_residues, target_aligned_residues):
        if 'CA' in ref_residue and 'CA' in target_residue:
            ref_atom = ref_residue['CA']
            target_atom = target_residue['CA']
            distance = np.linalg.norm(ref_atom.get_coord() - target_atom.get_coord())
            distances.append((ref_residue.get_id()[1], target_residue.get_id()[1], distance))  # Use residue ID

    return distances

def save_distances_to_excel(distances, output_path):
    df = pd.DataFrame(distances, columns=["Reference Residue", "Aligned Residue", "Distance"])
    df.to_excel(output_path, index=False)

def plot_rmsf(distances, output_plot_path):
    reference_residues = [d[0] for d in distances]
    rmsf_values = [d[2] for d in distances]

    plt.scatter(reference_residues, rmsf_values)
    plt.xlabel("Reference Residue")
    plt.ylabel("RMSF (Å)")
    plt.title("Root Mean Square Fluctuation")
    plt.savefig(output_plot_path)
    plt.close()

The green tick on the above block here signals that the functions to perform every associated funciton within this tool have been ***Defined & Initialised***

We can finally now, move to uploading the ***SINGLE CHAIN PDBs*** for all further analyses.

In [None]:
# Upload the PDB files individually
print("Please upload the reference PDB file.")
reference_pdb = upload_files()[0]

print("Please upload the target PDB file.")
target_pdb = upload_files()[0]

try:
    # Align structures
    reference_structure, aligned_structure = align_structures(reference_pdb, target_pdb)

    # Extract sequences and residues again for distance calculation
    ref_seq, ref_residues = extract_sequence_and_residues(reference_structure)
    tgt_seq, target_residues = extract_sequence_and_residues(aligned_structure)

    # Align sequences globally
    alignment = align_sequences(ref_seq, tgt_seq)

    # Map residues based on alignment
    ref_aligned_residues, target_aligned_residues = map_residues_by_alignment(alignment, ref_residues, target_residues)

    # Calculate distances between atoms
    distances = calculate_distances(reference_structure, aligned_structure, ref_aligned_residues, target_aligned_residues)

    # Save distances to Excel
    output_excel = "GCHS_PDB_RMSF.xlsx"
    save_distances_to_excel(distances, output_excel)

    # Plot RMSF
    output_plot = "GCHS_PDB_RMSF_plot.png"
    plot_rmsf(distances, output_plot)

    # Download the files
    print("Script execution completed. \n The temp Excel file and the Plot shall be auto-downloaded to your local system now. \n Please allow permissions.")
    files.download(output_excel)
    files.download(output_plot)

except Exception as e:
    print("An error occurred during script execution:", str(e))