# This notebook will detail the produiction of an analysis pipeline for molecular dynamic simulations carried out with openmmm

The flow of the analysis is broken down and handled by classes as follows: <br>

1. **master_anal** - this class contains information about an entire simulation carried out. <br>

    - Dictionary of all files outputted from the simulation organised into their respective steps (i.e. groups all thermal ramping files, equilibration files, minimization files). The idea behind this is that these files can be called and mdanalysis universes can be created for each part of the simulation with ease. It also contains information about the residues in the simulation (it contains a list of unique residue codes and a dictionary matching residue codes to polymers - this part is necessary since the polymers made with amber are built from a series of residues, so a 10-mer will actually have 10 residue codes.) This dictionary makes selecting individual polymers much easier! <br>

2. **universe** - this class creates a mdanalysis universe we can pass to the analysis methods <br>

3. **anal_methods** - this class contains mdanalysis methods 

  
   

The first thing to do is set up the manager class - something that is consistent across all of these notebooks!

In [1]:
from modules.sw_directories import *
import os as os

manager = SnippetSimManage(os.getcwd())

The system I will be analysing is "3HB_10_polymer_5_5_array_crystal" however you can use any system name that has simulations associated with it. The first step is to locate the simulation files required and return paths. The function below could return filepaths to multiple directories if the simualtion has been carried out multiple times.

In [7]:
system_name = "3HB_10_polymer_5_5_array_crystal"
sim_avail = manager.simulations_avail(system_name)

Output contains paths to simulation directories.


Print out the path of the simulation directories.

In [6]:
sim_avail

['/home/dan/polymersimulator/pdb_files/systems/3HB_10_polymer_5_5_array_crystal/2024-10-04_174036']

Everything for the analysis is contained within this path with the exception of the topology file which can be returned in different ways using the manager class. <br>

Now we want to set up *master_anal* class which will contain the steps of the simulation we want to analyse.

In [167]:
from collections import defaultdict

class master_anal():
    def __init__(self, manager, system_name, simulation_directory, poly_length=None):
        self.manager = manager
        self.system_name = system_name
        self.topology_file = self.manager.load_amber_filepaths(system_name)[0]
        self.simulation_directory = simulation_directory
        self.simulation_files = self.group_files()
        self.min_filepath = os.path.join(self.simulation_directory, self.simulation_files["min"][0])
        # It is important to note that passing a polymer length is only appropriate where the system contains polymers of the same length
        if poly_length is not None:
            self.poly_length = poly_length
            self.residue_codes = self.calculate_polymers_and_assign_residue_codes(self.min_filepath, self.poly_length)[2]
            self.poly_sel_dict = self.calculate_polymers_and_assign_residue_codes(self.min_filepath, self.poly_length)[1]
        else:
            self.poly_length = None 
            self.residue_codes = self.extract_rescodes_and_resnums(self.min_filepath)[1]
        self.simulation_steps = list(self.simulation_files.keys())
    
    def group_files(self):
        grouped_files = defaultdict(list)

        sim_step_strings = ["1_atm", "temp_ramp_heat", "temp_ramp_cool", "min"]

        for file in os.listdir(self.simulation_directory):
            if file.endswith(('.txt', '.dcd', '.pdb')):
                base_name = os.path.splitext(file)[0]
                for string in sim_step_strings:
                    if string in base_name:
                        grouped_files[string].append(file)

        return(grouped_files)

    def extract_rescodes_and_resnums(self, pdb_file_path):
        largest_residue_number = None  # Variable to track the largest residue number
        unique_residue_codes = set()    # Set to hold unique residue codes

        with open(pdb_file_path, 'r') as pdb_file:
            for line in pdb_file:
                # Parse only lines that start with "ATOM" or "HETATM"
                if line.startswith("ATOM") or line.startswith("HETATM"):
                    # Extract the residue number (position 22-26)
                    residue_number = int(line[22:26].strip())
                    # Extract the residue code (position 17-20)
                    residue_code = line[17:20].strip()

                    # Update the largest residue number if this one is larger
                    if largest_residue_number is None or residue_number > largest_residue_number:
                        largest_residue_number = residue_number
                
                    # Add the residue code to the set for unique codes
                    unique_residue_codes.add(residue_code)

        return largest_residue_number, unique_residue_codes

    def calculate_polymers_and_assign_residue_codes(self, pdb_file_path, poly_length):
        # Find the largest residue number and unique residue codes
        largest_residue_number, unique_residue_codes = self.extract_rescodes_and_resnums(pdb_file_path)

        # Calculate the number of polymers
        num_polymers = largest_residue_number // poly_length

        # Create a dictionary to hold the polymer residue codes
        polymers_dict = {}

        # Assign residue codes based on the number of residues per polymer
        for i in range(num_polymers):
            # Calculate the start and end residue codes for this polymer
            start_code = i * poly_length + 1
            end_code = start_code + poly_length - 1
            polymers_dict[f'Polymer_{i + 1}'] = list(range(start_code, end_code + 1))

        return num_polymers, polymers_dict, unique_residue_codes   

# Set up masterclass for simulation analysis

This class *master_anal* doesn't do analysis but contains information that is super useful for analysis. We need to pass the system name, the simulation folder we want to analyse and the length of the polymers (if this applicable).

In [168]:
# sim_avail[x] is from the mananger class and will be an entire filepath to a directory containing simulation outputs
masterclass = master_anal(manager, system_name, sim_avail[0], 10)

In [169]:
masterclass.topology_file

'/home/dan/polymersimulator/pdb_files/systems/3HB_10_polymer_5_5_array_crystal/3HB_10_polymer_5_5_array_crystal.prmtop'

In [170]:
masterclass.simulation_files

defaultdict(list,
            {'1_atm': ['3HB_10_polymer_5_5_array_crystal_1_atm_2024-10-04_174036.pdb',
              '3HB_10_polymer_5_5_array_crystal_1_atm_2024-10-04_174036.dcd',
              '3HB_10_polymer_5_5_array_crystal_1_atm_2024-10-04_174036.txt'],
             'temp_ramp_heat': ['3HB_10_polymer_5_5_array_crystal_temp_ramp_heat_300_700_2024-10-04_174036.txt',
              '3HB_10_polymer_5_5_array_crystal_temp_ramp_heat_300_700_2024-10-04_174036.pdb',
              '3HB_10_polymer_5_5_array_crystal_temp_ramp_heat_300_700_2024-10-04_174036.dcd'],
             'min': ['min_3HB_10_polymer_5_5_array_crystal.pdb'],
             'temp_ramp_cool': ['3HB_10_polymer_5_5_array_crystal_temp_ramp_cool_300_700_2024-10-04_174036.dcd',
              '3HB_10_polymer_5_5_array_crystal_temp_ramp_cool_300_700_2024-10-04_174036.txt',
              '3HB_10_polymer_5_5_array_crystal_temp_ramp_cool_300_700_2024-10-04_174036.pdb']})

In [171]:
masterclass.residue_codes

{'hAD', 'mAD', 'tAD'}

In [172]:
masterclass.poly_sel_dict

{'Polymer_1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 'Polymer_2': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
 'Polymer_3': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 'Polymer_4': [31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
 'Polymer_5': [41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
 'Polymer_6': [51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
 'Polymer_7': [61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
 'Polymer_8': [71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
 'Polymer_9': [81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
 'Polymer_10': [91, 92, 93, 94, 95, 96, 97, 98, 99, 100],
 'Polymer_11': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
 'Polymer_12': [111, 112, 113, 114, 115, 116, 117, 118, 119, 120],
 'Polymer_13': [121, 122, 123, 124, 125, 126, 127, 128, 129, 130],
 'Polymer_14': [131, 132, 133, 134, 135, 136, 137, 138, 139, 140],
 'Polymer_15': [141, 142, 143, 144, 145, 146, 147, 148, 149, 150],
 'Polymer_16': [151, 152, 153, 154, 155, 156, 157, 158, 159, 160],
 'Polymer_17': [161, 162, 163, 164, 165, 166,

In [173]:
masterclass.simulation_steps

['1_atm', 'temp_ramp_heat', 'min', 'temp_ramp_cool']



Now we can set up a universe for our simulation. We want to pass the masterclass class to **Universe** alongside a string coming from **masterclass.simulation_steps** as follows: <br>

universe = Universe(masterclass, '1_atm') <br>

This will create an mdanalysis for a specific part of the simulation using the .pdb trajectory. We can also specify the trajectory we want to use (i.e. '.pdb' or '.dcd' but if nothing is specified it will use the '.pdb' trajectory by default. <br>

universe = Universe(masterclass, '1_atm', '.dcd') <br>

In [181]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import nglview as nv
import MDAnalysis as mda
from MDAnalysis.lib import distances 
from MDAnalysis.analysis import rdf
import MDAnalysisData as data
from MDAnalysis.analysis.polymer import PersistenceLength

import pandas as pd

In [194]:
class Universe():
    def __init__(self, master_anal, sim_key, traj_format=None):
        if traj_format is None:
            self.traj_format = ".pdb"
        else:
            if traj_format != ".pdb" and traj_format != ".dcd":
                print(f"{traj_format} is not supported")
                print("please enter '.pdb' or '.dcd' format.")
            else:
                self.traj_format = traj_format
        self.sim_key = sim_key
        self.masterclass = master_anal
        self.topology = self.masterclass.topology_file
        # True tells 'select_file' we are searching for the traj
        self.trajectory = os.path.join(self.masterclass.simulation_directory, self.select_file(True))
        self.universe = mda.Universe(self.topology, self.trajectory)
        self.output_filename = os.path.join(self.masterclass.simulation_directory, self.masterclass.system_name + f"_{self.sim_key}")
        # False tells 'select_file' we are searching for the data file
        self.data_file = os.path.join(self.masterclass.simulation_directory, self.select_file(False))
        self.data = pd.read_csv(self.data_file)

    def select_file(self, traj):
        if self.sim_key in self.masterclass.simulation_files:
            # Filter the files based on the specified extension
            if traj == True:
                matching_files = [filename for filename in self.masterclass.simulation_files[self.sim_key] if filename.endswith(self.traj_format)]
            if traj == False:
                matching_files = [filename for filename in self.masterclass.simulation_files[self.sim_key] if filename.endswith(".txt")]            
            
            if matching_files:
                return matching_files[0]  # Return the first matching file
            else:
                return f"No files with extension '{extension}' found for key '{self.sim_key}'."
        else:
            return f"Key '{self.sim_key}' not found in the dictionary."
        

In [195]:
universe = Universe(masterclass, 'temp_ramp_cool', ".pdb")

In [196]:
universe.topology

'/home/dan/polymersimulator/pdb_files/systems/3HB_10_polymer_5_5_array_crystal/3HB_10_polymer_5_5_array_crystal.prmtop'

In [197]:
universe.trajectory

'/home/dan/polymersimulator/pdb_files/systems/3HB_10_polymer_5_5_array_crystal/2024-10-04_174036/3HB_10_polymer_5_5_array_crystal_temp_ramp_cool_300_700_2024-10-04_174036.pdb'

In [198]:
universe.output_filename

'/home/dan/polymersimulator/pdb_files/systems/3HB_10_polymer_5_5_array_crystal/2024-10-04_174036/3HB_10_polymer_5_5_array_crystal_temp_ramp_cool'

In [199]:
universe.universe

<Universe with 3075 atoms>

In [200]:
universe.data

Unnamed: 0,"#""Progress (%)""",Step,Time (ps),Potential Energy (kJ/mole),Kinetic Energy (kJ/mole),Total Energy (kJ/mole),Temperature (K),Box Volume (nm^3),Density (g/mL),Speed (ns/day),Elapsed Time (s)
0,0.0%,1000,1.000000,-25115.301993,18640.853699,-6474.448294,584.457605,38.765671,0.941194,0.00,0.000231
1,0.0%,2000,2.000000,-22984.163637,20445.845135,-2538.318502,641.050559,38.884793,0.938311,6.11,14.141430
2,0.0%,3000,3.000000,-22207.232071,21687.085684,-520.146388,679.967901,39.295464,0.928505,6.12,28.242461
3,0.0%,4000,4.000000,-21275.433820,22604.436956,1329.003136,708.730153,38.514641,0.947329,6.12,42.335292
4,0.1%,5000,5.000000,-21498.174381,22345.699989,847.525608,700.617821,39.464460,0.924529,6.13,56.420394
...,...,...,...,...,...,...,...,...,...,...,...
9994,100.0%,9995000,9995.000002,-36960.375464,9719.748180,-27240.627284,304.748958,31.623401,1.153767,6.05,142759.173750
9995,100.0%,9996000,9996.000002,-37217.966354,9361.407441,-27856.558913,293.513691,31.813964,1.146856,6.05,142773.638903
9996,100.0%,9997000,9997.000002,-37140.970462,9574.010816,-27566.959646,300.179569,31.824723,1.146468,6.05,142788.077985
9997,100.0%,9998000,9998.000002,-37093.903853,9411.511642,-27682.392211,295.084638,32.051156,1.138369,6.05,142802.541960


In [43]:
masterclass.simulation_files["min"]


['min_3HB_10_polymer_5_5_array_crystal.pdb']

In [45]:
min_filepath = os.path.join(masterclass.simulation_directory, masterclass.simulation_files["min"][0])

In [86]:
def parse_pdb_simple(pdb_file_path):
    residues = []  # List to store all residue codes
    unique_residues = set()  # Set to store unique residue codes
    unique_resnums = []
    largest_residue_number = None
    
    with open(pdb_file_path, 'r') as pdb_file:
        for line in pdb_file:
            # Parse only lines that start with "ATOM" or "HETATM"
            if line.startswith("ATOM") or line.startswith("HETATM"):
                
                # Extract the residue code (position 17-19)
                residue_code = line[17:20].strip()  # Residue code is in columns 18-20
                # Append residue code to the list of residues
                residues.append(residue_code)
                # Add the residue code to the unique residues set
                unique_residues.add(residue_code)

                residue_number = int(line[22:26].strip())
                # Update the largest residue number if this one is larger
                if largest_residue_number is None or residue_number > largest_residue_number:
                    largest_residue_number = residue_number
    
    # Return the sorted unique residues and the total count of residues
    return sorted(unique_residues),largest_residue_number

In [87]:
a, b = parse_pdb_simple(min_filepath)

In [89]:
b

250

In [97]:
def extract_rescodes_and_resnums(pdb_file_path):
    largest_residue_number = None  # Variable to track the largest residue number
    unique_residue_codes = set()    # Set to hold unique residue codes

    with open(pdb_file_path, 'r') as pdb_file:
        for line in pdb_file:
            # Parse only lines that start with "ATOM" or "HETATM"
            if line.startswith("ATOM") or line.startswith("HETATM"):
                # Extract the residue number (position 22-26)
                residue_number = int(line[22:26].strip())
                # Extract the residue code (position 17-20)
                residue_code = line[17:20].strip()

                # Update the largest residue number if this one is larger
                if largest_residue_number is None or residue_number > largest_residue_number:
                    largest_residue_number = residue_number
                
                # Add the residue code to the set for unique codes
                unique_residue_codes.add(residue_code)

    return largest_residue_number, unique_residue_codes

def calculate_polymers_and_assign_residue_codes(pdb_file_path, poly_length):
    # Find the largest residue number and unique residue codes
    largest_residue_number, unique_residue_codes = fextract_rescodes_and_resnums(pdb_file_path)

    # Calculate the number of polymers
    num_polymers = largest_residue_number // poly_length

    # Create a dictionary to hold the polymer residue codes
    polymers_dict = {}

    # Assign residue codes based on the number of residues per polymer
    for i in range(num_polymers):
        # Calculate the start and end residue codes for this polymer
        start_code = i * poly_length + 1
        end_code = start_code + poly_length - 1
        polymers_dict[f'Polymer_{i + 1}'] = list(range(start_code, end_code + 1))

    return num_polymers, polymers_dict, unique_residue_codes

# Example usage
pdb_file_path = min_filepath  # Replace with your actual PDB file path
residues_per_polymer = 10  # Number of residues per polymer

num_polymers, polymers_dict, rescodes = calculate_polymers_and_assign_residue_codes(pdb_file_path, residues_per_polymer)

In [98]:
rescodes

{'hAD', 'mAD', 'tAD'}