# Biopandas for PDB files
-----------------------------------------------------------------------------
## Overview
If you are a structural biologist working with molecular structure files, a fantastic way to process pdb files is with Pandas dataframes. The tools are beyond the scope of this Introduction to Python course, but we include it here to give you a taste of how you can use traditional programming to query and calculate with these complex file types.

## Learning objectives
After this submodule, you will be able to:
1. Import at pdb file from the rcsb using biopandas
2. Access dataframes within the file
3. Calculate from and analyze series
4. Carry out an exercise identifying a ligand binding site in a structure

## Prerequisites
- Knowledge of pandas

## Getting Started
Run the next code box to install pandas, numpy, and biopandas

In [None]:
%pip install pandas
%pip install biopandas

import pandas as pd
import numpy as np
from biopandas.pdb import PandasPdb

# Importing a pdb file into a pandas dataframe

As with Biopython, biopandas has functions that are well adapted for biological data types. The protein data bank format (PDB) is a large data file that contains the position and identity of every atom in the 3D structure. 

The file can be imported directly from the RCSB using fetch_pdb as shown. 

In [None]:
ppdb = PandasPdb().fetch_pdb('3eiy')
ppdb.df['ATOM'].tail()

In [None]:
ppdb.df['ATOM']["residue_name"].unique()

In [None]:
ppdb.df['ATOM']["b_factor"].mean()

In [None]:
ppdb.df['HETATM'].head()

It is easy to identify all the heteroatoms in this structure file:

In [None]:
ppdb.df['HETATM']['residue_name'].unique()

Because this is a pandas **series** rather than a type count works on (that is, strings, lists, tuples, sets and dictionaries), the tool to 'count' items is slightly different: df.value_counts()\[value]

Can you also identify how many sodium atoms are in the structure file?

In [None]:
ppdb.df['HETATM']['residue_name'].value_counts()['HOH']

## Exercise: Identifying the ligand binding residues 
In this project, you'll analyze a protein-ligand complex from a PDB file to:

1. Extract and explore atomic information.
2. Identify the binding site of the ligand.
3. Calculate distances between the ligand and nearby protein residues.
4. Visualize the results using Python.

You should re-run it with a different pdb file and ligand to see how it can work for you.

On a PC, many of these steps take ~10min to run. This is where cloud computing shows it's power & utility!

### Step one: Get the file
Assign the pdb file to the variable pdb  PDB ID: 4AKE (adenylate kinase with a bound ligand). 

As always, look at the first few lines of the file to make sure it is what you expected and to see the column names

In [None]:
# Load the PDB file
pdb = PandasPdb().fetch_pdb('4CFF')

# View the ATOM and HETATM data
print(pdb.df['ATOM'].head())  # Protein atoms
print(pdb.df['HETATM'].head())  # Ligand or non-standard residues
pdb.df['HETATM']['residue_name'].unique()

### Step two: Filter the Ligand Data
Identify and extract the ligand information (from the HETATM section).

In [None]:
# Extract the ligand (e.g., "AMP" for adenylate)
amp_ligand = pdb.df['HETATM'][pdb.df['HETATM']['residue_name'] == 'AMP']
print(amp_ligand.iloc[1:5,1:8])


### Step three: Identify Nearby Protein Atoms
Calculate the distance between the ligand and nearby protein atoms to identify potential binding site residues.

In [None]:
# Extract protein atoms
protein_atoms = pdb.df['ATOM']

# Define a function to calculate distances
def calculate_distance(coord1, coord2):
    return np.linalg.norm(coord1 - coord2)

# Find nearby protein atoms within 5Å of the ligand
binding_site = []
for _, ligand_row in amp_ligand.iterrows():
    ligand_coords = np.array([ligand_row['x_coord'], ligand_row['y_coord'], ligand_row['z_coord']])
    for _, protein_row in protein_atoms.iterrows():
        protein_coords = np.array([protein_row['x_coord'], protein_row['y_coord'], protein_row['z_coord']])
        distance = calculate_distance(ligand_coords, protein_coords)
        if distance <= 5.0:
            binding_site.append(protein_row)

# Convert the binding site data to a DataFrame
binding_site_df = pd.DataFrame(binding_site)
print(binding_site_df)

### Step 4: Visualize the Results
Plot the ligand and the binding site residues in 3D using Matplotlib.

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Extract coordinates
ligand_coords = amp_ligand[['x_coord', 'y_coord', 'z_coord']].values
protein_coords = binding_site_df[['x_coord', 'y_coord', 'z_coord']].values

# Create a 3D scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Plot ligand
ax.scatter(ligand_coords[:, 0], ligand_coords[:, 1], ligand_coords[:, 2], color='red', label='Ligand')

# Plot binding site residues
ax.scatter(protein_coords[:, 0], protein_coords[:, 1], protein_coords[:, 2], color='blue', label='Binding Site')

# Label the plot
ax.set_title("Protein-Ligand Binding Site")
ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_zlabel("Z")
ax.legend()

plt.show()

You can see the 4 AMP molecules (in red) with the nearby atoms. 

### Step 5: Get a dataframe of the nearby amino acids
In the previous steps we identified all the nearby **atoms**. Here, we collect the names of the nearby residues. We are putting the residue information into a SET which  will only add (add()) unique elements. *This step can take some time.*

In [None]:
# Find amino acids near AMP ligand
nearby_residues = set()  # To store unique residues
distance_threshold = 3.0  # Distance threshold in Å. Consider adding a user input to ask you for the distance

for _, ligand_row in amp_ligand.iterrows():
    ligand_coords = np.array([ligand_row['x_coord'], ligand_row['y_coord'], ligand_row['z_coord']])
    for _, protein_row in protein_atoms.iterrows():
        protein_coords = np.array([protein_row['x_coord'], protein_row['y_coord'], protein_row['z_coord']])
        distance = calculate_distance(ligand_coords, protein_coords)
        if distance <= distance_threshold:
            # Add residue information to the set
            residue_info = f"{protein_row['residue_name']} {protein_row['residue_number']} {protein_row['chain_id']}"
            nearby_residues.add(residue_info)

# Display the nearby residues
print("Amino acids near AMP ligand:")
residue_set=set()
for residue in sorted(nearby_residues):
    residue_set.add(residue) #only adds unique elements

print("Residue # Chain")
for line in residue_set:
    print(line)

# Conclusion
You should see the potential power of Pandas to process even complex structure data files.  Perhaps you can envision using a python script to collect and compare hundreds of AMP binding sites from protein files. Hopefully, you edited the scripts above to look at a protein of your choice or to use other tools from previous tutorials!

You should now learn about more about the tools available in python for [visualizing your data]("./Submodule_2_Tutorial_3_VisualizingData.ipynb")

## Clean up
After you are done, be sure to stop the compute instance for this Jupyter notebook to avoid unnecessary charges.