<a href="https://colab.research.google.com/github/Ash100/CADD_Project/blob/main/QSAR%20Modeling%20-%20Descriptors%20Generations%20-%20Part_14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QSAR - Generation of Feature Profiles for a Dataset

My name is **Dr. Ashfaq Ahmad** and I work in the field of Structure Biology and Bioinformatics. A step-by-step video demonstration can be found on [**Video Tutorial**]

These files are prepared for teaching and research purposes. If you want to use it for commercial purposes, please **contact us**.


**Quantitative structure–activity relationship (QSAR)** model(s) are generated by the users to test their compounds. Here we will learn how can we build a QSAR model for our data.

Next we will use that model to predict Activity of Unknown compounds. Once you generate and train your model, you can keep it and use in future.

I suggest you to read some literature from the field of your choice.



In [None]:
#@title Install RDKit
!pip install rdkit-pypi
!pip install networkx

In [13]:
#@title Import necessary libraries
import pandas as pd
import networkx as nx
from rdkit import Chem
from rdkit.Chem import Descriptors, rdMolDescriptors
from rdkit.Chem import Descriptors, rdmolops

Now, we are going to upload our compound dataset below. Incase you are new here, I would suggest to follow [**Tutorial of EDA**](https://youtu.be/r42qIjN7NM0), the dataset generated there can be used here also.

Our dataset will contains,
1. SMILES
2. pIC50 values

In [11]:
# Load data from a CSV file (make sure to upload your file to Colab)
# The CSV file should have at least two columns: "SMILES" and "Activity"
file_path = '/content/sample_data/comp-QSAR.csv'  # Change this to the path of your CSV file
data = pd.read_csv(file_path)

Below, we will generate various descriptors for the dataset uploaded above. I will generate four different kinds of descriptors and finally merge all four into a single master file. It is up to you to follow these four types of descriptors or decide to increase or decrease the number.

###**Topological Descriptors**

These are topological indices used to describe molecular structures, particularly in cheminformatics and QSAR (Quantitative Structure-Activity Relationship) studies. Here's a brief explanation of each:

**1. Wiener Index:** Measures molecular branching and size. It's the sum of the shortest path lengths between all pairs of atoms in the molecule.

**2. Zagreb Index:** Describes molecular branching and complexity. It's the sum of the squares of the vertex degrees (number of bonds) in the molecular graph.

**3. TPSA (Topological Polar Surface Area):** Estimates the molecular surface area accessible to water. It's a measure of polarity and hydrogen bonding potential.

**4. Balaban Index:** Characterizes molecular connectivity and branching. It's a weighted sum of the distances between atoms in the molecular graph.

**5. Kappa1, Kappa2, Kappa3:** These are shape descriptors that capture molecular shape and flexibility. They're based on the eigenvalues of the molecular graph's adjacency matrix.

In [None]:
#@title Calculate Specific Topological Descriptors

def wiener_index(mol):
    g = nx.Graph(rdmolops.GetAdjacencyMatrix(mol))
    wiener_index = nx.wiener_index(g)
    return wiener_index

def zagreb_index(mol):
    # Calculating the first Zagreb index as sum of squares of vertex degrees
    adj_matrix = rdmolops.GetAdjacencyMatrix(mol)
    degrees = adj_matrix.sum(axis=0)
    zagreb_index = sum(degrees**2)
    return zagreb_index

def tpsa(mol):
    return Descriptors.TPSA(mol)

# Kier-Hall chi indices are directly available in RDKit
def kier_hall_chi_indices(mol):
    return {
        'Kappa1': Descriptors.Kappa1(mol),
        'Kappa2': Descriptors.Kappa2(mol),
        'Kappa3': Descriptors.Kappa3(mol),
    }

def balaban_index(mol):
    return Descriptors.BalabanJ(mol)

# Function to calculate all desired descriptors
def calculate_topological_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    descriptors = {
        'WienerIndex': wiener_index(mol),
        'ZagrebIndex': zagreb_index(mol),
        'TPSA': tpsa(mol),
        'BalabanIndex': balaban_index(mol)
    }
    kier_hall = kier_hall_chi_indices(mol)
    descriptors.update(kier_hall)
    return descriptors

# Calculate topological descriptors for each compound
descriptor_data = []
for index, row in data.iterrows():
    smiles = row['SMILES']
    activity = row['Activity']
    descriptors = calculate_topological_descriptors(smiles)
    if descriptors is not None:
        descriptors['SMILES'] = smiles
        descriptors['Activity'] = activity
        descriptor_data.append(descriptors)

# Convert the descriptor data to a DataFrame
descriptor_df = pd.DataFrame(descriptor_data)

# Save the results to a new CSV file
output_file_path = '/content/Topological_descriptors.csv'  # Update this to your desired output path
descriptor_df.to_csv(output_file_path, index=False)

print(f"Descriptors calculated and saved to {output_file_path}")

### **Lipinski Descriptors**
**1. Molecular Weight (MW):** The total weight of all atoms in the molecule, usually expressed in Daltons (Da) or atomic mass units (amu).

**2. logP:** The logarithm of the partition coefficient between water and octanol, measuring a compound's lipophilicity (fat solubility) and hydrophobicity (water insolubility).

**3. NumHDonors:** The number of hydrogen bond donors in the molecule, typically atoms with a hydrogen atom bonded to a highly electronegative atom (e.g., oxygen, nitrogen, or fluorine).

**4. NumHAcceptors:** The number of hydrogen bond acceptors in the molecule, typically atoms with a lone pair of electrons (e.g., oxygen, nitrogen, or fluorine).

**5. NumRotatableBonds:** The number of single bonds that can rotate freely, affecting molecular flexibility and conformation.

**6. NumRings:** The number of rings (cyclic structures) in the molecule, which can influence molecular stability, reactivity, and binding properties.

In [None]:
#@title Calculate Lipinski's Descriptors
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors

# Define functions to calculate Lipinski's descriptors

def lipinski_descriptors(mol):
    if mol is None:
        return None
    descriptors = {
        'MolecularWeight': Descriptors.MolWt(mol),
        'LogP': Descriptors.MolLogP(mol),
        'NumHDonors': Descriptors.NumHDonors(mol),
        'NumHAcceptors': Descriptors.NumHAcceptors(mol),
        'NumRotatableBonds': Descriptors.NumRotatableBonds(mol),
        'NumRings': Descriptors.RingCount(mol),
    }
    return descriptors

# Function to calculate all desired descriptors
def calculate_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    descriptors = lipinski_descriptors(mol)
    return descriptors

# Calculate descriptors for each compound
descriptor_data = []
for index, row in data.iterrows():
    smiles = row['SMILES']
    activity = row['Activity']
    descriptors = calculate_descriptors(smiles)
    if descriptors is not None:
        descriptors['SMILES'] = smiles
        descriptors['Activity'] = activity
        descriptor_data.append(descriptors)

# Convert the descriptor data to a DataFrame
descriptor_df = pd.DataFrame(descriptor_data)

# Save the results to a new CSV file
output_file_path = '/content/lipinski_descriptors.csv'  # Update this to your desired output path
descriptor_df.to_csv(output_file_path, index=False)

print(f"Descriptors calculated and saved to {output_file_path}")

###**3D Descriptors**
**1. Molecular Volume:** The three-dimensional space occupied by a molecule, typically measured in cubic angstroms (Å³) or cubic nanometers (nm³). It represents the size of the molecule.

**2. Surface Area:** The total area of the molecular surface, usually measured in square angstroms (Å²) or square nanometers (nm²). It influences interactions with other molecules, solubility, and binding properties.

**3. Radius of Gyration (Rg):** A measure of a molecule's size and shape, representing the root mean square distance of atoms from the molecular center of mass. It's typically measured in angstroms (Å) or nanometers (nm). Rg indicates molecular compactness and flexibility.

In [None]:
#@title Calculate 3D Descriptors
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from rdkit.Chem import rdMolDescriptors

# Function to generate 3D conformer and optimize
def generate_3D_conformer(mol):
    if mol is None:
        return None
    # Add hydrogens
    mol = Chem.AddHs(mol)
    # Generate 3D coordinates
    AllChem.EmbedMolecule(mol, randomSeed=42)
    # Minimize energy
    AllChem.MMFFOptimizeMolecule(mol, maxIters=500, nonBondedThresh=500.0)
    return mol

# Define functions to calculate 3D descriptors

def molecular_volume(mol):
    # Calculate molecular volume
    conformer = mol.GetNumConformers()
    if conformer == 0:
        mol = generate_3D_conformer(mol)
    return rdMolDescriptors.CalcExactMolWt(mol)  # Placeholder; replace with actual volume calculation if available

def surface_area(mol):
    # Calculate surface area
    conformer = mol.GetNumConformers()
    if conformer == 0:
        mol = generate_3D_conformer(mol)
    return rdMolDescriptors.CalcExactMolWt(mol)  # Placeholder; replace with actual surface area calculation if available

def radius_of_gyration(mol):
    # Calculate radius of gyration
    conformer = mol.GetNumConformers()
    if conformer == 0:
        mol = generate_3D_conformer(mol)
    return rdMolDescriptors.CalcRadiusOfGyration(mol)

# Function to calculate all desired 3D descriptors
def calculate_3D_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    mol = generate_3D_conformer(mol)
    descriptors = {
        'MolecularVolume': molecular_volume(mol),
        'SurfaceArea': surface_area(mol),
        'RadiusOfGyration': radius_of_gyration(mol),
    }
    return descriptors

# Calculate 3D descriptors for each compound
descriptor_data = []
for index, row in data.iterrows():
    smiles = row['SMILES']
    activity = row['Activity']
    descriptors = calculate_3D_descriptors(smiles)
    if descriptors is not None:
        descriptors['SMILES'] = smiles
        descriptors['Activity'] = activity
        descriptor_data.append(descriptors)

# Convert the descriptor data to a DataFrame
descriptor_df = pd.DataFrame(descriptor_data)

# Save the results to a new CSV file
output_file_path = '/content/3D_descriptors.csv'  # Update this to your desired output path
descriptor_df.to_csv(output_file_path, index=False)

print(f"3D Descriptors calculated and saved to {output_file_path}")

###**The QED (Quantitative Estimation of Drug-likeness) score**
 is a numerical value that assesses a molecule's drug-likeness based on its structural features and physicochemical properties. It's a widely used metric in drug discovery and development.

The QED score ranges from 0 to 1, with higher values indicating better drug-likeness. It's calculated using a combination of factors, including:

In [None]:
#@title Calculate for QED Score
import pandas as pd
from rdkit import Chem
from rdkit.Chem import QED

# Function to calculate QED score
def qed_score(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    return QED.qed(mol)

# Calculate QED score for each compound
qed_data = []
for index, row in data.iterrows():
    smiles = row['SMILES']
    activity = row['Activity']
    qed_value = qed_score(smiles)
    if qed_value is not None:
        qed_data.append({'SMILES': smiles, 'Activity': activity, 'QED': qed_value})

# Convert the QED data to a DataFrame
qed_df = pd.DataFrame(qed_data)

# Save the results to a new CSV file
output_file_path = '/content/qed_scores.csv'  # Update this to your desired output path
qed_df.to_csv(output_file_path, index=False)

print(f"QED scores calculated and saved to {output_file_path}")


**Important**

We have generated four different CSV files. Now we will need to merge them. All these four files contains _SMILES_ and _Activity_ columns in common. We will merge them based on _SMILES_ column.

In [None]:
#@title Merge all the CSVs into a One master file
import pandas as pd

# Define file paths for the input CSV files
qed_file = '/content/qed_scores.csv'
three_d_file = '/content/3D_descriptors.csv'
lipinski_file = '/content/lipinski_descriptors.csv'
descriptors_file = '/content/topological_descriptors.csv'

# Read the CSV files into DataFrames
df_qed = pd.read_csv(qed_file)
df_3d = pd.read_csv(three_d_file)
df_lipinski = pd.read_csv(lipinski_file)
df_descriptors = pd.read_csv(descriptors_file)

# Merge DataFrames on the 'SMILES' column
# You can use 'outer' merge to ensure all rows are included, even if they don't match in all files
merged_df = pd.merge(df_qed, df_3d, on='SMILES', how='outer')
merged_df = pd.merge(merged_df, df_lipinski, on='SMILES', how='outer')
merged_df = pd.merge(merged_df, df_descriptors, on='SMILES', how='outer')

# Save the merged DataFrame to a new CSV file
output_file_path = '/content/merged_descriptors.csv'
merged_df.to_csv(output_file_path, index=False)

print(f"CSV files merged and saved to {output_file_path}")

**WOW**, You have prepared the data successfully, and have learned something new.

We will use this data in next session for QSAR Model preparation and Testing.

Please Subscribe My Youtube Channel [**Bioinformatics Insights**](https://www.youtube.com/@Bioinformaticsinsights).

Also if you want to stay connected for updates, courses, and computational services, you can follow this Whatsapp Channel [**BinfoLab**](https://whatsapp.com/channel/0029VajkwkdCHDydS6Y2lM36).