## Choice of descriptors

# Domain-Driven Descriptor Prioritization
*Descriptor*	    *Rationale*
MolLogP	            Lipophilicity (critical for membrane penetration in Gram+ bacteria)
FractionCSP3	    Measures saturation (linked to metabolic stability)
NHOHCount	        Hydrogen-bond donors (target engagement)
NumAromaticRings	Aromatic interactions with bacterial enzymes
LabuteASA	        Polar surface area (permeability predictor)
HeavyAtomMolWt	    Better than MolWt for drug-likeness (Rule of 5 compliance)
HallKierAlpha	    Molecular flexibility (conformational adaptation to targets)



In [5]:
# Original descriptor list:
descriptor_names = [
        "BalabanJ", "BertzCT", "Chi0", "Chi0n", "Chi0v", "Chi1", "Chi1n", "Chi1v", "Chi2n", "Chi2v", "Chi3n", "Chi3v", "Chi4n", "Chi4v",
        "EState_VSA1", "EState_VSA10", "EState_VSA11", "EState_VSA2", "EState_VSA3", "EState_VSA4", "EState_VSA5", "EState_VSA6", "EState_VSA7",
        "EState_VSA8", "EState_VSA9", "ExactMolWt", "FpDensityMorgan1", "FpDensityMorgan2", "FpDensityMorgan3", "FractionCSP3", "HallKierAlpha",
        "HeavyAtomCount", "HeavyAtomMolWt", "Ipc", "Kappa1", "Kappa2", "Kappa3", "LabuteASA", "MaxAbsEStateIndex", "MaxAbsPartialCharge",
        "MaxEStateIndex", "MaxPartialCharge", "MinAbsEStateIndex", "MinAbsPartialCharge", "MinEStateIndex", "MinPartialCharge", "MolLogP", "MolMR",
        "MolWt", "NHOHCount", "NOCount", "NumAliphaticCarbocycles", "NumAliphaticHeterocycles", "NumAliphaticRings", "NumAromaticCarbocycles",
        "NumAromaticHeterocycles"
    ]

### Key Changes and Rationale

#### Removed Redundant Descriptors
- **Removed**: `Chi0v`, `Chi1v`, `Chi2v`, etc.  
  **Reason**: Kept `Chi0n` and `Chi4n` for simplicity and interpretability. Higher-order connectivity indices (`Chi1v`, `Chi2v`, etc.) are often redundant and add unnecessary complexity.

- **Removed**: `ExactMolWt`  
  **Reason**: Redundant with `HeavyAtomMolWt`. `HeavyAtomMolWt` is more relevant for drug-likeness as it excludes hydrogens.

- **Removed**: `Kappa2` and `Kappa3`  
  **Reason**: Kept `Kappa1` for shape simplicity. Higher-order kappa indices (`Kappa2`, `Kappa3`) are less interpretable and often do not add significant predictive value.

- **Removed**: `EState_VSA1`, `EState_VSA2`, etc.  
  **Reason**: Kept `EState_VSA3` as a representative polar surface descriptor. Other EState_VSA descriptors are often correlated and redundant.

---

#### Added Chemically Meaningful Descriptors
- **Added**: `NumHDonors` and `NumHAcceptors`  
  **Reason**: Explicit counts of hydrogen bond donors and acceptors are more interpretable than `NHOHCount`. These are critical for target engagement and solubility.

- **Added**: `TPSA` (Topological Polar Surface Area)  
  **Reason**: Critical for predicting permeability and absorption. It is a well-established descriptor in drug discovery.

- **Added**: `NumRotatableBonds`  
  **Reason**: Measures molecular flexibility, which is important for binding entropy and conformational adaptation to targets.

- **Added**: `NumHeteroatoms`  
  **Reason**: Heteroatom count influences interactions with bacterial targets and overall molecular diversity.

---

#### Retained Key Descriptors
- **Retained**: `MolLogP`, `FractionCSP3`, `LabuteASA`, and `NumAromaticRings`  
  **Reason**: These descriptors are strongly relevant to drug-likeness and anti-TB activity.  
  - `MolLogP`: Lipophilicity (critical for membrane penetration).  
  - `FractionCSP3`: Measures saturation (linked to metabolic stability).  
  - `LabuteASA`: Approximate surface area (permeability predictor).  
  - `NumAromaticRings`: Aromatic interactions with bacterial enzymes.

---

#### Simplified Connectivity Indices
- **Kept**: `Chi0n` and `Chi4n`  
  **Reason**: Representative connectivity indices that capture molecular branching and complexity. Removed others (`Chi1v`, `Chi2v`, etc.) to reduce multicollinearity.

---

#### Removed Low-Impact Descriptors
- **Removed**: `MaxAbsEStateIndex`, `MinAbsEStateIndex`, and other EState indices  
  **Reason**: These are less interpretable and often redundant with other descriptors.

- **Removed**: `FpDensityMorgan1`, `FpDensityMorgan2`, etc.  
  **Reason**: These are fingerprint-based descriptors and not true molecular descriptors. They are better suited for similarity-based methods rather than descriptor-based models.

---

### Summary of Changes
- **Reduced redundancy**: Removed highly correlated or less interpretable descriptors.
- **Added chemically meaningful descriptors**: Focused on properties directly relevant to drug-likeness and anti-TB activity.
- **Simplified connectivity indices**: Kept only the most representative indices.
- **Removed low-impact descriptors**: Eliminated descriptors that do not contribute significantly to predictive power.

This refined list balances interpretability, chemical relevance, and computational efficiency, making it well-suited for your anti-tuberculosis activity prediction model.


In [None]:
# New list of descriptors
descriptor_names = [
    "BertzCT",                # Molecular complexity (topological index)
    "Chi0n",                 # Atom connectivity index (simpler than Chi0v)
    "Chi4n",                 # Higher-order connectivity index (captures branching)
    "EState_VSA3",           # Electrotopological state (polar surface area contribution)
    "FractionCSP3",          # Fraction of sp3 hybridized carbons (saturation)
    "HallKierAlpha",         # Molecular flexibility (topological shape)
    "HeavyAtomMolWt",        # Molecular weight excluding hydrogens (better than MolWt)
    "LabuteASA",             # Approximate surface area (permeability predictor)
    "MolLogP",               # Lipophilicity (critical for membrane penetration)
    "NHOHCount",             # Number of NH or OH groups (hydrogen bond donors)
    "NumAromaticRings",      # Aromatic ring count (important for target binding)
    "NumAliphaticRings",     # Aliphatic ring count (rigidity and 3D shape)
    "NumHDonors",            # Explicit hydrogen bond donors (target engagement)
    "NumHAcceptors",         # Explicit hydrogen bond acceptors (solubility)
    "TPSA",                  # Topological polar surface area (permeability)
    "Kappa1",                # Shape index (simpler than Kappa2/Kappa3)
    "Ipc",                   # Information content (molecular complexity)
    "MolMR",                 # Molar refractivity (polarizability)
    "NumRotatableBonds",     # Molecular flexibility (conformational freedom)
    "NumHeteroatoms"         # Heteroatom count (diversity of interactions)
]

In [None]:
# Further reduced model, since ChemProp graphs may implicitly learn certain behaviors
descriptor_names = [
    "MolLogP",               # Lipophilicity (critical for membrane penetration)
    "TPSA",                  # Topological polar surface area (permeability)
    "FractionCSP3",          # Fraction of sp3 hybridized carbons (saturation)
    "NumHDonors",            # Hydrogen bond donors (target engagement)
    "NumHAcceptors",         # Hydrogen bond acceptors (solubility)
    "NumAromaticRings",      # Aromatic ring count (important for target binding)
    "NumRotatableBonds",     # Molecular flexibility (conformational freedom)
    "HeavyAtomMolWt",        # Molecular weight excluding hydrogens (drug-likeness)
    "LabuteASA",             # Approximate surface area (permeability predictor)
    "HallKierAlpha",         # Molecular flexibility (topological shape)
]

In [None]:
final_descriptor_names = [
    # Global Physicochemical Properties
    "MolLogP", "TPSA", "FractionCSP3", "HeavyAtomMolWt", "LabuteASA", "HallKierAlpha",

    # Structural and Topological Descriptors
    "BertzCT", "Chi0n", "Chi4n", "NumAromaticRings", "NumAliphaticRings", 
    "NumRotatableBonds", "NumHeteroatoms",

    # Hydrogen Bonding and Polarity
    "NumHDonors", "NumHAcceptors", "EState_VSA3",

    # 3D Descriptors
    "RadiusOfGyration", "PMI1", "PMI2", "PMI3", "InertialShapeFactor", 
    "Asphericity", "SpherocityIndex", "PlaneOfBestFit",

    # Target-Specific Descriptors
    "NumHalogens", "NumElectronWithdrawingGroups", "NumElectronDonatingGroups", 
    "Redox_Potential",

    # Toxicity and Bioavailability
    "RuleOf5Violations", "CYP450_Inhibition", "hERG_Inhibition", "Ames_Mutagenicity"
]

# Molecular Descriptors to Complement GNNs

This document lists the key molecular descriptors that should be used alongside a **Graph Neural Network (GNN)** for predicting antimycobacterial activity. These descriptors capture global physicochemical, permeability, and electronic properties that are difficult for GNNs to infer directly.

## 1. Physicochemical Properties

### **LogD (pH 7.4)**
- **Definition:** Octanol/water distribution coefficient at pH 7.4.
- **Why it's important:** More relevant than LogP for estimating passive permeability across the mycobacterial cell wall.

### **Topological Polar Surface Area (TPSA)**
- **Definition:** Sum of the polar atomic surface areas.
- **Why it's important:** Reflects hydrogen bonding capacity, affecting both solubility and permeability.

### **Molecular Weight (MW)**
- **Definition:** Sum of atomic weights of all atoms in the molecule.
- **Why it's important:** Helps assess permeability, as large molecules struggle to diffuse through bacterial membranes.

### **Number of Rotatable Bonds**
- **Definition:** Count of single bonds not in a ring that allow free rotation.
- **Why it's important:** Measures molecular flexibility, which can influence binding affinity and permeability.

## 2. Electronic Properties

### **Dipole Moment**
- **Definition:** Measure of charge separation in a molecule.
- **Why it's important:** Affects interactions with proteins and biological membranes.

### **HOMO-LUMO Gap**
- **Definition:** Energy difference between the Highest Occupied Molecular Orbital (HOMO) and Lowest Unoccupied Molecular Orbital (LUMO).
- **Why it's important:** Indicates reactivity and redox potential, which are key for certain antimycobacterial drugs.

### **Partial Atomic Charges (ESP/Gasteiger Charges)**
- **Definition:** Electrostatic potential (ESP) or Gasteiger charges assigned to individual atoms.
- **Why it's important:** Useful for modeling electrostatic interactions with biomolecular targets.

## 3. Permeability-Specific Features

### **Eccentric Connectivity Index (ECI)**
- **Definition:** Graph-based descriptor that quantifies molecular shape complexity.
- **Why it's important:** Affects diffusion and transport through bacterial membranes.

### **Fraction of sp3 Hybridized Carbons (Fsp3)**
- **Definition:** Ratio of sp3 hybridized carbons to total carbons.
- **Why it's important:** Higher Fsp3 correlates with increased solubility and metabolic stability.

### **Molecular Rigidity Score**
- **Definition:** Measure of overall molecular flexibility.
- **Why it's important:** Affects the ability to diffuse through bacterial membranes.

## 4. ADMET & Pharmacokinetics

### **CYP Inhibition Score**
- **Definition:** Predicts the likelihood of a compound inhibiting cytochrome P450 enzymes.
- **Why it's important:** Helps assess metabolic stability and potential drug-drug interactions.

### **Plasma Protein Binding (%PPB)**
- **Definition:** Percentage of drug molecules bound to plasma proteins.
- **Why it's important:** Determines free drug availability and pharmacokinetics.

## Summary
This reduced descriptor set complements the GNN by providing:
- **Physicochemical insights (LogD, MW, TPSA, rotatable bonds)**
- **Electronic properties (Dipole moment, HOMO-LUMO gap, atomic charges)**
- **Permeability predictors (ECI, Fsp3, rigidity score)**
- **ADMET properties (CYP inhibition, PPB)**

This balance ensures the model captures both **structural and functional** properties of antimycobacterial compounds.



In [None]:
descriptor_names = [
    # Physicochemical Properties
    "LogD_pH7.4", "TPSA", "MolecularWeight", "NumRotatableBonds",
    
    # Electronic Properties
    "DipoleMoment", "HOMO_LUMO_Gap", "PartialAtomicCharges",
    
    # Permeability-Specific Features
    "EccentricConnectivityIndex", "Fraction_sp3_Carbons", "MolecularRigidityScore",
    
    # ADMET & Pharmacokinetics
    "CYP_Inhibition", "PlasmaProteinBinding"
]