# Molecular Descriptor Generation  

## Overview  

This notebook performs the generation of molecular descriptors and structural fingerprints for the curated bioactivity dataset. These numerical features form the input space for subsequent QSAR model development.

Descriptor and fingerprint calculation is conducted using the PaDEL-Descriptor engine via the `padelpy` Python interface. The resulting feature matrix integrates both physicochemical descriptors and structural fingerprint representations.

All computational steps are executed in a transparent and reproducible manner within the defined workflow structure.

## Acknowledgement  

This workflow adapts the open-source PaDEL integration script by *Data Professor* (https://github.com/dataprofessor/padel), used under the MIT License with minor modifications for dataset compatibility and pipeline integration.

**Step 1: Install and Configure PaDEL-Descriptor**

The `padelpy` interface is used to compute molecular descriptors and structural fingerprints via the PaDEL-Descriptor engine.  

Fingerprint definitions are provided as XML configuration files consistent with the referenced open-source workflow.

In [None]:
# Install padelpy (required for PaDEL-Descriptor integration)
!pip install padelpy

**Step 2: Prepare Fingerprint Configuration Files**

PaDEL fingerprint definitions are provided through XML configuration files.  
These files specify the fingerprint families that will be computed in this workflow.

The XML configuration set is obtained from the referenced open-source implementation to maintain consistency with the adapted methodology.

In [None]:
# Download fingerprint XML configuration files
!wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
!unzip -o fingerprints_xml.zip

**Step 3: Identify and Sort Fingerprint XML Files**

The downloaded XML files are detected and sorted to ensure consistent mapping between fingerprint definitions and configuration files.

In [None]:
# Identify and sort fingerprint XML configuration files
import glob

xml_files = sorted(glob.glob("*.xml"))

print(f"Total XML files detected: {len(xml_files)}")
xml_files

**Step 4: Define Fingerprint Identifier List**

A list of fingerprint identifiers is defined to correspond to the downloaded XML configuration files.  
These identifiers will be mapped to their respective XML files for controlled descriptor generation.

In [None]:
# Define fingerprint identifiers corresponding to XML files
FP_list = [
    'AtomPairs2DCount',
    'AtomPairs2D',
    'EState',
    'CDKextended',
    'CDK',
    'CDKgraphonly',
    'KlekotaRothCount',
    'KlekotaRoth',
    'MACCS',
    'PubChem',
    'SubstructureCount',
    'Substructure'
]

print(f"Total fingerprint types defined: {len(FP_list)}")

**Step 5: Map Fingerprint Identifiers to XML Configuration Files**

A dictionary is created to map each fingerprint identifier to its corresponding XML configuration file.  
This enables controlled selection of fingerprint types during descriptor calculation.

In [None]:
# Validate matching counts
if len(FP_list) != len(xml_files):
    print("Warning: Mismatch between fingerprint list and XML files.")

# Create mapping dictionary
fp = dict(zip(FP_list, xml_files))

print("Fingerprint mapping created successfully.")
fp

**Step 6: Load Dataset for Descriptor Calculation**

The curated dataset generated in the previous stage of the workflow is loaded from the structured project directory.  
Using a predefined file path ensures consistent and reproducible execution across environments.

In [None]:
from pathlib import Path
import pandas as pd

# Define path to curated dataset
DATA_PATH = Path("../data/processed/bioactivity_dataset_curated.csv")

# Validate file existence
assert DATA_PATH.exists(), f"Dataset not found at: {DATA_PATH}"

# Load dataset
df = pd.read_csv(DATA_PATH)

print("Dataset loaded successfully.")
print(f"Shape: {df.shape}")

df.head()

**Step 7: Prepare SMILES Input File for PaDEL**

A subset of the dataset containing SMILES and compound identifiers is prepared and exported in `.smi` format.  
This file serves as the input for PaDEL-Descriptor calculation.

In [None]:
# Validate required columns
required_columns = ['SMILES', 'ID']
missing_cols = [col for col in required_columns if col not in df.columns]

if missing_cols:
    raise ValueError(f"Missing required columns: {missing_cols}")

# Prepare SMILES input for PaDEL
df2 = df[['SMILES', 'ID']].copy()

# Export as .smi file (tab-separated, no header)
df2.to_csv('molecule.smi', sep='\t', index=False, header=False)

print(f"SMILES input file created successfully. Shape: {df2.shape}")
df2.head()

**Step 8: Calculate Molecular Fingerprints Using PaDEL**

One fingerprint type is selected from the predefined fingerprint dictionary and passed to the PaDEL-Descriptor engine.  

The selected fingerprint configuration determines which structural representation will be generated.  
To compute a different fingerprint type, adjust the `fingerprint` variable to one of the keys defined in the `fp` dictionary.

In [None]:
from padelpy import padeldescriptor

# Select fingerprint type (must match a key in fp dictionary)
fingerprint = 'Substructure'  # Change as needed

if fingerprint not in fp:
    raise ValueError(f"Fingerprint '{fingerprint}' not found in fingerprint dictionary.")

fingerprint_output_file = f"{fingerprint}.csv"
fingerprint_descriptortypes = fp[fingerprint]

print(f"Selected fingerprint type: {fingerprint}")
print(f"Output file: {fingerprint_output_file}")

padeldescriptor(
    mol_dir='molecule.smi',
    d_file=fingerprint_output_file,
    descriptortypes=fingerprint_descriptortypes,
    detectaromaticity=True,
    standardizenitro=True,
    standardizetautomers=True,
    threads=2,
    removesalt=True,
    log=True,
    fingerprints=True
)

**Step 9: Load and Standardise Calculated Fingerprints**

The generated fingerprint file is loaded into a DataFrame.  
The identifier column is renamed to ensure consistency with the main dataset structure.

In [None]:
# Load generated fingerprint file
descriptors = pd.read_csv(fingerprint_output_file)

print(f"Fingerprint file loaded successfully. Shape: {descriptors.shape}")

# Ensure identifier consistency
if 'Name' in descriptors.columns:
    descriptors = descriptors.rename(columns={'Name': 'ID'})
    print("Column 'Name' renamed to 'ID' for consistency.")
else:
    print("Column 'Name' not found â€” no renaming applied.")

# Save updated file
descriptors.to_csv(fingerprint_output_file, index=False)

descriptors.head()

**Step 10: Download Generated Fingerprint File**

The generated fingerprint CSV file is downloaded for local storage and subsequent modelling steps.

In [None]:
# Download generated fingerprint file (Colab environment)
from google.colab import files

print(f"Preparing download for: {fingerprint_output_file}")
files.download(fingerprint_output_file)