## 💊 Drug Detection using Machine Learning

An IITG Summer Internship Project

## Data Preperation

### Build NEGATIVE dataset with
MolD2 (777) + ECFP4 (2048) + MACCS (166)

In [None]:
"""
Build NEGATIVE dataset with
MolD2 (777) + ECFP4 (2048) + MACCS (166)
"""

import pandas as pd, numpy as np, random, rdkit
from tqdm import tqdm
from rdkit import Chem
from rdkit.Chem import MACCSkeys
import rdkit.DataStructs as ds
from Mold2_pywrapper import Mold2

# ── CONFIG ────────────────────────────────────────────────────────────
dude_path   = "Data/negatives/dud-e/dude_decoys.csv"
gdb13_path  = "Data/negatives/gdb13/gdb13_simple_non_drugs.csv"
gdb17_path  = "Data/negatives/gdb17/GDB17.50000000LL.noSR.smi"
tox21_path  = "Data/negatives/tox21/tox21_stress_response_toxics.csv"
zinc_path   = "Data/negatives/zinc20/for-sale.csv"

output_path = "Dataset/negatives/dataset.csv"
mold2_zip   = "Tools/Mold2-Executable-File.zip"

sample_dude_n  = 15_000
sample_gdb17_n = 15_000
ecfp_bits      = 2048
maccs_bits     = 166
# ───────────────────────────────────────────────────────────────────────


def read_smiles_csv(path):
    return pd.read_csv(path)["smiles"].dropna().astype(str).tolist()


def read_smiles_smi(path, max_lines=None):
    smiles = []
    with open(path) as fh:
        for i, line in enumerate(fh):
            if max_lines and i >= max_lines:
                break
            tok = line.split()
            if tok:
                smiles.append(tok[0])
    return smiles


def fp_to_array(fp, n_bits):
    arr = np.zeros((n_bits,), dtype=int)
    ds.ConvertToNumpyArray(fp, arr)
    return arr


# ------------ 1.  LOAD & MERGE ---------------------------------------
print("📥 Reading / sampling source datasets …")
dude  = random.sample(read_smiles_csv(dude_path), sample_dude_n)
gdb13 = read_smiles_csv(gdb13_path)
gdb17 = read_smiles_smi(gdb17_path, max_lines=sample_gdb17_n)
tox21 = read_smiles_csv(tox21_path)
zinc  = read_smiles_csv(zinc_path)

all_smiles = list({*dude, *gdb13, *gdb17, *tox21, *zinc})
print(f"🔄 Unique SMILES collected: {len(all_smiles):,}")

# ------------ 2.  RDKit Mol objects ----------------------------------
print("🔬 Converting SMILES → mols …")
mols, smiles_valid = [], []
for s in tqdm(all_smiles, desc="MolFromSmiles"):
    m = Chem.MolFromSmiles(s)
    if m:
        mols.append(m)
        smiles_valid.append(s)
print(f"✅ Valid mols: {len(mols):,}")

# ------------ 3.  MolD2 descriptors ----------------------------------
print("🧪 Calculating MolD2 …")
mold2 = Mold2.from_executable(mold2_zip)
mold2_df = pd.DataFrame(mold2.calculate(mols))

# ------------ 4.  ECFP4 via new MorganGenerator ----------------------
print("🧬 Calculating ECFP4 (2048 bits) …")
from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator

morgan_gen = GetMorganGenerator(
    radius=2,
    countSimulation=False,
    includeChirality=False,
    useBondTypes=True,
    onlyNonzeroInvariants=False,
    includeRingMembership=True,
    fpSize=ecfp_bits
)

ecfp_mat = [
    fp_to_array(morgan_gen.GetFingerprint(m), ecfp_bits)
    for m in tqdm(mols, desc="ECFP4")
]
ecfp_df = pd.DataFrame(ecfp_mat,
                       columns=[f"ECFP4_{i}" for i in range(ecfp_bits)])

# ------------ 5.  MACCS keys -----------------------------------------
print("🧬 Calculating MACCS (166 bits) …")
maccs_mat = [
    fp_to_array(MACCSkeys.GenMACCSKeys(m), 167)[1:]
    for m in tqdm(mols, desc="MACCS")
]
maccs_df = pd.DataFrame(maccs_mat, columns=[f"MACCS_{i}" for i in range(166)])

# ------------ 6.  MERGE & SAVE ---------------------------------------
print("📦 Combining descriptors …")
final_df = pd.concat(
    [pd.Series(smiles_valid, name="smiles"), mold2_df, ecfp_df, maccs_df],
    axis=1
)
final_df.drop_duplicates(inplace=True)
final_df.replace([np.inf, -np.inf], np.nan, inplace=True)
final_df.dropna(inplace=True)

final_df.to_csv(output_path, index=False)
print(f"✅ Saved {len(final_df):,} negatives → {output_path}")


  from .autonotebook import tqdm as notebook_tqdm


📥 Reading / sampling source datasets …
🔄 Unique SMILES collected: 56,252
🔬 Converting SMILES → mols …


MolFromSmiles:  77%|███████▋  | 43325/56252 [00:03<00:01, 12901.37it/s][00:38:27] Explicit valence for atom # 2 Cl, 1, is greater than permitted
MolFromSmiles: 100%|██████████| 56252/56252 [00:04<00:00, 12704.56it/s]


✅ Valid mols: 56,251
🧪 Calculating MolD2 …
Mold2 calculates a large and diverse set of molecular descriptors encoding two-
dimensional chemical structure information. Comparative analysis of Mold2 descriptors
with those calculated from commercial software on several published datasets
demonstrated that Mold2 descriptors convey sufficient structural information. In addition,
better models were generated using Mold2 descriptors than the compared commercial
software packages. This publicly available software is developed by the Center for
Bioinformatics, which is led by Dr. Weida Tong, at the National Center for Toxicological
Research (NCTR).
    
Mold2 is a product designed and produced by the National Center for Toxicological
Research (NCTR).  FDA and NCTR retain ownership of this product.

Please address any questions or suggestions to Dr. Huixiao Hong, National Center for Toxicological
Research, at 870-543-7296 or Huixiao.Hong@fda.hhs.gov.

###################################

Should 

ECFP4: 100%|██████████| 56251/56251 [00:03<00:00, 16316.36it/s]


🧬 Calculating MACCS (166 bits) …


MACCS: 100%|██████████| 56251/56251 [00:36<00:00, 1552.45it/s]


📦 Combining descriptors …
✅ Saved 56,251 negatives → Dataset/negatives/dataset.csv


### Build POSITIVE dataset with
MolD2 (777) + ECFP4 (2048) + MACCS (166)

In [None]:
"""
Build POSITIVE dataset with
MolD2 (777) + ECFP4 (2048) + MACCS (166)
"""

import pandas as pd, numpy as np
from rdkit import Chem
from rdkit.Chem import MACCSkeys
import rdkit.DataStructs as ds
from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
from Mold2_pywrapper import Mold2

# ---------- paths ----------
input_file      = "Data/positives/zinc20/world.csv"
output_path     = "Dataset/positives/dataset.csv"
mold2_zip       = "Tools/Mold2-Executable-File.zip"

ecfp_bits  = 2048
maccs_bits = 166

# ---------- helper ----------
def fp_to_array(fp, n_bits):
    arr = np.zeros((n_bits,), dtype=int)
    ds.ConvertToNumpyArray(fp, arr)
    return arr

# ---------- load SMILES ----------
df = pd.read_csv(input_file)
df = df[~df["smiles"].str.contains("\\.")].reset_index(drop=True)

# RDKit mols (filter out invalid)
mols, smiles_valid = [], []
for s in df["smiles"]:
    m = Chem.MolFromSmiles(s)
    if m:
        mols.append(m)
        smiles_valid.append(s)

# ---------- MolD2 ----------
mold2 = Mold2.from_executable(mold2_zip)
mold2_df = pd.DataFrame(mold2.calculate(mols))

# ---------- ECFP4 (Morgan r=2, 2048 bits) ----------
print("🧬 Calculating ECFP4 …")
morgan_gen = GetMorganGenerator(
    radius=2,
    countSimulation=False,
    includeChirality=False,
    useBondTypes=True,
    onlyNonzeroInvariants=False,
    includeRingMembership=True,
    fpSize=ecfp_bits
)
ecfp_mat = [fp_to_array(morgan_gen.GetFingerprint(m), ecfp_bits) for m in mols]
ecfp_df  = pd.DataFrame(ecfp_mat, columns=[f"ECFP4_{i}" for i in range(ecfp_bits)])

# ---------- MACCS (slice off dummy bit 0) ----------
print("🧬 Calculating MACCS …")
maccs_mat = [
    fp_to_array(MACCSkeys.GenMACCSKeys(m), 167)[1:]   # keep bits 1‑166
    for m in mols
]
maccs_df = pd.DataFrame(maccs_mat, columns=[f"MACCS_{i}" for i in range(maccs_bits)])

# ---------- combine & clean ----------
combined_df = pd.concat(
    [pd.Series(smiles_valid, name="smiles"), mold2_df, ecfp_df, maccs_df],
    axis=1
).drop_duplicates()

combined_df.replace([np.inf, -np.inf], np.nan, inplace=True)
combined_df.dropna(inplace=True)

# ---------- save ----------
combined_df.to_csv(output_path, index=False)
print(f"✅ Saved {len(combined_df)} molecules with MolD2 + ECFP4 + MACCS → {output_path}")


Mold2 calculates a large and diverse set of molecular descriptors encoding two-
dimensional chemical structure information. Comparative analysis of Mold2 descriptors
with those calculated from commercial software on several published datasets
demonstrated that Mold2 descriptors convey sufficient structural information. In addition,
better models were generated using Mold2 descriptors than the compared commercial
software packages. This publicly available software is developed by the Center for
Bioinformatics, which is led by Dr. Weida Tong, at the National Center for Toxicological
Research (NCTR).
    
Mold2 is a product designed and produced by the National Center for Toxicological
Research (NCTR).  FDA and NCTR retain ownership of this product.

Please address any questions or suggestions to Dr. Huixiao Hong, National Center for Toxicological
Research, at 870-543-7296 or Huixiao.Hong@fda.hhs.gov.

###################################

Should you publish results based on the Mold² desc

### Final Dataset

In [3]:
import pandas as pd

# ---------- Paths ----------
negative_path = "Dataset/negatives/dataset.csv"
positive_path = "Dataset/positives/dataset.csv"
combined_path = "Dataset/final/dataset.csv"

# ---------- Load ----------
pos_df = pd.read_csv(positive_path)
pos_df["Is Drug"] = 1

neg_df = pd.read_csv(negative_path)
neg_df["Is Drug"] = 0

# ---------- Combine and de-duplicate (keep drugs on conflict) ----------
combined_df = pd.concat([pos_df, neg_df], ignore_index=True)

# Sort so drugs come before non-drugs for duplicates
combined_df.sort_values(by="Is Drug", ascending=False, inplace=True)

# Drop duplicates based on all feature columns (except Is Drug)
feature_cols = combined_df.columns.difference(["Is Drug"])
combined_df = combined_df.drop_duplicates(subset=feature_cols, keep="first")

# Shuffle for training
combined_df = combined_df.sample(frac=1, random_state=42).reset_index(drop=True)

# ---------- Save ----------
combined_df.to_csv(combined_path, index=False)
print(f"✅ Saved {len(combined_df)} to {combined_path}")


✅ Saved 62134 to Dataset/final/dataset.csv
