<a href="https://colab.research.google.com/github/Arun-1509/23321006/blob/main/Drug_Toxicity_Prediction_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
sohamshee_drug_molecular_toxicity_dataset_path = kagglehub.dataset_download('sohamshee/drug-molecular-toxicity-dataset')

print('Data source import complete.')


<img src="https://devra.ai/analyst/notebook/1845/image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;"><div style="font-size:150%; color:#FEE100"><b>Drug Molecular Toxicity Prediction and Analysis Notebook</b></div><div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>

This notebook embarks on a journey to explore the intriguing world of drug molecular toxicity. Our analysis leverages chemistry-inspired feature extraction techniques and predictive modeling to shed light on the toxicity potential of molecules. If you find these insights useful, please consider upvoting.

## Table of Contents

- [Setup and Imports](#Setup-and-Imports)
- [Data Loading](#Data-Loading)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Feature Extraction](#Feature-Extraction)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Predictive Modeling and Evaluation](#Predictive-Modeling-and-Evaluation)
- [Conclusion and Future Work](#Conclusion-and-Future-Work)

In [None]:
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set matplotlib backend and inline plotting for Kaggle
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.switch_backend('Agg')

# Ensure inline plotting
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns

# For predictive modeling
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc

# For permutation importance
from sklearn.inspection import permutation_importance

# Attempt to import RDKit for chemistry-related feature extraction
try:
    from rdkit import Chem
    from rdkit.Chem import Descriptors, Crippen
    rdkit_available = True
except ModuleNotFoundError:
    rdkit_available = False
    print('RDKit module not found. Please ensure RDKit is installed to extract chemical descriptors reliably.')

# Set seaborn plotting style
sns.set(style='whitegrid', palette='muted', color_codes=True)

In [None]:
# Data Loading
# The dataset is split into separate files for SMILES strings and toxicity labels for training and testing.

# Load training SMILES data
train_smiles_df = pd.read_csv('/kaggle/input/drug-molecular-toxicity-dataset/names_smiles_train.csv',
                             encoding='ascii')

# Load training labels
train_labels_df = pd.read_csv('/kaggle/input/drug-molecular-toxicity-dataset/names_labels_train.csv',
                              encoding='ascii')

# Load testing SMILES data
test_smiles_df = pd.read_csv('/kaggle/input/drug-molecular-toxicity-dataset/names_smiles_test.csv',
                            encoding='ascii')

# Load testing labels
test_labels_df = pd.read_csv('/kaggle/input/drug-molecular-toxicity-dataset/names_labels_test.csv',
                            encoding='ascii')

print('Data loading complete.')

In [None]:
# Data Cleaning and Preprocessing
# The training SMILES and labels data share a common column. However, the column names are based on molecule identifiers.
# We rename these columns to simplify merging and subsequent analysis.

# For training set
train_smiles_df = train_smiles_df.rename(columns={train_smiles_df.columns[0]: 'molecule_id',
                                                  train_smiles_df.columns[1]: 'smiles'})
train_labels_df = train_labels_df.rename(columns={train_labels_df.columns[0]: 'molecule_id',
                                                  train_labels_df.columns[1]: 'label'})

# Merge training data on molecule_id
train_df = pd.merge(train_smiles_df, train_labels_df, on='molecule_id')

# For testing set
test_smiles_df = test_smiles_df.rename(columns={test_smiles_df.columns[0]: 'molecule_id',
                                                test_smiles_df.columns[1]: 'smiles'})
test_labels_df = test_labels_df.rename(columns={test_labels_df.columns[0]: 'molecule_id',
                                                test_labels_df.columns[1]: 'label'})

test_df = pd.merge(test_smiles_df, test_labels_df, on='molecule_id')

print('Training and testing data merged.')
# Display the first few rows of the training data for verification (this output will be generated at runtime)

In [None]:
# Feature Extraction
# If RDKit is available, we will extract chemical descriptors such as Molecular Weight, LogP, the number of hydrogen donors and acceptors.
# These descriptors are widely used in cheminformatics, and can offer insight into the molecular properties related to toxicity.

def compute_descriptors(smiles):
    '''
    Computes selected chemical descriptors given a SMILES string. If RDKit is not available,
    returns a dictionary of NaNs.
    '''
    if rdkit_available:
        try:
            mol = Chem.MolFromSmiles(smiles)
            if mol is not None:
                desc = {
                    'MolWt': Descriptors.MolWt(mol),
                    'LogP': Descriptors.MolLogP(mol),
                    'NumHDonors': Descriptors.NumHDonors(mol),
                    'NumHAcceptors': Descriptors.NumHAcceptors(mol)
                }
            else:
                desc = {'MolWt': np.nan, 'LogP': np.nan, 'NumHDonors': np.nan, 'NumHAcceptors': np.nan}
        except Exception as e:
            # Logging the error for this molecule, as similar errors might be encountered by other users
            print(f'Error computing descriptors for SMILES {smiles}: {e}')
            desc = {'MolWt': np.nan, 'LogP': np.nan, 'NumHDonors': np.nan, 'NumHAcceptors': np.nan}
    else:
        # If RDKit is not installed, use a fallback descriptor based on the length of the SMILES string
        desc = {
            'MolWt': len(smiles),
            'LogP': len(smiles) / 10.0,
            'NumHDonors': len([c for c in smiles if c == 'N']),
            'NumHAcceptors': len([c for c in smiles if c == 'O'])
        }
    return desc

# Apply descriptor computation to the training set
descriptor_list = []
for s in train_df['smiles']:
    descriptor_list.append(compute_descriptors(s))

descriptors_df = pd.DataFrame(descriptor_list)

# Concatenate the descriptors with the original dataframe
train_df = pd.concat([train_df.reset_index(drop=True), descriptors_df.reset_index(drop=True)], axis=1)

print('Feature extraction complete. Descriptors computed and added to training dataframe.')

In [None]:
# Exploratory Data Analysis
# We now perform a series of exploratory visualizations to understand the distribution of the descriptors
# and their relation to the toxicity label.

numeric_df = train_df.select_dtypes(include=[np.number])

# Check if we have enough numeric columns (at least 4) to compute a correlation heatmap
if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(8, 6))
    corr = numeric_df.corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm')
    plt.title('Correlation Heatmap of Numeric Features')
    plt.show()
else:
    print('Not enough numeric features for a meaningful correlation heatmap.')

# Pair Plot of descriptors colored by toxicity label
sns.pairplot(train_df, vars=['MolWt', 'LogP', 'NumHDonors', 'NumHAcceptors'], hue='label')
plt.show()

# Histograms for each descriptor
for col in ['MolWt', 'LogP', 'NumHDonors', 'NumHAcceptors']:
    sns.histplot(train_df[col].dropna(), kde=True)
    plt.title(f'Histogram of {col}')
    plt.show()

# Count Plot (Pie Chart equivalent) for the toxicity label
sns.countplot(x='label', data=train_df)
plt.title('Distribution of Toxicity Labels')
plt.show()

In [None]:
# Predictive Modeling and Evaluation
# We will build a simple logistic regression classifier using the computed chemical descriptors
# as features and the binary toxicity label as the target. We then evaluate its performance
# using accuracy, confusion matrix, ROC curve and permutation importance for feature relevance.

# Define feature columns and target
features = ['MolWt', 'LogP', 'NumHDonors', 'NumHAcceptors']
target = 'label'

# Prepare training data (dropping any rows with missing descriptor values)
train_model_df = train_df.dropna(subset=features + [target])

X_train = train_model_df[features]
y_train = train_model_df[target]

# Initialize and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on training data (note: in a real setting, cross-validation or a hold-out set would be more appropriate)
y_pred = model.predict(X_train)

# Compute accuracy
accuracy = accuracy_score(y_train, y_pred)
print(f'Training Accuracy: {accuracy:.4f}')

# Confusion Matrix
cm = confusion_matrix(y_train, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='YlGnBu')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# ROC Curve
y_probs = model.predict_proba(X_train)[:, 1]
fpr, tpr, thresholds = roc_curve(y_train, y_probs)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

# Permutation Importance
perm_importance = permutation_importance(model, X_train, y_train, n_repeats=10, random_state=42)
sorted_idx = perm_importance.importances_mean.argsort()
plt.figure(figsize=(6, 4))
plt.barh(np.array(features)[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.title('Feature Importance')
plt.show()

## Conclusion and Future Work

This notebook demonstrated an exploratory analysis of drug molecular toxicity by combining chemical descriptor extraction with predictive modeling. The use of visualization techniques such as heatmaps, pair plots, histograms, and ROC curves helped uncover relationships in the data, and a simple logistic regression model provided a starting point for prediction.

In future work, further preprocessing of the SMILES strings, incorporation of more advanced descriptors or fingerprints, and the application of more sophisticated machine learning models could lead to improved predictions. Additionally, handling missing descriptor data and hyperparameter tuning may further enhance performance.

If you found this notebook engaging and insightful, please upvote it.