<a href="https://colab.research.google.com/github/JunaidRaza78/MachineLearning/blob/master/machine_learning__write_code_in_python__4__09_04_2024_14_37_49_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

When working with SMILES (Simplified Molecular Input Line Entry System) strings for training generative models like a Variational Autoencoder (VAE), effective preprocessing is crucial. The SMILES notation represents the structure of a chemical species using short ASCII strings, which makes it suitable for machine learning models that process sequential data.

Here's a step-by-step guide to preprocessing SMILES strings for your VAE model in Python:

1. **Data Collection and Cleaning**

  First, gather your dataset of SMILES strings, which might come from public chemical databases like PubChem, ChEMBL, or a proprietary dataset.

- **Remove duplicates**: Ensure that the dataset does not contain duplicate entries
to prevent bias.
- **Data cleaning**: Some SMILES strings might be invalid or too long for practical use in training. You can use a chemistry library like RDKit to filter out invalid SMILES and possibly normalize them.

2. **Tokenization**

  Tokenization is the process of converting the SMILES strings into a format that can be fed into the VAE. Since SMILES are sequences of characters, each unique character can be treated as a token.

- **Identify unique characters**: Parse through all SMILES strings to gather a set of all unique characters used.
- **Create a character-to-index dictionary**: Map each character to a unique integer for model processing.

In [None]:
# Assistant

import numpy as np

def build_vocab(smiles_list):
    vocab = set()
    for smiles in smiles_list:
        vocab.update(smiles)
    vocab = sorted(vocab)
    char_to_index = {char: idx for idx, char in enumerate(
        vocab, 1)}  # Start indexing from 1
    char_to_index['<pad>'] = 0  # Add a padding character
    return char_to_index

# Example usage
smiles_list = ['CCO', 'CCC', 'CCN']
vocab = build_vocab(smiles_list)



3. **Vectorization**

  Convert SMILES strings into sequences of integers using the mapping created. This transformation is necessary for neural network processing.

In [None]:
# Assistant

def smiles_to_vector(smiles, char_to_index, max_length):
    vector = [char_to_index[char] for char in smiles]
    padded_vector = vector + [char_to_index['<pad>']] * (
        max_length - len(vector))
    return np.array(padded_vector)

# Determine max length from your dataset
max_length = max(len(smiles) for smiles in smiles_list)

# Convert all SMILES to vectors
smiles_vectors = np.array([smiles_to_vector(
    smiles, vocab, max_length) for smiles in smiles_list])




4. **Data Normalization**

  While traditional data normalization (e.g., mean subtraction, division by standard deviation) isn't used with SMILES strings, ensuring consistent length via padding (as done in vectorization) is essential.

5. **Splitting the Dataset**
  
  Divide your dataset into training, validation, and test sets. This helps in training the model effectively and evaluating its performance

In [None]:
# Assistant

from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test = train_test_split(
    smiles_vectors, test_size=0.1, random_state=42)
X_train, X_val = train_test_split(X_train, test_size=0.1, random_state=42)




6. **Batching and Loading**

  Use a data loader to batch the data during training, which is essential for efficient training of deep learning models.

In [None]:
# Assistant

from torch.utils.data import DataLoader, TensorDataset
import torch

# Convert to PyTorch tensors
train_data = TensorDataset(torch.from_numpy(X_train))
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)

# Similarly for val_loader and test_loader


**7. Model Training**:

  Now, with your preprocessed data, you can define your VAE model in PyTorch or TensorFlow/Keras and start the training process. Ensure to monitor both loss and any chemical validity metrics during the training.

This preprocessing pipeline sets a robust foundation for training a VAE on SMILES data, aiming to generate new and valid drug-like compounds efficiently.










Designing a Variational Autoencoder (VAE) to handle SMILES strings, which represent molecular structures, requires a thoughtful approach to both the encoder and decoder architectures. SMILES strings are sequences of characters, and as such, they share similarities with natural language processing (NLP). Therefore, models that excel in NLP tasks, such as recurrent neural networks (RNNs), LSTMs, GRUs, or Transformer-based models, are particularly suitable.

**Encoder Design**

The encoder's job is to convert the input SMILES strings into a fixed-size latent space that captures the essential information about the molecular structure.

1. **Input Layer:**

- **Embedding Layer**: Start with an embedding layer that converts the input tokens (characters in SMILES) into dense vectors. This layer translates the sparse, one-hot encoded vectors into a more manageable and meaningful form for the network.
2. **Recurrent Layers**:

- **GRU/LSTM**: Use GRU (Gated Recurrent Units) or LSTM (Long Short-Term Memory) layers to process the sequence data. These layers can handle the dependencies and structural characteristics in SMILES strings, which are crucial for capturing the sequential nature of the data.
3. **Variational Layer**:

- **Latent Space Representation**: After processing the sequence with RNN layers, connect to a dense layer that represents the mean (μ) and log variance (log σ²) of the latent space.
- **Reparameterization Trick**: Implement the reparameterization trick to enable backpropagation. This involves sampling from the latent space using the generated μ and σ and adding a stochastic element by sampling from a standard normal distribution.


In [None]:
# Assistant

import torch
from torch import nn

class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, rnn_units, latent_dim):
        super(Encoder, self).__init__()
        # Embedding layer converts token indices to dense vectors of a fixed size
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # GRU layer for processing sequences; outputs last hidden state
        self.rnn = nn.GRU(embedding_dim, rnn_units, batch_first=True)
        # Linear layers to produce means and log variance for the latent space
        self.linear_mu = nn.Linear(rnn_units, latent_dim)
        self.linear_var = nn.Linear(rnn_units, latent_dim)

    def forward(self, x):
        # Convert sparse input tokens to dense embedding vectors
        x = self.embedding(x)
        # GRU returns output and last hidden state; '_' is unused output
        _, h = self.rnn(x)
        # Squeeze to remove the first dimension (batch dimension remains)
        h = h.squeeze(0)
        # Compute the mean and log variance for parameterizing the latent distribution
        mu = self.linear_mu(h)
        log_var = self.linear_var(h)
        return mu, log_var



**Decoder Design**

The decoder reconstructs SMILES strings from the latent representations, essentially mirroring the structure of the encoder but in reverse.

1. **Input Layer**:

- **Latent to Sequence**: Start from the latent space and expand it to the sequence length required, typically using a dense layer.

2. Recurrent Layers:

-**GRU/LSTM**: Similar to the encoder, use GRU or LSTM layers to generate the output sequence. The initial state of this RNN can be set from the latent vector.

3. Output Layer:

- Dense Layer: Use a dense layer to convert the RNN outputs to the size of the vocabulary (i.e., a score for each possible character in the SMILES vocabulary).
- Softmax Activation: Apply a softmax layer to convert these scores into probabilities for each character in the vocabulary.


In [None]:
# Assistant

class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, rnn_units, latent_dim):
        super(Decoder, self).__init__()
        # Embedding layer to convert token indices to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # GRU layer to generate sequences from the latent space
        self.rnn = nn.GRU(embedding_dim, rnn_units, batch_first=True)
        # Linear layer to map from RNN output to vocabulary size (for generating tokens)
        self.dense = nn.Linear(rnn_units, vocab_size)

    def forward(self, x, h):
        # Convert sparse token indices to embeddings
        x = self.embedding(x)
        # Process input through GRU along with the initial hidden state
        x, h = self.rnn(x, h)
        # Output layer that converts RNN output to logits for each vocabulary token
        x = self.dense(x)
        return x, h



**Explanation and Use of the Code**

- **Embedding Layer**: This layer is crucial as it translates the integer-encoded SMILES characters into dense embeddings. These embeddings capture more information per character than the integers themselves.

- **GRU (Gated Recurrent Unit)**: GRUs are effective for sequence generation tasks because they can maintain information over longer sequences without the vanishing gradient problem common with standard recurrent neural networks.

- **Linear Transformation for Latent Variables (mu and log_var)**: These transformations convert the last hidden state of the GRU into parameters for the latent space distribution. These parameters are used to sample and generate new data points in the latent space, maintaining continuity and enabling the generation of new, plausible SMILES strings.

- **Decoder's Forward Function**: It takes an initial input (typically a start token) and the latent vector (transformed into an initial state) to begin generating a sequence. The decoder can continue generating tokens based on its previous outputs until it produces an end token or reaches a maximum sequence length.

This structured and commented approach in building the VAE components clarifies how each section contributes to processing and generating SMILES strings, aiding in the development of robust generative models for chemical compounds.









**Assistant**

Training a Variational Autoencoder (VAE) on SMILES strings involves tuning several hyperparameters to optimize model performance. The choice of these parameters can significantly affect the quality of the generated SMILES strings, their diversity, and the stability of the training process. Here’s a guide to understanding and tuning the key hyperparameters for a VAE trained on SMILES strings:

1. **Learning Rate**
- **Importance**: The learning rate controls how much to change the model in response to the estimated error each time the model weights are updated.
- **Tuning**: Start with a standard learning rate (e.g., 0.001) and adjust based on training behavior. If the training loss fluctuates widely, consider reducing it. Use learning rate schedules or adaptive learning rate methods (like Adam) to improve convergence.

2. **Batch Size**
- **Importance**: Batch size affects the stability of the training process and can influence the convergence speed and memory utilization.
- **Tuning**: Larger batch sizes provide a more accurate estimate of the gradient but require more memory and can sometimes lead to poorer generalization. Experiment with sizes like 32, 64, or 128 to find a balance between performance and resource usage.

3. **Number of Epochs**
- **Importance**: Determines how many times the entire dataset is passed forward and backward through the neural network.
- **Tuning**: Set a reasonably high number to ensure the network has enough iterations to learn from the data, but use early stopping to halt training if the validation loss stops improving.

4. **Latent Space Dimension**
- **Importance**: The size of the latent space determines how much information the VAE can encode about the input data.
- **Tuning**: Smaller dimensions force the VAE to learn more efficient representations but may miss some nuances of the data. Start with a size like 50 or 100 and adjust based on the quality of generated SMILES strings.

5. **RNN Layers and Units**
- **Importance**: The complexity of the RNN architecture influences the model’s ability to capture the dependencies in the data.
- **Tuning**:
  - **Layers**: More layers can model more complex patterns but increase the risk of overfitting and require more computational resources.
  - **Units**: More units (neurons) per layer allow the model to learn more detailed representations but can slow down training. Typical sizes might range from 128 to 512 units per layer.

6. **Reconstruction Loss Weight**
- **Importance**: In VAEs, the final loss function is typically a combination of reconstruction loss and KL divergence. The weight of the reconstruction loss impacts how well the VAE learns to recreate the input data.
- **Tuning**: Adjusting the balance between reconstruction loss and KL divergence is crucial. Too much emphasis on KL divergence can lead to overly regularized models that do not capture detailed variations in the data.

7. **KL Divergence Weight (Beta)**
- **Importance**: Controls the regularization effect in the VAE, forcing the latent distribution to approximate the prior distribution.
- **Tuning**: A technique known as "beta annealing" can be useful, where the influence of KL divergence is gradually increased during training. Start with a low beta and increase it as training progresses.

### Hyperparameter Tuning Techniques
- **Grid Search**: Systematically vary parameters over a fixed grid of values to find the best combination.
- **Random Search**: Randomly sample the hyperparameter space, often more effective and efficient than grid search.
- **Bayesian Optimization**: Uses a probabilistic model to predict which hyperparameters might lead to better performance.

Monitoring and Evaluation
- Monitor the loss components separately to understand how well the model is learning both aspects of the VAE objective.
- Validate the chemical validity of generated SMILES using tools like RDKit, and assess diversity and novelty through similarity metrics.

\Practical Implementation
Consider using libraries like `optuna` or `hyperopt` for more sophisticated hyperparameter optimization strategies. These libraries can handle complex search spaces and optimize based on the model’s performance metrics efficiently.

By carefully tuning these hyperparameters and continuously evaluating the output quality, you can enhance the performance of a VAE designed for generating novel SMILES strings, thereby pushing forward the boundaries in computational drug discovery.

 Implementation of the loss function for a Variational Autoencoder (VAE) specifically tailored for generating SMILES strings representing molecular structures. The focus will be on balancing reconstruction loss and KL divergence to optimize the model's performance.

**Key Components of the VAE Loss Function**

- **Reconstruction Loss**: Measures how well the VAE can recreate the input SMILES strings. For categorical data like SMILES, the categorical cross-entropy loss is typically used.

- **KL Divergence**: Provides regularization by encouraging the latent space distributions to approximate the prior distribution (usually a standard normal distribution). It helps ensure that the latent space encodes meaningful and generalizable representations.

**Implementation in PyTorch**

Here’s how to implement the combined VAE loss function using PyTorch:

In [None]:
# Assistant

import torch
import torch.nn.functional as F

def vae_loss(recon_x, x, mu, logvar, beta=1.0):
    """
    Compute VAE loss composed of reconstruction loss and KL divergence.

    Parameters:
    - recon_x: tensor, logits from VAE decoder.
    - x: tensor, original input data (SMILES strings indices).
    - mu: tensor, mean from the latent space.
    - logvar: tensor, log variance from the latent space.
    - beta: float, scales the impact of KL divergence.

    Returns:
    - Total loss as a tensor.
    """
    # Reconstruction loss: categorical cross-entropy between output and target
    recon_loss = F.cross_entropy(
        recon_x.view(-1, recon_x.size(2)), x.view(-1), reduction='sum')

    # KL divergence: measures how closely the latent variables match a standard normal distribution
    kl_div = -0.5 * torch.sum(
        1 + logvar - mu.pow(2) - logvar.exp())

    # Total VAE loss
    total_loss = recon_loss + beta * kl_div
    return total_loss



**Explanation:**

- `recon_x `should be the raw logits from the decoder (not passed through softmax).
- `x` must be the tensor of indices representing characters in the SMILES strings.
- `mu` and `logvar` are the outputs from the encoder, representing the parameters of the latent distribution.
- `beta` is a tuning parameter that allows control over the trade-off between enforcing the latent distribution to match the prior and achieving accurate reconstruction.

**Using the Loss Function**

During training, apply this function to calculate the loss and backpropagate errors to update the model. Monitoring both components of the loss (reconstruction and KL divergence) is crucial for diagnosing model performance and ensuring effective learning.

This concise setup emphasizes the practical aspects of implementing and using the VAE loss function, facilitating efficient training and optimization of your generative model for SMILES strings.


To effectively evaluate the novelty and plausibility of molecules generated by a Variational Autoencoder (VAE), you should incorporate checks for chemical validity, assess drug-likeness, compute novelty, and analyze molecular property distributions.

Below is a Python code block using RDKit that encapsulates these checks:



In [None]:
# Assistant

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors, Descriptors, AllChem, DataStructs

# Helper functions

def is_valid_smiles(smiles):
    """
    Check if a SMILES string is chemically valid.

    Parameters:
    - smiles: String representing a molecule in SMILES notation.

    Returns:
    - bool: True if the SMILES string can be parsed into a valid molecule;
    False otherwise.
    """
    mol = Chem.MolFromSmiles(smiles)
    return mol is not None

def check_druglike_properties(smiles):
    """
    Assess drug-likeness of a molecule using Lipinski's Rule of Five.

    Parameters:
    - smiles: String representing a molecule in SMILES notation.

    Returns:
    - bool: True if the molecule meets Lipinski's Rule of Five; False otherwise.
    """
    mol = Chem.MolFromSmiles(smiles)
    if not mol:
        return False
    mw = Descriptors.ExactMolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = rdMolDescriptors.CalcNumHBD(mol)
    hba = rdMolDescriptors.CalcNumHBA(mol)
    return (mw < 500 and logp <= 5 and hbd <= 5 and hba <= 10)

def compute_min_similarity(smiles, reference_smiles):
    """
    Compute minimum Tanimoto similarity between a given SMILES
    and a list of reference SMILES.

    Parameters:
    - smiles: String, SMILES notation of the molecule to compare.
    - reference_smiles: List of strings, SMILES notations of known molecules.

    Returns:
    - float: Minimum similarity score to any molecule in the reference set.
    """
    mol1 = Chem.MolFromSmiles(smiles)
    fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2)
    similarities = []
    for ref_smiles in reference_smiles:
        mol2 = Chem.MolFromSmiles(ref_smiles)
        fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2)
        similarities.append(DataStructs.TanimotoSimilarity(fp1, fp2))
    return min(similarities) if similarities else None

# Example usage
generated_smiles = ["CCO", "N#N", "C1CCCC1", "invalidSMILES"]  # Hypothetical generated SMILES
reference_smiles = ["CCC", "O=C(C)Oc1ccccc1C(=O)O", "c1ccccc1"]  # Known molecules

valid_smiles = [sm for sm in generated_smiles if is_valid_smiles(sm)]
druglike_smiles = [sm for sm in valid_smiles if check_druglike_properties(sm)]
novelty_scores = {sm: compute_min_similarity(
    sm, reference_smiles) for sm in druglike_smiles}

print("Valid SMILES:", valid_smiles)
print("Drug-like SMILES:", druglike_smiles)
print("Novelty Scores:", novelty_scores)





Explanation:

* `is_valid_smiles`: Checks chemical validity of each SMILES string.
* `check_druglike_properties`: Applies Lipinski's Rule of Five to determine drug-likeness.
* `compute_min_similarity`: Calculates the minimum Tanimoto similarity between generated and known molecules to assess novelty.

**Usage**: The example processes a list of generated SMILES strings, filtering them based on validity and drug-likeness, then calculates novelty scores against a reference set.

This implementation allows you to systematically evaluate generated molecules for their potential use in further drug development applications.


For implementing a Variational Autoencoder (VAE) to work with SMILES strings, you can efficiently use the combination of PyTorch for deep learning model development and RDKit for cheminformatics tasks. Here’s a concise guide and example of integrating these libraries into a VAE model:

* **Deep Learning Framework**: `PyTorch` and `TensorFlow/Keras`

  `PyTorch` offers dynamic computation graphs and a flexible approach to building deep learning models, which is particularly beneficial for developing custom architectures like VAEs.

  `TensorFlow` offers an extensive ecosystem and Keras provides high-level APIs which are user-friendly. TensorFlow 2.x has integrated Keras deeply, making model development more intuitive.


* **Cheminformatics Library**: `RDKit`

  `RDKit` is essential for preprocessing SMILES strings to ensure chemical validity and for additional chemical informatics functionalities.

In [None]:
# Assistant

import torch
from torch import nn
from rdkit import Chem

# Encoder class
class Encoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, hidden_dim)  # Embedding layer for SMILES characters
        self.rnn = nn.GRU(hidden_dim, hidden_dim, batch_first=True)  # GRU to capture sequence information
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)  # Linear layer to output means of latent variables
        self.fc_var = nn.Linear(hidden_dim, latent_dim)  # Linear layer to output log variance of latent variables

    def forward(self, x):
        embedded = self.embedding(x)  # Convert indices to embeddings
        _, hidden = self.rnn(embedded)  # Apply GRU
        mu = self.fc_mu(hidden.squeeze(0))  # Get mean from the hidden state
        log_var = self.fc_var(hidden.squeeze(0))  # Get log variance from the hidden state
        return mu, log_var

# Decoder class
class Decoder(nn.Module):
    def __init__(self, latent_dim, hidden_dim, output_dim):
        super(Decoder, self).__init__()
        self.fc = nn.Linear(latent_dim, hidden_dim)  # Map latent variables to hidden dimension
        self.rnn = nn.GRU(hidden_dim, hidden_dim, batch_first=True)  # GRU for sequence generation
        self.fc_out = nn.Linear(hidden_dim, output_dim)  # Output layer for generating SMILES characters

    def forward(self, z, output_length):
        hidden = self.fc(z).unsqueeze(0).repeat(output_length, 1, 1).transpose(0, 1)  # Prepare hidden states
        output, _ = self.rnn(hidden)  # Decode the sequence
        prediction = self.fc_out(output)  # Generate logits for next character
        return prediction

# Initialize the encoder and decoder
input_dim = 50  # Number of unique SMILES characters
output_length = 60  # Maximum length of SMILES strings
hidden_dim = 128  # Hidden dimension of GRU
latent_dim = 20  # Dimension of latent space

encoder = Encoder(input_dim, hidden_dim, latent_dim)
decoder = Decoder(latent_dim, hidden_dim, input_dim)

# Example dummy data and forward pass
smiles_idx = torch.randint(0, input_dim, (10, output_length))  # Example indices of SMILES characters
mu, log_var = encoder(smiles_idx)  # Encode input
z = torch.randn_like(mu) * torch.exp(log_var / 2) + mu  # Reparameterization trick
recon_smiles = decoder(z, output_length)  # Decode from latent space
