<a href="https://colab.research.google.com/github/Duncan1738/SMILES-Morgan-Fingerprints-and-Tanimoto-Similarity/blob/main/Molecular_Descriptors_and_Fingerprints_in_Cheminformatics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # **Molecular Descriptors and Fingerprints in Cheminformatics**

# This Colab notebook demonstrates how to generate **Morgan fingerprints** (ECFP) for molecules,
# calculate **Tanimoto similarity**, and experiment with different radius values using the `RDKit` library.

 ## **Step 1: Install RDKit in Google Colab**
# RDKit is required for molecular processing tasks such as calculating fingerprints and similarity.

In [2]:
# Install RDKit
!pip install rdkit-pypi


Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Downloading rdkit_pypi-2022.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.4/29.4 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit-pypi
Successfully installed rdkit-pypi-2022.9.5


## **Step 2: Generate Morgan Fingerprints and Calculate Tanimoto Similarity**
# Below, we define three molecules using their SMILES representation, convert them into molecular objects,
# generate fingerprints with two different radii (ECFP4 and ECFP6), and calculate the Tanimoto similarity.

### Molecular Information:
# - **Ethanol**: C2H5OH (SMILES: `CCO`)
# - **Propanol**: C3H7OH (SMILES: `CCCO`)
# - **Butanol**: C4H9OH (SMILES: `CCCCO`)

In [3]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.DataStructs import TanimotoSimilarity

# Example SMILES for three molecules
smiles_1 = 'CCO'    # Ethanol (C2H5OH)
smiles_2 = 'CCCO'   # Propanol (C3H7OH)
smiles_3 = 'CCCCO'  # Butanol (C4H9OH)

# Convert SMILES strings to RDKit molecular objects
mol_1 = Chem.MolFromSmiles(smiles_1)
mol_2 = Chem.MolFromSmiles(smiles_2)
mol_3 = Chem.MolFromSmiles(smiles_3)


 ### **Step 3: Generating Morgan Fingerprints (ECFP)**
# Morgan fingerprints are generated using a fixed bit vector length (1024 bits) and two different radii:
# - **Radius = 2**: ECFP4 (looks at atom neighborhoods up to two bonds away)
# - **Radius = 4**: ECFP6 (looks at atom neighborhoods up to four bonds away)


In [4]:
# Generate Morgan Fingerprints (ECFP) with radius=2 and 1024 bits (ECFP4)
fingerprint_radius_2_mol1 = AllChem.GetMorganFingerprintAsBitVect(mol_1, radius=2, nBits=1024)
fingerprint_radius_2_mol2 = AllChem.GetMorganFingerprintAsBitVect(mol_2, radius=2, nBits=1024)
fingerprint_radius_2_mol3 = AllChem.GetMorganFingerprintAsBitVect(mol_3, radius=2, nBits=1024)

# Generate Morgan Fingerprints (ECFP) with radius=4 and 1024 bits (ECFP6)
fingerprint_radius_4_mol1 = AllChem.GetMorganFingerprintAsBitVect(mol_1, radius=4, nBits=1024)
fingerprint_radius_4_mol2 = AllChem.GetMorganFingerprintAsBitVect(mol_2, radius=4, nBits=1024)
fingerprint_radius_4_mol3 = AllChem.GetMorganFingerprintAsBitVect(mol_3, radius=4, nBits=1024)


### **Step 4: Calculate Tanimoto Similarity**

# The **Tanimoto similarity** is computed between the molecules based on their generated fingerprints.
# This metric is commonly used in cheminformatics to assess molecular similarity:
# - **1.0**: Identical molecules
# - **0.0**: Completely different molecules


In [5]:
# Calculate Tanimoto similarity for radius=2 (ECFP4)
similarity_radius_2_mol1_mol2 = TanimotoSimilarity(fingerprint_radius_2_mol1, fingerprint_radius_2_mol2)
similarity_radius_2_mol1_mol3 = TanimotoSimilarity(fingerprint_radius_2_mol1, fingerprint_radius_2_mol3)

# Calculate Tanimoto similarity for radius=4 (ECFP6)
similarity_radius_4_mol1_mol2 = TanimotoSimilarity(fingerprint_radius_4_mol1, fingerprint_radius_4_mol2)
similarity_radius_4_mol1_mol3 = TanimotoSimilarity(fingerprint_radius_4_mol1, fingerprint_radius_4_mol3)


 ### **Step 5: Display Results**
# The results show the Tanimoto similarity between:
# 1. **Ethanol vs. Propanol**
# 2. **Ethanol vs. Butanol**

# Similarity is calculated for both ECFP4 (radius 2) and ECFP6 (radius 4).


In [6]:
# Display results for radius=2 (ECFP4)
print(f'Tanimoto Similarity (Radius 2 - ECFP4) between Ethanol and Propanol: {similarity_radius_2_mol1_mol2}')
print(f'Tanimoto Similarity (Radius 2 - ECFP4) between Ethanol and Butanol: {similarity_radius_2_mol1_mol3}')

# Display results for radius=4 (ECFP6)
print(f'Tanimoto Similarity (Radius 4 - ECFP6) between Ethanol and Propanol: {similarity_radius_4_mol1_mol2}')
print(f'Tanimoto Similarity (Radius 4 - ECFP6) between Ethanol and Butanol: {similarity_radius_4_mol1_mol3}')


Tanimoto Similarity (Radius 2 - ECFP4) between Ethanol and Propanol: 0.5555555555555556
Tanimoto Similarity (Radius 2 - ECFP4) between Ethanol and Butanol: 0.4166666666666667
Tanimoto Similarity (Radius 4 - ECFP6) between Ethanol and Propanol: 0.5555555555555556
Tanimoto Similarity (Radius 4 - ECFP6) between Ethanol and Butanol: 0.4166666666666667


 ## **Summary**

# - **Morgan Fingerprints** (ECFP) allow us to encode molecular structures in a fixed-length binary representation, capturing topological features.
# - By using different **radii** for the fingerprints (e.g., 2 for ECFP4, 4 for ECFP6), we explore how far atom neighborhoods are considered.
# - The **Tanimoto similarity** metric is then used to assess the structural similarity between molecules based on these fingerprints.

# You can modify the SMILES strings or the radius values to explore different molecular structures and their similarities.
