# Welcome to **ChemCluster** 🧪

Welcome to **ChemCluster**, a cheminformatics tool designed for interactive clustering and visualization of molecular datasets. Built as part of the CH-200 course at EPFL, it allows you to load your own SMILES data, compute descriptors, perform clustering, and explore chemical space interactively in 2D and 3D.

In this notebook, we’ll walk through the main features of ChemCluster step-by-step, including descriptor calculation, clustering, centroid visualization, and dataset comparison.

## ⚙️ Installation & Initialization

ChemCluster is available directly from PyPI. You can install it and start the app in just two steps.

### 1. Install the ChemCluster package
Run the following command in a new terminal:
```bash
pip install chemcluster
```
### 2. Launch the interface
Once installed, you can open the ChemCluster app in your browser by typing:
```bash
chemcluster
```
This will launch the Streamlit interface for interactive clustering and visualization.

🧑‍💻 *Note:* If you’re contributing to development or want to run it locally from source, see the instructions on [GitHub](https://github.com/erubbia/ChemCluster).

## 1. Project Objectives

This project was developed to:

1. **Automate molecular property extraction**  
   Compute useful properties (molecular weight, LogP, TPSA, fingerprints) from SMILES using a reproducible pipeline.
2. **Explore and cluster datasets**  
   Let users upload SMILES files or datasets and discover meaningful molecular clusters without manual preprocessing.
3. **Make the analysis interactive**  
   Provide a Streamlit app to adjust clustering parameters and view the 2D or 3D structure of the most representative molecule in each cluster.
4. **Design clean, reusable code**  
   Organize the project as a Python package for easy testing, maintenance, and future extension.

## 2. Project Structure

The code is structured as follows:

```
chemcluster/
├── app.py                 # Streamlit app entry point
├── environment.yml        # Conda environment
├── notebooks/
│   └── project_report.ipynb  # This report
├── src/
│   └── chemcluster/
│       ├── __init__.py
│       ├── descriptors.py     # Molecular descriptors
│       ├── clustering.py      # Clustering logic
│       └── visualization.py   # 2D and 3D rendering
└── tests/                # Unit tests
```

This layout enables clean modular development and testing.

## 3. Molecular Descriptor Calculation 🧬

We now calculate key chemical descriptors from a list of SMILES. This includes:
- Molecular weight, LogP, TPSA
- Number of H-bond donors/acceptors
- Number of rotatable bonds, aromatic rings, etc.

Additionally, we compute **ECFP-like fingerprints** using RDKit for clustering.
Here are the core functions defined in `chemcluster/descriptors.py` that we use:

In [3]:
from rdkit import Chem
from rdkit.Chem import (
    Descriptors, Crippen, Lipinski, rdMolDescriptors,
    Draw, AllChem
)
import py3Dmol
from io import BytesIO
import base64

def calculate_properties(mol, mol_name="Unknown"):
    return {
        "Molecule": mol_name,
        "Molecular Weight": round(Descriptors.MolWt(mol), 2),
        "LogP": round(Crippen.MolLogP(mol), 2),
        "H-Bond Donors": Lipinski.NumHDonors(mol),
        "H-Bond Acceptors": Lipinski.NumHAcceptors(mol),
        "TPSA": round(rdMolDescriptors.CalcTPSA(mol), 2),
        "Rotatable Bonds": Lipinski.NumRotatableBonds(mol),
        "Aromatic Rings": Lipinski.NumAromaticRings(mol),
        "Heavy Atom Count": mol.GetNumHeavyAtoms()
    }

def get_fingerprint(mol):
    return AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)

def clean_smiles_list(smiles_list):
    mols, valid_smiles = [], []
    for smi in smiles_list:
        mol = Chem.MolFromSmiles(smi)
        if mol:
            mols.append(mol)
            valid_smiles.append(smi)
    return mols, valid_smiles

def show_3d_molecule(mol, confId=-1):
    if not mol.GetNumConformers():
        AllChem.EmbedMolecule(mol, AllChem.ETKDG())
    mb = Chem.MolToMolBlock(mol, confId=confId)
    viewer = py3Dmol.view(width=300, height=300)
    viewer.addModel(mb, 'mol')
    viewer.setStyle({'stick': {}})
    viewer.setBackgroundColor('0xffffff')
    viewer.zoomTo()
    return viewer

def mol_to_base64_img(mol):
    img = Draw.MolToImage(mol, size=(200, 200))
    buffer = BytesIO()
    img.save(buffer, format="PNG")
    buffer.seek(0)
    return "data:image/png;base64," + base64.b64encode(buffer.read()).decode()

## 4. Clustering the Molecules 🧩

Once descriptors or fingerprints are computed, we can apply clustering algorithms to group similar molecules.

**ChemCluster** supports:
- K-Means clustering (with PCA-reduced fingerprints)
- DBSCAN (for density-based separation)

Below are example utilities from our clustering module. These use `scikit-learn` and can take either molecular descriptors or fingerprint vectors as input.

📌 After clustering, each molecule is assigned to a group, and the **centroid structure** is computed for each cluster.

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from rdkit import Chem
import numpy as np

def reduce_fingerprints(fps, n_components=2):
    pca = PCA(n_components=n_components)
    return pca.fit_transform(fps)

def cluster_with_kmeans(data, n_clusters=5):
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    labels = kmeans.fit_predict(data)
    return labels, kmeans

def cluster_with_dbscan(data, eps=0.5, min_samples=5):
    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    labels = dbscan.fit_predict(data)
    return labels, dbscan

def get_cluster_centroids(fps, labels, mols):
    centroids = []
    unique_labels = sorted(set(labels))
    for label in unique_labels:
        if label == -1:
            continue  # skip noise
        indices = [i for i, l in enumerate(labels) if l == label]
        cluster_fps = [fps[i] for i in indices]
        mean_fp = np.mean(cluster_fps, axis=0)
        closest = min(indices, key=lambda i: np.linalg.norm(fps[i] - mean_fp))
        centroids.append(mols[closest])
    return centroids

In [None]:
# Example:

## 6. Results and Visualization

*TODO: Insert plots (histogram of MW, PCA scatter, 3D py3Dmol view).*

In [None]:
# TODO: plotting code, e.g.:

## 7. Discussion

**Strengths:**  
- Interactive UI  
- Modular code  

**Limitations:**  
- Scalability to large libraries  
- Fixed default metrics  

**Future work:**  
- Add new similarity measures  
- Parallelize descriptor calculations




## 8. Conclusion

*TODO: Summarize key findings and perspectives.*


## 9. References & Appendix

1. RDKit: https://www.rdkit.org  
2. scikit-learn: Pedregosa et al., JMLR 12 (2011)  

*Appendix: code snippets, test instructions, etc.*


# ChemCluster Project Report

**CH-200 – Practical Programming in Chemistry**  
**Group:** Elisa Rubbia, Romain Guichonnet, Flavia Zabala Perez  
**Date:** May 2025


## Welcome to ChemCluster!

ChemCluster is a streamlined and interactive platform designed to facilitate the analysis and visualization of chemical datasets using molecular clustering techniques. It leverages the power of **RDKit** for cheminformatics and **Streamlit** for web interface deployment, offering an intuitive solution for researchers and students alike.

Two main modes of operation are supported by the platform:

- **Dataset Mode**: This mode allows users to upload datasets (in `.csv` or `.sdf` format) containing multiple molecular structures. These are then clustered based on structural similarity, and representative structures for each cluster are highlighted for further exploration.
- **Single Molecule Mode**: In this workflow, ChemCluster generates a set of 3D conformers for a single input molecule. These conformers are clustered, and representative structures are visualized both in 2D and 3D, enabling users to analyze conformational diversity effectively.

## Setup and Initialization

To begin using ChemCluster, you have two options: installation via PyPI or running the application locally from source.

### Installation from PyPI
The simplest way to install ChemCluster is through the Python Package Index. In a terminal or command prompt, run:

```bash
pip install chemcluster
```
Once installed, you can launch the application by executing:

```bash
chemcluster
```
This command will open the ChemCluster interface directly in your default web browser.

### Running Locally from Source (for Development or Contribution)
If you prefer to contribute to the project or wish to run it locally in a development environment, follow these steps:
1. Clone the repository from GitHub:
```bash
git clone https://github.com/erubbia/ChemCluster
```
2. Navigate into the project directory:
```bash
cd ChemCluster
```
3. Create a conda environment based on the project's environment file:
```bash
conda env create -f environment.yml
```
4. Activate the newly created environment:
```bash
conda activate chemcluster-env
```
5. Finally, install the project in editable mode:
```bash
pip install -e .
```
After this setup, you can launch ChemCluster the same way by running `chemcluster` in the terminal.