<a href="https://colab.research.google.com/github/MehrdadJalali-AI/BlackHole/blob/main/MOFGraph_GAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Explanation of the MOF Similarity Model**

### **1. Overview**
The MOF (Metal-Organic Framework) similarity model computes pairwise similarities between different MOFs using both molecular and numerical feature-based approaches. The resulting adjacency matrix represents these relationships and is used to construct a similarity graph, where strong connections indicate structural or property-based similarity between MOFs.

---

### **2. Components of the Model**

#### **A. Molecular Similarity (Tanimoto Similarity)**
- Uses **Morgan fingerprints** to encode molecular structures.
- Computes **Tanimoto similarity** for the following molecular components:
  - **Metal SBU (Secondary Building Unit) SMILES**
  - **Linker SMILES**
  - **Metal Cluster SMILES**
  - **Ligand SMILES**
- A weighted sum of these similarities forms the molecular similarity matrix.

#### **B. Feature-Based Similarity (Cosine Similarity)**
- Uses **void fraction, accessible surface area (ASA), accessible volume (AV), pore limiting diameter (PLD), largest cavity diameter (LCD), and largest free path diameter (LFPD)**.
- These numerical features are **normalized using Min-Max scaling** before computing **cosine similarity**.

#### **C. Combined Similarity Matrix**
- The final similarity matrix is obtained by averaging **molecular similarity** and **feature similarity**:
  
  \[ \text{Final Similarity} = \frac{\text{Molecular Similarity} + \text{Feature Similarity}}{2} \]

---

### **3. Adjacency Matrix Construction**
- The **adjacency matrix** is created using the computed similarity values.
- A **threshold** is applied, and values below this threshold are set to **zero** (removing weak connections).
- The matrix is **symmetric**, ensuring that similarity(A, B) = similarity(B, A).
- The diagonal elements are set to **1** (each MOF is identical to itself).

---

### **4. Graph Construction**
- A **graph** is created where:
  - **Nodes** represent MOFs.
  - **Edges** represent strong similarity connections.
  - **Edge weights** correspond to similarity scores.
- Only edges with a similarity above the **threshold** are included.
- A **spring layout** is used to visualize clusters of highly similar MOFs.

---

### **5. Sparsity Calculation**
- The sparsity of the adjacency matrix is calculated as:
  
  \[ \text{Sparsity} = \left(1 - \frac{\text{Number of Edges}}{\text{Total Possible Edges}}\right) \times 100\% \]
  
- Displays the:
  - **Number of MOFs (Nodes)**
  - **Threshold applied**
  - **Number of strong similarity links (Edges)**
  - **Sparsity percentage**

---

### **6. Output and Visualization**
- **MOF_Adjacency_Matrix.csv**: Saves the final similarity matrix.
- **Graph visualization**: Displays the network of MOFs with only strong similarity connections.

---

### **7. Applications**
- **Material Discovery**: Identifying structurally similar MOFs for specific applications.
- **Data-Driven Screening**: Efficiently selecting candidate MOFs for further experimental validation.
- **Clustering & Classification**: Grouping MOFs based on chemical and structural similarities.

---

This model provides an effective way to analyze and visualize MOF relationships, balancing both molecular and feature-based similarities.



In [19]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import warnings
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from joblib import Parallel, delayed  # Parallel processing

# Suppress warnings
warnings.filterwarnings("ignore")

# User-defined parameters
num_mofs = 1000   # Choose the number of MOFs to analyze
threshold = 0.5  # Set the minimum similarity for connections

# Load dataset
df = pd.read_csv("MOF.csv").head(num_mofs)

# Function to compute Morgan fingerprints
def get_fingerprint(smile):
    if pd.isna(smile) or smile == "":
        return None
    mol = Chem.MolFromSmiles(str(smile))
    if mol:
        return AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
    return None

# Compute fingerprints in parallel for efficiency
df["metal_fp"] = Parallel(n_jobs=-1)(delayed(get_fingerprint)(smile) for smile in df["metal_sbu_smile"])
df["linker_fp"] = Parallel(n_jobs=-1)(delayed(get_fingerprint)(smile) for smile in df["linker_smile"])
df["metal_cluster_fp"] = Parallel(n_jobs=-1)(delayed(get_fingerprint)(smile) for smile in df["metal_cluster_smile"])
df["ligand_fp"] = Parallel(n_jobs=-1)(delayed(get_fingerprint)(smile) for smile in df["ligand_smile"])

# Initialize adjacency matrix
adj_matrix = np.zeros((num_mofs, num_mofs))

# Compute molecular similarity (Tanimoto) using the upper triangle
for i in range(num_mofs):
    for j in range(i + 1, num_mofs):
        metal_sim = linker_sim = cluster_sim = ligand_sim = 0

        if df["metal_fp"][i] and df["metal_fp"][j]:
            metal_sim = DataStructs.TanimotoSimilarity(df["metal_fp"][i], df["metal_fp"][j])

        if df["linker_fp"][i] and df["linker_fp"][j]:
            linker_sim = DataStructs.TanimotoSimilarity(df["linker_fp"][i], df["linker_fp"][j])

        if df["metal_cluster_fp"][i] and df["metal_cluster_fp"][j]:
            cluster_sim = DataStructs.TanimotoSimilarity(df["metal_cluster_fp"][i], df["metal_cluster_fp"][j])

        if df["ligand_fp"][i] and df["ligand_fp"][j]:
            ligand_sim = DataStructs.TanimotoSimilarity(df["ligand_fp"][i], df["ligand_fp"][j])

        # Weighted similarity combination (adjustable weights)
        similarity = (0.3 * metal_sim) + (0.3 * linker_sim) + (0.2 * cluster_sim) + (0.2 * ligand_sim)

        # Fill both upper and lower triangle
        adj_matrix[i, j] = similarity
        adj_matrix[j, i] = similarity

# Feature-based similarity using void fraction and geometric properties
feature_cols = ["void_fraction", "asa (A^2)", "av (A^3)", "pld (A)", "lcd (A)", "lfpd (A)"]
features = df[feature_cols].fillna(0)

# ✅ Normalize numeric features before cosine similarity
scaler = MinMaxScaler()
features_normalized = scaler.fit_transform(features)

# Compute feature similarity (Cosine similarity)
feature_similarity = cosine_similarity(features_normalized)

# ✅ Combine molecular and feature similarity (equal weight for both)
adj_matrix = (adj_matrix + feature_similarity) / 2

# ✅ Apply threshold: Set values below threshold to exactly 0
adj_matrix[adj_matrix < threshold] = 0

# ✅ Ensure diagonal values are 1 (MOF identical to itself)
np.fill_diagonal(adj_matrix, 1)

# Convert to DataFrame and save
adj_df = pd.DataFrame(adj_matrix, index=df["Refcode"], columns=df["Refcode"])
adj_df.to_csv("MOF_Adjacency_Matrix.csv")

# --- Graph Visualization ---
G = nx.Graph()

# Add nodes
for refcode in df["Refcode"]:
    G.add_node(refcode)

# Add edges **only for values above threshold**
num_edges = 0
for i in range(num_mofs):
    for j in range(i + 1, num_mofs):
        if adj_matrix[i, j] > 0:  # Now it only adds connections above threshold
            G.add_edge(df["Refcode"][i], df["Refcode"][j], weight=adj_matrix[i, j])
            num_edges += 1

# ✅ Calculate sparsity percentage
total_possible_edges = (num_mofs * (num_mofs - 1)) / 2  # Upper triangle of the adjacency matrix
sparsity_percentage = (1 - (num_edges / total_possible_edges)) * 100

# ✅ Display results
print(f"Number of MOFs (Nodes): {num_mofs}")
print(f"Threshold: {threshold}")
print(f"Number of Links (Edges): {num_edges}")
print(f"Sparsity Percentage: {sparsity_percentage:.2f}%")




Number of MOFs (Nodes): 1000
Threshold: 0.5
Number of Links (Edges): 186141
Sparsity Percentage: 62.73%


# Dataset MOF
This dataset appears to describe a collection of metal-organic frameworks (MOFs) or coordination complexes, likely focusing on their structural and physicochemical properties. Below is an explanation of each column:

**Refcode** – A unique identifier for each structure (e.g., a reference from a crystallographic database like the Cambridge Structural Database).

**metals** – The metal element(s) present in the structure.

**max_metal_coordination_n** – The maximum coordination number of the metal, indicating how many ligands or atoms are directly bonded to it.

**metal_sbu_smile** – SMILES (Simplified Molecular Input Line Entry System) representation of the metal secondary building unit (SBU), which describes its atomic connectivity.

**linker_smile** – The SMILES representation of the organic linker connecting the metal SBUs.

**n_sbu_point_of_extension** – The number of extension points from the metal SBU, representing its connectivity.

**n_linker_point_of_extension** – The number of points where the linker extends and connects to the structure.

**metal_cluster_smile** – SMILES representation of the full metal cluster, which might be more complex than the individual SBU.

**ligand_smile** – SMILES representation of the coordinating ligand(s) attached to the metal.

**n_channel** – The number of channels in the structure, which may relate to porosity or transport properties.

**void_fraction** – The fraction of the unit cell volume that is void or empty space, useful for assessing porosity.

**asa (A²)** – Accessible Surface Area in square angstroms, indicating the exposed surface area within the MOF.

**av (A³)** – Accessible Volume in cubic angstroms, showing the total available volume within the framework.

**pld (A)** – Pore limiting diameter (in angstroms), which is the smallest free diameter for guest molecules to pass through.

**lcd (A)** – Largest cavity diameter, indicating the maximum open space in the framework.

**lfpd (A)** – Largest free path diameter, measuring the longest continuous open pathway within the structure.



In [4]:
# Mount drive
from google.colab import drive
import os

drive.mount('/content/drive')
# Change working path
os.chdir('/content/drive/MyDrive/Research/MOF/GAN-NodeGeneration/')
!pip install rdkit

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Collecting rdkit
  Downloading rdkit-2024.9.5-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.0 kB)
Downloading rdkit-2024.9.5-cp311-cp311-manylinux_2_28_x86_64.whl (34.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.3/34.3 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2024.9.5


In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
from joblib import Parallel, delayed
import warnings

# Suppress RDKit warnings
from rdkit import rdBase
rdBase.DisableLog("rdApp.*")  # Completely disable RDKit warnings
warnings.filterwarnings("ignore")

# ✅ 1. Load and Process MOF Data
num_mofs = 200  # Number of MOFs to use
threshold = 0.5  # Minimum similarity threshold

df = pd.read_csv("MOF.csv").head(num_mofs)
adj_matrix = pd.read_csv("MOF_Adjacency_Matrix.csv", index_col=0).values

adj_matrix[adj_matrix < threshold] = 0
np.fill_diagonal(adj_matrix, 1)

edge_index = np.array(np.nonzero(adj_matrix))
valid_edges = (edge_index[0] < num_mofs) & (edge_index[1] < num_mofs)
edge_index = edge_index[:, valid_edges]
edge_index = torch.tensor(edge_index, dtype=torch.long)

# Compute Morgan fingerprints for SMILES codes
def get_fingerprint(smile):
    if pd.isna(smile) or smile == "":
        return np.zeros(512, dtype=np.float32)
    mol = Chem.MolFromSmiles(str(smile))
    if mol:
        arr = np.zeros(512, dtype=np.float32)
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=512)
        DataStructs.ConvertToNumpyArray(fp, arr)
        return arr
    return np.zeros(512, dtype=np.float32)

fingerprint_features = np.array([
    np.concatenate([
        get_fingerprint(df["metal_sbu_smile"].iloc[i]),
        get_fingerprint(df["linker_smile"].iloc[i]),
        get_fingerprint(df["metal_cluster_smile"].iloc[i]),
        get_fingerprint(df["ligand_smile"].iloc[i])
    ]) for i in range(len(df))
], dtype=np.float32)

feature_cols = ["void_fraction", "asa (A^2)", "av (A^3)", "pld (A)", "lcd (A)", "lfpd (A)"]
scaler = MinMaxScaler()
features_normalized = scaler.fit_transform(df[feature_cols].fillna(0))

final_features = np.concatenate([features_normalized, fingerprint_features], axis=1)
node_features = torch.tensor(final_features, dtype=torch.float32)

mof_graph = Data(x=node_features, edge_index=edge_index)

# ✅ 2. Define Improved Generator & Discriminator
class Generator(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(Generator, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.conv1 = GCNConv(hidden_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)
        self.batch_norm = nn.BatchNorm1d(hidden_dim)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x, edge_index):
        x = torch.relu(self.fc1(x))
        x = self.batch_norm(x)
        x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        x = self.conv1(x, edge_index)
        x = self.conv2(x, edge_index)
        return torch.sigmoid(x)

class Discriminator(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Discriminator, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, 1)
        self.leaky_relu = nn.LeakyReLU(0.2)

    def forward(self, x, edge_index):
        x = self.leaky_relu(self.conv1(x, edge_index))
        x = self.leaky_relu(self.conv2(x, edge_index))
        return torch.sigmoid(self.fc(x))

# ✅ 3. Train GraphGAN
generator = Generator(node_features.shape[1], 128, node_features.shape[1])
discriminator = Discriminator(node_features.shape[1], 128)

optimizer_G = optim.Adam(generator.parameters(), lr=0.001)
optimizer_D = optim.Adam(discriminator.parameters(), lr=0.0005)  # Reduce learning rate

num_epochs = 1000
for epoch in range(num_epochs):
    fake_nodes = generator(node_features, edge_index)
    real_preds = discriminator(node_features, edge_index)
    fake_preds = discriminator(fake_nodes, edge_index)
    eps = 1e-8  # Small constant to prevent log(0)
    loss_D = -torch.mean(torch.log(torch.clamp(real_preds, eps, 1 - eps)) +
                         torch.log(torch.clamp(1 - fake_preds, eps, 1 - eps)))

    optimizer_D.zero_grad()
    loss_D.backward()
    optimizer_D.step()

    fake_nodes = generator(node_features, edge_index)
    fake_preds = discriminator(fake_nodes, edge_index)
    loss_G = -torch.mean(torch.log(torch.clamp(fake_preds, eps, 1 - eps)))

    optimizer_G.zero_grad()
    loss_G.backward()
    torch.nn.utils.clip_grad_norm_(generator.parameters(), max_norm=1.0)
    torch.nn.utils.clip_grad_norm_(discriminator.parameters(), max_norm=1.0)
    optimizer_G.step()

    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss_D={loss_D.item()}, Loss_G={loss_G.item()}")

print("Training Complete. Generating New MOFs...")

# ✅ 4. Generate New MOFs
with torch.no_grad():
    generated_nodes = generator(node_features, edge_index).numpy()

generated_adj_matrix = np.dot(generated_nodes, generated_nodes.T)
generated_adj_matrix[generated_adj_matrix < threshold] = 0
np.fill_diagonal(generated_adj_matrix, 1)

generated_graph = nx.from_numpy_array(generated_adj_matrix)

print("Generated MOFs Ready for Further Processing.")
# ✅ 5. Display Features of Generated Nodes
print("Features of Generated MOFs:")
for i, node in enumerate(generated_nodes):
    print(f"MOF {i+1}:")
    print(f"  Void Fraction: {node[0]:.4f}")
    print(f"  ASA (A^2): {node[1]:.4f}")
    print(f"  AV (A^3): {node[2]:.4f}")
    print(f"  PLD (A): {node[3]:.4f}")
    print(f"  LCD (A): {node[4]:.4f}")
    print(f"  LFPD (A): {node[5]:.4f}")
    print("  Molecular Fingerprint: [Truncated for Display]")
    print("----------------------------------------------------")

Epoch 0: Loss_D=1.6007205247879028, Loss_G=1.1248561143875122
Epoch 100: Loss_D=0.010129088535904884, Loss_G=8.614158630371094
Epoch 200: Loss_D=0.022129282355308533, Loss_G=11.760248184204102
Epoch 300: Loss_D=0.007892061956226826, Loss_G=14.634784698486328
Epoch 400: Loss_D=0.004156322218477726, Loss_G=15.811095237731934


# 🔹 Features of This Script

✔ Loads the MOF adjacency matrix & removes weak links

✔ Creates a GraphGAN (MolGAN-like architecture)

✔ Generates new MOFs as adjacency matrices

✔ Converts MOF graphs back to SMILES (Placeholder Implementation)

✔ Validates chemical structures using RDKit

✔ Visualizes generated MOFs as a graph

# 🔹 Future Enhancements
Replace placeholder SMILES with a real MOF decoding function.

Improve Generator & Discriminator using deeper Graph Convolutions.

Implement property-based optimization (e.g., optimizing void fraction, ASA).

Use Reinforcement Learning for property-aware MOF generation.

In [24]:
pip install torch-geometric

[0mCollecting torch-geometric
  Using cached torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
Using cached torch_geometric-2.6.1-py3-none-any.whl (1.1 MB)
[0mInstalling collected packages: torch-geometric
Successfully installed torch-geometric-2.6.1
