##  Pytorch geometric

Pytorch geometric is a pytorch package to train Graph Neural Networks.

One "simple" architecture is the Graph Convolutional Network (kipf et al)

references:
- http://tkipf.github.io/graph-convolutional-networks/
- https://proceedings.neurips.cc/paper/2015/hash/f9be311e65d81a9ad8150a60844bb94c-Abstract.html
- https://arxiv.org/abs/1609.02907


Install pytorch geometric : https://pytorch-geometric.readthedocs.io/en/latest/

In [None]:
! pip install torch_geometric rdkit

Simple graphs are manipulated with the torch_geometric.data.Data class : https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data

The constructor takes a collection a tensor representing the nodes and a tensor representing the edges.

Edges are represented by a 2D tensor (cf example here : https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html) where each columns represents an edge.


Create a simple undirected Graph with three nodes where node 1 is connected to node 2 and 3.
Specify the node values to be 1, 2, 3

In [None]:
import torch
from torch_geometric.data  import Data
import rdkit
from rdkit.Chem import MolFromSmiles, Draw
import networkx as nx
from torch_geometric.datasets import MoleculeNet
from torch_geometric.utils import to_dense_adj
from torch_geometric.utils import to_networkx
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random
from torch_geometric.data import DataLoader
from torch_geometric.nn import GCNConv
from torch_geometric.nn import global_mean_pool


In [None]:
! [ -e outputs ] || mkdir outputs

In [None]:
OUPTUT_DIR_PATH = 'outputs'

In [None]:
to_output = lambda file_path : os.path.join(OUPTUT_DIR_PATH,file_path)

In [None]:
nodes = torch.arange(1,4)
edges = torch.zeros((2,2))
edges[0][0] = 1 
edges[0][1] = 2
edges[1][0] = 1 
edges[1][1] = 3
first_graph = Data(nodes,edges)
first_graph.to_dict()

# 1. Explore dataset in Pytorch_geometric


To better understand molecule and graph, we will load and explore one of the molecule datasets that come from the pytorch_geometric library.

**1. Load the HIV dataset from the torch_geometric.datasets.MoleculeNet module.**

In [None]:
molecule_net = MoleculeNet(root='.',name='HIV')


**2. How many graphs are there in this dataset ? Print out the number of features and the number of classes for this dataset.**

In [None]:
graph_number = len(molecule_net)
print(f'There are {graph_number} graphs in the dataset')

**3. Get the first graph in this dataset. Print out the number of nodes, the number of edges, the number of features and  the adjency matrix of this graph. This graph is undirected or not ?**

In [None]:
first_graph = molecule_net.get(0)
n_nodes = first_graph.num_nodes
n_edges = first_graph.num_edges
n_features = first_graph.num_features
adj_matrix = to_dense_adj(first_graph.edge_index)
print(f"{n_nodes = }")
print(f"{n_edges = }")
print(f"{n_features = }")
print(f"{adj_matrix = }")


**4. Draw this graph using networkx (it's already installed with pytorch_geometric) and torch_geometric.utils.to_networkx.**


In [None]:
nx.draw(to_networkx(first_graph),with_labels=True)

**5. (Optional) Get the SMILES string of this molecule and draw its structure with Rdkit. The structure of this molecule looks like the graph that you've drawn in 4. ?**

In [None]:
smiles = first_graph['smiles']
Draw.MolToImage(MolFromSmiles(smiles))

# EX 2. Convert a molecule to graph

A single graph in PyTorch Geometric is described by an instance of the torch_geometric.data.Data class. So, in order to use graph neural network in pytorch_geometric,  we need convert molecules to torch_geometric.data.Data object.


The **mol2graph(mol, y, smiles)** function below allows us to convert a molecule (rdkit format) to graph (a torch_geometric.data.Data object).


Just load the tab

In [None]:
x_map = {
    'atomic_num':
    list(range(0, 119)),
    'chirality': [
        'CHI_UNSPECIFIED',
        'CHI_TETRAHEDRAL_CW',
        'CHI_TETRAHEDRAL_CCW',
        'CHI_OTHER',
    ],
    'degree':
    list(range(0, 11)),
    'formal_charge':
    list(range(-5, 7)),
    'num_hs':
    list(range(0, 9)),
    'num_radical_electrons':
    list(range(0, 5)),
    'hybridization': [
        'UNSPECIFIED',
        'S',
        'SP',
        'SP2',
        'SP3',
        'SP3D',
        'SP3D2',
        'OTHER',
    ],
    'is_aromatic': [False, True],
    'is_in_ring': [False, True],
}




e_map = {
    'bond_type': [
        'misc',
        'SINGLE',
        'DOUBLE',
        'TRIPLE',
        'AROMATIC',
    ],
    'stereo': [
        'STEREONONE',
        'STEREOZ',
        'STEREOE',
        'STEREOCIS',
        'STEREOTRANS',
        'STEREOANY',
    ],
    'is_conjugated': [False, True],
}

In [None]:
x_map.keys()

In [None]:
def mol2graph(mol, y, smiles):

    
    xs = []

    for atom in mol.GetAtoms():

        x = []

        x.append(x_map['atomic_num'].index(atom.GetAtomicNum()))
        # The atomic number is the number of protons in the nucleus of an atom

        x.append(x_map['chirality'].index(str(atom.GetChiralTag())))

        x.append(x_map['degree'].index(atom.GetTotalDegree()))
        # the number of carbon atoms that this atom is attached to

        x.append(x_map['formal_charge'].index(atom.GetFormalCharge()))
        x.append(x_map['num_hs'].index(atom.GetTotalNumHs()))
        x.append(x_map['num_radical_electrons'].index(
            atom.GetNumRadicalElectrons()))
        x.append(x_map['hybridization'].index(str(atom.GetHybridization())))
        x.append(x_map['is_aromatic'].index(atom.GetIsAromatic()))
        x.append(x_map['is_in_ring'].index(atom.IsInRing()))

        xs.append(x)



    x = torch.tensor(xs, dtype=torch.float).view(-1, 9)

    #print("x", x)

    edge_indices, edge_attrs = [], []
    for bond in mol.GetBonds():
        i = bond.GetBeginAtomIdx()
        j = bond.GetEndAtomIdx()

        e = []
        e.append(e_map['bond_type'].index(str(bond.GetBondType())))
        e.append(e_map['stereo'].index(str(bond.GetStereo())))
        e.append(e_map['is_conjugated'].index(bond.GetIsConjugated()))

        edge_indices += [[i, j], [j, i]]
        edge_attrs += [e, e]

    edge_index = torch.tensor(edge_indices)
    edge_index = edge_index.t().to(torch.long).view(2, -1)
    edge_attr = torch.tensor(edge_attrs, dtype=torch.long).view(-1, 3)

    # Sort indices.
    y = torch.tensor(y, dtype=torch.long)

    if edge_index.numel() > 0:
        perm = (edge_index[0] * x.size(0) + edge_index[1]).argsort()
        edge_index, edge_attr = edge_index[:, perm], edge_attr[perm]

    data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr, y = y, smiles=smiles)

    return data

**1. Use this function to convert a acetic acid molecule to graph. This function takes three parametes as inputs: rdkit molecule (mol), label of graph (here molecule is active or not) (y) and SMILES string of molecule.  Known that the SMILES string of acetic acid is "CC(O)=O" and you can choose in this case the label y = 1.**

In [None]:
smiles_acetic_acid = "CC(O)=O"
mol = MolFromSmiles(smiles_acetic_acid)
y = 1
acetic_acid_graph = mol2graph(mol,y,smiles_acetic_acid)
acetic_acid_graph

**2. How many features are there in the nodes ? What are they ? Print out the "edge_index" of the acetic acide graph.**

In [None]:
n_features = acetic_acid_graph.num_features
print(f'{n_features = }')
acetic_acid_graph

According to `mol_2_graph`'s code those features seems to refer of properties contained in `x_map` such as:

In [None]:
print(list(x_map.keys()))

# Buid a Graph Neural Network (GNN)

 In the next exercises of this notebook, we will try to build a graph network to predict the ability of molecules to inhibit a protein known as ERK2. For this purpose, we will use compounds that are derived from the DUD-E database.

# Ex 3: Create dataset

The file named "active_data.csv" consists of more than 300 active and decoy molecules. The dataset is made of two components:

-  Chemical structural data on compounds: each chemical compound is described under the SMILES format.

-  ERK2-activity : it corresponds to the screening result evaluating the activity (1) or the inactivity (0) of the chemical compound.

**1. Read the "active_data.csv" file into a pandas dataframe. Are there how many active molecules and how many decoy molecules?**

In [None]:
active_data_path = os.path.join('/kaggle/input/active-data-2','active_data.csv')
active_df = pd.read_csv(active_data_path)
active_df.head()

In [None]:
n_actives_mol = len(active_df[active_df['is_active'] == 1])
n_decoys_mol = len(active_df[active_df['is_active'] == 0])
print(f'{n_actives_mol } actives molecules')
print(f'{n_decoys_mol } decoys molecules')
print(f'{len(active_df)} total molecules')

**2. From this dataframe, create a list of RDKit molecules.**

In [None]:
molecules = list(map(MolFromSmiles,active_df['SMILES']))

**3. Using the mol2graph(mol, y, smiles) function to convert the list of Rdkit molecules to a list of torch_geometric.data.Data objects. You should call this list as "list_data".**

In [None]:
process_func = lambda active_data: mol2graph( MolFromSmiles(active_data['SMILES']) , active_data['is_active'], active_data['SMILES'])
list_data = list(map(process_func,active_df.iloc))

**4. Plot the histogram to see the ratio between the  compounds active and inactive.**

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
counts, bins = np.histogram(active_df['is_active'])
ax.set_xlabel('activity')
ax.set_ylabel('count')
uniques_val, counts = np.unique(active_df['is_active'],return_counts=True)
ax.hist(['not active','active'],weights=counts,label='molecules numbers',color=(0.6,0,1))
ax.legend()
fig.savefig(to_output('active_mol_hist.png'))
plt.show()


# EX 4. Create training set and test set
In this exercise, we will prepare a training set and a test set.

**1. Shuffle the "list_data" list that you've created above.**

In [None]:
random.seed(123)
random.shuffle(list_data)

**2. Take the first 300 molecules for "train_dataset" and the rest for "test_dataset".**

In [None]:
train_dataset = list_data[:300]
test_dataset = list_data[300:]

# Ex5: Create DataLoader

Usually a graph classification task trains on a lot of graphs, and it will be very inefficient to use only one graph at a time when training the model.

Pytorch Geometric opts for building a single giant graph from a list of graphs by stacking adjacency matrices in a diagonal fashion and node that target features are simply concatenated in the note dimension.

A single giant graph is automatically built from a list of graphs with DataLoader.

1. Create **train_loader** and **test_loader** from **train_dataset** and **test_dataset** by using the torch_geometric.data.DataLoader class.

In [None]:
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader =DataLoader(test_dataset, batch_size=64, shuffle=True)

2. Get a batch from **train_loader**. Print out the number of graphs and data of this batch.

In [None]:
batch = next(iter(train_loader))
print(f'There are {len(batch)} graphs in this batch')
batch

# Ex 6: Graph Neural Network Layer

 Let's try to test a graph neural network layer. This kind of layer is available on the pytorch_geometric.nn module. This layer's similar to Linear Layer (Multi-layer Perception Network) in deep learning.

In [None]:
smiles_acetic_acid = "CC(O)=O"
mol_acetic_acid = rdkit.Chem.MolFromSmiles(smiles_acetic_acid)
graph_acetic_acid = mol2graph(mol, y = 1, smiles = smiles_acetic_acid)

 **1. Create an instance of the torch_geometric.nn. GCNConv class. You need choose two parameters: number of features and number of hidden layers.**

In [None]:
gnn = GCNConv(in_channels=graph_acetic_acid.num_features,out_channels=16)

**2. Apply it to the graph of acetic acid.**

In [None]:
output = gnn(graph_acetic_acid.x,graph_acetic_acid.edge_index)
output

**3. What is the output ? its size ?**

In [None]:
print(f'{output.size()}')

# EX 7: global_mean_pool Layer

As we've seen in the ex 6, the output of a GNN layer is a tensor with size (35, 16). However, for the graph classification task, the label of graph is just a scaler number. So, we need to aggregate node embeddings into a unified graph embedding (known as readout layer) before training a final classifier. Let's try it to see what the output of a global_mean_pool layer is.


1. Pass the **out_GCN_layer** variable to the global_mean_pool function. Store the result in a variable named **out_GMP_layer**

In [None]:

# GMP mean Global mean pool
data_for_test_GMP_layer = DataLoader([graph_acetic_acid], batch_size=1 )
conv_test = GCNConv(9, 16)


data = next(iter(data_for_test_GMP_layer))
out_GCN_layer = conv_test(data.x, data.edge_index)

#### TO DO #####

out_GMP_layer = global_mean_pool(out_GCN_layer,data.batch)

2. Print out the shape of `out_GMP_layer`.

In [None]:
print(out_GMP_layer)
print("shape of output_GMP_layer ", out_GMP_layer.size() )

# EX 8: Building a graph network for graph classification task with Pytorch geometric


In this exercise, we will create a network to classify if a molecule is active. This network consists of 4 layers:

1. A graph convolution network layer conv1

2. Another GCN layer conv2

3. Another GCN layer conv3

4. A torch_geometric.nn.global_mean_pool layer

5. A linear layer

Relu activation function is used after the two first layers.

Complete the lines below (after #TODO) to finish the definition of this network.

In [None]:
from torch_geometric.nn import GCNConv
from torch_geometric.nn import global_mean_pool

import torch
import torch.nn.functional as F
from torch.nn import Linear

#### TODO
n_features = 9



class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        torch.manual_seed(12)
        self.conv1 = GCNConv(n_features, 8)
        self.conv2 = GCNConv(8, 16)
        self.conv3 = GCNConv(16, 32)

        # TO DO
        self.linear = Linear(32, 2)


    def forward(self, x, edge_index, batch):


        # 1. Obtain node embeddings
        # First GCN layer
        x = self.conv1(x, edge_index)
        x = F.relu(x)

        ## TODO
        # Second GCN layer
        x =  self.conv2(x,edge_index)
        x =  F.relu(x)


        # Third GCN layer
        x = self.conv3(x, edge_index)


        ### TODO#####
        #2. REadout layer
        x = global_mean_pool(x,batch)


        # 3. Linear Layer
        ## TO DO
        x = self.linear(x)

        return x


# Ex 9: Create network


1. Create the network then print out the model and look at it's text representation

In [None]:
model = Net()

2. Define an optimizer. You should use the torch.optim.Adam class.

In [None]:
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)

3. Define a loss function. You should use the CrossEntropyLoss class.

In [None]:
from torch.nn import CrossEntropyLoss
loss_func = CrossEntropyLoss()

In [None]:
print( model)

# Ex 10: Train model for an epoch

Write a function named **train()** that allows us to train a model for an epoch.

The tasks that the function should execute:


0. Iterate in batches over the train_loader

1. Perform a single forward pass

2. Compute the loss

3. Derive the gradient

4. Update parameters

5. Clearn gradients



Complete the lines (with ?????) below to finish the function.

In [None]:
def train(loader=train_loader):
    model.train()
    optimizer = torch.optim.Adam(model.parameters(),lr=0.0001)
    loss_func = CrossEntropyLoss()
    for data in loader:
        # TODO ( )
        #1.Forward pass
         out = model(data.x,data.edge_index,data.batch)
         # 2. Compute the loss
         loss = loss_func(out,data.y)
         # 3. Calculate the gradient
         loss.backward()
         #4. Update the parameters (weights)
         optimizer.step()
         #5. Clean gradients
         optimizer.zero_grad()

# Ex 11: Test

Similar to Ex9, write a function named **tes(loader)** that allows to compute the accuracy of the model on dataset "loader".

The steps to calculate the accuracy of a classification model:

1. Iterate in batches over the train_loader.

2. Compute the output of the model

3. Find the class with highest probability

4. Count ground-truth labels

5. Compute the accuracy


Complete the lines (with ?????)  to finish the definition of this network.

In [None]:
def test(loader):
    model.eval()
    correct = 0
    for data in loader:
        # output of the model
        out =  model(data.x,data.edge_index,data.batch)
        # Use the class with highest probability
        pred = out.argmax(dim=1)
        

    # Check against ground-truth labels
    correct += int((pred == data.y).sum())
    return correct / len(loader.dataset)

# Ex 12: Training model

Training model for 100 epoches.

Calculate training accuracy for train_loader and test_loader by using the **train()** function and the **test(loader)** function


In [None]:
from tqdm import tqdm
epochs = 100
model = Net()
print(model)
for epoch in tqdm(range(epochs)):
    train(train_loader)
print(test(test_loader))

In [None]:
test(test_loader)