# 그래프 컨볼류션

- 그래프 표현방식과 그래프 컨볼류션 모델의 이해

![Molecular Graph](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/assets/basic_graphs.gif?raw=1)



## Graph Convolutions 개념

- 일반 CNN
 - 이미지 처리, 시계열 처리 등에 사용
 - 신호가 여러 컨볼류션 계층을 통과한다. 주변의 샘플들에 필터를 적용하며 어떤 추상적인 패턴을 추출한다
 - 가끔 풀링을 수행하여 패턴신호(특성)의 이동과 정보 축약을 수행한다
- 그래프 CNN
 - 일반 CNN과 유사하나 이미지가 시계열 데이터가 아니라 그래프로 표현된 입력을 대상으로 동작한다
 - 주변 샘플 전체가 아니라 그래프로 연결된 샘플들을 사용하여 컨볼류션과 풀링을 수행한다


# import

In [None]:
!pip install deepchem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepchem
  Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
[K     |████████████████████████████████| 608 kB 35.5 MB/s 
Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[K     |████████████████████████████████| 36.8 MB 38 kB/s 
Installing collected packages: rdkit-pypi, deepchem
Successfully installed deepchem-2.6.1 rdkit-pypi-2022.3.5


In [None]:
import deepchem as dc
from deepchem.models.layers import GraphConv, GraphPool, GraphGather
from deepchem.metrics import to_one_hot
from deepchem.feat.mol_graphs import ConvMol
import tensorflow as tf
import tensorflow.keras.layers as layers
import numpy as np

# 데이터
- tox21 데이터 다운로드
- featurizer='GraphConv' 적용

In [None]:
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

# 모델 정의, 학습, 평가
- GraphConvModel 모델 사용

In [None]:
n_tasks = len(tasks)
model = dc.models.GraphConvModel(n_tasks, mode='classification')
model.fit(train_dataset, nb_epoch=50)
metric1 = dc.metrics.Metric(dc.metrics.roc_auc_score)
metric2 = dc.metrics.Metric(dc.metrics.accuracy_score)
print('Training set score:', model.evaluate(train_dataset, [metric1, metric2], transformers))
print('Test set score:', model.evaluate(test_dataset, [metric1, metric2], transformers))

  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." %

Training set score: {'roc_auc_score': 0.9709350136389215, 'accuracy_score': 0.8844721157939549}
Test set score: {'roc_auc_score': 0.6980784946614246, 'accuracy_score': 0.7615858843537415}


# 그래프 컨볼류션 직접 구현 (참고)
-  `GraphConv` layer: 그래프 컨볼류션을 수행 
- `GraphPool` layer: 주변 노드의 특성 벡터로부터 max-pooling을 수행

- `GraphGather`: 노드(원자) 단위의 특성을 수집하여 그래프 단위(분자)의 특성을 계산: a graph level feature vector 

- 이외에 [Dense](https://keras.io/api/layers/core_layers/dense/), [BatchNormalization](https://keras.io/api/layers/normalization_layers/batch_normalization/), [Softmax](https://keras.io/api/layers/activation_layers/softmax/) 를 사용한다

In [None]:
batch_size = 100

class MyGraphConvModel(tf.keras.Model):

  def __init__(self):
    super(MyGraphConvModel, self).__init__()
    self.gc1 = GraphConv(128, activation_fn=tf.nn.tanh)
    self.batch_norm1 = layers.BatchNormalization()
    self.gp1 = GraphPool()

    self.gc2 = GraphConv(128, activation_fn=tf.nn.tanh)
    self.batch_norm2 = layers.BatchNormalization()
    self.gp2 = GraphPool()

    self.dense1 = layers.Dense(256, activation=tf.nn.tanh)
    self.batch_norm3 = layers.BatchNormalization()
    self.readout = GraphGather(batch_size=batch_size, activation_fn=tf.nn.tanh)

    self.dense2 = layers.Dense(n_tasks*2)
    self.logits = layers.Reshape((n_tasks, 2))
    self.softmax = layers.Softmax()

  def call(self, inputs):
    gc1_output = self.gc1(inputs)
    batch_norm1_output = self.batch_norm1(gc1_output)
    gp1_output = self.gp1([batch_norm1_output] + inputs[1:])

    gc2_output = self.gc2([gp1_output] + inputs[1:])
    batch_norm2_output = self.batch_norm1(gc2_output)
    gp2_output = self.gp2([batch_norm2_output] + inputs[1:])

    dense1_output = self.dense1(gp2_output)
    batch_norm3_output = self.batch_norm3(dense1_output)
    readout_output = self.readout([batch_norm3_output] + inputs[1:])

    logits_output = self.logits(self.dense2(readout_output))
    return self.softmax(logits_output)

In [None]:
# 케라스 모델 사용
model = dc.models.KerasModel(MyGraphConvModel(), loss=dc.models.losses.CategoricalCrossEntropy())

In [None]:
# 입력은 ConvMol 타입임
test_dataset.X[0]

<deepchem.feat.mol_graphs.ConvMol at 0x7f3dc9102450>

# 입력 데이터 생성

- 모델은 ndarray 타입의 어레이를 사용하므로 `ConvMol` 객체로부터 X, y, w 를 생성해 주는 함수가 필요하다
- 배치단위로 데이터를 생성해야 한다
- 주요 변수:
 - `atom_features`: 각 원자에 대한 특성 표현 벡터이며 크기는 75이다. 
 - `degree_slice`: 주어진 degree에 대해서 원자를 구분하는 인덱싱 
 - `membership`: 분자 내에서 원자의 멤버쉽을 정의 (atom `i` belongs to molecule `membership[i]`). `deg_adjs`: 특정 degree에 대한, 인접 원자 리스트
 
- [구현 소스 코드](https://github.com/deepchem/deepchem/blob/master/deepchem/feat/mol_graphs.py)

## 데이터 제너레이터
- X, y, w를 계속 자동으로 생성해주는 함수 정의

In [None]:
def data_generator(dataset, epochs=1):
  for ind, (X_b, y_b, w_b, ids_b) in enumerate(dataset.iterbatches(batch_size, 
              epochs, deterministic=False, pad_batches=True)):
    multiConvMol = ConvMol.agglomerate_mols(X_b)
    inputs = [multiConvMol.get_atom_features(), multiConvMol.deg_slice, 
              np.array(multiConvMol.membership)]
              
    for i in range(1, len(multiConvMol.get_deg_adjacency_lists())):
      inputs.append(multiConvMol.get_deg_adjacency_lists()[i])
    labels = [to_one_hot(y_b.flatten(), 2).reshape(-1, n_tasks, 2)]
    weights = [w_b]
    yield (inputs, labels, weights)

- 모델을 훈련시키기 위해서 fit_generator(generator)를 사용한다
 - generator는 위에서 정의한 data_generator 함수가 생성해준다

In [None]:
model.fit_generator(data_generator(train_dataset, epochs=50))

  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." %

0.2700279998779297

## 성능 평가
- 위에서 정의한 generator를 사용한다

In [None]:
print('Training set score:', model.evaluate_generator(data_generator(train_dataset), [metric1, metric2], transformers))
print('Test set score:', model.evaluate_generator(data_generator(test_dataset), [metric1, metric2], transformers))

Training set score: {'roc_auc_score': 0.7812785269538275, 'accuracy_score': 0.9147486772486774}
Test set score: {'roc_auc_score': 0.6251479217537751, 'accuracy_score': 0.8997916666666667}
