<a href="https://colab.research.google.com/github/PsorTheDoctor/Sekcja-SI/blob/master/neural_networks/NSL/document_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Graph regularization: Klasyfikacja dokumentów z użyciem grafów

In [0]:
!pip install tensorflow-gpu==2.0.1

In [2]:
!pip install -q neural-structured-learning

[?25l[K     |███▏                            | 10kB 27.8MB/s eta 0:00:01[K     |██████▎                         | 20kB 32.6MB/s eta 0:00:01[K     |█████████▍                      | 30kB 21.8MB/s eta 0:00:01[K     |████████████▌                   | 40kB 18.8MB/s eta 0:00:01[K     |███████████████▋                | 51kB 15.0MB/s eta 0:00:01[K     |██████████████████▉             | 61kB 13.7MB/s eta 0:00:01[K     |██████████████████████          | 71kB 12.7MB/s eta 0:00:01[K     |█████████████████████████       | 81kB 12.6MB/s eta 0:00:01[K     |████████████████████████████▏   | 92kB 13.5MB/s eta 0:00:01[K     |███████████████████████████████▎| 102kB 13.0MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 13.0MB/s 
[?25h

## Dependecje i importy

In [3]:
from __future__ import absolute_import, division, print_function, unicode_literals

import neural_structured_learning as nsl

import tensorflow as tf

# Reset notebooka
tf.keras.backend.clear_session()

print('Version: ', tf.__version__)
print('Eager mode: ', tf.executing_eagerly())
print('GPU is', 'available' if tf.test.is_gpu_available() else 'NOT AVAILABLE')

Version:  2.0.1
Eager mode:  True
GPU is available


## Pobranie zbioru danych Cora

In [4]:
%%bash
wget --quiet -P /tmp https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz
tar -C /tmp -xvzf /tmp/cora.tgz

cora/
cora/README
cora/cora.content
cora/cora.cites


## Konwersja danych Cora na format NSL

In [5]:
!wget https://raw.githubusercontent.com/tensorflow/neural-structured-learning/master/neural_structured_learning/examples/preprocess/cora/preprocess_cora_dataset.py

!python preprocess_cora_dataset.py \
--input_cora_content=/tmp/cora/cora.content \
--input_cora_graph=/tmp/cora/cora.cites \
--max_nbrs=5 \
--output_train_data=/tmp/cora/train_merged_examples.tfr \
--output_test_data=/tmp/cora/test_examples.tfr

--2020-03-20 23:38:12--  https://raw.githubusercontent.com/tensorflow/neural-structured-learning/master/neural_structured_learning/examples/preprocess/cora/preprocess_cora_dataset.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11419 (11K) [text/plain]
Saving to: ‘preprocess_cora_dataset.py’


2020-03-20 23:38:12 (127 MB/s) - ‘preprocess_cora_dataset.py’ saved [11419/11419]

Reading graph file: /tmp/cora/cora.cites...
Done reading 5429 edges from: /tmp/cora/cora.cites (0.01 seconds).
Making all edges bi-directional...
Done (0.01 seconds). Total graph nodes: 2708
Joining seed and neighbor tf.train.Examples with graph edges...
Done creating and writing 2155 merged tf.train.Examples (1.19 seconds).
Out-degree histogram: [(1, 386), (2, 468), (3, 452), (4, 309), (5

## Zmienne globalne

In [0]:
### Doświadczalny zbiór danych
TRAIN_DATA_PATH = '/tmp/cora/train_merged_examples.tfr'
TEST_DATA_PATH = '/tmp/cora/test_examples.tfr'

### Stałe użyte do identyfikacji cech sąsiada na wejściu
NBR_FEATURE_PREFIX = 'NL_nbr_'
NBR_WEIGHT_SUFFIX = '_weight'

## Hiperparametry

In [0]:
class HParams(object):
  '''Hiperparametry użyte do trenowania.'''
  def __init__(self):
    # parametry zbioru danych
    self.num_classes = 10
    self.max_seq_length = 1433
    # parametry neural graph learning
    self.distance_type = nsl.configs.DistanceType.L2
    self.graph_regularization_multiplier = 0.1
    self.num_neighbors = 1
    # architektura modelu
    self.num_fc_units = [50, 50]
    # parametry trenowania
    self.train_epochs = 100
    self.batch_size = 128
    self.dropout_rate = 0.5
    # parametry oceny
    self.eval_steps = None  # Wszystkie instancje w zbiorze testowym są oceniane.

HPARAMS = HParams()

## Załadowanie danych treningowych i testowych

In [0]:
def parse_example(example_proto):

  feature_spec = {
      'words': tf.io.FixedLenFeature([HPARAMS.max_seq_length],
                                     tf.int64,
                                     default_value=tf.constant(
                                         0,
                                         dtype=tf.int64,
                                         shape=[HPARAMS.max_seq_length])),
      'label': tf.io.FixedLenFeature((), tf.int64, default_value=-1)
  }

  for i in range(HPARAMS.num_neighbors):
    nbr_feature_key = '{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'words')
    nbr_weight_key = '{}{}{}'.format(NBR_FEATURE_PREFIX, i, NBR_WEIGHT_SUFFIX)
    feature_spec[nbr_feature_key] = tf.io.FixedLenFeature(
        [HPARAMS.max_seq_length],
        tf.int64,
        default_value=tf.constant(
            0, dtype=tf.int64, shape=[HPARAMS.max_seq_length]))
    
    feature_spec[nbr_weight_key] = tf.io.FixedLenFeature(
        [1], tf.float32, default_value=tf.constant([0.0]))
    
  features = tf.io.parse_single_example(example_proto, feature_spec)
  label = features.pop('label')
  return features, label
  
    
def make_dataset(file_path, training=False):

  dataset = tf.data.TFRecordDataset([file_path])
  
  if training:
    dataset = dataset.shuffle(10000)
  
  dataset = dataset.map(parse_example)
  dataset = dataset.batch(HPARAMS.batch_size)
  
  return dataset


train_dataset = make_dataset(TRAIN_DATA_PATH, training=True)
test_dataset = make_dataset(TEST_DATA_PATH)

In [17]:
for feature_batch, label_batch in train_dataset.take(1):
  print('Lista cech:', list(feature_batch.keys()))
  print('Wsad wejść:', feature_batch['words'])
  nbr_feature_key = '{}{}_{}'.format(NBR_FEATURE_PREFIX, 0, 'words')
  nbr_weight_key = '{}{}{}'.format(NBR_FEATURE_PREFIX, 0, NBR_WEIGHT_SUFFIX)
  print('Wsad sąsiednich wejść:', feature_batch[nbr_feature_key])
  print('Wsad sąsiednich wag:', 
        tf.reshape(feature_batch[nbr_weight_key], [-1]))
  print('Wsad etykiet:', label_batch)

Lista cech: ['NL_nbr_0_weight', 'NL_nbr_0_words', 'words']
Wsad wejść: tf.Tensor(
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]], shape=(128, 1433), dtype=int64)
Wsad sąsiednich wejść: tf.Tensor(
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]], shape=(128, 1433), dtype=int64)
Wsad sąsiednich wag: tf.Tensor(
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1.], shape=(128,), dtype=float32)
Wsad etykiet: tf.Tensor(
[1 3 1 6 1 2 1 2 2 2 4 5 3 2 2 5 3 5 2 2 2 3 4 2 0 2 6 3 5 3 6 4 3 2 2 1 2
 2 1 6 1 1 2 0 0 1 2 0 2 

In [18]:
for feature_batch, label_batch in test_dataset.take(1):
  print('Lista cech:', list(feature_batch.keys()))
  print('Wsad wejść:', feature_batch['words'])
  nbr_feature_key = '{}{}_{}'.format(NBR_FEATURE_PREFIX, 0, 'words')
  nbr_weight_key = '{}{}{}'.format(NBR_FEATURE_PREFIX, 0, NBR_WEIGHT_SUFFIX)
  print('Wsad sąsiednich wejść:', feature_batch[nbr_feature_key])
  print('Wsad sąsiednich wag:', 
        tf.reshape(feature_batch[nbr_weight_key], [-1]))
  print('Wsad etykiet:', label_batch)

Lista cech: ['NL_nbr_0_weight', 'NL_nbr_0_words', 'words']
Wsad wejść: tf.Tensor(
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]], shape=(128, 1433), dtype=int64)
Wsad sąsiednich wejść: tf.Tensor(
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]], shape=(128, 1433), dtype=int64)
Wsad sąsiednich wag: tf.Tensor(
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.], shape=(128,), dtype=float32)
Wsad etykiet: tf.Tensor(
[5 2 2 2 1 2 6 3 2 3 6 1 3 6 4 4 2 3 3 0 2 0 5 2 1 0 6 3 6 4 2 2 3 0 4 2 2
 2 2 3 2 2 2 0 2 2 2 2 4 

## Budowa modelu
### Model sekwencyjny

In [0]:
def make_mlp_sequential_model(hparams):
  '''Tworzy sekwencyjny perceptron wielowarstwowy.'''
  model = tf.keras.Sequential()
  model.add(
      tf.keras.layers.InputLayer(
          input_shape=(hparams.max_seq_length,), name='words'))
  # Wejście jest już zakodowane w one-hot w formacie integer. 
  # Tutaj przerzucamy je do formatu zmiennoprzecinkowego.
  model.add(
      tf.keras.layers.Lambda(lambda x: tf.keras.backend.cast(x, tf.float32)))
  for num_units in hparams.num_fc_units:
    model.add(tf.keras.layers.Dense(num_units, activation='relu'))
    model.add(tf.keras.layers.Dropout(hparams.dropout_rate))
  model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))
  return model

### Model funkcjonalny

In [0]:
def make_mlp_functional_model(hparams):
  '''Tworzy funkcjonalny bazowany na API perceptron wielowarstwowy.'''
  inputs = tf.keras.Input(
      shape=(hparams.max_seq_length,), dtype='int64', name='words')
  
  # Wejście jest już zakodowane w one-hot w formacie integer. 
  # Tutaj przerzucamy je do formatu zmiennoprzecinkowego.
  cur_layer = tf.keras.layers.Lambda(
      lambda x: tf.keras.backend.cast(x, tf.float32))(inputs)

  for num_units in hparams.num_fc_units:
    cur_layer = tf.keras.layers.Dense(num_units, activation='relu')(cur_layer)
    cur_layer = tf.keras.layers.Dropout(hparams.dropout_rate)(cur_layer)

  outputs = tf.keras.layers.Dense(
      hparams.num_classes, activation='softmax')(cur_layer)

  model = tf.keras.Model(inputs, outputs=outputs)
  return model

## Tworzenie modelu bazowego

In [28]:
# Tworzy model bazowy MLP z użyciem funkcjonalnego API.
# Możemy również stworzyć model sekwencyjny używając
# funkcji make_mlp_sequential_model() zdefiniowanej powyżej.
base_model_tag, base_model = 'FUNCTIONAL', make_mlp_functional_model(HPARAMS)
base_model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
words (InputLayer)           [(None, 1433)]            0         
_________________________________________________________________
lambda_2 (Lambda)            (None, 1433)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 50)                71700     
_________________________________________________________________
dropout_4 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 50)                2550      
_________________________________________________________________
dropout_5 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 10)                510 

## Trenowanie modelu MLP

In [29]:
# Kompilacja i trening
base_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'])
base_model.fit(train_dataset, epochs=HPARAMS.train_epochs, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f0006ff3588>

## Ocena modelu MLP

In [0]:
# Funkcja pomocnicza do wyświetlenia oceny metryki.
def print_metrics(model_desc, eval_metrics):

  print('\n')
  print('Ocena dokładności dla ', model_desc, ': ', eval_metrics['accuracy'])
  print('Ocena straty dla ', model_desc, ': ', eval_metrics['loss'])
  if 'graph_loss' in eval_metrics:
    print('Ocena straty grafu dla: ', model_desc, ': ', eval_metrics['graph_loss'])

In [33]:
eval_results = dict(
    zip(base_model.metrics_names,
        base_model.evaluate(test_dataset, steps=HPARAMS.eval_steps)))
print_metrics('Model bazowy MLP', eval_results)



Ocena dokładności dla  Model bazowy MLP :  0.78842676
Ocena straty dla  Model bazowy MLP :  1.242060649394989


## Trenowanie modelu MLP z regularyzacją grafu

In [0]:
# Budowa nowego bazowego modelu MLP.
base_reg_model_tag, base_reg_model = 'FUNCTIONAL', make_mlp_functional_model(HPARAMS)

In [43]:
graph_reg_config = nsl.configs.make_graph_reg_config(
    max_neighbors = HPARAMS.num_neighbors,
    multiplier = HPARAMS.graph_regularization_multiplier,
    distance_type = HPARAMS.distance_type,
    sum_over_axis = -1)
graph_reg_model = nsl.keras.GraphRegularization(base_reg_model,
                                                graph_reg_config)
graph_reg_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'])
graph_reg_model.fit(train_dataset, epochs=HPARAMS.train_epochs, verbose=1)

Epoch 1/100


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

<tensorflow.python.keras.callbacks.History at 0x7f00069f45c0>

In [44]:
eval_results = dict(
    zip(graph_reg_model.metrics_names,
        graph_reg_model.evaluate(test_dataset, steps=HPARAMS.eval_steps)))
print_metrics('MLP + regularyzacja grafu', eval_results)



Ocena dokładności dla  MLP + regularyzacja grafu :  0.8119349
Ocena straty dla  MLP + regularyzacja grafu :  1.1085497200489045
Ocena straty grafu dla:  MLP + regularyzacja grafu :  0.0
