<a href="https://colab.research.google.com/github/AmbiTyga/73String/blob/main/GNN_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading The Repository to GNN
> The repository used in this notebook is from research [Text Level Graph Neural Network for Text Classification](https://www.aclweb.org/anthology/D19-1345.pdf)

With some normal changes in dataset input format and training procedure, we can develop our own Text based GNN model for classification tasks

In [1]:
!git clone https://github.com/LindgeW/TextLevelGNN.git

Cloning into 'TextLevelGNN'...
remote: Enumerating objects: 56, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 56 (delta 22), reused 30 (delta 8), pack-reused 0[K
Unpacking objects: 100% (56/56), done.


# Downloading Glove and packages

In [2]:
# For optimizers we use huggingface's transformers package
!pip install transformers -q

[K     |████████████████████████████████| 1.8MB 10.9MB/s 
[K     |████████████████████████████████| 3.2MB 39.1MB/s 
[K     |████████████████████████████████| 890kB 36.7MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [None]:
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip

--2021-02-13 04:33:16--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2021-02-13 04:33:16--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2021-02-13 04:33:17--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip

In [6]:
!unzip glove.840B.300d.zip

Archive:  glove.840B.300d.zip
  inflating: glove.840B.300d.txt     


# Importing packages

In [12]:
import time
import random
import numpy as np
import pandas as pd
import torch
from TextLevelGNN.modules.model import GNNModel
from TextLevelGNN.modules.optimizer import Optimizer
from TextLevelGNN.config.conf import arg_config, path_config
from TextLevelGNN.utils.datautil import create_vocab, batch_variable
import torch.nn.functional as F
from TextLevelGNN.logger.logger import logger
from TextLevelGNN.utils.dataset import DataSet, DataLoader
import torch.nn.utils as nn_utils
import json

# Setting up Parameters

In [13]:

data_paths = {
  "dataset": {
    "train": "./train.txt",
    "test": "./val.txt"
  },
  "pre_embed": {
    "word_embedding": "./glove.840B.300d.txt"
  }
}

with open('arg_data.json','w') as f:
  json.dump(data_paths,f)

In [14]:
# Argument Class to initialize network
class ARGS:
  def __init__(self):
    self.cuda=0
    self.learning_rate=0.001
    self.beta1=0.9
    self.beta2=0.98
    self.eps=1e-9
    self.warmup_step=10000
    self.decay=0.95
    self.decay_step=10000
    self.weight_decay=1e-4
    self.scheduler='linear'
    self.grad_clip=5.
    self.max_step=50000
    self.patient=10

    self.batch_size=32
    self.epoch=20
    self.update_step=1
    self.test_batch_size=50
    self.bert_lr=2e-5
    self.bert_layers=4
    self.bert_embed_dim=400
    self.wd_embed_dim=300
    self.tag_embed_dim=150
    self.char_embed_dim = 300

    self.hidden_size=128
    self.char_hidden_size=150
    self.dec_hidden_size=300
    self.rnn_depth=2
    self.enc_bidi=True

    self.mpe=600
    self.att_drop=0.1
    self.embed_drop=0.2
    self.rnn_drop=0.2
    self.dropout=0.33

    self.model_chkp='model.pkl'
    self.vocab_chkp='vocab.pkl'

args = ARGS()

# Loading Training and testing Dataset
Here we make simple changes in input format by converting the datasheet files to a text file, having delimiter a special character "`|`", to differentiate between different attributes in datasheet.

In [16]:
train = pd.read_csv('train.csv',header=None)
test = pd.read_csv('test.csv',header=None)

# Since header is none, heading is now at index:0
train.drop(index=0,inplace = True)
test.drop(index=0,inplace = True)

# Saving as text file, with delimiter: '|' for repositories internal processing
train.to_csv('train.txt',index = False,header = None,sep='|')
test.to_csv('test.txt',index = False,header = None,sep='|')

## Creating a validation data for training
To make the GNN learn, we sample a validation dataset from the training set. We use stratify feature from sklearn's train_test_split method to sample our data based on class or tags.

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_val = train_test_split(
    train.values, test_size=0.2, random_state=2021,stratify = train[0])

In [42]:
pd.DataFrame(X_train).to_csv('train.txt',sep='|',index = False,header = None)
pd.DataFrame(X_val).to_csv('val.txt',sep='|',index = False,header = None)

# Custom Trainer function
To be compatible with the notebook we customize a trainer class to build data loaders, train the Network, Validate and predict for the given data points.

In [20]:

class Trainer(object):
    def __init__(self, args, vocabs):
        self.vocabs = vocabs
        self.args = args
        self.model = GNNModel(num_node=len(vocabs['word']),
                              embedding_dim=args.wd_embed_dim,
                              num_cls=len(vocabs['label']),
                              pre_embed=vocabs['word'].embeddings).to(args.device)
        print(self.model)
        self.train_set = None
        self.val_set = None
        self.test_set = None

    def set_dataset(self, data_path):
        self.train_set = DataSet(data_path['dataset']['train'])
        self.val_set = DataSet(data_path['dataset']['test'])
        print(f'Train Size: {len(self.train_set)}, Val Size: {len(self.val_set)}')

    def train(self):
        params = filter(lambda p: p.requires_grad, self.model.parameters())
        optimizer = Optimizer(params, args)
        patient = 0
        best_dev_acc= 0
        for ep in range(1, self.args.epoch+1):
            train_loss, train_acc = self.train_iter(ep, self.train_set, optimizer)

            dev_acc = self.eval(self.val_set)
            if dev_acc > best_dev_acc:
                best_dev_acc = dev_acc
                
                patient = 0
            else:
                patient += 1

            print('[Epoch %d] train loss: %.4f, lr: %f, Train ACC: %.4f, Dev ACC: %.4f, Best Dev ACC: %.4f, patient: %d' % (
                    ep, train_loss, optimizer.get_lr(), train_acc, dev_acc, best_dev_acc, patient))

            if patient >= args.patient:
                break

    def train_iter(self, ep, train_set, optimizer):
        t1 = time.time()
        train_acc, train_loss = 0., 0.
        train_loader = DataLoader(train_set, batch_size=self.args.batch_size, shuffle=True)
        self.model.train()
        for i, batcher in enumerate(train_loader):
            batch = batch_variable(batcher, self.vocabs)
            batch.to_device(self.args.device)
            pred = self.model(batch.x, batch.nx, batch.ew)
            loss = F.nll_loss(pred, batch.y)
            loss.backward()
            nn_utils.clip_grad_norm_(filter(lambda p: p.requires_grad, self.model.parameters()),
                                     max_norm=args.grad_clip)
            optimizer.step()
            self.model.zero_grad()

            loss_val = loss.data.item()
            train_loss += loss_val
            train_acc += (pred.data.argmax(dim=-1) == batch.y).sum().item()

            print('[Epoch %d] Iter%d time cost: %.2fs, lr: %.6f, train acc: %.4f, train loss: %.4f' % (
                ep, i + 1, (time.time() - t1), optimizer.get_lr(), train_acc/len(train_set), loss_val))

        return train_loss/len(train_set), train_acc/len(train_set)

    def eval(self, test_set):
        nb_correct, nb_total = 0, 0
        test_loader = DataLoader(test_set, batch_size=self.args.test_batch_size)
        self.model.eval()
        with torch.no_grad():
            for i, batcher in enumerate(test_loader):
                batch = batch_variable(batcher, self.vocabs)
                batch.to_device(self.args.device)
                pred = self.model(batch.x, batch.nx, batch.ew)
                nb_correct += (pred.data.argmax(dim=-1) == batch.y).sum().item()
                nb_total += len(batch.y)
        return nb_correct / nb_total

    def predict(self,file_name):
      out = []
      texts = DataSet(file_name)
      test_loader = DataLoader(texts, batch_size=len(texts))
      self.model.eval()
      with torch.no_grad():
        for i, batcher in enumerate(test_loader):
                batch = batch_variable(batcher, self.vocabs,training = False)
                batch.to_device(self.args.device)
                pred = self.model(batch.x, batch.nx, batch.ew)
                out.append([x[:5] for x in pred.data.argsort(dim=1,descending = True)])
      return out


# Training

> We train the network using different schedulers
to check their performance

## Constant Scheduler

In [None]:

np.random.seed(2343)
random.seed(1347)
torch.manual_seed(1453)
torch.cuda.manual_seed(1347)
torch.cuda.manual_seed_all(1453)

print('cuda available:', torch.cuda.is_available())
print('cuDNN available:', torch.backends.cudnn.enabled)
print('gpu numbers:', torch.cuda.device_count())

if torch.cuda.is_available() and args.cuda >= 0:
    args.device = torch.device('cuda', args.cuda)
    torch.cuda.empty_cache()
else:
    args.device = torch.device('cpu')

data_path = path_config('/content/arg_data.json')
vocabs = create_vocab(data_path['dataset']['train'])
embed_count = vocabs['word'].load_embeddings(data_path['pre_embed']['word_embedding'])
print("%d pre-trained embeddings loaded..." % embed_count)
args.scheduler = 'const'
trainer = Trainer(args, vocabs)
trainer.set_dataset(data_path)
trainer.train()

cuda available: True
cuDNN available: True
gpu numbers: 1
{'dataset': {'train': './train.txt', 'test': './test.txt'}, 'pre_embed': {'word_embedding': './glove.840B.300d.txt'}}
12287 pre-trained embeddings loaded...
GNNModel(
  (node_embedding): Embedding(12290, 300)
  (edge_weight): Embedding(151019522, 1, padding_idx=0)
  (node_weight): Embedding(12290, 1, padding_idx=0)
  (fc): Sequential(
    (0): Linear(in_features=300, out_features=63, bias=True)
    (1): ReLU()
    (2): Dropout(
      (drop): Dropout(p=0.5, inplace=False)
    )
    (3): LogSoftmax(dim=1)
  )
)
Train Size: 4775, Val Size: 1195
[Epoch 1] Iter1 time cost: 0.05s, lr: 0.001000, train acc: 0.0002, train loss: 4.1611
[Epoch 1] Iter2 time cost: 0.09s, lr: 0.001000, train acc: 0.0006, train loss: 4.1432
[Epoch 1] Iter3 time cost: 0.14s, lr: 0.001000, train acc: 0.0006, train loss: 4.1526
[Epoch 1] Iter4 time cost: 0.18s, lr: 0.001000, train acc: 0.0010, train loss: 4.1328
[Epoch 1] Iter5 time cost: 0.23s, lr: 0.001000, tr

## Linear Scheduler

In [None]:
np.random.seed(2343)
random.seed(1347)
torch.manual_seed(1453)
torch.cuda.manual_seed(1347)
torch.cuda.manual_seed_all(1453)

print('cuda available:', torch.cuda.is_available())
print('cuDNN available:', torch.backends.cudnn.enabled)
print('gpu numbers:', torch.cuda.device_count())

if torch.cuda.is_available() and args.cuda >= 0:
    args.device = torch.device('cuda', args.cuda)
    torch.cuda.empty_cache()
else:
    args.device = torch.device('cpu')

data_path = path_config('/content/arg_data.json')
vocabs = create_vocab(data_path['dataset']['train'])
embed_count = vocabs['word'].load_embeddings(data_path['pre_embed']['word_embedding'])
print("%d pre-trained embeddings loaded..." % embed_count)
args.scheduler = 'linear'
trainer = Trainer(args, vocabs)
trainer.set_dataset(data_path)
trainer.train()

cuda available: True
cuDNN available: True
gpu numbers: 1
{'dataset': {'train': './train.txt', 'test': './test.txt'}, 'pre_embed': {'word_embedding': './glove.840B.300d.txt'}}
12287 pre-trained embeddings loaded...
GNNModel(
  (node_embedding): Embedding(12290, 300)
  (edge_weight): Embedding(151019522, 1, padding_idx=0)
  (node_weight): Embedding(12290, 1, padding_idx=0)
  (fc): Sequential(
    (0): Linear(in_features=300, out_features=63, bias=True)
    (1): ReLU()
    (2): Dropout(
      (drop): Dropout(p=0.5, inplace=False)
    )
    (3): LogSoftmax(dim=1)
  )
)
Train Size: 4775, Val Size: 1195
[Epoch 1] Iter1 time cost: 0.04s, lr: 0.000000, train acc: 0.0002, train loss: 4.1611
[Epoch 1] Iter2 time cost: 0.09s, lr: 0.000000, train acc: 0.0006, train loss: 4.1457
[Epoch 1] Iter3 time cost: 0.14s, lr: 0.000000, train acc: 0.0006, train loss: 4.1535




[Epoch 1] Iter4 time cost: 0.18s, lr: 0.000000, train acc: 0.0006, train loss: 4.1529
[Epoch 1] Iter5 time cost: 0.23s, lr: 0.000000, train acc: 0.0006, train loss: 4.1788
[Epoch 1] Iter6 time cost: 0.27s, lr: 0.000001, train acc: 0.0008, train loss: 4.1607
[Epoch 1] Iter7 time cost: 0.32s, lr: 0.000001, train acc: 0.0010, train loss: 4.1510
[Epoch 1] Iter8 time cost: 0.36s, lr: 0.000001, train acc: 0.0015, train loss: 4.1083
[Epoch 1] Iter9 time cost: 0.41s, lr: 0.000001, train acc: 0.0015, train loss: 4.1361
[Epoch 1] Iter10 time cost: 0.45s, lr: 0.000001, train acc: 0.0017, train loss: 4.1607
[Epoch 1] Iter11 time cost: 0.49s, lr: 0.000001, train acc: 0.0017, train loss: 4.1675
[Epoch 1] Iter12 time cost: 0.54s, lr: 0.000001, train acc: 0.0017, train loss: 4.1619
[Epoch 1] Iter13 time cost: 0.58s, lr: 0.000001, train acc: 0.0019, train loss: 4.1338
[Epoch 1] Iter14 time cost: 0.63s, lr: 0.000001, train acc: 0.0019, train loss: 4.1475
[Epoch 1] Iter15 time cost: 0.67s, lr: 0.000002, 

## Stepping Scheduler

In [None]:
# ['cosine', 'inv_sqrt', 'exponent', 'linear', 'step', 'const']
np.random.seed(2343)
random.seed(1347)
torch.manual_seed(1453)
torch.cuda.manual_seed(1347)
torch.cuda.manual_seed_all(1453)

print('cuda available:', torch.cuda.is_available())
print('cuDNN available:', torch.backends.cudnn.enabled)
print('gpu numbers:', torch.cuda.device_count())

if torch.cuda.is_available() and args.cuda >= 0:
    args.device = torch.device('cuda', args.cuda)
    torch.cuda.empty_cache()
else:
    args.device = torch.device('cpu')

data_path = path_config('/content/arg_data.json')
vocabs = create_vocab(data_path['dataset']['train'])
embed_count = vocabs['word'].load_embeddings(data_path['pre_embed']['word_embedding'])
print("%d pre-trained embeddings loaded..." % embed_count)
args.scheduler = 'step'
trainer = Trainer(args, vocabs)
trainer.set_dataset(data_path)
trainer.train()

cuda available: True
cuDNN available: True
gpu numbers: 1
{'dataset': {'train': './train.txt', 'test': './test.txt'}, 'pre_embed': {'word_embedding': './glove.840B.300d.txt'}}
12287 pre-trained embeddings loaded...
GNNModel(
  (node_embedding): Embedding(12290, 300)
  (edge_weight): Embedding(151019522, 1, padding_idx=0)
  (node_weight): Embedding(12290, 1, padding_idx=0)
  (fc): Sequential(
    (0): Linear(in_features=300, out_features=63, bias=True)
    (1): ReLU()
    (2): Dropout(
      (drop): Dropout(p=0.5, inplace=False)
    )
    (3): LogSoftmax(dim=1)
  )
)
Train Size: 4775, Val Size: 1195
[Epoch 1] Iter1 time cost: 0.05s, lr: 0.001000, train acc: 0.0002, train loss: 4.1611
[Epoch 1] Iter2 time cost: 0.09s, lr: 0.001000, train acc: 0.0006, train loss: 4.1432
[Epoch 1] Iter3 time cost: 0.13s, lr: 0.001000, train acc: 0.0006, train loss: 4.1526




[Epoch 1] Iter4 time cost: 0.18s, lr: 0.001000, train acc: 0.0010, train loss: 4.1328
[Epoch 1] Iter5 time cost: 0.23s, lr: 0.001000, train acc: 0.0010, train loss: 4.1688
[Epoch 1] Iter6 time cost: 0.27s, lr: 0.001000, train acc: 0.0015, train loss: 4.1162
[Epoch 1] Iter7 time cost: 0.32s, lr: 0.001000, train acc: 0.0025, train loss: 4.0703
[Epoch 1] Iter8 time cost: 0.36s, lr: 0.001000, train acc: 0.0034, train loss: 4.0800
[Epoch 1] Iter9 time cost: 0.41s, lr: 0.001000, train acc: 0.0034, train loss: 4.1295
[Epoch 1] Iter10 time cost: 0.45s, lr: 0.001000, train acc: 0.0036, train loss: 4.1370
[Epoch 1] Iter11 time cost: 0.50s, lr: 0.001000, train acc: 0.0038, train loss: 4.1149
[Epoch 1] Iter12 time cost: 0.54s, lr: 0.001000, train acc: 0.0046, train loss: 4.0376
[Epoch 1] Iter13 time cost: 0.58s, lr: 0.001000, train acc: 0.0052, train loss: 4.0091
[Epoch 1] Iter14 time cost: 0.63s, lr: 0.001000, train acc: 0.0057, train loss: 4.0850
[Epoch 1] Iter15 time cost: 0.67s, lr: 0.001000, 

## Exponential Scheduler

In [None]:
# ['cosine', 'inv_sqrt', 'exponent', 'linear', 'step', 'const']
np.random.seed(2343)
random.seed(1347)
torch.manual_seed(1453)
torch.cuda.manual_seed(1347)
torch.cuda.manual_seed_all(1453)

print('cuda available:', torch.cuda.is_available())
print('cuDNN available:', torch.backends.cudnn.enabled)
print('gpu numbers:', torch.cuda.device_count())

if torch.cuda.is_available() and args.cuda >= 0:
    args.device = torch.device('cuda', args.cuda)
    torch.cuda.empty_cache()
else:
    args.device = torch.device('cpu')

data_path = path_config('/content/arg_data.json')
vocabs = create_vocab(data_path['dataset']['train'])
embed_count = vocabs['word'].load_embeddings(data_path['pre_embed']['word_embedding'])
print("%d pre-trained embeddings loaded..." % embed_count)
args.scheduler = 'exponent'
trainer = Trainer(args, vocabs)
trainer.set_dataset(data_path)
trainer.train()

cuda available: True
cuDNN available: True
gpu numbers: 1
{'dataset': {'train': './train.txt', 'test': './test.txt'}, 'pre_embed': {'word_embedding': './glove.840B.300d.txt'}}
12287 pre-trained embeddings loaded...
GNNModel(
  (node_embedding): Embedding(12290, 300)
  (edge_weight): Embedding(151019522, 1, padding_idx=0)
  (node_weight): Embedding(12290, 1, padding_idx=0)
  (fc): Sequential(
    (0): Linear(in_features=300, out_features=63, bias=True)
    (1): ReLU()
    (2): Dropout(
      (drop): Dropout(p=0.5, inplace=False)
    )
    (3): LogSoftmax(dim=1)
  )
)
Train Size: 4775, Val Size: 1195
[Epoch 1] Iter1 time cost: 0.05s, lr: 0.001000, train acc: 0.0002, train loss: 4.1611




[Epoch 1] Iter2 time cost: 0.09s, lr: 0.001000, train acc: 0.0006, train loss: 4.1432
[Epoch 1] Iter3 time cost: 0.13s, lr: 0.001000, train acc: 0.0006, train loss: 4.1526
[Epoch 1] Iter4 time cost: 0.18s, lr: 0.001000, train acc: 0.0010, train loss: 4.1328
[Epoch 1] Iter5 time cost: 0.22s, lr: 0.001000, train acc: 0.0010, train loss: 4.1688
[Epoch 1] Iter6 time cost: 0.27s, lr: 0.001000, train acc: 0.0015, train loss: 4.1162
[Epoch 1] Iter7 time cost: 0.31s, lr: 0.001000, train acc: 0.0025, train loss: 4.0703
[Epoch 1] Iter8 time cost: 0.36s, lr: 0.001000, train acc: 0.0034, train loss: 4.0800
[Epoch 1] Iter9 time cost: 0.40s, lr: 0.001000, train acc: 0.0034, train loss: 4.1295
[Epoch 1] Iter10 time cost: 0.44s, lr: 0.001000, train acc: 0.0036, train loss: 4.1370
[Epoch 1] Iter11 time cost: 0.49s, lr: 0.001000, train acc: 0.0038, train loss: 4.1149
[Epoch 1] Iter12 time cost: 0.53s, lr: 0.001000, train acc: 0.0046, train loss: 4.0377
[Epoch 1] Iter13 time cost: 0.58s, lr: 0.001000, tr

## Inverse SquareRoot scheduler

In [None]:
np.random.seed(2343)
random.seed(1347)
torch.manual_seed(1453)
torch.cuda.manual_seed(1347)
torch.cuda.manual_seed_all(1453)

print('cuda available:', torch.cuda.is_available())
print('cuDNN available:', torch.backends.cudnn.enabled)
print('gpu numbers:', torch.cuda.device_count())

if torch.cuda.is_available() and args.cuda >= 0:
    args.device = torch.device('cuda', args.cuda)
    torch.cuda.empty_cache()
else:
    args.device = torch.device('cpu')

data_path = path_config('/content/arg_data.json')
vocabs = create_vocab(data_path['dataset']['train'])
embed_count = vocabs['word'].load_embeddings(data_path['pre_embed']['word_embedding'])
print("%d pre-trained embeddings loaded..." % embed_count)
args.scheduler = 'inv_sqrt'
trainer = Trainer(args, vocabs)
trainer.set_dataset(data_path)
trainer.train()

cuda available: True
cuDNN available: True
gpu numbers: 1
{'dataset': {'train': './train.txt', 'test': './test.txt'}, 'pre_embed': {'word_embedding': './glove.840B.300d.txt'}}
12287 pre-trained embeddings loaded...
GNNModel(
  (node_embedding): Embedding(12290, 300)
  (edge_weight): Embedding(151019522, 1, padding_idx=0)
  (node_weight): Embedding(12290, 1, padding_idx=0)
  (fc): Sequential(
    (0): Linear(in_features=300, out_features=63, bias=True)
    (1): ReLU()
    (2): Dropout(
      (drop): Dropout(p=0.5, inplace=False)
    )
    (3): LogSoftmax(dim=1)
  )
)
Train Size: 4775, Val Size: 1195
[Epoch 1] Iter1 time cost: 0.04s, lr: 0.000000, train acc: 0.0002, train loss: 4.1611
[Epoch 1] Iter2 time cost: 0.09s, lr: 0.000000, train acc: 0.0006, train loss: 4.1457
[Epoch 1] Iter3 time cost: 0.13s, lr: 0.000000, train acc: 0.0006, train loss: 4.1535




[Epoch 1] Iter4 time cost: 0.18s, lr: 0.000000, train acc: 0.0006, train loss: 4.1529
[Epoch 1] Iter5 time cost: 0.22s, lr: 0.000000, train acc: 0.0006, train loss: 4.1788
[Epoch 1] Iter6 time cost: 0.27s, lr: 0.000000, train acc: 0.0008, train loss: 4.1608
[Epoch 1] Iter7 time cost: 0.31s, lr: 0.000000, train acc: 0.0010, train loss: 4.1510
[Epoch 1] Iter8 time cost: 0.36s, lr: 0.000000, train acc: 0.0015, train loss: 4.1083
[Epoch 1] Iter9 time cost: 0.40s, lr: 0.000000, train acc: 0.0015, train loss: 4.1361
[Epoch 1] Iter10 time cost: 0.45s, lr: 0.000000, train acc: 0.0017, train loss: 4.1607
[Epoch 1] Iter11 time cost: 0.49s, lr: 0.000000, train acc: 0.0017, train loss: 4.1675
[Epoch 1] Iter12 time cost: 0.53s, lr: 0.000000, train acc: 0.0017, train loss: 4.1620
[Epoch 1] Iter13 time cost: 0.58s, lr: 0.000000, train acc: 0.0019, train loss: 4.1338
[Epoch 1] Iter14 time cost: 0.62s, lr: 0.000000, train acc: 0.0019, train loss: 4.1476
[Epoch 1] Iter15 time cost: 0.67s, lr: 0.000000, 

## Cosine-based Scheduler

In [46]:
np.random.seed(2343)
random.seed(1347)
torch.manual_seed(1453)
torch.cuda.manual_seed(1347)
torch.cuda.manual_seed_all(1453)
# args.epochs = 50
print('cuda available:', torch.cuda.is_available())
print('cuDNN available:', torch.backends.cudnn.enabled)
print('gpu numbers:', torch.cuda.device_count())

if torch.cuda.is_available() and args.cuda >= 0:
    args.device = torch.device('cuda', args.cuda)
    torch.cuda.empty_cache()
else:
    args.device = torch.device('cpu')

data_path = path_config('/content/arg_data.json')
vocabs = create_vocab(data_path['dataset']['train'])
embed_count = vocabs['word'].load_embeddings(data_path['pre_embed']['word_embedding'])
print("%d pre-trained embeddings loaded..." % embed_count)
args.scheduler = 'cosine'
trainer = Trainer(args, vocabs)
trainer.set_dataset(data_path)
trainer.train()

[Epoch 1] Iter1 time cost: 0.05s, lr: 0.000000, train acc: 0.0025, train loss: 2.8356
[Epoch 1] Iter2 time cost: 0.10s, lr: 0.000000, train acc: 0.0050, train loss: 2.7343
[Epoch 1] Iter3 time cost: 0.15s, lr: 0.000000, train acc: 0.0075, train loss: 2.8523
[Epoch 1] Iter4 time cost: 0.20s, lr: 0.000000, train acc: 0.0105, train loss: 2.6377
[Epoch 1] Iter5 time cost: 0.25s, lr: 0.000000, train acc: 0.0147, train loss: 1.9180
[Epoch 1] Iter6 time cost: 0.30s, lr: 0.000001, train acc: 0.0174, train loss: 2.5795
[Epoch 1] Iter7 time cost: 0.34s, lr: 0.000001, train acc: 0.0212, train loss: 2.0611
[Epoch 1] Iter8 time cost: 0.39s, lr: 0.000001, train acc: 0.0235, train loss: 2.7954
[Epoch 1] Iter9 time cost: 0.44s, lr: 0.000001, train acc: 0.0266, train loss: 2.3790
[Epoch 1] Iter10 time cost: 0.49s, lr: 0.000001, train acc: 0.0297, train loss: 2.4462
[Epoch 1] Iter11 time cost: 0.54s, lr: 0.000001, train acc: 0.0320, train loss: 2.8406
[Epoch 1] Iter12 time cost: 0.59s, lr: 0.000001, tra

# Evaluation

In [25]:
from sklearn.metrics import classification_report

In [47]:
outs = trainer.predict('./val.txt')[0]

outs = [x.cpu().detach().numpy() for x in outs]
outs = [[vocabs['label'].idx2inst(x) for x in y] for y in outs]
val = pd.read_csv('./val.txt',sep='|',header = None)

In [48]:
val[0]

0            Research & Consulting Services
1                  Communications Equipment
2                             Biotechnology
3                  Communications Equipment
4                              Homebuilding
                       ...                 
1189                          Biotechnology
1190                         Regional Banks
1191                         Regional Banks
1192    Environmental & Facilities Services
1193    Environmental & Facilities Services
Name: 0, Length: 1194, dtype: object

In [49]:
y_pred = np.array([x for _,_,_,_,x in outs])
y = val[0].values

In [51]:
val[2] = y_pred

# AUC report

In [55]:
from sklearn import metrics
label_df = [x for _,x in val.groupby(0)]

for x,df in val.groupby(0):
  df[2] = df[2].apply(lambda y: 0 if x!=y else 1)
  df[0] = 1
  fpr, tpr, thresholds = metrics.roc_curve(df[0].values, df[2].values, pos_label=2)
  auc = metrics.auc(fpr, tpr)
  print(f"AUC for {x} = {auc}")

AUC for Advertising = nan
AUC for Aerospace & Defense = nan
AUC for Apparel Retail = nan
AUC for Apparel, Accessories & Luxury Goods = nan
AUC for Application Software = nan
AUC for Asset Management & Custody Banks = nan
AUC for Auto Parts & Equipment = nan
AUC for Biotechnology = nan
AUC for Building Products = nan
AUC for Casinos & Gaming = nan
AUC for Commodity Chemicals = nan
AUC for Communications Equipment = nan
AUC for Construction & Engineering = nan
AUC for Construction Machinery & Heavy Trucks = nan
AUC for Consumer Finance = nan
AUC for Data Processing & Outsourced Services = nan
AUC for Diversified Metals & Mining = nan
AUC for Diversified Support Services = nan
AUC for Electric Utilities = nan
AUC for Electrical Components & Equipment = nan
AUC for Electronic Equipment & Instruments = nan
AUC for Environmental & Facilities Services = nan
AUC for Gold = nan
AUC for Health Care Equipment = nan
AUC for Health Care Facilities = nan
AUC for Health Care Services = nan
AUC for He



# Classification Report

In [50]:
print(classification_report(y, y_pred))

                                            precision    recall  f1-score   support

                               Advertising       0.00      0.00      0.00        22
                       Aerospace & Defense       0.00      0.00      0.00        15
                            Apparel Retail       0.00      0.00      0.00        11
       Apparel, Accessories & Luxury Goods       0.15      0.10      0.12        21
                      Application Software       0.25      0.02      0.04        41
          Asset Management & Custody Banks       0.00      0.00      0.00        21
                    Auto Parts & Equipment       0.00      0.00      0.00        10
                             Biotechnology       0.11      0.01      0.02        80
                         Building Products       0.00      0.00      0.00        20
                          Casinos & Gaming       0.00      0.00      0.00        17
                       Commodity Chemicals       0.00      0.00      0.00  

  _warn_prf(average, modifier, msg_start, len(result))


# Predicting

In [None]:
outs = trainer.predict('./test.txt')[0]

outs = [x.cpu().detach().numpy() for x in outs]
outs = [[vocabs['label'].idx2inst(x) for x in y] for y in outs]

In [None]:
test.rename(columns = {0:'Company',1:'Short Business Description'},inplace = True)

In [None]:
I,J,K,L,M = [],[],[],[],[]
for i,j,k,l,m in outs:
  I.append(i)
  J.append(j)
  K.append(k)
  L.append(l)
  M.append(m)

In [None]:
test['#1 Tag'] = I
test['#2 Tag'] = J
test['#3 Tag'] = K
test['#4 Tag'] = L
test['#5 Tag'] = M

In [None]:
test.drop(columns=['Short Business Description'],inplace=True)
test.to_excel('Test_Submission.xlsx',index =False)

## Saving the model

In [None]:
torch.save(trainer.model.state_dict(), './modelGNN.bin')