# Runtime Environment  
* python >= 3.6
* pytorch >= 1.0
* pandas
* nltk
* numpy
* sklearn
* pickle
* tqdm
* json

# Data processing

## 刪除多餘資訊 (Remove redundant information)  
我們在資料集中保留了許多額外資訊供大家使用，但是在這次的教學中我們並沒有用到全部資訊，因此先將多餘的部分先抽走。  
In dataset, we reserved lots of information. But in this tutorial, we don't need them, so we need to discard them.

In [1]:
import pandas as pd

dataset = pd.read_csv('../task1_trainset.csv', dtype=str)
dataset.head()

Unnamed: 0,Id,Title,Abstract,Authors,Categories,Created Date,Task 1
0,D00001,A Brain-Inspired Trust Management Model to Ass...,Rapid popularity of Internet of Things (IoT) a...,Mahmud/Kaiser/Rahman/Rahman/Shabut/Al-Mamun/Hu...,cs.CR/cs.AI/q-bio.NC,2018-01-11,BACKGROUND OBJECTIVES METHODS METHODS RESULTS ...
1,D00002,On Efficient Computation of Shortest Dubins Pa...,"In this paper, we address the problem of compu...",Sadeghi/Smith,cs.SY/cs.RO/math.OC,2016-09-21,OBJECTIVES OTHERS METHODS/RESULTS RESULTS RESULTS
2,D00003,Data-driven Upsampling of Point Clouds,High quality upsampling of sparse 3D point clo...,Zhang/Jiang/Yang/Yamakawa/Shimada/Kara,cs.CV,2018-07-07,BACKGROUND OBJECTIVES METHODS METHODS METHODS ...
3,D00004,Accessibility or Usability of InteractSE? A He...,Internet is the main source of information now...,Aqle/Khowaja/Al-Thani,cs.HC,2018-08-29,BACKGROUND BACKGROUND BACKGROUND OBJECTIVES OB...
4,D00005,Spatio-Temporal Facial Expression Recognition ...,Automated Facial Expression Recognition (FER) ...,Hasani/Mahoor,cs.CV,2017-03-20,BACKGROUND BACKGROUND BACKGROUND BACKGROUND ME...


In [2]:
train = pd.read_csv('../task1_trainset.csv', dtype=str)

In [3]:
train.loc[train["Id"].eq("D00001")]["Task 1"].values

array(['BACKGROUND OBJECTIVES METHODS METHODS RESULTS CONCLUSIONS'],
      dtype=object)

In [4]:
dataset.iloc[0]["Abstract"]

'Rapid popularity of Internet of Things (IoT) and cloud computing permits neuroscientists to collect multilevel and multichannel brain data to better understand brain functions, diagnose diseases, and devise treatments.$$$To ensure secure and reliable data communication between end-to-end (E2E) devices supported by current IoT and cloud infrastructure, trust management is needed at the IoT and user ends.$$$This paper introduces a Neuro-Fuzzy based Brain-inspired trust management model (TMM) to secure IoT devices and relay nodes, and to ensure data reliability.$$$The proposed TMM utilizes node behavioral trust and data trust estimated using Adaptive Neuro-Fuzzy Inference System and weighted-additive methods respectively to assess the nodes trustworthiness.$$$In contrast to the existing fuzzy based TMMs, the NS2 simulation results confirm the robustness and accuracy of the proposed TMM in identifying malicious nodes in the communication network.$$$With the growing usage of cloud based Io

In [5]:
dataset.drop('Title',axis=1,inplace=True)
dataset.drop('Categories',axis=1,inplace=True)
dataset.drop('Created Date',axis=1, inplace=True)
dataset.drop('Authors',axis=1,inplace=True)

## 資料切割  (Partition)
在訓練時，我們需要有個方法去檢驗訓練結果的好壞，因此需要將訓練資料切成training/validataion set。   
While training, we need some method to exam our model's performance, so we divide our training data into training/validataion set.

In [5]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
trainset, validset = train_test_split(dataset, test_size=0.1, random_state=42)

trainset.to_csv('trainset.csv',index=False)
validset.to_csv('validset.csv',index=False)

### For test data

In [6]:
dataset = pd.read_csv('../task1_public_testset.csv', dtype=str)
dataset.drop('Title',axis=1,inplace=True)
dataset.drop('Categories',axis=1,inplace=True)
dataset.drop('Created Date',axis=1, inplace=True)
dataset.drop('Authors',axis=1,inplace=True)
dataset.to_csv('testset.csv',index=False)

### 統計單字 (Count words)  
在訓練時，不能直接將單字直接餵入model，因為它只看得懂數字，因此我們必須把所有的單字抽取出來，並將它們打上編號，做出一個字典來對它們做轉換。
We can't feed "word" into model directly, since it can only recognize number. So, we need to know the total number of word, and give every word a unique number.  

在這裡，我們需要借助`nltk`這個library來幫忙做文字切割。當然，你也可以選擇自己寫規則來切割(通常上不建議搞死自己)。  
另外，我們也使用了`multiprocessing`來加速處理。  
In here, we split words by using `nltk library`. You can write your own rules and split it by yourself, but you won't want to do that, trust me.  
Also, we use `multiprocessing` to accelerate the process.

In [43]:
import pandas as pd
from multiprocessing import Pool
from nltk.tokenize import word_tokenize
def collect_words(data_path, n_workers=4):
    df = pd.read_csv(data_path, dtype=str)
        
    sent_list = []
    for i in df.iterrows():
        sent_list += i[1]['Abstract'].split('$$$')

    chunks = [
        ' '.join(sent_list[i:i + len(sent_list) // n_workers])
        for i in range(0, len(sent_list), len(sent_list) // n_workers)
    ]
    with Pool(n_workers) as pool:
        # map vs map_async: https://stackoverflow.com/questions/35908987/multiprocessing-map-vs-map-async
        chunks = pool.map_async(word_tokenize, chunks)
        words = set(sum(chunks.get(), []))

    return words

In [53]:
set(sum([[10], [20]], []))

{10, 20}

In [8]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/rossleecooloh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
words = set()
words |= collect_words('trainset.csv')

# import pickle
# file = open('tokenized_words.pickle', 'wb')
# pickle.dump(words, file)
# file.close()

In [12]:
words

{'PCBC',
 '29.5',
 '//github',
 'detectability',
 'fiber-optics',
 'revisited',
 'shortest',
 'strength/weight',
 'inquiry',
 'attains',
 'compartmentalized',
 'lock',
 'multiconductor',
 'fuelling',
 'opaque',
 'Logistic',
 'Lombard',
 'confounders',
 'practicality',
 'Function',
 'Plattner',
 'underlined',
 'radio-access',
 'CFR',
 'polygons',
 'intercept',
 'tree-like',
 'promoted',
 '0.5MB',
 'Trainable',
 'estimated',
 'ever-increasing',
 'conjectures',
 'brainwaves',
 '0.582',
 'parameterization',
 'service-oriented',
 'boolean',
 'ReCoN',
 'locking',
 'undergone',
 'mixed-integer',
 'Maximum-Likelihood',
 'VNF',
 'pseudo-samples',
 'CLEAR-DR',
 'plague',
 'verbosity',
 'declares',
 'ECA',
 'arguments',
 'excited',
 'storage',
 'frauds',
 'reification',
 'integrating',
 'USCSP',
 'one-coincidence',
 'R-squared',
 'Messages',
 'Multi-Topology',
 'close-to-Pareto-front',
 'Four-atom',
 'Bayesian/generative',
 'Inria',
 'diminishes',
 'drop-in',
 'basic',
 'Recurrent',
 'contains',


pad: for padding  
unk: for word that didn't in our dicitonary

In [13]:
PAD_TOKEN = 0
UNK_TOKEN = 1
word_dict = {'<pad>':PAD_TOKEN,'<unk>':UNK_TOKEN}
for word in words:
    word_dict[word] = len(word_dict)

In [14]:
word_dict

{'<pad>': 0,
 '<unk>': 1,
 'PCBC': 2,
 '29.5': 3,
 '//github': 4,
 'detectability': 5,
 'fiber-optics': 6,
 'revisited': 7,
 'shortest': 8,
 'strength/weight': 9,
 'inquiry': 10,
 'attains': 11,
 'compartmentalized': 12,
 'lock': 13,
 'multiconductor': 14,
 'fuelling': 15,
 'opaque': 16,
 'Logistic': 17,
 'Lombard': 18,
 'confounders': 19,
 'practicality': 20,
 'Function': 21,
 'Plattner': 22,
 'underlined': 23,
 'radio-access': 24,
 'CFR': 25,
 'polygons': 26,
 'intercept': 27,
 'tree-like': 28,
 'promoted': 29,
 '0.5MB': 30,
 'Trainable': 31,
 'estimated': 32,
 'ever-increasing': 33,
 'conjectures': 34,
 'brainwaves': 35,
 '0.582': 36,
 'parameterization': 37,
 'service-oriented': 38,
 'boolean': 39,
 'ReCoN': 40,
 'locking': 41,
 'undergone': 42,
 'mixed-integer': 43,
 'Maximum-Likelihood': 44,
 'VNF': 45,
 'pseudo-samples': 46,
 'CLEAR-DR': 47,
 'plague': 48,
 'verbosity': 49,
 'declares': 50,
 'ECA': 51,
 'arguments': 52,
 'excited': 53,
 'storage': 54,
 'frauds': 55,
 'reificatio

### 資料格式化 (Data formatting)  
有了字典後，接下來我們要把資料整理成一筆一筆，把input的句子轉成數字，把答案轉成onehot的形式。  
這裡，我們一樣使用`multiprocessing`來加入進行。  
After building dictionary, that's mapping our sentences into number array, and convert answers to onehot format.  

In [15]:
from tqdm import tqdm_notebook as tqdm
def label_to_onehot(labels):
    """ Convert label to onehot .
        Args:
            labels (string): sentence's labels.
        Return:
            outputs (onehot list): sentence's onehot label.
    """
    label_dict = {'BACKGROUND':0, 'OBJECTIVES':1, 'METHODS':2, 'RESULTS':3, 'CONCLUSIONS':4, 'OTHERS':5}
    onehot = [0,0,0,0,0,0]
    # 同個句子多個分類以 **/** 分開
    for l in labels.split('/'):
        onehot[label_dict[l]] = 1
    return onehot
        
def sentence_to_indices(sentence, word_dict):
    """ Convert sentence to its word indices.
    Args:
        sentence (str): One string.
    Return:
        indices (list of int): List of word indices.
    """
    # 如果字典找不到則給default值
    return [word_dict.get(word, UNK_TOKEN) for word in word_tokenize(sentence)]
    
def get_dataset(data_path, word_dict, n_workers=4):
    """ Load data and return dataset for training and validating.

    Args:
        data_path (str): Path to the data.
    """
    dataset = pd.read_csv(data_path, dtype=str)

    results = [None] * n_workers
    with Pool(processes=n_workers) as pool:
        for i in range(n_workers):
            batch_start = (len(dataset) // n_workers) * i
            if i == n_workers - 1:
                batch_end = len(dataset)
            else:
                batch_end = (len(dataset) // n_workers) * (i + 1)
            
            batch = dataset[batch_start: batch_end]
            results[i] = pool.apply_async(preprocess_samples, args=(batch, word_dict))

        pool.close()
        pool.join()

    processed = []
    for result in results:
        processed += result.get()  # apply_async要用.get()
    return processed  # 一個一個的摘要+答案

def preprocess_samples(dataset, word_dict):
    """ Worker function.

    Args:
        dataset (list of dict)
    Returns:
        list of processed dict.
    """
    # by摘要去做
    processed = []
    for sample in tqdm(dataset.iterrows(), total=len(dataset)):
        processed.append(preprocess_sample(sample[1], word_dict))

    return processed

def preprocess_sample(data, word_dict):
    """
    Args:
        data (dict)
    Returns:
        dict
    """
    # by摘要 -> 裡面的句子都先tokenize後轉成數字
    processed = {}
    processed['Abstract'] = [sentence_to_indices(sent, word_dict) for sent in data['Abstract'].split('$$$')]
    if 'Task 1' in data:
        processed['Label'] = [label_to_onehot(label) for label in data['Task 1'].split(' ')]
        
    return processed

In [16]:
dataset = pd.read_csv("trainset.csv", dtype=str)
for s in dataset.iterrows():
    print(s[1])
    print(s)
    break

Id                                                         D05945
Title           DOTmark - A Benchmark for Discrete Optimal Tra...
Abstract        The Wasserstein metric or earth mover's distan...
Authors                         Schrieber/Schuhmacher/Gottschlich
Categories                                          math.OC/cs.CV
Created Date                                           2016-10-11
Task 1          BACKGROUND BACKGROUND BACKGROUND/OBJECTIVES ME...
Name: 0, dtype: object
(0, Id                                                         D05945
Title           DOTmark - A Benchmark for Discrete Optimal Tra...
Abstract        The Wasserstein metric or earth mover's distan...
Authors                         Schrieber/Schuhmacher/Gottschlich
Categories                                          math.OC/cs.CV
Created Date                                           2016-10-11
Task 1          BACKGROUND BACKGROUND BACKGROUND/OBJECTIVES ME...
Name: 0, dtype: object)


In [17]:
print('[INFO] Start processing trainset...')
train = get_dataset('trainset.csv', word_dict, n_workers=4)
print('[INFO] Start processing validset...')
valid = get_dataset('validset.csv', word_dict, n_workers=4)
print('[INFO] Start processing testset...')
test = get_dataset('testset.csv', word_dict, n_workers=4)

[INFO] Start processing trainset...




[INFO] Start processing validset...




[INFO] Start processing testset...






## 資料封裝 (Data packing)

可用pytorch的pad_sequence: https://zhuanlan.zhihu.com/p/59772104

https://www.cnblogs.com/duye/p/10590146.html

為了更方便的進行batch training，我們將會借助[torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)。  
而要將資料放入dataloader，我們需要繼承[torch.utils.data.Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)，撰寫適合這份dataset的class。  
`collate_fn`用於batch data的後處理，在`dataloder`將選出的data放進list後會呼叫collate_fn，而我們會在此把sentence padding到同樣的長度，才能夠放入torch tensor (tensor必須為矩陣)。  

To easily training in batch, we'll use `dataloader`, which is a function built in Pytorch[torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)  
To use datalaoder, we need to packing our data into class `dataset` [torch.utils.data.Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)  
`collate_fn` is used for data processing.

In [18]:
from torch.utils.data import Dataset
import torch
class AbstractDataset(Dataset):
    def __init__(self, data, pad_idx, max_len=500):
        self.data = data
        self.pad_idx = pad_idx
        self.max_len = max_len
    
    def __len__(self):
        ##############################################
        ### Indicate the total size of the dataset
        ##############################################
        return len(self.data)

    def __getitem__(self, index):
        ##############################################
        # 1. Read from file (using numpy.fromfile, PIL.Image.open)
        # 2. Preprocess the data (torchvision.Transform).
        # 3. Return the data (e.g. image and label)
        ##############################################
        return self.data[index]
    # collate_fn (callable, optional) – merges a list of samples to form a mini-batch of Tensor(s). 
    # Used when using batched loading from a map-style dataset.
    # 用法參考: https://wizardforcel.gitbooks.io/learn-dl-with-pytorch-liaoxingyu/8.2.html
    def collate_fn(self, datas):
        # get max length in this batch
        max_sent = max([len(data['Abstract']) for data in datas])
        # 找出最長的句子，超過max_len則就是max_len
        max_len = max([min(len(sentence), self.max_len) for data in datas for sentence in data['Abstract']])
        batch_abstract = []
        batch_label = []
        sent_len = []
        for data in datas:
            # padding abstract to make them in same length
            pad_abstract = []
            # 塞一樣長的句子
            for sentence in data['Abstract']:
                # 大於max_len就切掉
                if len(sentence) > max_len:
                    pad_abstract.append(sentence[:max_len])
                else:
                    # 小於等於max_len就pad到滿
                    pad_abstract.append(sentence+[self.pad_idx]*(max_len-len(sentence)))
            sent_len.append(len(pad_abstract))
            # by batch 每個Abstract要一樣長，將不夠長的Abstract所應有的句子長度補滿所以叫pad abstract
            pad_abstract.extend([[self.pad_idx]*max_len]*(max_sent-len(pad_abstract)))
            batch_abstract.append(pad_abstract)
            # gather labels
            if 'Label' in data:
                pad_label = data['Label']
                pad_label.extend([[0]*6]*(max_sent-len(pad_label)))  # 如果是pad的句子就要塞六個0
                
                batch_label.append(pad_label)
        return torch.LongTensor(batch_abstract), torch.FloatTensor(batch_label), sent_len # 輸出摘要的真實句子數量

In [67]:
print("Abstract amount:", len(train))
print("Each training row includes abstract and labels:", len(train[0]))
print("The sentence numbers in an Abstract:", len(train[0]['Abstract']))
print("The label numbers in an Abstract:",  len(train[0]['Label']))

Abstract amount: 6300
Each training row includes abstract and labels: 2
The sentence numbers in an Abstract: 6
The label numbers in an Abstract: 6


In [68]:
trainData = AbstractDataset(train, PAD_TOKEN, max_len = 64)
validData = AbstractDataset(valid, PAD_TOKEN, max_len = 64)
testData = AbstractDataset(test, PAD_TOKEN, max_len = 64)

# Model

資料處理完成後，接下來就是最重要的核心部分：`Model`。  
此次範例中我們以簡單的一層RNN + 兩層Linear layer作為示範。  
而為了解決每次的句子長度不一的問題(`linear layer必須是fixed input size`)，因此我們把所有字的hidden_state做平均，讓這一個vector代表這句話。  

In this tutorial, we're going to implement a simple model, which contain one RNN layer and two fully connected layers (Linear layer). Of course you can make it "deep".  
To solve variant sentence length problem (`input size in linear layer must be fixed`), we can average all hidden_states, and become one vector. (Perfect!)

In [69]:
import torch.nn as nn
import torch.nn.functional as F


class simpleNet(nn.Module):
    def __init__(self, vocabulary_size):
        super(simpleNet, self).__init__()
        self.embedding_size = 50
        self.hidden_dim = 512
        self.embedding = nn.Embedding(vocabulary_size, self.embedding_size)
        # https://zhuanlan.zhihu.com/p/39191116
        # output保存了最后一层，每个time step的输出h，如果是双向LSTM，每个time step的输出h = [h正向, h逆向] 
        # (同一个time step的正向和逆向的h连接起来)。
        self.sent_rnn = nn.GRU(self.embedding_size,
                                self.hidden_dim,
                                bidirectional=True,
                                batch_first=True)
        self.l1 = nn.Linear(self.hidden_dim*2, self.hidden_dim)
        self.l2 = nn.Linear(self.hidden_dim, 6)

    def forward(self, x):
        x = self.embedding(x)
        b,s,w,e = x.shape
        x = x.view(b,s*w,e)
        x, __ = self.sent_rnn(x)
        x = x.view(b,s,w,-1)
        x = torch.max(x,dim=2)[0]  # 這邊是取最大值而不是平均 --> 取最大值代表整句話
        x = torch.relu(self.l1(x))
        x = torch.sigmoid(self.l2(x))
        # Not sure to understand your question here. The sigmoid 206 function is an element-wise function, 
        # so it will not change the shape of the tensor, just replace each entry with 1/(1+exp(-entry)).
        return x

#### test

In [70]:
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset=trainData,
                        batch_size=64,
                        shuffle=shuffle,
                        collate_fn=trainData.collate_fn,
                        num_workers=4)

In [71]:
import torch.nn.utils.rnn as rnn_utils

for (x, y, sent_len) in dataloader:
    
    print(rnn_utils.pack_padded_sequence(input=x, lengths=sent_len, batch_first=True))
    break

RuntimeError: 'lengths' array has to be sorted in decreasing order

In [79]:
E = nn.Embedding(len(word_dict), 5)
for (x, y, sent_len) in dataloader:
    print(E(x).shape)  # batch size, sent num, max len, each word's embedding dim
    print(x.shape)  # batch size, sent num, max len
    print(E(x))
    print(torch.max(x,dim=2)[0])
    break
    # 1個batch有64個摘要、1個摘要15個句、每句64個字

torch.Size([64, 13, 64, 5])
torch.Size([64, 13, 64])
tensor([[[[-1.4568e-01, -3.8034e-01, -2.5217e-01, -2.6783e-01,  8.2339e-01],
          [-9.5356e-01,  1.4439e+00,  4.1701e-01,  6.6224e-01,  5.5181e-01],
          [-4.8506e-01, -4.8425e-01, -1.0467e+00,  3.0083e-01,  9.0905e-01],
          ...,
          [ 9.1462e-01, -1.3605e+00,  1.3954e+00,  1.0752e+00,  1.7135e+00],
          [ 9.1462e-01, -1.3605e+00,  1.3954e+00,  1.0752e+00,  1.7135e+00],
          [ 9.1462e-01, -1.3605e+00,  1.3954e+00,  1.0752e+00,  1.7135e+00]],

         [[-3.8709e-01, -7.6052e-01,  8.0908e-01,  1.9075e-01, -1.7916e+00],
          [ 1.4451e+00,  3.4846e-01, -3.0386e-01, -4.6358e-01,  1.1868e+00],
          [ 1.3151e+00,  3.1930e-01,  6.0604e-01,  4.7804e-01,  1.3873e+00],
          ...,
          [ 9.1462e-01, -1.3605e+00,  1.3954e+00,  1.0752e+00,  1.7135e+00],
          [ 9.1462e-01, -1.3605e+00,  1.3954e+00,  1.0752e+00,  1.7135e+00],
          [ 9.1462e-01, -1.3605e+00,  1.3954e+00,  1.0752e+00,  1.71

In [80]:
print(x.shape)
embedding = nn.Embedding(len(word_dict), 50)
X = embedding(x)
print(X.shape)

torch.Size([64, 13, 64])
torch.Size([64, 13, 64, 50])


In [81]:
b,s,w,e = X.shape
X = X.view(b,s*w,e)
print(X.shape)

torch.Size([64, 832, 50])


In [82]:
sent_rnn = nn.GRU(50,
                  512,
                  bidirectional=True,
                  batch_first=True)
X, __ = sent_rnn(X)  # batch size, samples(sentence*word), hidden
X.shape

torch.Size([64, 832, 1024])

In [83]:
X = X.view(b,s,w,-1)
X.shape
# X = torch.max(X,dim=2)[0]  # 

torch.Size([64, 13, 64, 1024])

In [84]:
X

tensor([[[[-1.2257e-01, -1.0533e-01, -1.0726e-01,  ...,  2.1938e-02,
            4.7980e-02,  4.1659e-02],
          [ 5.7546e-02,  1.7083e-02, -1.9407e-01,  ..., -5.0114e-03,
            1.7382e-01, -6.4713e-02],
          [ 7.9679e-02, -3.0531e-02, -7.2432e-02,  ..., -5.3557e-02,
            6.8488e-02, -5.6473e-02],
          ...,
          [ 2.1845e-01, -3.6180e-03,  1.5780e-01,  ...,  9.6860e-02,
            1.7200e-01, -2.4694e-02],
          [ 2.1845e-01, -3.6180e-03,  1.5780e-01,  ...,  1.0404e-01,
            1.4831e-01, -2.2983e-02],
          [ 2.1845e-01, -3.6180e-03,  1.5780e-01,  ...,  1.3238e-01,
            9.8861e-02, -2.1004e-02]],

         [[-2.4501e-02, -7.5891e-02,  1.2959e-01,  ...,  2.0880e-01,
            4.2040e-04, -2.2536e-02],
          [ 4.0548e-02, -2.1071e-02, -4.4889e-02,  ...,  3.4508e-02,
            2.5480e-02, -2.4227e-03],
          [ 1.4336e-01,  5.8762e-03, -6.1120e-02,  ...,  1.7678e-01,
            9.3115e-02,  2.1718e-02],
          ...,
     

In [85]:
torch.max(X, dim=2)[0]
# output 經過relu, sigmoid -> batch, sentences, 6 -> each sentence has 6 dim with sigmoid activated element-wisely

tensor([[[ 0.2185,  0.1651,  0.2105,  ...,  0.2072,  0.1885,  0.2296],
         [ 0.2185,  0.1489,  0.1795,  ...,  0.2088,  0.1959,  0.0920],
         [ 0.2190,  0.2082,  0.1578,  ...,  0.3215,  0.2027,  0.1815],
         ...,
         [ 0.2185, -0.0036,  0.1578,  ...,  0.1179,  0.1881, -0.0311],
         [ 0.2185, -0.0036,  0.1578,  ...,  0.1179,  0.1881, -0.0311],
         [ 0.2185, -0.0036,  0.1578,  ...,  0.1179,  0.1881, -0.0255]],

        [[ 0.2185,  0.2156,  0.1578,  ...,  0.1642,  0.2360,  0.2161],
         [ 0.2185, -0.0036,  0.1578,  ...,  0.1179,  0.1881, -0.0311],
         [ 0.2185, -0.0036,  0.1578,  ...,  0.1179,  0.1881, -0.0311],
         ...,
         [ 0.2185, -0.0036,  0.1578,  ...,  0.1179,  0.1881, -0.0311],
         [ 0.2185, -0.0036,  0.1578,  ...,  0.1179,  0.1881, -0.0311],
         [ 0.2185, -0.0036,  0.1578,  ...,  0.1179,  0.1881, -0.0255]],

        [[ 0.2244,  0.2754,  0.2157,  ...,  0.2772,  0.1923,  0.3156],
         [ 0.2185,  0.1796,  0.1578,  ...,  0

# Training
訓練時用pytorch的pack_padded_sequence: https://zhuanlan.zhihu.com/p/59772104

https://www.cnblogs.com/duye/p/10590146.html

指定使用的運算裝置  
Designate running device.

In [86]:
device='cuda'

定義一個算分公式, 讓我們在training能快速了解model的效能  
Define score function, let us easily observe model performance while training.  

In [87]:
class F1():
    def __init__(self):
        self.threshold = 0.5
        self.n_precision = 0
        self.n_recall = 0
        self.n_corrects = 0
        self.name = 'F1'

    def reset(self):
        self.n_precision = 0
        self.n_recall = 0
        self.n_corrects = 0

    def update(self, predicts, groundTruth):
        predicts = predicts > self.threshold
        self.n_precision += torch.sum(predicts).data.item()
        self.n_recall += torch.sum(groundTruth).data.item()
        # 有猜對相對位置相乘就是1
        self.n_corrects += torch.sum(groundTruth.type(torch.uint8) * predicts).data.item()

    def get_score(self):
        recall = self.n_corrects / self.n_recall
        precision = self.n_corrects / (self.n_precision + 1e-20) #prevent divided by zero
        return 2 * (recall * precision) / (recall + precision + 1e-20)

    def print_score(self):
        score = self.get_score()
        return '{:.5f}'.format(score)


In [89]:
import os
def _run_epoch(epoch, training):
    model.train(training)
    if training:
        description = 'Train'
        dataset = trainData
        shuffle = True
    else:
        description = 'Valid'
        dataset = validData
        shuffle = False  # why?
    
    dataloader = DataLoader(dataset=dataset,
                            batch_size=64,
                            shuffle=shuffle,
                            collate_fn=dataset.collate_fn,
                            num_workers=4)

    trange = tqdm(enumerate(dataloader), total=len(dataloader), desc=description)
    loss = 0
    f1_score = F1()
    # Loader 每圈 load出來的東西就看剛剛Dataset是傳回什麼格式的
    for i, (x, y, sent_len) in trange:
        o_labels, batch_loss = _run_iter(x,y)
        if training:
            opt.zero_grad()
            batch_loss.backward()
            opt.step()

        loss += batch_loss.item()
        f1_score.update(o_labels.cpu(), y)  # .cpu() 移至cpu運算

        trange.set_postfix(
            loss=loss / (i + 1), f1=f1_score.print_score())
    
    if training:
        history['train'].append({'f1':f1_score.get_score(), 'loss':loss/ len(trange)})
    else:
        history['valid'].append({'f1':f1_score.get_score(), 'loss':loss/ len(trange)})

def _run_iter(x,y):
    if torch.cuda.is_available():
        abstract = x.to(device)
        labels = y.to(device)
        # .cuda()
    else:
        abstract = x.cpu()
        labels = y.cpu()
    o_labels = model(abstract)
    l_loss = criteria(o_labels, labels)
    return o_labels, l_loss

def save(epoch):
    if not os.path.exists('model'):
        os.makedirs('model')
    torch.save(model.state_dict(), 'model/model.pkl.'+str(epoch))
    with open('model/history.json', 'w') as f:
        json.dump(history, f, indent=4)

In [90]:
from torch.utils.data import DataLoader
from tqdm import trange
import json
model = simpleNet(len(word_dict))
opt = torch.optim.Adam(model.parameters())
criteria = torch.nn.BCELoss()

if torch.cuda.is_available():
    # 有GPU才能to cuda --> model.to(device)
    model = model.cuda()
else:
    model = model.cpu()

max_epoch = 6
history = {'train':[],'valid':[]}

for epoch in range(max_epoch):
    print('Epoch: {}'.format(epoch))
    _run_epoch(epoch, training=True)
    _run_epoch(epoch, training=False)
    save(epoch)

Epoch: 0


HBox(children=(IntProgress(value=0, description='Train', max=99, style=ProgressStyle(description_width='initia…




HBox(children=(IntProgress(value=0, description='Valid', max=11, style=ProgressStyle(description_width='initia…


Epoch: 1


HBox(children=(IntProgress(value=0, description='Train', max=99, style=ProgressStyle(description_width='initia…




HBox(children=(IntProgress(value=0, description='Valid', max=11, style=ProgressStyle(description_width='initia…


Epoch: 2


HBox(children=(IntProgress(value=0, description='Train', max=99, style=ProgressStyle(description_width='initia…




HBox(children=(IntProgress(value=0, description='Valid', max=11, style=ProgressStyle(description_width='initia…


Epoch: 3


HBox(children=(IntProgress(value=0, description='Train', max=99, style=ProgressStyle(description_width='initia…




HBox(children=(IntProgress(value=0, description='Valid', max=11, style=ProgressStyle(description_width='initia…


Epoch: 4


HBox(children=(IntProgress(value=0, description='Train', max=99, style=ProgressStyle(description_width='initia…




HBox(children=(IntProgress(value=0, description='Valid', max=11, style=ProgressStyle(description_width='initia…


Epoch: 5


HBox(children=(IntProgress(value=0, description='Train', max=99, style=ProgressStyle(description_width='initia…




HBox(children=(IntProgress(value=0, description='Valid', max=11, style=ProgressStyle(description_width='initia…




## Predict

In [94]:
model.train(False)
dataloader = DataLoader(dataset=testData,
                            batch_size=64,
                            shuffle=False,
                            collate_fn=testData.collate_fn,
                            num_workers=4)
trange = tqdm(enumerate(dataloader), total=len(dataloader), desc='Predict')
prediction = []
for i, (x, y, sent_len) in trange:
    if torch.cuda.is_available():
        o_labels = model(x.to(device))
    else:
        o_labels = x.cpu()
    # threshold
    o_labels = o_labels > 0.5
    for idx, o_label in enumerate(o_labels):
        prediction.append(o_label[:sent_len[idx]].to('cpu'))
# cat為列stack, detach為取消grad
prediction = torch.cat(prediction).detach().numpy().astype(int)

HBox(children=(IntProgress(value=0, description='Predict', max=313, style=ProgressStyle(description_width='ini…

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 64 and 62 in dimension 1 at /pytorch/aten/src/TH/generic/THTensorMoreMath.cpp:1333

In [97]:
prediction2 = rnn_utils.pad_sequence(prediction, batch_first=True)

RuntimeError: The expanded size of the tensor (64) must match the existing size (62) at non-singleton dimension 1.  Target sizes: [4, 64].  Tensor sizes: [4, 62]

In [None]:
import numpy as np
def SubmitGenerator(prediction, sampleFile, public=True, filename='prediction.csv'):
    sample = pd.read_csv(sampleFile)
    submit = {}
    submit['order_id'] = list(sample.order_id.values)
    redundant = len(sample) - prediction.shape[0]
    if public:
        submit['BACKGROUND'] = list(prediction[:,0]) + [0]*redundant
        submit['OBJECTIVES'] = list(prediction[:,1]) + [0]*redundant
        submit['METHODS'] = list(prediction[:,2]) + [0]*redundant
        submit['RESULTS'] = list(prediction[:,3]) + [0]*redundant
        submit['CONCLUSIONS'] = list(prediction[:,4]) + [0]*redundant
        submit['OTHERS'] = list(prediction[:,5]) + [0]*redundant
    else:
        submit['BACKGROUND'] = [0]*redundant + list(prediction[:,0])
        submit['OBJECTIVES'] = [0]*redundant + list(prediction[:,1])
        submit['METHODS'] = [0]*redundant + list(prediction[:,2])
        submit['RESULTS'] = [0]*redundant + list(prediction[:,3])
        submit['CONCLUSIONS'] = [0]*redundant + list(prediction[:,4])
        submit['OTHERS'] = [0]*redundant + list(prediction[:,5])
    df = pd.DataFrame.from_dict(submit) 
    df.to_csv(filename,index=False)

In [None]:
SubmitGenerator(prediction,
                '../task1_sample_submission.csv', 
                True, 
                '../task1_submission.csv')

# Plot Learning Curve

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

with open('model/history.json', 'r') as f:
    history = json.loads(f.read())
    
train_loss = [l['loss'] for l in history['train']]
valid_loss = [l['loss'] for l in history['valid']]
train_f1 = [l['f1'] for l in history['train']]
valid_f1 = [l['f1'] for l in history['valid']]

plt.figure(figsize=(7,5))
plt.title('Loss')
plt.plot(train_loss, label='train')
plt.plot(valid_loss, label='valid')
plt.legend()
plt.show()

plt.figure(figsize=(7,5))
plt.title('F1 Score')
plt.plot(train_f1, label='train')
plt.plot(valid_f1, label='valid')
plt.legend()
plt.show()

print('Best F1 score ', max([[l['f1'], idx] for idx, l in enumerate(history['valid'])]))