### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
import re
import torch
import random
import pandas as pd
from collections import Counter
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from torchtext.vocab import Vocab
from torchtext.data.utils import get_tokenizer
from sklearn.model_selection import train_test_split

In [2]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('data/Day06_polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
for i in range(input_data.shape[0]):
    input_data.loc[i, "text"] = re.sub(r"[^a-zA-Z]", " ", input_data.loc[i, "text"])
    
input_data.head()

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you ve got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1


In [3]:
# train, test = train_test_split(input_data, test_size=0.2)
train = input_data
tokenizer = get_tokenizer('basic_english')
text_transform = lambda x: [vocab['<BOS>']] + [vocab[token] for token in tokenizer(x)] + [vocab['<EOS>']]
counter = Counter()

for items in train.itertuples(index=False):
    counter.update(tokenizer(items[0]))
    
vocab = Vocab(counter, min_freq=10, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

In [4]:
def collate_fn(batch):
    texts = []
    labels = []
    for text, label in batch:
        processed_text = torch.tensor(text_transform(text))
        texts.append(processed_text)
        labels.append(label)
        
    return pad_sequence(texts, padding_value=3.0), torch.tensor(labels)

In [5]:
train_list = train.values.tolist()
batch_size = 8 

def batch_sampler():
    indices = [(i, len(tokenizer(s[0]))) for i, s in enumerate(train_list)]
    random.shuffle(indices)
    
    # create pool of indices with similar lengths 
    pooled_indices = []
    for i in range(0, len(indices), batch_size * 100):
        pooled_indices.extend(sorted(indices[i:i + batch_size * 100], key=lambda x: x[1]))

    pooled_indices = [x[0] for x in pooled_indices]

    # yield indices for current batch
    for i in range(0, len(pooled_indices), batch_size):
        yield pooled_indices[i:i + batch_size]

bucket_dataloader = DataLoader(train_list, batch_sampler=batch_sampler(),
                               collate_fn=collate_fn)

In [7]:
next(iter(bucket_dataloader))

(tensor([[   1,    1,    1,  ...,    1,    1,    1],
         [ 329, 1378, 6961,  ...,  678, 5443,   12],
         [  11, 3975, 4987,  ..., 7748, 4600,   11],
         ...,
         [   3,    3,    3,  ...,    3,    3,  231],
         [   3,    3,    3,  ...,    3,    3,   29],
         [   3,    3,    3,  ...,    3,    3,    2]]),
 tensor([1, 0, 0, 1, 0, 1, 0, 0]))