### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
import torch
import pandas as pd
import numpy as np
from torchtext import datasets
from torchtext.legacy import data
import os

In [2]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
path = '/content/drive/MyDrive/Colab Notebooks/NLP_course/day06'
input_data = pd.read_csv(os.path.join(path,'polarity.tsv'), delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste .it...,0
1998,there was a time when john carpenter was a gre...,0


### 建立Pipeline生成資料

In [3]:
# 建立Field與Dataset
text_field = data.Field(sequential=True, dtype=torch.float64, lower=True, tokenize='spacy')
label_field = data.Field(sequential=False)
input_data = data.TabularDataset(path=os.path.join(path,'polarity.tsv'), 
                                 format='tsv', 
                                 fields=[('text', text_field), ('label', label_field)])

In [4]:
# 取的examples並打亂順序
examples = input_data.examples
np.random.shuffle(examples)
# 以8:2的比例切分examples
train_ex = examples[:int(len(examples)*0.8)]
test_ex = examples[int(len(examples)*0.8):]
# 建立training與testing dataset
train_data = data.Dataset(examples=train_ex, fields={'text': text_field, 'label': label_field})
test_data = data.Dataset(examples=test_ex, fields={'text':text_field, 'label':label_field})

train_data[0].label, train_data[0].text

('1',
 ['how',
  'many',
  'of',
  'us',
  'would',
  'become',
  'strippers',
  '?',
  'for',
  'those',
  'of',
  'us',
  'who',
  'would',
  "n't",
  ',',
  'is',
  'it',
  'a',
  'moral',
  'reason',
  ',',
  'or',
  'purely',
  'a',
  'lack',
  'of',
  'confidence',
  '?',
  'that',
  "'s",
  'probably',
  'not',
  'a',
  'fair',
  'question',
  ',',
  'and',
  'for',
  'a',
  'lot',
  'of',
  'us',
  ',',
  'it',
  'could',
  'very',
  'well',
  'be',
  'for',
  'neither',
  'of',
  'those',
  'reasons',
  '.as',
  'you',
  'watch',
  'the',
  'full',
  'monty',
  ',',
  'however',
  ',',
  'you',
  'may',
  'begin',
  'asking',
  'yourself',
  'these',
  'kinds',
  'of',
  'questions',
  '.would',
  'you',
  'be',
  'willing',
  'to',
  'grin',
  'and',
  'bare',
  'it',
  'to',
  'bring',
  'in',
  'some',
  'much',
  'needed',
  'dough',
  '?',
  'in',
  'case',
  'you',
  'have',
  "n't",
  'guessed',
  ',',
  'the',
  'full',
  'monty',
  'is',
  'about',
  'stripping',
  ',

In [5]:
# 建立字典
text_field.build_vocab(train_data)
label_field.build_vocab(train_data)

print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', ',', 'the', 'a', 'and', 'of', 'to', 'is', 'in'] 



In [9]:
# create iterator for training and testing data
train_iter, test_iter = data.Iterator.splits(datasets=(train_data, test_data),
                                             batch_sizes=(3, 5),
                                             repeat=False,  
                                             sort_key = lambda ex: len(ex.text))

In [10]:
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[525.,   9., 125.],
        [ 18., 747.,   4.],
        [  3., 235., 271.],
        ...,
        [639.,   1.,   1.],
        [200.,   1.,   1.],
        [ 17.,   1.,   1.]], dtype=torch.float64) torch.Size([1278, 3])
tensor([1, 1, 1]) torch.Size([3])


In [11]:
for test_batch in test_iter:
    print(test_batch.text, test_batch.text.shape)
    print(test_batch.label, test_batch.label.shape)
    break

tensor([[2.9800e+02, 1.1009e+04, 3.3400e+02, 2.8080e+03, 2.3000e+01],
        [7.5100e+02, 6.0000e+00, 9.0000e+00, 5.6740e+03, 2.0000e+01],
        [4.5200e+02, 1.3900e+02, 3.0000e+00, 2.0000e+00, 8.0000e+00],
        [6.9000e+01, 1.9000e+01, 1.4510e+03, 8.0220e+03, 7.9850e+03],
        [5.6550e+03, 3.0000e+00, 4.0000e+01, 6.4630e+03, 8.7010e+03],
        [3.0000e+00, 1.5100e+02, 3.9000e+01, 2.0000e+00, 5.0000e+00],
        [1.8900e+02, 1.0000e+01, 1.7400e+02, 5.0000e+00, 2.8000e+01],
        [9.3000e+01, 3.8600e+02, 1.3600e+02, 8.1490e+03, 3.5200e+02],
        [4.5200e+02, 9.0000e+00, 2.7500e+02, 7.9740e+03, 3.6000e+01],
        [6.0000e+00, 3.0000e+00, 1.6300e+02, 1.0700e+02, 1.8600e+02],
        [1.3070e+03, 4.6900e+02, 5.0000e+00, 4.0000e+00, 7.0000e+00],
        [9.0000e+00, 2.7980e+03, 6.5000e+01, 1.7520e+03, 9.6200e+02],
        [3.1050e+03, 1.1000e+01, 8.0000e+00, 2.7410e+03, 1.1200e+02],
        [3.0700e+03, 2.1800e+02, 2.0900e+02, 6.0000e+00, 5.2000e+01],
        [1.8500e+02,