### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
import torch
import pandas as pd
import numpy as np
from torchtext import data, datasets

C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.TXA6YQSD3GCQQC22GEQ54J2UDCXDXHWN.gfortran-win_amd64.dll
  stacklevel=1)


In [2]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('./polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste .it...,0
1998,there was a time when john carpenter was a gre...,0


### 安裝套件

In [46]:
#!pip install torchtext
#!pip install spacy
#!python –m spacy download en
#!python -m spacy download en_core_web_sm

### 建立Pipeline生成資料

In [30]:
# 建立Field與Dataset

### <your code> ###
#import en_core_web_lg
import spacy
#spacy_en = spacy.load('en_core_web_sm')
spacy_en = spacy.load('en')

In [40]:
# 移除非字母字元:
import re
def remove_nono_char(text):
    text = ' '.join(text)
    text = re.sub("[^a-zA-Z]", ' ', text)
    text = text.split()
    return text

def tokenizer(text): # create a tokenizer function
    # 返回 a list of <class 'spacy.tokens.token.Token'>
    return [tok.text for tok in spacy.tokenizer(text)]

In [49]:
text_field = data.Field(sequential=True, dtype=torch.float64, lower=True, tokenize='spacy', preprocessing=remove_nono_char)
label_field = data.Field(sequential=False, use_vocab=False)

In [53]:
examples = []
for (text, label) in input_data.values:
    examples.append(data.Example.fromlist(data=[text, label], 
                                          fields=[('text', text_field),('label', label_field)]) )

In [54]:
# 取的examples並打亂順序
### <your code> ###
import random
random.shuffle(examples)

# 以8:2的比例切分examples
### <your code> ###
train_ex = examples[:int(len(examples)*0.8)]
test_ex = examples[int(len(examples)*0.8):]

# 建立training與testing dataset
### <your code> ###
test_data = data.Dataset(examples=test_ex, fields={'text':text_field, 'label':label_field})
train_data = data.Dataset(examples=train_ex, fields={'text': text_field, 'label': label_field})
train_data[0].label, train_data[0].text

(0,
 ['some',
  'movies',
  'i',
  'should',
  'just',
  'skip',
  'my',
  'daughter',
  'and',
  'i',
  'had',
  'a',
  'really',
  'vile',
  'time',
  'at',
  'my',
  'favorite',
  'martian',
  'a',
  'few',
  'weeks',
  'back',
  'and',
  'here',
  'comes',
  'another',
  'disney',
  'effects',
  'filled',
  'live',
  'action',
  'flick',
  'based',
  'on',
  'an',
  'old',
  'tv',
  'program',
  'true',
  'the',
  'probgram',
  'is',
  'only',
  'years',
  'old',
  'this',
  'time',
  'and',
  'it',
  's',
  'a',
  'cartoon',
  'but',
  'it',
  's',
  'a',
  'cartoon',
  'i',
  'liked',
  'and',
  'i',
  'was',
  'understandably',
  'reluctant',
  'to',
  'see',
  'what',
  'disney',
  'had',
  'done',
  'to',
  'it',
  'on',
  'the',
  'big',
  'screen',
  'but',
  'my',
  'daughter',
  'really',
  'wanted',
  'to',
  'go',
  'and',
  'how',
  'bad',
  'could',
  'it',
  'be',
  'turns',
  'out',
  'i',
  'was',
  'right',
  'mostly',
  'inspector',
  'gadget',
  'oddly',
  'enoug

In [55]:
# 建立字典
### <your code> ###
text_field.build_vocab(train_data)
label_field.build_vocab(train_data)
print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 's'] 



In [59]:
# create iterator for training and testing data
### <your code> ###
train_iter = data.Iterator(dataset=train_data, 
                                      batch_size=2, 
                                      repeat=False, 
                                      sort_key=lambda ex:len(ex.text))

test_iter = data.Iterator(dataset=test_data, 
                          batch_size=2, 
                          repeat=False, 
                          sort_key=lambda ex:len(ex.text))

In [60]:
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[2.0000e+00, 3.2420e+03],
        [3.7340e+03, 1.9355e+04],
        [6.4800e+02, 2.0645e+04],
        ...,
        [2.1910e+03, 1.0000e+00],
        [6.9340e+03, 1.0000e+00],
        [1.1100e+02, 1.0000e+00]], dtype=torch.float64) torch.Size([765, 2])
tensor([1, 1]) torch.Size([2])
