# 使用torchtext库处理数据
参考：[Torchtext 详细介绍 Part.1](https://zhuanlan.zhihu.com/p/37223078)

In [1]:
import os
import spacy
import torchtext
from torchtext import data
import numpy as np

In [2]:
text = "The sequel, Yes, Prime Minister, ran from 1986 to 1988. In total there were 38 episodes, of which all but one lasted half an hour. Almost all episodes ended with a variation of the title of the series spoken as the answer to a question posed by the same character, Jim Hacker. Several episodes were adapted for BBC Radio, and a stage play was produced in 2010, the latter leading to a new television series on UKTV Gold in 2013."
print(text)

The sequel, Yes, Prime Minister, ran from 1986 to 1988. In total there were 38 episodes, of which all but one lasted half an hour. Almost all episodes ended with a variation of the title of the series spoken as the answer to a question posed by the same character, Jim Hacker. Several episodes were adapted for BBC Radio, and a stage play was produced in 2010, the latter leading to a new television series on UKTV Gold in 2013.


## 1.定义token函数，并创建Field

In [3]:
spacy_en = spacy.load('en')          # 导入英语模型
def tokenizer(text):                 # tokenizer函数
    """
    func: 返回一个token的列表
    """
    return [token.text for token in spacy_en.tokenizer(text)]
#     return [token.text for token in spacy_en(text)]
toke = tokenizer(text)
print(toke[:10])

['The', 'sequel', ',', 'Yes', ',', 'Prime', 'Minister', ',', 'ran', 'from']


In [24]:
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True)
LABEL = data.Field(sequential=False, use_vocab=False)

**如果label是整型，不需要numericalize，需要设置use_vocab=False**

## 2. 加载语料库

In [25]:
data_path = os.path.join('..', 'data', 'text')
train, val, test = data.TabularDataset.splits(
    path=data_path, train='train.csv', validation='val.csv',test='test.csv', format='csv',
    fields=[('text', TEXT), ('labels', LABEL)]
)

In [42]:
TEXT.build_vocab(train)

dict_keys(['examples', 'fields'])

## 3. 使用批次读取数据

In [None]:
train_iter, val_iter, test_iter = data.Iterator.splits(
    (train, val, test), sort_key=lambda x:len(x.text),
    batch_size=(10,3,1), device=-1              # cpu:-1 gpu:0
)
train_loader = iter(train_iter)

In [None]:
# list(train_iter)                              # ???????

In [None]:
list(train_iter)