In [1]:
import torch
import numpy as np
import string

device = (
    "cuda:0"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cuda:0 device


## 文本处理

1. 分词/Tokenization
```py
list(str)     ## 1. by letter
str.split()   ## 2. by word
...........   ## 3. NLTK-ngrams: 有点类似k-mer，将n个相邻词合并为一个词段，稍后embed这个词段
```
2. 创建词表 vocab

3. 词表向量化：tf-idf/one-hot/散列编码/embedding


[gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py) 进行word2vec: 根据语料中词汇一起出现的频率（co-occurrence），进行word embedding。



### recall: word2vec 

根据语料中词汇一起出现的频率（co-occurrence），进行word embedding。

工具包：[gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py)

![](word2vec/1.PNG) 

参考

[1-原文] https://arxiv.org/pdf/1301.3781.pdf  
[2-优化Skip-gram] https://arxiv.org/pdf/1310.4546.pdf  
[3-释意] https://arxiv.org/pdf/1411.2738.pdf  
[4-帮助理解-数学描述] https://zhuanlan.zhihu.com/p/595381757  
[5-帮助理解-负采样] http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

In [2]:
str = "Sing in me, Muse, and through me tell the story of that man skilled in all ways of contending, the wanderer, harried for years on end, after he plundered the stronghold on the proud height of Troy"

for c in string.punctuation:
    str = str.replace(c,' ').replace('  ',' ').lower()

In [3]:
vocab = dict([(v,k) for k,v in enumerate(set(str.split()))])
s = [vocab.get(v) for v in str.split()]  
print(s)

[9, 21, 0, 1, 13, 17, 0, 11, 20, 6, 10, 15, 8, 2, 21, 16, 27, 10, 14, 20, 23, 25, 26, 19, 7, 22, 3, 12, 4, 20, 18, 7, 20, 28, 5, 10, 24]


In [4]:
## 1. one-hot Mtx, each line for a word
b = np.zeros((len(s),len(vocab))) 
for index,v in enumerate(s):
    b[index,v] = 1
b

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [5]:
## 2. Embedding, each line for a word
emb_layer = torch.nn.Embedding(len(vocab),20)
emb = emb_layer(torch.LongTensor(s))
emb.shape

torch.Size([37, 20])

## TorchText

```
torchtext.datasets.* 提供常见数据集  但是与torch==2.3.0不相容；可以直接从官网链接直接下载tar

且下载容易出错，建议转换为list()后check一下大小、标签set

https://pytorch.org/text/stable/datasets.html
```
* 各种文本处理工具: https://pytorch.org/text/stable/data_utils.html   
* 忘记 vocab.set_default_index 后续调用如果遇到不在vocab中的词后程序会崩溃

In [6]:
import torchtext     ## pip install torchtext; pip install torchdata
torchtext.__version__

'0.17.0+cpu'

In [7]:
test_data = list(torchtext.datasets.IMDB(split='train',root=r'L:\Datasets'))
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
def yield_tokens(data):
    for (_,text) in data:
        yield tokenizer(text)

vocab = torchtext.vocab.build_vocab_from_iterator(yield_tokens(test_data),specials=['<pad>','<unk>'],min_freq=3)
vocab.set_default_index(vocab['<unk>'])

In [8]:
set([k[0] for k in test_data])

{1, 2}

In [9]:
len(vocab)

40252

In [10]:
next(iter(test_data))

(1,
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwee

In [11]:
next(yield_tokens(test_data))

['i',
 'rented',
 'i',
 'am',
 'curious-yellow',
 'from',
 'my',
 'video',
 'store',
 'because',
 'of',
 'all',
 'the',
 'controversy',
 'that',
 'surrounded',
 'it',
 'when',
 'it',
 'was',
 'first',
 'released',
 'in',
 '1967',
 '.',
 'i',
 'also',
 'heard',
 'that',
 'at',
 'first',
 'it',
 'was',
 'seized',
 'by',
 'u',
 '.',
 's',
 '.',
 'customs',
 'if',
 'it',
 'ever',
 'tried',
 'to',
 'enter',
 'this',
 'country',
 ',',
 'therefore',
 'being',
 'a',
 'fan',
 'of',
 'films',
 'considered',
 'controversial',
 'i',
 'really',
 'had',
 'to',
 'see',
 'this',
 'for',
 'myself',
 '.',
 'the',
 'plot',
 'is',
 'centered',
 'around',
 'a',
 'young',
 'swedish',
 'drama',
 'student',
 'named',
 'lena',
 'who',
 'wants',
 'to',
 'learn',
 'everything',
 'she',
 'can',
 'about',
 'life',
 '.',
 'in',
 'particular',
 'she',
 'wants',
 'to',
 'focus',
 'her',
 'attentions',
 'to',
 'making',
 'some',
 'sort',
 'of',
 'documentary',
 'on',
 'what',
 'the',
 'average',
 'swede',
 'thought',


In [12]:
vocab['am']   ## index

246

In [13]:
vocab(['i','love'])

[13, 125]

In [14]:
vocab.lookup_token(1)

'<unk>'

In [15]:
vocab(next(yield_tokens(test_data)))

[13,
 1568,
 13,
 246,
 35468,
 43,
 64,
 398,
 1135,
 92,
 7,
 37,
 2,
 7126,
 15,
 3363,
 11,
 60,
 11,
 17,
 94,
 629,
 12,
 6921,
 3,
 13,
 87,
 553,
 15,
 38,
 94,
 11,
 17,
 20193,
 40,
 1225,
 3,
 16,
 3,
 9263,
 51,
 11,
 131,
 780,
 8,
 2480,
 14,
 682,
 4,
 1575,
 118,
 6,
 342,
 7,
 114,
 1160,
 3052,
 13,
 72,
 75,
 8,
 74,
 14,
 19,
 537,
 3,
 2,
 121,
 10,
 5959,
 194,
 6,
 191,
 3862,
 474,
 1424,
 766,
 4314,
 42,
 489,
 8,
 834,
 287,
 61,
 58,
 50,
 127,
 3,
 12,
 826,
 61,
 489,
 8,
 1132,
 47,
 11859,
 8,
 257,
 56,
 441,
 7,
 669,
 28,
 54,
 2,
 863,
 29737,
 209,
 50,
 781,
 1001,
 1304,
 147,
 18,
 2,
 2675,
 337,
 5,
 1510,
 1304,
 12,
 2,
 2359,
 1592,
 3,
 12,
 203,
 2182,
 7271,
 5,
 1919,
 19586,
 7,
 21478,
 50,
 73,
 4656,
 28,
 2381,
 4,
 61,
 52,
 402,
 20,
 47,
 474,
 1692,
 4,
 8135,
 4,
 5,
 999,
 347,
 3,
 54,
 1080,
 78,
 50,
 13,
 246,
 35468,
 10,
 15,
 1614,
 161,
 587,
 4,
 14,
 17,
 1160,
 8206,
 3,
 72,
 4,
 2,
 402,
 5,
 1000,
 145,
 31,
 175

## DataLoader

设定 collate_fn batch ...

此例是为了torch.nn.EmbeddingBag的输入做准备


In [16]:
def collate_batch(data_batch):
    label_lst, text_lst, offset_lst = [],[],[]
    for _label, _text in data_batch:
        label_lst.append(_label-1)  ## {1,2} => {0,1}
        tk_text = vocab(tokenizer(_text))
        text_lst.append(torch.tensor(tk_text,dtype=torch.int64))
        offset_lst.append(len(tk_text))
    label_lst = torch.tensor(label_lst)
    text_lst = torch.cat(text_lst)
    offsets = torch.cat((torch.tensor([0]), torch.tensor(offset_lst[:-1]).cumsum(dim=0) ))
    return label_lst.to(device),text_lst.to(device),offsets.to(device)


test_dl = torch.utils.data.DataLoader(test_data, batch_size=64, shuffle=True, collate_fn = collate_batch)

In [17]:
for tt in test_dl:
    break

tt

(tensor([0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
         1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
         0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0], device='cuda:0'),
 tensor([  37,    7,  592,  ...,   29, 1145,    3], device='cuda:0'),
 tensor([    0,   273,   442,   889,  1060,  1610,  1788,  2090,  2261,  2474,
          2834,  3005,  3210,  3507,  3601,  3871,  4230,  4279,  4418,  4685,
          5132,  5275,  5748,  6026,  6249,  6829,  7435,  7567,  7640,  7827,
          8068,  8224,  8333,  8493,  8671,  8965,  9134,  9257,  9473,  9864,
         10135, 10382, 10620, 10788, 10957, 11020, 11395, 11559, 11873, 11977,
         12377, 13530, 13651, 13800, 13973, 14113, 14568, 14668, 14914, 15100,
         15330, 15622, 15796, 16033], device='cuda:0'))

## EmbeddingBag Model

```
## torch.nn.EmbeddingBag 接受词表进行 Embedding，并对 Embedding 输出进行聚合：求和/均值/最大值/。。。
## 可以将一个批次中全部文本合并为一个长序列，并记录其中每一条文本的 **偏移值**（其所在位置）
```


In [18]:
class TextClassifyModel(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.Embed_Bag = torch.nn.EmbeddingBag(vocab_size,embed_dim)
        self.fc = torch.nn.Linear(embed_dim, 2)
    def forward(self, text, offset):
        embd = self.Embed_Bag(text,offset)
        return self.fc(embd)

model = TextClassifyModel(len(vocab),100).to(device)
model

TextClassifyModel(
  (Embed_Bag): EmbeddingBag(40252, 100, mode='mean')
  (fc): Linear(in_features=100, out_features=2, bias=True)
)

In [19]:
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)
exp_lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.1)

In [20]:
def train(dataloader, model, loss_fn, optimizer):
    lossSum = 0
    model.train()                                    ### set training mode
    for label_lst,text_lst,offsets in dataloader:
        # Compute prediction error
        pred = model(text_lst,offsets)
        loss = loss_fn(pred, label_lst)
        lossSum += loss.item()
        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    avgTrainingLoss = lossSum/len(dataloader)
    return avgTrainingLoss

In [21]:
epochs = 2
for t in range(epochs):
    avgTrainingLoss = train(test_dl, model, loss_fn, optimizer)
    print(f'Epoch {t+1}----Train Loss:: {avgTrainingLoss:>7f}') 

Epoch 1----Train Loss:: 0.671038
Epoch 2----Train Loss:: 0.646020
