# Word Window Classification

我们现在需要尝试解决一个NLP的任务示例，下面是我们会学习到的：

    1.数据:创建一个批量张量的数据集
    2.模型搭建
    3.模型训练
    4.预测
    
这一部分，我们的目标要训练一个模型，使得这个模型可以找到一个句子中与```LOCATION```有关的单词（这个单词只有一个词，而不能是词组）。这个任务称为```Word Window Classification```。我们不想让我们的模型在每次前向传播中只查看一个单词，我们希望它能够考虑相关单词的上下文。也就是说，对于每个单词，我们希望我们的模型能够注意到周围的单词。就让我们一探究竟吧!

## 1. Data
任何机器学习项目的第一个任务都是建立我们的训练集。通常，都会有一个我们要使用的训练语料库。在NLP任务中，语料库通常是```.txt```或```.csv```文件，其中每一行对应一个句子或一个表格数据点。在我们的简单任务中，我们假定已经将数据和相应的标签读入了Python列表。

In [2]:
# Our raw data, which consists of sentences
corpus = [
          "We always come to Paris",
          "The professor is from Australia",
          "I live in Stanford",
          "He comes from Taiwan",
          "The capital of Turkey is Ankara"
         ]

### 1.1 预处理
为了使我们的模型更容易学习，我们通常对我们的数据应用一些预处理步骤。这在处理文本数据时尤为重要。下面是一些文本预处理的例子:
* 标记化:将句子标记成单词。
* 小写:将所有字母改为小写。
* 去除噪声:去除特殊字符(如标点)。
* 停止词删除:删除常用的词。
哪些预处理步骤是必要的是由手头的任务决定的。例如，虽然在某些任务中删除特殊字符是有用的，但在其他任务中，它们可能是重要的(例如，如果我们要处理多种语言)。对于我们的任务，我们将单词小写化并标记化。

In [3]:
# Our function is a simple one, we lowercase the letters
# and then tokenize the words.
def preprocess_sentence(sentence):
  return sentence.lower().split()

# Create our training set
train_sentences = [sent.lower().split() for sent in corpus]
train_sentences

[['we', 'always', 'come', 'to', 'paris'],
 ['the', 'professor', 'is', 'from', 'australia'],
 ['i', 'live', 'in', 'stanford'],
 ['he', 'comes', 'from', 'taiwan'],
 ['the', 'capital', 'of', 'turkey', 'is', 'ankara']]

对于每个训练示例，我们都应该有相应的标签。回想一下，我们模型的目标是确定哪些单词对应于一个```LOCATION```。也就是说，我们希望我们的模型为所有不是```LOCATION```的单词输出0，为```LOCATION```的单词输出1。

In [4]:
# Set of locations that appear in our corpus
locations = set(["australia", "ankara", "paris", "stanford", "taiwan", "turkey"])

# Our train labels
train_labels = [[1 if word in locations else 0 for word in sent] for sent in train_sentences]
train_labels

[[0, 0, 0, 0, 1],
 [0, 0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1, 0, 1]]

### 1.2 把单词转换为词向量
让我们更仔细地看看我们的训练数据。我们拥有的每个数据点都是一个单词序列。另一方面，我们知道机器学习模型是用向量中的数字来工作的。我们如何将文字转化为数字?你可能在想词嵌入，你是对的!

假设我们有一个内嵌查找表E，其中每一行对应一个内嵌。也就是说，我们词汇表中的每个单词在这个表中都有一个对应的嵌入行i。每当我们想要找到一个单词的嵌入，我们将遵循以下步骤:

    1. 在嵌入表中找到单词对应的索引i: word->index
    2. 索引到嵌入表中，得到嵌入:Index ->embedding
    
让我们看看第一步。我们应该将我们词汇表中的所有单词分配到相应的索引中。我们可以这样做:

    1. 找出语料库中所有独特的单词。
    2. 给每一个都指定一个索引。

In [5]:
# Find all the unique words in our corpus 
vocabulary = set(w for s in train_sentences for w in s)
vocabulary

{'always',
 'ankara',
 'australia',
 'capital',
 'come',
 'comes',
 'from',
 'he',
 'i',
 'in',
 'is',
 'live',
 'of',
 'paris',
 'professor',
 'stanford',
 'taiwan',
 'the',
 'to',
 'turkey',
 'we'}

词汇库现在包含了我们语料库中的所有单词。另一方面，在测试模型期间，会出现我们的词汇表中没有包含的单词。如果我们能找到一种方法来表示未知单词，我们的模型仍然可以推断出它们是否是一个```LOCATION```，因为我们也会查看每个预测的邻近单词。

我们引入了一个特殊的标记<unk>，来处理词汇表之外的单词。如果需要，我们可以选择另一个字符串作为未知标记。这里唯一的要求是我们的标志应该是唯一的:我们应该只使用这个标志

In [6]:
# Add the unknown token to our vocabulary
vocabulary.add("<unk>")

前面我们提到过，我们的任务被称为词窗口分类，因为我们的模型在需要做出预测时，除了查看给定的单词外，还查看周围的单词。

例如，让我们以“We always come to Paris”这句话为例。这个句子对应的训练标号是```0,0,0,0,1```，因为只有最后一个单词Paris是一个位置。在一次传递中(即调用forward())，我们的模型将尝试为一个单词生成正确的标签。假设我们的模型试图为Paris生成正确的标签1。如果我们只让我们的模型去观测Paris，我们就会错过重要的信息，比如to这个词经常和location一起出现。

单词窗口允许我们的模型在进行预测时考虑每个单词周围的+N或-N个单词。在我们前面关于Paris的例子中，如果我们有一个窗口大小为1，这意味着我们的模型将查看紧接在Paris之前和之后的单词，它们是to，还有，emm, 什么都没有。现在，这又引出了另一个问题。Paris在句子的末尾，所以后面没有其他单词。记住，我们在初始化PyTorch模型时定义了它们的输入维数。如果我们将窗口大小设置为1，这意味着我们的模型将在每次传递中接受3个单词。我们不能让我们的模型在一次训练中只输入两个词。

解决方案是引入一个特殊的标记，例如```<pad>```，它将被添加到我们的句子中，以确保每个单词周围都有一个有效的窗口。类似于```<unk>```令牌，如果我们愿意，我们可以选择另一个字符串作为我们的pad令牌，只要我们确保它用于唯一的目的。

In [7]:
# Add the <pad> token to our vocabulary
vocabulary.add("<pad>")

# Function that pads the given sentence
# We are introducing this function here as an example
# We will be utilizing it later in the tutorial
def pad_window(sentence, window_size, pad_token="<pad>"):
  window = [pad_token] * window_size
  return window + sentence + window

# Show padding example
window_size = 2
pad_window(train_sentences[0], window_size=window_size)

['<pad>', '<pad>', 'we', 'always', 'come', 'to', 'paris', '<pad>', '<pad>']

现在我们的词汇表已经准备好了，让我们为每个单词指定一个索引。

In [8]:
# We are just converting our vocabularly to a list to be able to index into it
# Sorting is not necessary, we sort to show an ordered word_to_ind dictionary
# That being said, we will see that having the index for the padding token
# be 0 is convenient as some PyTorch functions use it as a default value
# such as nn.utils.rnn.pad_sequence, which we will cover in a bit
ix_to_word = sorted(list(vocabulary))

# Creating a dictionary to find the index of a given word
word_to_ix = {word: ind for ind, word in enumerate(ix_to_word)}
word_to_ix

{'<pad>': 0,
 '<unk>': 1,
 'always': 2,
 'ankara': 3,
 'australia': 4,
 'capital': 5,
 'come': 6,
 'comes': 7,
 'from': 8,
 'he': 9,
 'i': 10,
 'in': 11,
 'is': 12,
 'live': 13,
 'of': 14,
 'paris': 15,
 'professor': 16,
 'stanford': 17,
 'taiwan': 18,
 'the': 19,
 'to': 20,
 'turkey': 21,
 'we': 22}

太棒了!我们已经准备好将我们的训练句子转换为对应于每个标记的索引序列。

In [10]:
# Given a sentence of tokens, return the corresponding indices
def convert_token_to_indices(sentence, word_to_ix):
  indices = []
  for token in sentence:
    # Check if the token is in our vocabularly. If it is, get it's index. 
    # If not, get the index for the unknown token.
    if token in word_to_ix:
      index = word_to_ix[token]
    else:
      index = word_to_ix["<unk>"]
    indices.append(index)
  return indices

# More compact version of the same function
def _convert_token_to_indices(sentence, word_to_ix):
  return [word_to_ind.get(token, word_to_ix["<unk>"]) for token in sentence]

# Show an example
example_sentence = ["we", "always", "come", "to", "kuwait"]
example_indices = convert_token_to_indices(example_sentence, word_to_ix)
restored_example = [ix_to_word[ind] for ind in example_indices]

print(f"Original sentence is: {example_sentence}")
print(f"Going from words to indices: {example_indices}")
print(f"Going from indices to words: {restored_example}")

Original sentence is: ['we', 'always', 'come', 'to', 'kuwait']
Going from words to indices: [22, 2, 6, 20, 1]
Going from indices to words: ['we', 'always', 'come', 'to', '<unk>']


在上面的例子中，kuwait显示为<unk>，因为它没有包含在我们的词汇表中。让我们将train_sentences转换为example_padded_indices

In [11]:
# Converting our sentences to indices
example_padded_indices = [convert_token_to_indices(s, word_to_ix) for s in train_sentences]
example_padded_indices

[[22, 2, 6, 20, 15],
 [19, 16, 12, 8, 4],
 [10, 13, 11, 17],
 [9, 7, 8, 18],
 [19, 5, 14, 21, 12, 3]]

现在我们有了词汇表中每个单词的索引，我们可以用```nn.Embedding```在PyTorch中创建一个嵌入表。它是```nn.Embedding(num_words, embedding_dimension)```。其中num_words是词汇表中的单词数，而embeddding_dimension是我们想要的嵌入的维度。
```nn.embedding```没有什么奇特的地方。
嵌入:它只是围绕一个可以训练的NxE维张量的包装类，其中N是我们词汇表中的单词数，E是嵌入维数。这个表最初是随机的，但是它会随着时间而改变。当我们训练我们的网络时，梯度将一直反向传播到嵌入层，因此我们的词嵌入将被更新。我们将初始化嵌入层，我们将在我们的模型中使用我们的模型，然后我们在这里展示了一个例子。

In [13]:
import torch
import torch.nn as nn

In [14]:
# Creating an embedding table for our words
embedding_dim = 5
embeds = nn.Embedding(len(vocabulary), embedding_dim)

# Printing the parameters in our embedding table
list(embeds.parameters())

[Parameter containing:
 tensor([[ 0.1258, -0.7275,  1.2096, -1.8804,  1.6865],
         [-0.6010, -0.1004,  0.5889, -0.4271,  0.3562],
         [-0.3893, -1.7048,  0.5934,  0.7237,  0.7924],
         [ 0.0721,  0.0072, -0.3638,  1.1655, -0.7205],
         [-0.9576, -0.0367,  0.2058, -0.9100, -0.4638],
         [ 0.0304, -0.7949, -0.1166,  0.1753,  0.3865],
         [-1.0422,  1.1801, -1.3600, -1.0125,  0.3692],
         [ 0.6910, -0.8811,  0.0676, -1.5679, -1.5041],
         [ 1.2847, -1.0386,  0.7956,  0.1647, -0.4481],
         [ 0.8692,  1.2935,  0.3884,  2.4309,  0.5096],
         [-2.2409, -0.8163,  0.7386,  1.1934,  0.1866],
         [ 2.6747,  0.4610, -0.9643,  0.1758,  1.9913],
         [ 1.6052, -0.2443, -0.8988, -0.1224,  0.7469],
         [ 1.1621, -0.7050,  1.4881,  1.6442, -0.2078],
         [-0.3992, -1.7097,  0.0521, -0.2182,  1.5734],
         [-1.4902, -1.2717,  0.1680,  0.5448,  1.5934],
         [ 0.3983,  1.8864, -0.3645, -0.5886, -0.2506],
         [ 1.3114, -1.508

要将单词嵌入到词汇表中，我们所需要做的就是创建一个查找张量。查找张量就是一个张量包含我们要查找nn的指标。```nn.Embeding```需要一个长张量类型的索引张量，因此我们应该相应地创建我们的张量。

In [18]:
# Get the embedding for the word Paris
index = word_to_ix["paris"]
index_tensor = torch.tensor(index, dtype=torch.long)
print("向量索引：",index_tensor)
paris_embed = embeds(index_tensor)
paris_embed

向量索引： tensor(15)


tensor([-1.4902, -1.2717,  0.1680,  0.5448,  1.5934],
       grad_fn=<EmbeddingBackward>)

In [20]:
# We can also get multiple embeddings at once
index_paris = word_to_ix["paris"]
index_ankara = word_to_ix["ankara"]
indices = [index_paris, index_ankara]
print("向量索引：",indices)
indices_tensor = torch.tensor(indices, dtype=torch.long)
embeddings = embeds(indices_tensor)
embeddings

向量索引： [15, 3]


tensor([[-1.4902, -1.2717,  0.1680,  0.5448,  1.5934],
        [ 0.0721,  0.0072, -0.3638,  1.1655, -0.7205]],
       grad_fn=<EmbeddingBackward>)

通常，我们将嵌入层定义为模型的一部分，您将在本子的后面部分中看到。

### 1.3 批量句子
我们已经在课堂上学习了分批。在进行更新之前，需要等待整个训练语料库的处理，这种方法时间成本很高。另一方面，在每个训练实例之后更新参数，会导致两次更新之间的损失不太稳定。为了解决这些问题，我们在对一批数据进行训练后再更新我们的参数。这使我们能够更好地估计总体损失的梯度。在本节中，我们将学习如何使用```torch.util.data.DataLoader```将数据组织成批。

我们将按如下方式调用DataLoader类:```DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)```。batch_size参数确定每个批的示例数。在每个epoch中，我们将使用DataLoader遍历所有批次。默认情况下，批处理的顺序是确定的，但是我们可以通过将shuffle参数设置为True来要求DataLoader对批处理。通过这种方式，我们可以确保不会多次遇到不好的批次。

如果提供，DataLoader将它准备的批处理传递给collate_fn。我们可以编写一个自定义函数来传递给collate_fn参数，以便打印关于批处理的统计信息或执行额外的处理。在本例中，我们将使用```collate_fn```来:

    1. 填充我们用来训练的句子
    2. 将训练示例中的单词转换为索引。
    3. 填充训练的例子，使所有的句子和标签都有相同的长度。同样，我们也需要填充标签。这就产生了一个问题，因为在计算损失时，我们需要知道给定示例中的实际单词数。我们还将在传递给collate_fn参数的函数中跟踪这个数字。

因为collate_fn函数的版本需要访问我们的word_to_ix字典(以便它可以将单词转换为索引)，所以我们将使用Python中的partial函数，它将所给的形参传递给传递给它的函数。

In [21]:
from torch.utils.data import DataLoader
from functools import partial

def custom_collate_fn(batch, window_size, word_to_ix):
  # Break our batch into the training examples (x) and labels (y)
  # We are turning our x and y into tensors because nn.utils.rnn.pad_sequence
  # method expects tensors. This is also useful since our model will be
  # expecting tensor inputs. 
  x, y = zip(*batch)

  # Now we need to window pad our training examples. We have already defined a 
  # function to handle window padding. We are including it here again so that
  # everything is in one place.
  def pad_window(sentence, window_size, pad_token="<pad>"):
    window = [pad_token] * window_size
    return window + sentence + window

  # Pad the train examples.
  x = [pad_window(s, window_size=window_size) for s in x]

  # Now we need to turn words in our training examples to indices. We are
  # copying the function defined earlier for the same reason as above.
  def convert_tokens_to_indices(sentence, word_to_ix):
    return [word_to_ix.get(token, word_to_ix["<unk>"]) for token in sentence]

  # Convert the train examples into indices.
  x = [convert_tokens_to_indices(s, word_to_ix) for s in x]

  # We will now pad the examples so that the lengths of all the example in 
  # one batch are the same, making it possible to do matrix operations. 
  # We set the batch_first parameter to True so that the returned matrix has 
  # the batch as the first dimension.
  pad_token_ix = word_to_ix["<pad>"]

  # pad_sequence function expects the input to be a tensor, so we turn x into one
  x = [torch.LongTensor(x_i) for x_i in x]
  x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)

  # We will also pad the labels. Before doing so, we will record the number 
  # of labels so that we know how many words existed in each example. 
  lengths = [len(label) for label in y]
  lenghts = torch.LongTensor(lengths)

  y = [torch.LongTensor(y_i) for y_i in y]
  y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)

  # We are now ready to return our variables. The order we return our variables
  # here will match the order we read them in our training loop.
  return x_padded, y_padded, lenghts

这个函数看起来很长，但实际上并不一定很长。看看下面的替代版本，我们删除了额外的函数声明和注释。|

In [23]:
def _custom_collate_fn(batch, window_size, word_to_ix):
  # Prepare the datapoints
  x, y = zip(*batch)  
  x = [pad_window(s, window_size=window_size) for s in x]
  x = [convert_tokens_to_indices(s, word_to_ix) for s in x]

  # Pad x so that all the examples in the batch have the same size
  pad_token_ix = word_to_ix["<pad>"]
  x = [torch.LongTensor(x_i) for x_i in x]
  x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)

  # Pad y and record the length
  lengths = [len(label) for label in y]
  lenghts = torch.LongTensor(lengths)
  y = [torch.LongTensor(y_i) for y_i in y]
  y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)

  return x_padded, y_padded, lenghts  

现在，我们可以就可以看到DataLoader如何运行的。

In [24]:
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)

# Instantiate the DataLoader
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)

# Go through one loop
counter = 0
for batched_x, batched_y, batched_lengths in loader:
  print(f"Iteration {counter}")
  print("Batched Input:")
  print(batched_x)
  print("Batched Labels:")
  print(batched_y)
  print("Batched Lengths:")
  print(batched_lengths)
  print("")
  counter += 1

Iteration 0
Batched Input:
tensor([[ 0,  0, 19, 16, 12,  8,  4,  0,  0],
        [ 0,  0,  9,  7,  8, 18,  0,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 0, 1],
        [0, 0, 0, 1, 0]])
Batched Lengths:
tensor([5, 4])

Iteration 1
Batched Input:
tensor([[ 0,  0, 10, 13, 11, 17,  0,  0,  0],
        [ 0,  0, 22,  2,  6, 20, 15,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1]])
Batched Lengths:
tensor([4, 5])

Iteration 2
Batched Input:
tensor([[ 0,  0, 19,  5, 14, 21, 12,  3,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 1, 0, 1]])
Batched Lengths:
tensor([6])



你在上面看到的批量输入张量将被传递到我们的模型中。另一方面，我们一开始就说我们的模型将是一个窗口分类器。根据当前的输入张量格式，我们在一个数据点中有一个句子中的所有单词。当我们将这个输入传递给我们的模型时，它需要为每个单词创建窗口，预测中心单词是否是每个窗口的位置，将预测放在一起并返回。

我们可以避免这个问题，如果我们事先将数据分解到windows中。在这个例子中，我们将替换我们的模型于这种格式。

假设我们的window_size是N，我们希望我们的模型对每2N+1个令牌做一个预测。也就是说，如果我们有一个包含9个标记的输入，并且window_size为2，那么我们希望我们的模型返回5个预测。这是有意义的，因为在我们用2个标记填充它之前，我们的输入也有5个标记!

我们可以通过使用for循环来创建这些窗口，但是有一个更快的PyTorch替代方法，即展开(dimension, size, step)方法。我们可以使用下面的方法创建我们需要的窗口:

In [26]:
# Print the original tensor
print(f"Original Tensor: ")
print(batched_x)
print("")

# Create the 2 * 2 + 1 chunks
chunk = batched_x.unfold(1, window_size*2 + 1, 1)
print(f"Windows: ")
print(chunk)

Original Tensor: 
tensor([[ 0,  0, 19,  5, 14, 21, 12,  3,  0,  0]])

Windows: 
tensor([[[ 0,  0, 19,  5, 14],
         [ 0, 19,  5, 14, 21],
         [19,  5, 14, 21, 12],
         [ 5, 14, 21, 12,  3],
         [14, 21, 12,  3,  0],
         [21, 12,  3,  0,  0]]])


## 2. 模型搭建
现在我们已经准备好了数据，可以构建模型了。我们已经学会了如何写自定义nn。模块类。我们将在这里做同样的事情，并把我们目前学到的所有东西放在一起。

In [27]:
class WordWindowClassifier(nn.Module):

  def __init__(self, hyperparameters, vocab_size, pad_ix=0):
    super(WordWindowClassifier, self).__init__()
    
    """ Instance variables """
    self.window_size = hyperparameters["window_size"]
    self.embed_dim = hyperparameters["embed_dim"]
    self.hidden_dim = hyperparameters["hidden_dim"]
    self.freeze_embeddings = hyperparameters["freeze_embeddings"]

    """ Embedding Layer 
    Takes in a tensor containing embedding indices, and returns the 
    corresponding embeddings. The output is of dim 
    (number_of_indices * embedding_dim).

    If freeze_embeddings is True, set the embedding layer parameters to be
    non-trainable. This is useful if we only want the parameters other than the
    embeddings parameters to change. 

    """
    self.embeds = nn.Embedding(vocab_size, self.embed_dim, padding_idx=pad_ix)
    if self.freeze_embeddings:
      self.embed_layer.weight.requires_grad = False

    """ Hidden Layer
    """
    full_window_size = 2 * window_size + 1
    self.hidden_layer = nn.Sequential(
      nn.Linear(full_window_size * self.embed_dim, self.hidden_dim), 
      nn.Tanh()
    )

    """ Output Layer
    """
    self.output_layer = nn.Linear(self.hidden_dim, 1)

    """ Probabilities 
    """
    self.probabilities = nn.Sigmoid()

  def forward(self, inputs):
    """
    Let B:= batch_size
        L:= window-padded sentence length
        D:= self.embed_dim
        S:= self.window_size
        H:= self.hidden_dim
        
    inputs: a (B, L) tensor of token indices
    """
    B, L = inputs.size()

    """
    Reshaping.
    Takes in a (B, L) LongTensor
    Outputs a (B, L~, S) LongTensor
    """
    # Fist, get our word windows for each word in our input.
    token_windows = inputs.unfold(1, 2 * self.window_size + 1, 1)
    _, adjusted_length, _ = token_windows.size()

    # Good idea to do internal tensor-size sanity checks, at the least in comments!
    assert token_windows.size() == (B, adjusted_length, 2 * self.window_size + 1)

    """
    Embedding.
    Takes in a torch.LongTensor of size (B, L~, S) 
    Outputs a (B, L~, S, D) FloatTensor.
    """
    embedded_windows = self.embeds(token_windows)

    """
    Reshaping.
    Takes in a (B, L~, S, D) FloatTensor.
    Resizes it into a (B, L~, S*D) FloatTensor.
    -1 argument "infers" what the last dimension should be based on leftover axes.
    """
    embedded_windows = embedded_windows.view(B, adjusted_length, -1)

    """
    Layer 1.
    Takes in a (B, L~, S*D) FloatTensor.
    Resizes it into a (B, L~, H) FloatTensor
    """
    layer_1 = self.hidden_layer(embedded_windows)

    """
    Layer 2
    Takes in a (B, L~, H) FloatTensor.
    Resizes it into a (B, L~, 1) FloatTensor.
    """
    output = self.output_layer(layer_1)

    """
    Softmax.
    Takes in a (B, L~, 1) FloatTensor of unnormalized class scores.
    Outputs a (B, L~, 1) FloatTensor of (log-)normalized class scores.
    """
    output = self.probabilities(output)
    output = output.view(B, -1)

    return output

## 3. 训练
现在我们已经准备好把所有东西放在一起了。让我们从准备数据和初始化模型开始。然后可以初始化优化器并定义损失函数。这一次，我们将定义自己的损失函数，而不是像以前那样使用预定义的损失函数。

In [28]:
# Prepare the data
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)

# Instantiate a DataLoader
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)

# Initialize a model
# It is useful to put all the model hyperparameters in a dictionary
model_hyperparameters = {
    "batch_size": 4,
    "window_size": 2,
    "embed_dim": 25,
    "hidden_dim": 25,
    "freeze_embeddings": False,
}

vocab_size = len(word_to_ix)
model = WordWindowClassifier(model_hyperparameters, vocab_size)

# Define an optimizer
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# Define a loss function, which computes to binary cross entropy loss
def loss_function(batch_outputs, batch_labels, batch_lengths):   
    # Calculate the loss for the whole batch
    bceloss = nn.BCELoss()
    loss = bceloss(batch_outputs, batch_labels.float())

    # Rescale the loss. Remember that we have used lengths to store the 
    # number of words in each training example
    loss = loss / batch_lengths.sum().float()

    return loss

与前面的例子不同，这次我们将使用批处理，而不是在每个迭代中一次性将所有训练数据传递给模型。因此，在每个训练历元迭代中，我们遍历批次。

In [29]:
# Function that will be called in every epoch
def train_epoch(loss_function, optimizer, model, loader):
  
  # Keep track of the total loss for the batch
  total_loss = 0
  for batch_inputs, batch_labels, batch_lengths in loader:
    # Clear the gradients
    optimizer.zero_grad()
    # Run a forward pass
    outputs = model.forward(batch_inputs)
    # Compute the batch loss
    loss = loss_function(outputs, batch_labels, batch_lengths)
    # Calculate the gradients
    loss.backward()
    # Update the parameteres
    optimizer.step()
    total_loss += loss.item()

  return total_loss


# Function containing our main training loop
def train(loss_function, optimizer, model, loader, num_epochs=10000):

  # Iterate through each epoch and call our train_epoch function
  for epoch in range(num_epochs):
    epoch_loss = train_epoch(loss_function, optimizer, model, loader)
    if epoch % 100 == 0: print(epoch_loss)

Let's start training!

In [30]:
num_epochs = 1000
train(loss_function, optimizer, model, loader, num_epochs=num_epochs)

0.30812162160873413
0.2404438816010952
0.17627645283937454
0.1325424425303936
0.1190925408154726
0.07673261314630508
0.0607409942895174
0.061209687031805515
0.05008536949753761
0.04399936180561781


## 4. 预测
让我们看看我们的模型在预测方面做得有多好。我们可以从创建测试数据开始。

In [31]:
test_corpus = ["She comes from Paris"]
test_sentences = [s.lower().split() for s in test_corpus]
test_labels = [[0, 0, 0, 1]]

# Create a test loader
test_data = list(zip(test_sentences, test_labels))
batch_size = 1
shuffle = False
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=2, word_to_ix=word_to_ix)
test_loader = torch.utils.data.DataLoader(test_data, 
                                           batch_size=1, 
                                           shuffle=False, 
                                           collate_fn=collate_fn)

让我们循环一下我们的测试示例，看看我们做得如何。

In [32]:
for test_instance, labels, _ in test_loader:
  outputs = model.forward(test_instance)
  print(labels)
  print(outputs)

tensor([[0, 0, 0, 1]])
tensor([[0.3551, 0.1014, 0.0432, 0.9527]], grad_fn=<ViewBackward>)


In [35]:
a = torch.rand(3,4)
a.size()


torch.Size([3, 4])