# PyTorch Assignment: Natural Language Processing (NLP)

**[Duke Community Standard](http://integrity.duke.edu/standard.html): By typing your name below, you are certifying that you have adhered to the Duke Community Standard in completing this assignment.**

Name: Fabian Alejandro Toledo

### Text Classification

In Notebook 4A, we built a sentiment analysis model for movie reviews.
That particular sentiment analysis was a two-class classification problem, where the two classes were whether the review was positive or negative.
Of course, natural language comes in all sorts of different forms; sometimes we want to perform other types of classification.

For example, the AG News dataset contains text from 127600 online news articles, from 4 different categories: World, Sports, Business, and Science/Technology.
AG News is typically used for topic classification: given an unseen news article, we're interested in predicting the topic.
For this assignment, you'll be training several models on the AG News dataset.
Unlike the quick example we trained in Notebook 4A, however, we're going to *learn* the word embeddings.
Since you may be unfamiliar with AG News, we're going to walk through how to load the data, to get you started.


### Loading AG News with Torchtext

The AG News dataset is one of many included Torchtext.
It can be found grouped together with many of the other text classification datasets.
While we can download the source text online, Torchtext makes it retrievable with a quick API call&ast;. If you are running this notebook on your machine,  you can uncomment and run this block:

In [1]:
import torch
import torchtext
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
import string

In [2]:
train_iter, test_iter = torchtext.datasets.AG_NEWS(root="./datasets")

<font size="1">&ast;At the time this notebook was created, Torchtext contains a small bug in its csv reader.
You may need to change one line in the source code, as suggested [here](https://discuss.pytorch.org/t/one-error-about-the-utils-pys-code/53885) to successfully load the AG News dataset.
</font>

Unfortunately, Torchtext assumes we have network connectivity. If we don't have network access, such as notebooks running in Coursera Labs, we need to reimplement some Torchtext functionality. Skip this next block if you were able to successfully run the previous code:

In [3]:
remove_punct = str.maketrans('','',string.punctuation)

tokenizer = get_tokenizer('basic_english')
counter = Counter()

label_list, text_list = [], []
for (label, line) in train_iter:
    # Delete punctuation signs and lowercase the sentences
    line = line.translate(remove_punct).lower()
    # Update the counter
    counter.update(tokenizer(line))
    # Append to lists of labels and texts
    label_list.append(label-1)
    text_list.append(tokenizer(line))

vocab = Vocab(counter, min_freq=1)

text_pipeline = lambda tokens_line : [vocab[token] for token in tokens_line]

tensors_list = []
for tokens_line in text_list:
    tensors_list.append(torch.tensor(text_pipeline(tokens_line), dtype=torch.int64))

# Form a list of tuples with (label, Tensor with integers from vocab)
agnews_train = list(zip(label_list, tensors_list))

In [4]:
agnews_test = []
for (label, line) in test_iter:
    # Delete punctuation signs and lowercase the sentences
    line = line.translate(remove_punct).lower()
     # Append to list of tuples with (label, Tensor with integers from vocab)
    tensor_line = torch.tensor(text_pipeline(tokenizer(line)), dtype=torch.int64)
    agnews_test.append((label-1, tensor_line))

    

Let's inspect the first data example to see how the data is formatted:

In [5]:
print(agnews_train[0])
print(agnews_test[0])


(2, tensor([  399,   395,  1564, 14824,   101,    55,     2,   839,    24,    24,
        52406,   399,  2034, 70568,     5, 53948,    35,  3998,   763,   298]))
(2, tensor([  835,     9,  2449,  2505,  1440,    29,   155,  1520,  4041,   376,
           14,  6811, 39137,   213,    61,    35,  4735,    29,   155,    12,
        11288,  2408,   305,   176,  9826]))


We can see that Torchtext has each example as a tuple, with the first element being the label (0, 1, 2, or 3), and the second element the text data.
Notice that the text is already "tokenized": the words of the news article have been represented as word IDs, with each number corresponding to a unique word.

In previous notebooks, we've used `DataLoader`s to handle shuffling and batching.
However, if we directly try to feed these dataset objects into a `DataLoader`, we will face an error when we try to draw our first batch.
Can you figure out why?
Here's a hint:

In [6]:
print("Length of the first text example: {}".format(len(agnews_train[0][1])))
print("Length of the second text example: {}".format(len(agnews_train[1][1])))
print("Length of the second text example: {}".format(len(agnews_train[2][1])))
print("Length of the second text example: {}".format(len(agnews_train[100][1])))

Length of the first text example: 20
Length of the second text example: 35
Length of the second text example: 35
Length of the second text example: 44


Because each example is a news snippet, they can vary in length.
This is natural, as humans don't stick to consistent sentence length while writing.
This creates a bit of a problem while batching, as default tensors expect the size of each dimension to be consistent.

How do we fix this?
The common solution is to perform padding and/or truncation, picking a maximum sequence length $L$.
Inputs longer than the maximum length are truncated (i.e. $x_{t>L}$ are discarded), and shorter sequences have zeros padded to the end until they are all of length of $L$.
We'll focus on padding here, for simplicity.

We can perform this padding manually, but Pytorch has this functionality implemented.
As an example, let's pad the first two sequences to the same length:

In [7]:
from torch.nn.utils.rnn import pad_sequence

padded_exs = pad_sequence([agnews_train[0][1], agnews_train[100][1]])
print("First sequence padded: {}".format(padded_exs[:,0]))
print("First sequence length: {}".format(len(padded_exs[:,0])))
print("Second sequence padded: {}".format(padded_exs[:,1]))
print("Second sequence length: {}".format(len(padded_exs[:,1])))

First sequence padded: tensor([  399,   395,  1564, 14824,   101,    55,     2,   839,    24,    24,
        52406,   399,  2034, 70568,     5, 53948,    35,  3998,   763,   298,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])
First sequence length: 44
Second sequence padded: tensor([11635, 17848,     7,  3665,   402,     4,  5203,   600,  2080,  2080,
            4,  5203,   600,  1340,     3,  9794, 11635,     7, 17848,   166,
         1709,     3,    31,   148,     3,  3665,   616,     2, 12943,   509,
           35,  2114,    59,  7744,     7,    83,    31,    13,  2120,    13,
        24766,    17, 10362,  4182])
Second sequence length: 44


Although originally of unequal lengths, both sequences are now the same length, with the shorter one padded with zeros.

We'd like the `DataLoader` to perform this padding operation as part of its batching process, as this will allow us to effectively combine varying-length sequences in the same input tensor.
Fortunately, `Dataloader`s let us override the default batching behavior with the `collate_fn` argument.

In [8]:
import numpy as np
import torch

def collator(batch):
    labels = torch.tensor([example[0] for example in batch])
    sentences = [example[1] for example in batch]
    data = pad_sequence(sentences)
    
    return [data, labels]

Now that we have our collator padding our sequences, we can create our `DataLoader`s.
One last thing we need to do is choose a batch size for our `DataLoader`.
This may be something you have to play around with.
Too big and you may exceed your system's memory; too small and training may take longer (especially on CPU).
Batch size also tends to influence training dynamics and model generalization.
Fiddle around and see what works best.

In [9]:
BATCH_SIZE = 128

train_loader = torch.utils.data.DataLoader(agnews_train, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collator)
test_loader = torch.utils.data.DataLoader(agnews_test, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collator)


### Simple Word Embedding Model

First, let's try out the Simple Word Embedding Model (SWEM) that we built in Notebook 4A on the AG News dataset.
Unlike before though, instead of loading pre-trained embeddings, let's learn the embeddings from scratch.
Before we begin, it will be helpful to define a few more hyperparameters.

In [32]:
VOCAB_SIZE = len(counter)+10
EMBED_DIM = 100
HIDDEN_DIM = 64
NUM_OUTPUTS = 4
NUM_EPOCHS = 3

print(VOCAB_SIZE)


102179


Once again, we're going to organize our model as a `nn.Module`.
Instead of assuming the input is already an embedding, we're going to make learning the embedding as part of our model.
We do this by using `nn.Embedding` to perform an embedding look-up at the beginning of our forward pass.
Once we've done the look up, we'll have a minibatch of embedded sequences of dimension $L \times$ `BATCH_SIZE` $\times$ `EMBED_DIM`.
For SWEM, remember, we take the mean&ast; across the length dimension to get an average embedding for the sequence.

<font size="1"> 
&ast;Note: Technically we should only take the mean across the embeddings at the positions corresponding to "real" words in our input, and not for the zero paddings we artificially added.
This can be done by generating a binary mask while doing the padding to track the "real" words in the input.
Ultimately though, this refinement doesn't have much impact on the results for this particular task, so we omit it for simplicity.
</font>

In [27]:
import torch.nn as nn
import torch.nn.functional as F

class SWEM(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_dim, num_outputs):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        
        self.fc1 = nn.Linear(embedding_size, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, num_outputs)

    def forward(self, x):
        embed = self.embedding(x)
        embed_mean = torch.mean(embed, dim=0)
        
        h = self.fc1(embed_mean)
        h = F.relu(h)
        h = self.fc2(h)
        return h

With the model defined, we can instantiate, train, and evaluate.
Try doing so below!
Because of the way we organized our model as an `nn.Module` and our data pipeline with a `DataLoader`, you should be able to use much of the same code as we have in other examples.

<font size="1"> 
Note: Depending on you system, training may take up to a few hours, depending on how many training epochs you set.
To see results sooner, you can train for less iterations, but perhaps at the cost of final accuracy.
On the other hand, using a GPU (and GPU-enabled PyTorch) should enable full training in a couple minutes.
</font>

In [28]:
### YOUR CODE HERE ###

# Initialize the model
model_swem = SWEM(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, NUM_OUTPUTS)


In [38]:
### YOUR CODE HERE ###

# Create loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_swem.parameters(), lr=0.001)

print([p.shape for p in model_gru.parameters()])


[torch.Size([102179, 100]), torch.Size([192, 100]), torch.Size([192, 64]), torch.Size([192]), torch.Size([192]), torch.Size([192, 64]), torch.Size([192, 64]), torch.Size([192]), torch.Size([192]), torch.Size([4, 64]), torch.Size([4])]


In [30]:
### YOUR CODE HERE ###

for epoch in range(NUM_EPOCHS):
    correct = 0
    num_examples = 0
    mini_batch_num = 0
    for x, labels in train_loader:
        mini_batch_num += 1
        # Reset grads
        optimizer.zero_grad()
        # Forward pass
        y = model_swem(x)
        # Calc the loss
        loss = loss_fn(y, labels)
        # Backward pass
        loss.backward()
        # Update the parameters
        optimizer.step()

        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        num_examples += x.shape[1]

        if mini_batch_num % 50 == 0:
            acc = correct/num_examples
            print(f"Epoch: {epoch} \t Batch Num: {mini_batch_num} \t Train Loss: {loss} \t Train Acc: {acc}")

    

Epoch: 0 	 Batch Num: 50 	 Train Loss: 1.379472255706787 	 Train Acc: 0.3006249964237213
Epoch: 0 	 Batch Num: 100 	 Train Loss: 1.3291141986846924 	 Train Acc: 0.32929688692092896
Epoch: 0 	 Batch Num: 150 	 Train Loss: 1.193419098854065 	 Train Acc: 0.382708340883255
Epoch: 0 	 Batch Num: 200 	 Train Loss: 1.0379514694213867 	 Train Acc: 0.43460938334465027
Epoch: 0 	 Batch Num: 250 	 Train Loss: 0.745442807674408 	 Train Acc: 0.48656249046325684
Epoch: 0 	 Batch Num: 300 	 Train Loss: 0.7067626118659973 	 Train Acc: 0.5308333039283752
Epoch: 0 	 Batch Num: 350 	 Train Loss: 0.7388306856155396 	 Train Acc: 0.5698884129524231
Epoch: 0 	 Batch Num: 400 	 Train Loss: 0.568041205406189 	 Train Acc: 0.5994336009025574
Epoch: 0 	 Batch Num: 450 	 Train Loss: 0.4708831310272217 	 Train Acc: 0.624670147895813
Epoch: 0 	 Batch Num: 500 	 Train Loss: 0.4819585680961609 	 Train Acc: 0.6454843878746033
Epoch: 0 	 Batch Num: 550 	 Train Loss: 0.38743454217910767 	 Train Acc: 0.663309633731842
Epo

In [31]:
## Testing
correct = 0
num_test = 0

with torch.no_grad():
    # Iterate through test set minibatchs 
    for x, labels in test_loader:
        # Forward pass
        y = model_swem(x)
        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        num_test += x.shape[1]        
    
print('Test accuracy: {}'.format(correct/num_test))


Test accuracy: 0.9031578898429871


### RNNs

SWEM takes a mean over the time dimension, which means we're losing any information about the order of the data sequence.
How detrimental is this for document topic classification?
Modify the SWEM model to use an RNN instead.
Once you get an RNN working, try a GRU and LSTM as well.

In [12]:
### YOUR CODE HERE ###
import torch.nn as nn
import torch.nn.functional as F

class SRNN(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_dim, rnn_layers, num_outputs):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.rnn = nn.RNN(embedding_size, hidden_dim, num_layers=rnn_layers)
        self.fc1 = nn.Linear(hidden_dim, num_outputs)

    def forward(self, x):
        embed = self.embedding(x)
        out, h = self.rnn(embed)
        embed_mean = torch.mean(out, dim=0)
        pred = self.fc1(embed_mean) 
        return pred
   

In [13]:
### YOUR CODE HERE ###
RNN_LAYERS = 2
# Initialize the model
model_rnn = SRNN(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, RNN_LAYERS, NUM_OUTPUTS)


In [14]:
### YOUR CODE HERE ###

# Create loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_rnn.parameters(), lr=0.001)

print([p.shape for p in model_rnn.parameters()])

[torch.Size([102179, 100]), torch.Size([64, 100]), torch.Size([64, 64]), torch.Size([64]), torch.Size([64]), torch.Size([64, 64]), torch.Size([64, 64]), torch.Size([64]), torch.Size([64]), torch.Size([4, 64]), torch.Size([4])]


In [15]:
 ### YOUR CODE HERE ###

for epoch in range(NUM_EPOCHS):
    correct = 0
    num_examples = 0
    mini_batch_num = 0
    for x, labels in train_loader:
        mini_batch_num += 1
        # Reset grads
        optimizer.zero_grad()
        # Forward pass
        y = model_rnn(x)
        # Calc the loss
        loss = loss_fn(y, labels)
        # Backward pass
        loss.backward()
        # Update the parameters
        optimizer.step()

        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        num_examples += x.shape[1]

        if mini_batch_num % 50 == 0:
            acc = correct/num_examples
            print(f"Epoch: {epoch} \t Batch Num: {mini_batch_num} \t Train Loss: {loss} \t Train Acc: {acc}")

Epoch: 0 	 Batch Num: 50 	 Train Loss: 1.3606551885604858 	 Train Acc: 0.26953125
Epoch: 0 	 Batch Num: 100 	 Train Loss: 1.2310659885406494 	 Train Acc: 0.3180468678474426
Epoch: 0 	 Batch Num: 150 	 Train Loss: 1.0034176111221313 	 Train Acc: 0.3824479281902313
Epoch: 0 	 Batch Num: 200 	 Train Loss: 0.9215617179870605 	 Train Acc: 0.4471093714237213
Epoch: 0 	 Batch Num: 250 	 Train Loss: 0.7810717821121216 	 Train Acc: 0.49696874618530273
Epoch: 0 	 Batch Num: 300 	 Train Loss: 0.6499370336532593 	 Train Acc: 0.5348437428474426
Epoch: 0 	 Batch Num: 350 	 Train Loss: 0.43504711985588074 	 Train Acc: 0.5658705234527588
Epoch: 0 	 Batch Num: 400 	 Train Loss: 0.5552324056625366 	 Train Acc: 0.5919140577316284
Epoch: 0 	 Batch Num: 450 	 Train Loss: 0.6894865036010742 	 Train Acc: 0.6113020777702332
Epoch: 0 	 Batch Num: 500 	 Train Loss: 0.519778311252594 	 Train Acc: 0.630484402179718
Epoch: 0 	 Batch Num: 550 	 Train Loss: 0.5694618225097656 	 Train Acc: 0.6455113887786865
Epoch: 0

In [16]:
## Testing
correct = 0
num_test = 0

with torch.no_grad():
    # Iterate through test set minibatchs 
    for x, labels in test_loader:
        # Forward pass
        y = model_rnn(x)
        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        num_test += x.shape[1]        
    
print('Test accuracy: {}'.format(correct/num_test))


Test accuracy: 0.8907894492149353


### LSTM

In [17]:
### YOUR CODE HERE ###
import torch.nn as nn
import torch.nn.functional as F

class SLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_dim, rnn_layers, num_outputs):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, hidden_dim, num_layers=rnn_layers)
        self.fc1 = nn.Linear(hidden_dim, num_outputs)

    def forward(self, x):
        embed = self.embedding(x)
        out, h = self.lstm(embed)
        embed_mean = torch.mean(out, dim=0)
        pred = self.fc1(embed_mean) 
        return pred
   

In [18]:
### YOUR CODE HERE ###
RNN_LAYERS = 2
# Initialize the model
model_lstm = SLSTM(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, RNN_LAYERS, NUM_OUTPUTS)


In [19]:
### YOUR CODE HERE ###

# Create loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_lstm.parameters(), lr=0.001)

print([p.shape for p in model_lstm.parameters()])


[torch.Size([102179, 100]), torch.Size([256, 100]), torch.Size([256, 64]), torch.Size([256]), torch.Size([256]), torch.Size([256, 64]), torch.Size([256, 64]), torch.Size([256]), torch.Size([256]), torch.Size([4, 64]), torch.Size([4])]


In [20]:
 ### YOUR CODE HERE ###

for epoch in range(NUM_EPOCHS):
    correct = 0
    num_examples = 0
    mini_batch_num = 0
    for x, labels in train_loader:
        mini_batch_num += 1
        # Reset grads
        optimizer.zero_grad()
        # Forward pass
        y = model_lstm(x)
        # Calc the loss
        loss = loss_fn(y, labels)
        # Backward pass
        loss.backward()
        # Update the parameters
        optimizer.step()

        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        num_examples += x.shape[1]

        if mini_batch_num % 50 == 0:
            acc = correct/num_examples
            print(f"Epoch: {epoch} \t Batch Num: {mini_batch_num} \t Train Loss: {loss} \t Train Acc: {acc}")

Epoch: 0 	 Batch Num: 50 	 Train Loss: 1.366959810256958 	 Train Acc: 0.28187501430511475
Epoch: 0 	 Batch Num: 100 	 Train Loss: 1.0923306941986084 	 Train Acc: 0.3824218809604645
Epoch: 0 	 Batch Num: 150 	 Train Loss: 0.878016471862793 	 Train Acc: 0.4532812535762787
Epoch: 0 	 Batch Num: 200 	 Train Loss: 0.7977313995361328 	 Train Acc: 0.502148449420929
Epoch: 0 	 Batch Num: 250 	 Train Loss: 0.7197909951210022 	 Train Acc: 0.5402500033378601
Epoch: 0 	 Batch Num: 300 	 Train Loss: 0.6239700317382812 	 Train Acc: 0.57442706823349
Epoch: 0 	 Batch Num: 350 	 Train Loss: 0.6686455607414246 	 Train Acc: 0.6020312309265137
Epoch: 0 	 Batch Num: 400 	 Train Loss: 0.6089595556259155 	 Train Acc: 0.6276757717132568
Epoch: 0 	 Batch Num: 450 	 Train Loss: 0.3940446972846985 	 Train Acc: 0.648559033870697
Epoch: 0 	 Batch Num: 500 	 Train Loss: 0.5151731967926025 	 Train Acc: 0.666671872138977
Epoch: 0 	 Batch Num: 550 	 Train Loss: 0.3537208139896393 	 Train Acc: 0.6817187666893005
Epoch:

In [21]:
## Testing
correct = 0
num_test = 0

with torch.no_grad():
    # Iterate through test set minibatchs 
    for x, labels in test_loader:
        # Forward pass
        y = model_lstm(x)
        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        num_test += x.shape[1]        
    
print('Test accuracy: {}'.format(correct/num_test))


Test accuracy: 0.9073684215545654


### GRU

In [33]:
### YOUR CODE HERE ###
import torch.nn as nn
import torch.nn.functional as F

class SGRU(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_dim, rnn_layers, num_outputs):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.gru = nn.GRU(embedding_size, hidden_dim, num_layers=rnn_layers)
        self.fc1 = nn.Linear(hidden_dim, num_outputs)

    def forward(self, x):
        embed = self.embedding(x)
        out, h = self.gru(embed)
        embed_mean = torch.mean(out, dim=0)
        pred = self.fc1(embed_mean) 
        return pred
   

In [34]:
### YOUR CODE HERE ###
RNN_LAYERS = 2
# Initialize the model
model_gru = SGRU(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, RNN_LAYERS, NUM_OUTPUTS)


In [35]:
### YOUR CODE HERE ###

# Create loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_gru.parameters(), lr=0.001)

print([p.shape for p in model_gru.parameters()])


[torch.Size([102179, 100]), torch.Size([192, 100]), torch.Size([192, 64]), torch.Size([192]), torch.Size([192]), torch.Size([192, 64]), torch.Size([192, 64]), torch.Size([192]), torch.Size([192]), torch.Size([4, 64]), torch.Size([4])]


In [36]:
 ### YOUR CODE HERE ###

for epoch in range(NUM_EPOCHS):
    correct = 0
    num_examples = 0
    mini_batch_num = 0
    for x, labels in train_loader:
        mini_batch_num += 1
        # Reset grads
        optimizer.zero_grad()
        # Forward pass
        y = model_gru(x)
        # Calc the loss
        loss = loss_fn(y, labels)
        # Backward pass
        loss.backward()
        # Update the parameters
        optimizer.step()

        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        num_examples += x.shape[1]

        if mini_batch_num % 50 == 0:
            acc = correct/num_examples
            print(f"Epoch: {epoch} \t Batch Num: {mini_batch_num} \t Train Loss: {loss} \t Train Acc: {acc}")

Epoch: 0 	 Batch Num: 50 	 Train Loss: 1.3227726221084595 	 Train Acc: 0.3070312440395355
Epoch: 0 	 Batch Num: 100 	 Train Loss: 0.760535478591919 	 Train Acc: 0.4214843809604645
Epoch: 0 	 Batch Num: 150 	 Train Loss: 0.6351439356803894 	 Train Acc: 0.5140104293823242
Epoch: 0 	 Batch Num: 200 	 Train Loss: 0.5299344062805176 	 Train Acc: 0.5737890601158142
Epoch: 0 	 Batch Num: 250 	 Train Loss: 0.4983929395675659 	 Train Acc: 0.6192812323570251
Epoch: 0 	 Batch Num: 300 	 Train Loss: 0.5482546091079712 	 Train Acc: 0.6515104174613953
Epoch: 0 	 Batch Num: 350 	 Train Loss: 0.5962448716163635 	 Train Acc: 0.6758928298950195
Epoch: 0 	 Batch Num: 400 	 Train Loss: 0.3886748254299164 	 Train Acc: 0.6963866949081421
Epoch: 0 	 Batch Num: 450 	 Train Loss: 0.3570895791053772 	 Train Acc: 0.7138194441795349
Epoch: 0 	 Batch Num: 500 	 Train Loss: 0.4605090022087097 	 Train Acc: 0.7280937433242798
Epoch: 0 	 Batch Num: 550 	 Train Loss: 0.39979660511016846 	 Train Acc: 0.7403267025947571


In [37]:
## Testing
correct = 0
num_test = 0

with torch.no_grad():
    # Iterate through test set minibatchs 
    for x, labels in test_loader:
        # Forward pass
        y = model_gru(x)
        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        num_test += x.shape[1]        
    
print('Test accuracy: {}'.format(correct/num_test))


Test accuracy: 0.9027631282806396


### Short Answer:

1\. How do the RNN, GRU, and LSTM compare to SWEM for AG News topic classification?
Are you surprised?}
What about classification might make SWEM so effective for topic classification?

The accuracy of the test set is almost the same. The classification bases its decisions mostly on the specific words on a subject and not on the word order.

2\. How many learnable parameters do each of the models you've trained have?

SWEM
[torch.Size([102179, 100]), torch.Size([192, 100]), torch.Size([192, 64]), torch.Size([192]), torch.Size([192]), torch.Size([192, 64]), torch.Size([192, 64]), torch.Size([192]), torch.Size([192]), torch.Size([4, 64]), torch.Size([4])]
"10217900+19200+12288+192+192+12288+12288+192+192+256+4=10274992"

RNN
[torch.Size([102179, 100]), torch.Size([64, 100]), torch.Size([64, 64]), torch.Size([64]), torch.Size([64]), torch.Size([64, 64]), torch.Size([64, 64]), torch.Size([64]), torch.Size([64]), torch.Size([4, 64]), torch.Size([4])]
"10217900+6400+4096+64+64+4096+4096+64+64+256+4=10237104"

LSTM
[torch.Size([102179, 100]), torch.Size([256, 100]), torch.Size([256, 64]), torch.Size([256]), torch.Size([256]), torch.Size([256, 64]), torch.Size([256, 64]), torch.Size([256]), torch.Size([256]), torch.Size([4, 64]), torch.Size([4])]
10217900
"10217900+25600+16384+256+256+16384+16384+256+256+256+4=10293936"

GRU
[torch.Size([102179, 100]), torch.Size([192, 100]), torch.Size([192, 64]), torch.Size([192]), torch.Size([192]), torch.Size([192, 64]), torch.Size([192, 64]), torch.Size([192]), torch.Size([192]), torch.Size([4, 64]), torch.Size([4])]
"10217900+19200+12288+192+192+12288+12288+192+192+256+4=10274992"

