# Implementing softmax regression

In this notebook you will learn how to implement softmax regression in PyTorch, and apply it to the task of sentiment analysis.

## Loading the data

Our dataset is derived from the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/), which consists of 11,855 sentences extracted from movie reviews. Each sentence has been manually labelled with a sentiment expressed as an integer between 0 (very negative) and 4 (very positive) towards the movie at hand. For this data, sentiment analysis can be framed as a multi-class classification problem. We have pre-processed the original data by tokenization, removing non-alphabetical tokens, and lowercasing.

The following helper function loads the sentiment-labelled sentences from a tab-separated file. It returns a list of pairs where the first component of each pair is a tokenized sentence (represented a lists of string tokens) and the second component is the corresponding sentiment (an integer in the range 0–4). We cap the maximal length of a sentence at 20 words.

In [1]:
def load_data(filename, max_length=20):
    items = []
    with open(filename, 'rt', encoding='utf-8') as fp:
        for line in fp:
            sentence, label = line.rstrip().split('\t')
            items.append((sentence.split()[:max_length], int(label)))
    return items

We use this function to load the training data and the development data:

In [2]:
train_data = load_data('sst-5-train.txt')
dev_data = load_data('sst-5-dev.txt')

FileNotFoundError: ignored

The next cell prints the number of examples in the training data:

In [None]:
print(len(train_data))

## Vectorizing the data

To process the sentences using neural networks, we need to convert them into vectors of numerical values. A simple but common vector representation in natural language processing is the **bag-of-words**. Under this representation, we record *which* words occur in a text and *how often* they occur, but completely ignore their ordering. This is the representation that we will use in this notebook.

We first construct a mapping from words to vector indices. The domain of this mapping will be our vocabulary.

In [None]:
def make_vocab(data):
    vocab = {}
    for sentence, label in data:
        for t in sentence:
            if t not in vocab:
                vocab[t] = len(vocab)
    return vocab

We create the vocabulary from the training data:

In [None]:
vocab = make_vocab(train_data)

Running the cell below prints the size of the vocabulary:

In [None]:
print(len(vocab))

Next, we map each sentence to a vector that holds the counts in that sentence for all the words in the vocabulary. We collect all sentence vectors into a PyTorch tensor. We do the same thing for all labels, which results in a vector of integers.

In [None]:
import torch

def vectorize(vocab, data):
    xs = []
    ys = []
    for sentence, label in data:
        x = [0] * len(vocab)
        for w in sentence:
            if w in vocab:
                x[vocab[w]] += 1
        xs.append(x)
        ys.append(label)
    return torch.FloatTensor(xs), torch.LongTensor(ys)

We vectorize the training data and the development data:

In [None]:
train_x, train_y = vectorize(vocab, train_data)
dev_x, dev_y = vectorize(vocab, dev_data)

The next cell prints the shapes of the resulting tensors:

In [None]:
print('Training data:', train_x.shape, train_y.shape)
print('Development data:', dev_x.shape, dev_y.shape)

## Evaluation

The standard evaluation measure for our dataset is **accuracy** – the percentage of examples for which the classifier predicts the correct label.

In [None]:
def accuracy(y_pred, y):
    return torch.mean(torch.eq(y_pred, y).float()).item()

By always predicting the majority class in the training data (rating&nbsp;3), we can get an accuracy on the development data of slightly above 25%.

In [None]:
accuracy(torch.full_like(dev_y, 3), dev_y)

## Training the model

We are now ready to set up a softmax regression model and train it using the categorical cross-entropy loss function. Recall that a softmax regression model consists of a linear layer followed by the softmax function. In PyTorch, linear layers are implemented by the class [`nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html), and cross-entropy loss is implemented by the function [`cross_entropy()`](https://pytorch.org/docs/stable/nn.functional.html#cross-entropy). The softmax function will be computed inside the loss function; see the note below.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

We will train our model using minibatch gradient descent. The following function splits the data into randomly sampled minibatches of the specified size.

In [None]:
def minibatches(x, y, batch_size):
    random_indices = torch.randperm(x.size(0))
    for i in range(0, x.size(0) - batch_size + 1, batch_size):
        batch_indices = random_indices[i:i+batch_size]
        yield x[batch_indices], y[batch_indices]

With this we can now write our training loop:

In [None]:
def train(n_epochs=20, batch_size=24, lr=1e-1):
    # Initialize the model
    model = nn.Linear(len(vocab), 5)
    
    # Initialize the optimizer
    optimizer = optim.SGD(model.parameters(), lr=lr)
    
    # We train for several epochs
    for t in range(n_epochs):
        
        # In each epoch, we loop over all the minibatches
        for bx, by in minibatches(train_x, train_y, batch_size):
            
            # Reset the accumulated gradients
            optimizer.zero_grad()
            
            # Forward pass
            output = model.forward(bx)
            
            # Compute the loss
            loss = F.cross_entropy(output, by)
            
            # Backward pass; propagates the loss and computes the gradients
            loss.backward()
            
            # Update the parameters of the model
            optimizer.step()
    
    return model

**⚠️ The softmax is implicit!**

One thing that you will note when you go through the code of the training loop is that there is no explicit call to the softmax function. The outputs of the network are not normalized probabilities but just scores (logits). The softmax is computed inside `cross_entropy()`.

Here is an embellished version of the training loop that plots the per-epoch losses and the per-epoch accuracies on the development data.

In [None]:
# Same training loop with evaluation and plotting

import matplotlib.pyplot as plt
import tqdm

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

def train(n_epochs=20, batch_size=24, lr=1e-1):
    model = nn.Linear(len(vocab), 5)
    optimizer = optim.SGD(model.parameters(), lr=lr)
    losses = []
    dev_losses = []
    dev_accuracies = []
    info = {'dev loss': 0, 'dev acc': 0}
    with tqdm.tqdm(total=n_epochs) as pbar:
        for t in range(n_epochs):
            model.train()
            running_loss = 0
            for bx, by in minibatches(train_x, train_y, batch_size):
                optimizer.zero_grad()
                output = model.forward(bx)
                loss = F.cross_entropy(output, by)
                loss.backward()
                optimizer.step()
                running_loss += loss.item() * len(bx)
            losses.append(running_loss / len(train_x))
            model.eval()
            with torch.no_grad():
                dev_output = model.forward(dev_x)
                dev_loss = F.cross_entropy(dev_output, dev_y)
                dev_losses.append(dev_loss)
                dev_y_pred = torch.argmax(dev_output, axis=1)
                dev_acc = accuracy(dev_y_pred, dev_y)
                dev_accuracies.append(dev_acc)
                info['dev loss'] = f'{dev_loss:.4f}'
                info['dev acc'] = f'{dev_acc:.4f}'
                pbar.set_postfix(info)
            pbar.update()
    plt.figure(figsize=(15, 6))
    plt.subplot(121)
    plt.plot(losses)
    plt.plot(dev_losses)
    plt.xlabel('Epoch')
    plt.ylabel('Average loss')
    plt.subplot(122)
    plt.plot(dev_accuracies)
    plt.xlabel('Epoch')
    plt.ylabel('Development set accuracy')
    return model

We are ready to train:

In [None]:
train()

## Using a GPU

The data set that we used in this notebook is very small, and the model is very simple. For larger datasets and/or models, you may want to use a GPU, say on a service such as [Colab](http://colab.research.google.com). Fortunately, making a model ready for GPU training is rather straightforward in PyTorch: You only need to ‘send’ the model and its input to the correct device.

The next code cell contains a GPU-enabled version of the (embellished) training loop. Modified lines are marked as `CHANGED`.

In [None]:
# GPU-enabled training loop

import matplotlib.pyplot as plt
import tqdm

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# CHANGED: Set the device variable
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def train(n_epochs=20, batch_size=24, lr=1e-1):
    
    # CHANGED: Send the model to the available device
    model = nn.Linear(len(vocab), 5).to(device)
    
    optimizer = optim.SGD(model.parameters(), lr=lr)
    losses = []
    dev_losses = []
    dev_accuracies = []
    info = {'dev loss': 0, 'dev acc': 0}
    with tqdm.tqdm(total=n_epochs) as pbar:
        for t in range(n_epochs):
            model.train()
            running_loss = 0
            for bx, by in minibatches(train_x, train_y, batch_size):

                # CHANGED: Send the minibatches to the available device
                bx = bx.to(device)
                by = by.to(device)

                optimizer.zero_grad()
                output = model.forward(bx)
                loss = F.cross_entropy(output, by)
                loss.backward()
                optimizer.step()
                running_loss += loss.item() * len(bx)
            losses.append(running_loss / len(train_x))
            model.eval()
            with torch.no_grad():
                dev_output = model.forward(dev_x.to(device))
                dev_loss = F.cross_entropy(dev_output, dev_y.to(device))
                dev_losses.append(dev_loss)
                dev_y_pred = torch.argmax(dev_output, axis=1)
                dev_acc = accuracy(dev_y, dev_y_pred)
                dev_accuracies.append(dev_acc)
                info['dev loss'] = f'{dev_loss:.4f}'
                info['dev acc'] = f'{dev_acc:.4f}'
                pbar.set_postfix(info)
            pbar.update()
    plt.figure(figsize=(15, 6))
    plt.subplot(121)
    plt.plot(losses)
    plt.plot(dev_losses)
    plt.xlabel('Epoch')
    plt.ylabel('Average loss')
    plt.subplot(122)
    plt.plot(dev_accuracies)
    plt.xlabel('Epoch')
    plt.ylabel('Development set accuracy')
    return model

We can now train on the GPU. (For this very simple model however, the training times are almost identical.)

In [None]:
train()

That’s all folks!