# **Neural Networks in PyTorch**

## **Deep Learning Building Blocks: Affine maps, non-linearities and objectives**

DL consists of composing linearities with non-linearities that allows for powerful models. 

This section covers the basics of neural networks in PyTorch, including affine maps, non-linearities, objectives, and training a simple logistic regression model for text classification.

### **1. Affine Maps**

An affine map is a linear transformation followed by a translation. In mathematical terms, it's given by a function 
𝑓(𝑥) = 𝐴𝑥+𝑏, where 𝐴 is a matrix, 𝑥 is the input vector, and 𝑏 is the bias vector. The goal is to learn the parameters 𝐴 and b through training.

In PyTorch, nn.Linear represents this affine map. It applies a linear transformation to the input data.

The example creates a linear layer that maps from 𝑅^5 to 𝑅^3.

In [68]:
import torch        # core library for PyTorch
import torch.nn as nn        # provides classes & functions to build neural network layers & models.
import torch.nn.functional as F        # includes various functions for operations like activation functions.
import torch.optim as optim        # provides optimization algorithms like SGD.

torch.manual_seed(1)
lin = nn.Linear(5, 3)  # creates a layer that maps a 5d input to a 3d output (from R^5 to R^3)
data = torch.randn(4, 5) # Random input data with shape (4, 5), meaning we have 4 samples, each of 5 features
print(lin(data))  # Applies the affine transformation or linear layer to the data

tensor([[-0.0065,  0.4379,  0.1832],
        [ 0.7503,  0.0643, -0.5791],
        [-1.0600,  0.3903,  0.2721],
        [-0.7701,  0.2773,  0.3206]], grad_fn=<AddmmBackward0>)


### **Non-Linearities**

Non-linearities are functions applied after affine maps in neural networks to introduce complexity. 

Without them, a network composed solely of affine layers (e.g., 𝑓(𝑥) = 𝐴𝑥+𝑏 and 𝑔(𝑥) = 𝐶𝑥+𝑑) would be equivalent to a single affine map: 𝑓(𝑔(𝑥)) = 𝐴(𝐶𝑥+𝑑) + 𝑏 = 𝐴𝐶𝑥 + (𝐴𝑑+𝑏)

Common non-linearities include:
1. tanh⁡(𝑥)
2. 𝜎⁡(𝑥)
3. ReLU⁡(𝑥)

These functions are chosen for their computationally efficient gradients. For example:
\begin{align}\frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x))\end{align}

Although 𝜎⁡(𝑥) was once popular, it is often avoided due to the vanishing gradient problem, making tanh and ReLU more commonly used.


In [69]:
data = torch.randn(2, 2)
print(data, "\n")
print(F.relu(data))  # Applies ReLU non-linearity that sets all -ve values to 0 and leaves positive values unchanged.

tensor([[ 1.8213, -0.1814],
        [-0.9515,  0.4057]]) 

tensor([[1.8213, 0.0000],
        [0.0000, 0.4057]])


### **Softmax and Probabilities**

Softmax a non-linearity that converts logits (raw model outputs or vector of real numbers) into probabilities. It’s typically used as the last layer in classification models.

Let $x$ be a vector of real numbers (positive, negative, whatever, there are no constraints). The i'th component of$\text{Softmax}(x_i)$ function is defined as:
\begin{align}\frac{\exp(x_i)}{\sum_j \exp(x_j)}\end{align}
​This ensures that the outputs are non-negative and sum to 1.

In [70]:
# Softmax is also in torch.nn.functional
data = torch.randn(5)
print(data)

# F.softmax converts raw scores into probabilities. dim=0 means only dimension here since data is 1-dimensional.
# This means each value in data has been exponentiated & normalized so that the sum of the resulting probabilities is 1.
print(F.softmax(data, dim=0)) 
print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!

# Log softmax is useful for numerical stability in computations like cross-entropy loss.
# It computes the logarithm of the softmax probabilities.
print(F.log_softmax(data, dim=0))  

tensor([-1.5164,  0.7322,  2.2820, -1.2080,  1.1120])
tensor([0.0142, 0.1347, 0.6347, 0.0194, 0.1970])
tensor(1.)
tensor([-4.2530, -2.0044, -0.4546, -3.9446, -1.6246])


### **Objective Functions**
 
The objective function (or loss function or cost function) measures how well the model’s predictions match the actual labels. The goal is to minimize this loss. The process involves:

1. Choosing a training instance.
2. Running it through the neural network.
3. Computing the output loss.
4. Updating model parameters based on the loss function derivative.

High confidence but wrong predictions yield high loss; correct predictions yield low loss. The goal is to minimize loss to improve generalization to new data. A common loss function for multi-class classification is the Negative Log Likelihood Loss (NLLLoss), which minimizes the negative log probability of the correct output, or equivalently, maximizes the log probability of the correct output.

In [71]:
loss_function = nn.NLLLoss()

## **Creating Network Components in PyTorch**

To create a model in PyTorch, inherit from nn.Module and define the forward method. Here's an example of a classifier predicting "Travel" or "Food" from a bag-of-words (BoW) representation.

### **Example: Bag-of-Words Classifier for Travel vs Food**
Our model maps a sparse BoW representation to log probabilities for the labels "TRAVEL" and "FOOD." For instance, with a vocabulary of "explore" (index 0) and "delicious" (index 1):
+ The BoW vector for "explore explore explore explore" is: [4,0]
+ For "explore delicious delicious explore," it is: [2,2]
+ In general, it is: [Count(word1), Count(word2)]
+ Denote this BoW vector as 𝑥. The output of our network is: logSoftmax(𝐴𝑥 + 𝑏).

The input is passed through an affine map followed by a log softmax.

First, we need to prepare our data and vocabulary.

In [72]:
# Sample data: contains sentences & their associated labels ("TRAVEL" or "FOOD"). 
# Each sentence is split into a list of words.
data = [("I love to explore new places".split(), "TRAVEL"),
        ("The restaurant had amazing dishes".split(), "FOOD"),
        ("Traveling opens new horizons".split(), "TRAVEL"),
        ("The chef prepared a delicious meal".split(), "FOOD")]

# Test Data: used to evaluate the performance of the model after training.
test_data = [("Vacation spots are wonderful".split(), "TRAVEL"),
             ("I enjoy trying new recipes".split(), "FOOD")]

# word_to_ix is a dictionary mapping each unique word in the data to a unique index.
word_to_ix = {}
# Iterates through all sentences in data and test_data to populate this dictionary.
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

# VOCAB_SIZE is the number of unique words (size of the vocabulary).
VOCAB_SIZE = len(word_to_ix)
# NUM_LABELS is the number of possible labels ("TRAVEL" and "FOOD"), which is 2.
NUM_LABELS = 2

{'I': 0, 'love': 1, 'to': 2, 'explore': 3, 'new': 4, 'places': 5, 'The': 6, 'restaurant': 7, 'had': 8, 'amazing': 9, 'dishes': 10, 'Traveling': 11, 'opens': 12, 'horizons': 13, 'chef': 14, 'prepared': 15, 'a': 16, 'delicious': 17, 'meal': 18, 'Vacation': 19, 'spots': 20, 'are': 21, 'wonderful': 22, 'enjoy': 23, 'trying': 24, 'recipes': 25}


Define the model:

In [73]:
# BoWClassifier takes a bag-of-words vector and outputs log probabilities.
# BoWClassifier inherits from nn.Module to define a custom model.
class BoWClassifier(nn.Module):     
    def __init__(self, num_labels, vocab_size): 
        super(BoWClassifier, self).__init__()       # Initializes the parent class (nn.Module).
        self.linear = nn.Linear(vocab_size, num_labels)   # performs affine transformation from BoW vector to label logits.

    def forward(self, bow_vec):
        """Applies linear layer to input BoW vector & passes result through log_softmax to get log probabilities."""
        return F.log_softmax(self.linear(bow_vec), dim=1)   

Define helper functions for converting text to vectors and targets:

In [74]:
# Function make_bow_vector converts a sentence into a BoW vector.
def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))  # Initializes a zero vector of length equal to the vocabulary size.
    for word in sentence:
        vec[word_to_ix[word]] += 1  # Increments count for each word in sentence based on its index from word_to_ix.
    return vec.view(1, -1)  # Returns the vector reshaped to have a batch dimension.

# Function make_target converts a label into a tensor of indices for loss calculation.
def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])   # Maps the label to its index using label_to_ix.

Set up and train the model:

In [75]:
# Instantiates the BoWClassifier with the number of labels and vocabulary size.
model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

In [76]:
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.0644,  0.0431,  0.0713,  0.0972, -0.1816,  0.0987, -0.1379, -0.1480,
          0.0119, -0.0334,  0.1152, -0.1136, -0.1743,  0.1427, -0.0291,  0.1103,
          0.0630, -0.1471,  0.0394,  0.0471, -0.1313, -0.0931,  0.0669,  0.0351,
         -0.0834, -0.0594],
        [ 0.1796, -0.0363,  0.1106,  0.0849, -0.1268, -0.1668,  0.1882,  0.0102,
          0.1344,  0.0406,  0.0631,  0.1465,  0.1860, -0.1301,  0.0245,  0.1464,
          0.1421,  0.1218, -0.1419, -0.1412, -0.1186,  0.0246,  0.1955, -0.1239,
          0.1045, -0.1085]], requires_grad=True)
Parameter containing:
tensor([-0.1844, -0.0417], requires_grad=True)


In [77]:
# Run a forward pass with sample data
with torch.no_grad():   # to prevent gradient calculation.
    sample = data[0]
    # Converts the sample sentence into a BoW vector and computes log probabilities using the model.
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    log_probs = model(bow_vector)
    print(log_probs)

tensor([[-0.6906, -0.6957]])


In [78]:
label_to_ix = {"TRAVEL": 0, "FOOD": 1}  # mapping labels to their indices for target tensor creation.
# Define loss function and optimizer
loss_function = nn.NLLLoss()    # computes the negative log likelihood loss.
optimizer = optim.SGD(model.parameters(), lr=0.1)   # Stochastic Gradient Descent (SGD) with a learning rate of 0.1.

for epoch in range(100):
    for instance, label in data:
        model.zero_grad()   # Clears previous gradients
        # Converts instance into a BoW vector & creates a target tensor.
        bow_vec = make_bow_vector(instance, word_to_ix)   
        target = make_target(label, label_to_ix)
        log_probs = model(bow_vec)   # Computes log probabilities with the model.
        loss = loss_function(log_probs, target)   # Calculates loss between log probabilities and target.
        loss.backward()   # Computes gradients
        optimizer.step()   # Updates model parameters

Testing the Model

In [79]:
# Runs without gradient calculation to test the model's performance.
with torch.no_grad():
    # Converts test data into BoW vectors, computes log probabilities, and prints results.
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

tensor([[-0.6509, -0.7372]])
tensor([[-0.0967, -2.3840]])


In [80]:
print(next(model.parameters())[:, word_to_ix["explore"]])

tensor([ 0.4086, -0.2265], grad_fn=<SelectBackward0>)


In [81]:
print(next(model.parameters())[:, word_to_ix["dishes"]])

tensor([-0.2475,  0.4258], grad_fn=<SelectBackward0>)
