In [1]:
import torch.nn as nn

# Create an RNN
rnn = nn.RNN(input_size=4, hidden_size=3, num_layers=1)

# Check initial weights
print("Input weights W_{hx}:\n", rnn.weight_ih_l0)  # Weights for input to hidden
print("Hidden weights W_{hh}:\n", rnn.weight_hh_l0)  # Weights for hidden to hidden


Input weights W_{hx}:
 Parameter containing:
tensor([[ 0.5384, -0.2044,  0.1479,  0.2199],
        [ 0.0697, -0.5088,  0.2661, -0.1335],
        [-0.3623,  0.4109, -0.3424, -0.2226]], requires_grad=True)
Hidden weights W_{hh}:
 Parameter containing:
tensor([[ 0.5343,  0.2117, -0.1317],
        [ 0.4909,  0.0283, -0.3987],
        [ 0.0164, -0.3089,  0.0196]], requires_grad=True)


## Custom Weight Initialization

In [2]:
import torch
for i in rnn.named_parameters():
    print(type(i))
for i in rnn.named_parameters():
    print(i)


<class 'tuple'>
<class 'tuple'>
<class 'tuple'>
<class 'tuple'>
('weight_ih_l0', Parameter containing:
tensor([[ 0.5384, -0.2044,  0.1479,  0.2199],
        [ 0.0697, -0.5088,  0.2661, -0.1335],
        [-0.3623,  0.4109, -0.3424, -0.2226]], requires_grad=True))
('weight_hh_l0', Parameter containing:
tensor([[ 0.5343,  0.2117, -0.1317],
        [ 0.4909,  0.0283, -0.3987],
        [ 0.0164, -0.3089,  0.0196]], requires_grad=True))
('bias_ih_l0', Parameter containing:
tensor([ 0.0477,  0.2878, -0.2912], requires_grad=True))
('bias_hh_l0', Parameter containing:
tensor([-0.1890, -0.0997, -0.1595], requires_grad=True))


In [3]:
for i in rnn.named_parameters():
    print(i[0], i[1])
    break

weight_ih_l0 Parameter containing:
tensor([[ 0.5384, -0.2044,  0.1479,  0.2199],
        [ 0.0697, -0.5088,  0.2661, -0.1335],
        [-0.3623,  0.4109, -0.3424, -0.2226]], requires_grad=True)


In [4]:
import torch

# Initialize weights with Xavier uniform distribution
for name, param in rnn.named_parameters():
    if 'weight' in name:
        nn.init.xavier_uniform_(param)
        print(param)


Parameter containing:
tensor([[-0.6403, -0.4580,  0.0760,  0.2228],
        [ 0.7605,  0.2104, -0.1391, -0.1490],
        [-0.5899, -0.1240,  0.1980, -0.5080]], requires_grad=True)
Parameter containing:
tensor([[-0.0377,  0.6641,  0.1304],
        [-0.6605,  0.9089, -0.7736],
        [ 0.3250,  0.9797, -0.7253]], requires_grad=True)


### Benefits of Initializing Weights with a Uniform Distribution

Initializing weights using a **uniform distribution** offers several important benefits, particularly in the context of neural networks. Let’s break down the main advantages:

#### 1. Breaking Symmetry

One of the primary reasons for initializing weights with a uniform distribution (or any random distribution) is to **break symmetry** between the neurons. If all weights are initialized to the same value (e.g., zeros), each neuron in a layer will receive the same gradients and update in the same way during training, effectively making them learn the same features. Random initialization with a uniform distribution ensures that neurons start with different weights, allowing them to learn different aspects of the data.

- **Benefit**: Different neurons can learn different features, leading to a more expressive and powerful network.

#### 2. Ensuring Appropriate Scale of Weights

A uniform distribution allows control over the range of the initial weights, which can help keep the initial activations and gradients at a manageable scale, preventing problems like vanishing or exploding gradients. Proper initialization, particularly with small random values, helps ensure that the input signals neither shrink nor grow too much as they propagate through the network.

- **Benefit**: Reduces the risk of vanishing or exploding gradients, leading to more stable and efficient training.

#### 3. Faster Convergence

Random weight initialization using a uniform distribution, especially when coupled with techniques like **Xavier/Glorot** or **He/Kaiming** initialization, helps the network converge faster by providing weights that are well-scaled for the specific activation functions in use (e.g., sigmoid, ReLU). These initialization techniques are designed to keep the variance of the outputs consistent across layers, which can significantly improve the training speed and convergence.

- **Benefit**: Improved training speed and faster convergence to an optimal solution.

#### 4. Flexibility

Using a uniform distribution offers flexibility in controlling the range of initial weights. By specifying the bounds (e.g., `[-a, a]`), you can ensure the initial weights are not too large or too small, which can help avoid large fluctuations or vanishing signals during forward and backward propagation.

- **Benefit**: Control over the range of initial weights, reducing the risk of extreme initial values that can destabilize training.

#### 5. Good for Large Networks

Uniform initialization works well in practice for large networks, as it ensures that each neuron starts with a different weight but within a controlled range. This is especially important in deep neural networks, where improper weight initialization can cause problems as the signals pass through many layers.

- **Benefit**: Uniform initialization provides a practical and scalable solution for initializing weights in large, deep neural networks.

### Common Methods Based on Uniform Distribution

Several commonly used weight initialization techniques are based on uniform distributions. These methods adjust the range of the uniform distribution to account for the size of the input and output layers:

1. **Xavier/Glorot Uniform Initialization**:
   - Uses a uniform distribution with the range dependent on the number of input and output neurons.
   - Designed for use with sigmoid and tanh activation functions.
   
      \[
      $W \sim \mathcal{U}\left( -\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}} \right)
      \]
      where \( n \) is the number of input and output units in the layer.

2. **He/Kaiming Uniform Initialization**:
   - Uses a uniform distribution with the range adjusted for ReLU and variants.
   
   \[
   W \sim \mathcal{U}\left( -\sqrt{\frac{6}{n}}, \sqrt{\frac{6}{n}} \right)
   \]
   where \( n \) is the number of input units in the layer.

These methods ensure that the variance of the activations remains consistent across layers, which is critical for efficient training.

### Summary of Benefits

1. **Breaking Symmetry**: Ensures neurons have unique weights, allowing them to learn different features.
2. **Appropriate Weight Scale**: Helps avoid vanishing or exploding gradients by controlling the range of initial weights.
3. **Faster Convergence**: Properly initialized weights lead to faster and more stable convergence during training.
4. **Flexibility**: You can control the bounds of the uniform distribution to fit the network’s needs.
5. **Scalable to Large Networks**: Uniform initialization is practical for large, deep networks and can be tailored for different activation functions.

In conclusion, initializing weights with a uniform distribution ensures that the neural network can learn efficiently from the start, prevents various common issues during training, and improves convergence speed.


In [5]:
import torch
import torch.nn as nn
input = torch.randn(5, 3,10)
input
# batch, row, coloumn
# batch, seq_length, 

tensor([[[-1.7095,  0.3367, -0.3368,  1.2858,  1.0533, -2.2277,  0.8800,
          -0.3742, -0.7403,  0.0883],
         [ 0.5700,  0.0425, -0.3584,  2.2801,  0.1727, -0.7082,  0.0138,
           1.0876,  0.3305,  1.0911],
         [-0.3166,  0.2255,  0.6320,  1.1624,  0.7219,  0.3716, -0.5637,
           0.7088,  0.1332, -0.9087]],

        [[ 0.2929, -0.3405, -0.7353,  0.2625, -0.2958,  1.9104, -0.9117,
          -0.1829, -0.1219, -0.8041],
         [ 1.1385, -0.6271,  0.2522, -0.6451, -0.4261, -2.2981,  1.1322,
          -0.9697, -0.0300,  0.0631],
         [-1.2156,  0.8823,  1.6000, -1.7288,  2.1517,  1.2736,  0.3070,
           1.3770, -1.0922,  1.1184]],

        [[ 0.3145, -0.3453,  1.0731, -0.6722, -0.0336, -0.9665,  2.2367,
          -0.6329,  0.7638, -0.1410],
         [-1.2890, -0.3704,  1.7607, -2.4346,  0.4207, -0.8050, -0.6024,
          -1.6714, -1.7115, -1.2611],
         [-1.0112,  0.0441, -0.2354, -1.0832, -0.8319, -1.3778,  0.7898,
          -2.0225, -0.1850,  0.5123

In [6]:
rnn = nn.RNN(10,20, 2)

In [7]:
rnn

RNN(10, 20, num_layers=2)

In [8]:
h0 = torch.randn(2,3,20)

In [9]:
h0

tensor([[[ 1.8849, -0.0377, -0.8442, -0.3885, -0.3160, -0.9878, -1.7569,
          -0.4813, -0.2370,  0.7164,  0.1766, -1.8464,  0.8618, -0.1484,
          -0.9899,  0.8647, -0.6951, -0.6781, -1.3161,  0.2395],
         [-0.0618,  0.8876, -1.8760,  0.1734,  2.0396,  2.7417,  0.9176,
          -1.1632, -0.9284,  0.1284,  1.2373,  0.3357, -0.8940, -1.7979,
           1.3789, -0.0955,  0.0741,  0.3966, -0.2160, -0.7145],
         [-1.8246,  0.5814, -0.0167,  0.5478, -0.6746, -0.7669, -0.8837,
          -0.1548, -1.3660, -0.3456, -0.0959,  1.5976,  0.4747,  0.5659,
           0.2607, -2.0218, -1.3912, -1.2576, -1.2950,  0.8505]],

        [[-0.3516,  0.0660, -2.3885,  1.8896,  1.3383, -0.1246, -0.2231,
          -0.1445, -1.0238,  0.3231, -1.3092, -1.4012,  0.4618,  1.0453,
          -1.0436, -0.4002, -1.0601, -0.4116,  0.9950, -0.2595],
         [ 0.1968, -0.8300, -1.1702,  1.6192,  0.5013,  1.5713,  0.6121,
          -0.2064, -0.3890,  1.7899, -0.1054,  1.0439, -0.9625,  0.5699,
        

In [10]:
output, hn = rnn(input, h0)

In [11]:
output.shape

torch.Size([5, 3, 20])

In [12]:
m = nn.Dropout(p=0.2)
input = torch.randn(20, 16)
print(input)
output = m(input)
output
# The nn.Linear layer is a fully connected (dense) layer that maps the output of the LSTM to a probability distribution over the vocabulary. 
# This is a standard practice when predicting categorical values (in this case, words in a vocabulary).

tensor([[ 1.3358e-01,  2.0782e+00,  1.7463e+00, -1.1511e+00,  3.1073e-01,
          7.0345e-01,  7.9059e-01,  4.5952e-01,  4.6525e-01, -5.1697e-01,
         -1.8640e-01, -6.7928e-01, -1.2891e-02, -7.0636e-01, -3.6923e-01,
         -1.6991e-01],
        [ 8.4825e-01, -4.9683e-01, -1.0318e+00,  2.8291e-01, -4.5743e-01,
         -6.9884e-01, -4.2754e-01, -1.7330e-01,  7.7686e-01, -5.4991e-01,
         -1.7890e+00, -1.4420e+00, -1.6257e+00,  1.7252e+00,  1.3617e+00,
         -6.7427e-01],
        [ 2.2866e-01,  4.3696e-01,  5.1147e-01,  7.3411e-01,  6.0509e-01,
         -7.3855e-01,  1.4007e-01, -1.3685e+00, -1.1359e+00,  4.9581e-01,
          4.8565e-01,  7.1663e-01,  1.0555e-01,  2.1112e-01,  1.2970e+00,
         -1.3649e+00],
        [ 7.7281e-01,  2.3529e-01, -4.5390e-01,  1.1778e+00, -4.6243e-03,
         -7.4004e-02, -5.5147e-02,  7.8496e-01, -5.4664e-01, -1.6033e+00,
         -5.3937e-01, -1.3047e+00,  1.4342e+00, -3.0194e+00,  1.0339e+00,
          1.2475e+00],
        [-4.8489e-01

tensor([[ 1.6698e-01,  2.5978e+00,  0.0000e+00, -1.4389e+00,  0.0000e+00,
          8.7931e-01,  9.8823e-01,  5.7440e-01,  5.8157e-01, -6.4621e-01,
         -2.3301e-01, -8.4909e-01, -1.6114e-02, -8.8295e-01, -4.6154e-01,
         -2.1238e-01],
        [ 1.0603e+00, -6.2104e-01, -0.0000e+00,  3.5364e-01, -5.7179e-01,
         -0.0000e+00, -5.3443e-01, -2.1662e-01,  9.7108e-01, -6.8738e-01,
         -2.2363e+00, -1.8025e+00, -2.0321e+00,  2.1565e+00,  0.0000e+00,
         -8.4284e-01],
        [ 2.8582e-01,  5.4620e-01,  6.3934e-01,  0.0000e+00,  0.0000e+00,
         -9.2319e-01,  1.7509e-01, -0.0000e+00, -1.4198e+00,  6.1977e-01,
          0.0000e+00,  0.0000e+00,  1.3194e-01,  0.0000e+00,  1.6212e+00,
         -1.7062e+00],
        [ 9.6602e-01,  2.9411e-01, -5.6737e-01,  1.4723e+00, -0.0000e+00,
         -9.2505e-02, -6.8934e-02,  9.8120e-01, -0.0000e+00, -2.0041e+00,
         -6.7422e-01, -1.6309e+00,  1.7928e+00, -0.0000e+00,  1.2923e+00,
          1.5594e+00],
        [-6.0612e-01

In [13]:
import torch
import torch.nn as nn

class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers=1):
        super(DecoderRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
        self.hidden_size = hidden_size
        self.dropout = nn.Dropout(0.5)
    def forward(self, features, captions):
        """
            Forward pass of the encoder
            Arguments:
            - Features: Tensor of shape (batch_size, feature_size=512)
            - caption: Tensor of shape (batch_size, max_caption_length), word indices
            Returns:
            - output: Tensor of shape (batch_size, max_caption_length, vocab_size), word prediction
        """
        # Embedding the caption, excluding the <end> token
        embedding = self.embedding(captions[:, :-1])
        


In [14]:
import torch
import torch.nn as nn

class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers=1):
        super(DecoderRNN, self).__init__()
        
        # Embedding layer: converts word indices into dense vectors of size embed_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        # LSTM: input to hidden, hidden_size must match the size of features from CNN
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        
        # Fully connected layer to map LSTM output to vocab_size
        self.fc = nn.Linear(hidden_size, vocab_size)
        
        # Initialize the hidden state (if needed)
        self.hidden_size = hidden_size
        
        # Optional dropout to prevent overfitting
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, features, captions):
        """
        Forward pass of the decoder.
        Arguments:
        - features: Tensor of shape (batch_size, feature_size=512)
        - captions: Tensor of shape (batch_size, max_caption_length), word indices
        
        Returns:
        - outputs: Tensor of shape (batch_size, max_caption_length, vocab_size), word predictions
        """
        
        # Embedding the captions, excluding the <end> token"
        embeddings = self.embedding(captions[:, :-1])
        
        # Concatenate the features with the embedded captions
        # Features are passed as input to the first time step
        features = features.unsqueeze(1)  # shape (batch_size, 1, feature_size)
        lstm_input = torch.cat((features, embeddings), 1)  # shape (batch_size, 1 + caption_length, embed_size)
        
        # Pass the concatenated inputs through the LSTM
        lstm_out, _ = self.lstm(lstm_input)
        
        # Pass the LSTM output through the fully connected layer to get word predictions
        outputs = self.fc(lstm_out)
        
        return outputs


In [15]:
import os, sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))


In [16]:
import os, pickle
from utils.vocab import Vocabulary

vocab_file = '../vocab.pkl'

# Build vocabulary (done earlier)
if os.path.exists(vocab_file):
    with open(vocab_file, 'rb') as f:
        vocab = pickle.load(f)


In [17]:
import torch
import torch.optim as optim
import torch.nn as nn
import torchvision.models as models
from models.encoder import EncoderCNN
from models.decoder import DecoderRNN
import torchvision.transforms as transforms
from utils import data_loader 
from utils import vocab as vc
from utils.vocab import Vocabulary

import os
import pickle

image_root = '../data'
ann_file = '../data/captions_train2017.json'
save_dir = '../checkpoints/'
vocab_file = 'vocab.pkl'

if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# Build vocabulary (done earlier)
if os.path.exists(vocab_file):
    with open(vocab_file, 'rb') as f:
        vocab = pickle.load(f)
else:
    vocab = vc.build_vocab(ann_file, threshold=5)
    with open(vocab_file, 'wb') as f:
        pickle.dump(vocab, f)

# Hyperparameters
embed_size = 256
hidden_size = 512
vocab_size = len(vocab)  # Vocabulary size (make sure vocab.__len__() is implemented)
num_epochs = 1
learning_rate = 0.001
log_interval = 1  # Log every 10 batches
batch_size = 32  # Batch size for training
device = 'cuda' if torch.cuda.is_available() else 'cpu'
subset_fraction = 0.001
# Initialize models
encoder = EncoderCNN(embed_size).to(device)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size, num_layers=1).to(device)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=learning_rate)


# Load data
# Assume you have already created a DataLoader called data_loader
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
data_loader = data_loader.get_loader(image_root, ann_file, vocab, transform, batch_size = 32, shuffle=True, num_workers=4, subset_fraction=subset_fraction)



loading annotations into memory...
Done (t=0.60s)
creating index...
index created!


In [18]:
data_loader.batch_size
(data_loader.dataset[0])

(tensor([[[-0.8507, -0.5253, -0.1828,  ..., -0.9020, -0.7479, -1.2617],
          [-0.9877, -0.6794, -0.3541,  ..., -0.4054, -0.3198, -1.0733],
          [-1.0904, -0.9705, -0.5253,  ..., -0.8164, -0.8164, -1.0733],
          ...,
          [-1.1760, -1.1247, -1.0390,  ...,  0.9817,  0.6563,  0.3309],
          [-1.1589, -1.0904, -1.0048,  ...,  0.7762,  0.3309,  0.0056],
          [-1.1932, -1.1247, -0.9705,  ...,  0.6734,  0.3481,  0.0912]],
 
         [[-1.3004, -1.0728, -0.8277,  ..., -0.7402, -0.5476, -1.0903],
          [-1.4230, -1.2829, -0.8452,  ..., -0.1975, -0.1625, -0.8627],
          [-1.5980, -1.4055, -0.9678,  ..., -0.5301, -0.6352, -0.9678],
          ...,
          [-0.6352, -0.6877, -0.6001,  ...,  0.4853,  0.1877, -0.0574],
          [-0.7227, -0.6176, -0.5826,  ...,  0.3803, -0.0749, -0.4776],
          [-0.8277, -0.7052, -0.5126,  ...,  0.3452,  0.0126, -0.4251]],
 
         [[-1.2119, -0.9678, -0.7587,  ..., -0.2358, -0.1661, -0.6715],
          [-1.3164, -1.0898,

In [19]:
for i, (images, captions, lengths) in enumerate(data_loader):
        
        # Move images and captions to the device (GPU or CPU)
        images = images.to(device)
        captions = captions.to(device)
        print(images)
        break

tensor([[[[-0.1657, -0.1486, -0.1486,  ..., -0.3027, -0.3027, -0.3541],
          [-0.1314, -0.1314, -0.1314,  ..., -0.2856, -0.2856, -0.3369],
          [-0.1486, -0.1314, -0.1314,  ..., -0.2856, -0.3027, -0.3198],
          ...,
          [-0.0116, -0.0972, -0.0972,  ..., -0.1999, -0.1657, -0.1828],
          [ 0.0741,  0.0056,  0.1083,  ...,  0.0056,  0.0398, -0.1143],
          [-0.0801,  0.0741,  0.1083,  ..., -0.1657, -0.1143, -0.0116]],

         [[ 0.7479,  0.7654,  0.7654,  ...,  0.6429,  0.6078,  0.6078],
          [ 0.7304,  0.7479,  0.7829,  ...,  0.6604,  0.6429,  0.6254],
          [ 0.7654,  0.7829,  0.8004,  ...,  0.6604,  0.6604,  0.6078],
          ...,
          [ 0.1176,  0.0301,  0.0126,  ...,  0.0476,  0.1352,  0.1877],
          [ 0.2052,  0.1352,  0.2227,  ...,  0.2752,  0.3277,  0.2402],
          [ 0.0476,  0.2052,  0.2752,  ...,  0.1527,  0.1527,  0.2052]],

         [[ 1.7511,  1.7685,  1.7685,  ...,  1.6291,  1.6117,  1.5768],
          [ 1.7337,  1.7511,  

In [20]:
images[1]

tensor([[[-0.9534, -0.9363, -0.9192,  ..., -0.7137, -0.7479, -0.6794],
         [-0.9192, -0.9020, -0.8849,  ..., -0.6965, -0.7137, -0.6281],
         [-0.8849, -0.8849, -0.8678,  ..., -0.6794, -0.6965, -0.5938],
         ...,
         [-1.0562, -1.0390, -1.0390,  ..., -0.5767, -0.5767, -0.5938],
         [-1.0904, -1.0733, -1.0562,  ..., -0.6109, -0.6109, -0.5938],
         [-1.0904, -1.0562, -1.0562,  ..., -0.6281, -0.6281, -0.6281]],

        [[-1.1253, -1.1078, -1.0903,  ..., -0.8627, -0.9153, -0.8627],
         [-1.0903, -1.0903, -1.0728,  ..., -0.8452, -0.8803, -0.8102],
         [-1.0728, -1.0553, -1.0553,  ..., -0.8277, -0.8627, -0.7752],
         ...,
         [-1.2304, -1.2304, -1.2304,  ..., -0.7577, -0.7577, -0.7752],
         [-1.2479, -1.2654, -1.2654,  ..., -0.7927, -0.7927, -0.7752],
         [-1.2479, -1.2479, -1.2479,  ..., -0.8102, -0.8102, -0.8102]],

        [[-1.1770, -1.1596, -1.1421,  ..., -0.9678, -1.0027, -0.9504],
         [-1.1421, -1.1421, -1.1247,  ..., -0

In [21]:
features = encoder(images)
features.dtype

torch.float32

In [22]:
features.shape
# (batch size, each feature length)

torch.Size([32, 256])

In [23]:
len(features[0])

256

In [24]:
captions.shape

torch.Size([32, 23])

In [25]:
captions.dtype

torch.int64

In [26]:
captions[0]

tensor([   1,    4,   61,  420,  218,    4, 2848,   36,   50,   10,  462,  124,
          10,  109,    7,    4,   91,  151,   40,   10,  603,   13,    2],
       device='cuda:0')

In [27]:
import torch.nn as nn
device = torch.device("cuda:0")

embed = nn.Embedding(vocab_size, 256).to(device)

In [28]:
embedings = embed(captions)

In [29]:
embedings.shape

torch.Size([32, 23, 256])

In [30]:
embedings

tensor([[[ 1.6202e+00, -2.5387e+00, -1.3974e+00,  ...,  1.1184e+00,
           1.0806e+00,  1.2296e+00],
         [-8.6605e-01,  1.1939e-01,  2.2492e+00,  ..., -1.5899e+00,
           3.4686e-01,  1.5243e+00],
         [-1.9779e+00,  9.1244e-01,  6.3398e-01,  ..., -3.3810e-01,
           3.2364e-01, -1.0827e-02],
         ...,
         [ 4.2300e-01, -4.2027e-01, -2.7811e-01,  ...,  1.0645e+00,
          -7.4412e-01,  3.9002e-01],
         [-1.3493e+00,  1.4673e+00, -5.9974e-01,  ..., -5.4825e-01,
          -1.3084e+00,  2.9896e-01],
         [-1.9171e-03, -9.1087e-02, -3.4613e-01,  ..., -1.0111e-01,
          -1.6845e-01, -8.4402e-01]],

        [[ 1.6202e+00, -2.5387e+00, -1.3974e+00,  ...,  1.1184e+00,
           1.0806e+00,  1.2296e+00],
         [ 6.5716e-01,  3.0918e-01, -4.7231e-01,  ...,  1.2112e+00,
          -3.3847e-01,  2.1340e+00],
         [ 4.6459e-01, -1.7325e-01,  1.9442e+00,  ...,  2.1383e+00,
           9.9002e-02, -1.6520e-01],
         ...,
         [-1.3263e+00,  7

In [31]:
embedings[0]

tensor([[ 1.6202e+00, -2.5387e+00, -1.3974e+00,  ...,  1.1184e+00,
          1.0806e+00,  1.2296e+00],
        [-8.6605e-01,  1.1939e-01,  2.2492e+00,  ..., -1.5899e+00,
          3.4686e-01,  1.5243e+00],
        [-1.9779e+00,  9.1244e-01,  6.3398e-01,  ..., -3.3810e-01,
          3.2364e-01, -1.0827e-02],
        ...,
        [ 4.2300e-01, -4.2027e-01, -2.7811e-01,  ...,  1.0645e+00,
         -7.4412e-01,  3.9002e-01],
        [-1.3493e+00,  1.4673e+00, -5.9974e-01,  ..., -5.4825e-01,
         -1.3084e+00,  2.9896e-01],
        [-1.9171e-03, -9.1087e-02, -3.4613e-01,  ..., -1.0111e-01,
         -1.6845e-01, -8.4402e-01]], device='cuda:0',
       grad_fn=<SelectBackward0>)

In [32]:
lstm = nn.LSTM(256, 512, 1, batch_first=True)
fc = nn.Linear(hidden_size, vocab_size)


In [33]:
embedings.shape

torch.Size([32, 23, 256])

In [34]:
features.shape

torch.Size([32, 256])

In [35]:
features[0].shape

torch.Size([256])

In [36]:
features[0]

tensor([-1.7884e-01,  3.5485e-02,  1.8867e-01,  2.1133e-01,  3.0049e-01,
        -1.8708e-01,  1.0471e-02, -2.4576e-01,  1.5605e-01,  4.5011e-01,
        -9.1738e-02, -8.0126e-02, -2.6264e-01, -8.8581e-02, -1.5897e-01,
         3.4028e-01,  1.3875e-01, -7.2856e-02, -1.3199e-01,  1.7333e-01,
        -2.1683e-01,  3.1568e-02,  1.2569e-01, -3.8034e-01, -2.3918e-01,
        -9.8834e-02, -1.4781e-01,  1.8908e-01,  2.1438e-02, -3.7536e-02,
         7.2187e-02, -8.9411e-02, -1.2113e-03, -1.2203e-01,  5.9581e-02,
        -2.2986e-02,  8.0658e-02, -3.3516e-01, -1.8464e-01,  3.7603e-01,
         7.3006e-02, -1.9758e-01, -3.8929e-02,  3.1313e-03, -2.1165e-02,
         8.6840e-02,  3.9826e-01, -1.1354e-01, -1.0715e-02,  2.3976e-01,
         2.1704e-02, -9.9393e-02, -1.3941e-01, -1.7334e-01,  4.3932e-01,
        -1.2830e-01, -1.9109e-01, -7.7809e-03,  1.5201e-01, -1.5800e-01,
         1.1645e-01, -2.0851e-02, -3.4033e-01, -1.0205e-01, -7.5210e-02,
        -9.5101e-02, -4.6465e-02,  6.5427e-01, -9.2

In [37]:
features = features.unsqueeze(1)
features.shape
# to concanate with embeding of size(bath size, seq_length, embeding size)
# we transpose or conver to 1 row and 256 coloumn

torch.Size([32, 1, 256])

In [38]:
features[0].shape

torch.Size([1, 256])

In [39]:
lstm_input = torch.cat((features, embedings), 1)

In [40]:
lstm_input.shape

torch.Size([32, 24, 256])

In [41]:
lstm_input.is_cuda

True

In [42]:
lstm.to(device)

LSTM(256, 512, batch_first=True)

In [43]:
# Outputs: output, (h_n, c_n)

In [44]:
lstm_out, (hn, cn) = lstm(lstm_input)

In [45]:
lstm_out.shape

torch.Size([32, 24, 512])

In [46]:
lstm_out = lstm_out[:,:-1,:]

In [47]:
fc.to(device)

Linear(in_features=512, out_features=9439, bias=True)

In [48]:
output = fc(lstm_out)

In [49]:
output.shape

torch.Size([32, 23, 9439])

In [50]:
vocab_size

9439

In [51]:
# Assuming output has the shape (batch_size, seq_len, vocab_size)
batch_size, seq_len, vocab_size = output.shape

# Get the predicted word indices for each time step across the sequence
_, predicted_indices = torch.max(output, dim=2)  # Shape: (batch_size, seq_len)

# Convert the predicted indices to actual words using your vocabulary
predicted_words = []
for i in range(batch_size):
    words = [vocab.idx2word[idx.item()] for idx in predicted_indices[i]]
    predicted_words.append(words)

# Print predicted words for each sequence in the batch
for i, words in enumerate(predicted_words):
    print(f"Predicted words for sequence {i}: {words}")


Predicted words for sequence 0: ['bathed', 'suits', 'piano', 'travelers', 'regarding', 'pouch', 'atv', 'aircraft', 'california', 'demonic', 'demonic', 'rains', 'pops', 'merchant', 'possessions', 'possessions', 'aircraft', 'protection', 'stay', 'sneaker', 'adult', 'keychain', 'thai']
Predicted words for sequence 1: ['evil', 'suits', 'juicy', 'gazes', 'texas', 'dig', 'arabic', 'arabic', 'recorder', 'according', 'mama', 'investigates', 'plunger', 'funky', 'toppings', 'working', 'flames', 'smashed', 'yankees', 'walkie', 'walkie', 'walkie', 'walkie']
Predicted words for sequence 2: ['aircraft', 'suits', 'piano', 'theater', 'apparatus', 'plunger', 'mph', 'beat', 'ground', 'atv', 'once', 'cameraman', 'california', 'draped', 'den', 'fingertips', 'av', 'yankees', 'walkie', 'walkie', 'walkie', 'walkie', 'walkie']
Predicted words for sequence 3: ['pedestrians', 'suits', 'piano', 'crumbs', 'roasted', 'mugs', 'manmade', 'once', 'pressing', 'sheets', 'california', 'fenced', 'motorcycles', 'bot', 'ma

In [52]:
predicted_indices.shape

torch.Size([32, 23])

In [53]:
predicted_indices[0]

tensor([4352, 3282, 4030, 2461, 4052, 1930,  799, 1095, 3366, 7332, 7332,  878,
        4697, 5388, 4569, 4569, 1095, 7588, 1792, 7459,  913, 3664, 2819],
       device='cuda:0')

In [54]:
lstm_out.shape

torch.Size([32, 23, 512])

In [55]:
output.shape

torch.Size([32, 23, 9439])

In [56]:
output = output.view(-1, vocab_size)

In [57]:
output.shape

torch.Size([736, 9439])

In [58]:
_, predicted_indices = torch.max(output, dim=1)

In [59]:
predicted_indices.shape

torch.Size([736])

In [60]:
predicted_indices[0]

tensor(4352, device='cuda:0')

In [61]:
captions.shape


torch.Size([32, 23])

In [62]:
targets = captions[:, :].contiguous().view(-1)  # Shape: (batch_size * sequence_length)


In [63]:
targets.shape

torch.Size([736])

In [64]:
targets[0]

tensor(1, device='cuda:0')

In [65]:
output.shape

torch.Size([736, 9439])

In [66]:
predicted_indices.shape

torch.Size([736])

In [67]:
targets.shape

torch.Size([736])

In [None]:
targets.to(float)

In [None]:
predicted_indices-targets

In [70]:
targets.shape

torch.Size([736])

In [82]:
type(targets)

torch.Tensor

In [85]:
type(targets[0])

torch.Tensor

In [86]:
print(targets[0])

tensor(1, device='cuda:0')


In [87]:
targets[0].item()

1

In [88]:
len(targets)

736

In [96]:
for i in range(len(targets)):
            predicted_words = [vocab.idx2word[predicted_indices[i].item()] ]
            target_words = [vocab.idx2word[targets[i].item()]]
            print(f"target: {target_words}, predicted words: {predicted_words}")
            break

target: ['<start>'], predicted words: ['bathed']
