# Understanding the Limitations of RNNs and LSTMs
Before diving into self-attention models, it's important for the students to understand why the field moved beyond RNNs and LSTMs.

## Issues with RNNs/LSTMs:
- **Sequential Processing:** RNNs and LSTMs process data sequentially, which makes parallelization (and hence, faster training and inference) difficult.
- **Long-Term Dependencies:** Even with LSTMs, learning long-term dependencies in very long sequences can be challenging due to the vanishing gradient problem.
- **Complexity and Training Time:** These models can become quite complex and may take a long time to train on large datasets.

### Example: LSTM Struggling with Long-Context Word Prediction
#### Scenario:
We'll use sentences where a specific word early in the sentence determines the choice of a word later in the sentence. For example, in the sentence "I grew up in France... I speak fluent **French**", the word "French" is highly dependent on the word "France" mentioned earlier. Let's see if we can predict it.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# Sample sentences
sentences = [
    "I grew up in France, I speak fluent French",
    "She was born in Spain and speaks Spanish",
    # ... more sentences ...
]

# Preprocessing
def tokenize(sentences):
    tokens = [sent.lower().split() for sent in sentences]
    return tokens

tokens = tokenize(sentences)
all_words = [word for sent in tokens for word in sent]
vocab = list(set(all_words))
word_to_idx = {word: idx for idx, word in enumerate(vocab)}

class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, lengths):
        x = self.embedding(x)
        packed_input = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
        packed_output, _ = self.lstm(packed_input)
        output, _ = pad_packed_sequence(packed_output, batch_first=True)
        out = self.fc(output)
        return out

    def predict_next_word(self, sentence):
        self.eval()  # Set the model to evaluation mode
        tokens = sentence.lower().split()
        idxs = [word_to_idx.get(word, 0) for word in tokens]  # Convert words to indices
        seq = torch.tensor(idxs, dtype=torch.long).unsqueeze(0)  # Convert to tensor
        with torch.no_grad():
            output = self(seq, [len(idxs)])
            last_word_logits = output[0, -1]
            predicted_idx = torch.argmax(last_word_logits).item()
        return vocab[predicted_idx]  # Return the predicted word

# Constants
VOCAB_SIZE = len(vocab)
EMBED_SIZE = 100
HIDDEN_SIZE = 128

# Instantiate the model
model = LSTMModel(VOCAB_SIZE, EMBED_SIZE, HIDDEN_SIZE)

# Prepare data for training
def create_dataset(tokens):
    sequences = []
    sequence_lengths = []
    for sentence in tokens:
        idxs = [word_to_idx[word] for word in sentence]
        for i in range(1, len(idxs)):
            sequences.append(idxs[:i+1])
            sequence_lengths.append(i+1)
    sequences = [np.pad(seq, (0, max(sequence_lengths)-len(seq)), mode='constant') for seq in sequences]
    return torch.tensor(sequences, dtype=torch.long), torch.tensor(sequence_lengths, dtype=torch.long)

sequences, seq_lengths = create_dataset(tokens)

# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
def train(model, data, epochs):
    for epoch in range(epochs):
        total_loss = 0
        for seq, length in zip(*data):
            optimizer.zero_grad()
            output = model(seq.unsqueeze(0), [length])
            loss = criterion(output.squeeze(0)[:length-1], seq[1:length])
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        avg_loss = total_loss / len(data[0])
        print(f'Epoch {epoch+1}, Loss: {avg_loss:.4f}')


# Run training
train(model, (sequences, seq_lengths), 40)

  return torch.tensor(sequences, dtype=torch.long), torch.tensor(sequence_lengths, dtype=torch.long)


Epoch 1, Loss: 2.5977
Epoch 2, Loss: 1.8552
Epoch 3, Loss: 1.0975
Epoch 4, Loss: 0.5490
Epoch 5, Loss: 0.2857
Epoch 6, Loss: 0.1703
Epoch 7, Loss: 0.1139
Epoch 8, Loss: 0.0823
Epoch 9, Loss: 0.0621
Epoch 10, Loss: 0.0482
Epoch 11, Loss: 0.0385
Epoch 12, Loss: 0.0316
Epoch 13, Loss: 0.0265
Epoch 14, Loss: 0.0226
Epoch 15, Loss: 0.0196
Epoch 16, Loss: 0.0172
Epoch 17, Loss: 0.0153
Epoch 18, Loss: 0.0137
Epoch 19, Loss: 0.0123
Epoch 20, Loss: 0.0112
Epoch 21, Loss: 0.0102
Epoch 22, Loss: 0.0094
Epoch 23, Loss: 0.0086
Epoch 24, Loss: 0.0080
Epoch 25, Loss: 0.0074
Epoch 26, Loss: 0.0069
Epoch 27, Loss: 0.0065
Epoch 28, Loss: 0.0061
Epoch 29, Loss: 0.0057
Epoch 30, Loss: 0.0054
Epoch 31, Loss: 0.0051
Epoch 32, Loss: 0.0048
Epoch 33, Loss: 0.0045
Epoch 34, Loss: 0.0043
Epoch 35, Loss: 0.0041
Epoch 36, Loss: 0.0039
Epoch 37, Loss: 0.0037
Epoch 38, Loss: 0.0035
Epoch 39, Loss: 0.0034
Epoch 40, Loss: 0.0032


In [2]:
def test_model_on_sentences(model, test_sentences):
    for sentence in test_sentences:
        prediction = model.predict_next_word(sentence)
        print(f"Sentence: '{sentence}' -> Predicted next word: '{prediction}'")

# Example sentences with important context at different positions
test_sentences = [
    "I grew up in France, I speak fluent",
    "She was born in Spain and speaks",
    "In Germany, many people speak",
    # Longer context sentences
    "After spending a decade in Japan, I finally learned to speak",
    "Listen to me, because I will only say this once: when you travel the world you will learn many different things. Once thing you will"
]

test_model_on_sentences(model, test_sentences)

Sentence: 'I grew up in France, I speak fluent' -> Predicted next word: 'french'
Sentence: 'She was born in Spain and speaks' -> Predicted next word: 'spanish'
Sentence: 'In Germany, many people speak' -> Predicted next word: 'in'
Sentence: 'After spending a decade in Japan, I finally learned to speak' -> Predicted next word: 'fluent'
Sentence: 'Listen to me, because I will only say this once: when you travel the world you will learn many different things. Once thing you will' -> Predicted next word: 'up'


## Explanation

- **Model:** This LSTM model uses word embeddings and is designed to predict the next word in a sequence.
- **Data Preparation:** We preprocess the sentences by tokenizing them and creating a vocabulary. The model is trained on sequences of increasing length, predicting the next word at each step.
- **Training:** During training, the model receives partial sentences and tries to predict the next word. The training loop includes packing the sequences to handle variable lengths.
- **Prediction Function:** The predict_next_word method takes a partial sentence, processes it through the model, and outputs the model's prediction for the next word.
- **Testing with Varied Contexts:** The test_model_on_sentences function tests the model with different sentences where the key context for predicting the next word (the name of a country) is located at varying distances from the end of the sentence.

# But what are embeddings?

Some resources:
- Vicki Boykis [What Are Embeddings](https://vickiboykis.com/what_are_embeddings/)
- Simon Wilson [What Are Embeddings? Why They Matter](https://simonwillison.net/2023/Oct/23/embeddings/)
- Roy Keyes [The Shortest Definition of Embeddings?](https://roycoding.com/blog/2022/embeddings.html)


Today, we're going to at this wonderful article: Visual Storytelling Team and Madhumita Murgia in (randomly?) Fortune Magazine [Generative AI](https://ig.ft.com/generative-ai/)

Embeddings are a crucial concept in natural language processing and machine learning, offering a way to represent words, sentences, or even entire documents as vectors in a high-dimensional space. These vectors capture semantic meaning and relationships between words or phrases. Visualizing embeddings can be quite enlightening, as it helps to understand how models perceive and process language.

### Visualizing Embeddings

To visualize embeddings, we usually reduce their dimensionality to 2D or 3D using techniques like PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding). This allows us to plot them and observe how words with similar meanings are grouped close together.

#### Step-by-Step Process:

1. **Get Pre-trained Embeddings:** We'll use pre-trained word embeddings like GloVe or Word2Vec. These embeddings are trained on large corpora and capture rich language semantics.

2. **Select Words for Visualization:** Choose a set of words that includes a mix of similar and dissimilar terms to illustrate how embeddings capture semantic relationships.

3. **Dimensionality Reduction:** Use PCA or t-SNE to reduce the embeddings to 2 or 3 dimensions.

4. **Plotting:** Plot the reduced embeddings and annotate them with corresponding words.

### Code Example: Visualizing Word Embeddings

In [None]:
! pip install torchtext



In [None]:
import torch
import pandas as pd
from sklearn.manifold import TSNE
from torchtext.vocab import GloVe
import plotly.express as px

# Load pre-trained GloVe embeddings
glove = GloVe(name='6B', dim=100)  # Using 100-dimensional vectors

# Expanded list of words with categories
words = [
    ("king", "royalty"), ("queen", "royalty"), ("prince", "royalty"), ("princess", "royalty"), ("duke", "royalty"), ("monarch", "royalty"),
    ("apple", "fruit"), ("banana", "fruit"), ("grape", "fruit"), ("orange", "fruit"), ("berry", "fruit"), ("melon", "fruit"),
    ("paris", "city"), ("berlin", "city"), ("london", "city"), ("madrid", "city"), ("rome", "city"), ("vienna", "city"),
    ("google", "tech"), ("microsoft", "tech"), ("apple", "tech"), ("ibm", "tech"), ("intel", "tech"), ("facebook", "tech"),
    ("happy", "emotion"), ("sad", "emotion"), ("angry", "emotion"), ("joyful", "emotion"), ("upset", "emotion"), ("glad", "emotion"),
    ("sweet", "taste?/emotion?")
]

# Extracting vectors for selected words
word_vectors = [glove[word] for word, _ in words]
word_vectors = torch.stack(word_vectors)

# Reducing dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=0, perplexity=len(words) // 5)
words_reduced = tsne.fit_transform(word_vectors)

# Convert the t-SNE output to a DataFrame
df = pd.DataFrame(words_reduced, columns=['x', 'y'])
df['word'] = [word for word, _ in words]
df['category'] = [category for _, category in words]

# Create a 2D scatter plot using Plotly Express with colors for categories
fig = px.scatter(df, x='x', y='y', text='word', color='category',
                 title="2D Visualization of GloVe Word Embeddings with Categorical Colors")
fig.update_traces(textposition='top center')
fig.show()

# Explanation
- **Loading GloVe Embeddings:** torchtext's GloVe class automatically handles downloading the GloVe embeddings. Here, we use the 100-dimensional vectors from the '6B' GloVe dataset.
- **Selecting Words:** We choose a set of diverse words to visualize how their embeddings relate to each other in a 2D space.
- **Dimensionality Reduction with t-SNE:** t-SNE is used to reduce the dimensions of the embeddings to 2D for visualization purposes.
- **Plotting:** We plot these reduced dimensions and annotate them with the corresponding words.

# Self- Attention and the Transformer
This brings us to Self-Attention and the Transformer.

Transitioning to self-attention involves understanding that, unlike LSTMs, transformers process the entire input sequence simultaneously, allowing them to capture dependencies between words regardless of their position in the sentence. This is done through self-attention mechanisms which compute a weighted sum of all words in the sentence, with the weights signifying the relevance of other words when encoding a particular word.

### Introduction to Self-Attention

Self-attention, also known as intra-attention, is an attention mechanism that relates different positions of a single sequence in order to compute a representation of the sequence. It has been effectively used in tasks where the entire context of the sequence is important.

The self-attention mechanism allows the model to weigh the influence of different parts of the input data differently. This is particularly useful for handling sequences with varying lengths and complex relationships between elements.

### Building a Basic Transformer Model

Here we'll outline a basic version of the Transformer model introduced in paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762), focusing on the key components that make up the transformer architecture.

#### Transformer Model Components:

1. **Encoder and Decoder:** The transformer model consists of an encoder to process the input and a decoder to produce the output.
2. **Self-Attention Layer:** In both the encoder and decoder, self-attention layers compute attention scores for each element in the sequence.
3. **Feed-Forward Neural Networks:** Each self-attention layer is followed by a feed-forward neural network.
4. **Positional Encoding:** Since the transformer doesn't have any recurrent or convolutional structures, positional encodings are added to give the model information about the position of each token in the sequence.


### Transformer Model Code Example

Let's build a simple version of the transformer model using PyTorch:


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Embed size needs to be divisible by heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        # Split the embedding into self.heads different pieces
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Split the embedding into self.heads pieces
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        # Attention calculation
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / (self.embed_size ** (1 / 2)), dim=3)

        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, self.heads * self.head_dim
        )

        out = self.fc_out(out)
        return out



This `SelfAttention` module can then be integrated into the encoder and decoder of a `transformer` architecture. The full transformer architecture is quite complex and would require additional components such as multi-head attention (which the above class handles), layer normalization, and more.

### Explanation of the Transformer Code

- **Self Attention Class:** This class is a simplified version of the multi-head self-attention mechanism. It computes the attention scores and applies them to the values.
- **Forward Method:** The method takes queries, keys, and values, splits them into multiple heads, and applies the self-attention mechanism. It outputs a weighted sum of values, combined from each head's results.
- **Masking:** If a mask is provided (used for padding or future blinding in the decoder), it is applied to the attention scores to prevent the model from attending to certain positions.

This basic self-attention module is a core building block of the transformer model, allowing it to consider the entire sequence at once and learn the dependencies between all tokens, regardless of their distance in the sequence. It demonstrates how transformers can overcome the limitations of RNNs and

### From Self Attention to Global Attention

[Attention Family Tree](https://ai.v-gar.de/ml/transformer/timeline/)

The journey from the self-attention mechanism in models like BERT to the development of large-scale generative AI models like GPT-3 and GPT-4 involved significant advancements in neural network architectures, computational resources, and training methodologies.

### 1. **Advancements Post 'Attention Is All You Need' Paper**

After the introduction of the Transformer architecture in the "Attention Is All You Need" paper, there was a surge in research focusing on leveraging self-attention for various tasks.

- **BERT and Its Impact:** BERT (Bidirectional Encoder Representations from Transformers), introduced by Google, was a landmark in understanding how context in language can be used for tasks like question answering and language inference. Unlike previous models that processed text in one direction (either left-to-right or right-to-left), BERT was designed to understand the context of a word in relation to all other words in a sentence (bidirectionality).

### 2. **From Understanding to Generation: GPT Series**

While BERT was focused on understanding language (natural language understanding, NLU), the next big leap was towards language generation (natural language generation, NLG).

- **GPT Series:** OpenAI’s GPT (Generative Pretrained Transformer) models shifted the focus towards generative tasks. GPT models are trained to predict the next word in a sentence, given the words that come before it. This predictive modeling can be extended to generate coherent and contextually relevant text over longer passages.
  
  - **GPT-1 and GPT-2:** These models demonstrated that a large-scale Transformer trained on a diverse range of internet text could generate coherent and surprisingly relevant text snippets based on given prompts.
  
  - **GPT-3 and Beyond:** With GPT-3, the scale was dramatically increased - both in terms of the size of the model (number of parameters) and the diversity and volume of training data. GPT-3 showed that scaling up the size of the model and the training data led to a significant increase in the model's ability to generate coherent and contextually appropriate text, as well as perform a variety of language tasks without task-specific training data (few-shot or zero-shot learning).

### 3. **Computational Power and Data**

The leap to models like GPT-3 and GPT-4 also required massive computational resources:

- **Hardware Advancements:** The development of more powerful GPUs and TPUs allowed for training larger models with billions of parameters more efficiently.
- **Larger and More Diverse Datasets:** The availability of large-scale and diverse datasets facilitated training models that could understand and generate more nuanced and contextually varied text.

### 4. **Broader Applications and Fine-Tuning**

With powerful generative models, the range of applications expanded significantly:

- **Versatility:** Models like GPT-3 can be fine-tuned for specific tasks (like translation, summarization, content creation) or used directly with prompts for various applications, demonstrating a broad understanding of language and knowledge.
- **Emergence of AI as a Service:** The introduction of these models as cloud services made powerful AI accessible to a broader audience, fueling a wave of innovation in AI applications.

### 5. **Continued Research and Ethical Considerations**

- **Ongoing Research:** The field continues to evolve, with research focusing on making models more efficient, less biased, and more interpretable.
- **Ethical and Societal Impact:** The rise of powerful generative models has also brought attention to ethical considerations, including potential misuse, bias in AI, and the impact on various sectors like journalism, law, and creative industries.

The transition from BERT's breakthrough in understanding context to the generative prowess of GPT-3 and GPT-4 marks a significant era in AI's evolution, highlighting how advancements in model architecture, training techniques, and computational capabilities come together to push the boundaries of what's possible in natural language processing and AI at large.