# Recurrent Neural Networks

# **Section 1: Theoretical Background**

## Long Short-Term Memory Networks (LSTMs) in Sequence Modeling

Recurrent Neural Networks (RNNs) are a class of neural networks that are well-suited to modeling sequential data, such as time series or natural language. However, standard RNNs struggle with learning long-term dependencies due to the vanishing or exploding gradient problem. **Long Short-Term Memory networks (LSTMs)** address this issue by introducing a memory cell that can maintain information over long periods.

### The LSTM Architecture

An LSTM cell consists of several components that interact to decide what information to keep, write, or delete from the cell state. The key components are:

- **Cell state ($C_t$)**: Represents the internal memory of the cell.
- **Hidden state ($h_t$)**: Output of the cell that combines with the input at the next time step.
- **Input gate ($i_t$)**: Decides which new information to store in the cell state.
- **Forget gate ($f_t$)**: Decides which information to discard from the cell state.
- **Output gate ($o_t$)**: Decides what information to output from the cell state.

### Mathematical Formulations

At each time step $t$, the LSTM performs the following computations:

1. **Input Gate ($i_t$)**:

   $
   i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
   $

2. **Forget Gate ($f_t$)**:

   $
   f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
   $

3. **Cell Candidate ($\tilde{C}_t$)**:

   $
   \tilde{C}_t = \tanh(W_C x_t + U_C h_{t-1} + b_C)
   $

4. **Cell State Update ($C_t$)**:

   $
   C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t
   $

5. **Output Gate ($o_t$)**:

   $
   o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)
   $

6. **Hidden State ($h_t$)**:

   $
   h_t = o_t \odot \tanh(C_t)
   $

Here, $\sigma$ denotes the sigmoid function, $\tanh$ is the hyperbolic tangent function, $x_t$ is the input at time $t$, and $\odot$ denotes element-wise multiplication.

### Step-by-Step Derivation

1. **Compute Gates**: Calculate the values of the input, forget, and output gates using the current input and previous hidden state.
2. **Update Cell State**: Modify the cell state by forgetting some information and adding new candidate information.
3. **Compute Hidden State**: Generate the new hidden state based on the updated cell state and output gate.


### Key Assumptions and Limitations

- **Assumptions**:
  - The sequential data has dependencies over varying time scales.
  - The input sequences are of variable length.

- **Limitations**:
  - Computationally intensive compared to standard RNNs.
  - May still struggle with very long sequences.

### Practical Applications

- **Natural Language Processing (NLP)**: Part-of-speech tagging, language translation, and text generation.
- **Time Series Forecasting**: Stock prices, weather prediction.
- **Speech Recognition**: Modeling temporal dependencies in audio signals.

### Summary of Key Points

- LSTMs mitigate the vanishing gradient problem in RNNs.
- They use gates to control the flow of information.
- Suitable for modeling long-term dependencies in sequential data.
- Widely used in various domains requiring sequence modeling.

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

torch.manual_seed(1)

def prepare_sequence(seq, to_ix):
    """Converts a sequence of words to a tensor of indices."""
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# Sample training data
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

# Create word-to-index and tag-to-index mappings
word_to_ix = {}
for sent, _ in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# Create character-to-index mapping
char_to_ix = {}
for word in word_to_ix.keys():
    for char in word:
        if char not in char_to_ix:
            char_to_ix[char] = len(char_to_ix)

# Hyperparameters
WORD_EMBEDDING_DIM = 6
CHAR_EMBEDDING_DIM = 3
HIDDEN_DIM = 6
CHAR_HIDDEN_DIM = 3

class LSTMTagger(nn.Module):
    """
    LSTM-based POS tagger that incorporates character-level features.
    """

    def __init__(self, word_embedding_dim, char_embedding_dim, hidden_dim,
                 char_hidden_dim, vocab_size, tagset_size, char_vocab_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.char_hidden_dim = char_hidden_dim

        # Word embeddings
        self.word_embeddings = nn.Embedding(vocab_size, word_embedding_dim)

        # Character embeddings and LSTM
        self.char_embeddings = nn.Embedding(char_vocab_size, char_embedding_dim)
        self.char_lstm = nn.LSTM(char_embedding_dim, char_hidden_dim)

        # Main LSTM
        self.lstm = nn.LSTM(word_embedding_dim + char_hidden_dim, hidden_dim)

        # Linear layer mapping to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence, words):
        """
        Forward pass of the model.

        Args:
            sentence: Tensor of word indices.
            words: List of words corresponding to the indices.
        Returns:
            tag_scores: Log probabilities of tags for each word.
        """
        # Initialize list to hold combined embeddings
        embeddings = []

        for idx, word in enumerate(words):
            # Prepare character-level inputs
            char_idxs = prepare_sequence(word, char_to_ix)
            char_embeds = self.char_embeddings(char_idxs).view(len(word), 1, -1)
            _, (char_hidden, _) = self.char_lstm(char_embeds)
            char_hidden = char_hidden.view(-1)

            # Get character embeddings and pass through char LSTM
            char_hidden = char_hidden.view(-1)
            # embeddings.append(char_hidden)  # Remove this line

            # Get word embedding and concatenate with char-level representation
            word_embed = self.word_embeddings(sentence[idx]).view(1, 1, -1)
            combined = torch.cat((word_embed, char_hidden.view(1, 1, -1)), 2)
            embeddings.append(combined.view(1, -1))

        # Stack embeddings and pass through main LSTM
        embeddings = torch.stack(embeddings).view(len(sentence), 1, -1)
        lstm_out, _ = self.lstm(embeddings)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

# Initialize the model, loss function, and optimizer
model = LSTMTagger(
    WORD_EMBEDDING_DIM,
    CHAR_EMBEDDING_DIM,
    HIDDEN_DIM,
    CHAR_HIDDEN_DIM,
    len(word_to_ix),
    len(tag_to_ix),
    len(char_to_ix)
)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training loop (fill in the missing parts)
for epoch in range(10):  # Reduced epochs for brevity
    for sentence, tags in training_data:
        # Zero gradients
        optimizer.zero_grad()

        # Prepare inputs
        inputs = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Forward pass
        tag_scores = model(inputs, sentence)

        # Compute loss and backpropagate
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# Testing the model (fill in the missing parts)
with torch.no_grad():
    test_sentence = training_data[0][0]
    inputs = prepare_sequence(test_sentence, word_to_ix)
    tag_scores = model(inputs, test_sentence)
    print("Tag scores:", tag_scores)

Tag scores: tensor([[-1.0572, -0.9974, -1.2597],
        [-1.0553, -1.0094, -1.2466],
        [-1.1263, -0.9667, -1.2193],
        [-1.1229, -1.0327, -1.1437],
        [-1.1395, -0.8838, -1.3211]])


# **Section 1: Theoretical Background**

## Gated Recurrent Units (GRUs) in Sequence Modeling

Recurrent Neural Networks (RNNs) are powerful for handling sequential data but suffer from the vanishing and exploding gradient problems, which hamper learning long-term dependencies. **Gated Recurrent Units (GRUs)** are a type of RNN architecture designed to mitigate these issues by introducing gating mechanisms, simplifying the architecture compared to LSTMs while retaining performance.

### The GRU Architecture

A GRU combines the hidden state and cell state of an LSTM into a single hidden state. It uses two gates:

- **Update Gate ($z_t$)**: Determines how much of the previous hidden state to keep.
- **Reset Gate ($r_t$)**: Determines how to combine the new input with the previous memory.

### Mathematical Formulations

At each time step $t$, the GRU performs the following computations:

1. **Update Gate ($z_t$)**:

   $
   z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)
   $

2. **Reset Gate ($r_t$)**:

   $
   r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)
   $

3. **Candidate Activation ($\tilde{h}_t$)**:

   $
   \tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)
   $

4. **Hidden State Update ($h_t$)**:

   $
   h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
   $

Here, $\sigma$ is the sigmoid function, $\tanh$ is the hyperbolic tangent function, $x_t$ is the input at time $t$, $h_{t-1}$ is the previous hidden state, and $\odot$ denotes element-wise multiplication.

### Step-by-Step Derivation

1. **Compute Update and Reset Gates**: Calculate $z_t$ and $r_t$ using the current input and previous hidden state.
2. **Compute Candidate Activation**: Generate a candidate hidden state $\tilde{h}_t$ using the reset gate to modulate the influence of the previous hidden state.
3. **Update Hidden State**: Combine the previous hidden state and the candidate activation using the update gate.

### Key Assumptions and Limitations

- **Assumptions**:
  - The sequential data exhibits temporal dependencies.
  - The model benefits from gating mechanisms to control information flow.

- **Limitations**:
  - May not capture dependencies as long as those LSTMs can, due to the simpler architecture.
  - Like all RNNs, GRUs can be computationally intensive for long sequences.

### Practical Applications

- **Natural Language Processing (NLP)**: Machine translation, text summarization, and sentiment analysis.
- **Speech Recognition**: Modeling temporal patterns in audio data.
- **Time Series Analysis**: Forecasting and anomaly detection in financial or sensor data.

### Summary of Key Points

- GRUs are a simplified version of LSTMs with fewer gates.
- They mitigate the vanishing gradient problem in standard RNNs.
- GRUs use update and reset gates to control information flow.
- They are computationally efficient and perform well on sequence modeling tasks.

In [10]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

torch.manual_seed(1)

def prepare_sequence(seq, to_ix):
    """Converts a sequence of words to a tensor of indices."""
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# Sample training data
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

# Create word-to-index and tag-to-index mappings
word_to_ix = {}
for sent, _ in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# Create character-to-index mapping
char_to_ix = {}
for word in word_to_ix.keys():
    for char in word:
        if char not in char_to_ix:
            char_to_ix[char] = len(char_to_ix)

# Hyperparameters
WORD_EMBEDDING_DIM = 6
CHAR_EMBEDDING_DIM = 3
HIDDEN_DIM = 6
CHAR_HIDDEN_DIM = 3

class GRUTagger(nn.Module):
    """
    GRU-based POS tagger that incorporates character-level features.
    """

    def __init__(self, word_embedding_dim, char_embedding_dim, hidden_dim,
                 char_hidden_dim, vocab_size, tagset_size, char_vocab_size):
        super(GRUTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.char_hidden_dim = char_hidden_dim

        # Word embeddings
        self.word_embeddings = nn.Embedding(vocab_size, word_embedding_dim)

        # Character embeddings and GRU
        self.char_embeddings = nn.Embedding(char_vocab_size, char_embedding_dim)
        self.char_gru = nn.GRU(char_embedding_dim, char_hidden_dim)

        # Main GRU
        self.gru = nn.GRU(word_embedding_dim + char_hidden_dim, hidden_dim)

        # Linear layer mapping to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence, words):
        """
        Forward pass of the model.

        Args:
            sentence: Tensor of word indices.
            words: List of words corresponding to the indices.
        Returns:
            tag_scores: Log probabilities of tags for each word.
        """
        # Initialize list to hold combined embeddings
        embeddings = []

        for idx, word in enumerate(words):
            # Prepare character-level inputs
            # Convert each character to its index and create a tensor
            char_idxs = prepare_sequence(word, char_to_ix)
            char_embeds = self.char_embeddings(char_idxs).view(len(word), 1, -1)

            # Get character embeddings and pass through char GRU
            # Obtain character embeddings
            _, char_hidden = self.char_gru(char_embeds)

            # Get word embedding and concatenate with char-level representation
            # Obtain word embedding
            word_embed = self.word_embeddings(sentence[idx])

            # Concatenate word embedding with char-level hidden state
            combined = torch.cat((word_embed, char_hidden.view(-1)), dim=0)
            embeddings.append(combined)

        # Stack embeddings and pass through main GRU
        embeddings = torch.stack(embeddings).view(len(sentence), 1, -1)
        gru_out, _ = self.gru(embeddings)
        tag_space = self.hidden2tag(gru_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

# Initialize the model, loss function, and optimizer
model = GRUTagger(
    WORD_EMBEDDING_DIM,
    CHAR_EMBEDDING_DIM,
    HIDDEN_DIM,
    CHAR_HIDDEN_DIM,
    len(word_to_ix),
    len(tag_to_ix),
    len(char_to_ix)
)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training loop (fill in the missing parts)
for epoch in range(10):  # Adjust the number of epochs as needed
    for sentence, tags in training_data:
        # Zero gradients
        optimizer.zero_grad()

        # Prepare inputs
        inputs = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Forward pass
        tag_scores = model(inputs, sentence)

        # Compute loss and backpropagate
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# Testing the model (fill in the missing parts)
with torch.no_grad():
    test_sentence = training_data[0][0]
    inputs = prepare_sequence(test_sentence, word_to_ix)
    tag_scores = model(inputs, test_sentence)
    print("Tag scores:", tag_scores)

Tag scores: tensor([[-0.8481, -1.0790, -1.4618],
        [-0.7996, -1.1305, -1.4801],
        [-1.0361, -1.0181, -1.2592],
        [-1.0467, -0.9545, -1.3322],
        [-1.0948, -1.0075, -1.2031]])


# **Section 1: Theoretical Background**

## Recurrent Neural Networks (RNNs) and Teacher Forcing in Sequence Modeling

Sequential data, such as time series and natural language, require models that can capture temporal dependencies. **Recurrent Neural Networks (RNNs)** are a class of neural networks designed for this purpose. However, training RNNs can be challenging due to issues like vanishing gradients and the complexities involved in sequence generation. **Teacher Forcing** is a training strategy used to address some of these challenges by providing the model with the ground truth output from the previous time step during training.

### Recurrent Neural Networks (RNNs)

An RNN processes sequences by maintaining a hidden state that is updated at each time step based on the current input and the previous hidden state.

#### Mathematical Formulation

At each time step $ t $:

1. **Hidden State Update**:

   $
   h_t = \tanh(W_{hx} x_t + W_{hh} h_{t-1} + b_h)
   $

   - $ x_t $: Input at time $ t $
   - $ h_{t-1} $: Hidden state from the previous time step
   - $ W_{hx}, W_{hh} $: Weight matrices
   - $ b_h $: Bias vector

2. **Output**:

   $
   y_t = W_{hy} h_t + b_y
   $

   - $ y_t $: Output at time $ t $
   - $ W_{hy} $: Output weight matrix
   - $ b_y $: Output bias

#### Key Characteristics

- **Memory of Previous Inputs**: The hidden state $ h_t $ serves as a memory of previous inputs.
- **Shared Parameters**: The weights are shared across time steps.
- **Challenges**: Standard RNNs struggle with long-term dependencies due to vanishing or exploding gradients.

### Teacher Forcing

Teacher forcing is a training strategy where, during training, the model receives the actual ground truth output from the previous time step as input, instead of its own previous output.

#### Mechanism

- **With Teacher Forcing**:

  At each time step, the ground truth $ y_{t-1} $ is used as input for generating $ y_t $.

- **Without Teacher Forcing**:

  The model uses its own predicted output $ \hat{y}_{t-1} $ from the previous time step to generate $ y_t $.

#### Mathematical Representation

- **With Teacher Forcing**:

  $
  h_t = \tanh(W_{hx} x_t + W_{hy} y_{t-1} + W_{hh} h_{t-1} + b_h)
  $

- **Without Teacher Forcing**:

  $
  h_t = \tanh(W_{hx} x_t + W_{hy} \hat{y}_{t-1} + W_{hh} h_{t-1} + b_h)
  $

### Step-by-Step Derivation with Teacher Forcing

1. **Initialization**: Start with an initial hidden state $ h_0 $ (often set to zeros).
2. **For each time step $ t $**:
   - Compute the hidden state $ h_t $ using the ground truth output $ y_{t-1} $.
   - Generate the output $ y_t $ based on $ h_t $.

### Contrasting with and without Teacher Forcing

- **Convergence**:

  - *With Teacher Forcing*: The model typically converges faster because it receives correct context from the ground truth.
  - *Without Teacher Forcing*: Training may be slower and less stable due to accumulated errors from previous predictions.

- **Exposure Bias**:

  - *With Teacher Forcing*: The model may not learn to recover from its own mistakes because it rarely encounters them during training.
  - *Without Teacher Forcing*: The model learns to handle its own errors, potentially improving robustness.

### Key Assumptions and Limitations

- **Assumptions**:

  - The ground truth sequence is available during training.
  - Sequential dependencies are important for the task.

- **Limitations**:

  - *Teacher Forcing* can lead to discrepancies between training and inference (known as exposure bias).
  - Without teacher forcing, training may be less efficient due to error accumulation.

### Practical Applications

- **Language Modeling**: Predicting the next word in a sentence.
- **Machine Translation**: Translating sequences from one language to another.
- **Speech Recognition**: Transcribing audio sequences into text.

### Summary of Key Points

- RNNs are suitable for modeling sequential data but face challenges with long-term dependencies.
- Teacher forcing is a strategy to stabilize and accelerate training by using ground truth outputs.
- There is a trade-off between training efficiency and robustness to errors when using teacher forcing.
- Understanding the effects of teacher forcing is crucial for designing effective sequence models.

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

torch.manual_seed(1)

def prepare_sequence(seq, to_ix):
    """Converts a sequence of words or tags to a tensor of indices."""
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# Sample training data
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

# Create word-to-index and tag-to-index mappings
word_to_ix = {}
for sent, _ in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# Hyperparameters
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

class RNNTagger(nn.Module):
    """
    RNN-based POS tagger with optional teacher forcing.
    """

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(RNNTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.tagset_size = tagset_size

        # Word embeddings
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # RNN layer
        self.rnn = nn.RNN(embedding_dim + tagset_size, hidden_dim)

        # Linear layer mapping to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence, targets=None, teacher_forcing=False):
        """
        Forward pass of the model.

        Args:
            sentence: Tensor of word indices.
            targets: Tensor of tag indices (ground truth).
            teacher_forcing: Boolean indicating whether to use teacher forcing.
        Returns:
            tag_scores: Log probabilities of tags for each word.
        """
        embeds = self.word_embeddings(sentence)
        tag_scores = []
        hidden = torch.zeros(1, 1, self.hidden_dim)

        # Initialize previous output
        prev_output = torch.zeros(1, self.tagset_size)

        for i in range(len(sentence)):
            # Determine input for current time step
            if teacher_forcing and targets is not None and i > 0:
                # Use ground truth tag from previous time step
                prev_tag = torch.zeros(1, self.tagset_size)
                prev_tag[0][targets[i - 1]] = 1
                rnn_input = torch.cat((embeds[i].view(1, -1), prev_tag), dim=1)
            else:
                # Use previous prediction
                rnn_input = torch.cat((embeds[i].view(1, -1), prev_output), dim=1)

            # TODO: Pass through RNN
            rnn_output, hidden = self.rnn(rnn_input.view(1, 1, -1), hidden)

            # TODO: Compute tag scores
            tag_space = self.hidden2tag(rnn_output.view(1, -1))
            tag_prob = F.log_softmax(tag_space, dim=1)
            tag_scores.append(tag_prob)

            # Update previous output
            if teacher_forcing and targets is not None:
                prev_output = torch.zeros(1, self.tagset_size)
                prev_output[0][targets[i]] = 1
            else:
                prev_output = torch.exp(tag_prob)

        tag_scores = torch.cat(tag_scores, dim=0)
        return tag_scores

# Initialize the model, loss function, and optimizer
model = RNNTagger(
    EMBEDDING_DIM,
    HIDDEN_DIM,
    len(word_to_ix),
    len(tag_to_ix)
)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training loop (fill in the missing parts)
num_epochs = 10
for epoch in range(num_epochs):
    for sentence, tags in training_data:
        # Zero gradients
        model.zero_grad()

        # Prepare inputs
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Forward pass with teacher forcing
        tag_scores = model(sentence_in, targets, teacher_forcing=True)

        # Compute loss and backpropagate
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# Testing the model with and without teacher forcing
with torch.no_grad():
    test_sentence = training_data[0][0]
    inputs = prepare_sequence(test_sentence, word_to_ix)
    targets = prepare_sequence(training_data[0][1], tag_to_ix)

    # With teacher forcing
    tag_scores_tf = model(inputs, targets, teacher_forcing=True)
    print("Tag scores with teacher forcing:", tag_scores_tf)

    # Without teacher forcing
    tag_scores_no_tf = model(inputs, teacher_forcing=False)
    print("Tag scores without teacher forcing:", tag_scores_no_tf)

Tag scores with teacher forcing: tensor([[-0.6665, -1.0373, -2.0242],
        [-1.5274, -0.5120, -1.6948],
        [-2.0098, -1.0183, -0.6836],
        [-0.5246, -1.3591, -1.8886],
        [-1.5782, -0.5187, -1.6176]])
Tag scores without teacher forcing: tensor([[-0.6665, -1.0373, -2.0242],
        [-1.5320, -0.5137, -1.6839],
        [-1.9086, -0.9977, -0.7278],
        [-0.6094, -1.3147, -1.6726],
        [-1.4744, -0.5879, -1.5343]])
