# Lab 4: Advanced Architectures

In this final neural networks lab, we'll explore architectures for sequential data, attention mechanisms, and generative models.

## Learning Objectives

By the end of this lab, you will:
- Understand and implement Recurrent Neural Networks (RNNs)
- Build LSTM and GRU networks
- Apply attention mechanisms
- Understand transformer basics
- Implement autoencoders
- Work with sequence-to-sequence models
- Build practical applications with sequential data

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import torch
import torch.nn as nn

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)

## Part 1: Recurrent Neural Networks (RNNs)

RNNs process sequential data by maintaining hidden state.

### Basic RNN:
$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$
$$y_t = W_{hy} h_t + b_y$$

### Problems:
- **Vanishing gradients**: Hard to learn long-term dependencies
- **Exploding gradients**: Unstable training

### Solution: LSTM and GRU

In [None]:
# Simple sequence prediction task
def generate_sequence_data(n_samples=1000, seq_length=10):
    """
    Generate sequences where target is sum of inputs.
    """
    X = np.random.rand(n_samples, seq_length, 1)
    y = np.sum(X, axis=1)
    return X, y

X_train, y_train = generate_sequence_data(1000)
X_test, y_test = generate_sequence_data(200)

print(f"Input shape: {X_train.shape}")
print(f"Output shape: {y_train.shape}")

In [None]:
# Simple RNN model
model_rnn = keras.Sequential([
    layers.SimpleRNN(32, activation='tanh', input_shape=(10, 1)),
    layers.Dense(1)
])

model_rnn.compile(optimizer='adam', loss='mse', metrics=['mae'])

history_rnn = model_rnn.fit(
    X_train, y_train,
    batch_size=32,
    epochs=20,
    validation_split=0.2,
    verbose=0
)

test_loss = model_rnn.evaluate(X_test, y_test, verbose=0)
print(f"Simple RNN Test MAE: {test_loss[1]:.4f}")

## Part 2: LSTM (Long Short-Term Memory)

LSTM uses gates to control information flow:

### Gates:
1. **Forget gate**: $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$
2. **Input gate**: $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$
3. **Output gate**: $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$

### Cell state update:
$$\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)$$
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$
$$h_t = o_t \odot \tanh(C_t)$$

In [None]:
# LSTM model
model_lstm = keras.Sequential([
    layers.LSTM(32, input_shape=(10, 1)),
    layers.Dense(1)
])

model_lstm.compile(optimizer='adam', loss='mse', metrics=['mae'])

history_lstm = model_lstm.fit(
    X_train, y_train,
    batch_size=32,
    epochs=20,
    validation_split=0.2,
    verbose=0
)

# GRU model (simplified LSTM)
model_gru = keras.Sequential([
    layers.GRU(32, input_shape=(10, 1)),
    layers.Dense(1)
])

model_gru.compile(optimizer='adam', loss='mse', metrics=['mae'])

history_gru = model_gru.fit(
    X_train, y_train,
    batch_size=32,
    epochs=20,
    validation_split=0.2,
    verbose=0
)

# Compare
print(f"LSTM Test MAE: {model_lstm.evaluate(X_test, y_test, verbose=0)[1]:.4f}")
print(f"GRU Test MAE: {model_gru.evaluate(X_test, y_test, verbose=0)[1]:.4f}")

# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(history_rnn.history['val_loss'], label='RNN')
plt.plot(history_lstm.history['val_loss'], label='LSTM')
plt.plot(history_gru.history['val_loss'], label='GRU')
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.title('RNN vs LSTM vs GRU')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Part 3: Text Generation with LSTM

Let's build a character-level text generator.

In [None]:
# Sample text
text = """Deep learning is a subset of machine learning that uses neural networks with multiple layers. 
These networks can learn hierarchical representations of data, making them powerful for tasks like 
image recognition, natural language processing, and more."""

# Create character mappings
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}

print(f"Unique characters: {len(chars)}")
print(f"Text length: {len(text)}")

# Create sequences
seq_length = 40
step = 3
sequences = []
next_chars = []

for i in range(0, len(text) - seq_length, step):
    sequences.append(text[i:i + seq_length])
    next_chars.append(text[i + seq_length])

print(f"Number of sequences: {len(sequences)}")

# Vectorize
X = np.zeros((len(sequences), seq_length, len(chars)), dtype=bool)
y = np.zeros((len(sequences), len(chars)), dtype=bool)

for i, seq in enumerate(sequences):
    for t, char in enumerate(seq):
        X[i, t, char_to_idx[char]] = 1
    y[i, char_to_idx[next_chars[i]]] = 1

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

In [None]:
# Build LSTM model for text generation
model_text = keras.Sequential([
    layers.LSTM(128, input_shape=(seq_length, len(chars))),
    layers.Dense(len(chars), activation='softmax')
])

model_text.compile(
    optimizer='adam',
    loss='categorical_crossentropy'
)

# Train
model_text.fit(X, y, batch_size=32, epochs=50, verbose=0)

# Generate text
def generate_text(model, seed_text, length=200, temperature=1.0):
    generated = seed_text
    
    for _ in range(length):
        # Prepare input
        x_pred = np.zeros((1, seq_length, len(chars)))
        for t, char in enumerate(seed_text[-seq_length:]):
            x_pred[0, t, char_to_idx[char]] = 1
        
        # Predict next character
        preds = model.predict(x_pred, verbose=0)[0]
        preds = np.log(preds + 1e-7) / temperature
        exp_preds = np.exp(preds)
        preds = exp_preds / np.sum(exp_preds)
        
        next_idx = np.random.choice(len(chars), p=preds)
        next_char = idx_to_char[next_idx]
        
        generated += next_char
        seed_text += next_char
    
    return generated

# Generate with different temperatures
seed = "Deep learning is a subset of machine l"

for temp in [0.5, 1.0, 1.5]:
    print(f"\n--- Temperature: {temp} ---")
    print(generate_text(model_text, seed, length=100, temperature=temp))

## Part 4: Attention Mechanism

Attention allows models to focus on relevant parts of the input.

### Attention Formula:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$: Query
- $K$: Key
- $V$: Value
- $d_k$: Dimension of key vectors

In [None]:
# Simple attention layer
class SimpleAttention(layers.Layer):
    def __init__(self, units):
        super(SimpleAttention, self).__init__()
        self.W = layers.Dense(units)
        self.V = layers.Dense(1)
    
    def call(self, hidden_states):
        # hidden_states shape: (batch, time, features)
        score = self.V(tf.nn.tanh(self.W(hidden_states)))
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = attention_weights * hidden_states
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

# Model with attention
inputs = keras.Input(shape=(10, 1))
lstm_out = layers.LSTM(32, return_sequences=True)(inputs)
context, attention_weights = SimpleAttention(32)(lstm_out)
outputs = layers.Dense(1)(context)

model_attention = keras.Model(inputs=inputs, outputs=outputs)
model_attention.compile(optimizer='adam', loss='mse', metrics=['mae'])

history_attention = model_attention.fit(
    X_train, y_train,
    batch_size=32,
    epochs=20,
    validation_split=0.2,
    verbose=0
)

print(f"LSTM with Attention Test MAE: {model_attention.evaluate(X_test, y_test, verbose=0)[1]:.4f}")

## Part 5: Autoencoders

Autoencoders learn compressed representations by reconstructing inputs.

### Architecture:
- **Encoder**: Compress input to latent representation
- **Latent space**: Bottleneck layer
- **Decoder**: Reconstruct from latent representation

### Applications:
- Dimensionality reduction
- Denoising
- Anomaly detection
- Feature learning

In [None]:
# Load MNIST for autoencoder
(X_train_ae, _), (X_test_ae, _) = keras.datasets.mnist.load_data()
X_train_ae = X_train_ae.astype('float32') / 255.0
X_test_ae = X_test_ae.astype('float32') / 255.0
X_train_ae = X_train_ae.reshape(-1, 784)
X_test_ae = X_test_ae.reshape(-1, 784)

# Build autoencoder
latent_dim = 32

# Encoder
encoder_input = keras.Input(shape=(784,))
x = layers.Dense(128, activation='relu')(encoder_input)
x = layers.Dense(64, activation='relu')(x)
latent = layers.Dense(latent_dim, activation='relu', name='latent')(x)

encoder = keras.Model(encoder_input, latent, name='encoder')

# Decoder
decoder_input = keras.Input(shape=(latent_dim,))
x = layers.Dense(64, activation='relu')(decoder_input)
x = layers.Dense(128, activation='relu')(x)
decoder_output = layers.Dense(784, activation='sigmoid')(x)

decoder = keras.Model(decoder_input, decoder_output, name='decoder')

# Autoencoder
autoencoder_input = keras.Input(shape=(784,))
encoded = encoder(autoencoder_input)
decoded = decoder(encoded)

autoencoder = keras.Model(autoencoder_input, decoded, name='autoencoder')

autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.summary()

In [None]:
# Train autoencoder
history_ae = autoencoder.fit(
    X_train_ae, X_train_ae,  # Input = Output for autoencoder
    batch_size=256,
    epochs=10,
    validation_split=0.1,
    verbose=1
)

# Visualize reconstructions
n_samples = 10
decoded_imgs = autoencoder.predict(X_test_ae[:n_samples], verbose=0)

fig, axes = plt.subplots(2, n_samples, figsize=(15, 4))

for i in range(n_samples):
    # Original
    axes[0, i].imshow(X_test_ae[i].reshape(28, 28), cmap='gray')
    axes[0, i].axis('off')
    if i == 0:
        axes[0, i].set_ylabel('Original', fontsize=12)
    
    # Reconstructed
    axes[1, i].imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
    axes[1, i].axis('off')
    if i == 0:
        axes[1, i].set_ylabel('Reconstructed', fontsize=12)

plt.tight_layout()
plt.show()

print(f"Latent dimension: {latent_dim}")
print(f"Compression: {784} → {latent_dim} (x{784/latent_dim:.1f} smaller)")

## Part 6: Sequence-to-Sequence (Seq2Seq)

Seq2Seq models transform one sequence into another:

### Architecture:
1. **Encoder**: Processes input sequence → context vector
2. **Decoder**: Generates output sequence from context

### Applications:
- Machine translation
- Text summarization
- Chatbots
- Speech recognition

In [None]:
# Simple seq2seq example: reverse sequences
def generate_seq2seq_data(n_samples=10000, seq_length=10):
    """
    Generate sequences where target is reversed input.
    """
    X = np.random.randint(0, 10, (n_samples, seq_length))
    y = np.flip(X, axis=1)
    return X, y

X_seq, y_seq = generate_seq2seq_data()
X_seq_train, X_seq_test = X_seq[:8000], X_seq[8000:]
y_seq_train, y_seq_test = y_seq[:8000], y_seq[8000:]

# One-hot encode
def one_hot_encode_seq(X, num_classes=10):
    return tf.keras.utils.to_categorical(X, num_classes)

X_seq_train_oh = one_hot_encode_seq(X_seq_train)
y_seq_train_oh = one_hot_encode_seq(y_seq_train)
X_seq_test_oh = one_hot_encode_seq(X_seq_test)
y_seq_test_oh = one_hot_encode_seq(y_seq_test)

print(f"Input shape: {X_seq_train_oh.shape}")
print(f"Output shape: {y_seq_train_oh.shape}")

In [None]:
# Simple seq2seq model
model_seq2seq = keras.Sequential([
    layers.LSTM(128, input_shape=(10, 10), return_sequences=True),
    layers.LSTM(128, return_sequences=True),
    layers.TimeDistributed(layers.Dense(10, activation='softmax'))
])

model_seq2seq.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

history_seq2seq = model_seq2seq.fit(
    X_seq_train_oh, y_seq_train_oh,
    batch_size=128,
    epochs=20,
    validation_split=0.2,
    verbose=0
)

# Evaluate
test_loss, test_acc = model_seq2seq.evaluate(X_seq_test_oh, y_seq_test_oh, verbose=0)
print(f"\nSequence Accuracy: {test_acc:.4f}")

# Test predictions
sample_input = X_seq_test[0:5]
sample_pred = model_seq2seq.predict(X_seq_test_oh[0:5], verbose=0)
sample_pred = np.argmax(sample_pred, axis=-1)

print("\nExample predictions:")
for i in range(5):
    print(f"Input:  {sample_input[i]}")
    print(f"Target: {y_seq_test[i]}")
    print(f"Pred:   {sample_pred[i]}")
    print()

## Key Takeaways

1. **RNNs** process sequential data with hidden state
2. **LSTMs** solve vanishing gradient problem with gates
3. **GRUs** are simpler, faster alternative to LSTMs
4. **Attention** allows focusing on relevant parts
5. **Autoencoders** learn compressed representations
6. **Seq2Seq** transforms sequences (translation, summarization)
7. **Transformers** (not covered) use self-attention exclusively
8. Choose architecture based on task requirements

## When to Use What?

**LSTMs/GRUs:**
- Time series forecasting
- Text generation
- Speech recognition
- Video analysis

**Attention/Transformers:**
- Machine translation
- Question answering
- Document understanding
- Modern NLP tasks

**Autoencoders:**
- Dimensionality reduction
- Anomaly detection
- Denoising
- Feature learning

## Exercises

1. **Time Series**: Forecast stock prices using LSTM
2. **Sentiment Analysis**: Classify movie reviews with LSTM
3. **Attention Visualization**: Visualize attention weights
4. **Variational Autoencoder**: Implement VAE for generation
5. **Bidirectional LSTM**: Compare with unidirectional
6. **Custom Seq2Seq**: Build chatbot with seq2seq
7. **Transformer**: Implement simple transformer encoder

## Next Steps

You've completed Week 6! Continue with:
- Week 7: Language - Advanced NLP and Transformers
- Explore: GANs, Reinforcement Learning, Graph Neural Networks
- Read: Attention Is All You Need paper
- Practice: Build projects on Hugging Face

Congratulations! You now have a comprehensive understanding of neural network architectures!