# Transformer-based NLP with Neural DSL

This tutorial shows how to build transformer models for NLP tasks using Neural DSL.

## Overview
- Build a transformer encoder for text classification
- Understand multi-head attention
- Train on text datasets
- Compare with LSTM models

## Setup

In [None]:
import os
import sys
import numpy as np
import matplotlib.pyplot as plt

from neural.parser.parser import create_parser, ModelTransformer
from neural.code_generation.code_generator import generate_code

## Define the Transformer Model

In [None]:
dsl_code = """
network TransformerNLP {
  input: (None, 512)
  
  layers:
    Embedding(input_dim=30000, output_dim=256)
    Dropout(rate=0.1)
    TransformerEncoder(num_heads=8, ff_dim=512, dropout=0.1)
    TransformerEncoder(num_heads=8, ff_dim=512, dropout=0.1)
    TransformerEncoder(num_heads=8, ff_dim=512, dropout=0.1)
    GlobalAveragePooling1D()
    Dense(units=256, activation="relu")
    Dropout(rate=0.3)
    Dense(units=128, activation="relu")
    Dropout(rate=0.3)
    Output(units=10, activation="softmax")

  loss: "sparse_categorical_crossentropy"
  optimizer: Adam(learning_rate=0.0001)
  metrics: ["accuracy"]

  train {
    epochs: 30
    batch_size: 32
    validation_split: 0.15
  }
}
"""

with open('transformer_nlp.neural', 'w') as f:
    f.write(dsl_code)

print("Transformer model defined!")

## Understanding Transformers

The transformer architecture uses:
- **Multi-head Attention**: Allows the model to focus on different positions
- **Feed-forward Networks**: Process information after attention
- **Layer Normalization**: Stabilizes training
- **Positional Encoding**: Injects position information

Key advantages over RNNs:
- Parallel processing (faster training)
- Better long-range dependencies
- More interpretable attention patterns

## Compile the Model

In [None]:
!neural compile transformer_nlp.neural --backend tensorflow --output transformer_nlp_tf.py
print("Model compiled!")

## Visualize Architecture

In [None]:
!neural visualize transformer_nlp.neural --format html
print("Visualization generated!")

## Prepare Data

For this example, we'll use a text classification dataset.

In [None]:
try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras.preprocessing import sequence
    
    # Load dataset (example with IMDB, can be replaced)
    max_features = 30000
    maxlen = 512
    
    print("Loading dataset...")
    (x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(
        num_words=max_features
    )
    
    # For multi-class, we'd need a different dataset
    # Here we'll demonstrate with binary classification
    
    print("Padding sequences...")
    x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
    x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
    
    print(f"Training data shape: {x_train.shape}")
    print(f"Test data shape: {x_test.shape}")
    
except ImportError:
    print("TensorFlow not installed")

## Train the Model

In [None]:
# Train using CLI
!neural run transformer_nlp_tf.py --backend tensorflow

## Compare with LSTM

In [None]:
# Create LSTM model for comparison
lstm_code = """
network LSTMBaseline {
  input: (None, 512)
  
  layers:
    Embedding(input_dim=30000, output_dim=256)
    LSTM(units=256, return_sequences=True, dropout=0.2)
    LSTM(units=128, dropout=0.2)
    Dense(units=128, activation="relu")
    Dropout(rate=0.3)
    Output(units=10, activation="softmax")

  loss: "sparse_categorical_crossentropy"
  optimizer: Adam(learning_rate=0.001)
  metrics: ["accuracy"]

  train {
    epochs: 30
    batch_size: 32
    validation_split: 0.15
  }
}
"""

with open('lstm_baseline.neural', 'w') as f:
    f.write(lstm_code)

print("LSTM baseline model created for comparison")

## Visualize Attention Patterns

In [None]:
# Pseudo-code for attention visualization
# In practice, you'd extract attention weights from the model

# import seaborn as sns
# 
# def visualize_attention(attention_weights, tokens):
#     plt.figure(figsize=(10, 8))
#     sns.heatmap(attention_weights, xticklabels=tokens, yticklabels=tokens, 
#                 cmap='viridis', cbar=True)
#     plt.title('Attention Weights')
#     plt.xlabel('Key')
#     plt.ylabel('Query')
#     plt.show()

print("Attention visualization code ready (requires model weights)")

## Performance Analysis

In [None]:
# Compare training time and accuracy
# results = {
#     'Transformer': {'accuracy': 0.XX, 'time': XXX},
#     'LSTM': {'accuracy': 0.XX, 'time': XXX}
# }
# 
# models = list(results.keys())
# accuracies = [results[m]['accuracy'] for m in models]
# times = [results[m]['time'] for m in models]
# 
# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# 
# ax1.bar(models, accuracies)
# ax1.set_ylabel('Accuracy')
# ax1.set_title('Model Accuracy Comparison')
# 
# ax2.bar(models, times)
# ax2.set_ylabel('Training Time (s)')
# ax2.set_title('Training Time Comparison')
# 
# plt.tight_layout()
# plt.show()

print("Performance comparison template ready")

## Hyperparameter Tuning

In [None]:
# Optimize hyperparameters
!neural compile transformer_nlp.neural --backend tensorflow --hpo

## Debug with NeuralDbg

In [None]:
print("To debug, run:")
print("neural debug transformer_nlp.neural --backend tensorflow --dashboard --port 8050")

## Export to Production

In [None]:
# Export to different formats
!neural compile transformer_nlp.neural --backend pytorch --output transformer_pytorch.py
!neural compile transformer_nlp.neural --backend onnx --output transformer.onnx

## Summary

In this tutorial, we:
1. Built a transformer-based NLP model
2. Understood multi-head attention
3. Compared with LSTM architectures
4. Explored attention visualization

## Next Steps
- Build encoder-decoder transformers
- Implement BERT-style pre-training
- Try GPT-style autoregressive models
- Fine-tune pre-trained transformers
- Apply to machine translation