# HuggingFace Datasets Integration Example

This notebook demonstrates how to use HuggingFace datasets with the PyTorch sequence models in this repository.

## 1. Import Required Libraries

In [None]:
import torch
import numpy as np
from datasets import load_dataset
from torch.utils.data import DataLoader
from dataset.huggingface_dataset import (
    HuggingFaceDatasetAdapter,
    HuggingFaceSequenceClassificationDataset
)
from models.transformer import PyTorchTransformerEncoder
from models.embedding import EmbeddingType

## 2. Using HuggingFace Datasets with the Adapter

The `HuggingFaceDatasetAdapter` allows you to use any HuggingFace dataset with the models in this repository.

In [None]:
# Example: Load a dataset from HuggingFace Hub
# Note: This is a conceptual example. You may need to tokenize the data first.
# For this example, we'll create a simple synthetic dataset

from datasets import Dataset

# Create a simple example dataset
data = {
    'input_ids': [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]],
    'labels': [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]
}
hf_dataset = Dataset.from_dict(data)

# Wrap it with our adapter
adapted_dataset = HuggingFaceDatasetAdapter(
    hf_dataset=hf_dataset,
    input_column='input_ids',
    target_column='labels'
)

# Create a DataLoader
dataloader = DataLoader(adapted_dataset, batch_size=2, shuffle=True)

# Test the dataloader
for batch in dataloader:
    inputs, targets = batch
    print(f"Input shape: {inputs.shape}")
    print(f"Target shape: {targets.shape}")
    print(f"Inputs: {inputs}")
    print(f"Targets: {targets}")
    break

## 3. Creating a Custom Text Classification Dataset

The `HuggingFaceSequenceClassificationDataset` class allows you to create custom datasets for text classification that are compatible with both PyTorch and HuggingFace.

In [None]:
# Create synthetic data for demonstration
num_samples = 100
vocab_size = 1000
max_length = 50
num_classes = 5

# Generate random sequences (in practice, these would be tokenized text)
sequences = [np.random.randint(1, vocab_size, size=np.random.randint(10, max_length)).tolist() 
             for _ in range(num_samples)]
labels = np.random.randint(0, num_classes, size=num_samples).tolist()

# Create the dataset
classification_dataset = HuggingFaceSequenceClassificationDataset(
    sequences=sequences,
    labels=labels,
    max_length=max_length,
    pad_token_id=0
)

print(f"Dataset size: {len(classification_dataset)}")
print(f"Sample item: {classification_dataset[0]}")

## 4. Convert Custom Dataset to HuggingFace Format

You can convert your custom dataset to a HuggingFace Dataset object for compatibility with HuggingFace tools.

In [None]:
# Convert to HuggingFace Dataset
hf_dataset_from_custom = classification_dataset.to_huggingface_dataset()

print(f"HuggingFace Dataset: {hf_dataset_from_custom}")
print(f"First item: {hf_dataset_from_custom[0]}")

## 5. Using the Dataset with a Transformer Encoder Model

Now let's use the custom dataset with one of the models from this repository.

In [None]:
# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Set hyperparameters
embedding_dim = 64
num_layers = 2
heads = 4
batch_size = 16

# Create model
model = PyTorchTransformerEncoder(
    embedding_type=EmbeddingType.POS_LEARNED,
    src_vocab_size=vocab_size,
    trg_vocab_size=num_classes,
    embedding_dim=embedding_dim,
    num_layers=num_layers,
    heads=heads,
    dropout=0.1,
    device=device,
    max_length=max_length
).to(device)

print(f"Model created with {sum(p.numel() for p in model.parameters())} parameters")

In [None]:
# Create DataLoader
train_loader = DataLoader(
    classification_dataset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True
)

# Test forward pass
model.eval()
with torch.no_grad():
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        
        # Forward pass
        output = model(input_ids)
        
        print(f"Input shape: {input_ids.shape}")
        print(f"Output shape: {output.shape}")
        print(f"Labels shape: {labels.shape}")
        break

## 6. Example: Loading a Real HuggingFace Dataset

Here's how you might use a real dataset from the HuggingFace Hub (commented out as it requires tokenization).

In [None]:
# Example with a real dataset (uncomment to use)
# Note: You'll need to tokenize the text data first

# from datasets import load_dataset
# from transformers import AutoTokenizer

# # Load dataset
# dataset = load_dataset("imdb", split="train[:100]")

# # Load tokenizer
# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# # Tokenize the dataset
# def tokenize_function(examples):
#     return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# tokenized_dataset = dataset.map(tokenize_function, batched=True)

# # Use the adapter
# adapted_dataset = HuggingFaceDatasetAdapter(
#     hf_dataset=tokenized_dataset,
#     input_column='input_ids',
#     target_column='label'
# )

# # Create DataLoader
# dataloader = DataLoader(adapted_dataset, batch_size=8, shuffle=True)

## Summary

This notebook demonstrated:
1. How to use the `HuggingFaceDatasetAdapter` to wrap HuggingFace datasets for use with models in this repository
2. How to create custom text classification datasets using `HuggingFaceSequenceClassificationDataset`
3. How to convert custom datasets to HuggingFace format
4. How to use these datasets with the Transformer models in the repository

The integration allows you to leverage the extensive HuggingFace datasets ecosystem while using the sequence models provided in this repository.