# Comprehensive NLG Tutorial: WebNLG, DART, and ToTTo for Aspiring Scientists

This Jupyter Notebook is a complete guide to understanding WebNLG, DART, and ToTTo, three key datasets in Natural Language Generation (NLG). Designed for beginners aiming to become scientists, it covers theory, practical code, visualizations, applications, research directions, and projects. We'll use simple language, analogies, and real-world examples to make concepts clear and note-friendly.

## Objectives
- Learn NLG and the role of WebNLG, DART, and ToTTo.
- Implement practical code with Hugging Face Transformers.
- Visualize dataset structures and model performance.
- Explore real-world applications and case studies.
- Identify research directions and project ideas.
- Understand future steps and essential topics for scientists.

## Prerequisites
- Install Python packages: `pip install transformers torch pandas matplotlib seaborn datasets`
- Basic Python knowledge (covered in resources like [Python-Knowledge-Hub](https://github.com/Nilesh-Choudhary/Python-Knowledge-Hub)).

## Structure
1. Theory of NLG and Datasets
2. Practical Code: Model Training and Inference
3. Visualizations
4. Applications and Case Studies
5. Mini and Major Project Ideas
6. Research Directions and Rare Insights
7. Future Directions and Tips
8. Missing Topics for Scientists
9. Conclusion

Let's begin!

## 1. Theory of NLG and Datasets

### What is NLG?
Natural Language Generation (NLG) is an AI process that transforms data into human-readable text. Think of it as a storyteller who takes raw facts (like a table of weather data) and crafts a sentence like, "It's sunny with a high of 75°F."

**Analogy**: NLG is like a chef turning ingredients (data) into a tasty dish (text). The ingredients must be chosen wisely (content selection) and prepared well (surface realization).

### Structured Data in NLG
- **Tables**: Rows and columns, e.g., a spreadsheet of names and ages.
- **RDF Triples**: Subject-predicate-object facts, e.g., "Einstein-born_in-Germany."
- **Key-Value Pairs**: Simple mappings, e.g., "Temperature: 75°F."

### WebNLG
- **Description**: Dataset of RDF triples from DBpedia, generating text for 15 domains (e.g., Astronaut, City).
- **Input**: 1–7 triples, e.g., `(Alan_Bean, occupation, Astronaut)`.
- **Output**: Multi-sentence text, e.g., "Alan Bean is an astronaut born in Wheeler, Texas."
- **Challenge**: Generalizing to unseen domains.
- **Size**: ~35,000 examples (WebNLG 2020).

### DART
- **Description**: Open-domain dataset combining WebNLG, Wikipedia tables, and more.
- **Input**: RDF triples or tree ontologies (hierarchical data).
- **Output**: Multi-sentence text, often complex.
- **Challenge**: Handling diverse, hierarchical inputs.
- **Size**: 82,191 examples.

### ToTTo
- **Description**: Table-to-text dataset from Wikipedia with highlighted cells.
- **Input**: Table with selected cells, e.g., |Name|Occupation| → |Marie_Curie|Scientist|.
- **Output**: Single sentence, e.g., "Marie Curie was a scientist."
- **Challenge**: Precise content selection.
- **Size**: ~120,000 examples.

**Analogy**: WebNLG is a librarian picking specific facts, DART is a novelist weaving diverse tales, ToTTo is an editor summarizing highlighted data.

## 2. Practical Code: Model Training and Inference

We'll use the T5 model from Hugging Face to generate text from WebNLG, DART, and ToTTo-like inputs. We'll also fine-tune a model on a sample dataset.

### Setup
Ensure you have the required libraries installed.

In [None]:
# !pip install transformers torch pandas matplotlib seaborn datasets
!pip install datasets

### Loading a Dataset
We'll use the `web_nlg` dataset from Hugging Face for demonstration.

In [2]:
from datasets import load_dataset

# Load WebNLG dataset
dataset = load_dataset('web_nlg', 'release_v3.0_en')
print(dataset['train'][0])  # View a sample

  from .autonotebook import tqdm as notebook_tqdm


RuntimeError: Dataset scripts are no longer supported, but found web_nlg.py

### Inference with Pre-trained T5
Let's generate text from a sample input using T5.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Initialize model and tokenizer
model_name = 't5-small'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Sample inputs for WebNLG, DART, ToTTo
inputs = [
    'generate text: Alan_Bean | occupation | Astronaut | birthPlace | Wheeler_Texas',  # WebNLG
    'generate text: Empire_State_Building | height | 443.2_meters | location | New_York_City',  # DART
    'generate text: table | Name | Marie_Curie | Occupation | Scientist',  # ToTTo
]

# Generate text
for input_text in inputs:
    inputs_tokenized = tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
    outputs = model.generate(**inputs_tokenized, max_length=50, num_beams=5)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f'Input: {input_text}\nOutput: {generated_text}\n')

**Note**: The outputs may be rough since `t5-small` isn't fine-tuned. For better results, we’ll fine-tune below.

### Fine-Tuning T5 on WebNLG
Let’s fine-tune T5 on a small subset of WebNLG.

In [None]:
from transformers import Trainer, TrainingArguments
import pandas as pd

# Prepare a small training dataset
train_data = dataset['train'].select(range(100))  # Use 100 examples
def preprocess(example):
    triples = ' | '.join([f"{t['subject']} | {t['property']} | {t['object']}" for t in example['modified_tripleset']])
    text = example['lex']['text'][0]
    return {'input_text': f'generate text: {triples}', 'target_text': text}

train_data = train_data.map(preprocess)

# Tokenize dataset
def tokenize_function(examples):
    inputs = tokenizer(examples['input_text'], max_length=512, truncation=True, padding='max_length')
    targets = tokenizer(examples['target_text'], max_length=128, truncation=True, padding='max_length')
    inputs['labels'] = targets['input_ids']
    return inputs

tokenized_data = train_data.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=10000,
    save_total_limit=2,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data,
)

# Train
trainer.train()

# Save model
model.save_pretrained('./fine_tuned_t5')
tokenizer.save_pretrained('./fine_tuned_t5')

**Note**: Fine-tuning on a full dataset requires more compute. This is a demo on 100 examples.

### Evaluating with BLEU
Let’s compute the BLEU score for a generated text.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

reference = ['Alan Bean is an astronaut born in Wheeler, Texas.']
candidate = 'Alan Bean, an astronaut, was born in Wheeler, Texas.'
score = sentence_bleu([ref.split() for ref in reference], candidate.split())
print(f'BLEU Score: {score:.3f}')

## 3. Visualizations

Let’s visualize dataset characteristics and model performance using Matplotlib and Seaborn, inspired by data visualization tutorials.
[Data Visualization Basics with Matplotlib](https://github.com/Geoffrey-lab/Data-Visualization-with-Python/blob/main/Data%20Visualization%20Basics%20with%20Matplotlib.ipynb), [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.00-Introduction-To-Matplotlib.ipynb)

### Dataset Size Comparison

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset sizes
datasets = ['WebNLG', 'DART', 'ToTTo']
sizes = [35000, 82191, 120000]

plt.figure(figsize=(8, 5))
sns.barplot(x=datasets, y=sizes, palette='viridis')
plt.title('Dataset Size Comparison')
plt.ylabel('Number of Examples')
plt.show()

### Hypothetical Model Performance

In [None]:
# Hypothetical BLEU scores
bleu_scores = [65, 60, 70]

plt.figure(figsize=(8, 5))
sns.barplot(x=datasets, y=bleu_scores, palette='magma')
plt.title('Hypothetical BLEU Scores Across Datasets')
plt.ylabel('BLEU Score (%)')
plt.show()

## 4. Applications and Case Studies

### Applications
- **WebNLG**:
  - Automated journalism: Generating bios from DBpedia.
  - Chatbots: Answering factual queries.
  - E-commerce: Product descriptions.
- **DART**:
  - Travel apps: Destination guides from mixed data.
  - Education: Study guides from Wikipedia tables.
  - Business: Summarizing complex datasets.
- **ToTTo**:
  - Sports: Game summaries from stats tables.
  - Wikipedia: Auto-updating articles.
  - Dashboards: Text captions for charts.

### Case Studies
See the separate `Case_Studies.md` file for detailed examples, including:
- WebNLG in news automation.
- DART in travel apps.
- ToTTo in sports analytics.

## 5. Mini and Major Project Ideas

### Mini Projects
1. **WebNLG Text Generator**:
   - **Task**: Fine-tune T5 on WebNLG to generate text from 1–3 triples.
   - **Steps**: Load dataset, preprocess triples, train model, evaluate with BLEU.
   - **Outcome**: A model generating simple bios.
2. **DART Summarizer**:
   - **Task**: Create a script to summarize Wikipedia tables using DART.
   - **Steps**: Parse table data, use a pre-trained model, generate summaries.
   - **Outcome**: Summaries for educational content.
3. **ToTTo Sentence Generator**:
   - **Task**: Build a tool to generate sentences from highlighted table cells.
   - **Steps**: Simulate ToTTo data, fine-tune a model, test on sports tables.
   - **Outcome**: Sports game summaries.

### Major Projects
1. **Cross-Dataset NLG System**:
   - **Task**: Develop a unified model for WebNLG, DART, and ToTTo.
   - **Steps**: Combine datasets, fine-tune a large model (e.g., T5-base), evaluate generalization.
   - **Outcome**: A versatile NLG system for multiple data types.
2. **Interactive NLG Dashboard**:
   - **Task**: Create a web app to input triples/tables and generate text.
   - **Steps**: Use Flask/Streamlit, integrate a fine-tuned model, add visualizations.
   - **Outcome**: A tool for journalists or educators.
3. **New Evaluation Metric**:
   - **Task**: Design a metric beyond BLEU (e.g., incorporating semantic similarity).
   - **Steps**: Analyze limitations of BLEU, implement a new metric, test on datasets.
   - **Outcome**: A research paper contribution.

## 6. Research Directions and Rare Insights

### Research Directions
- **Generalization**: Improve models’ ability to handle unseen domains in WebNLG.
- **Coherence**: Enhance multi-sentence coherence in DART outputs.
- **Controlled Generation**: Optimize content selection in ToTTo for precision.
- **Multimodal NLG**: Combine text and visuals (e.g., generate captions for charts).
- **Ethical NLG**: Study biases in generated text (e.g., gender or cultural biases).

### Rare Insights
- **WebNLG**: Models often struggle with referring expressions (e.g., "he" vs. "Alan Bean") due to limited context. Research on coreference resolution could improve fluency.
- **DART**: Hierarchical inputs (tree ontologies) are underutilized. Exploring graph neural networks could unlock better performance.
- **ToTTo**: Highlighted cells oversimplify content selection. Real-world tables often lack such guidance, so studying implicit selection is a gap.

**Tip**: Read papers like "DART: Open-Domain Structured Data Record to Text Generation" (arXiv:2007.02871) for deeper insights.

## 7. Future Directions and Tips

### Future Directions
- **Large Language Models**: Leverage models like LLaMA or GPT-4 for NLG tasks.
- **Few-Shot Learning**: Train models with fewer examples for efficiency.
- **Multilingual NLG**: Extend datasets to non-English languages.
- **Real-Time NLG**: Develop systems for live data (e.g., sports scores).

### Tips for Scientists
- **Reproducibility**: Use Jupyter Notebooks for clear, shareable experiments. ([Jupyter Best Practices](https://github.com/chrisvoncsefalvay/jupyter-best-practices))
- **Version Control**: Store code on GitHub for collaboration.
- **Read Widely**: Follow arXiv and Papers With Code for NLG advancements.
- **Experiment**: Start with small datasets, then scale to full WebNLG/DART/ToTTo.
- **Network**: Join NLP communities on X or Reddit (e.g., r/MachineLearning).

## 8. Missing Topics for Scientists

The previous tutorial covered basics but missed some critical areas for researchers:
- **Evaluation Metrics Beyond BLEU**: BLEU focuses on n-gram overlap but ignores semantics. Explore BLEURT or ROUGE for semantic similarity.
- **Data Preprocessing**: Real-world data is messy. Learn to clean and normalize triples/tables.
- **Model Interpretability**: Understand why models generate specific outputs (e.g., attention mechanisms).
- **Ethical Considerations**: Address biases in training data (e.g., DBpedia’s coverage of certain demographics).
- **Scalability**: Handling large datasets requires efficient data pipelines and compute resources.

## 9. Conclusion

This notebook provides a foundation for mastering WebNLG, DART, and ToTTo. You’ve learned theory, coded a model, visualized data, explored applications, and identified research paths. As a scientist, continue experimenting, reading, and collaborating to push NLG boundaries.

**Next Steps**:
1. Download datasets from Hugging Face (`web_nlg`, `dart`, `totto`).
2. Fine-tune a larger model (e.g., T5-base).
3. Propose a research question, e.g., "How can we improve coherence in DART?"
4. Share your work on GitHub or X to build your research profile.

Happy researching!