## 7. Next Steps

This notebook demonstrated the core functionality of the BERT text summarization tool. Here's what you can do next:

### Command Line Usage
```bash
# Run the CLI tool
python cli.py extractive --text "Your text..." --num-sentences 3

# Interactive mode
python cli.py interactive
```

### Web Interface
```bash
# Start the web app
python web_app.py
# Open browser to http://localhost:5000
```

### Testing and Evaluation
```bash
# Run comprehensive tests
python test_summarization.py
```

### Further Exploration
- Try different BERT models (DistilBERT for speed, BERT-large for quality)
- Experiment with abstractive summarization using the HybridSummarizer
- Fine-tune on domain-specific data
- Integrate with your own applications using the Python API

Happy summarizing! 🎉

In [None]:
# Advanced techniques and tips

print("💡 Performance Tips:")
print("=" * 30)
print("1. 🚀 First run may be slow due to model downloading")
print("2. ⚡ Subsequent runs are much faster (models cached)")
print("3. 📏 Shorter texts process faster")
print("4. 🧠 Consider DistilBERT for speed:")
print("   summarizer = BERTExtractiveSummarizer('distilbert-base-uncased')")
print("5. 🖥️  GPU acceleration available if CUDA installed")

print(f"\n🔧 Customization Options:")
print("=" * 30)
print("• num_sentences: Control summary length")
print("• max_sentence_length: Filter very long sentences")
print("• preprocess: Clean HTML, URLs, etc.")
print("• Different BERT models for different domains")

print(f"\n📊 Quality Guidelines:")
print("=" * 30)
print("• Input should be >100 words for best results")
print("• Use preprocessing for web content")
print("• Adjust sentence count based on document length")
print("• Compare ROUGE scores for evaluation")

# Example with custom parameters
print(f"\n🎯 Custom Configuration Example:")
print("-" * 40)

custom_summary = summarizer.extractive_summarize(
    sample_text,
    num_sentences=4,
    max_sentence_length=100  # Prefer shorter sentences
)

print(f"📋 Custom summary (4 sentences, max 100 words each):")
print(custom_summary)

## 6. Advanced Usage and Tips

Here are some advanced techniques and optimization tips for better summarization results.

In [None]:
# Load sample dataset
dataset_loader = DatasetLoader()
sample_data = dataset_loader.load_cnn_dailymail_sample()

print(f"📂 Loaded {len(sample_data)} sample documents")

# Test each document
for i, data_point in enumerate(sample_data):
    print(f"\n📄 Document {i+1}: {data_point['title']}")
    print("-" * 50)
    
    document = data_point['document']
    reference = data_point['summary']
    
    # Generate summary
    start_time = time.time()
    generated = summarizer.extractive_summarize(document, num_sentences=2)
    processing_time = time.time() - start_time
    
    # Evaluate
    evaluation = evaluator.evaluate_summary(document, reference, generated)
    
    print(f"📊 Processing time: {processing_time:.3f}s")
    print(f"📏 Original: {len(document.split())} words")
    print(f"📏 Generated: {len(generated.split())} words")
    print(f"📈 ROUGE-1 F1: {evaluation['rouge1_f1']:.4f}")
    print(f"📈 Compression: {evaluation['compression_ratio']:.3f}")
    print(f"\n📚 Reference: {reference}")
    print(f"📋 Generated: {generated}")

## 5. Testing with Real Data

Let's test the summarizer with some sample datasets.

In [None]:
# Initialize evaluator
evaluator = SummarizationEvaluator()

# Generate a summary to evaluate
generated_summary = summarizer.extractive_summarize(sample_text, num_sentences=2)

# Reference summary (what we consider a good summary)
reference_summary = """
Artificial intelligence is intelligence demonstrated by machines that perceive their environment 
and take actions to maximize their goals. The AI field includes reasoning, learning, natural 
language processing, and draws upon computer science, mathematics, psychology, and other fields.
"""

print("📋 Generated Summary:")
print(generated_summary)
print(f"\n📚 Reference Summary:")
print(reference_summary)

# Calculate ROUGE scores
scores = evaluator.calculate_rouge_scores(reference_summary, generated_summary)

print(f"\n📊 ROUGE Scores:")
for metric, score in scores.items():
    print(f"  {metric}: {score:.4f}")

# Calculate compression ratio
compression_ratio = evaluator.calculate_compression_ratio(sample_text, generated_summary)
print(f"\n📈 Compression Ratio: {compression_ratio:.3f}")
print(f"📏 Original: {len(sample_text.split())} words → Summary: {len(generated_summary.split())} words")

## 4. Evaluation and Metrics

Let's evaluate the quality of our summaries using ROUGE scores and other metrics.

In [None]:
# Initialize preprocessor and test with messy text
preprocessor = TextPreprocessor()

messy_text = """
<p>This is a sample text with HTML tags.</p>
Visit our website at https://example.com for more information!
Contact us at: info@example.com

This sentence has weird    spacing   and formatting.
This is way too short.
This is a proper sentence that should be kept in the final output after preprocessing and cleaning operations.
"""

print("🧹 Original messy text:")
print(repr(messy_text))

cleaned_text = preprocessor.preprocess_document(messy_text)
print(f"\n✨ Cleaned text:")
print(repr(cleaned_text))

# Summarize the cleaned text
if len(cleaned_text.split()) > 10:
    clean_summary = summarizer.extractive_summarize(cleaned_text, num_sentences=1)
    print(f"\n📋 Summary of cleaned text:")
    print(clean_summary)

## 3. Text Preprocessing

The tool includes text preprocessing capabilities to clean and normalize input text.

In [None]:
# Generate summaries with different lengths
for num_sentences in [1, 2, 3]:
    print(f"\n🔍 {num_sentences}-sentence summary:")
    print("-" * 40)
    
    start_time = time.time()
    summary = summarizer.extractive_summarize(sample_text, num_sentences=num_sentences)
    processing_time = time.time() - start_time
    
    print(f"⏱️  Processing time: {processing_time:.3f}s")
    print(f"📏 Summary length: {len(summary.split())} words")
    print(f"📋 Summary: {summary}")

## 2. Basic Summarization

Let's generate a summary with different lengths to see how the tool works.

In [None]:
# Sample text for demonstration
sample_text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural
intelligence displayed by humans and animals. Leading AI textbooks define the field as the study
of "intelligent agents": any device that perceives its environment and takes actions that maximize
its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence"
is often used to describe machines that mimic "cognitive" functions that humans associate with the
human mind, such as "learning" and "problem solving". As machines become increasingly capable,
tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon
known as the AI effect. A quip in Tesler's Theorem says "AI is whatever hasn't been done yet."
For instance, optical character recognition is frequently excluded from things considered to be AI,
having become a routine technology. Modern machine learning techniques are heavy on data and require
large amounts of computing power. The traditional problems of AI research include reasoning, knowledge
representation, planning, learning, natural language processing, perception, and the ability to move
and manipulate objects. General intelligence is among the field's long-term goals. Approaches include
statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in
AI, including versions of search and mathematical optimization, artificial neural networks, and methods
based on statistics, probability and economics. The AI field draws upon computer science, information
engineering, mathematics, psychology, linguistics, philosophy, and many other fields.
"""

print("📄 Sample text loaded:")
print(f"📊 Length: {len(sample_text.split())} words")
print(f"📝 Preview: {sample_text[:200]}...")

In [None]:
# Initialize the BERT extractive summarizer
print("🔄 Initializing BERT summarizer...")
summarizer = BERTExtractiveSummarizer()

print("✅ Summarizer initialized!")
print(f"📋 Using model: {summarizer.model_name}")
print(f"🖥️  Device: {summarizer.device}")

## 1. Basic Setup and Initialization

Let's start by initializing the BERT summarizer and preparing some sample text.

In [None]:
# Import required libraries
import warnings
warnings.filterwarnings('ignore')

from text_summarizer import BERTExtractiveSummarizer, HybridSummarizer
from utils import TextPreprocessor, SummarizationEvaluator, DatasetLoader
import time

# BERT Text Summarization Tool 🤖

This notebook demonstrates how to use BERT for extractive text summarization. The tool uses BERT embeddings to identify and extract the most important sentences from a document.

## Features
- 🔍 **Extractive Summarization**: Uses BERT to identify key sentences
- 🧠 **Semantic Understanding**: Leverages BERT's contextual embeddings  
- ⚙️ **Customizable**: Adjustable summary length and preprocessing
- 📊 **Evaluation**: ROUGE scores and compression metrics