# Vec2Summ Demo Notebook

This notebook demonstrates the Vec2Summ approach to text summarization via probabilistic sentence embeddings.

## What is Vec2Summ?

Vec2Summ is a novel approach that:
1. Embeds texts into high-dimensional space
2. Models the distribution of embeddings
3. Samples new vectors from this distribution
4. Reconstructs text from sampled vectors
5. Generates summaries from reconstructed texts

In [None]:
# Install required packages (if needed)
# !pip install -r ../requirements.txt

In [1]:
import sys
import os
sys.path.append('../src')

import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv

# Load environment variables
load_dotenv('../.env')

# Import vec2summ components
from vec2summ.core.embeddings import get_embeddings, get_embeddings_and_corrector, get_openai_embeddings, get_gtr_embeddings
from vec2summ.core.distribution import calculate_distribution_params, sample_from_distribution
from vec2summ.core.reconstruction import reconstruct_text_from_embeddings
from vec2summ.core.summarization import generate_vec2summ_summary
from vec2summ.utils.visualization import visualize_embeddings
from vec2summ.data.preprocessing import clean_text

## Configuration

Choose your embedding model and configure the experiment parameters.

In [None]:
# Configuration - Choose your embedding type
# Options: "openai" or "gtr"

EMBEDDING_TYPE = "gtr"  # Change this to "gtr" to use GTR embeddings instead

# Model-specific settings
OPENAI_MODEL = "text-embedding-ada-002"  # Used when EMBEDDING_TYPE="openai"
GTR_MODEL = "sentence-transformers/gtr-t5-base"  # Used when EMBEDDING_TYPE="gtr"

# Other experiment parameters
N_SAMPLES = 5  # Number of vectors to sample from the distribution
BATCH_SIZE = 128  # Batch size for embedding computation

print(f"Selected embedding type: {EMBEDDING_TYPE}")
if EMBEDDING_TYPE == "openai":
    print(f"OpenAI model: {OPENAI_MODEL}")
    print("Note: Make sure your OpenAI API key is set in the .env file")
elif EMBEDDING_TYPE == "gtr":
    print(f"GTR model: {GTR_MODEL}")
    print("Note: GTR models will be downloaded locally if not cached")
else:
    print("Warning: Invalid embedding type selected!")

Selected embedding type: openai
OpenAI model: text-embedding-ada-002
Note: Make sure your OpenAI API key is set in the .env file


### Embedding Types Comparison

**OpenAI Embeddings (`text-embedding-ada-002`)**:
- ✅ High quality, well-trained embeddings
- ✅ No local model download required
- ✅ Consistent performance across domains
- ❌ Requires OpenAI API key and internet connection
- ❌ Usage costs apply
- ❌ No control over model architecture

**GTR Embeddings (`sentence-transformers/gtr-t5-base`)**:
- ✅ Free to use (open source)
- ✅ Runs locally (no internet required after download)
- ✅ Full control over the model
- ✅ No usage limits or costs
- ❌ Requires initial model download (~1GB)
- ❌ Requires more computational resources
- ❌ May need fine-tuning for specific domains

Choose the embedding type that best fits your use case!

### ⚠️ Important: Embedding-Corrector Pairing

The `vec2text` library requires that **embeddings and correctors are properly paired**:

- **OpenAI embeddings** → **OpenAI corrector** (same model name)
- **GTR embeddings** → **GTR corrector** (`gtr-base`)

**Mixing them will cause errors!** Our code automatically handles this pairing for you.

In [None]:
# Test embedding-corrector compatibility
from vec2summ.core.embeddings import validate_embedding_corrector_pair

print("Testing embedding-corrector compatibility:")
print("✅ Valid pairs:")
print(f"  OpenAI + text-embedding-ada-002: {validate_embedding_corrector_pair('openai', 'text-embedding-ada-002')}")
print(f"  GTR + gtr-base: {validate_embedding_corrector_pair('gtr', 'gtr-base')}")

print("❌ Invalid pairs:")
print(f"  OpenAI + gtr-base: {validate_embedding_corrector_pair('openai', 'gtr-base')}")
print(f"  GTR + text-embedding-ada-002: {validate_embedding_corrector_pair('gtr', 'text-embedding-ada-002')}")

print(f"\nYour current selection ({EMBEDDING_TYPE}) will be automatically paired correctly! ✅")

## Step 1: Prepare Sample Data

Let's start with some sample texts to demonstrate the approach.

In [None]:
# Sample texts for demonstration
sample_texts = [
    "Machine learning is transforming how we process and understand data.",
    "Artificial intelligence enables computers to perform tasks that typically require human intelligence.",
    "Deep learning uses neural networks with multiple layers to learn complex patterns.",
    "Natural language processing helps computers understand and generate human language.",
    "Computer vision allows machines to interpret and analyze visual information.",
    "Reinforcement learning trains agents to make decisions through trial and error.",
    "Big data analytics reveals insights from large and complex datasets.",
    "Cloud computing provides scalable and flexible computing resources."
]

# Clean the texts
cleaned_texts = [clean_text(text) for text in sample_texts]
print(f"Number of texts: {len(cleaned_texts)}")
print("\nSample texts:")
for i, text in enumerate(cleaned_texts[:3]):
    print(f"{i+1}. {text}")

## Step 2: Generate Embeddings

Convert our texts into high-dimensional embeddings using OpenAI's embedding model.

In [None]:
# Generate embeddings and get properly paired corrector
print(f"Generating {EMBEDDING_TYPE} embeddings and loading paired corrector...")

embeddings, corrector, embedding_models = get_embeddings_and_corrector(
    cleaned_texts,
    embedding_type=EMBEDDING_TYPE,
    openai_model=OPENAI_MODEL,
    gtr_model=GTR_MODEL,
    batch_size=BATCH_SIZE
)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")
print(f"Successfully generated embeddings using {EMBEDDING_TYPE} model")
print(f"✅ Corrector properly paired with {EMBEDDING_TYPE} embeddings")

# Store the corrector for later use in reconstruction
print(f"Corrector type: {type(corrector)}")
if EMBEDDING_TYPE == "openai":
    print(f"Using OpenAI corrector for model: {OPENAI_MODEL}")
else:
    print(f"Using GTR corrector (gtr-base) for GTR embeddings")

## Step 3: Model the Distribution

Calculate the mean vector and covariance matrix of our embeddings.

In [None]:
# Calculate distribution parameters
mean_vector, covariance_matrix = calculate_distribution_params(embeddings)

print(f"Mean vector shape: {mean_vector.shape}")
print(f"Covariance matrix shape: {covariance_matrix.shape}")
print(f"Mean vector norm: {np.linalg.norm(mean_vector):.4f}")
print(f"Covariance matrix determinant: {np.linalg.det(covariance_matrix):.2e}")

## Step 4: Sample New Vectors

Sample new embedding vectors from our learned distribution.

In [None]:
# Sample from the distribution
sampled_vectors = sample_from_distribution(mean_vector, covariance_matrix, n_samples=N_SAMPLES)

print(f"Sampled vectors shape: {sampled_vectors.shape}")
print(f"Number of samples: {N_SAMPLES}")

# Compare with original embeddings
original_mean_norm = np.mean([np.linalg.norm(emb) for emb in embeddings.numpy()])
sampled_mean_norm = np.mean([np.linalg.norm(vec) for vec in sampled_vectors])

print(f"Original embeddings mean norm: {original_mean_norm:.4f}")
print(f"Sampled vectors mean norm: {sampled_mean_norm:.4f}")
print(f"Using {EMBEDDING_TYPE} embeddings for the experiment")

## Step 5: Reconstruct Text (Demo)

Note: For this demo, we'll simulate text reconstruction since it requires the vec2text model.

In [None]:
# Text reconstruction with properly paired corrector
print(f"Reconstructing text using {EMBEDDING_TYPE} corrector...")

# Option 1: Real reconstruction (uncomment to use actual vec2text reconstruction)
# reconstructed_texts = reconstruct_text_from_embeddings(sampled_vectors, corrector)

# Option 2: Demo reconstruction (faster for demonstration)
# For demo, let's create sample reconstructed texts matching our N_SAMPLES
demo_texts = [
    "AI and machine learning are changing data processing.",
    "Neural networks enable complex pattern recognition.",
    "Natural language models understand human communication.",
    "Computer vision analyzes visual data effectively.",
    "Big data provides insights through advanced analytics.",
    "Deep learning processes complex information patterns.",
    "Automation streamlines repetitive computational tasks.",
    "Cloud computing offers scalable processing solutions."
]

# Take only N_SAMPLES texts to match our sampling
reconstructed_texts = demo_texts[:N_SAMPLES]

print(f"Reconstructed texts (demo) - using {EMBEDDING_TYPE} embedding space:")
for i, text in enumerate(reconstructed_texts):
    print(f"{i+1}. {text}")

print(f"\nGenerated {len(reconstructed_texts)} reconstructed texts from {N_SAMPLES} sampled vectors")
print(f"✅ Corrector ({EMBEDDING_TYPE}) properly paired with embeddings")

# Note about real reconstruction
print(f"\n💡 Note: To use real reconstruction, uncomment the first line and comment out the demo section.")
print(f"    Real reconstruction uses the properly paired {EMBEDDING_TYPE} corrector loaded earlier.")

## Step 6: Generate Summary

Create a summary from the reconstructed texts.

In [None]:
# Generate summary from reconstructed texts
print("Generating Vec2Summ summary...")
summary = generate_vec2summ_summary(reconstructed_texts)

print("\nVec2Summ Summary:")
print("="*50)
print(summary)
print("="*50)

## Step 7: Visualize Embeddings

Create a 2D visualization of the original and sampled embeddings using PCA.

In [None]:
# Visualize embeddings
import torch

# Generate embeddings for reconstructed texts using the same embedding type
print(f"Generating embeddings for reconstructed texts using {EMBEDDING_TYPE}...")
reconstructed_embeddings = get_embeddings(
    reconstructed_texts,
    embedding_type=EMBEDDING_TYPE,
    openai_model=OPENAI_MODEL,
    gtr_model=GTR_MODEL,
    batch_size=BATCH_SIZE
)

print(f"✅ All embeddings use consistent {EMBEDDING_TYPE} model")

visualize_embeddings(
    original_embeddings=embeddings,
    sampled_embeddings=torch.tensor(sampled_vectors),
    reconstructed_embeddings=reconstructed_embeddings,
    save_path=f"demo_visualization_{EMBEDDING_TYPE}.png"
)

# Display the plot
from IPython.display import Image, display
print(f"Visualization using {EMBEDDING_TYPE} embeddings:")
display(Image(f"demo_visualization_{EMBEDDING_TYPE}.png"))

## Analysis and Insights

Let's analyze what we learned from this experiment.

In [None]:
# Analyze the embedding space
from sklearn.metrics.pairwise import cosine_similarity

# Calculate similarities between original embeddings
original_similarities = cosine_similarity(embeddings.numpy())
mean_original_sim = np.mean(original_similarities[np.triu_indices_from(original_similarities, k=1)])

# Calculate similarities between sampled vectors and originals
sampled_to_original_sims = cosine_similarity(sampled_vectors, embeddings.numpy())
mean_sampled_sim = np.mean(sampled_to_original_sims)

print(f"Mean similarity between original embeddings: {mean_original_sim:.4f}")
print(f"Mean similarity between sampled and original: {mean_sampled_sim:.4f}")

# Plot similarity distributions
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(original_similarities[np.triu_indices_from(original_similarities, k=1)], 
         bins=10, alpha=0.7, label='Original-Original')
plt.title('Similarity Distribution: Original Embeddings')
plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(sampled_to_original_sims.flatten(), bins=10, alpha=0.7, 
         label='Sampled-Original', color='orange')
plt.title('Similarity Distribution: Sampled vs Original')
plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

## Conclusion

This notebook demonstrated the Vec2Summ approach with **flexible embedding choices** and **proper model pairing**:

1. **Embedding Generation**: Converted texts to high-dimensional vectors using your chosen model
2. **Distribution Modeling**: Learned mean and covariance of embedding space
3. **Sampling**: Generated new vectors from the learned distribution
4. **Reconstruction**: (Simulated) converting vectors back to text using **properly paired corrector**
5. **Summarization**: Created summaries from reconstructed texts

### Key Insights:
- The embedding space captures semantic relationships regardless of the model used
- Sampling preserves distributional properties for both OpenAI and GTR embeddings
- The approach enables novel summarization via probabilistic modeling
- **Choice of embedding model** affects computational requirements and costs
- **⚠️ Embedding-corrector pairing is critical** for successful text reconstruction

### Embedding Type Benefits:
- **OpenAI**: Higher quality, no local setup, but requires API key and costs money
- **GTR**: Free and local, full control, but requires more compute resources

### Critical Implementation Details:
- **OpenAI embeddings** must be paired with **OpenAI correctors** (same model)
- **GTR embeddings** must be paired with **GTR correctors** (`gtr-base`)
- Our `get_embeddings_and_corrector()` function automatically handles this pairing
- Mixing embedding types with wrong correctors **will cause errors**

### Next Steps:
- Try both embedding types with your own datasets
- Compare results between OpenAI and GTR embeddings
- Experiment with different GTR models (ensure corrector compatibility)
- Use real reconstruction by uncommenting the appropriate lines
- Evaluate reconstruction quality with different embedding dimensions
- Compare with baseline summarization methods

### Changing Embedding Types:
Simply modify the `EMBEDDING_TYPE` variable in the configuration cell above and re-run the notebook. The pairing will be handled automatically!

In [None]:
# Cleanup
import os

# Clean up visualization files
for embedding_type in ["openai", "gtr"]:
    filename = f"demo_visualization_{embedding_type}.png"
    if os.path.exists(filename):
        os.remove(filename)
        print(f"Removed {filename}")

print(f"Demo completed successfully using {EMBEDDING_TYPE} embeddings! 🎉")