# Vec2Summ Demo Notebook

This notebook demonstrates the Vec2Summ approach to text summarization via probabilistic sentence embeddings.

## What is Vec2Summ?

Vec2Summ is a novel approach that:
1. Embeds texts into high-dimensional space
2. Models the distribution of embeddings
3. Samples new vectors from this distribution
4. Reconstructs text from sampled vectors
5. Generates summaries from reconstructed texts

In [None]:
# Install required packages (if needed)
# !pip install -r ../requirements.txt

In [None]:
import sys
import os
sys.path.append('../src')

import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv

# Load environment variables
load_dotenv('../.env')

# Import vec2summ components
from vec2summ.core.embeddings import get_openai_embeddings
from vec2summ.core.distribution import calculate_distribution_params, sample_from_distribution
from vec2summ.core.reconstruction import reconstruct_text_from_embeddings
from vec2summ.core.summarization import generate_vec2summ_summary
from vec2summ.utils.visualization import visualize_embeddings
from vec2summ.data.preprocessing import clean_text

## Step 1: Prepare Sample Data

Let's start with some sample texts to demonstrate the approach.

In [None]:
# Sample texts for demonstration
sample_texts = [
    "Machine learning is transforming how we process and understand data.",
    "Artificial intelligence enables computers to perform tasks that typically require human intelligence.",
    "Deep learning uses neural networks with multiple layers to learn complex patterns.",
    "Natural language processing helps computers understand and generate human language.",
    "Computer vision allows machines to interpret and analyze visual information.",
    "Reinforcement learning trains agents to make decisions through trial and error.",
    "Big data analytics reveals insights from large and complex datasets.",
    "Cloud computing provides scalable and flexible computing resources."
]

# Clean the texts
cleaned_texts = [clean_text(text) for text in sample_texts]
print(f"Number of texts: {len(cleaned_texts)}")
print("\nSample texts:")
for i, text in enumerate(cleaned_texts[:3]):
    print(f"{i+1}. {text}")

## Step 2: Generate Embeddings

Convert our texts into high-dimensional embeddings using OpenAI's embedding model.

In [None]:
# Generate embeddings
print("Generating embeddings...")
embeddings = get_openai_embeddings(cleaned_texts)
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")

## Step 3: Model the Distribution

Calculate the mean vector and covariance matrix of our embeddings.

In [None]:
# Calculate distribution parameters
mean_vector, covariance_matrix = calculate_distribution_params(embeddings)

print(f"Mean vector shape: {mean_vector.shape}")
print(f"Covariance matrix shape: {covariance_matrix.shape}")
print(f"Mean vector norm: {np.linalg.norm(mean_vector):.4f}")
print(f"Covariance matrix determinant: {np.linalg.det(covariance_matrix):.2e}")

## Step 4: Sample New Vectors

Sample new embedding vectors from our learned distribution.

In [None]:
# Sample from the distribution
n_samples = 5
sampled_vectors = sample_from_distribution(mean_vector, covariance_matrix, n_samples=n_samples)

print(f"Sampled vectors shape: {sampled_vectors.shape}")
print(f"Number of samples: {n_samples}")

# Compare with original embeddings
original_mean_norm = np.mean([np.linalg.norm(emb) for emb in embeddings.numpy()])
sampled_mean_norm = np.mean([np.linalg.norm(vec) for vec in sampled_vectors])

print(f"Original embeddings mean norm: {original_mean_norm:.4f}")
print(f"Sampled vectors mean norm: {sampled_mean_norm:.4f}")

## Step 5: Reconstruct Text (Demo)

Note: For this demo, we'll simulate text reconstruction since it requires the vec2text model.

In [None]:
# Simulated reconstruction for demo purposes
# In practice, you would use: reconstructed_texts = reconstruct_text_from_embeddings(sampled_vectors, corrector)

# For demo, let's create some sample reconstructed texts
reconstructed_texts = [
    "AI and machine learning are changing data processing.",
    "Neural networks enable complex pattern recognition.",
    "Natural language models understand human communication.",
    "Computer vision analyzes visual data effectively.",
    "Big data provides insights through advanced analytics."
]

print("Reconstructed texts (demo):")
for i, text in enumerate(reconstructed_texts):
    print(f"{i+1}. {text}")

## Step 6: Generate Summary

Create a summary from the reconstructed texts.

In [None]:
# Generate summary from reconstructed texts
print("Generating Vec2Summ summary...")
summary = generate_vec2summ_summary(reconstructed_texts)

print("\nVec2Summ Summary:")
print("="*50)
print(summary)
print("="*50)

## Step 7: Visualize Embeddings

Create a 2D visualization of the original and sampled embeddings using PCA.

In [None]:
# Visualize embeddings
import torch

visualize_embeddings(
    original_embeddings=embeddings,
    sampled_embeddings=torch.tensor(sampled_vectors),
    save_path="demo_visualization.png"
)

# Display the plot
from IPython.display import Image, display
display(Image("demo_visualization.png"))

## Analysis and Insights

Let's analyze what we learned from this experiment.

In [None]:
# Analyze the embedding space
from sklearn.metrics.pairwise import cosine_similarity

# Calculate similarities between original embeddings
original_similarities = cosine_similarity(embeddings.numpy())
mean_original_sim = np.mean(original_similarities[np.triu_indices_from(original_similarities, k=1)])

# Calculate similarities between sampled vectors and originals
sampled_to_original_sims = cosine_similarity(sampled_vectors, embeddings.numpy())
mean_sampled_sim = np.mean(sampled_to_original_sims)

print(f"Mean similarity between original embeddings: {mean_original_sim:.4f}")
print(f"Mean similarity between sampled and original: {mean_sampled_sim:.4f}")

# Plot similarity distributions
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(original_similarities[np.triu_indices_from(original_similarities, k=1)], 
         bins=10, alpha=0.7, label='Original-Original')
plt.title('Similarity Distribution: Original Embeddings')
plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(sampled_to_original_sims.flatten(), bins=10, alpha=0.7, 
         label='Sampled-Original', color='orange')
plt.title('Similarity Distribution: Sampled vs Original')
plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

## Conclusion

This notebook demonstrated the Vec2Summ approach:

1. **Embedding Generation**: Converted texts to high-dimensional vectors
2. **Distribution Modeling**: Learned mean and covariance of embedding space
3. **Sampling**: Generated new vectors from the learned distribution
4. **Reconstruction**: (Simulated) converting vectors back to text
5. **Summarization**: Created summaries from reconstructed texts

### Key Insights:
- The embedding space captures semantic relationships
- Sampling preserves distributional properties
- The approach enables novel summarization via probabilistic modeling

### Next Steps:
- Try with your own datasets
- Experiment with different embedding models
- Evaluate reconstruction quality
- Compare with baseline summarization methods

In [None]:
# Cleanup
import os
if os.path.exists("demo_visualization.png"):
    os.remove("demo_visualization.png")
    
print("Demo completed successfully! 🎉")