# Dilution of Meaning: Multi-Text Experiment

This notebook scales the iterative rewriting experiment to process all text files in the `texts/` directory. It also introduces semantic tracking using embeddings and PCA (Principal Component Analysis) to visualize how the meaning of each text "drifts" across iterations.

## Environment Setup

First, we install and import the necessary libraries. This includes `transformers` for the LLM, `sentence-transformers` for embeddings, and `scikit-learn` for PCA.

In [None]:
!pip install -q sentence-transformers scikit-learn matplotlib tqdm

In [22]:
import os
import glob
import torch
import matplotlib.pyplot as plt
import numpy as np
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from tqdm.notebook import tqdm

# Set device
device = "cpu"
if torch.backends.mps.is_available():
    device = torch.device("mps")

if torch.cuda.is_available():
    device = toch.device("cuda")

print(device)

cpu


## Model Initialization

We initialize two models:
1.  **Generation Model**: The larger LLM used for rewriting (`google/gemma-3n-E2B-it`).
2.  **Embedding Model**: A specialized model for generating semantic vectors (`all-MiniLM-L6-v2`).

In [None]:
# Large Language Model for Generation
gen_model_id = "google/gemma-3n-E2B-it" 
# Note: "ServiceNow-AI/Apriel-1.5-15b-Thinker" can be used if hardware supports it.

print(f"Loading Generation Model: {gen_model_id}...")
pipe = pipeline("text-generation", model=gen_model_id, device=device)

# Sentence Transformer for Embeddings
print("Loading Embedding Model...")
embed_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)


## Multi-Text Execution Loop

We discover all `.txt` files in the `texts/` directory and run the rewriting experiment on each. We store the text and its embedding at every step.

In [None]:
text_files = glob.glob("./texts/*.txt")
iterations = 4
results = {} # {filename: {'texts': [...], 'embeddings': [...]}}

# Clear or create output.txt
with open("output_txt_all.txt", "w") as f:
    f.write("Dilution of Meaning: All Texts Experiment\n\n")

for file_path in tqdm(text_files, desc="Processing Files"):
    file_name = os.path.basename(file_path)
    print(f"\nProcessing: {file_name}")
    
    with open(file_path, "r") as f:
        current_text = f.read().strip()
    
    file_results = {'texts': [current_text], 'embeddings': []}
    
    # Generate embedding for original text
    orig_embedding = embed_model.encode(current_text)
    file_results['embeddings'].append(orig_embedding)
    
    with open("output_txt_all.txt", "a") as output_file:
        output_file.write(f"--- FILE: {file_name} ---\n")
        output_file.write(f"Original:\n{current_text}\n\n")
    
    for i in range(1, iterations + 1):
        summarize_prompt = [
            {'role': 'system', 'content': 'Rewrite this passage in your own words and in a similar length and style.'},
            {'role': 'user', 'content': current_text}
        ]
        
        # Generate rewrite
        output = pipe(
            summarize_prompt,
            max_new_tokens=2000,
            return_full_text=False,
            do_sample=True,
            temperature=0.7,
            tokenizer_encode_kwargs={"enable_thinking": False},
        )
        
        generated_text = output[0]["generated_text"]
        
        # Generate embedding for the rewrite
        new_embedding = embed_model.encode(generated_text)
        
        file_results['texts'].append(generated_text)
        file_results['embeddings'].append(new_embedding)
        
        with open("output_txt_all.txt", "a") as output_file:
            output_file.write(f"Iteration {i}:\n{generated_text}\n\n")
        
        current_text = generated_text
    
    results[file_name] = file_results

print("\nAll processing complete.")

## Visualization: Semantic Drift

We use PCA to reduce the embeddings to 2D and plot the trajectories of each file. This shows how the "meaning" of the text moves in semantic space as it is rewritten.

In [None]:
# Flatten all embeddings to fit PCA
all_embeddings = []
labels = []
for file_name, data in results.items():
    all_embeddings.extend(data['embeddings'])
    for i in range(len(data['embeddings'])):
        labels.append((file_name, i))

all_embeddings = np.array(all_embeddings)

# Apply PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(all_embeddings)

# Plot
plt.figure(figsize=(12, 8))
colors = plt.cm.rainbow(np.linspace(0, 1, len(results)))

for idx, (file_name, data) in enumerate(results.items()):
    # Get start and end indices in the flattened array
    start_idx = idx * (iterations + 1)
    end_idx = start_idx + (iterations + 1)
    
    coords = embeddings_2d[start_idx:end_idx]
    
    # Plot the line (trajectory)
    plt.plot(coords[:, 0], coords[:, 1], color=colors[idx], alpha=0.5, linestyle='--')
    
    # Plot the points (Original and Iterations)
    for i, (x, y) in enumerate(coords):
        marker = 'o' if i == 0 else 'x'
        marker_size = 100 if i == 0 else 50
        plt.scatter(x, y, color=colors[idx], marker=marker, s=marker_size, label=f"{file_name} (orig)" if i == 0 else "")
        plt.text(x, y, f"{i}", fontsize=9)
        
plt.title("Dilution of Meaning: Semantic Drift across Iterations (PCA 2D)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, linestyle=':', alpha=0.6)
plt.tight_layout()
plt.show()