# Prompt Length vs. Semantic Noise Trade-off

This notebook simulates prompts composed of informative and noisy tokens to study how latent drift depends on both total length and the semantic redundancy ratio $\rho$. A concept vector is reconstructed from token embeddings that either align with the ground-truth manifold or introduce random noise. We evaluate mean squared error across different lengths and noise fractions and visualize the resulting heatmap.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

plt.style.use("seaborn-v0_8")
np.random.seed(0)



In [None]:
def simulate_error(seq_len: int, noise_ratio: float, dim: int = 64, trials: int = 200) -> float:
    errors = []
    for _ in range(trials):
        concept = np.random.randn(dim)
        concept /= np.linalg.norm(concept)
        tokens = []
        informative = int(round(seq_len * (1 - noise_ratio)))
        noisy = seq_len - informative
        if informative > 0:
            info_tokens = concept + 0.1 * np.random.randn(informative, dim)
            tokens.append(info_tokens)
        if noisy > 0:
            noise_tokens = np.random.randn(noisy, dim)
            noise_tokens /= np.linalg.norm(noise_tokens, axis=1, keepdims=True)
            tokens.append(noise_tokens)
        prompt = np.vstack(tokens)
        reconstruction = prompt.mean(axis=0)
        reconstruction /= np.linalg.norm(reconstruction) + 1e-8
        errors.append(np.linalg.norm(reconstruction - concept) ** 2)
    return float(np.mean(errors))

lengths = np.arange(2, 33)
noise_levels = np.linspace(0, 0.8, 17)
heatmap = np.zeros((len(noise_levels), len(lengths)))

for i, noise in enumerate(noise_levels):
    for j, length in enumerate(lengths):
        heatmap[i, j] = simulate_error(length, noise)



In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
mesh = ax.imshow(
    heatmap,
    aspect="auto",
    origin="lower",
    extent=[lengths[0], lengths[-1], noise_levels[0], noise_levels[-1]],
    cmap="viridis",
)
ax.set_xlabel("Prompt length (tokens)")
ax.set_ylabel("Noise ratio (1 - œÅ)")
ax.set_title("Reconstruction error vs. length and semantic noise")
fig.colorbar(mesh, ax=ax, label="Mean squared error")
plt.show()



Increasing prompt length reduces error only when noise ratios remain low (high semantic redundancy). Once noise dominates, longer prompts plateau or worsen reconstruction, mirroring the hypothesis that unstructured repetition injects aliasing rather than usable redundancy.
