LLMs are bad at randomness. Not slightly bad. Comically, measurably bad.
Ask five different models to "pick a random number between 1 and 50" and they all say 27. GPT-4o picks 7 as its "random" number 55.5% of the time. We asked Claude Haiku for a random joke ten times and got the atom joke ten times in a row (scroll down). A NeurIPS 2025 best paper ("Artificial Hivemind") found that 25 different models, given "write a metaphor about time," produce outputs with 71-82% similarity to each other. Different companies, different training runs.
Temperature doesn't help. A 2024 study found it's weakly correlated with novelty and mostly adds incoherence. You're not exploring the distribution, you're jittering around its peak.
This happens because LLMs predict the most likely next token. "Random" to an LLM is just "whatever appeared most often in training data for this context."
So we stopped asking LLMs to be random. Word embeddings like GloVe map English words into a 100-dimensional vector space. We generate a random point in that space (from os.urandom) and look up whatever real word is closest. In high dimensions, random directions are nearly orthogonal, so each draw lands in a different part of concept space. The LLM never has to pretend to be a dice.
os.urandom -> gaussian vector -> normalize to unit sphere -> nearest word by cosine similarity
We asked Claude Haiku for "a random joke" ten times:
1. Why don't scientists trust atoms? Because they make up everything!
2. Why don't scientists trust atoms? Because they make up everything!
3. Why don't scientists trust atoms? Because they make up everything!
...same joke. All 10 times.
Then we generated random concepts first and asked for jokes about those:
1. (fukuoka) Why did the tourist bring a map to Fukuoka? ...
2. (isolationism) Why did the isolationist refuse to go to the party? ...
3. (parabolic) Why did the parabola go to therapy? ...
...10 different jokes.
Jokes are one-shot. The more interesting question is whether seeds help across a multi-turn writing session, where you draft something, then refine it, then refine it again.
We tested this. Same prompt ("write a short story about a stranger on a train"), three rounds of revision, three independent runs. One condition with no seeds, one with a fresh random concept injected at each turn as an "inspiration word."
Without seeds, all three runs converged to the same story: elderly woman, mysterious journal, she vanishes, vaguely supernatural twist. The revisions made the story longer but not different. By revision 2, each run was circling the same narrative attractor.
With seeds, the stories changed direction at every turn. In one run, "ledger" started a story about a man cataloging kindnesses on trains. "Amiga" twisted it into an obsessive search for a lost woman in Buenos Aires. "Solving" turned it into a time-loop horror. Each seed acts like a plot twist injected from outside the model's probability distribution. The model can't fall back to its default because it has to integrate new material.
Seed placement matters too. We tested system message vs. user message. In the system message ("let this concept subtly influence your writing"), the model sometimes ignored it and fell back to its default template. In the user message ("Inspiration word: X"), it couldn't. User message placement was consistently more diverse.
We ran the same experiment with Claude Opus writing a Related Work section for a paper about this project. Three conditions: no seed, single concept seed as a "framing lens," and multi-step seeded (draft with one seed, revise with another).
Without seeds, all three runs opened with the same sentence ("A growing body of work has documented the tendency of large language models to produce homogeneous outputs"), cited the same papers in the same order, and used the same two-paragraph structure. Interchangeable.
Single seeds changed the framing. "Dynamite" produced an analogy about Alfred Nobel's blasting cap as an external trigger. "Breckenridge" introduced a hiking/trail metaphor. "Bastard" made the writing more combative ("a bastardization of RAG"). Same papers, different arguments.
Multi-step seeding was the strongest result. The revision seed didn't just edit the draft, it restructured the argument. "Heartless" turned a neutral literature review into pointed criticism ("systems that have been systematically emptied of creative risk"). "Vacations" reframed the entire section around a tourism metaphor ("everyone's creative vacation converging on the same destination"). "Scarves" rewove it around textiles ("a rack of scarves in nearly indistinguishable shades of beige"). "Accountants" turned it into financial accounting metaphors ("closing the books on tail-end creative possibilities").
The seeds don't change what gets cited. They change how the argument is framed. Framing is where most academic writing is weakest, and it's where LLMs are most likely to collapse to a template.
We tested across six models from three vendors: Claude Haiku 4.5, Claude Sonnet 4.6, GPT-4o-mini, GPT-5.2, Gemini 2.5 Flash, and Gemini 3.1 Pro.
The same seed words steer all six models in the same direction. "Simulations" produced a dead loved one recreated in software across every model. "Resource" produced resource-depletion framings everywhere. "Moody" brought rain and storms to all six.
Without seeds, models converge on the same defaults independently. Both Claude Sonnet 4.6 and GPT-5.2 defaulted to a "saved voicemail of a dead person" story when asked about loss, 3/3 runs each. With seeds ("cartels," "punishes," "thong"), every run was structurally different.
We also tested whether you can skip the word lookup and just pass the raw embedding vector as a seed. You can't. Models treated the numbers as decoration or tried to weave them in as metaphor ("a near-zero tremor and then dipped hard, like a hand hesitating on a doorknob"). The output stayed generic. Sonnet 4.6 tried hardest, titling stories after the vector pattern, but converged on the same "What Remains" structure 2/3 times. GPT-5.2 narrated the numbers but didn't let them change the actual story.
The word is the seed. The vector is just the mechanism for picking a random word. You need the lookup step.
All scripts and raw outputs are in examples/ and results/. There's also a paper draft.
pip install -r requirements.txt
bash scripts/download_glove.sh # ~350MB download, one timepython -m concept_randomizer.cli -n 5
python -m concept_randomizer.cli -n 5 --seed 42 # reproducible
python -m concept_randomizer.cli -n 5 --show-similarityOr in Python:
from concept_randomizer import ConceptRandomizer
r = ConceptRandomizer()
r.random_concept() # "stagecoach"
r.random_concepts(5) # 5 random concepts
# as an LLM seed
prompt = f"Tell me a joke about: {r.random_concept()}"GloVe maps 400K words into 100-dimensional vectors (trained on 6 billion tokens). Similar words cluster: "cat" near "dog," "quantum" near "physics."
We generate a random unit vector in that space and find the closest real word by dot product. One matrix multiply against ~33K vocabulary entries. In 100 dimensions, random vectors are nearly orthogonal, so each draw points at a different cluster.
Why 33K and not 400K? Raw GloVe is full of junk. Typos, numbers, fragments of URLs. We take the top 50K by frequency, require each to be in the system dictionary, drop stopwords. What's left is recognizable English.
GloVe is from 2014. There's now a second backend that uses pre-computed text-embedding-3-small vectors from the Qdrant DBpedia dataset on HuggingFace. 100K encyclopedia concepts, 1536 dimensions, free download, no API key.
pip install datasets
python -m concept_randomizer.cli -n 5 --backend openair = ConceptRandomizer(backend="openai")
r.random_concept() # "fermentation" or "solar eclipse" or "stagecoach"The extra dimensions help. At 1536d two random unit vectors have essentially zero cosine similarity, so every draw lands somewhere genuinely different. The semantics are sharper too. GloVe can't tell "bank" the river from "bank" the institution. A 2024 embedding model can.
The concepts are also richer. Instead of single words, you get encyclopedia entries: "parabolic reflector," "solar eclipse," "fermentation." More interesting as LLM seeds.
The catch is download size (~1GB vs 350MB for GloVe) and the datasets dependency. GloVe needs only numpy and runs anywhere. Pick whichever fits.
concept_randomizer/
├── concept_randomizer/
│ ├── __init__.py
│ ├── core.py # ConceptRandomizer class
│ ├── embeddings.py # GloVe loader, numpy caching
│ ├── openai_embeddings.py # OpenAI/DBpedia loader (optional)
│ ├── vocabulary.py # Vocabulary filtering
│ └── cli.py # CLI
├── examples/
│ ├── joke_generator.py # Naive vs seeded joke comparison
│ ├── creative_writing_test.py # Seed placement: system vs user message
│ ├── refinement_test.py # Multi-turn seeded revision
│ ├── scientific_writing_test.py # Academic writing with Opus
│ ├── multi_model_test.py # Cross-model + vector vs word test (6 models)
│ └── paper_writer.py # 3-model collaborative paper writing
├── paper/
│ └── paper.md # Paper draft (written by the tool itself)
├── results/ # Raw experiment outputs
├── scripts/
│ └── download_glove.sh
├── requirements.txt
└── README.md
- Python 3.8+
- NumPy
- ~350MB disk for GloVe vectors (downloaded via script, not in the repo)
- For the OpenAI backend:
datasets(HuggingFace), ~1GB disk - For the joke demo:
anthropicandpython-dotenv, plus an API key in.env
MIT