A bilingual "Human vs AI" summarization game for retail exhibitions
Adservio build sovereign, aligned, responsible and frugal AI.
This Streamlit application creates an engaging "Human vs AI" game for exhibition stands. Visitors:
- See a short Wikipedia text for a few seconds
- Write their own summary from memory
- Compare their summary with an AI-generated summary
- Reveal the original text to see who did better
The app is fully bilingual (French/English) and designed for live demonstrations at retail expos like Porte de Versailles.
- Bilingual interface: French and English with instant language switching
- Two-pass AI summarization: Guarantees high-quality, consistent summaries even with small models
- Local LLM execution: Uses Ollama to run models locally (no external API calls)
- Online & Offline modes: Fetch live Wikipedia content or use pre-built corpus
- Multiple model support: Granite (default), Llama, DeepSeek, Mistral
- Dark theme: Professional exhibition-ready interface
- Always-visible branding: Adservio logo and motto displayed at all times
- Countdown timer: Visual circular countdown showing remaining time
- Semantic scoreboard: Multi-dimensional evaluation with 6 metrics
- Concept highlighting: Visual highlighting of semantically matching content
- Matrix visualization: Interactive heatmap showing phrase-level correspondences
- Cross-language support: Works even when comparing French ↔ English summaries
- Pedagogical tooltips: Mouseover explanations with mathematical formulas
- "Cheat" mode: Iterative learning - revise and recalculate your scores
-
📖 Read Phase (configurable time, default 10s)
- A Wikipedia text excerpt appears on screen (60-120 words)
- A countdown circle shows remaining time
- Read and memorize as much as you can!
-
✍️ Write Phase (unlimited time)
- Text disappears
- Write your summary from memory
- Try to capture the key information in 2-3 sentences
- Click "Valider mon résumé / Submit my summary"
-
🤖 AI Challenge
- The AI generates its own summary using two-pass summarization
- First pass: Extract key facts (hidden)
- Second pass: Generate polished 2-sentence summary (visible)
-
📊 Comparison & Results
- See both summaries side-by-side
- View semantic scoreboard with detailed metrics
- Optional: Highlight matching concepts
- Optional: View correspondence analysis matrix
- Reveal original text to see who captured it better
-
🎯 "Cheat" to Learn (optional)
- Edit your summary after seeing results
- Click "Tricher / Cheat" to recalculate scores
- Experiment with different phrasings
- Learn what makes a good summary!
The semantic scoreboard determines the winner using a composite score (0-100):
- Score difference > 2 points: Clear winner
- Score difference ≤ 2 points: Tie (both did well!)
The AI is strong, but humans can win by:
- Capturing key concepts precisely
- Staying focused (no off-topic content)
- Writing concise, well-structured summaries
The app uses multilingual sentence embeddings (384-dimensional vectors) to evaluate summaries semantically, not just word-by-word.
- What it measures: Overall summary quality
- Formula:
S = 100 × [α·sim + β·cov + γ·focus] × penalty- α (global similarity weight) = 0.4
- β (coverage weight) = 0.3
- γ (focus weight) = 0.3
- Interpretation: Higher is better. Combines all metrics with length penalty.
- What it measures: Overall semantic closeness to original text
- Method: Cosine similarity between mean embeddings (384-D vectors)
- Range: 0-100%
- Interpretation: Measures if summary captures the "meaning" globally
- What it measures: Fraction of original text concepts captured
- Formula:
coverage = (matched phrases) / (total reference phrases) - Threshold: Phrase similarity ≥ 65%
- Interpretation: Did you cover all the important points?
- What it measures: Fraction of summary content that's relevant
- Formula:
focus = (aligned phrases) / (total summary phrases) - Threshold: Phrase similarity ≥ 65%
- Interpretation: Did you stay on-topic? (Penalizes hallucinations/off-topic content)
- What it measures: Penalty for summaries too short or too long
- Formula:
p = exp(-((n-50)/25)²)where n = word count - Optimum: 50 words
- Interpretation: Gaussian penalty centered at target length
- What it measures: Total words in summary
- Method: Alphanumeric tokens only
- Interpretation: Context for understanding length penalty
Matrix Heatmap Visualization showing ALL phrase-level relationships:
- Rows (A, B, C...): Phrases from your summary
- Columns (1, 2, 3...): Phrases from original text
- Cell colors:
- 🟢 Green (>85%): Strong semantic match
- 🟠 Orange (70-85%): Medium match
- ⚪ Gray (<70%): Weak match
- Cell values: Cosine similarity percentage (0-100%)
What it reveals:
- Which parts of your summary match which parts of the original
- One-to-many relationships (one summary phrase capturing multiple original concepts)
- Gaps in coverage (original concepts you missed)
Toggle buttons to visually highlight matching content:
- Orange background on semantically similar phrases (≥65% similarity)
- Content words only (nouns, verbs, proper names)
- Stopwords excluded (70+ common FR/EN function words)
- Model:
paraphrase-multilingual-MiniLM-L12-v2(384 dimensions) - Languages: 50+ supported (FR, EN, and more)
- Comparison level: Phrase-level (3-12 words), not word-by-word
- Threshold: 0.65 for coverage/focus (configurable)
- Cross-language: Works for FR ↔ EN comparisons!
- Conda or Python 3.11+
- Ollama installed and running (see ollama.com)
-
Clone or download this repository
-
Create and activate the conda environment:
conda env create -f environment.yaml
conda activate retail-summarizer- Install dependencies (if needed):
pip install -r requirements.txt- Install semantic embeddings model (for semantic analysis):
python scripts/preinstall_semantics_model.pyThis downloads and caches the multilingual embedding model (~100MB) for offline use.
- Pull required Ollama models:
# Minimum (default model)
ollama pull granite3.1-moe:3b
# Optional alternatives
ollama pull llama3:latest
ollama pull deepseek-r1:14b
ollama pull mistral:latest- Generate offline corpus (recommended for exhibition use):
# French corpus (300 excerpts)
python scripts/build_offline_corpus.py build --lang fr --n 300
# English corpus (300 excerpts)
python scripts/build_offline_corpus.py build --lang simple --n 300- Start the application:
streamlit run app.pyThe app will open in your browser at http://localhost:8501.
Define available LLM models for Ollama. Each model needs:
id: Unique identifierlabel: Display name in UIollama_name: Exact Ollama model name (e.g.,granite3.1-moe:3b)description: User-friendly description
The default_model_id specifies which model loads on startup.
Key configuration options:
default_language: Starting language ("fr"or"en")default_display_time: Seconds to show original text (default: 10)min_words/max_words: Wikipedia excerpt size range (60-120)ollama.host/ollama.port: Ollama server connection
summarization.target_sentences: Target summary length (default: 2)summarization.max_attempts: Retry attempts for sentence validation (default: 2)
semantics.reference_source: Comparison reference ("raw"= original text,"internal"= AI's first pass)semantics.similarity_threshold: Threshold for phrase matching (default: 0.65, range: 0.60-0.75)semantics.target_word_count: Optimal summary length for penalty (default: 50 words)semantics.weights.alpha: Weight for global similarity (default: 0.4)semantics.weights.beta: Weight for concept coverage (default: 0.3)semantics.weights.gamma: Weight for semantic focus (default: 0.3)
Tuning tips:
- Lower
similarity_threshold(0.60-0.63) for more lenient matching - Higher
similarity_threshold(0.70-0.75) for stricter matching - Adjust
target_word_countbased on your typical text lengths - Weights must sum to 1.0 for proper scaling
- Select language (FR/EN) in sidebar
- Choose LLM model from available options
- Select mode: Online (live Wikipedia) or Offline (local corpus)
- Adjust display time (5-60 seconds)
- Click "Nouveau texte / New text" to load a random excerpt
- Click "Démarrer / Start" to begin the round
- Read the text before it disappears
- Write your summary from memory
- Submit and see the AI's summary
- Compare and reveal the original text
The app uses a clever "two-pass" strategy to ensure high-quality AI summaries:
- First pass (hidden): Extract key information from the full text
- Second pass (visible): Generate a polished N-sentence summary from the internal summary
This approach:
- Improves consistency with small models
- Reduces hallucinations
- Produces cleaner, more focused summaries
- Validates sentence count and retries if needed
For reliable exhibition use without internet dependency, pre-generate a local corpus:
# French Wikipedia
python scripts/build_offline_corpus.py build --lang fr --n 300
# Simple English Wikipedia
python scripts/build_offline_corpus.py build --lang simple --n 300
# Append more entries
python scripts/build_offline_corpus.py build --lang fr --n 100 --appendpython scripts/build_offline_corpus.py statsThe corpus is saved in data/wiki_corpus.jsonl (JSONL format, one entry per line).
- Frontend: Streamlit with custom CSS for dark theme
- LLM Backend: Ollama (local inference)
- Data Source: Wikipedia API (online) or local JSONL corpus (offline)
- State Management: Streamlit session state
Default: Granite 3.1 MoE 3B
- Multilingual (French/English)
- Fast on CPU/GPU
- Good balance of size vs. quality
Alternatives:
- Llama 3: Strong general-purpose performance
- DeepSeek R1 14B: Larger reasoning model (slower)
- Mistral: Compact and efficient
.
├── app.py # Main Streamlit application (1700+ lines)
├── semantics_utils.py # Semantic analysis module (v0.5.0+)
├── CLAUDE.md # Project specification
├── README.md # This file
├── CHANGELOG.md # Version history
├── VERSION.txt # Current version
├── SEMANTICS_IMPLEMENTATION.md # Semantic analysis documentation
├── environment.yaml # Conda environment
├── requirements.txt # Python dependencies
├── config/
│ ├── models.json # LLM model definitions
│ └── app_config.yaml # App settings (incl. semantic config)
├── data/
│ └── wiki_corpus.jsonl # Offline corpus (generated, git-ignored)
├── models/
│ └── embeddings/ # Cached embedding model (git-ignored)
├── scripts/
│ ├── build_offline_corpus.py # Corpus builder CLI
│ └── preinstall_semantics_model.py # Embedding model downloader
├── assets/
│ └── adservio-logo.svg # Adservio branding
└── .streamlit/
└── config.toml # Streamlit theme (dark mode)
Solution: Ensure Ollama is running:
# Check if running
ollama list
# If not running, start it (varies by OS)
# On Linux:
ollama serveSolution: Generate the corpus first:
python scripts/build_offline_corpus.py build --lang fr --n 300Solution: Pull the required models:
ollama pull granite3.1-moe:3bSolutions:
- Use a smaller model (Granite 3B or Mistral)
- Reduce
max_wordsin config (fewer words to process) - Use offline mode (eliminates Wikipedia fetch time)
- Ensure Ollama is using GPU if available
-
Pre-install everything offline:
# Download embedding model python scripts/preinstall_semantics_model.py # Pull Ollama models ollama pull granite3.1-moe:3b # Build offline corpus (600+ entries) python scripts/build_offline_corpus.py build --lang fr --n 300 python scripts/build_offline_corpus.py build --lang simple --n 300
-
Test on exhibition hardware:
- Run end-to-end gameplay with semantic analysis
- Verify embedding model loads quickly (<2s)
- Test both French and English modes
- Check matrix visualization renders properly
-
Configure for optimal experience:
- Set
default_display_time: 8(challenging but fair) - Use
reference_source: "raw"for fair scoring - Keep
similarity_threshold: 0.65for balanced matching
- Set
- Use offline mode - No internet dependency
- Keep Granite 3B model - Best speed/quality balance
- Show tooltips - Hover over metrics to explain to technical visitors
- Encourage "cheating" - Let visitors iterate and learn!
- Highlight cross-language - Show FR summary vs EN original (impressive!)
- Challenge visitors: "Can you beat the AI in 10 seconds?"
- Show the matrix: "See how your words map to the original"
- Explain the math: Hover tooltips show formulas (for engineers/academics)
- Iterate mode: "Try again with the Cheat button - learn what works!"
- Multilingual demo: "Write in English, compare to French text - it works!"
- Slow semantic analysis: Check if embedding model is cached (
models/embeddings/) - Coverage/focus = 0.00: Lower threshold to 0.60 in config
- Matrix too large: Increase
max_wordsin config to get shorter texts - AI summaries too long: Check Ollama model is properly loaded
Author: Olivier Vitrac, PhD, HDR Email: olivier.vitrac@adservio.fr Organization: Adservio – Innovation Lab
Technologies:
- Streamlit - UI framework with custom dark theme
- Ollama - Local LLM serving (no external API calls)
- Wikipedia API - Content source (online mode)
- Sentence Transformers - Multilingual embeddings (semantic analysis)
- PyTorch - Deep learning backend
- scikit-learn - Cosine similarity computations
- Language Models:
- Granite 3.1 MoE 3B (default)
- Llama 3 Latest
- DeepSeek R1 14B
- Mistral Latest
- Embedding Model:
paraphrase-multilingual-MiniLM-L12-v2(384-D, 50+ languages)
© 2025 Adservio. All rights reserved.
Adservio build sovereign, aligned, responsible and frugal AI.
Starting the app:
conda activate retail-summarizer
streamlit run app.pyKey buttons:
- Nouveau texte - Load new Wikipedia excerpt
- Démarrer - Start countdown timer
- Valider mon résumé - Submit human summary (triggers AI)
- Tricher - Recalculate after editing summary
- Analyse des correspondances - Show matrix visualization
Metrics to explain:
- Score final - Overall quality (0-100)
- Similarité globale - Semantic closeness (%)
- Couverture - Concepts captured (0-1)
- Focus - Content relevance (0-1)
- Pénalité - Length penalty (0-1)
Troubleshooting:
- Slow? Check offline mode is enabled
- Coverage = 0? Normal for very different summaries
- AI not responding? Check
ollama listin terminal
Hover over ℹ️ icons for:
- Mathematical formulas
- Technical explanations
- Vector space dimensions
Show off features:
- Cross-language comparison (FR ↔ EN)
- Matrix heatmap (phrase-level matching)
- Semantic embeddings (384 dimensions)
- Local execution (no API calls)
Key technical points:
- Multilingual transformer embeddings
- Cosine similarity in 384-D space
- Phrase-level granularity (not word-by-word)
- Threshold: 65% for phrase matching
- Two-pass summarization for stability
Current Version: 0.5.0 (2025-11-22)
Major Features:
- v0.1: Basic gameplay
- v0.2: Bilingual support
- v0.3: Rules display, countdown timer
- v0.4: Semantic analysis, scoreboard, concept highlighting
- v0.5: Matrix visualization, tooltips, cheat mode, phrase-level metrics
See CHANGELOG.md for detailed version history.