# üé® Words to Canvas - Build an LLM from YOUR Vocabulary

**Concept:** Train a tiny LLM using ONLY your controlled vocabulary (StPetePros wordlist), then expand knowledge on-demand by fetching from Wikipedia/news.

This is like **"reverse OCR"** - instead of image ‚Üí text, we do:

```
Your Words ‚Üí Word Embeddings ‚Üí Tiny LLM ‚Üí Generated Text ‚Üí Canvas Visualization
                                 ‚Üì
                  (expand vocabulary from Wikipedia when needed)
```

**What makes this unique:**
- Starts with YOUR 263 Tampa Bay words
- Learns relationships through embeddings
- Expands contextually (not pre-trained on entire internet)
- Visualizes knowledge graph on canvas
- Pure numpy (no black boxes)

**Sections:**
1. Load your vocabulary
2. Train word embeddings (Word2Vec-style)
3. Build tiny transformer LLM
4. Generate text from prompts
5. Expand vocabulary from Wikipedia
6. Visualize knowledge graph on canvas

In [None]:
# Setup imports
import sys
import os

# Add core directory to path
sys.path.insert(0, os.path.abspath('../core'))

import numpy as np
import json
from pathlib import Path

# Import our custom modules
from word_embeddings import WordEmbeddings, build_vocabulary_from_wordlist, generate_training_pairs
from vocabulary_expander import VocabularyExpander
from tiny_llm import TinyLLM
from canvas_visualizer import CanvasVisualizer

print("‚úÖ Imports loaded!")
print("üì¶ Modules available:")
print("   - WordEmbeddings (Word2Vec-style)")
print("   - VocabularyExpander (Wikipedia/dictionary)")
print("   - TinyLLM (Transformer from scratch)")
print("   - CanvasVisualizer (Knowledge graph)")

## Step 1: Load Your Vocabulary

Start with your 263 StPetePros words (Tampa Bay themed)

In [None]:
# Load base vocabulary from wordlist
wordlist_path = '../stpetepros-wordlist.txt'

words, word_to_idx, idx_to_word = build_vocabulary_from_wordlist(wordlist_path)

print(f"\nüìö Your Vocabulary:")
print(f"   Total words: {len(words)}")
print(f"\n   Sample words:")
for i, word in enumerate(words[:20]):
    print(f"      {i}: {word}")

print(f"\n   ... and {len(words) - 20} more!")

## Step 2: Train Word Embeddings

Learn vector representations where similar words have similar vectors.

**Algorithm:** Word2Vec Skip-gram
- For each word, predict surrounding context words
- Train 2-layer neural network
- Extract hidden layer as embeddings

In [None]:
# Generate training pairs (word, context) from vocabulary sequence
training_pairs_idx = generate_training_pairs(
    list(range(len(words))),
    window_size=3  # Consider 3 words on each side as context
)

print(f"\nüîó Training Pairs Generated:")
print(f"   Total pairs: {len(training_pairs_idx)}")
print(f"\n   Sample pairs (word ‚Üí context):")
for i, (word_idx, context_idx) in enumerate(training_pairs_idx[:10]):
    print(f"      {idx_to_word[word_idx]} ‚Üí {idx_to_word[context_idx]}")

In [None]:
# Initialize word embeddings
embeddings = WordEmbeddings(
    vocab_size=len(words),
    embedding_dim=50,  # 50-dimensional vectors
    learning_rate=0.1
)

# Train embeddings
embeddings.train(
    training_pairs_idx,
    epochs=100,  # More epochs = better embeddings (but slower)
    verbose=True
)

print("\n‚úÖ Embeddings trained!")

In [None]:
# Test word similarity
print("\nüîç Word Similarities:")

test_words = ['plumber', 'tampa', 'repair', 'service', 'professional']

for word in test_words:
    if word not in word_to_idx:
        print(f"   '{word}' not in vocabulary")
        continue

    word_idx = word_to_idx[word]
    similar = embeddings.most_similar(word_idx, word_to_idx, idx_to_word, top_k=5)

    print(f"\n   '{word}' is similar to:")
    for sim_word, score in similar:
        print(f"      {sim_word}: {score:.3f}")

## Step 3: Build Tiny Transformer LLM

Create a minimal GPT-style transformer:
- Self-attention mechanism
- Feed-forward layers
- Positional encodings
- Next-token prediction

In [None]:
# Initialize Tiny LLM
llm = TinyLLM(
    vocab_size=len(words),
    embedding_dim=50,  # Match embeddings dimension
    max_seq_len=20,  # Maximum sequence length
    learning_rate=0.01
)

# Load pre-trained word embeddings
llm.load_embeddings(embeddings.get_all_embeddings())

print("\n‚úÖ Tiny LLM initialized with pre-trained embeddings!")

## Step 4: Generate Text

Use the LLM to generate text from prompts using YOUR vocabulary

In [None]:
# Test text generation
print("\nüé® Generating Text from Prompts:\n")

prompts = [
    ['tampa', 'bay', 'plumber'],
    ['reliable', 'professional', 'service'],
    ['repair', 'fix', 'install']
]

for prompt_words in prompts:
    # Convert words to indices
    prompt_ids = [word_to_idx[w] for w in prompt_words if w in word_to_idx]

    if not prompt_ids:
        print(f"   Prompt words not in vocabulary: {prompt_words}")
        continue

    # Generate continuation
    generated_ids = llm.generate(
        prompt_ids=np.array(prompt_ids),
        max_new_tokens=5,
        temperature=0.8,  # Higher = more random
        idx_to_word=idx_to_word
    )

    # Convert back to words
    generated_words = [idx_to_word.get(i, '<?>')  for i in generated_ids]
    print(f"   Prompt: {' '.join(prompt_words)}")
    print(f"   Generated: {' '.join(generated_words)}")
    print()

## Step 5: Expand Vocabulary from Wikipedia

When you encounter unknown words, fetch definitions on-demand.

This is the **"reverse OCR"** magic - expand knowledge contextually, not upfront!

In [None]:
# Initialize vocabulary expander
expander = VocabularyExpander(base_vocab_path='../stpetepros-wordlist.txt')

# Words to expand (not in your base vocabulary)
expand_words = [
    'database',
    'cryptocurrency',
    'blockchain',
    'neural',
    'algorithm'
]

print("\nüåê Expanding Vocabulary from Wikipedia:\n")

# Expand words
results = expander.batch_expand(expand_words, sources=['wikipedia', 'builtin'])

# Show results
for word, definition in results.items():
    if definition:
        print(f"\n‚úÖ {word}:")
        print(f"   {definition['definition'][:150]}...")
        print(f"   Source: {definition['source']}")

        # Show related words from your base vocabulary
        related = expander.get_related_words(word)
        if related:
            print(f"   Related to: {', '.join(related[:3])}")

In [None]:
# Show expansion statistics
stats = expander.get_expansion_stats()

print("\nüìä Vocabulary Statistics:")
print(f"   Base vocabulary: {stats['base_vocab_size']} words")
print(f"   Expanded vocabulary: {stats['expanded_vocab_size']} words")
print(f"   Total vocabulary: {stats['total_vocab_size']} words")
print(f"   Expansion rate: {stats['expanded_vocab_size'] / stats['base_vocab_size'] * 100:.1f}%")

## Step 6: Visualize Knowledge Graph on Canvas

Render your vocabulary + expansions as an interactive knowledge graph.

This is the **"canvas"** part - visual representation of your LLM's knowledge!

In [None]:
# Build knowledge graph from expanded vocabulary
graph = expander.build_knowledge_graph()

print("\nüï∏Ô∏è Knowledge Graph Built:")
print(f"   Nodes: {len(graph['nodes'])}")
print(f"   Edges: {len(graph['edges'])}")

# Show sample nodes
print("\n   Sample nodes:")
for node in graph['nodes'][:10]:
    node_type = node.get('type', 'unknown')
    print(f"      {node['label']} ({node_type})")

In [None]:
# Create visualizer
viz = CanvasVisualizer(width=800, height=600)

# Compute force-directed layout
positions = viz.layout_force_directed(
    nodes=graph['nodes'],
    edges=graph['edges'],
    iterations=50
)

print("\n‚úÖ Layout computed!")

In [None]:
# Create output directory
output_dir = Path('../data/words_to_canvas')
output_dir.mkdir(parents=True, exist_ok=True)

# Render visualizations
print("\nüé® Rendering Visualizations:\n")

# 1. SVG (static vector graphics)
svg_path = output_dir / 'knowledge_graph.svg'
viz.render_svg(graph['nodes'], graph['edges'], positions, str(svg_path))

# 2. Interactive HTML
html_path = output_dir / 'knowledge_graph.html'
viz.render_html_interactive(graph['nodes'], graph['edges'], positions, str(html_path))

# 3. JSON export
json_path = output_dir / 'knowledge_graph.json'
viz.export_json(graph['nodes'], graph['edges'], positions, str(json_path))

print(f"\n‚úÖ All visualizations saved to: {output_dir}")
print(f"\n   üåê Open {html_path} in your browser to explore!")

In [None]:
# Show ASCII version (for terminal/notebook)
ascii_graph = viz.render_ascii(graph['nodes'], graph['edges'], positions)
print(ascii_graph)

## Step 7: Save Everything

Persist your trained models and expanded vocabulary

In [None]:
# Save word embeddings
embeddings_path = output_dir / 'word_embeddings.json'
embeddings.save(str(embeddings_path), word_to_idx, idx_to_word)

# Save LLM
llm_path = output_dir / 'tiny_llm.json'
llm.save(str(llm_path))

# Save expanded vocabulary
vocab_path = output_dir / 'expanded_vocabulary.json'
expander.save_expanded_vocab(str(vocab_path))

print("\nüíæ All models saved!")
print(f"   Embeddings: {embeddings_path}")
print(f"   LLM: {llm_path}")
print(f"   Vocabulary: {vocab_path}")

## Summary: What We Built

üéâ **Congratulations!** You just built:

1. ‚úÖ **Word Embeddings** - Word2Vec-style vectors from YOUR vocabulary
2. ‚úÖ **Vocabulary Expander** - On-demand knowledge from Wikipedia
3. ‚úÖ **Tiny LLM** - Transformer from scratch (pure numpy)
4. ‚úÖ **Text Generator** - Generate Tampa Bay business descriptions
5. ‚úÖ **Knowledge Graph** - Visualize relationships on canvas

**This is "reverse OCR":**
- Instead of: Image ‚Üí Text
- You built: Words ‚Üí LLM ‚Üí Canvas Visualization

**Key innovations:**
- Controlled vocabulary (your 263 words)
- Contextual expansion (Wikipedia on-demand)
- Transparent (no black boxes)
- Visual (knowledge graph)

**Next steps:**
- Train on real Tampa Bay business data
- Expand to more sources (news, dictionary, Python docs)
- Add more LLM layers (multi-head attention)
- Fine-tune for specific domains

**Open the HTML visualization to explore your knowledge graph!**