# üß© Tutorial: Solving Word Riddles with Semantic Similarity

![Word Riddles](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Question_mark_%28black%29.svg/200px-Question_mark_%28black%29.svg.png)

## Welcome to the Fascinating World of Word Riddles! üîç

In this comprehensive tutorial, you'll learn:
- üß© What are word riddles and how humans solve them
- üìä Word embeddings and semantic similarity
- üéØ Clustering algorithms for grouping similar words
- üî¢ TF-IDF and inverse document frequency
- ü§ñ Building an intelligent riddle-solving system
- üíª Hands-on implementation with real Polish riddles
- üé® Optimization techniques for better performance
- üß™ Interactive exercises to build your skills

By the end, you'll be ready to implement a sophisticated word riddle solver using semantic similarity and clustering!


## üìö Table of Contents

1. [üéì Understanding Word Riddles](#1--understanding-word-riddles)
2. [üìä Word Embeddings and Semantic Similarity](#2--word-embeddings-and-semantic-similarity)
3. [üîß Setting Up the Environment](#3--setting-up-the-environment)
4. [üóÇÔ∏è Working with Polish Language Data](#4--working-with-polish-language-data)
5. [üéØ Clustering Words by Meaning](#5--clustering-words-by-meaning)
6. [üìà TF-IDF and Word Importance](#6--tf-idf-and-word-importance)
7. [üßÆ Cosine Similarity and Vector Operations](#7--cosine-similarity-and-vector-operations)
8. [üèóÔ∏è Building the Riddle Solver Architecture](#8--building-the-riddle-solver-architecture)
9. [‚ö° Optimization Techniques](#9--optimization-techniques)
10. [üéÆ Interactive Exercises](#10--interactive-exercises)
11. [üöÄ Complete Solution Walkthrough](#11--complete-solution-walkthrough)
12. [üìñ Summary and Next Steps](#12--summary-and-next-steps)


## 1. üéì Understanding Word Riddles

### What is a Word Riddle?

A word riddle presents a description or definition, and you need to guess the word being described. Think of it as reverse dictionary lookup! üîÑ

**Examples:**
- **Riddle:** "kobieta podr√≥≈ºujƒÖca ≈õrodkiem transportu, np. samolotem, pociƒÖgiem, statkiem"
- **Answer:** "pasa≈ºerka" (female passenger)

- **Riddle:** "emocjonalne uczucie ≈ÇƒÖczƒÖce dwie osoby, oparte na zaufaniu, szacunku, trosce i oddaniu"
- **Answer:** "mi≈Ço≈õƒá" (love)

### How Do Humans Solve Riddles?

1. **Parse the description** - identify key concepts and relationships
2. **Activate semantic knowledge** - think of related words and concepts
3. **Find intersections** - look for words that match multiple clues
4. **Eliminate impossibilities** - rule out words that don't fit
5. **Select best match** - choose the word that best fits all clues

### The Computational Challenge

To solve riddles computationally, we need to:
- **Understand meaning** - represent words as vectors in semantic space
- **Measure similarity** - quantify how similar word meanings are
- **Handle ambiguity** - deal with multiple meanings and synonyms
- **Scale efficiently** - search through thousands of possible answers

### Mathematical Formulation

Given a riddle $R = \{w_1, w_2, ..., w_n\}$ (set of words in the description) and a dictionary $D$ of possible answers, find:

$$\text{answer} = \arg\max_{d \in D} \text{similarity}(R, \text{definitions}(d))$$

The challenge is defining `similarity` in a way that captures semantic relationships! üß†


## 2. üìä Word Embeddings and Semantic Similarity

### What are Word Embeddings?

Word embeddings are dense vector representations of words that capture semantic relationships. Instead of treating words as discrete symbols, we represent them as points in a high-dimensional space where **similar words are close together**.

### Key Intuition

**"You shall know a word by the company it keeps"** - J.R. Firth

Words that appear in similar contexts tend to have similar meanings. Word2Vec and similar models learn these patterns from large text corpora.

### Word2Vec Architecture

```
Input: "The cat sat on the mat"
Context window size = 2

Training pairs:
("cat", "The"), ("cat", "sat")  # cat appears near these words
("sat", "cat"), ("sat", "on")   # sat appears near these words
...
```

### Mathematical Foundation

Word2Vec learns two matrices:
- **Input matrix** $W_{in} \in \mathbb{R}^{V \times d}$ (vocabulary size √ó embedding dimension)
- **Output matrix** $W_{out} \in \mathbb{R}^{d \times V}$

The probability that word $w_o$ appears in the context of word $w_i$ is:

$$P(w_o|w_i) = \frac{\exp(v_{w_o}^T v_{w_i})}{\sum_{w=1}^{V} \exp(v_w^T v_{w_i})}$$

### Properties of Word Embeddings

1. **Semantic similarity**: Similar words have similar vectors
2. **Arithmetic relationships**: king - man + woman ‚âà queen
3. **Clustering**: Related words cluster together in vector space
4. **Compositionality**: Phrases can be represented as vector combinations

### Why This Matters for Riddles

Word embeddings allow us to:
- **Measure semantic similarity** between riddle words and definitions
- **Handle synonyms** - words with similar meanings have similar vectors
- **Capture context** - polysemous words get context-dependent representations
- **Enable fuzzy matching** - find approximate rather than exact matches


## 3. üîß Setting Up the Environment

Let's start by importing all the necessary libraries and setting up our environment for working with Polish word riddles.


In [None]:
# Essential imports for word riddle solving
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict as dd
import math
import random
import os
from tqdm import tqdm

# Natural language processing
import nltk
from nltk.tokenize import word_tokenize as tokenize

# Word embeddings
from gensim.models import Word2Vec

# Linear algebra operations
from numpy.linalg import norm

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)

print("‚úÖ Libraries imported successfully!")
print("üìä NumPy version:", np.__version__)

# Download required NLTK data
try:
    nltk.download("punkt", quiet=True)
    print("‚úÖ NLTK punkt tokenizer ready!")
except:
    print("‚ö†Ô∏è  NLTK download may be needed")

## 4. üóÇÔ∏è Working with Polish Language Data

### Understanding the Data Structure

In this problem, we work with several key data sources:

1. **Dictionary definitions** (`plwiktionary_definitions_clean.txt`) - definitions of Polish words
2. **Word base forms** (`superbazy_clean.txt`) - mapping from inflected forms to base forms
3. **Word embeddings** (`w2v_polish_lemmas.model`) - pre-trained Word2Vec model
4. **Sample riddles** (`zagadki_do_testow_clean.txt`) - examples for testing

### Key Concepts

**Base Forms (Lemmatization)**: Polish is a highly inflected language. The word "kot" (cat) can appear as "kota", "kotem", "kot√≥w", etc. We need to map all forms to their base form.

**Inverse Document Frequency (IDF)**: Measures how rare/important a word is. Rare words are more informative than common words like "the", "and", etc.

Let's simulate loading this data structure:


In [None]:
# Simulate the data structures we'll work with
# In the real problem, these would be loaded from files

# Dictionary: word -> list of definitions (each definition is a set of words)
all_word_definitions = dd(list)

# Dictionary: inflected form -> base form
bases = {}

# Dictionary: base form -> IDF score
base_idf = dd(float)

# Let's create some example data to understand the structure
# Example 1: "kot" (cat)
all_word_definitions["kot"] = [
    {"zwierzƒô", "domowe", "ssak", "futro", "pazury"},
    {"ma≈Çy", "drapie≈ºnik", "miauczy", "≈Çapie", "myszy"},
]

# Example 2: "mi≈Ço≈õƒá" (love)
all_word_definitions["mi≈Ço≈õƒá"] = [
    {"uczucie", "emocja", "przywiƒÖzanie", "serce", "kochaƒá"},
    {"relacja", "zwiƒÖzek", "partnerstwo", "zaufanie", "oddanie"},
]

# Example base form mappings
bases["koty"] = "kot"  # cats -> cat
bases["kota"] = "kot"  # cat (genitive) -> cat
bases["kotem"] = "kot"  # with cat (instrumental) -> cat
bases["kocham"] = "kochaƒá"  # I love -> to love
bases["mi≈Ço≈õci"] = "mi≈Ço≈õƒá"  # love (genitive) -> love

# Example IDF scores (higher = rarer word)
base_idf["kot"] = 4.2
base_idf["mi≈Ço≈õƒá"] = 5.1
base_idf["zwierzƒô"] = 3.8
base_idf["uczucie"] = 4.6
base_idf["i"] = 1.2  # very common word, low IDF
base_idf["jest"] = 1.5  # very common word, low IDF

print("üìö Sample data structure created!")
print(f"üê± Definitions for 'kot': {len(all_word_definitions['kot'])} definitions")
print(f"‚ù§Ô∏è  Definitions for 'mi≈Ço≈õƒá': {len(all_word_definitions['mi≈Ço≈õƒá'])} definitions")
print(f"üî§ Base form mappings: {len(bases)} examples")
print(f"üìä IDF scores: {len(base_idf)} words")

In [None]:
# Helper function to get base form of a word
def get_word_base(word):
    """Get the base (lemmatized) form of a word"""
    word = word.lower()
    return bases.get(word, word)  # return base form if exists, else return word itself


# Test the function
test_words = ["koty", "kocham", "mi≈Ço≈õci", "nowe_s≈Çowo"]
print("üîç Testing base form lookup:")
for word in test_words:
    base = get_word_base(word)
    print(f"  {word} ‚Üí {base}")

# Demonstrate why base forms matter
print("\nüí° Why base forms matter:")
print(
    "The riddle might use 'koty' (cats) but the dictionary definition is under 'kot' (cat)"
)
print("Without lemmatization, we'd miss the connection!")

## 5. üéØ Clustering Words by Meaning

### The Clustering Challenge

When we look at word definitions, we often see many related words. For example, a definition of "kot" might include: `{"zwierzƒô", "domowe", "ssak", "futro", "pazury", "ma≈Çy", "drapie≈ºnik", "miauczy", "≈Çapie", "myszy"}`

**Problem**: Not all these words are equally important! Some are:
- **Core concepts** (zwierzƒô, ssak) - central to the meaning
- **Descriptive details** (futro, pazury) - important but secondary  
- **Common words** (ma≈Çy) - less discriminative

### Clustering Solution

We can group related words into **clusters** where each cluster represents a coherent concept. This helps us:
1. **Reduce noise** - group similar words together
2. **Weight importance** - give more weight to important clusters
3. **Improve matching** - compare clusters instead of individual words

### Mathematical Approach

For a set of words $W = \{w_1, w_2, ..., w_n\}$ in a definition:

1. **Convert to vectors**: $V = \{v_1, v_2, ..., v_n\}$ using Word2Vec
2. **Apply clustering algorithm**: Group similar vectors
3. **Compute cluster centroids**: $c_k = \frac{\sum_{v_i \in C_k} w_i \cdot v_i}{\sum_{v_i \in C_k} w_i}$
4. **Weight by IDF**: Use inverse document frequency as weights $w_i$

### Clustering Algorithm

We'll use a **greedy clustering approach**:
- Start with empty clusters
- For each word vector:
  - Find most similar existing cluster (cosine similarity)
  - If similarity > threshold: add to cluster and update centroid
  - Else: create new cluster (if word is important enough)

This is more flexible than K-means because we don't need to specify the number of clusters in advance!


## 6. üìà TF-IDF and Word Importance

### What is TF-IDF?

**TF-IDF** stands for Term Frequency - Inverse Document Frequency. It's a way to measure how important a word is in a document relative to a collection of documents.

**Formula:**
$$\text{TF-IDF}(w, d, D) = \text{TF}(w, d) \times \text{IDF}(w, D)$$

Where:
- $\text{TF}(w, d)$ = frequency of word $w$ in document $d$
- $\text{IDF}(w, D) = \log\frac{|D|}{|\{d \in D : w \in d\}|}$

### Simplified IDF for Our Problem

In our riddle solver, we use a simplified version focusing only on **IDF**:
- Words that appear in many definitions ‚Üí **low IDF** ‚Üí less important
- Words that appear in few definitions ‚Üí **high IDF** ‚Üí more important

### Weight Function

We transform raw IDF scores using a weight function:

$$\text{weight}(w) = \max(0, (\text{IDF}(w) - M)^D)$$

Where:
- $M$ = offset parameter (shifts the function)
- $D$ = shape parameter (controls how steeply weights increase)

This gives us several benefits:
1. **Filter common words**: Very common words get weight ‚âà 0
2. **Boost rare words**: Rare, specific words get high weights  
3. **Smooth scaling**: Gradual transition, not binary cutoff"


In [None]:
# Implement the weight function
def get_weight(idf_score, weight_m=-1.0, weight_d=1.35):
    """
    Transform raw IDF score into a weight using power function.

    Args:
        idf_score: Raw IDF value
        weight_m: Offset parameter (shifts function left/right)
        weight_d: Shape parameter (controls steepness)

    Returns:
        float: Transformed weight (always >= 0)
    """
    x = idf_score - weight_m
    if x < 0:
        return 0.0
    return x**weight_d


# Visualize the weight function
idf_values = np.linspace(0, 6, 100)
weights = [get_weight(idf) for idf in idf_values]

plt.figure(figsize=(10, 6))
plt.plot(idf_values, weights, "b-", linewidth=2, label="Weight Function")
plt.axhline(y=0, color="k", linestyle="--", alpha=0.3)
plt.xlabel("IDF Score")
plt.ylabel("Weight")
plt.title("üìä IDF to Weight Transformation")
plt.grid(True, alpha=0.3)
plt.legend()

# Add some example points
example_words = ["i", "jest", "zwierzƒô", "kot", "mi≈Ço≈õƒá"]
example_idfs = [1.2, 1.5, 3.8, 4.2, 5.1]
example_weights = [get_weight(idf) for idf in example_idfs]

for word, idf, weight in zip(example_words, example_idfs, example_weights):
    plt.plot(idf, weight, "ro", markersize=8)
    plt.annotate(
        f"{word}\n({weight:.2f})",
        xy=(idf, weight),
        xytext=(10, 10),
        textcoords="offset points",
        fontsize=9,
        bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7),
    )

plt.tight_layout()
plt.show()

print("üí° Key insights:")
print(f"  ‚Ä¢ Common words like 'i' get weight ‚âà 0")
print(f"  ‚Ä¢ Specific words like 'mi≈Ço≈õƒá' get high weight")
print(f"  ‚Ä¢ The function smoothly transitions between extremes")

## 7. üßÆ Cosine Similarity and Vector Operations

### Understanding Cosine Similarity

Cosine similarity measures the **angle** between two vectors, not their magnitude. This is perfect for comparing word meanings because:

- **Direction matters more than magnitude** - "cat" and "kitten" should be similar regardless of vector lengths
- **Range is [-1, 1]** - easy to interpret (1 = identical, 0 = orthogonal, -1 = opposite)
- **Robust to scaling** - adding the same concept multiple times doesn't change similarity

### Mathematical Definition

For vectors $\mathbf{a}$ and $\mathbf{b}$:

$$\text{cosine similarity} = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} = \frac{\sum_{i=1}^n a_i b_i}{\sqrt{\sum_{i=1}^n a_i^2} \sqrt{\sum_{i=1}^n b_i^2}}$$

### Why Cosine Similarity for Words?

1. **Semantic relationships**: Words with similar meanings have similar vector directions
2. **Scale invariance**: A word mentioned once vs. multiple times has the same semantic content
3. **Efficient computation**: Can be computed quickly using matrix operations

### Vector Normalization

Normalizing vectors to unit length simplifies cosine similarity:

$$\text{normalize}(\mathbf{v}) = \frac{\mathbf{v}}{||\mathbf{v}||} \quad \text{so that} \quad ||\text{normalize}(\mathbf{v})|| = 1$$

For normalized vectors: $\text{cosine similarity} = \mathbf{a} \cdot \mathbf{b}$ (just dot product!)


In [None]:
# Implement vector operations for word embeddings
def normalize(vec):
    """Normalize vector to unit length"""
    return vec / np.sqrt((vec**2).sum())


def vector_len(vec):
    """Calculate vector length (Euclidean norm)"""
    return np.sqrt((vec**2).sum())


def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    return np.dot(vec1, vec2) / (vector_len(vec1) * vector_len(vec2))


# Create example vectors to demonstrate
# Simulate word embeddings (in practice, these come from Word2Vec)
cat_vector = np.array([0.2, 0.8, 0.1, 0.9, 0.3])
dog_vector = np.array([0.3, 0.7, 0.2, 0.8, 0.4])  # Similar to cat
car_vector = np.array([0.9, 0.1, 0.8, 0.2, 0.7])  # Different from cat

print("üê± Example: Comparing word vectors")
print(f"Cat vector: {cat_vector}")
print(f"Dog vector: {dog_vector}")
print(f"Car vector: {car_vector}")
print()

# Calculate similarities
cat_dog_sim = cosine_similarity(cat_vector, dog_vector)
cat_car_sim = cosine_similarity(cat_vector, car_vector)
dog_car_sim = cosine_similarity(dog_vector, car_vector)

print("üîç Cosine similarities:")
print(f"Cat ‚Üî Dog: {cat_dog_sim:.3f}")
print(f"Cat ‚Üî Car: {cat_car_sim:.3f}")
print(f"Dog ‚Üî Car: {dog_car_sim:.3f}")
print()

# Demonstrate normalization
print("üìè Vector normalization:")
cat_norm = normalize(cat_vector)
print(f"Original cat vector length: {vector_len(cat_vector):.3f}")
print(f"Normalized cat vector length: {vector_len(cat_norm):.3f}")
print(f"Normalized cat vector: {cat_norm}")

# Show that cosine similarity with normalized vectors is just dot product
print(f"\nüßÆ Cosine similarity: {cosine_similarity(cat_vector, dog_vector):.3f}")
print(
    f"Dot product of normalized: {np.dot(normalize(cat_vector), normalize(dog_vector)):.3f}"
)
print("‚úÖ They're the same!")

## 8. üèóÔ∏è Building the Riddle Solver Architecture

### Overall Strategy

Our riddle solver uses a **semantic similarity approach**:

1. **Preprocess** - cluster all dictionary definitions  
2. **Query** - cluster the riddle description
3. **Compare** - find dictionary words with most similar definition clusters
4. **Rank** - return top K most similar words

### Key Components

1. **Clustering Engine** - groups related words in definitions
2. **Similarity Calculator** - compares cluster sets using cosine similarity  
3. **Scoring Function** - combines multiple similarity signals
4. **Optimization Layer** - speeds up search with masking

### Clustering Algorithm Details

```python
def set_to_clusters(word_set):
    clusters = []
    for word in word_set:
        # Get word vector and IDF weight
        vector = get_word_embedding(word)
        weight = get_weight(word_idf[word])
        
        # Find most similar existing cluster
        best_similarity = -1
        best_cluster = -1
        
        for i, (cluster_centroid, cluster_weight) in enumerate(clusters):
            similarity = cosine_similarity(vector, cluster_centroid)
            if similarity > best_similarity:
                best_similarity = similarity
                best_cluster = i
        
        # Decide: add to existing cluster or create new one?
        if best_similarity > SIMILARITY_THRESHOLD:
            # Update existing cluster centroid (weighted average)
            old_centroid, old_weight = clusters[best_cluster]
            new_centroid = (old_centroid * old_weight + vector * weight) / (old_weight + weight)
            clusters[best_cluster] = [new_centroid, old_weight + weight]
        elif weight > IMPORTANCE_THRESHOLD:
            # Create new cluster for important words
            clusters.append([vector, weight])
    
    return clusters
```

### Scoring Strategy

For each dictionary word, we compare its definition clusters with riddle clusters:

$$\text{score} = \frac{\text{max similarities riddle‚Üídef} + \text{max similarities def‚Üíriddle}}{2}$$

This **bidirectional scoring** ensures that both the riddle matches the definition AND the definition matches the riddle.


In [None]:
# Implement a simplified clustering algorithm
def simple_clustering_demo(
    word_set, similarity_threshold=0.3, importance_threshold=0.7
):
    """
    Demonstrate clustering algorithm with simplified word vectors.
    In practice, this would use real Word2Vec embeddings.
    """
    # Simulate word embeddings (normally from Word2Vec)
    simulated_embeddings = {
        "zwierzƒô": np.array([0.8, 0.2, 0.1, 0.9, 0.3]),
        "ssak": np.array([0.7, 0.3, 0.2, 0.8, 0.4]),
        "kot": np.array([0.6, 0.4, 0.3, 0.7, 0.5]),
        "futro": np.array([0.5, 0.5, 0.4, 0.6, 0.6]),
        "miauczy": np.array([0.4, 0.6, 0.5, 0.5, 0.7]),
        "≈Çapie": np.array([0.2, 0.8, 0.7, 0.3, 0.9]),
        "myszy": np.array([0.3, 0.7, 0.6, 0.4, 0.8]),
        "ma≈Çy": np.array([0.1, 0.1, 0.1, 0.1, 0.1]),  # generic word
    }

    clusters = []

    print("üéØ Clustering process:")
    print(f"Similarity threshold: {similarity_threshold}")
    print(f"Importance threshold: {importance_threshold}")
    print()

    for word in word_set:
        if word not in simulated_embeddings:
            print(f"‚ö†Ô∏è  Skipping {word} (no embedding)")
            continue

        vector = normalize(simulated_embeddings[word])
        weight = get_weight(base_idf.get(word, 2.0))

        print(f"Processing '{word}' (weight: {weight:.2f})")

        if not clusters:
            # First cluster
            clusters.append([vector, weight])
            print(f"  ‚Üí Created first cluster")
            continue

        # Find most similar cluster
        best_sim = -1
        best_idx = -1

        for i, (centroid, cluster_weight) in enumerate(clusters):
            sim = cosine_similarity(vector, centroid)
            if sim > best_sim:
                best_sim = sim
                best_idx = i

        print(f"  ‚Üí Best similarity: {best_sim:.3f} with cluster {best_idx}")

        if best_sim > similarity_threshold:
            # Add to existing cluster
            old_centroid, old_weight = clusters[best_idx]
            new_centroid = (old_centroid * old_weight + vector * weight) / (
                old_weight + weight
            )
            clusters[best_idx] = [normalize(new_centroid), old_weight + weight]
            print(f"  ‚Üí Added to cluster {best_idx}")
        elif weight > importance_threshold:
            # Create new cluster
            clusters.append([vector, weight])
            print(f"  ‚Üí Created new cluster {len(clusters)-1}")
        else:
            print(f"  ‚Üí Ignored (low importance)")

        print()

    print(f"üèÅ Final result: {len(clusters)} clusters")
    return clusters


# Test with cat definition
cat_definition = {
    "zwierzƒô",
    "ssak",
    "kot",
    "futro",
    "miauczy",
    "≈Çapie",
    "myszy",
    "ma≈Çy",
}
clusters = simple_clustering_demo(cat_definition)

## 9. ‚ö° Optimization Techniques

### Performance Challenges

Solving riddles requires comparing thousands of dictionary words against each riddle. Key bottlenecks:

1. **Vector operations** - many cosine similarity calculations
2. **Redundant comparisons** - similar words get similar scores
3. **Memory usage** - storing all embeddings and clusters

### Optimization Strategy 1: Masking Similar Words

**Idea**: If word A is very similar to word B, and word A doesn't match the riddle well, then word B probably won't either.

**Implementation**:
- Precompute similarity matrix between all dictionary words
- Create boolean mask: `mask[i,j] = True` if words i and j are similar
- During search, if word i gets low score, mark similar words as "masked"
- Skip masked words in future comparisons

```python
# Precompute similarity matrix
similarity_matrix = compute_word_similarities(all_words)
mask = similarity_matrix > MASK_THRESHOLD

# During riddle solving
for i, word in enumerate(all_words):
    if already_processed[i]:
        continue
    
    score = compute_score(riddle, word)
    if score < LOW_SCORE_THRESHOLD:
        # Mark similar words as processed
        already_processed |= mask[i]
```

### Optimization Strategy 2: Matrix Operations

**Idea**: Use numpy's optimized matrix operations instead of loops.

**Before**: Compare clusters one by one
```python
for riddle_cluster in riddle_clusters:
    for def_cluster in definition_clusters:
        similarity = cosine_similarity(riddle_cluster, def_cluster)
```

**After**: Compute all similarities at once
```python
# Shape: (num_riddle_clusters, num_def_clusters)
similarity_matrix = riddle_clusters @ definition_clusters.T
```

### Optimization Strategy 3: Dimension Reduction

**Idea**: Use every Nth element of word vectors to reduce computation.

```python
# Instead of full 300-dimensional vector
full_vector = model.wv['word']

# Use every 2nd element ‚Üí 150 dimensions
reduced_vector = model.wv['word'][::2]
```

This trades some accuracy for significant speed improvement!


In [None]:
# Demonstrate matrix operations optimization
def compare_performance():
    """Compare loop-based vs matrix-based similarity computation"""

    # Create example cluster matrices
    riddle_clusters = np.random.randn(3, 10)  # 3 clusters, 10 dimensions each
    def_clusters = np.random.randn(5, 10)  # 5 clusters, 10 dimensions each

    # Normalize for fair comparison
    riddle_clusters = np.array([normalize(c) for c in riddle_clusters])
    def_clusters = np.array([normalize(c) for c in def_clusters])

    print("üîÑ Method 1: Loop-based computation")
    import time

    start_time = time.time()
    similarities_loop = np.zeros((3, 5))
    for i in range(3):
        for j in range(5):
            similarities_loop[i, j] = np.dot(riddle_clusters[i], def_clusters[j])
    loop_time = time.time() - start_time

    print("‚ö° Method 2: Matrix-based computation")
    start_time = time.time()
    similarities_matrix = riddle_clusters @ def_clusters.T
    matrix_time = time.time() - start_time

    print(f"\nResults comparison:")
    print(f"Loop method time: {loop_time*1000:.2f} ms")
    print(f"Matrix method time: {matrix_time*1000:.2f} ms")
    print(f"Speedup: {loop_time/matrix_time:.1f}x")
    print(f"Results identical: {np.allclose(similarities_loop, similarities_matrix)}")

    return similarities_matrix


# Run the comparison
similarity_matrix = compare_performance()
print(f"\nüìä Similarity matrix shape: {similarity_matrix.shape}")
print("Matrix values:")
print(similarity_matrix.round(3))

## 10. üéÆ Interactive Exercises

Now let's test your understanding with some hands-on exercises! Work through these to solidify the concepts.

### Exercise 1: Word Similarity Exploration üîç

**Your Task**: Use the functions we've built to explore relationships between Polish words.

**Instructions**: 
1. Choose three Polish words that you think should be semantically similar
2. Choose one word that should be different from the others
3. Calculate pairwise cosine similarities
4. Explain the results


In [None]:
# Exercise 1: Complete this code
print("üéØ Exercise 1: Word Similarity Exploration")
print("-" * 50)

# TODO: Replace these with your chosen words
similar_words = ["kot", "pies", "kr√≥lik"]  # Three similar words
different_word = "samoch√≥d"  # One different word

all_exercise_words = similar_words + [different_word]

# Create simulated embeddings for exercise
exercise_embeddings = {
    "kot": np.array([0.8, 0.2, 0.1, 0.9, 0.3]),
    "pies": np.array([0.7, 0.3, 0.2, 0.8, 0.4]),
    "kr√≥lik": np.array([0.6, 0.4, 0.3, 0.7, 0.5]),
    "samoch√≥d": np.array([0.1, 0.9, 0.8, 0.2, 0.6]),
}

print("Your words:", all_exercise_words)
print("\nüìä Pairwise similarities:")

# TODO: Calculate all pairwise similarities
for i, word1 in enumerate(all_exercise_words):
    for j, word2 in enumerate(all_exercise_words):
        if i < j:  # Avoid duplicates
            vec1 = normalize(exercise_embeddings[word1])
            vec2 = normalize(exercise_embeddings[word2])
            similarity = cosine_similarity(vec1, vec2)
            print(f"{word1} ‚Üî {word2}: {similarity:.3f}")

print("\n‚ùì Questions to think about:")
print("1. Which pairs have highest similarity?")
print("2. Which pairs have lowest similarity?")
print("3. Do the results match your intuition?")
print("4. What might cause unexpected results?")

### Exercise 2: Build a Mini Riddle Solver üß©

**Your Task**: Create a simple riddle solver that can handle basic cases.

**Scenario**: You're given a riddle and three possible answers. Find the best match!

**Instructions**:
1. Implement a function to compare riddle words with definition words
2. Score each possible answer
3. Return the best match


In [None]:
# Exercise 2: Build a Mini Riddle Solver
print("üß© Exercise 2: Mini Riddle Solver")
print("-" * 50)

# Sample riddle and possible answers
riddle_words = {"du≈ºe", "zwierzƒô", "trƒÖba", "szary", "afryka"}
candidate_answers = {
    "s≈Ço≈Ñ": {"zwierzƒô", "du≈ºe", "ssak", "trƒÖba", "afryka", "szary"},
    "kot": {"zwierzƒô", "ma≈Çy", "futro", "miauczy", "pazury"},
    "samoch√≥d": {"pojazd", "ko≈Ça", "silnik", "transport", "benzyna"},
}

print(f"üéØ Riddle words: {riddle_words}")
print(f"ü§î Possible answers: {list(candidate_answers.keys())}")
print()


def mini_riddle_solver(riddle_words, candidate_answers):
    """
    Simple riddle solver using word overlap and IDF weighting.

    Args:
        riddle_words: set of words in the riddle
        candidate_answers: dict mapping answer -> definition words

    Returns:
        list of (answer, score) pairs, sorted by score descending
    """
    scores = []

    for answer, definition in candidate_answers.items():
        print(f"Evaluating '{answer}'...")

        # Find word overlaps
        common_words = riddle_words.intersection(definition)
        print(f"  Common words: {common_words}")

        # Calculate weighted score using IDF
        score = 0
        for word in common_words:
            weight = get_weight(base_idf.get(word, 3.0))  # Default IDF if not found
            score += weight
            print(f"    '{word}': IDF weight = {weight:.2f}")

        # Add bonus for coverage
        coverage = len(common_words) / len(riddle_words)
        score += coverage * 2  # Bonus multiplier
        print(f"  Coverage bonus: {coverage:.2f} * 2 = {coverage * 2:.2f}")
        print(f"  Total score: {score:.2f}")
        print()

        scores.append((answer, score))

    # Sort by score descending
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores


# Solve the riddle
results = mini_riddle_solver(riddle_words, candidate_answers)

print("üèÜ Final ranking:")
for rank, (answer, score) in enumerate(results, 1):
    print(f"{rank}. {answer}: {score:.2f}")

print(f"\n‚úÖ Best answer: {results[0][0]}")
print("Does this make sense? Why or why not?")

## 11. üöÄ Complete Solution Walkthrough

Now let's walk through the key components of the complete solution, connecting all the concepts we've learned.

### Solution Architecture Overview

The complete solution has these main components:

1. **Configuration Parameters** - tunable hyperparameters
2. **Helper Functions** - vector operations and utilities  
3. **Clustering Engine** - groups words by semantic similarity
4. **Masking System** - optimization for similar words
5. **Precomputation** - processes all data upfront
6. **Main Solver** - answers riddles using similarity scoring

### Key Parameters from the Solution

```python
# Clustering parameters
CLUSTERIZATION_A = 0.3      # Similarity threshold for joining clusters
CLUSTERIZATION_W = 0.7      # Minimum IDF weight for new clusters
VECTOR_R = 2                # Vector dimension reduction factor
NORM_F = True               # Whether to normalize vectors

# Masking parameters  
MASK_T1 = 0.75              # High similarity threshold for masking
MASK_T2 = 0.45              # Low similarity threshold for masking

# IDF weight function parameters
WEIGHT_M = -1.0             # Offset parameter
WEIGHT_D = 1.35             # Shape parameter

# Scoring parameters
SCR_SNDW_1 = 0.3            # Weight for second direction similarity
SCR_SNDW_2 = 0.3            # Weight for second direction normalization
```

### Algorithm Flow

1. **Precomputation Phase**:
   - Load word definitions and embeddings
   - Cluster all dictionary definitions
   - Compute similarity masks between words
   - Optimize word ordering

2. **Query Phase**:
   - Cluster the riddle words
   - Compare riddle clusters with each dictionary word's definition clusters
   - Apply masking to skip similar words
   - Score using bidirectional similarity
   - Return top K answers"


In [None]:
# Demonstrate the core scoring logic from the solution
def demonstrate_bidirectional_scoring():
    """Show how the solution scores riddle-definition similarity"""

    print("üéØ Bidirectional Scoring Demonstration")
    print("=" * 50)

    # Simulate riddle clusters (3 clusters, 5 dimensions each)
    riddle_clusters = np.array(
        [
            [0.8, 0.2, 0.1, 0.9, 0.3],  # Animal concept
            [0.6, 0.4, 0.3, 0.7, 0.5],  # Size concept
            [0.4, 0.6, 0.5, 0.5, 0.7],  # Geography concept
        ]
    )
    riddle_weights = np.array([3.0, 2.0, 2.5])  # IDF weights

    # Simulate definition clusters (4 clusters, 5 dimensions each)
    definition_clusters = np.array(
        [
            [0.7, 0.3, 0.2, 0.8, 0.4],  # Animal concept (similar)
            [0.5, 0.5, 0.4, 0.6, 0.6],  # Size concept (similar)
            [0.1, 0.9, 0.8, 0.2, 0.6],  # Transport concept (different)
            [0.3, 0.7, 0.6, 0.4, 0.8],  # Action concept (different)
        ]
    )
    definition_weights = np.array([2.8, 1.8, 1.2, 1.5])  # IDF weights

    print("Riddle clusters shape:", riddle_clusters.shape)
    print("Definition clusters shape:", definition_clusters.shape)
    print()

    # Compute similarity matrix (cosine similarities between all cluster pairs)
    # This is the key matrix operation from the solution
    similarity_matrix = riddle_clusters @ definition_clusters.T

    print("üìä Cluster Similarity Matrix:")
    print("   ", " ".join(f"Def{i}" for i in range(4)))
    for i, row in enumerate(similarity_matrix):
        print(f"R{i} ", " ".join(f"{val:.2f}" for val in row))
    print()

    # Direction 1: Riddle ‚Üí Definition (for each riddle cluster, find best definition match)
    riddle_to_def = similarity_matrix.max(axis=1)  # Max along definition axis
    score_1 = np.dot(riddle_to_def, riddle_weights) / riddle_weights.sum()

    print("Direction 1 (Riddle ‚Üí Definition):")
    print("  Max similarities per riddle cluster:", riddle_to_def)
    print("  Weighted score:", score_1)

    # Direction 2: Definition ‚Üí Riddle (for each definition cluster, find best riddle match)
    def_to_riddle = similarity_matrix.max(axis=0)  # Max along riddle axis
    score_2 = np.dot(def_to_riddle, definition_weights) / definition_weights.sum()

    print("\nDirection 2 (Definition ‚Üí Riddle):")
    print("  Max similarities per definition cluster:", def_to_riddle)
    print("  Weighted score:", score_2)

    # Final combined score (like in the solution)
    final_score = (score_1 + score_2 * 0.3) / (1 + 0.3)  # SCR_SNDW_1 = 0.3

    print(f"\nüèÜ Final combined score: {final_score:.3f}")
    print("\nüí° Key insights:")
    print("  ‚Ä¢ Higher scores = better semantic match")
    print("  ‚Ä¢ Bidirectional ensures both riddle and definition are well-covered")
    print("  ‚Ä¢ IDF weighting emphasizes important/rare concepts")

    return similarity_matrix, final_score


# Run the demonstration
similarity_matrix, score = demonstrate_bidirectional_scoring()

## 12. üìñ Summary and Next Steps

### üéØ What You've Learned

Congratulations! You've now mastered the key concepts needed to build a sophisticated word riddle solver:

1. **Word Embeddings** - representing words as vectors that capture semantic relationships
2. **Cosine Similarity** - measuring how similar word meanings are
3. **Clustering** - grouping related words to reduce noise and improve matching
4. **TF-IDF Weighting** - emphasizing rare, informative words over common ones
5. **Matrix Operations** - optimizing computations for speed
6. **Bidirectional Scoring** - ensuring both riddle and definitions are well-matched
7. **Masking Optimization** - skipping similar words to improve performance

### üß† Key Insights

- **Semantic similarity is key** - word embeddings let us capture meaning relationships
- **Clustering reduces noise** - grouping similar words improves signal-to-noise ratio
- **IDF weighting matters** - rare words are more informative than common ones
- **Bidirectional comparison is crucial** - both directions must match well
- **Optimization is essential** - techniques like masking make real-time performance possible

### üöÄ Building Your Solution

Now you're ready to implement your own riddle solver! Key components:

1. **Load and preprocess data** - word definitions, embeddings, IDF scores
2. **Implement clustering** - group words by semantic similarity
3. **Create similarity functions** - compare riddle clusters with definition clusters
4. **Add optimization** - masking, matrix operations, dimension reduction
5. **Tune parameters** - clustering thresholds, scoring weights, etc.

### üîó Useful Resources

- **Word2Vec Paper**: [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)
- **Cosine Similarity**: [Understanding the math behind similarity measures](https://en.wikipedia.org/wiki/Cosine_similarity)
- **TF-IDF**: [Term Frequency - Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- **K-means Clustering**: [Understanding clustering algorithms](https://en.wikipedia.org/wiki/K-means_clustering)
- **NumPy Documentation**: [Efficient numerical operations](https://numpy.org/doc/)

### üéÆ Additional Challenges

Ready for more? Try these extensions:
- **Experiment with different clustering algorithms** (K-means, DBSCAN, hierarchical)
- **Try other similarity measures** (Euclidean distance, Manhattan distance)
- **Implement attention mechanisms** for better cluster comparison
- **Add word sense disambiguation** for polysemous words
- **Create a web interface** for interactive riddle solving

Good luck building your riddle solver! üß©‚ú®
