In [1]:
import sys
print(sys.executable)  # Should show path to rnn environment

import torch
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

/opt/miniconda3/envs/ml/bin/python


[nltk_data] Downloading package punkt to /Users/chaklader/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/chaklader/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/chaklader/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/chaklader/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/chaklader/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
sample_text = "The quick brown fox jumps over the lazy dog."
tokens = sample_text.lower().split()

print(tokens)

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']



---
##### Word Embeddings Explained

Word embeddings like GloVe are dense vector representations of words where:

- Each word is mapped to a fixed-length vector of real numbers
- The vectors capture semantic relationships between words
- Words with similar meanings have vectors that are close in the vector space
- The vector dimensions implicitly represent different semantic aspects of words

GloVe (Global Vectors for Word Representation) specifically is trained to capture global word-word co-occurrence statistics from a corpus. The resulting embeddings have interesting properties:

- Words that appear in similar contexts have similar embeddings
- Vector arithmetic works meaningfully: e.g., vector("king") - vector("man") + vector("woman") ≈ vector("queen")
- The distance between word vectors correlates with semantic similarity

The file naming convention `glove.6B.50d.txt` indicates:
- `6B`: Trained on 6 billion tokens
- `50d`: Each word is represented by a 50-dimensional vector

These pre-trained embeddings allow you to convert text data into numerical representations that machine learning models can process while preserving semantic relationships between words.

##### How These Components Work Together

In a typical NLP pipeline:

1. The `preprocess_text` function would clean and tokenize raw text
2. The tokens would be converted to embeddings using the loaded embedding dictionary
3. These embeddings would then be fed into a machine learning model

For example, after preprocessing a sentence, you might average the embeddings of all its words to get a sentence representation, or you might create sequences of embeddings to feed into an LSTM or other neural network.

This approach is fundamental to many NLP tasks like sentiment analysis, text classification, and question answering.

----

# Understanding Word Embeddings: From Text to Mathematical Representations

## The Fundamental Concept

Word embeddings transform the discrete, symbolic nature of human language into continuous mathematical vectors that computers can process effectively. Rather than treating words as arbitrary symbols, embeddings capture semantic relationships through dense numerical representations where words with similar meanings occupy nearby positions in high-dimensional space.

Consider how humans understand word relationships: we intuitively know that "dog" and "puppy" are more related than "dog" and "computer." Word embeddings encode these relationships mathematically, enabling machines to perform similar semantic reasoning.

## Dense vs. Sparse Representations

Traditional text processing often uses sparse representations like one-hot encoding, where each word is represented by a vector with exactly one element set to 1 and all others set to 0. For a vocabulary of 10,000 words, each word becomes a 10,000-dimensional vector with 9,999 zeros and one 1.

Word embeddings replace this inefficient representation with dense vectors containing all non-zero values. A 50-dimensional embedding captures far more semantic information than a 10,000-dimensional one-hot vector while using significantly less memory and computational resources.

**Comparison Example**:
- One-hot representation of "king": [0, 0, 0, 1, 0, 0, 0, ...] (9,999 zeros, one 1)
- Dense embedding of "king": [0.2, -0.1, 0.8, 0.3, -0.4, 0.7, 0.1, -0.5, ...]

## The GloVe Algorithm and Training Process

GloVe (Global Vectors for Word Representation) learns embeddings by analyzing global word co-occurrence statistics across large text corpora. The algorithm constructs a word-word co-occurrence matrix where each entry represents how frequently two words appear together within a specified context window.

The training objective minimizes the difference between the dot product of word vectors and the logarithm of their co-occurrence probability. This mathematical formulation ensures that frequently co-occurring words develop similar vector representations.

**Co-occurrence Example**:
In the sentence "The quick brown fox jumps over the lazy dog," with a context window of 2, the word "fox" co-occurs with "brown," "jumps," "quick," and "over." Words that consistently appear in similar contexts across millions of sentences develop similar embeddings.

## Numerical Properties and Semantic Relationships

The most remarkable property of word embeddings is their ability to encode semantic relationships through vector arithmetic. Mathematical operations on embeddings often yield semantically meaningful results.

**Classic Analogy Example**:
- vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This works because the vector difference between "king" and "man" captures the concept of royalty, which when added to "woman" points toward the feminine royal equivalent.

**Dimensional Analysis**:
Using 50-dimensional GloVe embeddings, suppose we have:
- king: [0.2, -0.1, 0.8, 0.3, ..., 0.1]
- man: [0.1, 0.2, 0.4, 0.1, ..., 0.2]  
- woman: [0.1, 0.2, 0.4, 0.7, ..., 0.3]

The arithmetic operation produces a vector that should be closest to:
- queen: [0.2, -0.1, 0.8, 0.9, ..., 0.2]

## Semantic Clustering and Distance Metrics

Words with similar meanings cluster together in the embedding space. The cosine similarity between vectors provides a measure of semantic relatedness, with values ranging from -1 (completely opposite) to 1 (identical meaning).

**Similarity Examples**:
- cosine_similarity("dog", "puppy") ≈ 0.8 (highly similar)
- cosine_similarity("dog", "computer") ≈ 0.1 (weakly related)
- cosine_similarity("hot", "cold") ≈ -0.3 (opposites)

These relationships emerge naturally from training data without explicit programming of semantic rules.

## Dimensionality and Information Encoding

The choice of embedding dimensions represents a trade-off between expressiveness and computational efficiency. Common dimensions include 50, 100, 200, and 300, with each serving different purposes:

**50-dimensional embeddings**: Capture basic semantic relationships efficiently, suitable for smaller vocabularies and computational constraints.

**300-dimensional embeddings**: Provide richer representations capable of encoding subtle semantic distinctions and complex relationships.

Each dimension implicitly represents different semantic aspects, though these aspects aren't directly interpretable. Some dimensions might capture concepts like animacy, size, emotional valence, or grammatical properties, but these associations emerge implicitly during training.

## Training Corpus and Statistical Foundation

The "6B" in "glove.6B.50d.txt" indicates training on 6 billion tokens, representing an enormous collection of text from sources like Wikipedia and news articles. This massive scale ensures that statistical patterns reflect genuine language usage rather than idiosyncratic patterns from smaller datasets.

The quality of embeddings depends heavily on corpus diversity and size. Larger, more diverse corpora produce embeddings that generalize better across different domains and capture more subtle semantic relationships.

## Practical Applications in NLP Pipelines

Word embeddings serve as the foundation for numerous NLP applications. In sentiment analysis, embeddings allow models to recognize that "excellent" and "outstanding" convey similar positive sentiment even if they weren't seen together during training.

**Sentence Representation Example**:
For the sentence "The movie was excellent," individual word embeddings might be:
- "the": [0.1, 0.2, -0.1, ...]
- "movie": [0.3, -0.2, 0.5, ...]  
- "was": [0.0, 0.1, 0.2, ...]
- "excellent": [0.8, 0.6, 0.3, ...]

A simple sentence representation averages these vectors, though more sophisticated methods use weighted averages or sequential processing through neural networks.

## Limitations and Considerations

Word embeddings have several important limitations. They struggle with polysemy (words with multiple meanings), as "bank" (financial institution) and "bank" (river edge) receive the same embedding despite different contexts. They also reflect biases present in training data, potentially perpetuating stereotypes or unfair associations.

Additionally, embeddings are static representations that don't adapt to context within specific sentences. Recent advances like BERT and GPT address some limitations through contextualized embeddings, but classical embeddings like GloVe remain valuable for many applications due to their simplicity and computational efficiency.

## Integration with Machine Learning Models

Word embeddings bridge the gap between human language and machine learning algorithms. They provide a numerical foundation that enables traditional machine learning methods to process text data while preserving semantic relationships that rule-based approaches often miss.

The mathematical properties of embeddings - their ability to encode relationships through vector arithmetic, cluster semantically similar concepts, and provide dense representations - make them indispensable tools in modern natural language processing systems.
   
----

#### Sample GloVe embeddings:

| Word | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | ... | Dimension 48 | Dimension 49 | Dimension 50 |
|------|-------------|-------------|-------------|-------------|-------------|-----|--------------|--------------|--------------|
| on | 0.30045 | 0.25006 | -0.16692 | 0.1923 | 0.026921 | ... | -0.07131 | 0.23052 | -0.51939 |
| is | 0.6185 | 0.64254 | -0.46552 | 0.3757 | 0.74838 | ... | -0.27557 | 0.30899 | 0.48497 |
| was | 0.086888 | -0.19416 | -0.24267 | -0.33391 | 0.56731 | ... | -0.77 | 0.3945 | -0.16937 |
| said | 0.38973 | -0.2121 | 0.51837 | 0.80136 | 1.0336 | ... | 0.86119 | 0.1415 | 1.2018 |
| with | 0.25616 | 0.43694 | -0.11889 | 0.20345 | 0.41959 | ... | -0.07573 | -0.25868 | -0.39339 |
| he | -0.20092 | -0.060271 | -0.61766 | -0.8444 | 0.5781 | ... | -0.33317 | -0.041659 | -0.013171 |

##### Key Observations from the Sample

**Vector Diversity**: Each word has a unique 50-dimensional vector with values ranging approximately from -3 to +4, showing the continuous nature of the embedding space.

**Semantic Relationships**: Notice how "is" and "was" (both forms of "be") have some similar patterns, while "he" has quite different values, reflecting their different semantic roles.

**Dimensionality**: All vectors have exactly 50 dimensions as specified by the "50d" in the filename, with each dimension capturing different aspects of word meaning and context.

**Value Distribution**: The embeddings contain both positive and negative values, allowing for rich representation of semantic relationships through vector arithmetic operations.

This table format clearly shows how each word maps to its corresponding dense vector representation that can be used in machine learning models.

----


In [4]:
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def preprocess_text(text: str) -> list:
    """
    Preprocess text for NLP tasks through normalization, tokenization, and stopword removal.
    
    This function implements a standard text preprocessing pipeline commonly used for
    machine learning and natural language processing tasks. The preprocessing steps
    help normalize text variations and reduce noise by removing common but less
    informative words.
    
    Processing steps:
    1. Case normalization: Converts all text to lowercase for consistency
    2. Punctuation removal: Eliminates common punctuation marks that may not contribute to meaning
    3. Tokenization: Splits text into individual word tokens using NLTK's word_tokenize
    4. Stopword filtering: Removes common English words (the, is, at, etc.) that typically
       carry little semantic information for classification tasks
    
    Args:
        text (str): Raw input text string to be preprocessed.
            Can contain mixed case, punctuation, and common stopwords.
            
    Returns:
        list: List of cleaned and filtered tokens ready for further NLP processing.
            Tokens are lowercase strings with punctuation and stopwords removed.
    
    Example:
        >>> preprocess_text("Hello! This is a great movie.")
        ['hello', 'great', 'movie']
        
    Note:
        Requires NLTK data downloads: punkt tokenizer and stopwords corpus.
        Run: nltk.download('punkt') and nltk.download('stopwords') if not available.
    """
    """
    Normalize text to lowercase for consistent token matching.
    Eliminates case variations that would create duplicate vocabulary entries
    (e.g., "Hello" and "hello" become the same token).
    """
    text = text.lower()
    
    """
    Remove punctuation marks that typically don't contribute semantic meaning.
    Uses character filtering to eliminate common punctuation while preserving
    alphanumeric characters and spaces for proper tokenization.
    
    Removed characters: .,;:!?-"'()[]{}
    This covers most common punctuation but may need expansion for specific domains.
    """
    text = ''.join(c for c in text if c not in '.,;:!?-"\'()[]{}')
    
    """
    Tokenize preprocessed text into individual word tokens.
    NLTK's word_tokenize provides robust tokenization handling edge cases
    like contractions, abbreviations, and various text formats better than
    simple string splitting.
    """
    tokens = word_tokenize(text)
    
    """
    Filter out English stopwords to reduce noise and focus on content words.
    Stopwords are frequently occurring words (the, is, at, which, on, etc.)
    that typically carry little discriminative information for classification tasks.
    
    This step significantly reduces vocabulary size and can improve model
    performance by emphasizing semantically meaningful words.
    """
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    return filtered_tokens


def load_glove_model(file: str) -> dict:
    """
    Load pre-trained GloVe word embeddings from a text file into memory.
    
    GloVe (Global Vectors for Word Representation) embeddings provide dense vector
    representations of words trained on large corpora. These embeddings capture
    semantic relationships between words based on their co-occurrence patterns
    in natural language text.
    
    This function parses the standard GloVe file format where each line contains:
    word embedding_dim1 embedding_dim2 ... embedding_dimN
    
    The resulting embeddings can be used to initialize embedding layers in neural
    networks or for similarity computations between words.
    
    Args:
        file (str): Path to the GloVe embeddings file.
            Common files include glove.6B.50d.txt, glove.6B.100d.txt, etc.
            The file naming convention `glove.6B.50d.txt` indicates:
            - `6B`: Trained on 6 billion tokens
            - `50d`: Each word is represented by a 50-dimensional vector
            Format: each line contains word followed by space-separated float values.
            
    Returns:
        dict: Dictionary mapping words (str) to their embedding vectors (numpy.ndarray).
            Keys are vocabulary words, values are dense vector representations.
            Vector dimensions match the GloVe file specification (e.g., 50, 100, 200, 300).
    
    Raises:
        FileNotFoundError: If the specified GloVe file path doesn't exist.
        ValueError: If file format is incorrect or contains invalid numeric values.
        
    Example:
        >>> embeddings = load_glove_model("glove.6B.50d.txt")
        >>> print(embeddings['king'].shape)  # (50,)
        >>> similarity = np.dot(embeddings['king'], embeddings['queen'])
        
    Note:
        Large GloVe files (several GB) may take significant time to load and
        consume substantial memory. Consider loading only required words for
        memory-constrained applications.
    """
    """
    Initialize empty dictionary to store word-to-embedding mappings.
    Dictionary provides O(1) lookup time for word embeddings during model training.
    """
    glove_model = {}

    """
    Process GloVe file line by line to avoid loading entire file into memory.
    Each line represents one word and its corresponding embedding vector.
    
    File format: "word dim1 dim2 dim3 ... dimN"
    Example: "hello 0.12 -0.45 0.78 ... 0.23"
    """
    with open(file, 'r') as f:
        for line in f:
            """
            Parse each line to extract word and embedding components.
            split() separates word (first element) from embedding values (remaining elements).
            """
            split_line = line.split()
            word = split_line[0]
            
            """
            Convert embedding string values to numpy array of float64.
            Using float64 provides high precision for embedding values, though
            float32 might be sufficient for most applications with memory benefits.
            """
            embedding = np.array(split_line[1:], dtype=np.float64)
            
            """
            Store word-embedding pair in dictionary for fast lookup.
            Overwrites any duplicate words (shouldn't occur in standard GloVe files).
            """
            glove_model[word] = embedding

    return glove_model


"""
Load pre-trained GloVe embeddings for use in NLP models.

This loads the 50-dimensional GloVe embeddings trained on 6 billion tokens
from Wikipedia and Gigaword corpora. The embeddings provide semantic word
representations that can be used for:
- Initializing embedding layers in neural networks
- Computing word similarities and analogies
- Feature engineering for traditional ML models
"""
embedding_dict = load_glove_model("data/glove.6B.50d.txt")

"""
Demonstrate embedding lookup and inspect embedding properties.

The 'hello' embedding shows how common words are represented as dense
50-dimensional vectors capturing semantic relationships learned from
large text corpora.
"""
hello_embedding = embedding_dict['hello']
print("Embedding for 'hello':")
print(hello_embedding)

"""
Verify embedding dimensionality matches expected GloVe specification.

For glove.6B.50d.txt, each word should have exactly 50 dimensions.
This verification ensures the file was loaded correctly and embeddings
have the expected shape for downstream usage.
"""
print(f"Embedding dimension: {hello_embedding.shape[0]}")  # This should be 50 for this specific file

Embedding for 'hello':
[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]
Embedding dimension: 50


In [5]:
# DEMONSTRATION: Working with word embeddings
# -----------------------------------------------------------------------------

def get_sentence_embedding(text: str, embedding_dict: dict) -> np.ndarray:
    """
    Convert a sentence to its embedding representation by averaging constituent word vectors.
    
    This function implements a simple but effective approach to sentence-level embeddings
    by taking the element-wise average of all valid word embeddings in the sentence.
    While more sophisticated methods exist (weighted averages, neural encoders), mean
    pooling often provides surprisingly good results for many NLP tasks.
    
    The approach handles out-of-vocabulary words gracefully by filtering them out,
    and returns a zero vector for sentences with no valid embeddings to prevent
    errors in downstream processing.
    
    Args:
        text (str): Input sentence or text to convert to embedding representation.
            Can contain punctuation, mixed case, and stopwords which will be
            preprocessed according to the preprocess_text function.
        embedding_dict (dict): Dictionary mapping words to their embedding vectors.
            Typically loaded from pre-trained embeddings like GloVe or Word2Vec.
            Keys are lowercase strings, values are numpy arrays of consistent dimensionality.
            
    Returns:
        np.ndarray: Dense vector representation of the sentence with same dimensionality
            as individual word embeddings. For empty sentences or sentences with no
            valid words, returns zero vector of appropriate dimension.
    
    Example:
        >>> sentence = "The cat sat on the mat"
        >>> embedding = get_sentence_embedding(sentence, glove_embeddings)
        >>> print(embedding.shape)  # (50,) for 50-dimensional GloVe
        
    Note:
        This averaging approach loses word order information and may not capture
        complex compositional semantics. Consider using sequential models (LSTM, 
        Transformer) for tasks requiring word order understanding.
    """
    """
    Apply text preprocessing to extract clean, normalized tokens.
    Removes punctuation, converts to lowercase, tokenizes, and filters stopwords
    to focus on content-bearing words for embedding lookup.
    """
    tokens = preprocess_text(text)
    
    """
    Filter tokens to include only words present in the embedding vocabulary.
    This step handles out-of-vocabulary (OOV) words by excluding them from
    the sentence representation, preventing KeyError exceptions and ensuring
    robust processing of diverse text inputs.
    """
    valid_tokens = [token for token in tokens if token in embedding_dict]
    
    """
    Handle edge case of sentences with no valid embeddings.
    Returns zero vector with same dimensionality as word embeddings to maintain
    consistent output shape and prevent downstream processing errors.
    """
    if not valid_tokens:
        embedding_dim = next(iter(embedding_dict.values())).shape[0]
        return np.zeros(embedding_dim)
    
    """
    Retrieve embedding vectors for all valid tokens in the sentence.
    Creates list of numpy arrays, each representing one word's dense vector
    representation learned from the pre-trained embedding model.
    """
    token_embeddings = [embedding_dict[token] for token in valid_tokens]
    
    """
    Compute element-wise average across all word embeddings to create sentence representation.
    numpy.mean with axis=0 averages across the first dimension (words) while preserving
    the embedding dimension, resulting in a single vector representing the entire sentence.
    
    Mathematical formulation: sentence_vector = (1/n) * Σ(word_vector_i) for i=1 to n
    """
    sentence_embedding = np.mean(token_embeddings, axis=0)
    
    return sentence_embedding


def find_similar_words(word: str, embedding_dict: dict, n: int = 5) -> list:
    """
    Identify the n most semantically similar words to a target word using cosine similarity.
    
    This function leverages the geometric properties of word embeddings where semantically
    related words cluster together in the vector space. Cosine similarity measures the
    angle between vectors, providing a scale-invariant measure of semantic relatedness
    that ranges from -1 (opposite) to 1 (identical direction).
    
    The cosine similarity metric is preferred over Euclidean distance for word embeddings
    because it focuses on the direction rather than magnitude of vectors, making it
    more robust to variations in vector norms while preserving semantic relationships.
    
    Args:
        word (str): Target word to find semantic neighbors for. Must exist in the
            embedding dictionary vocabulary. Case-sensitive based on embedding keys.
        embedding_dict (dict): Pre-trained word embeddings mapping words to vectors.
            All vectors should have consistent dimensionality for valid comparisons.
        n (int, optional): Number of most similar words to return. Defaults to 5.
            Larger values provide broader semantic neighborhoods but increase computation.
            
    Returns:
        list: List of tuples (word, similarity_score) ordered by decreasing similarity.
            Similarity scores are float values between -1 and 1, with higher values
            indicating greater semantic similarity. Returns error message if word
            not found in vocabulary.
    
    Example:
        >>> similar = find_similar_words("king", embeddings, n=3)
        >>> print(similar)
        [('queen', 0.8547), ('prince', 0.7834), ('monarch', 0.7456)]
        
    Note:
        Computation scales O(V) with vocabulary size V. For large vocabularies,
        consider using approximate nearest neighbor algorithms for efficiency.
    """
    """
    Validate that the target word exists in the embedding vocabulary.
    Prevents KeyError exceptions and provides informative feedback for
    out-of-vocabulary queries.
    """
    if word not in embedding_dict:
        return [("Word not found in vocabulary", 0)]
    
    """
    Retrieve the embedding vector for the target word.
    This serves as the reference point for computing similarities with all
    other words in the vocabulary.
    """
    word_embedding = embedding_dict[word]
    
    """
    Define cosine similarity computation for vector comparison.
    
    Cosine similarity formula: cos(θ) = (A·B) / (||A|| * ||B||)
    where A·B is dot product and ||A|| is vector magnitude (L2 norm).
    
    This measures the cosine of the angle between vectors, providing a
    normalized similarity score independent of vector magnitudes.
    """
    def cosine_similarity(vec1, vec2):
        dot_product = np.dot(vec1, vec2)
        norm_vec1 = np.linalg.norm(vec1)
        norm_vec2 = np.linalg.norm(vec2)
        return dot_product / (norm_vec1 * norm_vec2)
    
    """
    Compute similarity scores between target word and all vocabulary words.
    Iterates through entire embedding dictionary to build comprehensive
    similarity rankings for semantic neighbor identification.
    """
    similarities = []
    for other_word, other_embedding in embedding_dict.items():
        """
        Skip self-comparison to avoid trivial perfect similarity.
        Target word would always rank first with similarity 1.0.
        """
        if other_word == word:
            continue
        
        """
        Calculate cosine similarity between target and candidate word vectors.
        Higher scores indicate greater semantic relatedness based on
        distributional similarity in the training corpus.
        """
        similarity = cosine_similarity(word_embedding, other_embedding)
        similarities.append((other_word, similarity))
    
    """
    Sort candidate words by similarity score in descending order.
    Returns top n most similar words as ranked list of (word, score) tuples.
    """
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:n]


"""
Demonstrate sentence-level embedding computation on sample text.
Shows how individual word embeddings combine to create document representations
suitable for tasks like semantic similarity, classification, or clustering.
"""
sample_sentence = "The quick brown fox jumps over the lazy dog"
print("\n--- Sentence Embedding Example ---")

sentence_embedding = get_sentence_embedding(sample_sentence, embedding_dict)
print(f"Original sentence: '{sample_sentence}'")
print(f"Preprocessed tokens: {preprocess_text(sample_sentence)}")
print(f"Sentence embedding shape: {sentence_embedding.shape}")
print(f"First 5 values of sentence embedding: {sentence_embedding[:5]}")


"""
Demonstrate semantic similarity search using pre-trained embeddings.
Illustrates how word embeddings capture semantic relationships through
vector space proximity, enabling automated discovery of related concepts.
"""
target_word = "king"
print(f"\n--- Finding words similar to '{target_word}' ---")

similar_words = find_similar_words(target_word, embedding_dict)
print("Most similar words (with similarity scores):")

for word, score in similar_words:
    print(f"{word}: {score:.4f}")


"""
Demonstrate famous word analogy solving through vector arithmetic.
The equation "king - man + woman ≈ queen" shows how embeddings encode
relational knowledge that can be manipulated through linear algebra operations.

This works because:
1. (king - man) captures the concept "royalty"
2. Adding "woman" applies this concept to the feminine domain
3. The result vector points toward "queen" in the embedding space
"""
if all(word in embedding_dict for word in ["king", "man", "woman"]):
    print("\n--- Word Vector Arithmetic Example ---")
    
    """
    Compute analogy vector through arithmetic operations on word embeddings.
    This algebraic manipulation encodes the semantic relationship:
    "king is to man as queen is to woman"
    """
    result_vector = embedding_dict["king"] - embedding_dict["man"] + embedding_dict["woman"]
    
    """
    Find the vocabulary word whose embedding is closest to the computed result vector.
    Uses cosine similarity to identify the word that best completes the analogy
    based on geometric proximity in the embedding space.
    """
    closest_word = None
    highest_similarity = -1
    
    for word, embedding in embedding_dict.items():
        """
        Exclude words used in the analogy equation to find novel completions.
        Prevents trivial solutions and focuses on discovering the intended
        analogical relationship (queen).
        """
        if word in ["king", "man", "woman"]:
            continue
            
        """
        Calculate cosine similarity between result vector and candidate word embedding.
        Tracks the word with highest similarity as the best analogy completion.
        """
        similarity = np.dot(result_vector, embedding) / (np.linalg.norm(result_vector) * np.linalg.norm(embedding))
        
        if similarity > highest_similarity:
            highest_similarity = similarity
            closest_word = word
    
    print(f"king - man + woman ≈ {closest_word} (similarity: {highest_similarity:.4f})")


--- Sentence Embedding Example ---
Original sentence: 'The quick brown fox jumps over the lazy dog'
Preprocessed tokens: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
Sentence embedding shape: (50,)
First 5 values of sentence embedding: [-0.15505333 -0.18144967 -0.12989    -0.17379167  0.29983667]

--- Finding words similar to 'king' ---
Most similar words (with similarity scores):
prince: 0.8236
queen: 0.7839
ii: 0.7746
emperor: 0.7736
son: 0.7667

--- Word Vector Arithmetic Example ---
king - man + woman ≈ queen (similarity: 0.8610)


In [6]:
# Now let's create the embedding matrix for sample_text
sample_tokens = preprocess_text(sample_text)
sample_embedding_matrix = []

for sample_token in sample_tokens:
    sample_embedding_matrix.append(embedding_dict[sample_token])

# we should have as many embedding vectors (rows of embedding matrix) as there are sample tokens
assert len(sample_embedding_matrix) == len(sample_tokens)

# lets print a token and its embedding
print(sample_tokens[2])
print(sample_embedding_matrix[2])

fox
[ 0.44206   0.059552  0.15861   0.92777   0.1876    0.24256  -1.593
 -0.79847  -0.34099  -0.24021  -0.32756   0.43639  -0.11057   0.50472
  0.43853   0.19738  -0.1498   -0.046979 -0.83286   0.39878   0.062174
  0.28803   0.79134   0.31798  -0.21933  -1.1015   -0.080309  0.39122
  0.19503  -0.5936    1.7921    0.3826   -0.30509  -0.58686  -0.76935
 -0.61914  -0.61771  -0.68484  -0.67919  -0.74626  -0.036646  0.78251
 -1.0072   -0.59057  -0.7849   -0.39113  -0.49727  -0.4283   -0.15204
  1.5064  ]
