# Word Embeddings

## Text Representations

In machine learning (ML) and neural networks, the data often needs to be numerical since models cannot usually work directly with raw text. Text is inherently complex, filled with semantics, context, and structure that computers cannot easily interpret. To train models effectively on textual data, we need systematic ways to convert text into numerical representations while preserving as much relevant information as possible. Without this transformation, machine learning algorithms would be unable to derive patterns or make predictions from text data.

Text is composed of characters and words, which hold meaning for humans but are not directly understandable by machines. To perform tasks like classification, clustering, or generation, machine learning models need features, or numerical inputs, that capture relevant patterns in the text. 

Effective text representations help models understand and make predictions by encoding:
- Frequency and presence of important words
- Context and relationships between words
- Semantic meaning beyond the surface structure

The quality of these representations directly impact model performance.

## Different Approaches To Text Representations

![](https://aiml.com/wp-content/uploads/2023/02/disadvantage-bow.png)

### Bag of Words (BoW)  

The Bag of Words approach is one of the simplest and most widely used text representations. It involves creating a vocabulary from the text and representing each document as a vector indicating the presence or frequency of each word in the vocabulary.

For example, consider two sentences:  
- "The cat sat on the mat."  
- "The dog sat on the mat."  

The vocabulary from these sentences is: `['The', 'cat', 'dog', 'sat', 'on', 'mat']`. 
The first sentence can be represented as a vector: `[1, 1, 0, 1, 1, 1]`, where each entry corresponds to the count of a word from the vocabulary. 

While BoW is easy to implement, it has limitations. Primarily, it ignores word order and the context in which words appear.

#### Example Using CountVectorizer

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus
corpus = [
    "The cat sat on the mat",
    "The dog barked at the cat",
    "The cat chased the mouse",
    "The dog chased the ball"
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus into BoW representation
X = vectorizer.fit_transform(corpus)

# Convert the result to a DataFrame for readability
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Display the Bag of Words representation
print("Bag of Words Representation:")
bow_df

Bag of Words Representation:


Unnamed: 0,at,ball,barked,cat,chased,dog,mat,mouse,on,sat,the
0,0,0,0,1,0,0,1,0,1,1,2
1,1,0,1,1,0,1,0,0,0,0,2
2,0,0,0,1,1,0,0,1,0,0,2
3,0,1,0,0,1,1,0,0,0,0,2


### TF-IDF (Term Frequency-Inverse Document Frequency)  

TF-IDF is a statistical measure used to evaluate how important a word (term) is to a document within a collection or corpus of documents. It builds upon the simple Bag of Words (BoW) representation by incorporating both the frequency of a term in a single document and how common (or rare) that term is across the entire corpus. In other words, TF-IDF helps us down-weight very common words (like “the”, “is”, “and”) that carry little distinguishing information and up-weight rarer and more informative words.

#### Term Frequency (TF)

The term frequency component quantifies how often a given term appears in a specific document. A higher TF value means the term appears more often in that document, suggesting it may be important within that document’s context. There are a few common ways to define TF:

- **Raw Count**:  
  $\mathrm{TF}(t, d) = \text{count of term } t \text{ in document } d.$
  

- **Normalized Frequency** (accounts for document length):  
  $\mathrm{TF}(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total words in } d}.$
  

- **Log-Scaled Frequency** (dampens large counts):  
  $\mathrm{TF}(t, d) =
    \begin{cases}
      1 + \log(\text{count of } t \text{ in } d), & \text{if count} > 0,\\
      0, & \text{otherwise}.
    \end{cases}$
    

#### Inverse Document Frequency (IDF)

While TF measures how often a term appears in one document, IDF measures how unique or rare that term is across the entire corpus. Intuitively:

- If a term appears in almost every document (e.g., “the” or “and”), it carries less discriminative power.
- If a term appears in very few documents (e.g., “photosynthesis” in a mixed corpus), it is more informative.

A common definition of IDF is:

$$
\mathrm{IDF}(t, D) \;=\; \log\!\Bigl(\frac{N}{\,|\{d \in D: t \in d\}|\,}\Bigr),
$$  

where:

- $N$ is the total number of documents in the corpus $D$.
- $|\{\,d \in D : t \in d\,\}|$ is the number of documents in which term $t$ appears (the document frequency of $t$).


Some variants add 1 to the denominator (or to the log) to avoid division by zero or negative values:  
$$
\mathrm{IDF}(t, D) = \log\!\Bigl(\frac{1+N}{1 + |\{d: t \in d\}|}\Bigr) + 1.
$$

The higher the IDF, the rarer the term is across documents.

#### Combining TF and IDF

The core idea of TF-IDF is to multiply the two components:

$$
\text{TF-IDF}(t, d, D) \;=\; \text{TF}(t, d) \times \text{IDF}(t, D).
$$

- If a term $t$ occurs frequently in a document $d$ (high TF) but rarely in the rest of the corpus (high IDF), then TF-IDF is large.  
- If a term is common in a document but also common across many documents (low IDF), then TF-IDF remains moderate.  
- If a term is very rare in the specific document (low TF), its TF-IDF is low regardless of its IDF.  

By weighting words this way, we produce a vector representation for each document where each dimension corresponds to a vocabulary term, and the value is the TF-IDF score. These vectors can then be used as features in downstream tasks (classification, clustering, information retrieval, etc.).

#### Limitations of TF-IDF

- **Ignores Word Order**: TF-IDF still treats each term independently and does not capture the sequence or context in which words appear. For example, “dog bites man” and “man bites dog” have the same TF-IDF representation even though their meanings differ.

- **Ignores Word Relationships/Semantics**: Synonyms (“car” vs. “automobile”) will be treated as completely separate dimensions. TF-IDF does not capture semantic similarity between related words.

- **High Dimensionality**: The resulting feature space is as large as the vocabulary size, which can be in the tens or hundreds of thousands. Sparse, high-dimensional feature vectors can be computationally expensive to store and process.

- **Static Across Corpus**: Once you compute IDF, those values remain fixed unless you retrain. In dynamic corpora where documents are continually added, TF-IDF scores can become stale unless recalculated.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample corpus
corpus = [
    "The cat sat on the mat",
    "The dog barked at the cat",
    "The cat chased the mouse",
    "The dog chased the ball"
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus into TF-IDF representation
X = vectorizer.fit_transform(corpus)

# Convert the result to a DataFrame for readability
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Display the TF-IDF matrix
print("TF-IDF Representation:")
tfidf_df

TF-IDF Representation:


Unnamed: 0,at,ball,barked,cat,chased,dog,mat,mouse,on,sat,the
0,0.0,0.0,0.0,0.301002,0.0,0.0,0.471578,0.0,0.471578,0.471578,0.492178
1,0.492768,0.0,0.492768,0.314527,0.0,0.388504,0.0,0.0,0.0,0.0,0.514293
2,0.0,0.0,0.0,0.361459,0.446473,0.0,0.0,0.566295,0.0,0.0,0.591032
3,0.0,0.547794,0.0,0.0,0.431887,0.431887,0.0,0.0,0.0,0.0,0.571724


## Word Embeddings  

Word embeddings are a type of dense vector representation where words with similar meanings have similar vector representations. Instead of treating words as discrete entities, embeddings place them in a continuous vector space, allowing models to capture semantic relationships and contextual similarity. 

For instance, in a well-trained word embedding model, the relationship "king - man + woman" could approximate the vector for "queen," showing how embeddings can encode relational knowledge.

Word embeddings are typically generated by training on large corpora of text. The model learns to place words in the vector space such that words that appear in similar contexts have similar vector representations. The training process focuses on capturing semantic relationships and dependencies between words.

Popular methods for generating word embeddings include **Word2Vec**, **GloVe**, and **FastText**. These models learn from large corpora and capture relationships between words, such as analogies and semantic similarities.

Word embeddings overcome many limitations of simpler methods like BoW and TF-IDF by capturing semantic meaning and some degree of context.

## Word2Vec  

Word2Vec is a popular algorithm for generating word embeddings. Word2Vec uses neural networks to map words into a continuous vector space based on their context in a corpus of text. The key idea is that words appearing in similar contexts will have similar vector representations.  

The training objective of Word2Vec is to predict either:
- the surrounding words given a target word (Skip-Gram) 
- or predict the target word given its surrounding words (CBOW). 
  
**Skip-Gram:**
- The model looks at a single “center” word and tries to guess which words appear around it in the sentence. For example, if the center word is “dog,” it will learn to predict words like “barks,” “leash,” or “park” if those tend to appear near “dog.”

**CBOW (Continuous Bag of Words):**
- The model does the reverse. It takes a group of words around a blank spot (the context) and tries to predict which word belongs in that position. For instance, seeing “the ___ chased the ball,” it would learn to predict “dog.”

Because it only needs to pay attention to a few words at a time rather than the entire vocabulary, Word2Vec can process really large amounts of text very quickly. The end result is that words with similar meanings or grammar roles (like “run” and “jog,” or “cat” and “kitty”) get mapped to vectors that sit close together in this new numerical space. These vectors then become convenient features for tasks like classifying documents, finding related words, or powering simple search systems.

![](https://miro.medium.com/v2/resize:fit:1200/1*xC6wfTU_zpUlpRlXs5NZ4w.png)

### Skip-Gram Model  

The Skip-Gram model, a core component of Word2Vec, predicts the surrounding context words given a target word. For example, given the word "cat," the model will try to predict nearby words like "the," "sat," or "mat."  

Mathematically, this looks like:  

$$
\text{maximize} \, \sum \log P(\text{context} \mid \text{target})
$$

Skip-Gram is particularly useful for capturing meaningful relationships when the corpus is large, as it can effectively handle rare words. One way to understand Skip-Gram intuitively is to visualize a sliding window moving over the text and predicting the words within the window.

### Continuous Bag of Words (CBOW)  

CBOW is the opposite of the Skip-Gram model. Instead of predicting context words given a target word, it predicts the target word based on its surrounding context. Given a window of words surrounding a target word, the model aims to predict the target word.

Mathematically, this looks like:

$$
\text{maximize} \, \sum \log P(\text{target} \mid \text{context})
$$  

CBOW is generally faster to train than Skip-Gram because it averages the context word vectors to predict the target word. However, it may not perform as well on rare words since it relies heavily on the surrounding context.

### Coding Example

In [None]:
!pip install gensim

In [35]:
from gensim.models import Word2Vec

# Sample corpus (list of tokenized sentences)
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "barked", "at", "the", "cat"],
    ["the", "dog", "chased", "the", "ball"],
    ["the", "cat", "chased", "the", "mouse"],
    ["the", "dog", "sat", "on", "the", "sofa"]
]

# Train Word2Vec model - Skip-Gram Model
model = Word2Vec(sentences=corpus, vector_size=50, window=3, min_count=1, sg=1, epochs=100)

# Train Word2Vec model - CBOW Model
#model = Word2Vec(sentences=corpus, vector_size=50, window=3, min_count=1, sg=1, epochs=100)

# Save the trained model
# model.save("word2vec.model")

# Access vector for a word
cat_vector = model.wv["cat"]
print(f"Vector for 'cat':\n{cat_vector}")

# Find the most similar words to 'cat'
similar_words = model.wv.most_similar("cat", topn=3)
print("\nMost similar words to 'cat':")
for word, score in similar_words:
    print(f"{word}: {score}")

# Find similarity between two words
similarity_score = model.wv.similarity("cat", "dog")
print(f"\nSimilarity between 'cat' and 'dog': {similarity_score:.2f}")

Vector for 'cat':
[-0.01753848  0.00737714  0.01058508  0.01203614  0.01473678 -0.01264185
  0.00283708  0.01270612 -0.00632933 -0.01250195 -0.00059064 -0.01700857
 -0.01139127  0.01432026  0.00677256  0.01453993  0.01423641  0.01551222
 -0.00832898 -0.00157581  0.00498176 -0.00899076  0.01758184 -0.01981414
  0.01403708  0.00569289 -0.00990202  0.00932641 -0.00395903  0.01334891
  0.0198663  -0.00912277 -0.00077523 -0.01217308  0.00750499  0.00543842
  0.01447675  0.01197266  0.01928601  0.01860449  0.0156456  -0.01382573
 -0.01894968 -0.00101606 -0.0058394   0.01620697  0.01174434 -0.00293944
  0.00312232  0.00403298]

Most similar words to 'cat':
on: 0.18660055100917816
the: 0.16965018212795258
mat: 0.16849403083324432

Similarity between 'cat' and 'dog': 0.03


#### Explanation of Parameters:

- **`sentences`**: The corpus of tokenized sentences used for training.  
- **`vector_size=50`**: The size (dimensionality) of the word embeddings.  
- **`window=3`**: The context window size, meaning the model looks at 3 words before and after the target word.  
- **`min_count=1`**: Minimum number of occurrences for a word to be included in the vocabulary.  
- **`sg=1`**: Skip-Gram model (set `sg=0` for CBOW).  
- **`epochs=100`**: Number of training epochs (iterations over the corpus).  


## Alternatives to Word2Vec  

Although Word2Vec is widely used, several alternative methods exist:  

- **GloVe (Global Vectors for Word Representation):** GloVe focuses on co-occurrence statistics across the entire corpus rather than relying on local context windows. It constructs a co-occurrence matrix and factorizes it to generate word embeddings. GloVe works well for capturing global context and semantic relationships.  
- **FastText:** Developed by Facebook, FastText extends Word2Vec by incorporating subword information. It breaks words into character n-grams, making it better at handling rare words and morphologically rich languages.  
- **BERT and Contextual Embeddings:** Unlike static embeddings like Word2Vec and GloVe, contextual embeddings (e.g., BERT, GPT) generate dynamic representations based on the context in which a word appears. These embeddings are ideal for tasks where word meaning depends on surrounding words.  

### Contextual Embeddings  

Contextual embeddings take word representations a step further by making them dependent on the surrounding text. Models like **BERT**, **GPT**, and **RoBERTa** generate embeddings dynamically, meaning the same word can have different representations depending on its usage. This is particularly useful for disambiguating polysemous words (e.g., the word "bank" has different meanings in "river bank" and "financial bank").

### Character-Level Representations  

Rather than representing words as whole units, some models use representations based on characters. This is particularly beneficial in cases where words are rare, misspelled, or morphologically complex. 

Character-level models can detect subword patterns and handle previously unseen words more effectively. Techniques like **Char-CNN** and **Char-RNN** implement this approach for languages with complex morphology.

While character-level representations add robustness, they often require more computational resources and longer training times compared to word-based approaches.

## Important Things to Consider  

#### Window Size  
The window size defines the number of words around a target word that the model should consider when learning embeddings. A shorter window (e.g., 2-3 words) typically captures **syntactic** information, such as word order or grammatical relationships. A longer window (e.g., 5-10 words) is better for capturing **semantic or topical relationships** between words.  

For example:  
- A shorter window size may learn that "run" and "ran" have similar grammatical functions. 
- A longer window size may capture that "book" and "novel" are topically related, even if they are not grammatically similar.  
  
#### Subword Information  
Languages with complex morphology or compound words benefit from embeddings that capture subword information. Techniques like FastText and character-based embeddings decompose words into smaller units, enabling models to handle unseen or rare words more effectively.  
#### Handling Out-of-Vocabulary (OOV) Words  
Static embeddings, such as Word2Vec or GloVe, suffer from the issue of out-of-vocabulary (OOV) words. If a word is not in the training corpus, it will not have an embedding. Contextual embeddings and subword-based models provide solutions to this problem by dynamically generating embeddings or breaking words into subunits.  

#### Semantic and Syntactic Trade-offs  
Depending on the task, you may need embeddings that prioritize semantic relationships (e.g., grouping words by topic) or syntactic relationships (e.g., grammatical roles). Tuning parameters like window size, corpus size, and embedding dimensions allows for optimizing the model based on your needs.

#### Dimensionality and Training Time  
Choosing the dimensionality of the word vectors is important. Larger dimensions generally result in better performance but increase computational cost and risk overfitting. Practical implementations balance the quality of embeddings with the available computational resources.  

## Distance Functions

When we represent text as vectors in an **n-dimensional space**, comparing the similarity between different text representations involves calculating the "distance" between them. 

These distances provide a quantitative measure of how similar (or different) two text samples are, which is crucial for tasks like document clustering, classification, and information retrieval.

The choice of distance function can impact the performance of your model, as different tasks may require different types of similarity measures. 

![](https://www.maartengrootendorst.com/images/posts/2021-01-02-distances/header.png)

### Euclidean Distance

Euclidean distance is one of the most common and straightforward distance metrics. It measures the straight-line distance between two points in space. 

Given two vectors $ \mathbf{A} $ and $ \mathbf{B} $, the Euclidean distance $ d $ is defined as:

$$
d(\mathbf{A}, \mathbf{B}) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}
$$

Intuitively, this works well when comparing vectors based on magnitude or when word order does not matter. However, in high-dimensional spaces or sparse text data, Euclidean distance may not always capture meaningful similarities.

#### Code Example

In [8]:
import numpy as np
from scipy.spatial.distance import euclidean

# Define two vectors (representing text vectors or points in n-dimensional space)
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 0, 3])

# Method 1: Manual computation of Euclidean distance
euclidean_manual = np.sqrt(np.sum((vector1 - vector2)**2))
print(f"Euclidean distance (manual): {euclidean_manual}")

# Method 2: Using scipy's euclidean function
euclidean_scipy = euclidean(vector1, vector2)
print(f"Euclidean distance (scipy): {euclidean_scipy}")

Euclidean distance (manual): 3.605551275463989
Euclidean distance (scipy): 3.605551275463989


### Manhattan Distance (L1 Distance)

Manhattan distance measures the distance by summing the absolute differences between the vector components. 

Mathematically, it is given by:

$$
d(\mathbf{A}, \mathbf{B}) = \sum_{i=1}^{n} |A_i - B_i|
$$

This metric is useful when movement between dimensions is restricted. It is often used in text-related tasks where small deviations in feature values are meaningful.

#### Code Example

In [9]:
import numpy as np
from scipy.spatial.distance import cityblock  # Scipy function for Manhattan distance

# Define two vectors (representing text vectors or points in n-dimensional space)
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 0, 3])

# Method 1: Manual computation of Manhattan distance
manhattan_manual = np.sum(np.abs(vector1 - vector2))
print(f"Manhattan distance (manual): {manhattan_manual}")

# Method 2: Using scipy's cityblock function
manhattan_scipy = cityblock(vector1, vector2)
print(f"Manhattan distance (scipy): {manhattan_scipy}")

Manhattan distance (manual): 5
Manhattan distance (scipy): 5


### Minkowski Distance

The **Minkowski distance** is a generalized form of distance that includes both the **Euclidean distance** and **Manhattan distance** as special cases. The formula for the Minkowski distance between two vectors $ \mathbf{A} $ and $ \mathbf{B} $ is:

$$
d(\mathbf{A}, \mathbf{B}) = \left( \sum_{i=1}^{n} |A_i - B_i|^p \right)^{\frac{1}{p}}
$$

- **p = 1:** The formula becomes the **Manhattan distance**.  
- **p = 2:** The formula becomes the **Euclidean distance**.  
- For larger values of $ p $, the metric places more emphasis on large differences between vector components.  

#### Code Example

In [11]:
import numpy as np
from scipy.spatial.distance import minkowski

# Define two vectors (representing text vectors or points in n-dimensional space)
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 0, 3])

# Method 1: Manual computation of Minkowski distance with p = 3
p = 3
minkowski_manual = np.power(np.sum(np.abs(vector1 - vector2)**p), 1/p)
print(f"Minkowski distance (manual, p=3): {minkowski_manual}")

# Method 2: Using scipy's minkowski function
minkowski_scipy = minkowski(vector1, vector2, p=3)
print(f"Minkowski distance (scipy, p=3): {minkowski_scipy}")

Minkowski distance (manual, p=3): 3.2710663101885897
Minkowski distance (scipy, p=3): 3.2710663101885897


### Cosine Similarity

Cosine similarity is a popular measure for text representations because it compares the **angles between vectors** rather than their magnitudes. This makes it ideal for text data, where two documents with different lengths may still have high similarity if they share the same word distribution.

The formula for cosine similarity between two vectors $ \mathbf{A} $ and $ \mathbf{B} $ is:

$$
\text{cosine similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}
$$

Here, $ \mathbf{A} \cdot \mathbf{B} $ represents the dot product of the two vectors, and $ \|\mathbf{A}\| $ and $ \|\mathbf{B}\| $ represent their magnitudes.

Cosine similarity is widely used in **information retrieval** and **document clustering** because it focuses on the direction of vectors, making it less sensitive to differences in vector length.

#### Code Example

In [13]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Define two vectors (representing text vectors or points in n-dimensional space)
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 0, 3])

# Method 1: Manual computation of cosine similarity
dot_product = np.dot(vector1, vector2)
magnitude1 = np.linalg.norm(vector1)
magnitude2 = np.linalg.norm(vector2)
cosine_similarity_manual = dot_product / (magnitude1 * magnitude2)
print(f"Cosine similarity (manual): {cosine_similarity_manual}")

# Method 2: Using scikit-learn's cosine_similarity function
cosine_similarity_sklearn = cosine_similarity([vector1], [vector2])[0][0]
print(f"Cosine similarity (scikit-learn): {cosine_similarity_sklearn}")

Cosine similarity (manual): 0.6948792289723034
Cosine similarity (scikit-learn): 0.6948792289723034


### Other Distance Measures

There are additional distance measures you may encounter depending on the specific problem:

1. **Jaccard Similarity:** Measures the similarity between two sets by dividing the size of their intersection by the size of their union. This is particularly useful for comparing sparse vectors or binary representations of text.
   $$
   J(A, B) = \frac{|A \cap B|}{|A \cup B|}
   $$

2. **Hamming Distance:** Measures the number of positions where corresponding components of two vectors differ. It is commonly used for comparing binary or categorical data.

## Real World Example Using Newsgroup Dataset

#### Install Required Libraries (if needed)

In [None]:
!pip install gensim scikit-learn nltk

In [15]:
from sklearn.datasets import fetch_20newsgroups
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import nltk

#### Load The Dataset

We are using the 20 Newsgroups dataset, a common benchmark dataset in NLP. It contains around 18,000 newsgroup documents spread across 20 different categories, including topics like sports, politics, computers, and religion. This dataset is widely used for text classification, clustering, and word embedding demonstrations.

In [21]:
# Download NLTK tokenizer
nltk.download('punkt_tab')

# Load the 20 Newsgroups dataset
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/jonathanschlosser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


#### Tokenization

In [22]:
# Tokenize the documents into sentences of words
sentences = [word_tokenize(doc.lower()) for doc in newsgroups_data.data]

In [26]:
sentences[0]

['i',
 'am',
 'sure',
 'some',
 'bashers',
 'of',
 'pens',
 'fans',
 'are',
 'pretty',
 'confused',
 'about',
 'the',
 'lack',
 'of',
 'any',
 'kind',
 'of',
 'posts',
 'about',
 'the',
 'recent',
 'pens',
 'massacre',
 'of',
 'the',
 'devils',
 '.',
 'actually',
 ',',
 'i',
 'am',
 'bit',
 'puzzled',
 'too',
 'and',
 'a',
 'bit',
 'relieved',
 '.',
 'however',
 ',',
 'i',
 'am',
 'going',
 'to',
 'put',
 'an',
 'end',
 'to',
 'non-pittsburghers',
 "'",
 'relief',
 'with',
 'a',
 'bit',
 'of',
 'praise',
 'for',
 'the',
 'pens',
 '.',
 'man',
 ',',
 'they',
 'are',
 'killing',
 'those',
 'devils',
 'worse',
 'than',
 'i',
 'thought',
 '.',
 'jagr',
 'just',
 'showed',
 'you',
 'why',
 'he',
 'is',
 'much',
 'better',
 'than',
 'his',
 'regular',
 'season',
 'stats',
 '.',
 'he',
 'is',
 'also',
 'a',
 'lot',
 'fo',
 'fun',
 'to',
 'watch',
 'in',
 'the',
 'playoffs',
 '.',
 'bowman',
 'should',
 'let',
 'jagr',
 'have',
 'a',
 'lot',
 'of',
 'fun',
 'in',
 'the',
 'next',
 'couple',
 '

#### Train Word2Vec Model

In [28]:
# Train the Word2Vec model on the tokenized sentences
model = Word2Vec(sentences, vector_size=100, window=5, min_count=5, sg=1, epochs=10)

#### Evaluate

In [29]:
# Access the vector representation of a word
word_vector = model.wv['computer']
print(f"Vector for 'computer':\n{word_vector}")

Vector for 'computer':
[-0.21225789  0.5404231   0.40459347  0.32228127  0.12945211 -0.3299833
 -0.09653363  0.20713666 -0.36554763  0.13997215 -0.37456098 -0.67297906
 -0.3721857   0.00911829 -0.20481554 -0.36107153  0.7215204   0.3608223
 -0.41200316 -0.7169939   0.38293985  0.86623466  0.09137827 -0.26366988
  0.23594125 -0.17807429 -0.17314552 -0.07365485  0.2280913   0.200225
  0.03429207 -0.2574355   0.28391963 -0.32798564 -0.03226035 -0.5596238
 -0.12443216  0.20995711  0.05276503 -0.3246789   0.9564842  -0.6072698
  0.62925166  0.00243073  0.15336983  0.07485364 -0.31112248 -0.06012597
 -0.04015968  0.29611218 -0.20674986 -0.33931953 -0.40383413 -0.0755101
  0.12370312  0.03053324 -0.13557634 -0.25636342 -0.26846898  0.263858
 -0.21728204  0.4428228   0.11869055  0.20396456 -0.00572924  0.39625686
 -0.02928978  0.41261107  0.07562852  0.1514651   0.10060371 -0.1340094
  0.09531747 -0.27664697  0.58087593 -0.3628328  -0.00305403  0.30699235
 -0.08109504  0.03020633 -0.24945392  

In [30]:
# Find the most similar words to a given word
similar_words = model.wv.most_similar('computer', topn=5)
print("\nMost similar words to 'computer':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")


Most similar words to 'computer':
visualisation: 0.6896
graphics: 0.6778
shopper: 0.6759
eckton: 0.6675
computing: 0.6632


In [31]:
# Calculate similarity between two words
similarity_score = model.wv.similarity('computer', 'software')
print(f"\nSimilarity between 'computer' and 'software': {similarity_score:.4f}")


Similarity between 'computer' and 'software': 0.5553
