<a href="https://colab.research.google.com/github/Sagaust/DH-Computational-Methodologies/blob/main/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings and Semantic Similarity

---

**Definition:**  
Word Embeddings are numerical representations of words, usually as vectors in a high-dimensional space. The idea is that words with similar meanings or usages will have similar vectors, i.e., they will be close in this vector space. This property allows us to measure semantic similarity between words.

---

## 📌 **Why are Word Embeddings Important?**

1. **Semantic Meaning**: They capture the semantic meaning of words, which is not possible with simpler representations like one-hot encoding.
2. **Dimensionality Reduction**: Represent words in a more compact form compared to sparse representations.
3. **Versatility**: Useful in a wide range of NLP tasks including text classification, sentiment analysis, and machine translation.

---

## 🛠 **How Do Word Embeddings Work?**

Word embeddings are trained on large corpora to capture semantic relationships between words. They use the context in which words appear to determine word similarities. The idea is that words appearing in similar contexts tend to have similar meanings.

---

## 🌐 **Popular Word Embedding Models**:

- **Word2Vec**: Developed by Google, it offers two training algorithms: Continuous Bag of Words (CBOW) and Skip-Gram.
- **GloVe (Global Vectors for Word Representation)**: Developed by Stanford, it's based on factorizing the word co-occurrence matrix.
- **FastText**: Developed by Facebook, it's an extension of Word2Vec. Unlike Word2Vec, which treats each word in the corpus as an atomic entity, FastText represents a word as a bag of character n-grams.

---

## 📚 **Applications of Word Embeddings**:

1. **Text Classification**: Improve accuracy by using semantically rich word representations.
2. **Information Retrieval**: Search for documents that are semantically related to a query.
3. **Sentiment Analysis**: Understand the sentiment of texts by capturing the semantic meaning of the words.
4. **Machine Translation**: Translate between languages by mapping words to a common semantic space.

---

## 💡 **Insights from Word Embeddings**:

1. **Semantic Relationships**: Discover relationships like "man" is to "woman" as "king" is to "queen".
2. **Topic Identification**: Identify what topics words are related to based on their embeddings.
3. **Language Structure**: Uncover syntactic relationships between words.

---

## 🛑 **Challenges with Word Embeddings**:

1. **Requires Large Data**: Reliable word embeddings typically require training on large corpora.
2. **Static Representation**: Traditional word embeddings offer a static representation, meaning a word has the same vector regardless of context (though models like BERT have addressed this).
3. **Storage**: High-dimensional vectors for large vocabularies can take up significant storage.

---

## 🧪 **Word Embeddings in Python**:

Python libraries like Gensim provide tools to work with word embeddings. Here's a simple example using Gensim's Word2Vec:

```python
from gensim.models import Word2Vec

# Sample data
sentences = [["cat", "say", "meow"], ["dog", "say", "bark"]]

# Train a Word2Vec model
model = Word2Vec(sentences, min_count=1)

# Find vector for a word
vector = model.wv['cat']

# Find similar words
similar = model.wv.most_similar('cat', topn=5)
