# What are Embeddings?

- Embeddings act like a unique language translator for machines, taking diverse forms of data (like words, documents, images, people, ads, and more) and transforming them into a compact series of numbers in a smaller dimensional space.
- This transformation manages to hold on to the crucial 'meaning' or 'context' of these data forms, which is what we call capturing their semantic information.
- The real strength of embeddings comes from their ability to place similar data points close to each other and dissimilar ones farther apart in this compact space. It's like clustering related items in a neighborhood and separating distinct ones into different districts. 

- Sure, let's explore this in a hypothetical 6-dimensional embedding space. The embeddings (vectors) for the words "king", "queen", and "apple" could look something like this:

1. king: [1.5, 2.2, 0.9, 1.1, 0.7, 1.3]
2. queen: [1.6, 2.3, 0.85, 1.2, 0.75, 1.4]
3. apple: [3.2, 4.1, 3.9, 3.5, 4.0, 3.8]

Each number list (vector) here represents a point in our 6-dimensional space. The numbers for "king" and "queen" are closer to each other in this six-dimensional space compared to "apple". This indicates that "king" and "queen" are semantically more similar to each other than either is to "apple", which aligns with our intuitive understanding of the meanings of these words.

This is the essence of how word embeddings capture semantic information, though, in actual practice, the vectors are in much higher-dimensional spaces (often 100s or 1000s of dimensions).



# How Are Embeddings Made?

- Creating embeddings involves transforming discrete tokens (like words, sentences, or entire documents) into continuous vectors in a high-dimensional space. 
- These vectors capture the semantic similarities between the tokens. Let's dive into the process of creating word embeddings using a popular method: `Word2Vec`.




### Step1: Text Pre-Processing

- Before training, raw text data must be cleaned and prepared. 
- This involves removing punctuation, lowercasing, tokenizing (converting sentences into words), removing stop words, and possibly lemmatizing (reducing words to their base or root form).


In [None]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    # Removing punctuation
    text = re.sub(r'\W', ' ', text)
    # Tokenization
    tokens = word_tokenize(text)
    # Removing stop words and lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words('english')]
    return tokens

# Sample text
sample_text = "Natural language processing enables computers to understand human language."
tokens = preprocess_text(sample_text)
print(tokens)


### Step 2: Preparing Data for Word2Vec

- Word2Vec requires a sequence of tokenized sentences for training. 
- If you're working with a large corpus, each document should be tokenized into sentences, and then each sentence tokenized into words.


In [None]:
# Assuming 'raw_text' is a large string containing all your text
from nltk.tokenize import sent_tokenize

# Tokenize the raw text into sentences
sentences = sent_tokenize(raw_text)

# Preprocess each sentence and collect them
preprocessed_sentences = [preprocess_text(sentence) for sentence in sentences]


### Step 3: Understanding Word2Vec Parameters

When initializing Word2Vec, key parameters to consider are:

- `vector_size`: Dimensionality of the word vectors.
- `window`: Maximum distance between the current and predicted word within a sentence.
- `min_count`: Ignores all words with total frequency lower than this.
- `workers`: Number of worker threads to train the model.
- `sg`: Training algorithm: 1 for Skip-Gram, 0 for CBOW.


### Step 4: Training the Model

- Using Gensim's Word2Vec implementation, train the model on the preprocessed sentences.

In [None]:
from gensim.models import Word2Vec

# Train the Word2Vec model
model = Word2Vec(preprocessed_sentences, vector_size=100, window=5, min_count=2, workers=4, sg=1)


### Step 5: Exploring and Visualizing Embeddings

- After training, explore the embeddings by checking similar words and visualizing the word vectors to get a sense of how words are positioned relative to each other in the embedding space.

In [None]:
# Find similar words
print(model.wv.most_similar('computer'))

# Visualize embeddings using t-SNE
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

words = list(model.wv.key_to_index.keys())
word_vectors = [model.wv[word] for word in words]

tsne = TSNE(n_components=2)
word_vecs_2d = tsne.fit_transform(word_vectors)

plt.figure(figsize=(10, 10))
for i, word in enumerate(words):
    plt.scatter(word_vecs_2d[i, 0], word_vecs_2d[i, 1])
    plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]), xytext=(5, 2),
                 textcoords='offset points', ha='right', va='bottom')
plt.show()

- This visualization helps illustrate the principle that semantically similar words cluster together in the embedding space. 
- For example, "king" and "queen" might be closer together, whereas "apple" might be far from them but closer to "orange".

# Word-level Embeddings

Word-level embeddings represent individual words in a high-dimensional space. These embeddings capture the semantic properties of words based on their usage in the training corpus.

In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Encode a word
word = 'king'
inputs = tokenizer(word, return_tensors='pt')
outputs = model(**inputs)

# Get the word embedding
word_embedding = outputs.last_hidden_state
print("Word Embedding for 'king':", word_embedding[:, 1:-1, :].squeeze().detach().numpy()) # Squeeze to remove batch dimension

# Sentence-level Embeddings

Sentence-level embeddings represent the entire sentences, capturing not just the semantics of individual words but also how those words are used together in a sentence.



In [None]:
from sentence_transformers import SentenceTransformer

# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define sentences
sentences = [
    "The king rules the kingdom.",
    "A piece of cake.",
    "I have a dream."
]

# Generate embeddings
sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding[:5]}...")  # Display first 5 elements for brevity
    print("Embedding length:", len(embedding), "\n")

- In the word-level embedding example, we focused on extracting an embedding for a single word, 'king'. This embedding captures the semantic essence of 'king' based on its context in the training data.


- For the sentence-level embeddings, we processed an entire sentence, 'The king rules the kingdom.'. The resulting embedding is an aggregate (in this case, a mean) of all token embeddings in the sentence, capturing the overall semantic content of the sentence.

# Types of Embeddings

### Word2Vec
- Word2Vec, developed by Mikolov et al. at Google, is a predictive model for generating word embeddings. 
- Word2Vec uses a shallow neural network to learn word associations from a large corpus of text. It operates on the principle that "a word is known by the company it keeps," focusing on learning embeddings that predict nearby words.
- It uses either a Continuous Bag of Words (CBOW) or Skip-Gram model:
    - **Skip-Gram Model**: Predicts surrounding context words given a target word.
    - **CBOW Model**: Predicts the target word from a set of context words.

Imagine a sentence "Cats love to play." Word2Vec can learn to predict "love" when given "Cats" and "play" as context in the Skip-Gram model.

In [None]:
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

# Sample sentences
sentences = [["cat", "say", "meow"], ["dog", "say", "bark"]]

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Access the word vector for 'cat'
vector_cat = model.wv['cat']
print(vector_cat)


### GloVe
- GloVe (Global Vectors for Word Representation), developed by Stanford, is an unsupervised learning model for obtaining vector representations for words by aggregating global word-word co-occurrence statistics from a corpus.

    - Co-occurrence Matrix: Constructs a matrix that captures how often pairs of words occur together in the context of the entire corpus.
    - Matrix Factorization: Reduces this matrix to produce a lower-dimensional representation (embeddings) that captures major semantic relationships.


In [None]:
import numpy as np
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models.keyedvectors import KeyedVectors

# Convert GloVe file to Word2Vec format
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

# Load the converted model
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

# Access the word vector for 'cat'
vector_cat = model['cat']
print(vector_cat)


### FastText
- FastText, developed by Facebook Research, extends Word2Vec to consider subword information.
- FastText breaks words into a bag of character n-grams. This approach allows FastText to generate word embeddings for words not seen during training, making it particularly useful for handling rare words or morphologically rich languages.

    - Subword N-grams: Breaks words into smaller chunks and learns embeddings for these sub-parts, which are then aggregated to form the word's embedding.
    - Handling of Rare Words: By considering subwords, FastText can generate embeddings for rare or out-of-vocabulary words based on their compositional parts.



For the word "unbelievable," even if it wasn't in the training corpus, FastText could derive its meaning from the subwords "un," "believ," "able," etc., creating an embedding that reflects its semantic properties.

In [None]:
from gensim.models import FastText

# Sample sentences
sentences = [["cat", "say", "meow"], ["dog", "say", "bark"]]

# Train a FastText model
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Access the word vector for 'cat'
vector_cat = model.wv['cat']
print(vector_cat)

# Limitations

Each embedding technique has its unique strengths, making them suitable for different NLP tasks. However, they share common limitations, such as the inability to capture word sense disambiguation (words with multiple meanings) effectively. 

In [None]:
from gensim.models import Word2Vec

# Sample sentences
sentences = [
    "The cat sat on the mat.",
    "Dogs love to play in the park.",
    "I love my cat and my dog.",
    "The bank of the river was flooded.",
    "He went to the bank to deposit money."
]

# Tokenization of sentences
tokens = [sentence.lower().split() for sentence in sentences]

# Training the Word2Vec model
model = Word2Vec(tokens, vector_size=100, window=5, min_count=1, workers=4)

# Exploring word embeddings
word_vectors = model.wv

# Example: Find most similar words to 'bank'
similar_words = word_vectors.most_similar('bank', topn=5)
print("Words similar to 'bank':", similar_words)


While Word2Vec is powerful for capturing semantic relationships, it assigns a single vector per word, regardless of its different meanings in various contexts. For example, 'bank' in our sentences refers to both the land alongside a river and a financial institution, but Word2Vec cannot distinguish these senses.



In [None]:
# Assuming 'bank' has been included in the model vocabulary
print("Context vectors for 'bank':", word_vectors['bank'])

This output will show a single vector representation for 'bank,' which is an average of its contexts in the training data. Because Word2Vec does not generate different embeddings for the different meanings of 'bank,' it can't disambiguate between them based on context alone.



# Overcoming Limitations
More advanced models like BERT or ELMo offer solutions to this limitation by providing context-dependent embeddings, where the representation of 'bank' would differ based on its usage in a sentence. These models generate different embeddings for the same word in different contexts, better capturing its various meanings.