# What is Word Embedding in NLP?
Word Embedding is an approach for representing words and documents. Word Embedding or Word Vector is a numeric vector input that represents a word in a lower-dimensional space. It allows words with similar meanings to have a similar representation.

Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning model to work with text data. They try to preserve syntactical and semantic information. The methods such as Bag of Words (BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or semantic information. In these algorithms, the size of the vector is the number of elements in the vocabulary. We can get a sparse matrix if most of the elements are zero. Large input vectors will mean a huge number of weights which will result in high computation required for training. Word Embeddings give a solution to these problems.

# Need for Word Embedding?
To reduce dimensionality
To use a word to predict the words around it.
Inter-word semantics must be captured.
How are Word Embeddings used?
They are used as input to machine learning models.
Take the words —-> Give their numeric representation —-> Use in training or inference.
To represent or visualize any underlying patterns of usage in the corpus that was used to train them.

In a similar way, we can create word vectors for different words as well on the basis of given features. The words with similar vectors are most likely to have the same meaning or are used to convey the same sentiment.

# Approaches for Text Representation
# 1. Traditional Approach
The conventional method involves compiling a list of distinct terms and giving each one a unique integer value, or id. and after that, insert each word’s distinct id into the sentence. Every vocabulary word is handled as a feature in this instance. Thus, a large vocabulary will result in an extremely large feature size. Common traditional methods include:

# 1.1. One-Hot Encoding
One-hot encoding is a simple method for representing words in natural language processing (NLP). In this encoding scheme, each word in the vocabulary is represented as a unique vector, where the dimensionality of the vector is equal to the size of the vocabulary. The vector has all elements set to 0, except for the element corresponding to the index of the word in the vocabulary, which is set to 1.
While one-hot encoding is a simple and intuitive method for representing words in NLP, it has several disadvantages, which may limit its effectiveness in certain applications.

One-hot encoding results in high-dimensional vectors, making it computationally expensive and memory-intensive, especially with large vocabularies.
It does not capture semantic relationships between words; each word is treated as an isolated entity without considering its meaning or context.
It is restricted to the vocabulary seen during training, making it unsuitable for handling out-of-vocabulary words.

# 1.2. Bag of Word (Bow)
Bag-of-Words (BoW) is a text representation technique that represents a document as an unordered set of words and their respective frequencies. It discards the word order and captures the frequency of each word in the document, creating a vector representation.

While BoW is a simple and interpretable representation, below disadvantages highlight its limitations in capturing certain aspects of language structure and semantics:

BoW ignores the order of words in the document, leading to a loss of sequential information and context making it less effective for tasks where word order is crucial, such as in natural language understanding.
BoW representations are often sparse, with many elements being zero resulting in increased memory requirements and computational inefficiency, especially when dealing with large datasets.
# 1.3. Term frequency-inverse document frequency (TF-IDF)
Term Frequency-Inverse Document Frequency, commonly known as TF-IDF, is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It is widely used in natural language processing and information retrieval to evaluate the significance of a term within a specific document in a larger corpus. TF-IDF consists of two components:

Term Frequency (TF): Term Frequency measures how often a term (word) appears in a document.           

Inverse Document Frequency (IDF): Inverse Document Frequency measures the importance of a term across a collection of documents. 
The TF-IDF score for a term t in a document d is then given by multiplying the TF and IDF values:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)          

The higher the TF-IDF score for a term in a document, the more important that term is to that document within the context of the entire corpus. This weighting scheme helps in identifying and extracting relevant information from a large collection of documents, and it is commonly used in text mining, information retrieval, and document clustering.

Let’s Implement Term Frequency-Inverse Document Frequency (TF-IDF) using python with the scikit-learn library. It begins by defining a set of sample documents. The TfidfVectorizer is employed to transform these documents into a TF-IDF matrix. The code then extracts and prints the TF-IDF values for each word in each document. This statistical measure helps assess the importance of words in a document relative to their frequency across a collection of documents, aiding in information retrieval and text analysis tasks.

TF-IDF is a widely used technique in information retrieval and text mining, but its limitations should be considered, especially when dealing with tasks that require a deeper understanding of language semantics. For example:

TF-IDF treats words as independent entities and doesn’t consider semantic relationships between them. This limitation hinders its ability to capture contextual information and word meanings.
Sensitivity to Document Length: Longer documents tend to have higher overall term frequencies, potentially biasing TF-IDF towards longer documents.
# 2. Neural Approach
## 2.1. Word2Vec
Word2Vec is a neural approach for generating word embeddings. It belongs to the family of neural word embedding techniques and specifically falls under the category of distributed representation models. It is a popular technique in natural language processing (NLP) that is used to represent words as continuous vector spaces. Developed by a team at Google, Word2Vec aims to capture the semantic relationships between words by mapping them to high-dimensional vectors. The underlying idea is that words with similar meanings should have similar vector representations. In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.

In [1]:
from gensim.models import Word2Vec

text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence 
concerned with the interactions between computers and human language. As such, NLP is related to the area of 
human-computer interaction. Many challenges in NLP involve understanding natural language to derive meaning 
and information from it."""


# Tokenize the text into sentences, then words
sentences = [text.split()]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Access vector representation of a word
word_vector = model.wv['language']

# Display word vector
print(word_vector)


[-8.6228903e-03  3.6640612e-03  5.1995856e-03  5.7334779e-03
  7.4570905e-03 -6.1744191e-03  1.0984222e-03  6.0582189e-03
 -2.8444759e-03 -6.1877300e-03 -4.1608122e-04 -8.3801262e-03
 -5.6023225e-03  7.1134185e-03  3.3506860e-03  7.2196117e-03
  6.8130088e-03  7.5271088e-03 -3.7921285e-03 -5.7889870e-04
  2.3492652e-03 -4.5099277e-03  8.3968453e-03 -9.8600173e-03
  6.7578158e-03  2.9302840e-03 -4.9444796e-03  4.3994249e-03
 -1.7387009e-03  6.7221164e-03  9.9649020e-03 -4.3673944e-03
 -6.0853740e-04 -5.7161008e-03  3.8461862e-03  2.8021245e-03
  6.9052088e-03  6.0953684e-03  9.5311170e-03  9.2763416e-03
  7.9116691e-03 -6.9992831e-03 -9.1550248e-03 -3.6017137e-04
 -3.0979610e-03  7.8838654e-03  5.9249173e-03 -1.5498387e-03
  1.5103049e-03  1.7842182e-03  7.8201881e-03 -9.5230276e-03
 -2.0481556e-04  3.4776574e-03 -9.4633829e-04  8.3850771e-03
  9.0182032e-03  6.5341769e-03 -7.1102567e-04  7.7052424e-03
 -8.5333847e-03  3.1948513e-03 -4.6267984e-03 -5.0804727e-03
  3.5976472e-03  5.38585

There are two neural embedding methods for Word2Vec, Continuous Bag of Words (CBOW) and Skip-gram.

# 2.2. Continuous Bag of Words(CBOW)

Continuous Bag of Words (CBOW) is a type of neural network architecture used in the Word2Vec model. The primary objective of CBOW is to predict a target word based on its context, which consists of the surrounding words in a given window. Given a sequence of words in a context window, the model is trained to predict the target word at the center of the window.

CBOW is a feedforward neural network with a single hidden layer. The input layer represents the context words, and the output layer represents the target word. The hidden layer contains the learned continuous vector representations (word embeddings) of the input words.

The architecture is useful for learning distributed representations of words in a continuous vector space.



The hidden layer contains the continuous vector representations (word embeddings) of the input words.

The weights between the input layer and the hidden layer are learned during training.
The dimensionality of the hidden layer represents the size of the word embeddings (the continuous vector space).

In [2]:
from gensim.models import Word2Vec

# Sample sentences for training
sentences = [
    ['the', 'cat', 'sat', 'on', 'the', 'mat'],
    ['the', 'dog', 'barked', 'at', 'the', 'cat'],
    ['the', 'bird', 'flew', 'over', 'the', 'cat']
]

# Train CBOW model (cbow means sg=0)
cbow_model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=0)

# Access the word vector for 'cat'
word_vector = cbow_model.wv['cat']
print("CBOW Word Vector for 'cat':", word_vector)

# Finding most similar words to 'cat'
similar_words = cbow_model.wv.most_similar('cat')
print("\nMost similar words to 'cat':", similar_words)


CBOW Word Vector for 'cat': [-0.01631721  0.0089922  -0.00827343  0.00164993  0.01699845 -0.00892532
  0.00903522 -0.01357311 -0.00709744  0.01879622 -0.00315539  0.00064151
 -0.00828214 -0.01536442 -0.00301554  0.00494069 -0.00177506  0.01106851
 -0.00548655  0.00452008  0.010912    0.01669134 -0.00290626 -0.01841786
  0.00874216  0.00114401  0.01488318 -0.00162592 -0.00527712 -0.01750513
 -0.00171165  0.00565252  0.01080284  0.01410456 -0.01140574  0.00371808
  0.01217882 -0.00959526 -0.00621315  0.01359672  0.00326414  0.0003788
  0.00694596  0.00043551  0.01923731  0.01012244 -0.01783401 -0.01408344
  0.00180315  0.01278541]

Most similar words to 'cat': [('bird', 0.12493900209665298), ('barked', 0.07400304824113846), ('the', 0.042383477091789246), ('at', 0.018274817615747452), ('over', 0.011328201740980148), ('on', 0.0013540179934352636), ('mat', -0.11909283697605133), ('flew', -0.1742609143257141), ('dog', -0.17549626529216766), ('sat', -0.24705801904201508)]


    Explanation of Parameters:
    vector_size=50: Dimensionality of the word vectors.
    window=3: Context window size (words around the target word to be considered).
    min_count=1: Ignores words with total frequency lower than this.
    sg=0: 0 means the CBOW architecture (default).

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define CBOW model
class CBOWModel(nn.Module):
	def __init__(self, vocab_size, embed_size):
		super(CBOWModel, self).__init__()
		self.embeddings = nn.Embedding(vocab_size, embed_size)
		self.linear = nn.Linear(embed_size, vocab_size)

	def forward(self, context):
		context_embeds = self.embeddings(context).sum(dim=1)
		output = self.linear(context_embeds)
		return output

# Sample data
context_size = 2
raw_text = "word embeddings are awesome"
tokens = raw_text.split()
vocab = set(tokens)
word_to_index = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(tokens) - 2):
	context = [word_to_index[word] for word in tokens[i - 2:i] + tokens[i + 1:i + 3]]
	target = word_to_index[tokens[i]]
	data.append((torch.tensor(context), torch.tensor(target)))

# Hyperparameters
vocab_size = len(vocab)
embed_size = 10
learning_rate = 0.01
epochs = 100

# Initialize CBOW model
cbow_model = CBOWModel(vocab_size, embed_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(cbow_model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(epochs):
	total_loss = 0
	for context, target in data:
		optimizer.zero_grad()
		output = cbow_model(context)
		loss = criterion(output.unsqueeze(0), target.unsqueeze(0))
		loss.backward()
		optimizer.step()
		total_loss += loss.item()
	print(f"Epoch {epoch + 1}, Loss: {total_loss}")

# Example usage: Get embedding for a specific word
word_to_lookup = "embeddings"
word_index = word_to_index[word_to_lookup]
embedding = cbow_model.embeddings(torch.tensor([word_index]))
print(f"Embedding for '{word_to_lookup}': {embedding.detach().numpy()}")


Epoch 1, Loss: 0
Epoch 2, Loss: 0
Epoch 3, Loss: 0
Epoch 4, Loss: 0
Epoch 5, Loss: 0
Epoch 6, Loss: 0
Epoch 7, Loss: 0
Epoch 8, Loss: 0
Epoch 9, Loss: 0
Epoch 10, Loss: 0
Epoch 11, Loss: 0
Epoch 12, Loss: 0
Epoch 13, Loss: 0
Epoch 14, Loss: 0
Epoch 15, Loss: 0
Epoch 16, Loss: 0
Epoch 17, Loss: 0
Epoch 18, Loss: 0
Epoch 19, Loss: 0
Epoch 20, Loss: 0
Epoch 21, Loss: 0
Epoch 22, Loss: 0
Epoch 23, Loss: 0
Epoch 24, Loss: 0
Epoch 25, Loss: 0
Epoch 26, Loss: 0
Epoch 27, Loss: 0
Epoch 28, Loss: 0
Epoch 29, Loss: 0
Epoch 30, Loss: 0
Epoch 31, Loss: 0
Epoch 32, Loss: 0
Epoch 33, Loss: 0
Epoch 34, Loss: 0
Epoch 35, Loss: 0
Epoch 36, Loss: 0
Epoch 37, Loss: 0
Epoch 38, Loss: 0
Epoch 39, Loss: 0
Epoch 40, Loss: 0
Epoch 41, Loss: 0
Epoch 42, Loss: 0
Epoch 43, Loss: 0
Epoch 44, Loss: 0
Epoch 45, Loss: 0
Epoch 46, Loss: 0
Epoch 47, Loss: 0
Epoch 48, Loss: 0
Epoch 49, Loss: 0
Epoch 50, Loss: 0
Epoch 51, Loss: 0
Epoch 52, Loss: 0
Epoch 53, Loss: 0
Epoch 54, Loss: 0
Epoch 55, Loss: 0
Epoch 56, Loss: 0
E

# 2.3. Skip-Gram
The Skip-Gram model learns distributed representations of words in a continuous vector space. The main objective of Skip-Gram is to predict context words (words surrounding a target word) given a target word. This is the opposite of the Continuous Bag of Words (CBOW) model, where the objective is to predict the target word based on its context. It is shown that this method produces more meaningful embeddings.



After applying the above neural embedding methods we get trained vectors of each word after many iterations through the corpus. These trained vectors preserve syntactical or semantic information and are converted to lower dimensions. The vectors with similar meaning or semantic information are placed close to each other in space.

Let’s understand with a basic example. The python code contains, vector_size parameter that controls the dimensionality of the word vectors, and you can adjust other parameters such as window based on your specific needs.

Note: Word2Vec models can perform better with larger datasets. If you have a large corpus, you might achieve more meaningful word embeddings.



In [1]:
from gensim.models import Word2Vec

# Same sample sentences for training
sentences = [
    ['the', 'cat', 'sat', 'on', 'the', 'mat'],
    ['the', 'dog', 'barked', 'at', 'the', 'cat'],
    ['the', 'bird', 'flew', 'over', 'the', 'cat']
]

# Train Skip-Gram model (skip-gram means sg=1)
skipgram_model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)

# Access the word vector for 'cat'
word_vector = skipgram_model.wv['cat']
print("Skip-Gram Word Vector for 'cat':", word_vector)

# Finding most similar words to 'cat'
similar_words = skipgram_model.wv.most_similar('cat')
print("\nMost similar words to 'cat':", similar_words)


Skip-Gram Word Vector for 'cat': [-0.01631985  0.0089937  -0.00827527  0.00164995  0.01700127 -0.00892669
  0.00903687 -0.01357599 -0.00709863  0.01880009 -0.00315596  0.00064206
 -0.00828341 -0.01536769 -0.00301628  0.00494125 -0.00177575  0.01107021
 -0.00548739  0.00452096  0.01091394  0.01669473 -0.00290724 -0.01842082
  0.00874346  0.00114407  0.01488624 -0.00162646 -0.00527802 -0.01750879
 -0.00171249  0.00565381  0.01080491  0.01410751 -0.0114081   0.00371863
  0.01218077 -0.00959739 -0.00621481  0.0135988   0.00326435  0.00037924
  0.00694775  0.00043561  0.01924111  0.01012394 -0.01783769 -0.01408602
  0.00180341  0.01278773]

Most similar words to 'cat': [('bird', 0.12490319460630417), ('barked', 0.07400049269199371), ('the', 0.04237981140613556), ('at', 0.01827564835548401), ('over', 0.011206768453121185), ('on', 0.001355123007670045), ('mat', -0.11909693479537964), ('flew', -0.17425644397735596), ('dog', -0.17548997700214386), ('sat', -0.24706168472766876)]


    Explanation of Parameters:
    sg=1: 1 means the Skip-Gram architecture.
    The rest of the parameters remain the same as in the CBOW example.
    Key Differences:
    CBOW (in gensim): sg=0
    Predicts a word given its surrounding context (the words before and after the target word).
    Skip-Gram (in gensim): sg=1
    Predicts the context words given a target word.
    Outputs:
    Both examples will print the word vector for "cat" and show words that are most similar to "cat" based on the trained embeddings. The results from CBOW and Skip-Gram may differ slightly as CBOW tends to work better for frequent words, while Skip-Gram works better for infrequent words.

In [2]:

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt') # Download the tokenizer models if not already downloaded

sample = "Word embeddings are dense vector representations of words."
tokenized_corpus = word_tokenize(sample.lower()) # Lowercasing for consistency

skipgram_model = Word2Vec(sentences=[tokenized_corpus],
						vector_size=100, # Dimensionality of the word vectors
						window=5,		 # Maximum distance between the current and predicted word within a sentence
						sg=1,			 # Skip-Gram model (1 for Skip-Gram, 0 for CBOW)
						min_count=1,	 # Ignores all words with a total frequency lower than this
						workers=4)	 # Number of CPU cores to use for training the model

# Training
skipgram_model.train([tokenized_corpus], total_examples=1, epochs=10)
skipgram_model.save("skipgram_model.model")
loaded_model = Word2Vec.load("skipgram_model.model")
vector_representation = loaded_model.wv['word']
print("Vector representation of 'word':", vector_representation)


Vector representation of 'word': [-9.5800208e-03  8.9437785e-03  4.1664648e-03  9.2367809e-03
  6.6457358e-03  2.9233587e-03  9.8055992e-03 -4.4231843e-03
 -6.8048164e-03  4.2256550e-03  3.7299085e-03 -5.6668529e-03
  9.7035142e-03 -3.5551414e-03  9.5499391e-03  8.3657773e-04
 -6.3355025e-03 -1.9741615e-03 -7.3781307e-03 -2.9811086e-03
  1.0425397e-03  9.4814906e-03  9.3598543e-03 -6.5986011e-03
  3.4773252e-03  2.2767992e-03 -2.4910474e-03 -9.2290826e-03
  1.0267317e-03 -8.1645092e-03  6.3240929e-03 -5.8001447e-03
  5.5353874e-03  9.8330071e-03 -1.5987856e-04  4.5296676e-03
 -1.8086446e-03  7.3613892e-03  3.9419360e-03 -9.0095028e-03
 -2.3953868e-03  3.6261671e-03 -1.0080514e-04 -1.2024897e-03
 -1.0558038e-03 -1.6681013e-03  6.0541567e-04  4.1633579e-03
 -4.2531900e-03 -3.8336846e-03 -5.0755290e-05  2.6549282e-04
 -1.7014991e-04 -4.7843382e-03  4.3120929e-03 -2.1710952e-03
  2.1056964e-03  6.6702347e-04  5.9686624e-03 -6.8418151e-03
 -6.8183104e-03 -4.4762432e-03  9.4359247e-03 -1.593

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
