# Embedding Techniques

## Introduction

This notebook is about the below listed embedding techniques used to convert **text to numeric** values (vectors) so as to enable natural language processing in machine learning. Machine learning algorithms can't operate on raw text, we need to convert the text to numerical representation. This process is known as **embedding** the text.


The list of embedding we will be exploring are:

1- Bag of Words (BoW)

2- TF - IDF (Term Frequency - Inverse Document Frequency)

3- Word2Vec

4- GloVe

5- FastText



## Bag of Words (BoW)

BoW texhnique represents texts as frequency of words in the texts. It focuses in capturing the words in the input text and identifying their frequencies to create a vector representation of the input text. Hence we dont get any syntactical or semantic information of the text. Input text can be be anything from  few sentences to documents. Conceptually, we think of the whole document as a “bag” of words, rather than a sequence.

The following steps are performed to generate a Bag of Words vector representation of text:

- **tokenizing** strings: for instance by using white-spaces and punctuation as token separators and giving an integer id for each possible token.

- **counting**: the occurrences of tokens in each document.

Lets see Bag Of Words working using scikit-learn library.


In [1]:
#import the CountVectorizer class from the feature_extraction.text module of the sklearn library (also known as scikit-learn)
#CountVectorizer implements both tokenization and occurrence counting in a single class
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#CountVectorizer(*,
#                input='content',
#                encoding='utf-8',
#                decode_error='strict',
#                strip_accents=None,
#                lowercase=True,
#                preprocessor=None,
#                tokenizer=None,
#                stop_words=None,
#                token_pattern='(?u)\\b\\w\\w+\\b',
#                ngram_range=(1, 1),
#                analyzer='word',
#                max_df=1.0,
#                min_df=1,
#                max_features=None,
#                vocabulary=None,
#                binary=False,
#                dtype=<class 'numpy.int64'>)

In [35]:
#selecting the Bag of Words Vectorizer, with default configurations
#token_pattern='(?u)\\b\\w\\w+\\b' - match words that are at least two characters long
vectorizer = CountVectorizer()
vectorizer

In [36]:
#Use CountVectorizer to tokenize and count the word occurrences of a minimalistic corpus of text documents
#corpus - list of strings
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
#Learns the corpus, creates vocabulary index and builds a document-term matrix (string to vocabulary matrix)
X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

The result of the CountVectorizer's fit_transform method is a document term matrix with 4 rows and 9 columns. 4 rows represent the 4 input documents in the corpus. 9 columns represent the tokenized words identified from the 4 documents.

In [37]:
#finding the tokens - this is the order of vocabulary index created
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [38]:
#vocabulary index - a dictionary with token as key and their index as values
#as we can see index is assigned on alphabetic order
vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [39]:
#document-term matrix
#as per the vocabulary - 9 elements in each vector representation.
#count of tokens present marked by their frequency in their respective index as per vocabulary index
dtm_array = X.toarray()
dtm_array

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [40]:
#getting index of a term from vocabulary
vectorizer.vocabulary_.get('first')

2

In [41]:
#retrieves the documents represented by the vocabulary terms from them
vectorizer.inverse_transform(X)

[array(['this', 'is', 'the', 'first', 'document'], dtype='<U8'),
 array(['this', 'is', 'the', 'document', 'second'], dtype='<U8'),
 array(['the', 'and', 'third', 'one'], dtype='<U8'),
 array(['this', 'is', 'the', 'first', 'document'], dtype='<U8')]

In [22]:
print(corpus[0])
print(corpus[3])

This is the first document.
Is this the first document?


In [23]:
print(dtm_array[0])
print(dtm_array[3])

[0 1 1 1 0 0 1 0 1]
[0 1 1 1 0 0 1 0 1]


**Vector representation for two different texts with same tokens in different order is the same** as Bag of Word embedding only looks at the freqencies of the tokens and not preserving token postions.

**To resolve ambiguities encoded in local positioning patterns, we can use ngrams feature**. An n-gram is a contiguous sequence of n items (typically words) from a given text.

The parameter ngram_range specifies the minimum and maximum range of continuous words to be extracted.

ngram_range=(1, 1): Unigram, only single words are extracted

ngram_range=(1, 2): Both unigrams and bigrams (pairs of consecutive words) are extracted.

ngram_range=(1, 3): Unigrams, bigrams, and trigrams (sequences of three consecutive words) are extracted

ngram_range=(2, 2): Only bigrams are extracted

ngram_range allows to specify the range of n-grams (from unigrams to higher-order n-grams) that CountVectorizer should extract from the text, giving flexibility in capturing different levels of word combinations for text analysis.

**Increasing the range (especially the upper limit) can significantly increase the number of features generated, which might lead to higher computational costs and memory usage.**



### Pros and Cons

While BoW is simple and interpretable representation, below disadvantages highlight its limitations in capturing certain aspects of language structure and semantics:

- BoW ignores the order of words in the document, leading to a **loss of sequential information and context** making it less effective for tasks where word order is crucial, such as in natural language understanding.

- BoW representations are **often sparse**, with many elements being zero resulting in increased memory requirements and computational inefficiency, especially when dealing with large datasets.

### Applications

The bag of words model is typically used to embed documents in order to train a classifier (categorizes a document as belonging to one of multiple types).The mere presence and frequency of certain words is strongly indicative of what category the document belongs.

Some common uses of the bag of words method include spam filtering and sentiment analysis.



## TF - IDF

Term Frequency - Inverse Document Frequency is known as TF-IDF. It is a numerical statistic that reflects the importance of a word in a document. It is commonly used in NLP to represent the relevance of a term to a document or a corpus of documents. The TF-IDF algorithm takes into account two main factors: the frequency of a word in a document (TF) and the frequency of the word across all documents in the corpus (IDF).

**Term Frequency (TF)**

**TF measures the frequency of a term within a document**. It is calculated as the ratio of the number of times a term occurs in a document to the total number of terms in that document. The resulting value is a number between 0 and 1. The goal is to emphasize words that are frequent within a document.

TF(t,d) = count of t in d / number of words in d


​**Inverse Document Frequency (IDF)**

**IDF measures the importance of a term across a collection of documents**.It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. The resulting value is a number greater than or equal to 0. The goal is to diminish the weight of terms that occur very frequently in the document set and increases the weight of terms that rarely occur.

With large corpus, say 100,000,000, the IDF value explodes. To avoid this, we take the log of IDF.

During the query time, when a word that’s not in the vocabulary occurs, the DF will be 0. Since we can’t divide by 0, we smoothen the value by adding 1 to the denominator.

And that’s the final formula.

idf(t,D) = log(N/(df + 1))

N:Total Number of Document
df: Number of Documents with the term t

IDF will be very low for the most occurring words, such as stop words like “is.” because those words are present in almost all of the documents.

The **TF-IDF score** for a term in a document is obtained by multiplying its TF and IDF scores.

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

TF-IDF is a measure used to evaluate how important a word is to a document in a collection or corpus. The higher the TF-IDF score, the more important the term is in the document.

Lets see working Tf-Idf vectorization using scikit-learn library.

In [42]:
#import the TfidfVectorizer class from the feature_extraction.text module of the sklearn library (also known as scikit-learn)
#TfidfVectorizer class converts a collection of raw documents to a matrix of TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
#TfidfVectorizer(*,
#                input='content',
#                encoding='utf-8',
#                decode_error='strict',
#                strip_accents=None,
#                lowercase=True,
#                preprocessor=None,
#                tokenizer=None,
#                analyzer='word',
#                stop_words=None,
#                token_pattern='(?u)\\b\\w\\w+\\b',
#                ngram_range=(1, 1),
#                max_df=1.0,
#                min_df=1,
#                max_features=None,
#                vocabulary=None,
#                binary=False,
#                dtype=<class 'numpy.float64'>,
#                norm='l2', use_idf=True,
#                smooth_idf=True,
#                sublinear_tf=False)

In [45]:
#selecting the TF - IDF Vectorizer, with default configurations
#token_pattern='(?u)\\b\\w\\w+\\b' - match words that are at least two characters long

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer

In [46]:
#Learns vocabulary and idf, return document-term matrix
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
X_tfidf

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [51]:
#TF-IDF values
X_tfidf.toarray()

array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])

In [49]:
#vocabulary - tokens with index
tfidf_vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [52]:
#identified tokens
tfidf_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [53]:
#IDF values of tokens
tfidf_vectorizer.idf_

array([1.91629073, 1.22314355, 1.51082562, 1.22314355, 1.91629073,
       1.91629073, 1.        , 1.91629073, 1.22314355])

In [62]:
print('\nidf values:\n')
for ele1, ele2 in zip(tfidf_vectorizer.get_feature_names_out(), tfidf_vectorizer.idf_):
    if(ele1=='document'):
      print(ele1, ':', ele2)
    else:
      print(ele1, '\t:', ele2)


idf values:

and 	: 1.916290731874155
document : 1.2231435513142097
first 	: 1.5108256237659907
is 	: 1.2231435513142097
one 	: 1.916290731874155
second 	: 1.916290731874155
the 	: 1.0
third 	: 1.916290731874155
this 	: 1.2231435513142097


### Pros and Cons

Some of the advantages of using TF-IDF:

- Measures relevance: TF-IDF helps to identify which terms are most relevant to a particular document.

- Handles stop words: TF-IDF automatically down-weights common words that occur frequently in the text corpus (stop words) that do not carry much meaning or importance, making it a more accurate measure of term importance.

Few of its limitations are:

- Ignores the context: TF-IDF only considers the frequency of each term in a document, and does not take into account the context in which the term appears. This can lead to incorrect interpretations of the meaning of the document.

- Assumes independence: TF-IDF assumes that the terms in a document are independent of each other. However, this is often not the case in natural language, where words are often related to each other in complex ways.

- No concept of word order: TF-IDF treats words regardless of their order or position in the document. This can be problematic for certain applications, such as sentiment analysis, where word order can be crucial for determining the sentiment of a document.

### Applications

Here are some of the main applications of TF-IDF:

- Search engines: TF-IDF is used in search engines to rank documents based on their relevance to a query. The TF-IDF score of a document is used to measure how well the document matches the search query.

- Text classification: TF-IDF is used in text classification to identify the most important features in a document. The TF-IDF score of each term in the document is used to measure its relevance to the class.

- Information extraction: TF-IDF is used in information extraction to identify the most important entities and concepts in a document. The TF-IDF score of each term is used to measure its importance in the document.

- Keyword extraction: TF-IDF is used in keyword extraction to identify the most important keywords in a document. The TF-IDF score of each term is used to measure its importance in the document.

- Recommender systems: TF-IDF is used in recommender systems to recommend items to users based on their preferences. The TF-IDF score of each item is used to measure its relevance to the user’s preferences.

- Sentiment analysis: TF-IDF is used in sentiment analysis to identify the most important words in a document that contribute to the sentiment. The TF-IDF score of each word is used to measure its importance in the document.


## Word2Vec

While BoW and TF-IDF were frequency based embedding technique which followed a deterministic approach, **Word2Vec is a prediction based embedding technique** that works with shallow neural networks.

Word2Vec is able to preserve the context of the words in the text which was a drawback of both BoW and TF-IDF embeddings.  

Word2Vec enables similar words to have similar dimensions and, consequently, helps bring context.

There are **two neural embedding methods for Word2Vec**:

- Continuous Bag of Words (CBOW)
- Skip-gram



**Continuous Bag of Words (CBOW):**

- Its a **neural network architecture** used in Word2 Vec Model.
- Predict a target word based on its context
- It is a feedforward neural network with a single hidden layer
- Input layer represents the context words
- Output layer represents the target word
- Hidden layer contains the learned continuous vector representations (word embeddings) of the input words
- The weights between the input layer and the hidden layer are learned during training.
- The dimensionality of the hidden layer represents the size of the word embeddings (the continuous vector space).

**Skip-Gram:**

- The model **learns distributed representations of words in a continuous vector space**.
- It predict context words given a target word
- It works just the opposite of CBOW


On applying Skip-Gram model, we get trained vectors of each word after many iterations through the corpus. These trained vectors preserve syntactical or semantic information and are converted to lower dimensions. The vectors with similar meaning or semantic information are placed close to each other in space.

Word2Vec models can perform better with larger datasets as you achieve more meaningful word embeddings.

CBOW might be preferred when training resources are limited, and capturing syntactic information is important.

Skip-gram might be chosen when semantic relationships and the representation of rare words are crucial.

While Skip-Gram frequently performs better for infrequent words, CBOW is faster and typically performs better with frequent words.

### Pros and Cons

In NLP, Word2Vec is a significant method for representing words as vectors in a continuous vector space. The advantages of the model are:

- Semantic Representations: Word2Vec records the connections between words semantically. Words are represented in the vector space so that similar words are near to one another. This enables the model to interpret words according to their context within a particular corpus.

- Distributional Semantics: Word2Vec generates vector representations that reflect semantic similarities by learning from the distributional patterns of words in a large corpus. Words with similar meanings are more likely to occur in similar contexts.

- Vector Arithmetic: Word2Vec generates vector representations that have intriguing algebraic characteristics. Vector arithmetic, for instance, can be used to record word relationships. One well-known example is that the vector representation of “queen” could resemble the vector representation of “king” less “man” plus “woman.”

- Transfer Learning: A variety of natural language processing tasks can be initiated with pre-trained Word2Vec models.

- Scalability: Word2Vec can handle big corpora with ease and is scalable. Scalability like this is essential for training on large text datasets.


Below are the limitations of Word2Vec:

- Inability to handle unknown or out-of-vocabulary (OOV) words: If your model hasn’t encountered a word before, it will have no idea how to interpret it or how to build a vector for it.

- No shared representations at sub-word levels: Word2vec represents every word as an independent vector, even though many words are morphologically similar, just like flawless or careless. This can also become a challenge in morphologically rich, and polysynthetic languages such as Arabic, German or Turkish.

### Applications

Word2Vec embeddings used in a number of natural language processing (NLP) applications, such as machine translation, text classification, sentiment analysis, information retrieval and question answering. These applications are successful because of the capacity to capture semantic relationships.

## GloVe

**Global Vectors for Word Representation (GloVe)** is a powerful word embedding technique that captures the **semantic relationships** between words by considering their **co-occurrence probabilities within a corpus**.

The key to GloVe’s effectiveness lies in the construction of a **word-context matrix and** the subsequent **factorization process**.

**Word-Context Matrix Formation**:

The first step of GloVe involves creating a word-context matrix. This matrix is designed to represent the likelihood of a given word appearing near another across the entire corpus. Each cell in the matrix holds the co-occurrence count of how often words appear together in a certain context window.

**Factorization for Word Vectors**:

With the word-context matrix in place, next step in GloVe is matrix factorization. The objective here is to **decompose this high-dimensional matrix into two smaller matrices — one representing words and the other contexts**.

Let’s denote W for words and C for contexts.

The ideal scenario is when the dot product of W and CT (transpose of C) approximates the original word-context matrix:

X≈W⋅CT

Through iterative optimization, GloVe adjusts W and C to minimize the difference between X and W⋅CT. This process yields refined vector representations for each word, capturing the nuances of their co-occurrence patterns

**Vector Representations**:

Once trained, GloVe provides each word with a dense vector that captures local context and global word usage patterns. **These vectors encode semantic and syntactic information**, revealing similarities and differences between words based on their overall usage in the corpus.

### Pros and Cons

Pros:

- Efficiently captures global statistics of the corpus.

- Good at representing both semantic and syntactic relationships.

- Effective in capturing word analogies.


Cons:

- Requires more memory for storing co-occurrence matrices.

- Less effective with very small corpora.


### Applications

- **Text Classification**: GloVe embeddings can be utilised as features in machine learning models for sentiment analysis, topic classification, spam detection, and other applications.

- **Named Entity Recognition (NER)**: By capturing the semantic relationships between words and enhancing the model’s capacity to identify entities in text, GloVe embeddings can improve the performance of NER systems.

- **Machine Translation**: GloVe embeddings can be used to represent words in the source and target languages in machine translation systems, which aim to translate text from one language to another, thereby enhancing the quality of the translation.

- **Question Answering Systems**: To help models comprehend the context and relationships between words and produce more accurate answers, GloVe embeddings are used in question-answering tasks.

- **Document Similarity and Clustering**: GloVe embeddings enable applications in information retrieval and document organization by measuring the semantic similarity between documents or grouping documents according to their content.

- **Word Analogy Tasks**: In word analogy tasks, GloVe embeddings frequently yield good results. For instance, the generated vector for “king-man + woman” might resemble the “queen” vector, demonstrating the capacity to recognize semantic relationships.

- **Semantic Search**: In semantic search applications, where retrieving documents or passages according to their semantic relevance to a user’s query is the aim, GloVe embeddings are helpful.

## FastText

**FastText** is an **advanced word embedding technique** developed by Facebook AI Research (FAIR) that **extends the Word2Vec model**. Unlike Word2Vec, **FastText not only considers whole words but also incorporates subword information — parts of words like n-grams**. This approach enables the **handling of morphologically rich languages** and captures information about word structure more effectively.

**Subword Information**:

FastText represents each word as a **bag of character n-grams in addition to the whole word** itself. This means that the word “apple” is represented by the word itself and its constituent n-grams like “ap”, “pp”, “pl”, “le”, etc. **This approach helps capture the meanings of shorter words and affords a better understanding of suffixes and prefixes**.



**Model Training**:

Similar to Word2Vec, FastText can use either the CBOW or Skip-gram architecture. It incorporates the subword information during training. The neural network in FastText is trained to predict words (in CBOW) or context (in Skip-gram) not just based on the target words but also based on these n-grams.

### Pros and Cons

 Pros:

 Significant advantage of FastText is its ability to generate better word representations for **rare words or even words not seen during training**. By breaking down words into n-grams, FastText can construct meaningful representations for these words based on their subword units.

 Word2vec and GloVe both fail to provide any vector representation for words that are not in the model dictionary. This is a huge advantage of this method.

- Better representation of rare words.

- Capable of handling out-of-vocabulary words.

- Richer word representations due to subword information.


Cons:

- Increased model size due to n-gram information.

- Longer training times compared to Word2Vec.


### Applications

- **Text Classification and Categorisation**:

**Spam filtering, topic categorisation, and content tagging** across various domains.

- **Language Identification and Translation**

The subword-level embeddings in fastText empower it to work with languages even in cases where only fragments or limited text samples are available. This proves beneficial in **language identification tasks**, aiding multilingual applications and facilitating language-specific processing. Additionally, fastText’s embeddings have been utilised to enhance **machine translation systems, improving the accuracy and performance of translation models**.

- **Sentiment Analysis and Opinion Mining**

In sentiment analysis, fastText’s robustness in capturing subtle linguistic nuances allows for more accurate sentiment classification. Its ability to understand and represent words based on their subword units enables a more profound comprehension of sentiment-laden expressions, contributing to more nuanced **opinion mining in social media analysis, product reviews, and customer feedback**.

- **Entity Recognition and Tagging**

Entity recognition involves identifying and classifying entities within a text, such as names of persons, organisations, locations, and more. fastText’s subword embeddings contribute to better handling of unseen or rare entities, improving the accuracy of entity recognition systems. This capability finds applications in **information extraction, search engines, and content analysis**.

# SUMMARY

**Choosing the Right Embedding Model**

- Word2Vec: Use when semantic relationships are crucial, and you have a large dataset.

- GloVe: Suitable for diverse datasets and when capturing global context is important.

- FastText: Opt for morphologically rich languages or when handling out-of-vocabulary words is vital.