# <center> NLP Assignment - 1

### Q1. Explain One Hot Encoding 

One-hot encoding is a technique used in machine learning and data processing to represent categorical variables as binary vectors. It is commonly employed when dealing with categorical data in machine learning models that require numerical input. In one-hot encoding, each unique category or label in a categorical variable is represented by a binary vector, where all elements are zero except for the index corresponding to the category, which is set to one. This creates a sparse binary matrix where each column represents a category, and only one element in each column is non-zero.

Here's a simple example to illustrate the concept. Consider a categorical variable "Color" with three categories: Red, Green, and Blue. The one-hot encoding for these categories would be as follows:
- Red:   [1, 0, 0]
- Green: [0, 1, 0]
- Blue:  [0, 0, 1]

### Q2. Explain Bag of Words

The Bag of Words (BoW) model is a common and simple technique used in Natural Language Processing (NLP) for text representation. It represents a document (or a piece of text) as an unordered set of words, disregarding grammar and word order but considering the frequency of each word. The term "bag of words" implies that you are looking at a collection (or bag) of words without any specific order.

Here's a step-by-step explanation of how the Bag of Words model works:

1. **Tokenization:**
   - The first step is to break down the text into individual words or tokens. This process is known as tokenization.

2. **Vocabulary Building:**
   - Create a vocabulary containing all unique words present in the entire set of documents. Each word in the vocabulary is assigned a unique index.

3. **Vectorization:**
   - Represent each document as a vector where each element corresponds to the count (or sometimes binary presence/absence) of a word in the vocabulary.

   - For example, if the vocabulary contains ["apple", "banana", "orange"], and you have a document like "I like apple and banana," the vector representation would be [1, 1, 0] because "apple" and "banana" are present, while "orange" is not.

4. **Sparse Representation:**
   - Since most documents use only a small subset of the entire vocabulary, the resulting vectors are often sparse (contain mostly zeros).

The Bag of Words model has some limitations, such as:
- **Loss of Word Order:** Since the model ignores the order of words, it may not capture the semantics of a sentence well.
- **Ignores Context:** The model does not consider the context in which words appear.

### Q3. Explain bag of N-grams 

The Bag of N-grams model is an extension of the Bag of Words (BoW) model in Natural Language Processing (NLP). While the BoW model represents a document as an unordered set of words, the Bag of N-grams model takes into account sequences of N contiguous words, known as "N-grams."

Here's an explanation of the Bag of N-grams model:

1. **Tokenization:**
   - Similar to the Bag of Words model, the first step is to tokenize the text into individual words or tokens.

2. **N-gram Generation:**
   - Instead of treating each word in isolation, the Bag of N-grams model considers sequences of N consecutive words (N-grams) in the text.
   - For example, if the original text is "I love natural language processing," and N is set to 2 (bigrams), the generated bigrams would be: ["I love", "love natural", "natural language", "language processing"].

3. **Vocabulary Building:**
   - Create a vocabulary containing all unique N-grams present in the entire set of documents. Each N-gram in the vocabulary is assigned a unique index.

4. **Vectorization:**
   - Represent each document as a vector where each element corresponds to the count (or binary presence/absence) of an N-gram in the vocabulary.
   - The vector captures not only individual words but also the relationships between adjacent words in sequences of length N.

   - For example, if the vocabulary contains ["I love", "love natural", "natural language", "language processing"], and you have a document like "I love natural language processing," the vector representation might be [1, 1, 1, 1].

5. **Sparse Representation:**
   - As in the Bag of Words model, the resulting vectors are often sparse since most documents use only a small subset of the entire N-gram vocabulary.

The Bag of N-grams model helps address some of the limitations of the Bag of Words model by capturing local word patterns and partial semantics. However, it still suffers from the loss of overall sentence structure and context.


### Q4. Explain TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). TF-IDF is commonly used in information retrieval and text mining to represent the significance of terms in a document within a larger dataset.

Here's a breakdown of how TF-IDF is calculated:

1. **Term Frequency (TF):**
   - TF measures the frequency of a term (word) within a document.
   - It is calculated as the number of times a term appears in a document divided by the total number of terms in that document.
   - Mathematically, TF for a term \(t\) in a document \(d\) is given by:
     \[ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \]

2. **Inverse Document Frequency (IDF):**
   - IDF measures the importance of a term across a collection of documents (corpus).
   - It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term.
   - Mathematically, IDF for a term \(t\) in a corpus \(D\) is given by:
     \[ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in the corpus } D}{\text{Number of documents containing term } t + 1}\right) \]
   - The addition of 1 in the denominator is to avoid division by zero for terms that appear in all documents.

3. **TF-IDF Calculation:**
   - The TF-IDF score for a term \(t\) in a document \(d\) is the product of its TF and IDF:
     \[ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) \]

   - The TF-IDF score is higher for terms that are frequent in a specific document but rare across the entire corpus. This helps in highlighting terms that are discriminative or characteristic of a particular document.

4. **Vectorization:**
   - Each document in the corpus is represented as a vector where each element corresponds to the TF-IDF score of a term in that document.

TF-IDF is widely used in various natural language processing tasks, such as information retrieval, text summarization, and document clustering, as it helps in capturing the importance of terms while considering their distribution across the entire dataset.

### Q5. What is OOV problem?

OOV stands for "Out of Vocabulary" or "Out of Vocabulary Items." The OOV problem arises in natural language processing and machine learning when a model encounters words or tokens that it has not seen during training. These unseen words are considered out of vocabulary because they are not part of the vocabulary or set of known tokens the model has learned from.

The OOV problem can occur for various reasons:

1. **New or Unseen Words:**
   - If the model is trained on a specific dataset, it may not be able to handle new or previously unseen words that appear in the test or real-world data. This is common in scenarios where the vocabulary of the training set is limited.

2. **Typos and Misspellings:**
   - Words with typos or misspellings may be treated as OOV if the model has not been explicitly trained to handle variations in spelling.

3. **Rare Words:**
   - If a word is infrequent or rare in the training data, the model may not have learned a robust representation for it, making it challenging to handle during testing.

4. **Domain Shift:**
   - If the distribution of words in the test data is significantly different from the distribution in the training data (domain shift), OOV occurrences may increase.

Addressing the OOV problem is crucial for building robust and generalizable natural language processing models. Some common strategies to mitigate the OOV problem include:

1. **Handling Unknown Words:**
   - Design models that can handle unknown words gracefully. For example, using character-level embeddings or subword embeddings can help capture information from parts of words, even if the entire word is unknown.

2. **Using Pre-trained Embeddings:**
   - Utilize pre-trained word embeddings (e.g., Word2Vec, GloVe, or FastText) that have been trained on large, diverse datasets. These embeddings often capture general language patterns and can help with OOV words.

3. **Dynamic Vocabulary Expansion:**
   - Dynamically update the vocabulary during training to include new words encountered in the data. This approach allows the model to adapt to new words as it encounters them.

4. **Handling Typos:**
   - Implement techniques for handling typos and misspellings, such as approximate string matching or incorporating spelling correction mechanisms.

By addressing the OOV problem, models can better generalize to diverse and unseen data, making them more reliable in real-world applications.training to include new words encountered in the data. This approach allows the model to adapt to new words as it encounters them.

### Q6. What are word embeddings?

Word embeddings are numerical representations of words in a continuous vector space. These representations capture semantic relationships and similarities between words, making them a fundamental concept in natural language processing (NLP) and machine learning. Unlike traditional approaches that represent words using discrete symbols or one-hot encoding, word embeddings aim to capture the meaning of words in a more continuous and distributed form.

Key characteristics of word embeddings include:

1. **Continuous Vector Space:**
   - Each word is represented as a vector of real numbers, and similar words are placed closer together in the vector space.

2. **Semantic Relationships:**
   - Word embeddings capture semantic relationships between words. Words with similar meanings or contexts tend to have similar vector representations.

3. **Contextual Information:**
   - The meaning of a word is influenced by its context in sentences. Word embeddings aim to capture this contextual information, allowing words to have different representations based on their usage in different contexts.

4. **Dimensionality:**
   - The dimensionality of word embeddings is a key parameter. Typically, each word is represented by a vector of a fixed length, and the number of dimensions in the vector determines the richness of the representation.

5. **Pre-trained Embeddings:**
   - Pre-trained word embeddings are often used, where models are trained on large text corpora to learn general language patterns. These pre-trained embeddings can be transferred to downstream NLP tasks, providing a useful starting point for models.

Common methods for generating word embeddings include:

1. **Word2Vec:**
   - Introduced by Google, Word2Vec learns word embeddings by predicting a word's context or vice versa. It has two models: Skip-gram and Continuous Bag of Words (CBOW).

2. **GloVe (Global Vectors for Word Representation):**
   - GloVe is an unsupervised learning algorithm that learns word embeddings by factorizing the word co-occurrence matrix. It considers global word-word relationships in the entire corpus.

3. **FastText:**
   - Developed by Facebook, FastText extends Word2Vec by representing words as bags of character n-grams. This approach is especially useful for handling morphologically rich languages and capturing subword information.

4. **BERT (Bidirectional Encoder Representations from Transformers):**
   - BERT is a transformer-based model that generates contextual embeddings for words by considering both left and right context. It has achieved state-of-the-art performance in various NLP tasks.

Word embeddings have significantly contributed to the success of NLP applications, including sentiment analysis, machine translation, document classification, and more. They enable algorithms to understand and process natural language by capturing the nuanced relationships between words based on their contextual usage.

### Q7. Explain Continuous bag of words (CBOW)

Continuous Bag of Words (CBOW) is a type of word embedding model used in natural language processing (NLP). It is designed to learn distributed representations of words based on their context in a given corpus. CBOW is part of the Word2Vec family of models, which were introduced by Tomas Mikolov and his colleagues at Google.

The primary goal of CBOW is to predict a target word based on its surrounding context. Unlike other language models that consider the previous and next words in a sequence, CBOW focuses on the context words within a specific window around the target word. The architecture of CBOW involves predicting the target word given its context.

Here's how the CBOW model works:

1. **Context Window:**
   - Given a sentence or a sequence of words, a context window of a fixed size (e.g., 2 or 3) is slid over the words in the sequence. The target word is the word in the center of the window.

2. **Word Representation:**
   - Each word in the context window is represented as a one-hot encoded vector, where only the position corresponding to the word is set to 1, and all other positions are 0.

3. **Embedding Layer:**
   - The one-hot encoded word vectors are then passed through an embedding layer, which converts them into continuous vector representations (word embeddings).

4. **Aggregation:**
   - The continuous vector representations of the context words are aggregated (e.g., averaged) to obtain a single context vector.

5. **Prediction:**
   - The context vector is used to predict the target word. The model is trained to minimize the difference between the predicted word and the actual target word.

The CBOW model is trained to maximize the conditional probability of the target word given its context. The objective function involves maximizing the average log probability of predicting the target word across all words in the training data.

One advantage of CBOW is its ability to generate embeddings efficiently, especially for frequent words. However, CBOW might not perform as well for infrequent words compared to other models like Skip-gram, which is another variant of Word2Vec.

In summary, Continuous Bag of Words (CBOW) is a neural network-based model that learns word embeddings by predicting a target word based on its context in a given window. It is part of the Word2Vec family and is used for capturing distributed representations of words in natural language.

# Q8. Explain SkipGram

Skip-gram is another type of word embedding model belonging to the Word2Vec family, introduced by Tomas Mikolov and his colleagues at Google. Unlike Continuous Bag of Words (CBOW), which predicts a target word based on its context, Skip-gram takes a reverse approach. It aims to predict the context words given a target word. Skip-gram is designed to learn distributed representations of words that capture semantic relationships and similarities.

Here's how the Skip-gram model works:

1. **Target Word Selection:**
   - Given a sequence of words (e.g., a sentence), a target word is selected.

2. **Context Window:**
   - A context window of a fixed size is defined around the target word. The context words within this window become the positive examples for the model.

3. **Negative Sampling (Optional):**
   - In the Skip-gram model, negative sampling is often used to address computational efficiency. Negative sampling involves randomly selecting a few words from the vocabulary that do not appear in the context window and treating them as negative examples.

4. **Word Representation:**
   - The target word is represented as a one-hot encoded vector, where only the position corresponding to the target word is set to 1, and all other positions are 0.

5. **Embedding Layer:**
   - The one-hot encoded target word vector is passed through an embedding layer, which converts it into a continuous vector representation (word embedding).

6. **Prediction:**
   - The model is trained to predict the context words (and optionally, distinguish them from negative examples) based on the embedding of the target word.

The objective of the Skip-gram model is to maximize the conditional probability of the context words given the target word. The training process involves adjusting the parameters of the model to increase the likelihood of predicting the actual context words.

Skip-gram is known for its ability to perform well on capturing semantic relationships and analogies, especially for rare words. It is often preferred in scenarios where the focus is on generating high-quality word embeddings, even if it comes at the cost of increased computational complexity compared to CBOW.

In summary, Skip-gram is a Word2Vec model that learns word embeddings by predicting context words based on a given target word. It has been widely used in natural language processing tasks and has demonstrated effectiveness in capturing semantic information in word representations.

### Q9 Explain Glove Embeddings.

GloVe, which stands for "Global Vectors for Word Representation," is an unsupervised learning algorithm for generating word embeddings. Developed by Jeffrey Pennington, Richard Socher, and Christopher D. Manning at Stanford University, GloVe focuses on capturing global information about word co-occurrences in a corpus to learn meaningful and contextually rich word representations.

Here are the key principles behind GloVe embeddings:

1. **Word Co-occurrence Matrix:**
   - GloVe starts by constructing a word co-occurrence matrix \(X\), where \(X_{ij}\) represents the number of times word \(i\) co-occurs with word \(j\) in a given context window across the entire corpus.

2. **Objective Function:**
   - The core idea of GloVe is to learn word embeddings in such a way that the dot product of the embeddings captures the logarithm of the probability of word co-occurrence. The objective function is formulated to minimize the difference between the dot product of the embeddings and the logarithm of the co-occurrence counts.

   - The objective function is defined as:
     \[ J = \sum_{i, j=1}^{V} f(X_{ij}) \left(\mathbf{w}_i^T \cdot \mathbf{v}_j + b_i + b_j - \log(X_{ij})\right)^2 \]

   - Here, \(V\) is the vocabulary size, \(\mathbf{w}_i\) and \(\mathbf{v}_j\) are the word vectors for words \(i\) and \(j\), \(b_i\) and \(b_j\) are bias terms, and \(f(X_{ij})\) is a weighting term that helps mitigate the influence of very common word pairs.

3. **Training:**
   - The model is trained by adjusting the word vectors and biases to minimize the objective function using gradient descent or other optimization algorithms.

4. **Resulting Embeddings:**
   - The learned embeddings capture semantic relationships between words based on their co-occurrence patterns. Similar words in terms of context tend to have similar embeddings, making the vectors useful for various natural language processing tasks.

GloVe embeddings have several advantages:

- **Efficiency:** GloVe is computationally efficient and often requires less memory compared to methods like Skip-gram with negative sampling.
  
- **Contextual Understanding:** The embeddings capture not only local context but also global context information, resulting in meaningful representations.

- **Generalization:** Pre-trained GloVe embeddings can be transferred to downstream tasks, providing a useful starting point for models with limited training data.

GloVe has become a popular choice for generating word embeddings, and pre-trained GloVe embeddings are widely used in various natural language processing applications such as sentiment analysis, named entity recognition, and machine translation.