# Assignment 1

#### 1. Explain One-Hot Encoding

One-Hot Encoding is a method for representing binary vectors of category variables. It is frequently employed in data preprocessing and machine learning activities, particularly when working with categorical data that cannot be used directly in numerical calculations. 

Each distinct category in the categorical variable is converted into a binary vector of fixed length in the one-hot encoding method, where only one element is "hot" (1) and the others are "cold" (0). The number of distinct categories in the variable is equal to the length of the binary vector.

Here's an example to illustrate the process:

Let's say we have a categorical variable called "Fruit" with three categories: "Apple", "Orange", and "Banana". To apply One-Hot Encoding, we would create three binary vectors, one for each category:

- "Apple" -> [1, 0, 0]
- "Orange" -> [0, 1, 0]
- "Banana" -> [0, 0, 1]

In the binary vectors, the position of the "hot" (1) element corresponds to the category it represents. All other elements are set to "cold" (0).

One-Hot Encoding allows us to represent categorical variables in a format that can be used as input for machine learning algorithms. Each category becomes a separate feature with binary values, capturing the presence or absence of a specific category in the data. It helps in avoiding any ordinality assumptions between the categories and enables algorithms to treat each category equally.

#### 2. Explain Bag of Words

Natural language processing (NLP) practitioners frequently utilise the Bag of Words (BoW) technique to express textual data as numbers. It is a quick and efficient approach to turn text into a numerical representation that can be applied to a variety of machine learning applications, including sentiment analysis, document clustering, and text categorization.

The Bag of Words method ignores word order and concentrates solely on the frequency of occurrence of each word, treating a text as a collection of words. In order to represent each document as a numerical vector depending on the frequency of these terms, each document is first represented as a "bag" or "vocabulary" of unique words that are present throughout the whole corpus of documents.

Here's a step-by-step explanation of the Bag of Words process:

1. Corpus Construction: Gather a collection of documents that form the corpus. These documents can be sentences, paragraphs, or entire text documents.

2. Tokenization: Break down each document into individual words or tokens. This step involves removing punctuation, converting text to lowercase, and splitting the text into words based on whitespace or other delimiters.

3. Vocabulary Creation: Create a vocabulary or dictionary of unique words present in the corpus. Each unique word is assigned a unique index or identifier.

4. Document Representation: Represent each document in the corpus as a numerical vector. The vector has the same length as the vocabulary size, and each element represents the frequency or presence of a word in the document. Various methods can be used to assign values to the vector elements, such as term frequency (count of occurrences) or binary values (presence or absence of the word).

5. Vectorization: Combine the numerical vectors of all the documents to create a matrix representation of the entire corpus. Each row in the matrix corresponds to a document, and each column represents a word in the vocabulary.

The Bag of Words strategy has some drawbacks. It disregards word order, which results in the loss of crucial context. It also doesn't take into account the semantics of words or the connections between them. Bag of Words is still a popular and efficient method for many text-based machine learning problems in spite of these drawbacks.

#### 3. Explain Bag of N-Grams

Bag of N-Grams is an expansion of the Bag of Words (BoW) approach used in natural language processing (NLP) that takes into account n-grams in addition to single words. Bag of N-Grams measures the frequency of word sequences of length 'n', whereas BoW simply measures the frequency of single words.

Here is a detailed breakdown of the Bag of N-Grams procedure:

1. Corpus Construction: Gather a collection of documents that form the corpus, similar to the Bag of Words approach.

2. Tokenization: Break down each document into individual words or tokens, similar to the Bag of Words approach.

3. N-Gram Generation: Generate n-grams from the tokenized documents. An n-gram is a contiguous sequence of 'n' words. For example, if n=2, then the 2-grams (bigrams) of the sentence "I love to code" would be "I love" and "love to" and "to code".

4. Vocabulary Creation: Create a vocabulary or dictionary of unique n-grams present in the corpus. Each unique n-gram is assigned a unique index or identifier.

5. Document Representation: Represent each document in the corpus as a numerical vector based on the frequency or presence of the n-grams. Similar to BoW, various methods can be used to assign values to the vector elements, such as the count of occurrences or binary values indicating the presence or absence of an n-gram.

6. Vectorization: Combine the numerical vectors of all the documents to create a matrix representation of the entire corpus, similar to the Bag of Words approach.

The Bag of N-Grams method enables the collection of both the frequency of individual words as well as the frequency of word sequences, which can capture some local word order data and add extra context. Taking into account larger n-grams, such as trigrams (n=3) or higher, the model may be able to capture more intricate word patterns and linkages.

In a variety of NLP tasks, such as text classification, sentiment analysis, and information retrieval, the bag of N-grams can be utilised as a feature representation. By changing the value of "n," it is a flexible technique that may be modified to collect various amounts of word sequence information.

#### 4. Explain TF-IDF

A term's (word's) relevance in a document within a corpus of documents is measured using the TF-IDF (Term Frequency-Inverse Document Frequency) statistic in natural language processing (NLP). It seeks to draw attention to words that, when compared to the total corpus, are more significant in a particular document.

Term frequency, or TF, refers to how frequently a term (or word) appears in a document. It determines the proportion between a term's occurrences and the overall number of terms in a document. It aids in locating the words that are used the most frequently in a manuscript.

IDF (Inverse Document Frequency) measures the rarity or uniqueness of a term across the entire corpus. It calculates the logarithm of the ratio between the total number of documents in the corpus and the number of documents containing the term. IDF assigns higher weights to terms that appear in a smaller number of documents, considering them more informative or distinctive.

The TF-IDF score of a term in a document is obtained by multiplying its TF and IDF values. The higher the TF-IDF score of a term in a document, the more important or relevant that term is to the document.

Here's the calculation formula for TF-IDF:

TF-IDF(term, document) = TF(term, document) * IDF(term)

The TF-IDF can be used to represent documents as numerical feature vectors, with each dimension denoting a distinct term and the value indicating the term's TF-IDF score. These feature vectors can be utilised in a variety of NLP tasks, including text mining, document classification, and information retrieval, to determine how similar two texts are to one another or to locate crucial terms within a document.

Common terms that appear in most texts, such as "the," "is," and "and," are less influential thanks to TF-IDF, which also helps to emphasise uncommon or distinctive terms that are more semantically significant or provide context for a given document.

#### 5. What is OOV problem?

The OOV (Out-of-Vocabulary) problem is the difficulty in managing words or tokens that are used in language processing tasks but are not from a model or system's training set or vocabulary. Since the model has no prior knowledge of or representation for an OOV term, it presents challenges when it is encountered.

The OOV issue can occur in a variety of natural language processing (NLP) activities, including text categorization, sentiment analysis, speech recognition, and machine translation. There are a number of causes for it, including:

1. Rare or infrequent words: If a word appears rarely in the training data or is entirely absent, the model may not have learned any meaningful representation for it.

2. Out-of-domain or domain-specific words: If the training data does not cover specific domains or topics, words from those domains may be treated as OOV when encountered during inference.

3. Misspellings or variations: Words that are misspelled, abbreviated, or written in a different form from what was seen during training can be considered OOV.

The OOV problem can impact the performance and accuracy of NLP models, as they may struggle to handle or make sense of unseen words. Handling the OOV problem typically involves implementing strategies such as:

1. Handling unknown words: Assigning a special token or placeholder for OOV words during inference and treating them as a distinct category.

2. Word normalization: Applying techniques like stemming, lemmatization, or handling case sensitivity to bring words into a more standardized form and reduce the chances of encountering OOV words.

3. Incorporating external resources: Utilizing pre-trained word embeddings or language models that have a larger vocabulary and coverage, which can provide representations for a broader range of words.

4. Data augmentation: Expanding the training data by generating or augmenting examples with variations, synonyms, or domain-specific terms to expose the model to a wider vocabulary.

5. Dynamic vocabulary expansion: Updating the vocabulary of the model dynamically as new words are encountered during inference, either by adding them to the existing vocabulary or by using open-vocabulary models that can handle OOV words more effectively.

#### 6. What are word embeddings?

Word embeddings are numerically based distributed representations of words that identify syntactic and semantic links between words. They are dense vector representations, where vectors near to each other in the vector space represent words with related meanings or contextual usage.

Large volumes of text data are frequently used to train neural network models like Word2Vec, GloVe, or FastText to learn word embeddings. Based on the context of the training data, these models attempt to capture the meaning of words. The generated word embeddings represent the connections and commonalities in word meanings.

The following are some benefits of using word embeddings for natural language processing (NLP) tasks:

1. Dimensionality reduction: Word embeddings typically have a lower dimensionality compared to one-hot encoded word representations, which reduces the computational complexity and memory requirements of NLP models.

2. Semantic relationships: Word embeddings capture semantic relationships between words. For example, words with similar meanings or that often appear in similar contexts have embeddings that are close together in the vector space.

3. Analogical reasoning: Word embeddings can exhibit interesting algebraic relationships. For example, by performing vector operations such as addition and subtraction on word embeddings, it is possible to find analogies like "king - man + woman = queen."

4. Generalization: Word embeddings can generalize well to unseen words or rare words that were not present in the training data. By leveraging the context and relationships learned during training, word embeddings can provide meaningful representations for unseen words.

Many NLP applications, such as sentiment analysis, machine translation, text classification, named entity identification, and others, now use word embeddings as a fundamental building block. By utilising contextual data and the semantic connections between words, they enable models to catch the subtle semantic variations of words and enhance the performance of NLP tasks.

#### 7. Explain Continuous bag of words (CBOW)

A well-liked approach for word embedding training in natural language processing (NLP) is Continuous Bag of Words (CBOW). A target word's context, which consists of the words around it in a phrase or a fixed-size window, is what CBOW attempts to forecast.

With the aid of a neural network design, the CBOW model is trained. This is how it goes:

1. Data Preparation: CBOW requires a large corpus of text as training data. The corpus is split into sentences, and a sliding window of a fixed size is used to create training samples. Each training sample consists of the context words as input and the target word as the output.

2. Word Encoding: Before training, each word in the vocabulary is assigned a unique index. The words are often represented as one-hot encoded vectors or through other encoding schemes.

3. Architecture: CBOW uses a shallow neural network with a single hidden layer. The input layer has neurons equal to the size of the context window, and the output layer has neurons equal to the size of the vocabulary.

4. Training: During training, the CBOW model learns to predict the target word given its context. The context words are fed into the input layer, and their embeddings (word vectors) are averaged to obtain the context representation. This context representation is then passed through the hidden layer and finally through the output layer, which outputs the probabilities of each word in the vocabulary.

5. Loss Calculation: The model's predictions are compared to the true target word using a loss function such as cross-entropy. The model's parameters (weights and biases) are adjusted through backpropagation and gradient descent to minimize the loss.

6. Word Embeddings: Once the CBOW model is trained, the weights of the hidden layer (context representation) serve as word embeddings. These embeddings capture the distributional properties of words in the training corpus, representing their semantic and syntactic relationships.

CBOW is renowned for its effectiveness and capacity to produce high-quality word embeddings, particularly for terms that are often used in the training set. It is especially helpful in situations where the context offers clear hints about what the target term means. It might not, however, capture long-range dependencies or handle uncommon terms as well as other models, such as Skip-gram.

#### 8. Explain SkipGram

Another well-liked approach for word embedding training in natural language processing (NLP) is the skip-gram. Skip-gram predicts the context words given a target word, as contrast to Continuous Bag of Words (CBOW), which predicts the target word given its context.

The Skip-gram model operates as follows:

1. Data Preparation: Similar to CBOW, the training data for Skip-gram consists of a large corpus of text. The corpus is split into sentences, and a sliding window is used to create training samples. Each training sample consists of a target word and its context words within a fixed-size window.

2. Word Encoding: Each word in the vocabulary is assigned a unique index, and the words are often represented as one-hot encoded vectors or through other encoding schemes.

3. Architecture: Skip-gram also uses a shallow neural network, but with a different architecture compared to CBOW. The input layer has neurons equal to the size of the vocabulary, and the hidden layer has a lower dimensionality, typically referred to as the embedding size.

4. Training: During training, the Skip-gram model learns to predict the context words given a target word. The target word is fed into the input layer, and its embedding (word vector) is obtained from the hidden layer. This embedding is then passed through the output layer, which produces a probability distribution over the vocabulary. The model aims to maximize the probability of the true context words and minimize the probability of other words.

5. Loss Calculation: The loss function used in Skip-gram is typically a form of softmax loss or negative sampling. Softmax loss calculates the cross-entropy between the predicted probabilities and the true context words, while negative sampling randomly selects negative examples and adjusts the model's parameters to distinguish between true context words and randomly sampled words.

6. Word Embeddings: After training, the weights of the hidden layer serve as word embeddings. These embeddings capture the semantic and syntactic relationships between words in the training corpus. The embeddings are dense vector representations that can be used for various NLP tasks, such as word similarity, document classification, and language generation.

#### 9. Explain Glove Embeddings.

GloVe (Global Vectors for Word Representation) is a word embedding model that uses co-occurrence data from a sizable corpus of text to attempt to capture semantic and grammatical relationships between words. Researchers at Stanford University created it.

GloVe embeddings operate as follows:

1. Co-occurrence Matrix: GloVe starts by constructing a co-occurrence matrix from the corpus of text. The matrix captures how often each word co-occurs with other words in a given window of context. The co-occurrence count represents the strength of the relationship between words.

2. Probability Distribution: The co-occurrence matrix is transformed into a probability distribution by normalizing the counts. This step accounts for the frequency biases in the raw counts and focuses on the relative importance of word co-occurrences.

3. Word Embeddings: GloVe aims to learn word embeddings that encode semantic and syntactic information based on the co-occurrence probabilities. It does this by minimizing a loss function that measures the difference between the dot product of word embeddings and the logarithm of their co-occurrence probabilities. The embeddings are initialized randomly and updated iteratively during training to optimize the loss.

4. Training: The training process involves adjusting the word embeddings iteratively using gradient descent optimization. The goal is to learn word representations that capture the semantic relationships between words. The embeddings are updated based on their influence on the co-occurrence probabilities.

5. Vector Space Representation: The resulting GloVe embeddings are dense vector representations where each dimension corresponds to a latent feature capturing word properties. Words with similar meanings or usage tend to have similar embeddings, allowing for semantic relationships to be captured in the vector space.