### Q1.	What are Vanilla autoencoders

Vanilla autoencoders, also known as basic or traditional autoencoders, are a type of neural network architecture used for unsupervised learning and dimensionality reduction tasks. The purpose of an autoencoder is to learn a compact representation of input data by encoding it into a lower-dimensional latent space and then reconstructing it back to its original form. Vanilla autoencoders consist of two main components: an encoder and a decoder.

1. **Encoder**:
   - The encoder takes the input data and maps it to a lower-dimensional latent space representation.
   - It consists of one or more layers of neural network units that gradually reduce the dimensionality of the input data.
   - The output of the encoder is the encoded representation (also known as the latent space representation) of the input data, which typically has a lower dimensionality than the input.

2. **Decoder**:
   - The decoder takes the encoded representation from the latent space and reconstructs the original input data.
   - It consists of one or more layers of neural network units that gradually increase the dimensionality of the encoded representation.
   - The output of the decoder is the reconstructed data, which ideally should be as close as possible to the original input data.

3. **Objective Function**:
   - The training objective of a vanilla autoencoder is to minimize the reconstruction error between the input data and the reconstructed data.
   - Commonly used loss functions for this purpose include the mean squared error (MSE) loss or binary cross-entropy loss, depending on the nature of the input data.

4. **Latent Space Representation**:
   - The dimensionality of the latent space representation is a crucial hyperparameter in autoencoder models.
   - By reducing the dimensionality of the input data, autoencoders aim to capture the most salient features or patterns in the data, facilitating tasks such as compression, denoising, and feature extraction.

5. **Applications**:
   - Vanilla autoencoders have various applications in unsupervised learning, dimensionality reduction, and data compression tasks.
   - They can be used for feature learning, anomaly detection, data denoising, and image generation, among others.

Despite their simplicity, vanilla autoencoders are powerful tools for learning efficient representations of high-dimensional data and have served as the basis for more advanced autoencoder architectures, such as variational autoencoders (VAEs) and denoising autoencoders. However, vanilla autoencoders may suffer from overfitting, and their ability to capture complex data distributions is limited compared to more advanced models.

### Q2.	What are Sparse autoencoders

Sparse autoencoders are a variant of autoencoder neural networks that incorporate sparsity constraints during training. The objective of sparse autoencoders is to learn sparse representations of input data, where only a small fraction of the latent space neurons are activated for any given input. This sparsity encourages the model to learn more meaningful and informative features, leading to better generalization and improved performance on tasks such as denoising, feature learning, and dimensionality reduction.

Here are the key components and characteristics of sparse autoencoders:

1. **Sparsity Constraint**:
   - Sparse autoencoders introduce a sparsity constraint to the loss function, penalizing activations of latent space neurons that exceed a predefined sparsity target or threshold.
   - The sparsity constraint encourages the autoencoder to learn representations where only a subset of neurons are activated for any given input, resulting in a more compact and informative representation.

2. **Sparsity Regularization**:
   - Various techniques can be used to enforce sparsity in autoencoders, including \( L_1 \) regularization, \( L_2 \) regularization, or a penalty term based on the Kullback-Leibler (KL) divergence between the activation distribution and the target sparsity distribution.
   - \( L_1 \) regularization penalizes the absolute values of the activations, promoting sparsity by encouraging many neurons to have zero activations.
   - \( L_2 \) regularization penalizes the squared values of the activations, which tends to produce more distributed but still sparse activations.
   - KL divergence-based regularization explicitly penalizes deviations from the target sparsity distribution, which can be specified as a fixed value or learned from the data.

3. **Training**:
   - Sparse autoencoders are trained using backpropagation and gradient descent optimization algorithms.
   - The sparsity constraint is incorporated into the loss function, and gradients are computed with respect to both the reconstruction error and the sparsity regularization term.
   - During training, the autoencoder learns to balance between minimizing the reconstruction error and achieving the desired level of sparsity in the latent space.

4. **Applications**:
   - Sparse autoencoders have been applied to various tasks in unsupervised learning, including dimensionality reduction, feature learning, and data compression.
   - They are particularly useful in scenarios where interpretability and generalization are important, such as anomaly detection, where sparse representations can highlight unusual patterns in the data.

Sparse autoencoders offer a balance between data representation and compression, allowing for more compact and meaningful representations of high-dimensional data. By promoting sparsity in the latent space, these models can capture salient features and underlying structures of the data, leading to improved performance on downstream tasks.

### Q3.	What are Denoising autoencoders

Denoising autoencoders are a type of autoencoder neural network that is specifically designed to learn robust representations of data by reconstructing clean inputs from noisy or corrupted versions. They are trained to remove noise or corruption from input data, leading to more robust and informative representations in the latent space.

Here are the key components and characteristics of denoising autoencoders:

1. **Corruption Process**:
   - Denoising autoencoders introduce noise or corruption to the input data during training. This corruption process can involve various types of noise, such as Gaussian noise, dropout, masking, or random occlusions.
   - The goal of the corruption process is to simulate real-world scenarios where input data may be corrupted or noisy, such as in sensor measurements, image acquisition, or natural language processing tasks.

2. **Reconstruction Objective**:
   - Denoising autoencoders are trained to reconstruct clean versions of the input data from the corrupted inputs.
   - The objective function of the denoising autoencoder penalizes the reconstruction error between the clean input data and the output of the autoencoder.
   - By learning to reconstruct clean data from noisy inputs, denoising autoencoders implicitly learn to capture the underlying structure and features of the data, making them more robust to noise and corruption.

3. **Training**:
   - During training, denoising autoencoders are presented with pairs of corrupted input data and their corresponding clean versions.
   - The autoencoder is trained to minimize the reconstruction error between the clean input data and the output of the autoencoder.
   - The corruption process encourages the autoencoder to learn meaningful representations that capture the underlying structure of the data while filtering out noise and artifacts.

4. **Robust Feature Learning**:
   - Denoising autoencoders learn to encode and decode clean data by filtering out noise and capturing salient features.
   - The learned representations in the latent space tend to be more robust and informative, as they are less affected by noise and corruption in the input data.
   - These robust representations can be useful for downstream tasks such as classification, clustering, or anomaly detection, where clean and informative features are essential.

5. **Applications**:
   - Denoising autoencoders have various applications in image denoising, signal processing, text denoising, and other tasks where noisy or corrupted data is common.
   - They can also be used for data augmentation, where noisy data is artificially generated to increase the diversity of training examples and improve the generalization of machine learning models.

Denoising autoencoders provide a powerful framework for learning robust representations of data in the presence of noise and corruption. By training on noisy inputs and reconstructing clean data, denoising autoencoders learn to filter out noise and capture the essential features of the underlying data distribution.

### Q4.	What are Convolutional autoencoders

Convolutional autoencoders are a type of autoencoder neural network that incorporates convolutional layers in both the encoder and decoder architectures. They are designed to learn hierarchical representations of data with spatial structures, such as images, by capturing local patterns and spatial dependencies in the input data.

Here are the key components and characteristics of convolutional autoencoders:

1. **Convolutional Encoder**:
   - The encoder of a convolutional autoencoder consists of one or more convolutional layers followed by pooling or striding operations.
   - Convolutional layers apply a set of learnable filters to the input data, capturing local patterns and features at different spatial scales.
   - Pooling or striding operations reduce the spatial dimensions of the feature maps, increasing the receptive field of the network and promoting translation invariance.

2. **Latent Space Representation**:
   - The output of the convolutional encoder is a set of high-dimensional feature maps that represent the input data in a compressed form.
   - These feature maps capture hierarchical representations of the input data, with lower layers capturing local features and higher layers capturing more abstract and global features.

3. **Deconvolutional Decoder**:
   - The decoder of a convolutional autoencoder consists of one or more deconvolutional layers (also known as transposed convolution or up-convolution layers) followed by upsampling operations.
   - Deconvolutional layers apply learnable filters to the feature maps, gradually increasing the spatial dimensions of the data.
   - Upsampling operations increase the resolution of the feature maps, reconstructing the original input data from the compressed representation learned by the encoder.

4. **Objective Function**:
   - The training objective of a convolutional autoencoder is to minimize the reconstruction error between the input data and the reconstructed data.
   - Commonly used loss functions for this purpose include mean squared error (MSE) loss or binary cross-entropy loss, depending on the nature of the input data.

5. **Applications**:
   - Convolutional autoencoders are widely used for various tasks in computer vision, such as image denoising, image inpainting, image compression, and feature learning.
   - They can also be used for generative modeling tasks, such as image generation or style transfer, by training the decoder to generate realistic samples from the latent space representations.

Convolutional autoencoders leverage the hierarchical structure of convolutional neural networks to capture spatial dependencies and local patterns in the input data efficiently. By learning compact representations of high-dimensional data, convolutional autoencoders enable tasks such as denoising, compression, and generation of realistic images. They have become a fundamental building block in many computer vision applications and have led to significant advancements in the field of deep learning.

### Q5.	What are Stacked autoencoders

Stacked autoencoders, also known as deep autoencoders or multi-layer autoencoders, are a type of autoencoder neural network architecture composed of multiple layers of encoder and decoder units stacked on top of each other. They are designed to learn hierarchical representations of input data by capturing increasingly abstract and higher-level features across successive layers.

Here are the key components and characteristics of stacked autoencoders:

1. **Layer-wise Training**:
   - Stacked autoencoders are trained using a layer-wise pre-training approach, where each layer of the autoencoder is trained independently as a shallow autoencoder.
   - During the layer-wise pre-training phase, the input data is first fed through the first layer of the autoencoder, and the output of this layer is used as input to the next layer.
   - Each layer is trained to reconstruct the input data at its own level of abstraction, learning features that capture patterns and structures in the input data.

2. **Fine-tuning**:
   - After the layer-wise pre-training phase, the entire stacked autoencoder is fine-tuned using backpropagation and gradient descent optimization.
   - During fine-tuning, the weights of all layers in the stacked autoencoder are adjusted simultaneously to minimize the reconstruction error between the input data and the reconstructed data.

3. **Hierarchical Representation**:
   - Stacked autoencoders learn hierarchical representations of input data, with lower layers capturing local patterns and features and higher layers capturing more abstract and global features.
   - As information flows through the network, each layer learns to extract increasingly complex and higher-level representations of the input data.

4. **Non-linear Activation Functions**:
   - Stacked autoencoders typically use non-linear activation functions, such as the rectified linear unit (ReLU) or hyperbolic tangent (tanh), in the encoder and decoder layers.
   - These non-linearities enable the autoencoder to learn complex and non-linear mappings between the input and latent space representations.

5. **Regularization Techniques**:
   - Stacked autoencoders may incorporate regularization techniques, such as dropout or \( L_1 \) and \( L_2 \) regularization, to prevent overfitting and improve generalization.
   - These regularization techniques help the autoencoder learn more robust and generalizable representations of input data.

6. **Applications**:
   - Stacked autoencoders have various applications in unsupervised learning, dimensionality reduction, feature learning, and data denoising tasks.
   - They can also be used as building blocks in deep learning architectures for tasks such as image classification, object detection, and natural language processing.

Stacked autoencoders leverage the depth and hierarchical structure of deep neural networks to learn rich and hierarchical representations of input data. By capturing increasingly abstract and higher-level features across successive layers, stacked autoencoders enable the extraction of informative and discriminative features from high-dimensional data. They have become a powerful tool in the field of deep learning and have been successfully applied to a wide range of real-world problems.

### Q6.	Explain how to generate sentences using LSTM autoencoders

Generating sentences using LSTM autoencoders involves training a sequence-to-sequence model where the input and output sequences are both sentences. The autoencoder architecture consists of an LSTM-based encoder and decoder, which encode the input sentence into a fixed-size latent representation and then decode it back into the original sentence. Here's a step-by-step guide on how to generate sentences using LSTM autoencoders:

1. **Data Preparation**:
   - Prepare a dataset of sentences for training the autoencoder. Tokenize the sentences into words or subword units and create vocabulary mappings from words to numerical indices.
   - Pad or truncate the sentences to a fixed length to ensure uniform input sizes.

2. **Model Architecture**:
   - Design the LSTM autoencoder architecture with an encoder and a decoder.
   - The encoder consists of one or more LSTM layers followed by a dense layer to produce the latent representation.
   - The decoder mirrors the encoder's architecture but in reverse, with LSTM layers followed by a dense layer to reconstruct the input sequence.
   - Optionally, you can add attention mechanisms to the decoder to improve the model's ability to generate coherent and contextually relevant sentences.

3. **Training**:
   - Train the LSTM autoencoder on the dataset of sentences.
   - Use teacher forcing during training, where the decoder's input at each time step is the ground truth word from the target sentence rather than the model's own predictions.
   - Minimize the reconstruction loss between the input and output sentences using optimization algorithms like stochastic gradient descent (SGD), Adam, or RMSprop.

4. **Latent Space Exploration**:
   - Once the autoencoder is trained, you can explore the latent space by encoding input sentences into their corresponding latent representations.
   - Visualize the latent space using dimensionality reduction techniques like t-SNE or PCA to gain insights into the learned representations.

5. **Sentence Generation**:
   - To generate sentences, you can input a seed sentence into the encoder and decode its latent representation using the decoder.
   - Alternatively, you can sample latent representations from the learned latent space and decode them into sentences using the decoder.
   - Repeat the decoding process until the end-of-sequence token is generated or a maximum length is reached.

6. **Evaluation**:
   - Evaluate the generated sentences using qualitative and quantitative metrics such as BLEU score, perplexity, or human judgment.
   - Fine-tune the autoencoder architecture or adjust hyperparameters based on evaluation results to improve sentence generation quality.

7. **Repeat and Iterate**:
   - Iterate on the training process, experimenting with different architectures, hyperparameters, and training strategies to improve the LSTM autoencoder's performance on sentence generation tasks.

By following these steps, you can train an LSTM autoencoder to generate coherent and contextually relevant sentences, making it a valuable tool for tasks such as text generation, paraphrasing, and language modeling.

### Q7.	Explain Extractive summarization

Extractive summarization is a text summarization technique that involves selecting and extracting a subset of sentences or passages from a document to create a summary. Unlike abstractive summarization, which generates new sentences to capture the main ideas of the original text, extractive summarization relies on identifying and retaining existing sentences that are deemed most important or representative of the document's content.

Here's how extractive summarization works:

1. **Sentence Selection**:
   - The first step in extractive summarization is to analyze the document and identify sentences or passages that contain key information or important ideas.
   - Various methods can be used for sentence selection, including statistical methods, graph-based algorithms, machine learning models, or heuristics based on sentence features such as length, position, and relevance to the document's topic.

2. **Scoring Sentences**:
   - Once candidate sentences are identified, they are scored based on their relevance, importance, or informativeness.
   - Common features used for scoring sentences include term frequency-inverse document frequency (TF-IDF), sentence length, sentence position, presence of keywords or named entities, and semantic similarity to other sentences in the document.
   - Machine learning models such as support vector machines (SVM), decision trees, or neural networks can also be trained to assign scores to sentences based on labeled training data.

3. **Sentence Ranking**:
   - After scoring each sentence, they are ranked based on their scores in descending order.
   - The top-ranked sentences are selected to form the summary, with the number of sentences chosen typically constrained by a predefined summary length or word limit.

4. **Summary Generation**:
   - The selected sentences are concatenated to form the final summary, preserving the order in which they appeared in the original document.
   - Optionally, post-processing steps such as sentence compression or coherence optimization can be applied to improve the readability and flow of the summary.

5. **Evaluation**:
   - The quality of the extractive summary can be evaluated using various metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation), which measures the overlap between the summary and reference summaries in terms of n-gram recall.
   - Human evaluation by human judges can also be conducted to assess the informativeness, coherence, and overall quality of the generated summary.

Extractive summarization has the advantage of producing summaries that are directly extracted from the original document, ensuring that the summary remains faithful to the source text. However, extractive summarization may struggle with preserving coherence and readability, especially when combining sentences from different parts of the document. Despite its limitations, extractive summarization remains a widely used approach for generating concise and informative summaries of text documents, particularly for tasks such as news summarization, document summarization, and text summarization in information retrieval systems.

### Q8.	Explain Abstractive summarization

Abstractive summarization is a text summarization technique that involves generating a concise summary of a document by paraphrasing and synthesizing its content in a new way. Unlike extractive summarization, which selects and extracts sentences directly from the original document, abstractive summarization aims to understand the document's meaning and generate human-like summaries that may contain new sentences not present in the source text.

Here's how abstractive summarization works:

1. **Understanding the Document**:
   - The first step in abstractive summarization is to comprehend the content of the document to be summarized.
   - Natural language processing techniques, such as named entity recognition, part-of-speech tagging, syntactic parsing, and semantic analysis, are applied to extract key information, identify important concepts, and understand the relationships between them.

2. **Generating Summary**:
   - Once the document is understood, the abstractive summarization model generates a summary by composing new sentences that capture the main ideas and key information in the document.
   - The model may use a variety of techniques, including neural networks, recurrent neural networks (RNNs), long short-term memory (LSTM) networks, transformer-based architectures, or sequence-to-sequence models, to generate text based on the learned representation of the document.

3. **Paraphrasing and Synthesizing**:
   - During the summary generation process, the abstractive summarization model paraphrases and synthesizes the content of the document, rephrasing sentences, combining information from multiple sources, and expressing ideas in a concise and coherent manner.
   - The model may also incorporate external knowledge sources, such as databases, ontologies, or pre-trained language models, to enhance the quality and relevance of the generated summary.

4. **Evaluation**:
   - The quality of the abstractive summary can be evaluated using various metrics, such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation), BLEU (Bilingual Evaluation Understudy), or METEOR (Metric for Evaluation of Translation with Explicit Ordering).
   - Human evaluation by human judges can also be conducted to assess the informativeness, coherence, and overall quality of the generated summary.

Abstractive summarization offers several advantages over extractive summarization, including the ability to generate more concise and coherent summaries that better capture the essence of the original document. However, abstractive summarization also presents challenges, such as generating grammatically correct and contextually relevant sentences, avoiding factual inaccuracies or distortions, and ensuring that the generated summary reflects the author's intended meaning. Despite these challenges, abstractive summarization techniques continue to advance, driven by innovations in natural language processing and machine learning, and are increasingly being used in applications such as news summarization, document summarization, and conversational agents.

### Q9.	Explain Beam search

Beam search is a heuristic search algorithm commonly used in sequence generation tasks, such as machine translation, text generation, and speech recognition. It is used to efficiently generate the most likely sequence of words or tokens given a probabilistic model, such as a language model or a sequence-to-sequence model. Beam search explores the search space by considering multiple hypotheses in parallel and selecting the most promising candidates based on their likelihood scores.

Here's how beam search works:

1. **Initialization**:
   - Beam search starts with an initial hypothesis, typically the start token or an empty sequence, and initializes a set of partial sequences, often referred to as beams, with a fixed beam width or beam size.
   - Each beam represents a potential sequence of words or tokens and is associated with a score indicating its likelihood according to the probabilistic model.

2. **Expanding Beams**:
   - At each time step, the algorithm expands each beam by considering all possible next tokens and generating new partial sequences.
   - For each beam, the algorithm calculates the likelihood scores of all possible next tokens using the probabilistic model.
   - The top-k tokens with the highest scores are selected to create new beams, where k is the beam width.
   - The selected beams become the new set of partial sequences for the next time step.

3. **Pruning**:
   - To keep the search space manageable, beam search may employ pruning strategies to discard low-scoring beams and focus on the most promising candidates.
   - One common pruning strategy is to keep only the top-k beams with the highest scores at each time step and discard the rest.
   - This helps prevent the search space from growing exponentially and improves computational efficiency.

4. **Termination**:
   - Beam search continues expanding beams until a termination condition is met, such as reaching a maximum sequence length or generating an end token.
   - Alternatively, beam search may continue until a fixed number of time steps or until all beams have reached an end token.

5. **Selection of the Best Sequence**:
   - Once beam search has completed, the algorithm selects the final sequence from the set of completed beams, typically choosing the sequence with the highest overall likelihood score.
   - If multiple sequences are tied for the highest score, beam search may select the first sequence encountered or use additional criteria for selection, such as diversity or fluency.

Beam search is a widely used algorithm for sequence generation tasks due to its simplicity, efficiency, and effectiveness in finding high-quality sequences. However, it may suffer from issues such as repetition, suboptimal solutions, and lack of diversity, especially when the beam width is small or when the search space is highly multimodal. Various extensions and improvements to beam search, such as length normalization, diverse beam search, and stochastic beam search, have been proposed to address these challenges and improve its performance in practice.

### Q10. Explain Length normalization

Length normalization is a technique commonly used in sequence generation tasks, such as machine translation and text generation, to mitigate the bias towards shorter sequences when using probabilistic models. In these tasks, the probability of longer sequences tends to be lower than that of shorter sequences due to the exponential nature of the probability distribution over the sequence space. Length normalization adjusts the probabilities of sequences based on their length to account for this bias and ensure fair comparison between sequences of different lengths.

Here's how length normalization works:

1. **Probability Distribution**:
   - In sequence generation tasks, a probabilistic model (e.g., a language model or a sequence-to-sequence model) assigns probabilities to sequences of words or tokens.
   - The probability distribution over the sequence space is typically conditioned on the input data and modeled using techniques such as maximum likelihood estimation or neural networks.

2. **Normalization Factor**:
   - Length normalization introduces a normalization factor that adjusts the probabilities of sequences based on their length.
   - The normalization factor is usually a function of the sequence length, such as the reciprocal of the sequence length or the square root of the sequence length.
   - Longer sequences receive lower normalization factors, while shorter sequences receive higher normalization factors.

3. **Normalized Probability**:
   - The normalized probability of a sequence is obtained by dividing its original probability by the normalization factor.
   - This effectively scales down the probabilities of longer sequences and scales up the probabilities of shorter sequences, making the probabilities more comparable across different sequence lengths.

4. **Length Penalty**:
   - Length normalization can also be interpreted as applying a penalty or reward to sequences based on their length.
   - Longer sequences are penalized by having their probabilities scaled down, while shorter sequences are rewarded by having their probabilities scaled up.
   - This encourages the generation of sequences that are not only fluent and coherent but also of appropriate length.

5. **Sequence Selection**:
   - When selecting the final sequence from a set of candidate sequences generated by a sequence generation algorithm (e.g., beam search), length normalization can be used to bias the selection towards sequences of appropriate length.
   - Sequences with higher normalized probabilities are preferred over sequences with lower normalized probabilities, ensuring that the selected sequence is not overly biased towards shorter or longer sequences.

Length normalization helps mitigate the bias towards shorter sequences in sequence generation tasks and encourages the generation of sequences that are both fluent and of appropriate length. It is a simple yet effective technique that improves the performance of probabilistic models in tasks such as machine translation, text summarization, and dialogue generation.

### Q11. Explain Coverage normalization

Coverage normalization is a technique used in sequence-to-sequence models, particularly in tasks like neural machine translation, where the model generates output sequences one token at a time. The purpose of coverage normalization is to ensure that the model attends to all parts of the input sequence when generating the output sequence. This helps prevent the model from repeatedly attending to the same parts of the input, which can lead to incomplete translations or repetitions in the generated output.

Here's how coverage normalization works:

1. **Coverage Vector**:
   - During the generation of each token in the output sequence, the model maintains a coverage vector, which keeps track of how much attention has been placed on each token in the input sequence up to that point.
   - Initially, the coverage vector is initialized to zeros.

2. **Attention Mechanism**:
   - The attention mechanism in the sequence-to-sequence model computes attention scores for each token in the input sequence based on its relevance to the current step of generating the output sequence.
   - These attention scores are used to compute a context vector, which summarizes the relevant parts of the input sequence to help generate the next token in the output sequence.

3. **Coverage Vector Update**:
   - After computing the attention scores, the coverage vector is updated to reflect the attention given to each token in the input sequence.
   - Specifically, the coverage vector is incremented by the attention scores for each token, effectively accumulating the attention given to each token across all decoding steps.

4. **Coverage Penalty**:
   - During the computation of the attention scores, the coverage vector is used to penalize attention to tokens that have already received significant attention in previous decoding steps.
   - This penalty discourages the model from repeatedly attending to the same parts of the input sequence and encourages it to distribute attention more evenly across the input.

5. **Normalized Attention Scores**:
   - The attention scores are normalized using the coverage vector before being used to compute the context vector.
   - This normalization ensures that tokens that have received less attention in previous decoding steps receive higher attention scores, while tokens that have already been attended to receive lower attention scores.

6. **Sequence Generation**:
   - The normalized attention scores are used to compute the context vector, which is then combined with the decoder's hidden state to generate the next token in the output sequence.
   - This process is repeated until an end-of-sequence token is generated or a maximum sequence length is reached.

By incorporating coverage normalization into the attention mechanism of sequence-to-sequence models, the model is encouraged to attend to all parts of the input sequence during the generation of the output sequence. This helps improve the quality and fluency of the generated sequences, particularly in tasks like neural machine translation, where it is important to capture all the relevant information from the input sequence.

### Q12. Explain ROUGE metric evaluation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics commonly used for evaluating the quality of automatic summaries or machine-generated texts by comparing them to reference (human-created) summaries or texts. ROUGE measures the overlap between the generated summary and the reference summaries in terms of n-gram recall, capturing how well the generated summary covers the important content present in the reference summaries.

There are several variants of the ROUGE metric, including ROUGE-N, ROUGE-L, and ROUGE-W. Here's an overview of each variant:

1. **ROUGE-N (N-gram overlap)**:
   - ROUGE-N measures the overlap between n-grams (contiguous sequences of n tokens) in the generated summary and the reference summaries.
   - It calculates precision, recall, and F1-score for each value of n (typically from 1 to 4).
   - Precision measures the proportion of n-grams in the generated summary that are also present in the reference summaries.
   - Recall measures the proportion of n-grams in the reference summaries that are also present in the generated summary.
   - F1-score is the harmonic mean of precision and recall, providing a single value to assess the overall performance.

2. **ROUGE-L (Longest Common Subsequence)**:
   - ROUGE-L measures the longest common subsequence (LCS) between the generated summary and the reference summaries.
   - A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous, in both the generated summary and the reference summaries.
   - ROUGE-L calculates precision, recall, and F1-score based on the length of the longest common subsequence.

3. **ROUGE-W (Weighted LCS)**:
   - ROUGE-W is an extension of ROUGE-L that assigns higher weights to longer common subsequences.
   - It penalizes gaps between the subsequences and rewards longer matches, providing a more nuanced measure of overlap.
   - ROUGE-W calculates precision, recall, and F1-score based on the weighted length of the longest common subsequence.

To compute ROUGE scores, the generated summary and reference summaries are tokenized into units (words, n-grams, or characters), and then the overlap between these units is computed. The precision, recall, and F1-score are then calculated based on the number of overlapping units.

ROUGE metrics provide a quantitative measure of how well a generated summary captures the content of the reference summaries. They are widely used in natural language processing research and evaluation to compare the performance of different summarization models and to assess the quality of machine-generated summaries.