
# 1) What is BERT and how does it work?
**Ans:** **BERT (Bidirectional Encoder Representations from Transformers)** is a pre-trained deep learning model for natural language processing (NLP) tasks. It was developed by Google and is designed to understand the context of words in a sentence by looking at both the words before and after it, making it bidirectional (instead of just looking at the previous words, as many models did before).

### How BERT Works:

1. **Transformer Architecture:**
   - BERT is based on the **Transformer** architecture, which uses attention mechanisms to process input data. The Transformer model doesn't rely on sequence order, so it can process all words in parallel rather than one at a time like older models.
   
2. **Bidirectional Context:**
   - Unlike traditional models like RNNs or LSTMs that process text in one direction (left-to-right or right-to-left), BERT looks at the full sentence at once, considering the context from both directions.
   
3. **Masked Language Model (MLM):**
   - During pre-training, BERT uses a technique called **masked language modeling**. It randomly masks some words in a sentence and tries to predict them based on the surrounding context. This helps BERT learn the relationship between words and the context they appear in.

4. **Next Sentence Prediction (NSP):**
   - BERT is also trained on the task of **next sentence prediction**. It learns to predict whether a given sentence logically follows another, helping it understand sentence relationships.

# 2) What are the main advantages of using the attention mechanism in neural networks?
**Ans:** The attention mechanism in neural networks offers several significant advantages:

1. **Focus on Relevant Information**:  
   Attention allows the model to dynamically focus on the most important parts of the input data while ignoring irrelevant information, improving performance on tasks like machine translation and image captioning.

2. **Improved Long-Range Dependencies**:  
   Attention mechanisms help capture relationships between distant elements in the input sequence, overcoming limitations of models like vanilla RNNs or LSTMs in handling long-range dependencies.

3. **Parallelization**:  
   Unlike RNNs, attention mechanisms, especially in transformers, enable processing all input elements simultaneously, leading to faster training and inference.

4. **Interpretability**:  
   Attention weights provide insights into which parts of the input the model focused on during decision-making, enhancing interpretability and trust in the model.

5. **Versatility Across Modalities**:  
   Attention is applicable to various data types—text, images, audio, and even multimodal data—making it a versatile tool for diverse tasks.

6. **Enhanced Representations**:  
   By considering the global context, attention mechanisms create richer and more nuanced representations of the data, improving model accuracy.

7. **Scalability**:  
   With the advent of transformer models, attention mechanisms scale effectively with larger datasets and higher model complexity, as seen in models like GPT and BERT.

# 3) How does the self-attention mechanism differ from traditional attention mechanisms?
**Ans:** The **self-attention mechanism** differs from traditional attention mechanisms in several key aspects:

1. **Scope of Application**:
   - **Traditional Attention**:
     - Operates between two distinct sequences, such as the encoder and decoder in sequence-to-sequence models.
     - Aligns elements from the input sequence to the output sequence, facilitating tasks like machine translation.
   - **Self-Attention**:
     - Operates within a single sequence, allowing each element to attend to all other elements in the same sequence.
     - Captures intra-sequence dependencies, which is crucial for understanding context within the sequence.

2. **Functionality**:
   - **Traditional Attention**:
     - Enhances the model's focus on relevant parts of the input when generating each part of the output.
     - Improves performance in tasks where alignment between input and output sequences is essential.
   - **Self-Attention**:
     - Enables the model to weigh the importance of different elements within the same sequence.
     - Allows for the capture of long-range dependencies without relying on recurrent architectures.

3. **Architectural Integration**:
   - **Traditional Attention**:
     - Often used in conjunction with recurrent neural networks (RNNs) to improve their performance by focusing on relevant parts of the input sequence.
   - **Self-Attention**:
     - Forms the backbone of Transformer architectures, replacing the need for recurrence and enabling parallelization.
     - Leads to more efficient training and inference, especially for large-scale models.

4. **Computational Complexity**:
   - **Traditional Attention**:
     - Computationally intensive due to the sequential nature of RNNs, limiting parallelization.
   - **Self-Attention**:
     - Allows for parallel computation, reducing training time and enabling the handling of larger datasets.

# 4) What is the role of the decoder in a Seq2Seq model?
**Ans:** In a Sequence-to-Sequence (Seq2Seq) model, the **decoder** plays a crucial role in generating the output sequence based on the encoded representation of the input. Its primary functions include:

1. **Generating the Output Sequence**:
   - The decoder takes the context vector (also known as the thought vector) produced by the encoder and generates the output sequence. It operates in an autoregressive manner, producing one element of the output sequence at a time.

2. **Utilizing Encoder's Context**:
   - The decoder uses the hidden representation from the encoder to generate the output sequence.

3. **Incorporating Attention Mechanism**:
   - In models enhanced with attention mechanisms, the decoder can focus dynamically on the most relevant parts of the input sequence during the generation process. This approach boosts accuracy and provides valuable insights into the model’s decision-making process.

# 5) What is the difference between GPT-2 and BERT models?
**Ans:** GPT-2 and BERT are both influential language models based on the Transformer architecture, but they differ in several key aspects:

**1. Model Architecture:**
- **GPT-2 (Generative Pre-trained Transformer 2):**
  - Employs a *decoder-only* Transformer architecture.
  - Designed primarily for text generation tasks.
- **BERT (Bidirectional Encoder Representations from Transformers):**
  - Utilizes an *encoder-only* Transformer architecture.
  - Aimed at understanding and processing language, excelling in tasks like text classification and question answering.

**2. Training Objectives:**
- **GPT-2:**
  - Trained using a *causal language modeling* objective, predicting the next word in a sequence based on preceding words.
  - Processes text in a unidirectional (left-to-right) manner.
- **BERT:**
  - Trained with a *masked language modeling* (MLM) objective, where random words in a sentence are masked, and the model learns to predict these masked words using the surrounding context.
  - Processes text bidirectionally, considering context from both left and right.

**3. Contextual Understanding:**
- **GPT-2:**
  - Focuses on generating coherent and contextually relevant text by leveraging preceding context.
- **BERT:**
  - Excels in understanding the meaning of words and sentences by analyzing the full context, both preceding and succeeding words.

**4. Applications:**
- **GPT-2:**
  - Well-suited for natural language generation tasks, such as text completion, summarization, and creative writing.
- **BERT:**
  - Ideal for natural language understanding tasks, including sentiment analysis, named entity recognition, and question answering.

**5. Training Data:**
- **GPT-2:**
  - Trained on a vast, unfiltered corpus of internet text, encompassing diverse content types.
- **BERT:**
  - Trained on a combination of Wikipedia and BookCorpus, focusing on high-quality text sources.

# 6) Why is the Transformer model considered more efficient than RNNs and LSTMs?
**Ans:** The Transformer model is considered more efficient than Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) due to several key factors:

**1. Parallel Processing Capability:**
- **Transformers:** Utilize a self-attention mechanism that allows for the simultaneous processing of all input elements. This parallelism leads to significantly faster training and inference times, especially with large datasets.
- **RNNs and LSTMs:** Process input sequences sequentially, where each step depends on the previous one. This inherent sequential nature limits opportunities for parallel computation, resulting in slower processing times.

**2. Effective Handling of Long-Range Dependencies:**
- **Transformers:** Employ self-attention mechanisms to capture relationships between distant elements in a sequence efficiently. This design enables them to model long-range dependencies without difficulty.
- **RNNs and LSTMs:** While capable of managing long-term dependencies to some extent, they often encounter challenges such as the vanishing gradient problem, which can impede learning over long sequences.

**3. Scalability and Model Capacity:**
- **Transformers:** Their architecture supports scaling to larger models with increased parameters, enhancing their ability to learn complex patterns from extensive datasets.
- **RNNs and LSTMs:** Scaling these models can be more challenging due to training instabilities and the sequential nature of their computations.

**4. Memory Efficiency:**
- **Transformers:** By processing inputs in parallel and utilizing self-attention, they can be more memory-efficient, especially when dealing with long sequences.
- **RNNs and LSTMs:** Maintaining hidden states over long sequences can lead to higher memory consumption and potential inefficiencies.

These advantages make Transformers a preferred choice for many natural language processing tasks, leading to their widespread adoption in models like BERT and GPT.

# 7) Explain how the attention mechanism works in a Transformer model?
**Ans:** In Transformer models, the **attention mechanism** enables the model to weigh the significance of different words in an input sequence when processing each word. This mechanism is pivotal for capturing dependencies and relationships within the data, facilitating more effective learning and generation.

**How the Attention Mechanism Works:**

1. **Input Representation:**
   - Each word in the input sequence is transformed into three vectors:
     - **Query (Q):** Determines which words to focus on.
     - **Key (K):** Represents the words in the sequence.
     - **Value (V):** Contains the actual information of the words.

2. **Calculating Attention Scores:**
   - The model computes a score for each pair of words by taking the dot product of their Query and Key vectors.
   - These scores indicate the relevance of one word to another.

3. **Applying Softmax:**
   - The scores are passed through a softmax function to convert them into probabilities, emphasizing the most relevant words while diminishing the less relevant ones.

4. **Weighted Sum of Values:**
   - Each Value vector is weighted by the corresponding softmax probability.
   - The model computes a weighted sum of these Value vectors, producing an output that reflects the importance of each word in the context of the input sequence.

5. **Multi-Head Attention:**
   - The Transformer employs multiple attention mechanisms, known as heads, to capture different aspects of relationships between words.
   - Each head processes the input independently, and their outputs are concatenated and linearly transformed to produce the final result.

# 8) What is the difference between an encoder and a decoder in a Seq2Seq model?
**Ans:** In a Sequence-to-Sequence (Seq2Seq) model, the **encoder** and **decoder** serve distinct yet complementary roles in transforming an input sequence into an output sequence.

**Encoder:**

- **Function:** Processes the input sequence to capture its essential information and represent it in a fixed-size context vector.

- **Operation:** Sequentially reads each element of the input, updating its internal state to encapsulate the sequence's context.

- **Output:** Generates a context vector that summarizes the input sequence's information.

**Decoder:**

- **Function:** Generates the output sequence by interpreting the context vector provided by the encoder.

- **Operation:** Produces the output sequence element by element, using the context vector and previously generated elements to inform each step.

- **Output:** Constructs the final output sequence, such as a translated sentence in another language.

# 9) What is the primary purpose of using the self-attention mechanism in transformers?
**Ans:** The primary purpose of the self-attention mechanism in Transformer models is to enable the model to evaluate and assign varying levels of importance to different words within a single input sequence. This capability allows the model to capture intricate relationships and dependencies between words, regardless of their positions in the sequence.

# 10)  How does the GPT-2 model generate text?
**Ans:** GPT-2, or Generative Pre-trained Transformer 2, generates text by predicting the next word in a sequence based on the preceding context. Here's an overview of how this process works:

1. **Input Prompt:** The process begins with an initial text prompt provided by the user. This prompt serves as the starting context for the model.

2. **Tokenization:** The input text is divided into tokens, which are smaller units like words or subwords. Each token is then converted into a numerical representation that the model can process.

3. **Contextual Embedding:** GPT-2 uses its internal parameters to generate embeddings for each token, capturing the contextual meaning based on the input sequence.

4. **Next-Word Prediction:** Leveraging the Transformer architecture, GPT-2 analyzes the sequence of tokens and computes probabilities for potential next tokens. It selects the token with the highest probability as the next word in the sequence.

5. **Iterative Generation:** The newly generated token is appended to the input sequence, and the model repeats the prediction process to generate subsequent tokens. This iterative approach continues until the model produces a complete text output of the desired length or until it encounters a stopping criterion.

# 11) What is the main difference between the encoder-decoder architecture and a simple neural network?
**Ans:** The **encoder-decoder architecture** and a **simple neural network** differ primarily in their design and application, especially concerning sequence processing tasks.

**Encoder-Decoder Architecture:**

- **Structure:** Comprises two main components:
  - **Encoder:** Processes the input sequence and converts it into a fixed-length vector representation, capturing the essential information of the input.
  - **Decoder:** Takes this vector representation and generates the output sequence, one element at a time.

- **Purpose:** Designed for sequence-to-sequence tasks where the input and output may vary in length and structure, such as machine translation, text summarization, and image captioning.

- **Functionality:** Effectively handles variable-length input and output sequences by learning a mapping from the input sequence to the output sequence through the intermediate vector representation.

**Simple Neural Network:**

- **Structure:** Typically consists of an input layer, one or more hidden layers, and an output layer.

- **Purpose:** Suited for tasks where the input and output have fixed dimensions, such as image classification or regression problems.

- **Functionality:** Processes input data to produce an output without an intermediate representation tailored for sequence processing.

# 12) Explain the concept of “fine-tuning” in BERT?
**Ans:** Fine-tuning in BERT (Bidirectional Encoder Representations from Transformers) involves adapting a pre-trained BERT model to a specific downstream task by training it further on task-specific labeled data. This process tailors the general language understanding capabilities of BERT to meet the requirements of particular applications, such as text classification, named entity recognition, or question answering.

**Key Aspects of Fine-Tuning BERT:**

1. **Task-Specific Adaptation:** After BERT's initial pre-training on large text corpora, fine-tuning adjusts its parameters using labeled data pertinent to a specific task, enabling the model to learn task-related patterns and nuances.

2. **Architecture Modification:** Typically, fine-tuning involves adding a task-specific layer, such as a classification head, on top of BERT's architecture. This additional layer is trained alongside the pre-trained layers to produce outputs aligned with the desired task.

3. **Training Process:** During fine-tuning, the entire model, including both the pre-trained BERT layers and the newly added task-specific layer, is trained on the labeled dataset. This process usually requires only a few epochs, as BERT has already learned general language representations during pre-training.

4. **Parameter Adjustment:** Fine-tuning updates BERT's parameters to better fit the specific task, enhancing performance by aligning the model's understanding with the task's unique characteristics.

# 13) How does the attention mechanism handle long-range dependencies in sequences?
**Ans:** The attention mechanism effectively manages long-range dependencies in sequences by allowing models to evaluate and assign varying levels of importance to different elements, regardless of their positions. This capability is particularly advantageous in natural language processing tasks, where understanding the relationship between distant words is crucial for grasping context and meaning.
# 14) What is the core principle behind the Transformer architecture?
**Ans:** The Transformer architecture revolutionized natural language processing by introducing a mechanism that allows models to process and generate sequences without relying on traditional recurrent or convolutional structures. The core principle behind this architecture is the self-attention mechanism, which enables the model to evaluate and assign varying levels of importance to different elements within a sequence, regardless of their positions.
# 15) What is the role of the "position encoding" in a Transformer model?
**Ans:** In Transformer models, the **positional encoding** mechanism is essential for incorporating information about the order of tokens in a sequence. Since Transformers process input data in parallel without inherent sequential awareness, positional encodings provide the necessary context regarding the position of each token, enabling the model to interpret the sequence correctly.

**Role of Positional Encoding:**

- **Sequence Order Awareness:** By adding positional encodings to token embeddings, the model gains information about the position of each token within the sequence, allowing it to distinguish between different orderings of the same set of words.

- **Facilitating Attention Mechanisms:** Positional encodings enable the self-attention mechanism to consider the position of tokens when calculating attention scores, which is crucial for tasks that depend on the relative positioning of words.

# 16) How do Transformers use multiple layers of attention?
**Ans:** In Transformer models, multiple layers of attention are employed to progressively extract and refine features from input sequences, enhancing the model's ability to capture complex patterns and relationships. This layered approach enables the model to build hierarchical representations, with each layer learning increasingly abstract features.
# 17) What does it mean when a model is described as “autoregressive” like GPT-2?
**Ans:**
An autoregressive model predicts the next element in a sequence by relying on preceding elements. In the context of language models like GPT-2, this means generating text one token at a time, with each token's prediction based on all previously generated tokens.
# 18) How does BERT's bidirectional training improve its performance?
**Ans:** BERT (Bidirectional Encoder Representations from Transformers) enhances its performance through bidirectional training, enabling it to grasp the context of a word by considering both its preceding and succeeding words. This comprehensive understanding allows BERT to capture intricate language nuances, leading to improved accuracy in various natural language processing tasks.

# 19) What are the advantages of using the Transformer over RNN-based models in NLP?
**Ans:** Transformers offer several advantages over Recurrent Neural Network (RNN)-based models in Natural Language Processing (NLP):

1. **Parallelization**: Unlike RNNs, which process input sequences sequentially, Transformers handle entire sequences simultaneously. This parallel processing significantly reduces training times and allows for more efficient utilization of computational resources.

2. **Long-Range Dependency Handling**: Transformers effectively capture long-range dependencies within text due to their self-attention mechanism. This capability enables them to model relationships between distant words without the vanishing gradient issues that RNNs often face.

3. **Scalability**: The architecture of Transformers facilitates scaling to larger datasets and models, leading to improved performance across various NLP tasks. This scalability has been instrumental in the development of large language models like GPT and BERT.

4. **Reduced Training Time**: Due to their parallel processing capabilities, Transformers require less training time compared to RNNs, making them more efficient for large-scale language modeling.

5. **Enhanced Performance**: Transformers have consistently outperformed RNNs in various NLP tasks, including language comprehension, text translation, and context capturing, making them the preferred choice for many applications.

# 20) What is the attention mechanism’s impact on the performance of models like BERT and GPT-2?
**Ans:** The attention mechanism is a pivotal component in models like BERT and GPT-2, significantly enhancing their performance in natural language processing tasks. Its impact can be understood through several key aspects:

1. **Contextual Understanding**: Attention allows these models to weigh the importance of different words in a sentence, enabling a nuanced grasp of context. This capability is crucial for tasks such as machine translation and sentiment analysis, where understanding the relationship between words is essential.

2. **Handling Long-Range Dependencies**: Traditional models often struggle with capturing relationships between distant words. The attention mechanism addresses this by enabling models to consider the relevance of all words in a sequence, regardless of their position, thereby effectively managing long-range dependencies.

3. **Parallelization and Efficiency**: Incorporating attention mechanisms allows models to process input sequences in parallel, rather than sequentially. This parallelization leads to more efficient computation and faster training times, which is particularly beneficial when dealing with large datasets.

4. **Interpretability**: The attention mechanism provides insights into the decision-making process of models like BERT and GPT-2. By visualizing attention weights, researchers can understand which parts of the input the model focuses on, enhancing interpretability and trust in model predictions.

5. **Scalability**: The modular nature of attention mechanisms facilitates the scaling of models to handle more complex tasks and larger datasets, contributing to the development of advanced language models with superior performance.


# Practical


# 1) How to implement a simple text classification model using LSTM in Keras?

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split


texts = [
    'Sample text data for classification',
    'Another text for classification',
    'More text data here',
    'This is a different text',
    'And another sample text'
]
labels = [0, 1, 0, 1, 0]  # Replace with your labels

# Parameters
vocab_size = 5000
max_len = 100
embedding_dim = 100

# Tokenization
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)

# Model building
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len),
    LSTM(128, activation='relu', return_sequences=True),
    Dropout(0.2),
    LSTM(128, activation='relu'),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(4, activation='softmax')  # Adjust the output units and activation based on your number of classes
])

# Model compilation
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model training
# Convert X_train and y_train to NumPy arrays with dtype=float32
X_train = np.array(X_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.float32)
X_val = np.array(X_val, dtype=np.float32)
y_val = np.array(y_val, dtype=np.float32)

# Model training
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=64)


# 2) How to generate sequences of text using a Recurrent Neural Network (RNN)?

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.utils import to_categorical


text = "Your text data here"
vocab = sorted(set(text))
char_to_idx = {char: idx for idx, char in enumerate(vocab)}
idx_to_char = np.array(vocab)

# Convert text to integer sequences
text_as_int = np.array([char_to_idx[c] for c in text])

# Create input-output pairs
seq_length = 100
examples_per_epoch = len(text) - seq_length
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# Batch size
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

# Build the model
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024

model = Sequential([
    Embedding(vocab_size, embedding_dim, batch_input_shape=[BATCH_SIZE, None]),
    LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
    Dense(vocab_size)
])

# Compile the model
model.compile(optimizer='adam', loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True))

# Train the model
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS)

# Text generation function
def generate_text(model, start_string, num_generate=1000):
    input_eval = [char_to_idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []

    temperature = 1.0

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)

        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()

        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx_to_char[predicted_id])

    return start_string + ''.join(text_generated)

# Generate text
print(generate_text(model, start_string="Once upon a time"))


# 3) How to perform sentiment analysis using a simple CNN model?

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense
from sklearn.model_selection import train_test_split

# Sample data
texts = ['I love this movie', 'I hate this movie']  # Replace with your dataset
labels = [1, 0]  # 1 for positive, 0 for negative

# Parameters
vocab_size = 5000
max_len = 100
embedding_dim = 100

# Tokenization
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)

# Model building
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len),
    Conv1D(filters=128, kernel_size=5, activation='relu'),
    GlobalMaxPooling1D(),
    Dense(10, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Model compilation
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model training
# Convert X_train and y_train to NumPy arrays with dtype=float32
X_train = np.array(X_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.float32)
X_val = np.array(X_val, dtype=np.float32)
y_val = np.array(y_val, dtype=np.float32)

# Model training
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=64)


# 4) How to perform Named Entity Recognition (NER) using spaCy?

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm


In [None]:
import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_sm")


text = "Apple is looking at buying U.K. startup for $1 billion."

# Process the text
doc = nlp(text)

# Iterate over the identified entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# 5) How to implement a simple Seq2Seq model for machine translation using LSTM in Keras?

In [None]:
import numpy as np
from keras.models import Model
from keras.layers import Input, LSTM, Embedding, Dense
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

# Sample data
source_texts = ['Hello', 'How are you?']  # Replace with your dataset
target_texts = ['Bonjour', 'Comment ça va?']  # Replace with your dataset

# Tokenization and sequence conversion
source_tokenizer = Tokenizer()
target_tokenizer = Tokenizer()
source_tokenizer.fit_on_texts(source_texts)
target_tokenizer.fit_on_texts(target_texts)

source_sequences = source_tokenizer.texts_to_sequences(source_texts)
target_sequences = target_tokenizer.texts_to_sequences(target_texts)

# Padding sequences
max_source_len = max(len(seq) for seq in source_sequences)
max_target_len = max(len(seq) for seq in target_sequences)
source_sequences = pad_sequences(source_sequences, maxlen=max_source_len, padding='post')
target_sequences = pad_sequences(target_sequences, maxlen=max_target_len, padding='post')

# Vocabulary sizes
source_vocab_size = len(source_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1

# Model parameters
embedding_dim = 256
latent_dim = 512

# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(source_vocab_size, embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
_, state_h, state_c = encoder_lstm(enc_emb)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,))
dec_emb = Embedding(target_vocab_size, embedding_dim)(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
decoder_dense = Dense(target_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Seq2Seq Model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy')

# Data preparation for training
# Convert target sequences to one-hot encoded format
def one_hot_encode(sequences, vocab_size):
    one_hot = np.zeros((len(sequences), max_target_len, vocab_size), dtype='float32')
    for i, seq in enumerate(sequences):
        for t, word_id in enumerate(seq):
            if word_id > 0:
                one_hot[i, t, word_id] = 1.0
    return one_hot

target_sequences_one_hot = one_hot_encode(target_sequences, target_vocab_size)

# Training
model.fit([source_sequences, target_sequences], target_sequences_one_hot,
          batch_size=64, epochs=100, validation_split=0.2)

# Inference models
# Encoder inference model
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2 = Embedding(target_vocab_size, embedding_dim)(decoder_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(
    dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

# Function to decode sequences
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_tokenizer.word_index['\t']

    # Sampling loop for a batch of sequences
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = target_tokenizer.index_word[sampled_token_index]
        decoded_sentence += ' ' + sampled_word

        # Exit condition: either hit max length or find stop character.
        if sampled_word == '\n' or len(decoded_sentence) > max_target_len:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence



# 6) How to generate text using a pre-trained transformer model (GPT-2)?

In [None]:
!pip install transformers

In [18]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

prompt = "Once upon a time"

input_ids = tokenizer.encode(prompt, return_tensors='pt')

output = model.generate(input_ids, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Once upon a time, the world was a place of great beauty and great danger. The world of the gods was the place where the great gods were born, and where they were to live.

The world that was created was not the same as the one that is now. It was an endless, endless world. And the Gods were not born of nothing. They were created of a single, single thing. That was why the universe was so beautiful. Because the cosmos was made of two


# 7) How to apply data augmentation for text in NLP?

In [None]:
!pip install textattack

In [None]:
from textattack.augmentation import EasyDataAugmenter

# Initialize the augmenter
augmenter = EasyDataAugmenter()

# Original sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Generate augmented sentences
augmented_sentences = augmenter.augment(sentence)

print(augmented_sentences)


# 8) How can you add an Attention Mechanism to a Seq2Seq model?

In [22]:
from tensorflow.keras.layers import Layer, Input, LSTM, Dense, Concatenate, TimeDistributed
import tensorflow.keras.backend as K

class Attention(Layer):
    def __init__(self):
        super(Attention, self).__init__()

    def build(self, input_shape):
        self.W = self.add_weight(shape=(input_shape[0][-1], input_shape[0][-1]), initializer='normal', name='W')
        self.b = self.add_weight(shape=(input_shape[0][-1],), initializer='zeros', name='b')
        super(Attention, self).build(input_shape)

    def call(self, inputs):
        encoder_out, decoder_out = inputs
        score = K.tanh(K.dot(decoder_out, self.W) + self.b)
        attention_weights = K.softmax(K.dot(score, K.transpose(encoder_out), axes=[1, 2]))
        context_vector = K.batch_dot(attention_weights, encoder_out, axes=[1, 1])
        return context_vector, attention_weights


In [23]:
def build_model(input_dim, output_dim, timesteps, latent_dim):
    # Encoder
    encoder_inputs = Input(shape=(timesteps, input_dim))
    encoder_lstm = LSTM(latent_dim, return_sequences=True)(encoder_inputs)

    # Decoder
    decoder_inputs = Input(shape=(timesteps, output_dim))
    decoder_lstm = LSTM(latent_dim, return_sequences=True)(decoder_inputs)

    # Attention
    attention = Attention()([encoder_lstm, decoder_lstm])
    context_vector = attention[0]
    attention_weights = attention[1]

    # Concatenate context vector with decoder LSTM output
    decoder_combined_context = Concatenate(axis=-1)([context_vector, decoder_lstm])

    # Output layer
    decoder_dense = TimeDistributed(Dense(output_dim, activation='softmax'))
    decoder_outputs = decoder_dense(decoder_combined_context)

    # Define the model
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    return model


In [None]:
# Assuming X_train and Y_train are your training data
model = build_model(input_dim=X_train.shape[2], output_dim=Y_train.shape[2], timesteps=X_train.shape[1], latent_dim=256)
model.fit([X_train, Y_train], Y_train, epochs=10, batch_size=64)
