# Attention based Models and Transfer Learning

In [None]:
# Q1. What is BERT and how does it work?

# answer
# BERT (Bidirectional Encoder Representations from Transformers) is a Transformer-based model developed by Google.
# It learns context bidirectionally by attending to both left and right tokens simultaneously.
# BERT is pretrained using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP),
# and later fine-tuned on specific NLP tasks such as classification and question answering.

In [None]:
# Q2. What are the main advantages of using the attention mechanism in neural networks?

# answer
# The attention mechanism allows models to focus on relevant parts of the input sequence,
# handle long-range dependencies, improve interpretability, and reduce the limitations of fixed-size context windows
# found in RNNs and CNNs.

In [None]:
# Q3. How does the self-attention mechanism differ from traditional attention mechanisms?

# answer
# Traditional attention attends to encoder states from a decoder, while self-attention allows tokens in the same
# sequence to attend to each other. This enables parallelization and better modeling of dependencies across the input.

In [None]:
# Q4. What is the role of the decoder in a Seq2Seq model?

# answer
# The decoder generates the output sequence step by step, conditioned on the encoder’s context vector
# and its previously generated tokens.

In [None]:
# Q5. What is the difference between GPT-2 and BERT models?

# answer
# BERT is bidirectional and designed mainly for understanding tasks, while GPT-2 is unidirectional (autoregressive)
# and designed for text generation. BERT uses MLM and NSP during pretraining, GPT-2 predicts the next word
# in a sequence.

In [None]:
# Q6. Why is the Transformer model considered more efficient than RNNs and LSTMs?

# answer
# Transformers eliminate sequential dependencies by using self-attention, allowing parallelization and
# capturing long-range dependencies without vanishing gradients.

In [None]:
# Q7. Explain how the attention mechanism works in a Transformer model.

# answer
# The attention mechanism computes weighted combinations of values (V), where weights are obtained by
# comparing queries (Q) with keys (K). Scaled dot-product attention is applied to determine relevance,
# allowing the model to focus on important tokens.

In [None]:
# Q8. What is the difference between an encoder and a decoder in a Seq2Seq model?

# answer
# The encoder processes the input sequence into context representations, while the decoder generates the
# output sequence step by step using the encoder’s output and its own previous outputs.

In [None]:
# Q9. What is the primary purpose of using the self-attention mechanism in transformers?

# answer
# Self-attention allows each token to capture contextual relationships with every other token in the sequence,
# enabling better handling of semantic dependencies.

In [None]:
# Q10. How does the GPT-2 model generate text?

# answer
# GPT-2 generates text autoregressively by predicting the next token in a sequence using previously generated tokens
# and the transformer decoder architecture.

In [None]:
# Q11. What is the main difference between the encoder-decoder architecture and a simple neural network?

# answer
# Encoder-decoder architectures are specifically designed for sequence-to-sequence tasks like translation,
# while simple neural networks lack the ability to encode contextual sequence dependencies.

In [None]:
# Q12. Explain the concept of “fine-tuning” in BERT.

# answer
# Fine-tuning involves taking a pretrained BERT model and adapting it to a specific downstream task
# (e.g., sentiment classification, NER) by training it on task-specific labeled data.

In [None]:
# Q13. How does the attention mechanism handle long-range dependencies in sequences?

# answer
# Attention directly computes pairwise relationships between tokens, regardless of distance,
# avoiding the vanishing gradient problem of RNNs.

In [None]:
# Q14. What is the core principle behind the Transformer architecture?

# answer
# The core principle is self-attention, which enables parallel computation and effective modeling
# of dependencies without recurrence.

In [None]:
# Q15. What is the role of the "position encoding" in a Transformer model?

# answer
# Since Transformers lack recurrence, position encodings are added to input embeddings to provide information
# about the order of tokens.

In [None]:
# Q16. How do Transformers use multiple layers of attention?

# answer
# Transformers stack multiple attention layers (multi-head attention) to capture different types of relationships
# in the sequence simultaneously.

In [None]:
# Q17. What does it mean when a model is described as “autoregressive” like GPT-2?

# answer
# Autoregressive means the model generates one token at a time, conditioned on previously generated tokens,
# predicting the next word step by step.

In [None]:
# Q18. How does BERT's bidirectional training improve its performance?

# answer
# Bidirectional training allows BERT to learn context from both left and right tokens,
# leading to richer contextual representations compared to unidirectional models.

In [None]:
# Q19. What are the advantages of using the Transformer over RNN-based models in NLP?

# answer
# Transformers offer parallelization, better handling of long-range dependencies,
# scalability, and superior performance on NLP benchmarks.

In [None]:
# Q20. What is the attention mechanism’s impact on the performance of models like BERT and GPT-2?

# answer
# The attention mechanism improves context understanding, enables handling of long sequences,
# and significantly boosts the performance of language understanding and generation tasks.

# Practical

In [None]:
# Q1. How to implement a simple text classification model using LSTM in Keras?

#code >

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Sample dataset (tiny dataset, better accuracy with more data)
sentences = [
    "I love this course",       # Positive
    "This is an amazing class", # Positive
    "I enjoy learning NLP",     # Positive
    "I hate this subject",      # Negative
    "This class is boring"      # Negative
]

labels = [1, 1, 1, 0, 0]  # 1=Positive, 0=Negative

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

# Padding sequences to same length
X = pad_sequences(sequences, padding='post')
y = np.array(labels)

# Define LSTM model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=16))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
model.fit(X, y, epochs=10, verbose=1)

# Test on new data
test_sentences = ["I enjoy learning NLP", "I hate this subject"]
test_seq = tokenizer.texts_to_sequences(test_sentences)
test_pad = pad_sequences(test_seq, maxlen=X.shape[1], padding='post')

predictions = model.predict(test_pad)
predicted_labels = (predictions > 0.5).astype(int)

#example >
print("Test Sentences:", test_sentences)
print("Predictions (probabilities):", predictions)
print("Predicted Labels:", predicted_labels.reshape(-1))

Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.4000 - loss: 0.6946
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 59ms/step - accuracy: 0.4000 - loss: 0.6934
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step - accuracy: 0.6000 - loss: 0.6921
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step - accuracy: 0.6000 - loss: 0.6909
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step - accuracy: 0.6000 - loss: 0.6897
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step - accuracy: 0.6000 - loss: 0.6885
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 134ms/step - accuracy: 0.6000 - loss: 0.6873
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step - accuracy: 0.6000 - loss: 0.6861
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

In [None]:
# Q2. How to generate sequences of text using a Recurrent Neural Network (RNN)?

#code >

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

# Sample corpus
data = """I love learning NLP with deep learning.
RNN models can generate sequences of text.
This is a simple RNN text generation example."""

# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
total_words = len(tokenizer.word_index) + 1

# Convert text into sequences
input_sequences = []
for line in data.split("."):
    tokens = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(tokens)):
        n_gram = tokens[:i+1]
        input_sequences.append(n_gram)

# Pad sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_seq_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre')

X, y = input_sequences[:, :-1], input_sequences[:, -1]
y = to_categorical(y, num_classes=total_words)

# Define RNN model
model = Sequential()
model.add(Embedding(total_words, 16, input_length=max_seq_len-1))
model.add(SimpleRNN(32))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train
model.fit(X, y, epochs=200, verbose=0)

# Generate text
seed_text = "I love"
next_words = 5

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding='pre')
    predicted = np.argmax(model.predict(token_list, verbose=0), axis=-1)

    for word, index in tokenizer.word_index.items():
        if index == predicted:
            seed_text += " " + word
            break

#example >
print("Generated Text:", seed_text)

Generated Text: I love learning nlp with deep learning


In [None]:
# Q3. How to perform sentiment analysis using a simple CNN model?

#code >
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

def cnn_sentiment_classifier(vocab_size=5000, embedding_dim=64, input_length=100):
    model = Sequential([
        Embedding(vocab_size, embedding_dim, input_length=input_length),
        Conv1D(128, 5, activation='relu'),
        GlobalMaxPooling1D(),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

#example
cnn_model = cnn_sentiment_classifier()
print(cnn_model.summary())

None


In [None]:
# Q4. How to perform Named Entity Recognition (NER) using spaCy?

#code >
import spacy

def perform_ner(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

#example
print(perform_ner("Apple is looking at buying U.K. startup for $1 billion"))

[('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]


In [None]:
# Q5. How to implement a simple Seq2Seq model for machine translation using LSTM in Keras?

#code >
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense

def seq2seq_model(latent_dim=256, input_dim=100, output_dim=100):
    encoder_inputs = Input(shape=(None, input_dim))
    encoder = LSTM(latent_dim, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_inputs)
    encoder_states = [state_h, state_c]

    decoder_inputs = Input(shape=(None, output_dim))
    decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = Dense(output_dim, activation='softmax')
    decoder_outputs = decoder_dense(decoder_outputs)

    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
    return model

#example
seq2seq = seq2seq_model()
print(seq2seq.summary())

None


In [None]:
# Q6. How to generate text using a pre-trained transformer model (GPT-2)?

#code >
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer & model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# GPT-2 has no native pad token; set it to eos to avoid warnings
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

# Prompt
prompt = "Natural Language Processing with transformers is"

# Tokenize WITH attention_mask and padding
enc = tokenizer(
    prompt,
    return_tensors="pt",
    padding=True,
    truncation=True
)

# Generate (note: using max_new_tokens; and passing attention_mask & pad_token_id)
out_ids = model.generate(
    input_ids=enc["input_ids"],
    attention_mask=enc["attention_mask"],
    max_new_tokens=80,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

#example >
print("Prompt:", prompt)
print("Generated Text:", tokenizer.decode(out_ids[0], skip_special_tokens=True))

Prompt: Natural Language Processing with transformers is
Generated Text: Natural Language Processing with transformers is a tool that allows you to generate code that will run on your computer. It has been developed by the team at Intel and is used by the IBM Watson.

This is a step-by-step guide to making your own transformers. It will help you to make the most of your existing data sets and make the most of your existing data sets, without having to build your own transform


In [None]:
# Q7. How to apply data augmentation for text in NLP?
#code >
import random

def synonym_replacement(text, synonyms={"happy":["joyful","cheerful"],"sad":["unhappy","sorrowful"]}):
    words = text.split()
    new_words = []
    for word in words:
        if word in synonyms:
            new_words.append(random.choice(synonyms[word]))
        else:
            new_words.append(word)
    return " ".join(new_words)

#example
print(synonym_replacement("I am happy but sometimes sad"))

I am cheerful but sometimes unhappy


In [None]:
# Q8. How can you add an Attention Mechanism to a Seq2Seq model?

#code >
from tensorflow.keras.layers import Attention

def add_attention_to_seq2seq(latent_dim=256, input_dim=100, output_dim=100):
    encoder_inputs = Input(shape=(None, input_dim))
    encoder_outputs, state_h, state_c = LSTM(latent_dim, return_sequences=True, return_state=True)(encoder_inputs)
    encoder_states = [state_h, state_c]

    decoder_inputs = Input(shape=(None, output_dim))
    decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

    attn_layer = Attention()
    attn_out = attn_layer([decoder_outputs, encoder_outputs])
    decoder_dense = Dense(output_dim, activation="softmax")
    decoder_outputs = decoder_dense(attn_out)

    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    return model

#example
attention_seq2seq = add_attention_to_seq2seq()
print(attention_seq2seq.summary())

None
