# ShakespeareBot: Transformer-based Shakespearean QA System

This notebook implements a Shakespeare-style question answering system using both traditional NLP techniques (Markov chains) and modern transformer models (BERT + GPT). We'll compare both approaches and evaluate their performance.

## Part 1: Setup and Dependencies

In [1]:
# Install required packages
!pip install nltk
!pip install spacy
!pip install markovify
!pip install transformers
!pip install torch
!pip install tqdm
!pip install matplotlib
!pip install datasets
!python -m spacy download en_core_web_sm

Collecting markovify
  Downloading markovify-0.9.4.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from markovify)
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.8/235.8 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.9.4-py3-none-any.whl size=18606 sha256=23bcd06155202f7658de7aafcbaab12e9a5a4e781cae8f18c34f7263cd800e0d
  Stored in directory: /root/.cache/pip/wheels/9c/20/eb/1a3fb93f3132f2f9683e4efd834800f80c53aeddf50e84ae80
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.9.4 unidecode-1.4.0
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvr

In [3]:
# Import necessary libraries
import spacy
import re
import markovify
import nltk
from nltk.corpus import gutenberg  # consists of all shakespeare novels
import warnings
import torch
from transformers import BertTokenizer, BertModel, GPT2LMHeadModel, GPT2Tokenizer, get_scheduler
# Import AdamW from torch.optim instead of transformers
from torch.optim import AdamW
import matplotlib.pyplot as plt
import numpy as np
from tqdm.notebook import tqdm
import random
import pandas as pd

# Suppress warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

## Part 2: Traditional Approach - Markov Chain Model
### 2.1 Data Preparation and Cleaning

In [4]:
# Import novels as text objects
hamlet = gutenberg.raw('shakespeare-hamlet.txt')
macbeth = gutenberg.raw('shakespeare-macbeth.txt')
caesar = gutenberg.raw('shakespeare-caesar.txt')

# Print first 200 characters of each
print('Raw:', hamlet[:200])
print('Raw:', macbeth[:200])
print('Raw:', caesar[:200])

Raw: [The Tragedie of Hamlet by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Barnardo and Francisco two Centinels.

  Barnardo. Who's there?
  Fran. Nay answer me: Stand & vnfold
your sel
Raw: [The Tragedie of Macbeth by William Shakespeare 1603]


Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Rai
Raw: [The Tragedie of Julius Caesar by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Flauius, Murellus, and certaine Commoners ouer the Stage.

  Flauius. Hence: home you idle Creatures, g


In [5]:
# Text cleaning function
def text_cleaner(text):
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'(\\b|\\s+\\-?|^\\-?)(\\d+|\\d*\\.\\d+)', '', text)
    text = ' '.join(text.split())
    return text

# Remove chapter indicator
hamlet = re.sub(r'Chapter \d+', '', hamlet)
macbeth = re.sub(r'Chapter \d+', '', macbeth)
caesar = re.sub(r'Chapter \d+', '', caesar)

# Apply cleaning function to corpus
hamlet = text_cleaner(hamlet)
caesar = text_cleaner(caesar)
macbeth = text_cleaner(macbeth)

In [6]:
# Parse cleaned novels
nlp = spacy.load('en_core_web_sm')
hamlet_doc = nlp(hamlet)
macbeth_doc = nlp(macbeth)
caesar_doc = nlp(caesar)

hamlet_sents = ' '.join([sent.text for sent in hamlet_doc.sents if len(sent.text) > 1])
macbeth_sents = ' '.join([sent.text for sent in macbeth_doc.sents if len(sent.text) > 1])
caesar_sents = ' '.join([sent.text for sent in caesar_doc.sents if len(sent.text) > 1])

# Combine all Shakespeare text
shakespeare_sents = hamlet_sents + macbeth_sents + caesar_sents

# Print a sample of our processed text
print(shakespeare_sents[:500])

Actus Primus. Scoena Prima. Enter Barnardo and Francisco two Centinels. Barnardo. Who's there? Fran. Nay answer me: Stand & vnfold your selfe Bar. Long liue the King Fran. Barnardo? Bar. He Fran. You come most carefully vpon your houre Bar. 'Tis now strook twelue, get thee to bed Francisco Fran. For this releefe much thankes: 'Tis bitter cold, And I am sicke at heart Barn. Haue you had quiet Guard? Fran. Not a Mouse stirring Barn. Well, goodnight. If you do meet Horatio and Marcellus, the Riuals


### 2.2 Building the Markov Chain Model

In [7]:
# Create text generator using markovify
generator_1 = markovify.Text(shakespeare_sents, state_size=3)

# Randomly generate three sentences
print("Basic Markov model outputs:")
for i in range(3):
    print(generator_1.make_sentence())

Basic Markov model outputs:
None
None
Yes, bring me word Luc.


In [8]:
# Randomly generate three more sentences of no more than 100 characters
print("Short sentences:")
for i in range(3):
    print(generator_1.make_short_sentence(max_chars=100))

Short sentences:
Once more goodnight, And when you do them- Brut.
Hye you Messala, And I will bring him to the Capitoll Por.
My Lord, do as you please, But if you would driue me into a toyle?


In [9]:
# Use spacy's part of speech to generate more legible text
class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        return ['::'.join((word.orth_, word.pos_)) for word in nlp(sentence)]
    def word_join(self, words):
        sentence = ' '.join(word.split('::')[0] for word in words)
        return sentence

# Call the class on our text
generator_2 = POSifiedText(shakespeare_sents, state_size=3)

# Now we will use the above generator to generate sentences
print("POS-enhanced Markov model outputs:")
for i in range(5):
    print(generator_2.make_sentence())

POS-enhanced Markov model outputs:
Vpon my Head they plac'd a fruitlesse Crowne , And put it in his Pocket Qu .
Set on , and it marres him ; it sets him on , and not mine owne .
Being thus benetted round with Villaines , Ere I had euer seene that day Horatio .
Romans , Countrey - men , and it marres him ; it sets him on , and leaue you so .
Go Captaine , from me greet the Danish King , Tell him his prankes haue been too broad to beare with me , That I must be idle .


In [10]:
# Print 100 characters or less sentences
print("Short POS-enhanced sentences:")
for i in range(5):
    print(generator_2.make_short_sentence(max_chars=100))

Short POS-enhanced sentences:
Masters , you are a Gentleman .
Enter Macbeth , Lenox , Soldiers .
Thy Master is a Wise and Valiant Romane , I neuer gaue you ought Ophe .
It is not madnesse That I haue longed long to re - deliuer .
Poore Birds they are not Ham .


### 2.3 Simple Question-Answering with Markov Model

In [11]:
def markov_answer_question(question, model=generator_2):
    """
    A simple function to generate a Shakespeare-style response
    to a question using the Markov chain model
    """
    # Basic logic: longer questions get longer answers
    words = len(question.split())

    if words <= 3:
        return model.make_short_sentence(max_chars=80) or "Brevity is the soul of wit."
    elif words <= 6:
        return model.make_short_sentence(max_chars=120) or "The quality of mercy is not strained."
    else:
        return model.make_sentence() or "To be, or not to be, that is the question."

# Test with some sample questions
sample_questions = [
    "What is love?",
    "How do I know if I'm in love?",
    "What is the meaning of life and how should we live it?"
]

print("Markov Chain Q&A Demo:")
print("-" * 30)
for question in sample_questions:
    print(f"Q: {question}")
    print(f"A: {markov_answer_question(question)}")
    print()

Markov Chain Q&A Demo:
------------------------------
Q: What is love?
A: I rather tell thee what is to be buried in't ?

Q: How do I know if I'm in love?
A: To be, or not to be, that is the question.

Q: What is the meaning of life and how should we live it?
A: In what particular thought to work , I know not , Sir Brut .



## Part 3: Transformer-Based Approach
### 3.1 Preparing Shakespeare Data for Fine-Tuning

In [12]:
# Extract sentences from Shakespeare's works for training
def extract_training_sentences(doc, min_length=40, max_length=200):
    """Extract sentences suitable for training from a spaCy doc"""
    sentences = []
    for sent in doc.sents:
        text = sent.text.strip()
        # Filter for sentences of appropriate length and complexity
        if min_length <= len(text) <= max_length and len(text.split()) >= 5:
            sentences.append(text)
    return sentences

# Extract training sentences from each play
hamlet_training = extract_training_sentences(hamlet_doc)
macbeth_training = extract_training_sentences(macbeth_doc)
caesar_training = extract_training_sentences(caesar_doc)

# Combine all training sentences
shakespeare_training = hamlet_training + macbeth_training + caesar_training

# Print some statistics and examples
print(f"Total training sentences: {len(shakespeare_training)}")
print("Example sentences:")
for i in range(5):
    print(f"[{i+1}] {shakespeare_training[i]}")

Total training sentences: 2405
Example sentences:
[1] Enter Barnardo and Francisco two Centinels.
[2] Nay answer me: Stand & vnfold your selfe Bar.
[3] You come most carefully vpon your houre Bar.
[4] 'Tis now strook twelue, get thee to bed Francisco Fran.
[5] For this releefe much thankes: 'Tis bitter cold, And I am sicke at heart Barn.


### 3.2 Building the BERT Question Encoder

In [13]:
class ShakespeareBertEncoder:
    """Uses BERT to encode questions for the Shakespeare QA system"""

    def __init__(self):
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.model = BertModel.from_pretrained('bert-base-uncased')
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.model.eval()

    def encode_question(self, question):
        """Encode a question to get its embedding representation"""
        inputs = self.tokenizer(question, return_tensors="pt",
                               padding=True, truncation=True, max_length=64)

        # Move inputs to device
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = self.model(**inputs)

        # Get the [CLS] token embedding (question representation)
        question_embedding = outputs.last_hidden_state[:, 0, :]
        return question_embedding

    def get_relevant_sentence_ids(self, question, sentences, top_k=5):
        """Find the most relevant sentences to a question based on embedding similarity"""
        question_emb = self.encode_question(question)

        # Encode all sentences
        sentence_embeddings = []
        for sentence in tqdm(sentences, desc="Encoding sentences", leave=False):
            with torch.no_grad():
                inputs = self.tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=64)
                inputs = {k: v.to(self.device) for k, v in inputs.items()}
                outputs = self.model(**inputs)
                sentence_emb = outputs.last_hidden_state[:, 0, :]
                sentence_embeddings.append(sentence_emb)

        # Calculate similarities
        similarities = []
        for sent_emb in sentence_embeddings:
            similarity = torch.cosine_similarity(question_emb, sent_emb)
            similarities.append(similarity.item())

        # Get top-k indices
        top_indices = np.argsort(similarities)[-top_k:]

        return top_indices, [similarities[i] for i in top_indices]

### 3.3 GPT-2 Shakespeare Response Generator

In [14]:
class ShakespeareGPTGenerator:
    """Generates Shakespeare-style responses using a fine-tuned GPT-2 model"""

    def __init__(self):
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.model = GPT2LMHeadModel.from_pretrained('gpt2')
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

        # Add special tokens for Q&A format
        special_tokens = {'pad_token': '<PAD>', 'sep_token': '<SEP>'}
        self.tokenizer.add_special_tokens(special_tokens)
        self.model.resize_token_embeddings(len(self.tokenizer))

    def prepare_training_data(self, qa_pairs):
        """Prepare input data for fine-tuning"""
        inputs = []
        for question, answer in qa_pairs:
            # Format: "Question: {question} <SEP> Answer: {answer}"
            text = f"Question: {question} {self.tokenizer.sep_token} Answer: {answer}"
            inputs.append(text)
        return inputs

    def fine_tune(self, training_data, epochs=3, batch_size=4):
        """Fine-tune the GPT-2 model on Shakespeare data"""
        # Prepare the dataset
        encoded_inputs = self.tokenizer(training_data, padding=True, truncation=True,
                                       max_length=512, return_tensors="pt")

        # Create dataset
        dataset = torch.utils.data.TensorDataset(
            encoded_inputs["input_ids"],
            encoded_inputs["attention_mask"]
        )

        # Create data loader
        dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

        # Setup optimizer and scheduler
        optimizer = AdamW(self.model.parameters(), lr=5e-5)
        scheduler = get_scheduler(
                      name="linear",
                      optimizer=optimizer,
                      num_warmup_steps=0,
                      num_training_steps=len(dataloader) * epochs)

        # Training loop
        self.model.train()
        for epoch in range(epochs):
            print(f"Epoch {epoch+1}/{epochs}")
            epoch_loss = 0

            for batch in tqdm(dataloader, desc=f"Training epoch {epoch+1}"):
                batch = [item.to(self.device) for item in batch]
                input_ids, attention_mask = batch

                # Forward pass
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
                loss = outputs.loss

                # Backward pass
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)

                # Update parameters
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

                epoch_loss += loss.item()

            avg_loss = epoch_loss / len(dataloader)
            print(f"Average loss: {avg_loss:.4f}")

        print("Fine-tuning complete!")
        # Save the model
        self.model.save_pretrained('./models/shakespeare_gpt')
        self.tokenizer.save_pretrained('./models/shakespeare_gpt')
        print("Model saved to ./models/shakespeare_gpt")

    def generate_response(self, question, max_length=150):
        """Generate a Shakespeare-style response to a question"""
        # Format the input with the question
        prompt = f"Question: {question} {self.tokenizer.sep_token} Answer:"

        # Tokenize the prompt
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)

        # Generate response
        output_sequences = self.model.generate(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=max_length,
            temperature=0.8,
            top_k=50,
            top_p=0.95,
            do_sample=True,
            num_return_sequences=1,
            pad_token_id=self.tokenizer.pad_token_id
        )

        # Decode the response
        generated_text = self.tokenizer.decode(output_sequences[0], skip_special_tokens=True)

        # Extract just the answer part
        answer = generated_text.split("Answer:")[-1].strip()

        return answer

# Helper function to convert Shakespeare text to Q&A pairs for training
def create_qa_pairs(sentences, num_pairs=500):
    """Create synthetic Q&A pairs from Shakespeare sentences for training"""
    # Common Shakespeare-related questions
    question_templates = [
        "What is {topic}?",
        "How does one find {topic}?",
        "Why is {topic} important?",
        "What would you say about {topic}?",
        "How should I think about {topic}?",
        "What advice can you give about {topic}?",
        "What does it mean to experience {topic}?",
        "How would you describe {topic}?",
        "What is the nature of {topic}?",
        "Can you explain {topic}?"
    ]

    # Common Shakespeare themes
    topics = [
        "love", "death", "honor", "ambition", "jealousy", "revenge", "power",
        "betrayal", "loyalty", "fate", "time", "truth", "deception", "madness",
        "grief", "friendship", "courage", "fear", "wisdom", "folly"
    ]

    qa_pairs = []
    for _ in range(num_pairs):
        topic = random.choice(topics)
        question = random.choice(question_templates).format(topic=topic)
        answer = random.choice(sentences)
        qa_pairs.append((question, answer))

    return qa_pairs


# Example usage of the training data generation
qa_pairs = create_qa_pairs(shakespeare_training, num_pairs=100)
print("Sample Q&A pairs for training:")
for i in range(3):
    print(f"Question: {qa_pairs[i][0]}")
    print(f"Answer: {qa_pairs[i][1]}")


class ShakespeareQASystem:
    """Complete Shakespeare QA system that combines both approaches"""

    def __init__(self, use_transformer=True, train_model=False):
        """Initialize the QA system

        Args:
            use_transformer: Whether to use transformer models (True) or Markov chains (False)
            train_model: Whether to train the GPT model from scratch
        """
        self.use_transformer = use_transformer

        # Initialize Markov model for comparison or fallback
        self.markov_model = generator_2  # Using our POS-enhanced model

        # Initialize transformer models if requested
        if use_transformer:
            print("Initializing transformer models...")
            self.bert_encoder = ShakespeareBertEncoder()
            self.gpt_generator = ShakespeareGPTGenerator()

            # Train the model if requested
            if train_model:
                print("Preparing to fine-tune GPT-2 on Shakespeare texts...")
                # Create training data
                qa_pairs = create_qa_pairs(shakespeare_training, num_pairs=500)
                training_data = self.gpt_generator.prepare_training_data(qa_pairs)

                # Fine-tune the model
                self.gpt_generator.fine_tune(training_data, epochs=2, batch_size=2)

    def answer_question(self, question):
        """Answer a question using either Markov chains or transformers"""
        if not self.use_transformer:
            # Use the Markov model approach
            return markov_answer_question(question, self.markov_model)
        else:
            # Use the transformer pipeline
            try:
                return self.gpt_generator.generate_response(question)
            except Exception as e:
                print(f"Error with transformer model: {e}")
                # Fallback to Markov model
                return markov_answer_question(question, self.markov_model)

    def interactive_chat(self):
        """Run an interactive chat session with the system"""
        print("=== Shakespeare Question-Answering System ===")
        print(f"Using: {'Transformer models' if self.use_transformer else 'Markov chains'}")
        print("Ask a question or type 'exit' to quit.")

        while True:
            question = input("Your question: ")
            if question.lower() in ['exit', 'quit', 'bye']:
                print("Farewell, good night, parting is such sweet sorrow.")
                break

            response = self.answer_question(question)
            print(f"Shakespeare: {response}")

Sample Q&A pairs for training:
Question: Can you explain fate?
Answer: Cassius, Be not deceiu'd: If I haue veyl'd my looke, I turne the trouble of my Countenance Meerely vpon my selfe.
Question: How should I think about jealousy?
Answer: O worthyest Cousin, The sinne of my Ingratitude euen now Was heauie on me.
Question: What would you say about loyalty?
Answer: Throw Physicke to the Dogs, Ile none of it.


In [19]:
if __name__ == "__main__":
    # Use simpler Markov model for quick testing (no need to load transformers)
    qa_system = ShakespeareQASystem(use_transformer=False, train_model=False)

    # Run interactive chat session
    qa_system.interactive_chat()
    # qa_system = ShakespeareQASystem(use_transformer=True, train_model=True)
    # qa_system.interactive_chat()

=== Shakespeare Question-Answering System ===
Using: Markov chains
Ask a question or type 'exit' to quit.
Your question: what is love?
Shakespeare: I can not tell  but I shame To weare a Kerchiefe ?
Your question: how to make memories?
Shakespeare: This is a sorry sight Macb .
Your question: exit
Farewell, good night, parting is such sweet sorrow.


In [21]:
if __name__ == "__main__":
    # Use simpler Markov model for quick testing (no need to load transformers)
    qa_system = ShakespeareQASystem(use_transformer=True, train_model=False)

    # Run interactive chat session
    qa_system.interactive_chat()
    # qa_system = ShakespeareQASystem(use_transformer=True, train_model=True)
    # qa_system.interactive_chat()

Initializing transformer models...
=== Shakespeare Question-Answering System ===
Using: Transformer models
Ask a question or type 'exit' to quit.
Your question: What is love?
Shakespeare: Love is a thing of great love and a thing of great need. For it is not a love which is wholly necessary to you; it is a desire of your own self, which is one which must be cherished by all. It is a desire of your own self which is one which must be cherished by all. Love is the desire of your own self and will not be forgotten, for it will be cherished in the same manner as love. Love is the desire of your own self and will not be forgotten, for it will be cherished in the same manner as love. The desire of your own self is one of one which must be cherished by all. It is a desire of your own
Your question: how to make memories ?
Shakespeare: You should use these as a starting point for any memory retrieval, and as such are best used in the context of a study that deals with remembering objects. The b