In [1]:
pip install pennylane

Collecting pennylane
  Downloading PennyLane-0.40.0-py3-none-any.whl.metadata (10 kB)
Collecting rustworkx>=0.14.0 (from pennylane)
  Downloading rustworkx-0.16.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting tomlkit (from pennylane)
  Downloading tomlkit-0.13.2-py3-none-any.whl.metadata (2.7 kB)
Collecting appdirs (from pennylane)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting autoray>=0.6.11 (from pennylane)
  Downloading autoray-0.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting pennylane-lightning>=0.40 (from pennylane)
  Downloading PennyLane_Lightning-0.40.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (27 kB)
Collecting diastatic-malt (from pennylane)
  Downloading diastatic_malt-2.15.2-py3-none-any.whl.metadata (2.6 kB)
Collecting scipy-openblas32>=0.3.26 (from pennylane-lightning>=0.40->pennylane)
  Downloading scipy_openblas32-0.3.29.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5

I. Encoding Text Data for QNLP

This is arguably the trickiest part. Here's a breakdown of popular encoding methods for text, along with their strengths, weaknesses, and suitability for RAG:

    A. Basis Encoding (One-Hot Encoding)

        Concept: Assign each word in your vocabulary a unique basis state (e.g., |000> for "the," |001> for "cat," |010> for "sat," etc.). A document becomes a superposition of the basis states representing its constituent words. You'd need n qubits to represent a vocabulary of size 2^n.

        Pros: Simple to understand.

        Cons:

            Exponential Scaling: Requires an exponentially large number of qubits to represent even moderately sized vocabularies. Completely impractical for real-world text.

            No Semantic Similarity: Words encoded in this way have no inherent relationship. "Cat" and "Dog" are as different as "Cat" and "Hydrogen." This makes similarity comparisons useless.

        Suitability for RAG: Unsuitable due to scaling and lack of semantic meaning. Avoid this for any practical QNLP task.

    B. Amplitude Encoding

        Concept: Encode the values of a classical vector (e.g., word counts, TF-IDF scores, pre-trained word embeddings) into the amplitudes of a quantum state. If your vector is [a1, a2, ..., an], the quantum state becomes:

              
        |ψ> = a1|00...0> + a2|00...1> + ... + an|11...1>

            

        IGNORE_WHEN_COPYING_START

    Use code with caution.
    IGNORE_WHEN_COPYING_END

    Crucially, the amplitudes must be normalized such that a1^2 + a2^2 + ... + an^2 = 1. This represents a valid quantum state.

    Pros:

        Efficient Qubit Usage: Can encode 2^n values using only n qubits. This is a significant advantage over basis encoding.

        Potential for Speedup: Quantum algorithms can operate directly on the amplitudes, offering potential computational advantages.

    Cons:

        State Preparation Complexity: Preparing the quantum state with the desired amplitudes can be challenging and may require complex quantum circuits (e.g., using Quantum Random Access Memory or QRAM, which is still largely theoretical). This is a major bottleneck.

        Measurement Challenges: Retrieving the encoded data requires careful measurements, and the outcome is probabilistic.

    Suitability for RAG: Potentially useful, especially if you're working with pre-computed embeddings from a classical model (e.g., Word2Vec, GloVe, or even sentence embeddings like Sentence-BERT). You could encode these embeddings into quantum states. The challenge is the state preparation complexity.

C. Angle Encoding

    Concept: Encode data into the angles of qubits. For example, a single data point x can be encoded as:

          
    |ψ> = cos(x) |0> + sin(x) |1>

        

    IGNORE_WHEN_COPYING_START

        Use code with caution.
        IGNORE_WHEN_COPYING_END

        For a vector, you'd encode each element into the angle of a different qubit. You can also encode multiple features into a single qubit using a more complex rotation.

        Pros:

            Relatively Simple State Preparation: Easier to implement on near-term quantum hardware compared to amplitude encoding, as it mainly involves single-qubit rotations.

            Data is in Rotation Angles: Allows use of quantum rotation gates to perform computations.

        Cons:

            Limited Encoding Capacity: Each qubit typically encodes only one or two data points.

            Sensitivity to Noise: Quantum noise can easily corrupt the angle information.

        Suitability for RAG: Can be used for encoding smaller feature vectors or for implementing specific quantum kernels. It may be suitable for representing the relevance scores or attention weights.

    D. Quantum Feature Maps (Variational Quantum Circuits)

        Concept: Use a parameterized quantum circuit to map classical data into a high-dimensional quantum Hilbert space. The circuit's parameters are trained to create a feature map that separates different classes of data.

        Pros:

            Potential for Feature Engineering: The quantum circuit can implicitly create complex, non-linear features that might be difficult to engineer classically.

            Flexibility: The circuit architecture and parameters can be tailored to the specific data.

        Cons:

            Training Complexity: Training these circuits (using Variational Quantum Eigensolver (VQE) or similar methods) is computationally expensive and can be challenging. Requires a hybrid quantum-classical approach.

            Hardware Dependence: The optimal circuit architecture depends on the available quantum hardware.

            Limited Theoretical Understanding: The exact nature of the learned feature map is often difficult to interpret.

        Suitability for RAG: Potentially powerful for learning complex relationships between queries and documents. However, the training cost is significant. More suitable for tasks like document classification that may improve RAG.

    E. Tensor Network Encoding

        Concept: Represent text as a tensor network. This is particularly useful for capturing long-range dependencies in text. Each word can be represented as a tensor, and the relationships between words are encoded in the connections of the network.

        Pros:

            Efficient Representation of Long-Range Dependencies: Captures complex relationships between words and phrases.

            Dimensionality Reduction: Can compress high-dimensional data into a lower-dimensional representation.

        Cons:

            Complex Implementation: Requires specialized knowledge of tensor networks.

            Computational Overhead: Performing computations on tensor networks can be computationally expensive, even classically.

        Suitability for RAG: Potentially useful for capturing contextual information and improving the accuracy of semantic search, but the complexity is high.

II. Encoding Image Data

Encoding images into quantum states can be achieved in a number of ways, mirroring the text encoding strategies:

    A. Amplitude Encoding: Represent pixel values as amplitudes of a quantum state. If you have an image with N pixels, each with a grayscale value, you would need log2(N) qubits.

        Pros: Efficient qubit usage.

        Cons: State preparation complexity is high, especially for high-resolution images.

    B. Angle Encoding: Encode pixel values into the angles of qubits.

        Pros: Easier state preparation compared to amplitude encoding.

        Cons: Limited encoding capacity. Each qubit can hold only a small amount of information. High number of qubits required for image features.

    C. Flexible Representation of Quantum Images (FRQI)

        Concept: Encodes both the color and the position information of each pixel into a quantum state. This is specifically designed for image data.

        Pros: Efficiently encodes image data.

        Cons: Can be complex to implement and manipulate.

    D. Quantum Feature Maps: Similar to text, you can use variational quantum circuits to extract features from images. This can be applied to image recognition tasks.

III. Retrieval in Quantum States

Once you have your text and images encoded into quantum states, you need to retrieve relevant information. Here are some key approaches:

    A. Quantum Similarity Measures: Calculate the similarity between the quantum state of a query and the quantum states of documents in your database. Common measures include:

        State Fidelity: Measures the overlap between two quantum states. A fidelity of 1 indicates identical states, while 0 indicates orthogonal states. High fidelity implies high similarity.

        Inner Product: Calculate the inner product (dot product) between two quantum state vectors.

        Quantum Earth Mover's Distance (QEMD): A quantum version of the Earth Mover's Distance (also known as Wasserstein distance), which is a measure of the distance between two probability distributions.

    B. Quantum Search Algorithms: Use algorithms like Grover's algorithm to efficiently search for relevant documents in your quantum database. Grover's algorithm can provide a quadratic speedup compared to classical search algorithms.

    C. Quantum Associative Memory (QuAM): Store associations between queries and documents in a quantum memory. When a new query is presented, the QuAM can retrieve the associated documents.

IV. Choosing the Right Encoding and Retrieval Method

Here's a decision-making framework for your QNLP-RAG project:

    Data Type: Consider the type of data you are working with (text, images, or both).

    Feature Engineering: Decide whether to perform classical feature engineering (e.g., using word embeddings) before encoding or to rely on quantum feature maps to learn features. If using pre-trained embeddings, amplitude encoding becomes a more attractive option.

    Quantum Hardware: Consider the limitations of available quantum hardware (number of qubits, connectivity, gate fidelity). Near-term devices (NISQ) are noisy and have limited qubit counts, so simpler encoding methods like angle encoding or shallower quantum circuits are more practical.

    Computational Resources: Evaluate the computational cost of state preparation, quantum computations, and measurements.

    Desired Speedup: Determine the level of speedup you are hoping to achieve. Grover's algorithm provides a quadratic speedup, but other quantum algorithms may offer different advantages.

    Hybrid Approach: Consider a hybrid quantum-classical approach, where some computations are performed classically and others are performed on a quantum computer. For example, you could use classical machine learning to pre-process the data and then use a quantum computer to perform the similarity search.

    State Preparation Complexity: Aim to prepare a simpler state preparation algorithm because it is often the bottleneck of quantum implementations

Example Scenario and Implementation Tips

Let's say you have a corpus of text documents and you want to build a QNLP-RAG system.

    Step 1: Classical Pre-processing:

        Use classical NLP techniques to pre-process the text data (e.g., tokenization, stemming, stop word removal).

        Generate word embeddings (e.g., Word2Vec, GloVe, Sentence-BERT) for each document and query.

    Step 2: Quantum Encoding:

        Encode the word embeddings into quantum states using amplitude encoding. Normalize the embedding vectors to ensure they represent valid quantum states.

    Step 3: Quantum Retrieval:

        Calculate the state fidelity between the quantum state of the query and the quantum states of the documents in your database.

        Use Grover's algorithm to efficiently search for the documents with the highest state fidelity.

    Step 4: Classical Post-processing:

        Retrieve the classical text documents that correspond to the most relevant quantum states.

        Use classical NLP techniques to rank and present the retrieved documents to the user.

In [2]:
import pennylane as qml
from pennylane import numpy as np

# Example: Amplitude Encoding of a 2-dimensional vector
dev = qml.device("default.qubit", wires=1)  # Need 1 qubit for a 2-dim vector

@qml.qnode(dev)
def amplitude_encoding(vector):
    """Encodes a normalized vector into the amplitudes of a quantum state."""
    qml.AmplitudeEmbedding(features=vector, wires=range(1), normalize=False)  #Normalization should happen before
    return qml.state()

# Example Usage
embedding_vector = np.array([1/np.sqrt(2), 1/np.sqrt(2)]) #Normalized vector.
quantum_state = amplitude_encoding(embedding_vector)
print(quantum_state)
# Example : Angle Encoding
dev2 = qml.device("default.qubit", wires=1)

@qml.qnode(dev2)
def angle_encoding(x):
    qml.RY(x, wires=0) #RY rotation encodes the angle
    return qml.expval(qml.PauliZ(wires=0)) #Example measurement

angle = np.pi/4
expectation_value = angle_encoding(angle)
print(expectation_value)

[0.70710678+0.j 0.70710678+0.j]
0.7071067811865475


Important Considerations:

    Normalization: Always normalize your data before encoding it into quantum states, especially when using amplitude encoding.

    Error Mitigation: Quantum noise is a major challenge. Explore error mitigation techniques to improve the accuracy of your results.

    Hybrid Quantum-Classical Approach: In the near term, hybrid approaches are often the most practical.

    Scalability: Keep scalability in mind when choosing an encoding method. Some methods scale exponentially with the size of the data, which can be a major limitation.

    Benchmarking: Thoroughly benchmark your quantum algorithms against classical algorithms to determine whether you are actually achieving a speedup.

    Simplified Embeddings: The one-hot encoding-like embeddings are extremely basic. Real-world QNLP would use pre-trained word embeddings to capture semantic relationships.

    Normalization: Crucial for amplitude encoding. The amplitudes must sum to 1 when squared.

    Padding: Important to pad the vector to be a power of 2 to comply with Amplitude Encoding requirements

    State Fidelity as Similarity: State fidelity is a basic similarity measure. More advanced quantum similarity measures exist.

    Quantum Hardware Limitations: This example uses a simulator. Running this on real quantum hardware would be significantly more challenging due to noise and decoherence.

    Scalability: This approach does not scale well to large corpora or vocabularies due to the qubit requirements of amplitude encoding.

    Basic RAG: The answer is simply the retrieved document. A more sophisticated RAG system would use the retrieved document to generate a more concise and relevant answer. This would usually be done via a LLM.

In [4]:
import nltk
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pennylane as qml

# Download necessary NLTK resources (run this only once)
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

try:
    word_tokenize("example text")  # Attempt to tokenize to trigger download
except LookupError:
    nltk.download('punkt')

# I. Sample Text Corpus
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The cat sat on the mat.",
    "A dog is a loyal companion.",
    "The fox is a cunning animal."
]
question = "What does the fox do?"

# II. Classical Pre-processing
def preprocess(text):
    """Tokenizes, removes stop words, and lowercases the text."""
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

processed_corpus = [preprocess(doc) for doc in corpus]
processed_question = preprocess(question)

# III. Creating a Vocabulary and Word Embeddings (Simplified)
vocabulary = set()
for doc in processed_corpus:
    vocabulary.update(doc)
vocabulary.update(processed_question)

word_to_index = {word: i for i, word in enumerate(vocabulary)}
index_to_word = {i: word for word, i in word_to_index.items()}
vocab_size = len(vocabulary)

def create_simple_embedding(tokens, vocab_size, word_to_index):
    """Creates a simple vector embedding based on word presence."""
    embedding = np.zeros(vocab_size)
    for token in tokens:
        if token in word_to_index:
            embedding[word_to_index[token]] = 1
    return embedding

corpus_embeddings = [create_simple_embedding(doc, vocab_size, word_to_index) for doc in processed_corpus]
question_embedding = create_simple_embedding(processed_question, vocab_size, word_to_index)

def normalize_vector(vector):
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

normalized_corpus_embeddings = [normalize_vector(embedding) for embedding in corpus_embeddings]
normalized_question_embedding = normalize_vector(question_embedding)

# IV. Quantum Encoding (Amplitude Encoding)
num_qubits = int(np.ceil(np.log2(vocab_size)))

def pad_vector(vector, target_length):
    current_length = len(vector)
    if current_length < target_length:
        padding_length = target_length - current_length
        padding = np.zeros(padding_length)
        padded_vector = np.concatenate((vector, padding))
        return padded_vector
    return vector

padded_vocab_size = 2**num_qubits
padded_corpus_embeddings = [pad_vector(embedding, padded_vocab_size) for embedding in normalized_corpus_embeddings]
padded_question_embedding = pad_vector(normalized_question_embedding, padded_vocab_size)

dev = qml.device("default.qubit", wires=num_qubits)

@qml.qnode(dev)
def amplitude_encode(embedding):
    qml.AmplitudeEmbedding(features=embedding, wires=range(num_qubits), pad_with=0, normalize=False)
    return qml.state()

quantum_corpus_states = [amplitude_encode(embedding) for embedding in padded_corpus_embeddings]
quantum_question_state = amplitude_encode(padded_question_embedding)

# V. Quantum Retrieval (State Fidelity)
def state_fidelity(state1, state2):
    """Calculates the fidelity between two quantum states."""
    overlap = np.abs(np.vdot(state1, state2))**2
    return overlap

similarities = [state_fidelity(quantum_question_state, state) for state in quantum_corpus_states]
most_similar_index = np.argmax(similarities)

retrieved_document = corpus[most_similar_index]

# VI. Answering the Query
print(f"Query: {question}")
print(f"Answer: {retrieved_document}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Query: What does the fox do?
Answer: The fox is a cunning animal.


In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pennylane as qml
from sklearn.metrics.pairwise import cosine_similarity
import os
import requests, zipfile
from shutil import copyfileobj

# I. Setup and Downloads (GloVe and NLTK)
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

try:
    word_tokenize("example text")
except LookupError:
    nltk.download('punkt')

glove_file = "glove.6B.50d.txt"
url = "http://nlp.stanford.edu/data/glove.6B.zip"

if not os.path.exists(glove_file):
    print("Downloading GloVe embeddings...")
    response = requests.get(url, stream=True)
    with open("glove.zip", "wb") as out_file:
        copyfileobj(response.raw, out_file)

    print("Extracting GloVe embeddings...")
    with zipfile.ZipFile("glove.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("glove.zip")

# II. Load GloVe Embeddings
def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove_embeddings = load_glove_embeddings(glove_file)
embedding_dim = len(next(iter(glove_embeddings.values())))
print(f"Loaded {len(glove_embeddings)} GloVe embeddings with dimension {embedding_dim}")

# III. Larger Text Corpus and Preprocessing
corpus = [
    "The quick brown fox jumps over the lazy dog. This is a classic English pangram.",
    "The cat sat on the mat, peacefully napping in the sun.",
    "A dog is a loyal companion, offering unconditional love and support. Dogs are good pets.",
    "The fox is a cunning animal, known for its cleverness and adaptability. Foxes are often found in forests.",
    "Quantum computing holds the promise of revolutionizing various fields, including medicine and materials science.",
    "Natural language processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language.",
    "The Earth is the third planet from the Sun and the only known planet to harbor life.",
    "Artificial intelligence (AI) is rapidly transforming industries and reshaping the way we live and work.",
    "Climate change is a pressing global issue, requiring urgent action to mitigate its effects.",
    "Renewable energy sources, such as solar and wind power, are becoming increasingly important in the transition to a sustainable future."
]
question = "What is NLP?"

def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

processed_corpus = [preprocess(doc) for doc in corpus]
processed_question = preprocess(question)

# IV. Document and Query Embeddings (Using GloVe)
def create_document_embedding(tokens, embeddings, embedding_dim):
    word_vectors = [embeddings[token] for token in tokens if token in embeddings]
    if not word_vectors:
        return np.zeros(embedding_dim)
    document_embedding = np.mean(word_vectors, axis=0)
    return document_embedding

corpus_embeddings = [create_document_embedding(doc, glove_embeddings, embedding_dim) for doc in processed_corpus]
question_embedding = create_document_embedding(processed_question, glove_embeddings, embedding_dim)

def normalize_vector(vector):
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

normalized_corpus_embeddings = [normalize_vector(embedding) for embedding in corpus_embeddings]
normalized_question_embedding = normalize_vector(question_embedding)

# V. Quantum Encoding (Amplitude Encoding)
num_qubits = int(np.ceil(np.log2(embedding_dim)))
padded_vocab_size = 2**num_qubits

def pad_vector(vector, target_length):
    current_length = len(vector)
    if current_length < target_length:
        padding_length = target_length - current_length
        padding = np.zeros(padding_length)
        padded_vector = np.concatenate((vector, padding))
        return padded_vector
    return vector

padded_corpus_embeddings = [pad_vector(embedding, padded_vocab_size) for embedding in normalized_corpus_embeddings]
padded_question_embedding = pad_vector(normalized_question_embedding, padded_vocab_size)

# **IMPORTANT: Explicitly cast embeddings to real before quantum encoding**
padded_corpus_embeddings = [embedding.astype(np.float64) for embedding in padded_corpus_embeddings]
padded_question_embedding = padded_question_embedding.astype(np.float64)



dev = qml.device("default.qubit", wires=num_qubits)

@qml.qnode(dev)
def amplitude_encode(embedding):
    qml.AmplitudeEmbedding(features=embedding, wires=range(num_qubits), pad_with=0, normalize=False)
    return qml.state()

quantum_corpus_states = [amplitude_encode(embedding) for embedding in padded_corpus_embeddings]
quantum_question_state = amplitude_encode(padded_question_embedding)

# VI. Quantum Retrieval and Similarity (Cosine Similarity of Quantum States)
def quantum_cosine_similarity(state1, state2):
    # **IMPORTANT: Explicitly clip imaginary components to zero before cosine similarity**
    state1 = np.real(state1)
    state2 = np.real(state2)
    similarity = cosine_similarity(state1.reshape(1,-1), state2.reshape(1,-1))[0][0]
    return similarity

similarities = [quantum_cosine_similarity(quantum_question_state, state) for state in quantum_corpus_states]
most_similar_index = np.argmax(similarities)

retrieved_document = corpus[most_similar_index]

# VII. Answer Extraction (Basic)
def extract_answer(document, question):
    sentences = nltk.sent_tokenize(document)
    question_embedding = create_document_embedding(preprocess(question), glove_embeddings, embedding_dim)
    sentence_similarities = []
    for sentence in sentences:
        sentence_embedding = create_document_embedding(preprocess(sentence), glove_embeddings, embedding_dim)
        similarity = cosine_similarity(question_embedding.reshape(1, -1), sentence_embedding.reshape(1, -1))[0][0]
        sentence_similarities.append(similarity)
    best_sentence_index = np.argmax(sentence_similarities)
    return sentences[best_sentence_index]

answer = extract_answer(retrieved_document, question)

print(f"Query: {question}")
print(f"Answer: {answer}")

Loaded 400000 GloVe embeddings with dimension 50
Query: What is NLP?
Answer: Natural language processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language.


What is Quantum in the Code:

    Amplitude Encoding: The AmplitudeEmbedding from PennyLane is used to encode classical data (the GloVe embeddings) into the amplitudes of a quantum state. This is a quantum representation of the data.

    Quantum State Representation: The corpus and query are represented as quantum states.

What is Classical in the Code:

    GloVe Embeddings: The GloVe embeddings themselves are generated using a classical machine learning model. The semantics of the text are captured classically.

    Preprocessing: Tokenization, stop word removal, and other preprocessing steps are performed classically.

    Cosine Similarity: The cosine_similarity calculation from sklearn is a classical computation. Even though we are feeding it the quantum states, the actual similarity calculation is done classically.

    Answer Extraction: The entire answer extraction process is classical.

    Control Flow: The overall logic of the RAG system (retrieval, answer extraction) is implemented using classical Python code.

What Would Make it Closer to a "Complete" Quantum RAG:

A truly "complete" quantum RAG system would aim to leverage quantum algorithms and quantum data representations throughout the entire process, not just in the encoding step. Here are some key areas where quantum algorithms could be applied:

    Quantum Embeddings: Instead of using classical GloVe embeddings, explore methods for generating quantum embeddings directly. This could involve training a variational quantum circuit (VQC) to learn a quantum feature map that maps text to a quantum state space. This is a research area and can be computationally expensive.

    Quantum Similarity Search: The most promising area for potential quantum advantage is the similarity search. Replace the classical cosine_similarity calculation with a quantum algorithm for similarity search, such as:

        Quantum Earth Mover's Distance (QEMD): A quantum version of Earth Mover's Distance, potentially offering a speedup for comparing probability distributions.

        Quantum Nearest Neighbor Search: Quantum algorithms for nearest neighbor search, such as those based on Grover's algorithm, could be used to efficiently find the most similar documents in the quantum database. However, QRAM (Quantum Random Access Memory) is often assumed for these which is a challenge.

        Quantum Inner Product Estimation: There are quantum algorithms that can estimate the inner product (related to cosine similarity) of two quantum states more efficiently than classical algorithms in some scenarios.

    Quantum Natural Language Processing (QNLP) for Understanding: Use quantum algorithms to perform tasks such as:

        Quantum Parsing: Develop quantum algorithms for parsing natural language sentences.

        Quantum Semantic Analysis: Use quantum techniques to analyze the meaning of text.

    Quantum Answer Generation: Explore methods for generating answers using quantum algorithms. This could involve:

        Quantum Language Models: Train quantum language models to generate text. This is highly experimental.

        Quantum Information Retrieval: Use quantum algorithms to retrieve relevant information from a knowledge base and combine it to generate an answer.

Challenges and Considerations:

    Qubit Requirements: Many quantum algorithms require a large number of qubits, which are not yet available on current quantum hardware.

    Quantum Random Access Memory (QRAM): Some quantum algorithms, such as those based on Grover's algorithm, require QRAM, which is a theoretical technology that is not yet practical.

    State Preparation Complexity: Preparing the quantum states required for many quantum algorithms can be computationally expensive.

    Quantum Noise: Quantum computers are susceptible to noise, which can degrade the accuracy of quantum computations. Error correction is essential but adds overhead.

    Lack of Mature QNLP Algorithms: The field of QNLP is still relatively new, and there are not yet many mature quantum algorithms available for natural language processing tasks.

In summary:

Your current code is a great starting point for exploring QNLP-RAG, but it's more accurately described as a quantum-enhanced or quantum-inspired RAG system rather than a "complete" quantum RAG system. To move closer to a complete quantum RAG, you would need to replace more of the classical components with quantum algorithms. This is a challenging but potentially rewarding area of research. The main bottleneck, realistically, is a lack of good quantum algorithms for many of the core tasks, and the hardware to run them effectively.

In [13]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pennylane as qml
from sklearn.metrics.pairwise import cosine_similarity
import os
import requests, zipfile
from shutil import copyfileobj

# I. Setup and Downloads (GloVe and NLTK)
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

try:
    word_tokenize("example text")
except LookupError:
    nltk.download('punkt')

glove_file = "glove.6B.50d.txt"
url = "http://nlp.stanford.edu/data/glove.6B.zip"

if not os.path.exists(glove_file):
    print("Downloading GloVe embeddings...")
    response = requests.get(url, stream=True)
    with open("glove.zip", "wb") as out_file:
        copyfileobj(response.raw, out_file)

    print("Extracting GloVe embeddings...")
    with zipfile.ZipFile("glove.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("glove.zip")

# II. Load GloVe Embeddings
def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove_embeddings = load_glove_embeddings(glove_file)
embedding_dim = len(next(iter(glove_embeddings.values())))
print(f"Loaded {len(glove_embeddings)} GloVe embeddings with dimension {embedding_dim}")

# III. Larger Text Corpus and Preprocessing
corpus = [
    "The quick brown fox jumps over the lazy dog. This is a classic English pangram.",
    "The cat sat on the mat, peacefully napping in the sun.",
    "A dog is a loyal companion, offering unconditional love and support. Dogs are good pets.",
    "The fox is a cunning animal, known for its cleverness and adaptability. Foxes are often found in forests.",
    "Quantum computing holds the promise of revolutionizing various fields, including medicine and materials science.",
    "Natural language processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language.",
    "The Earth is the third planet from the Sun and the only known planet to harbor life.",
    "Artificial intelligence (AI) is rapidly transforming industries and reshaping the way we live and work.",
    "Climate change is a pressing global issue, requiring urgent action to mitigate its effects.",
    "Renewable energy sources, such as solar and wind power, are becoming increasingly important in the transition to a sustainable future."
]
question = "What is NLP?"

def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

processed_corpus = [preprocess(doc) for doc in corpus]
processed_question = preprocess(question)

# IV. Document and Query Embeddings (Using GloVe)
def create_document_embedding(tokens, embeddings, embedding_dim):
    word_vectors = [embeddings[token] for token in tokens if token in embeddings]
    if not word_vectors:
        return np.zeros(embedding_dim)
    document_embedding = np.mean(word_vectors, axis=0)
    return document_embedding

corpus_embeddings = [create_document_embedding(doc, glove_embeddings, embedding_dim) for doc in processed_corpus]
question_embedding = create_document_embedding(processed_question, glove_embeddings, embedding_dim)

def normalize_vector(vector):
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

normalized_corpus_embeddings = [normalize_vector(embedding) for embedding in corpus_embeddings]
normalized_question_embedding = normalize_vector(question_embedding)

# V. Quantum Encoding (Amplitude Encoding)
num_qubits = int(np.ceil(np.log2(embedding_dim)))
num_document_qubits = int(np.ceil(np.log2(len(corpus))))
padded_vocab_size = 2**num_qubits

def pad_vector(vector, target_length):
    current_length = len(vector)
    if current_length < target_length:
        padding_length = target_length - current_length
        padding = np.zeros(padding_length)
        padded_vector = np.concatenate((vector, padding))
        return padded_vector
    return vector

padded_corpus_embeddings = [pad_vector(embedding, padded_vocab_size) for embedding in normalized_corpus_embeddings]
padded_question_embedding = pad_vector(normalized_question_embedding, padded_vocab_size)

# Explicitly cast embeddings to real before quantum encoding
padded_corpus_embeddings = [embedding.astype(np.float64) for embedding in padded_corpus_embeddings]
padded_question_embedding = padded_question_embedding.astype(np.float64)

dev = qml.device("default.qubit", wires=num_qubits + num_document_qubits) # Additional qubits for document indexing

@qml.qnode(dev)
def amplitude_encode(embedding, index_qubits):  # Added index_qubits argument
    """Encodes embedding and index into the quantum state."""
    qml.AmplitudeEmbedding(features=embedding, wires=range(num_qubits), pad_with=0, normalize=False)
    qml.BasisState(index_qubits, wires=range(num_qubits, num_qubits + num_document_qubits))
    return qml.state()

quantum_corpus_states = [amplitude_encode(embedding, i) for i, embedding in enumerate(padded_corpus_embeddings)]
quantum_question_state = amplitude_encode(padded_question_embedding, 0) # Index doesn't matter for question, as long as same device is used

# VI. Quantum Retrieval (Grover's Algorithm)

# Calculate cosine similarities classically for oracle (simulated QRAM)
# Explicitly clip imaginary components to zero before cosine similarity
similarities = [cosine_similarity(np.real(quantum_question_state[:padded_vocab_size]).reshape(1,-1), np.real(quantum_corpus_states[i][:padded_vocab_size]).reshape(1,-1))[0][0] for i in range(len(corpus))] # Clip states for similarity

#Set the threshold
threshold = np.mean(similarities)
#Get Index of Document Qubits.
document_qubit_index = num_qubits + num_document_qubits - 1

def oracle(wires):
    """Oracle marks states with similarity above the threshold."""
    for i in range(len(corpus)):
        if similarities[i] >= threshold: #threshold to check is cosine similarity is good enough
            qml.FlipSign(wires=(document_qubit_index,), n=1)

#Grover operator to amplify good states

def grover_diffusion_op(wires):
    """Grover diffusion operator."""
    num_document_qubits = int(np.ceil(np.log2(len(corpus))))
    if num_document_qubits < 2: #Grover diffusion operator needs at least two qubits
        print("Need at least two document qubits for Grover diffusion operator.")
        return

    #Apply Hadamard to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

    #Apply PauliX to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.PauliX(wires=wire)

    # Apply CZ to the first two document qubits (Control and Target)
    qml.CZ(wires=[num_qubits, num_qubits + 1])

    #Apply PauliX to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.PauliX(wires=wire)

    #Apply Hadamard to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

@qml.qnode(dev)
def grover_search():
    """Grover's algorithm implementation."""
    # Superposition over document indices
    num_document_qubits = int(np.ceil(np.log2(len(corpus))))
    if num_document_qubits < 2:
      print("Need at least two document qubits for Grover search.")
      return 0*np.ones(2**num_document_qubits)

    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

    # Number of Grover iterations
    N = len(corpus)
    num_iterations = int(np.floor(np.pi/4*np.sqrt(N))) # Optimal iterations

    #Grover iterations
    for _ in range(num_iterations):
        oracle(range(num_qubits + num_document_qubits)) #Apply Oracle to mark good states
        grover_diffusion_op(range(num_qubits + num_document_qubits))

    return qml.probs(wires=range(num_qubits, num_qubits + num_document_qubits)) #measure the probability of each document index

# Perform Grover's search
probabilities = grover_search()
if isinstance(probabilities, int): # Check if grover_search returned 0 due to insufficient document qubits.
    most_likely_index = 0
else:
    most_likely_index = np.argmax(probabilities)
print("Most likely index:", most_likely_index)

retrieved_document = corpus[most_likely_index]

# VII. Answer Extraction (Basic)
def extract_answer(document, question):
    sentences = nltk.sent_tokenize(document)
    question_embedding = create_document_embedding(preprocess(question), glove_embeddings, embedding_dim)
    sentence_similarities = []
    for sentence in sentences:
        sentence_embedding = create_document_embedding(preprocess(sentence), glove_embeddings, embedding_dim)
        similarity = cosine_similarity(question_embedding.reshape(1, -1), sentence_embedding.reshape(1, -1))[0][0]
        sentence_similarities.append(similarity)
    best_sentence_index = np.argmax(sentence_similarities)
    return sentences[best_sentence_index]

answer = extract_answer(retrieved_document, question)

print(f"Query: {question}")
print(f"Answer: {answer}")

Loaded 400000 GloVe embeddings with dimension 50
Most likely index: 0
Query: What is NLP?
Answer: This is a classic English pangram.


Key Changes and Explanations:

    Qubit Allocation: The PennyLane device now needs enough qubits for both the embedding and for representing the document indices in superposition. We add int(np.ceil(np.log2(len(corpus)))) additional qubits.

    Encoding Index: The amplitude_encode function now also encodes the index of the document into the basis state of the additional qubits. This is necessary for Grover's algorithm to operate on the documents in superposition.

    Classical Similarity Calculation (for Oracle): I've kept the classical cosine similarity calculation temporarily for the oracle. This is where the QRAM assumption comes into play. In a real quantum system, you'd want a quantum way to determine if a document is "relevant" to the query. I'm using the classical cosine similarity and a threshold to simulate the QRAM. The oracle checks if similarities[i] >= threshold: to determine which states to mark.

    Oracle Implementation: The oracle function is the heart of Grover's algorithm. It flips the sign of the amplitude of the states that satisfy the search condition (i.e., the documents that are similar to the query based on the threshold).

    Grover Diffusion Operator: Implemented to amplify the good states

    grover_search Function: Implements the main Grover's search algorithm:

        Creates a superposition over all document indices using Hadamard gates.

        Applies the oracle to mark the "good" states.

        Applies the Grover diffusion operator to amplify the amplitudes of the "good" states.

        Repeats the oracle and diffusion operator a calculated number of times for optimal amplification.

        Measures the document index qubits to determine the most likely index.

    Measurement: qml.probs is used to measure the probability of each document index after Grover's algorithm has been applied. The index with the highest probability is the most likely answer.

How this code gets closer to a complete quantum implementation:

    Grover's Algorithm: We've replaced the classical linear search with Grover's algorithm, providing a potential quadratic speedup.

    Quantum State Manipulation: Grover's algorithm operates directly on the quantum states of the documents, leveraging quantum superposition and interference.

Still Missing for a "Complete" Quantum RAG:

    Quantum Oracle: The biggest missing piece is a quantum oracle. Using a classical similarity measure to define the oracle means we're still relying on classical computation for a critical step. Ideally, the oracle would be based on a purely quantum similarity measure or a learned quantum feature map.

    True QRAM: We're simulating QRAM. True QRAM is a significant hardware challenge.

    Quantum Embeddings: Starting with quantum embeddings from the beginning would truly make it a quantum system.

This Grover's search based example is a step closer to a "complete" quantum RAG system, but it still has limitations. It highlights the challenges and opportunities in this exciting field. Remember that this is a research area, and practical quantum RAG systems are still years away.


In [16]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pennylane as qml
from sklearn.metrics.pairwise import cosine_similarity
import os
import requests, zipfile
from shutil import copyfileobj

# I. Setup and Downloads (GloVe and NLTK)
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

try:
    word_tokenize("example text")
except LookupError:
    nltk.download('punkt')

glove_file = "glove.6B.50d.txt"
url = "http://nlp.stanford.edu/data/glove.6B.zip"

if not os.path.exists(glove_file):
    print("Downloading GloVe embeddings...")
    response = requests.get(url, stream=True)
    with open("glove.zip", "wb") as out_file:
        copyfileobj(response.raw, out_file)

    print("Extracting GloVe embeddings...")
    with zipfile.ZipFile("glove.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("glove.zip")

# II. Load GloVe Embeddings
def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove_embeddings = load_glove_embeddings(glove_file)
embedding_dim = len(next(iter(glove_embeddings.values())))
print(f"Loaded {len(glove_embeddings)} GloVe embeddings with dimension {embedding_dim}")

# III. Larger Text Corpus and Preprocessing
corpus = [
    "The quick brown fox jumps over the lazy dog. This is a classic English pangram.",
    "The cat sat on the mat, peacefully napping in the sun.",
    "A dog is a loyal companion, offering unconditional love and support. Dogs are good pets.",
    "The fox is a cunning animal, known for its cleverness and adaptability. Foxes are often found in forests.",
    "Quantum computing holds the promise of revolutionizing various fields, including medicine and materials science.",
    "Natural language processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language.",
    "The Earth is the third planet from the Sun and the only known planet to harbor life.",
    "Artificial intelligence (AI) is rapidly transforming industries and reshaping the way we live and work.",
    "Climate change is a pressing global issue, requiring urgent action to mitigate its effects.",
    "Renewable energy sources, such as solar and wind power, are becoming increasingly important in the transition to a sustainable future."
]

# Get user query from the command line
question = input("Enter your query: ")

def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

processed_corpus = [preprocess(doc) for doc in corpus]
processed_question = preprocess(question)

# IV. Document and Query Embeddings (Using GloVe)
def create_document_embedding(tokens, embeddings, embedding_dim):
    word_vectors = [embeddings[token] for token in tokens if token in embeddings]
    if not word_vectors:
        return np.zeros(embedding_dim)
    document_embedding = np.mean(word_vectors, axis=0)
    return document_embedding

corpus_embeddings = [create_document_embedding(doc, glove_embeddings, embedding_dim) for doc in processed_corpus]
question_embedding = create_document_embedding(processed_question, glove_embeddings, embedding_dim)

def normalize_vector(vector):
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

normalized_corpus_embeddings = [normalize_vector(embedding) for embedding in corpus_embeddings]
normalized_question_embedding = normalize_vector(question_embedding)

# V. Quantum Encoding (Amplitude Encoding)
num_qubits = int(np.ceil(np.log2(embedding_dim)))
num_document_qubits = int(np.ceil(np.log2(len(corpus))))
padded_vocab_size = 2**num_qubits

def pad_vector(vector, target_length):
    current_length = len(vector)
    if current_length < target_length:
        padding_length = target_length - current_length
        padding = np.zeros(padding_length)
        padded_vector = np.concatenate((vector, padding))
        return padded_vector
    return vector

padded_corpus_embeddings = [pad_vector(embedding, padded_vocab_size) for embedding in normalized_corpus_embeddings]
padded_question_embedding = pad_vector(normalized_question_embedding, padded_vocab_size)

# Explicitly cast embeddings to real before quantum encoding
padded_corpus_embeddings = [embedding.astype(np.float64) for embedding in padded_corpus_embeddings]
padded_question_embedding = padded_question_embedding.astype(np.float64)

dev = qml.device("default.qubit", wires=num_qubits + num_document_qubits) # Additional qubits for document indexing

@qml.qnode(dev)
def amplitude_encode(embedding, index_qubits):  # Added index_qubits argument
    """Encodes embedding and index into the quantum state."""
    qml.AmplitudeEmbedding(features=embedding, wires=range(num_qubits), pad_with=0, normalize=False)
    qml.BasisState(index_qubits, wires=range(num_qubits, num_qubits + num_document_qubits))
    return qml.state()

quantum_corpus_states = [amplitude_encode(embedding, i) for i, embedding in enumerate(padded_corpus_embeddings)]
quantum_question_state = amplitude_encode(padded_question_embedding, 0) # Index doesn't matter for question, as long as same device is used

# VI. Quantum Retrieval (Grover's Algorithm)

# Calculate cosine similarities classically for oracle (simulated QRAM)
# Explicitly clip imaginary components to zero before cosine similarity
similarities = [cosine_similarity(np.real(quantum_question_state[:padded_vocab_size]).reshape(1,-1), np.real(quantum_corpus_states[i][:padded_vocab_size]).reshape(1,-1))[0][0] for i in range(len(corpus))] # Clip states for similarity

#Set the threshold
threshold = np.mean(similarities)
#Get Index of Document Qubits.
document_qubit_index = num_qubits + num_document_qubits - 1

def oracle(wires):
    """Oracle marks states with similarity above the threshold."""
    for i in range(len(corpus)):
        if similarities[i] >= threshold: #threshold to check is cosine similarity is good enough
            qml.FlipSign(wires=(document_qubit_index,), n=1)

#Grover operator to amplify good states

def grover_diffusion_op(wires):
    """Grover diffusion operator."""
    num_document_qubits = int(np.ceil(np.log2(len(corpus))))
    if num_document_qubits < 2: #Grover diffusion operator needs at least two qubits
        print("Need at least two document qubits for Grover diffusion operator.")
        return

    #Apply Hadamard to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

    #Apply PauliX to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.PauliX(wires=wire)

    # Apply CZ to the first two document qubits (Control and Target)
    qml.CZ(wires=[num_qubits, num_qubits + 1])

    #Apply PauliX to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.PauliX(wires=wire)

    #Apply Hadamard to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

@qml.qnode(dev)
def grover_search():
    """Grover's algorithm implementation."""
    # Superposition over document indices
    num_document_qubits = int(np.ceil(np.log2(len(corpus))))
    if num_document_qubits < 2:
      print("Need at least two document qubits for Grover search.")
      return 0*np.ones(2**num_document_qubits)

    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

    # Number of Grover iterations
    N = len(corpus)
    num_iterations = int(np.floor(np.pi/4*np.sqrt(N))) # Optimal iterations

    #Grover iterations
    for _ in range(num_iterations):
        oracle(range(num_qubits + num_document_qubits)) #Apply Oracle to mark good states
        grover_diffusion_op(range(num_qubits + num_document_qubits))

    return qml.probs(wires=range(num_qubits, num_qubits + num_document_qubits)) #measure the probability of each document index

# Perform Grover's search
probabilities = grover_search()
if isinstance(probabilities, int): # Check if grover_search returned 0 due to insufficient document qubits.
    most_likely_index = 0
else:
    most_likely_index = np.argmax(probabilities)
print("Most likely index:", most_likely_index)

retrieved_document = corpus[most_likely_index]

# VII. Answer Extraction (Basic)
def extract_answer(document, question):
    sentences = nltk.sent_tokenize(document)
    question_embedding = create_document_embedding(preprocess(question), glove_embeddings, embedding_dim)
    sentence_similarities = []
    for sentence in sentences:
        sentence_embedding = create_document_embedding(preprocess(sentence), glove_embeddings, embedding_dim)
        similarity = cosine_similarity(question_embedding.reshape(1, -1), sentence_embedding.reshape(1, -1))[0][0]
        sentence_similarities.append(similarity)
    best_sentence_index = np.argmax(sentence_similarities)
    return sentences[best_sentence_index]

answer = extract_answer(retrieved_document, question)

print(f"Query: {question}")
print(f"Answer: {answer}")

Loaded 400000 GloVe embeddings with dimension 50
Enter your query: DOG
Most likely index: 0
Query: DOG
Answer: The quick brown fox jumps over the lazy dog.


In [17]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pennylane as qml
from sklearn.metrics.pairwise import cosine_similarity
import os
import requests, zipfile
from shutil import copyfileobj
from sentence_transformers import SentenceTransformer

# I. Setup and Downloads (NLTK and Sentence Transformers)
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

try:
    word_tokenize("example text")
except LookupError:
    nltk.download('punkt')


# II. Load Sentence Transformer Model
model = SentenceTransformer('all-mpnet-base-v2')
embedding_dim = model.get_sentence_embedding_dimension() # Get Embedding Dimension
print(f"Loaded Sentence Transformer model with dimension {embedding_dim}")

# III. Larger Text Corpus and Preprocessing
corpus = [
    "The quick brown fox jumps over the lazy dog. This is a classic English pangram.",
    "The cat sat on the mat, peacefully napping in the sun.",
    "A dog is a loyal companion, offering unconditional love and support. Dogs are good pets.",
    "The fox is a cunning animal, known for its cleverness and adaptability. Foxes are often found in forests.",
    "Quantum computing holds the promise of revolutionizing various fields, including medicine and materials science.",
    "Natural language processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language.",
    "The Earth is the third planet from the Sun and the only known planet to harbor life.",
    "Artificial intelligence (AI) is rapidly transforming industries and reshaping the way we live and work.",
    "Climate change is a pressing global issue, requiring urgent action to mitigate its effects.",
    "Renewable energy sources, such as solar and wind power, are becoming increasingly important in the transition to a sustainable future."
]

# Get user query from the command line
question = input("Enter your query: ")

def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

processed_corpus = [preprocess(doc) for doc in corpus]
processed_question = preprocess(question)

# IV. Document and Query Embeddings (Using Sentence Transformers)
corpus_embeddings = model.encode(corpus)  # Encode entire sentences
question_embedding = model.encode(question)

def normalize_vector(vector):
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

normalized_corpus_embeddings = [normalize_vector(embedding) for embedding in corpus_embeddings]
normalized_question_embedding = normalize_vector(question_embedding)

# V. Quantum Encoding (Amplitude Encoding)
num_qubits = int(np.ceil(np.log2(embedding_dim)))
num_document_qubits = int(np.ceil(np.log2(len(corpus))))
padded_vocab_size = 2**num_qubits

def pad_vector(vector, target_length):
    current_length = len(vector)
    if current_length < target_length:
        padding_length = target_length - current_length
        padding = np.zeros(padding_length)
        padded_vector = np.concatenate((vector, padding))
        return padded_vector
    return vector

padded_corpus_embeddings = [pad_vector(embedding, padded_vocab_size) for embedding in normalized_corpus_embeddings]
padded_question_embedding = pad_vector(normalized_question_embedding, padded_vocab_size)

# Explicitly cast embeddings to real before quantum encoding
padded_corpus_embeddings = [embedding.astype(np.float64) for embedding in padded_corpus_embeddings]
padded_question_embedding = padded_question_embedding.astype(np.float64)

dev = qml.device("default.qubit", wires=num_qubits + num_document_qubits) # Additional qubits for document indexing

@qml.qnode(dev)
def amplitude_encode(embedding, index_qubits):  # Added index_qubits argument
    """Encodes embedding and index into the quantum state."""
    qml.AmplitudeEmbedding(features=embedding, wires=range(num_qubits), pad_with=0, normalize=False)
    qml.BasisState(index_qubits, wires=range(num_qubits, num_qubits + num_document_qubits))
    return qml.state()

quantum_corpus_states = [amplitude_encode(embedding, i) for i, embedding in enumerate(padded_corpus_embeddings)]
quantum_question_state = amplitude_encode(padded_question_embedding, 0) # Index doesn't matter for question, as long as same device is used

# VI. Quantum Retrieval (Grover's Algorithm)

# Calculate cosine similarities classically for oracle (simulated QRAM)
# Explicitly clip imaginary components to zero before cosine similarity
similarities = [cosine_similarity(np.real(quantum_question_state[:padded_vocab_size]).reshape(1,-1), np.real(quantum_corpus_states[i][:padded_vocab_size]).reshape(1,-1))[0][0] for i in range(len(corpus))] # Clip states for similarity

#Set the threshold
threshold = np.mean(similarities)
#Get Index of Document Qubits.
document_qubit_index = num_qubits + num_document_qubits - 1

def oracle(wires):
    """Oracle marks states with similarity above the threshold."""
    for i in range(len(corpus)):
        if similarities[i] >= threshold: #threshold to check is cosine similarity is good enough
            qml.FlipSign(wires=(document_qubit_index,), n=1)

#Grover operator to amplify good states

def grover_diffusion_op(wires):
    """Grover diffusion operator."""
    num_document_qubits = int(np.ceil(np.log2(len(corpus))))
    if num_document_qubits < 2: #Grover diffusion operator needs at least two qubits
        print("Need at least two document qubits for Grover diffusion operator.")
        return

    #Apply Hadamard to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

    #Apply PauliX to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.PauliX(wires=wire)

    # Apply CZ to the first two document qubits (Control and Target)
    qml.CZ(wires=[num_qubits, num_qubits + 1])

    #Apply PauliX to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.PauliX(wires=wire)

    #Apply Hadamard to all document index qubits
    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

@qml.qnode(dev)
def grover_search():
    """Grover's algorithm implementation."""
    # Superposition over document indices
    num_document_qubits = int(np.ceil(np.log2(len(corpus))))
    if num_document_qubits < 2:
      print("Need at least two document qubits for Grover search.")
      return 0*np.ones(2**num_document_qubits)

    for wire in range(num_qubits, num_qubits + num_document_qubits):
        qml.Hadamard(wires=wire)

    # Number of Grover iterations
    N = len(corpus)
    num_iterations = int(np.floor(np.pi/4*np.sqrt(N))) # Optimal iterations

    #Grover iterations
    for _ in range(num_iterations):
        oracle(range(num_qubits + num_document_qubits)) #Apply Oracle to mark good states
        grover_diffusion_op(range(num_qubits + num_document_qubits))

    return qml.probs(wires=range(num_qubits, num_qubits + num_document_qubits)) #measure the probability of each document index

# Perform Grover's search
probabilities = grover_search()
if isinstance(probabilities, int): # Check if grover_search returned 0 due to insufficient document qubits.
    most_likely_index = 0
else:
    most_likely_index = np.argmax(probabilities)
print("Most likely index:", most_likely_index)

retrieved_document = corpus[most_likely_index]

# VII. Answer Extraction (Basic)
def extract_answer(document, question):
    sentences = nltk.sent_tokenize(document)
    #Remove this since Sentence Transformer encodes whole sentence
    #question_embedding = create_document_embedding(preprocess(question), glove_embeddings, embedding_dim)
    sentence_similarities = []
    for sentence in sentences:
        #Remove this since Sentence Transformer encodes whole sentence
        #sentence_embedding = create_document_embedding(preprocess(sentence), glove_embeddings, embedding_dim)
        similarity = cosine_similarity(model.encode(question).reshape(1, -1), model.encode(sentence).reshape(1, -1))[0][0]
        sentence_similarities.append(similarity)
    best_sentence_index = np.argmax(sentence_similarities)
    return sentences[best_sentence_index]

answer = extract_answer(retrieved_document, question)

print(f"Query: {question}")
print(f"Answer: {answer}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded Sentence Transformer model with dimension 768
Enter your query: quantum
Most likely index: 0
Query: quantum
Answer: The quick brown fox jumps over the lazy dog.


In [18]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pennylane as qml
from sklearn.metrics.pairwise import cosine_similarity
import os
import requests, zipfile
from shutil import copyfileobj
from sentence_transformers import SentenceTransformer

# I. Setup and Downloads (NLTK and Sentence Transformers)
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

try:
    word_tokenize("example text")
except LookupError:
    nltk.download('punkt')


# II. Load Sentence Transformer Model
model = SentenceTransformer('all-mpnet-base-v2')
embedding_dim = model.get_sentence_embedding_dimension() # Get Embedding Dimension
print(f"Loaded Sentence Transformer model with dimension {embedding_dim}")

# III. Larger Text Corpus and Preprocessing
corpus = [
    "The quick brown fox jumps over the lazy dog. This is a classic English pangram.",
    "The cat sat on the mat, peacefully napping in the sun.",
    "A dog is a loyal companion, offering unconditional love and support. Dogs are good pets.",
    "The fox is a cunning animal, known for its cleverness and adaptability. Foxes are often found in forests.",
    "Quantum computing holds the promise of revolutionizing various fields, including medicine and materials science.",
    "Natural language processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language.",
    "The Earth is the third planet from the Sun and the only known planet to harbor life.",
    "Artificial intelligence (AI) is rapidly transforming industries and reshaping the way we live and work.",
    "Climate change is a pressing global issue, requiring urgent action to mitigate its effects.",
    "Renewable energy sources, such as solar and wind power, are becoming increasingly important in the transition to a sustainable future."
]

# Get user query from the command line
question = input("Enter your query: ")

def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

processed_corpus = [preprocess(doc) for doc in corpus]
processed_question = preprocess(question)

# IV. Document and Query Embeddings (Using Sentence Transformers)
corpus_embeddings = model.encode(corpus)  # Encode entire sentences
question_embedding = model.encode(question)

def normalize_vector(vector):
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

normalized_corpus_embeddings = [normalize_vector(embedding) for embedding in corpus_embeddings]
normalized_question_embedding = normalize_vector(question_embedding)

# V. Quantum Circuit Preparation, drop amplitude encoding
num_documents = len(corpus)
num_document_qubits = int(np.ceil(np.log2(num_documents)))
dev = qml.device("default.qubit", wires=num_document_qubits)


# VI. Quantum Retrieval (Grover's Algorithm)

# Calculate cosine similarities classically for oracle (simulated QRAM)
similarities = [cosine_similarity(question_embedding.reshape(1,-1), corpus_embeddings[i].reshape(1,-1))[0][0] for i in range(num_documents)]

#Set the threshold
threshold = np.mean(similarities)

def oracle(wires):
    """Marks states with similarity above the threshold."""
    for i in range(num_documents):
        if similarities[i] >= threshold:
            qml.FlipSign(wires=(wires[0],), n=1)

#Grover operator to amplify good states

def grover_diffusion_op(wires):
    """Grover diffusion operator."""
    for wire in wires:
        qml.Hadamard(wires=wire)
    for wire in wires:
        qml.PauliX(wires=wire)
    if len(wires) > 1:  # Apply CZ only if there are at least two qubits
        qml.CZ(wires=[wires[0], wires[1]])
    for wire in wires:
        qml.PauliX(wires=wire)
    for wire in wires:
        qml.Hadamard(wires=wire)


@qml.qnode(dev)
def grover_search():
    """Grover's algorithm implementation."""
    wires = range(num_document_qubits)
    for wire in wires:
        qml.Hadamard(wires=wire)

    # Number of Grover iterations
    N = num_documents
    num_iterations = int(np.floor(np.pi / 4 * np.sqrt(N)))

    for _ in range(num_iterations):
        oracle(wires)
        grover_diffusion_op(wires)

    return qml.probs(wires=wires)

# Perform Grover's search
probabilities = grover_search()
most_likely_index = np.argmax(probabilities)
print("Most likely index:", most_likely_index)

retrieved_document = corpus[most_likely_index]

# VII. Answer Extraction (Basic)
def extract_answer(document, question):
    sentences = nltk.sent_tokenize(document)
    #Remove this since Sentence Transformer encodes whole sentence
    #question_embedding = create_document_embedding(preprocess(question), glove_embeddings, embedding_dim)
    sentence_similarities = []
    for sentence in sentences:
        #Remove this since Sentence Transformer encodes whole sentence
        #sentence_embedding = create_document_embedding(preprocess(sentence), glove_embeddings, embedding_dim)
        similarity = cosine_similarity(model.encode(question).reshape(1, -1), model.encode(sentence).reshape(1,-1))[0][0]
        sentence_similarities.append(similarity)
    best_sentence_index = np.argmax(sentence_similarities)
    return sentences[best_sentence_index]

answer = extract_answer(retrieved_document, question)

print(f"Query: {question}")
print(f"Answer: {answer}")

Loaded Sentence Transformer model with dimension 768
Enter your query: quantum
Most likely index: 0
Query: quantum
Answer: The quick brown fox jumps over the lazy dog.


Quantum Data Encoding:

    Goal: Represent text data in a quantum format that captures semantic relationships and is amenable to quantum computation.

    Approach: Variational Quantum Encoding (VQE).

        Train a VQC to map classical text into a quantum Hilbert space.

            Instead of starting from classical word embeddings, train from raw text using a quantum-classical hybrid approach.

            Use a quantum loss function that promotes semantic similarity between documents with similar meanings. This could involve a quantum version of contrastive loss.

Quantum Indexing:

    Goal: Create a quantum data structure that allows for efficient retrieval of relevant documents.

    Approach: Quantum Associative Memory (QuAM).

          
    *  Use a variation of QuAM that stores associations between query features and document features.

        

    IGNORE_WHEN_COPYING_START

        Use code with caution.
        IGNORE_WHEN_COPYING_END

    Quantum Retrieval:

        Goal: Use quantum algorithms to search the quantum index and retrieve relevant documents.

        Approach: Quantum Amplitude Estimation (QAE) combined with Grover's Algorithm.

    Quantum Answer Extraction:

        Goal: Extract the most relevant information from the retrieved quantum documents to answer the query.

        Approach: Quantum Attention Mechanisms with VQC.

II. Implementation Details

Since implementing the entire pipeline on real quantum hardware is currently infeasible, we'll focus on the most crucial components and simulate others.

A. Variational Quantum Encoding (VQE)

    Quantum Circuit Design:

        Choose a suitable VQC architecture. Good choices include:

            Hardware-Efficient Ansatz: Circuits designed to be easily implemented on specific quantum hardware.

            Tree Tensor Network (TTN) Ansatz: Circuits inspired by tensor networks, which can efficiently represent complex quantum states.

    Data Input:

        Encode the raw text (or pre-processed tokens) into a feature vector that can be fed into the VQC.

            Use techniques such as:

                Character-level encoding.

                Bag-of-words representation (with a limited vocabulary size).

                TF-IDF vectors.

    Quantum Loss Function:

        Define a quantum loss function that guides the training of the VQC.

            Examples:

                Quantum Contrastive Loss: Similar documents should have quantum states with high fidelity (overlap), while dissimilar documents should have low fidelity.

                      
                L = (1 - y) * (1 - Fidelity(ψ1, ψ2)) + y * max(0, Fidelity(ψ1, ψ2) - margin)

                    

                IGNORE_WHEN_COPYING_START

                Use code with caution.
                IGNORE_WHEN_COPYING_END

                where:

                    y = 1 if the documents are similar, y = 0 if they are dissimilar.

                    Fidelity(ψ1, ψ2) is the fidelity between the quantum states of the two documents.

                    margin is a hyperparameter that controls the separation between similar and dissimilar documents.

                Quantum Cross-Entropy Loss: If you have labeled data (e.g., document categories), you can use a quantum version of cross-entropy loss to train the VQC to classify documents.

    Training Loop:

        Use a hybrid quantum-classical optimization algorithm to train the VQC.

            Examples:

                Variational Quantum Eigensolver (VQE) with a classical optimizer like Adam or L-BFGS-B.

                Parameter-shift rule for calculating gradients.

    Output:

        The trained VQC becomes the quantum feature map that encodes text into quantum states.

B. Quantum Indexing (QuAM)

    Feature Extraction:

        Use the trained VQC to extract quantum features from each document.

        The output quantum state of the VQC represents the features of the document.

    Association Storage:

        Store the associations between document indices and their quantum features in a QuAM.

        This can be a theoretical step for now, as practical QuAM is not yet available.

C. Quantum Retrieval (Grover's Algorithm + QAE)

    Query Encoding:

        Encode the query into a quantum state using the same VQC that was used to encode the documents.

    Similarity Estimation:

        Use Quantum Amplitude Estimation (QAE) to estimate the similarity between the query state and the document states stored in the QuAM.

        QAE provides a quadratic speedup over classical methods for estimating amplitudes.

    Grover's Algorithm:

        Use Grover's algorithm to search the QuAM for documents with high similarity to the query.

        The oracle for Grover's algorithm can be based on the amplitude estimates from QAE.

D. Quantum Answer Extraction

    Quantum Attention Mechanisms:

        Implement quantum attention mechanisms to focus on the most relevant parts of the retrieved documents.

            This involves using quantum circuits to calculate attention weights between the query and the document.

    VQC for Answer Prediction:

        Train another VQC to predict the answer to the query based on the attended document features.

            This VQC can be trained to output a probability distribution over possible answers.
      
    IV. Optimization Strategies

    Circuit Optimization:

        Use circuit optimization techniques to reduce the number of gates and the circuit depth.

            Examples:

                Gate cancellation.

                ZX calculus.

                Quantum circuit compilation.

    Noise Mitigation:

        Implement error mitigation techniques to reduce the effects of quantum noise.

            Examples:

                Zero-noise extrapolation.

                Probabilistic error cancellation.

                Readout error mitigation.

    Hardware-Aware Design:

        Design the quantum circuits to match the specific architecture and connectivity of the available quantum hardware.

            This can involve:

                Qubit mapping.

                Gate scheduling.

                Pulse-level control.

V. Remarks

    Scalability:

        Scalability is a major challenge for quantum RAG systems. The number of qubits required for amplitude encoding and quantum computations can grow exponentially with the size of the data.

    Practicality:

        Practical quantum RAG systems are still years away. Current quantum hardware is limited in terms of qubit count, coherence time, and gate fidelity.

    Hybrid Approach:

        In the near term, hybrid quantum-classical approaches are likely to be the most practical.

            Use classical machine learning to pre-process data and prepare inputs for quantum algorithms.

    Research Directions:

        The field of quantum natural language processing (QNLP) is rapidly evolving. Stay up-to-date with the latest research and developments.

In [20]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pennylane as qml
from sklearn.metrics.pairwise import cosine_similarity
import os
import requests, zipfile
from shutil import copyfileobj
from sentence_transformers import SentenceTransformer

# ---------------------- Section I: Setup ----------------------

# 1. Download necessary NLTK resources
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

try:
    word_tokenize("example text")
except LookupError:
    nltk.download('punkt')


# ---------------------- Section II: Load Resources and Preprocessing ----------------------

# Load Text Corpus
corpus = [
    "The quick brown fox jumps over the lazy dog. This is a classic English pangram.",
    "The cat sat on the mat, peacefully napping in the sun.",
    "A dog is a loyal companion, offering unconditional love and support. Dogs are good pets.",
    "The fox is a cunning animal, known for its cleverness and adaptability. Foxes are often found in forests.",
    "Quantum computing holds the promise of revolutionizing various fields, including medicine and materials science.",
    "Natural language processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language.",
    "The Earth is the third planet from the Sun and the only known planet to harbor life.",
    "Artificial intelligence (AI) is rapidly transforming industries and reshaping the way we live and work.",
    "Climate change is a pressing global issue, requiring urgent action to mitigate its effects.",
    "Renewable energy sources, such as solar and wind power, are becoming increasingly important in the transition to a sustainable future."
]

# Get user query
question = input("Enter your query: ")

# Preprocessing
def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

processed_corpus = [preprocess(doc) for doc in corpus]
processed_question = preprocess(question)

# ---------------------- Section III: Quantum Data Encoding ----------------------
num_qubits = 6  # Example, adjust based on needs
dev = qml.device("default.qubit", wires=num_qubits)

def create_feature_vector(text, vocab_size): #Dummy, just for demonstration
    """Create simple one-hot feature vector."""
    feature_vector = [0] * vocab_size
    words = text.split()
    for word in words:
        feature_vector[hash(word) % vocab_size] = 1 #Arbitrary hashing
    return feature_vector

vocab_size = 100 #Set based on corpus analysis
encoded_corpus_feature_vectors = [create_feature_vector(" ".join(doc), vocab_size) for doc in processed_corpus] # Text to Numbers
encoded_query_feature_vector = create_feature_vector(" ".join(processed_question), vocab_size)


@qml.qnode(dev)
def quantum_encoding_circuit(features, params): # Variational Quantum Circuit to be Trained
    """Example Quantum Circuit to map Features to Quantum State."""
    # Insert Circuit
    return qml.state()

# --- Simulated Training Section (This requires significant development) ---
# Here should be training of quantum_encoding_circuit to generate appropriate quantum features based on the dataset
# The data are "encoded_corpus_feature_vectors" and "encoded_query_feature_vector"
# Training requires optimization of "params" and definition of appropriate "quantum_loss"

params = np.random.rand(10)  # Example initialization, NEEDS TO BE ADJUSTED


quantum_corpus_states = [quantum_encoding_circuit(feature, params) for feature in encoded_corpus_feature_vectors]  # Quantum States of Corpus
quantum_query_state = quantum_encoding_circuit(encoded_query_feature_vector, params) # Quantum State of Query



# ---------------------- Section IV: Quantum Retrieval  ----------------------
num_document_qubits = int(np.ceil(np.log2(len(corpus))))
dev_grover = qml.device("default.qubit", wires=num_document_qubits) # Quantum Device for Grover's alg

# Manually assign the number of wires
num_wires = dev_grover.num_wires  = num_document_qubits

# Simulated Quantum Similarity Calculation (Replace with QAE later)
def quantum_state_similarity(state1, state2): #Simulated Quantum Similarity
  similarity = cosine_similarity(np.real(state1).reshape(1,-1), np.real(state2).reshape(1,-1))[0][0]
  return similarity


@qml.qnode(dev_grover)
def grover_search(quantum_states, quantum_query):

  #Oracle Phase
  def oracle(wires):
        max_sim = max([quantum_state_similarity(quantum_query, quantum_states[i]) for i in range(len(quantum_states))])
        similarities = [quantum_state_similarity(quantum_query, quantum_states[i]) for i in range(len(quantum_states))]
        for i in range(len(quantum_states)):
            if similarities[i] >= max_sim:
                qml.FlipSign(wires=wires, n = 1)

  def grover_diffusion_op(wires):
    """Grover diffusion operator."""
    for wire in wires:
        qml.Hadamard(wires=wire)
    for wire in wires:
        qml.PauliX(wires=wire)
    if len(wires) > 1:  # Apply CZ only if there are at least two qubits
        qml.CZ(wires=[wires[0], wires[1]])
    for wire in wires:
        qml.PauliX(wires=wire)
    for wire in wires:
        qml.Hadamard(wires=wire)

  # Apply Hadamards
  wires = range(num_wires)
  for wire in wires:
    qml.Hadamard(wires=wire)

    # Number of Grover iterations
  N = len(quantum_states)
  num_iterations = int(np.floor(np.pi / 4 * np.sqrt(N)))
   #Run Grover's
  for _ in range(num_iterations):
      oracle(wires)
      grover_diffusion_op(wires)

  return qml.probs(wires=wires) # Output
print("grover probs")
probabilities = grover_search(quantum_corpus_states,quantum_query_state)  # Run Grover and get Probabilities
print(probabilities)


most_likely_index = np.argmax(probabilities)  # Index of most likely document.

# ---------------------- Section V: Quantum Answer Extraction (Simulated) ----------------------
# --- Replace with Learned Quantum Answer Extraction Section ---
def extract_answer(document, question):  # Basic answer extraction (as before)
    sentences = nltk.sent_tokenize(document)
    #Remove this since Sentence Transformer encodes whole sentence
    #question_embedding = create_document_embedding(preprocess(question), glove_embeddings, embedding_dim)
    sentence_similarities = []
    model = SentenceTransformer('all-mpnet-base-v2')
    for sentence in sentences:
        #Remove this since Sentence Transformer encodes whole sentence
        #sentence_embedding = create_document_embedding(preprocess(sentence), glove_embeddings, embedding_dim)
        similarity = cosine_similarity(model.encode(question).reshape(1, -1), model.encode(sentence).reshape(1,-1))[0][0]
        sentence_similarities.append(similarity)
    best_sentence_index = np.argmax(sentence_similarities)
    return sentences[best_sentence_index]


retrieved_document = corpus[most_likely_index]
answer = extract_answer(retrieved_document, question)

print(f"Query: {question}")
print(f"Answer: {answer}")

Enter your query: quantum
grover probs
[0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625
 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625]
Query: quantum
Answer: The quick brown fox jumps over the lazy dog.


In [24]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pennylane as qml
from sklearn.metrics.pairwise import cosine_similarity
import os
import requests, zipfile
from shutil import copyfileobj
import torch

# ---------------------- Section I: Setup ----------------------
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

try:
    word_tokenize("example text")
except LookupError:
    nltk.download('punkt')

# ---------------------- Section II: Load Resources and Preprocessing ----------------------
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The cat sat on the mat.",
    "A dog is a loyal companion.",
    "The fox is a cunning animal.",
    "Quantum computing is promising.",
    "NLP is a branch of AI.",
    "Earth is the third planet.",
    "AI is transforming industries.",
    "Climate change is a global issue.",
    "Renewable energy is important."
]
question = input("Enter your query: ")

def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

processed_corpus = [preprocess(doc) for doc in corpus]
processed_question = preprocess(question)

# ---------------------- Section III: Quantum Data Encoding & Training ----------------------
num_qubits = 2  # Limited Qubits for Simplicity
dev = qml.device("default.qubit", wires=num_qubits)

def create_feature_vector(text, vocab_size):
    """Improved: Create feature vector counting word occurrences."""
    feature_vector = np.zeros(vocab_size)
    tokens = preprocess(text)  # Use preprocess here
    for token in tokens:
        feature_vector[hash(token) % vocab_size] += 1
    return feature_vector

vocab_size = 20  # Increased vocab size to capture more variance

# Create feature vectors for corpus and query
encoded_corpus_feature_vectors = [create_feature_vector(doc, vocab_size) for doc in corpus]
encoded_query_feature_vector = create_feature_vector(question, vocab_size)

@qml.qnode(dev, interface="torch")
def quantum_encoding_circuit(features, params):
    """Variational Quantum Circuit to map Features to Quantum State."""
    qml.Hadamard(wires=0)
    qml.RY(params[0], wires=0)
    qml.CNOT(wires=[0, 1])  # Entanglement
    qml.RY(params[1], wires=1)
    # Implement feature loading
    for i in range(len(features)):
        qml.RY(features[i] * params[i+2], wires=0) # data-reuploading
    return qml.probs(wires=[0,1])

# --- Simulated Training ---
num_params = vocab_size + 2 # num weights + 2 variational parameters

# 2 is for the number of variational params,
params = torch.tensor(np.random.rand(num_params), requires_grad=True)

optimizer = torch.optim.Adam([params], lr=0.1) #Torch Optimizer
num_steps = 100 #Reduced training steps for simplicity.

def quantum_loss(params, feature_vector, target):
    """Quantum loss function (example: Mean Squared Error)."""
    probs = quantum_encoding_circuit(feature_vector, params)
    loss = torch.sum((probs - torch.tensor(target)) ** 2)  # MSE
    return loss

#Emulate the Training
# Training Data must be manually crafted
# Given the number of examples I will target specific circuit outputs to correspond to relevance.

emulated_training_data = [
    (encoded_corpus_feature_vectors[0], [0.2, 0.2, 0.3, 0.3]),  # Relevant
    (encoded_corpus_feature_vectors[1], [0.25, 0.25, 0.25, 0.25]), # Not relevant
    (encoded_corpus_feature_vectors[2], [0.3, 0.3, 0.2, 0.2]),    # Relevant
    (encoded_corpus_feature_vectors[3], [0.25, 0.25, 0.25, 0.25]), # Not Relevant
    (encoded_corpus_feature_vectors[4], [0.4, 0.1, 0.4, 0.1]),  # Relevant
    (encoded_corpus_feature_vectors[5], [0.25, 0.25, 0.25, 0.25]), # Not relevant
    (encoded_corpus_feature_vectors[6], [0.1, 0.4, 0.1, 0.4]), # Relevant
    (encoded_corpus_feature_vectors[7], [0.25, 0.25, 0.25, 0.25]), # Not relevant
    (encoded_corpus_feature_vectors[8], [0.2, 0.3, 0.2, 0.3]),    # Relevant
    (encoded_corpus_feature_vectors[9], [0.25, 0.25, 0.25, 0.25]), # Not Relevant

]

for step in range(num_steps):
    for feature_vector, target in emulated_training_data:
        optimizer.zero_grad() #Zero the gradient
        loss = quantum_loss(params, feature_vector, target) # Calculate Loss
        loss.backward() # Calculate Gradients
        optimizer.step() #Update Parameters

    if (step + 1) % 10 == 0:
        print(f"Step {step+1}: Cost = {loss.item():.4f}")

quantum_corpus_states = [quantum_encoding_circuit(feature, params).detach().numpy() for feature in encoded_corpus_feature_vectors] # Quantum States of Corpus
quantum_query_state = quantum_encoding_circuit(encoded_query_feature_vector, params).detach().numpy() # Quantum State of Query

# ---------------------- Section IV: Quantum Retrieval ----------------------
def state_overlap(state1, state2):
    """Calculates the overlap (similarity) between two quantum states."""
    overlap = np.abs(np.dot(np.conj(state1), state2))**2
    return overlap

# Calculate overlap between query state and corpus states
overlaps = [state_overlap(quantum_query_state, state) for state in quantum_corpus_states]

# Select the most relevant document
most_likely_index = np.argmax(overlaps)

retrieved_document = corpus[most_likely_index]

print(f"Query: {question}")
print(f"Answer: {retrieved_document}")

Enter your query: quantum
Step 10: Cost = 0.0028
Step 20: Cost = 0.0120
Step 30: Cost = 0.0116
Step 40: Cost = 0.0315
Step 50: Cost = 0.0405
Step 60: Cost = 0.0338
Step 70: Cost = 0.0298
Step 80: Cost = 0.0281
Step 90: Cost = 0.0275
Step 100: Cost = 0.0275
Query: quantum
Answer: The cat sat on the mat.
