<a href="https://colab.research.google.com/github/Papa-Panda/Paper_reading/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# !pip install faiss-cpu transformers torch

In [2]:
import faiss
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Step 1: Create some example documents
documents = [
    "The capital of France is Paris.",
    "The Great Wall of China is over 13,000 miles long.",
    "Python is a popular programming language.",
    "The Northern and Southern Dynasties lasted from 420 to 589 AD.",
    "Beijing is the capital of China."
]

# Step 2: Vectorize the documents using TF-IDF
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(documents).toarray()

# Step 3: Create a FAISS index for fast similarity search
dimension = doc_vectors.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_vectors)

# Step 4: Define a function to retrieve the most relevant document
def retrieve(query, k=1):
    query_vector = vectorizer.transform([query]).toarray()
    distances, indices = index.search(query_vector, k)
    return [documents[i] for i in indices[0]]

# Step 5: Load a pre-trained text generation model (e.g., GPT-2)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Step 6: Define the RAG function
def rag_generate(query):
    retrieved_docs = retrieve(query, k=1)
    context = " ".join(retrieved_docs)  # Combine the retrieved document(s)

    # Prepare the input for generation
    input_text = f"Context: {context}\nQuestion: {query}\nAnswer:"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")

    # Generate the response
    output = model.generate(input_ids, max_length=50, num_return_sequences=1)
    answer = tokenizer.decode(output[0], skip_special_tokens=True)

    return answer

# Step 7: Test the RAG implementation
query = "What is the capital of China?"
response = rag_generate(query)
print(response)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Context: Beijing is the capital of China.
Question: What is the capital of China?
Answer: The capital of China is Beijing.
Question: What is the capital of China?
Answer: The capital of China is Beijing.



In [5]:
query = "What is the capital of France?"
response = rag_generate(query)
print(response)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: The capital of France is Paris.
Question: What is the capital of France?
Answer: The capital of France is Paris.
Question: What is the capital of France?
Answer: The capital of France is Paris.

