<a href="https://colab.research.google.com/github/DylanCTY/TextAnalytics_LearningSpace/blob/main/IB9CW0_5504008_draft2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
!pip install transformers datasets faiss-cpu sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

Load and Prepare the dataset

In [7]:
import pandas as pd
from sentence_transformers import SentenceTransformer

# Read the contents of GOTbook.txt
with open('GOTbook.txt', 'r') as file:
    got_text = file.read()

# Split the text into paragraphs or chunks
paragraphs = got_text.split('\n\n')

# Create a DataFrame for the retriever with dummy titles
df = pd.DataFrame({
    'title': [f"Paragraph {i}" for i in range(len(paragraphs))],
    'text': paragraphs
})



Generate Embeddings

In [8]:
# Load a pre-trained sentence transformer model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each paragraph
embeddings = embedder.encode(paragraphs, convert_to_tensor=True)

# Add embeddings to the DataFrame
df['embeddings'] = embeddings.tolist()


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Create a Hugging Face Dataset and Add FAISS Index

In [12]:
from datasets import Dataset

# Load the DataFrame into a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Add a FAISS index to the dataset
dataset = dataset.add_faiss_index(column='embeddings')


  0%|          | 0/4 [00:00<?, ?it/s]

Set Up the RAG Model with the Custom Dataset

In [13]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, RagRetriever, RagSequenceForGeneration, RagTokenizer

# Load the LLM (e.g., T5 or BART)
llm_tokenizer = AutoTokenizer.from_pretrained("t5-small")
llm_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Create the retriever using the custom dataset
retriever = RagRetriever.from_pretrained(
    "facebook/rag-sequence-nq",
    indexed_dataset=dataset
)
rag_tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
rag_model = RagSequenceForGeneration.from_pretrained("facebook/rag-sequence-nq", retriever=retriever)

def generate_response_with_llm(query):
    inputs = llm_tokenizer.encode("summarize: " + query, return_tensors="pt", max_length=512, truncation=True)
    outputs = llm_model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return llm_tokenizer.decode(outputs[0], skip_special_tokens=True)

def generate_response_with_rag(query):
    inputs = rag_tokenizer(query, return_tensors="pt")
    generated_ids = rag_model.generate(input_ids=inputs["input_ids"])
    return rag_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Define your queries
queries = [
    "Who is Jon Snow's mother?",
    "Describe the Battle of the Blackwater.",
    "What are the names of the direwolves?",
    "Explain the relationship between the Stark and Lannister families.",
    "What is the prophecy given to Daenerys Targaryen?"
]

# Implement the run_tests function
def run_tests(queries):
    results = []
    for query in queries:
        llm_response = generate_response_with_llm(query)
        rag_response = generate_response_with_rag(query)

        results.append({
            "query": query,
            "llm_response": llm_response,
            "rag_response": rag_response
        })
    return results

# Run the tests
test_results = run_tests(queries)

# Print the results for comparison
for result in test_results:
    print(f"Query: {result['query']}")
    print(f"LLM Response: {result['llm_response']}")
    print(f"RAG Response: {result['rag_response']}")
    print("\n")


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

pytorch_model.bin:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-sequence-nq were not used when initializing RagSequenceForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagSequenceForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagSequenceForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


AssertionError: 