<a href="https://colab.research.google.com/github/DylanCTY/TextAnalytics_LearningSpace/blob/main/IB9CW0_5504008_draft2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers datasets faiss-cpu sentence-transformers


Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2

Load and Prepare the dataset

In [3]:
import pandas as pd
from sentence_transformers import SentenceTransformer

# Read the contents of GOTbook.txt
with open('GOTbook.txt', 'r') as file:
    got_text = file.read()

# Split the text into paragraphs or chunks
paragraphs = got_text.split('\n\n')

# Create a DataFrame for the retriever with dummy titles
df = pd.DataFrame({
    'title': [f"Paragraph {i}" for i in range(len(paragraphs))],
    'text': paragraphs
})



Generate Embeddings

In [4]:
# Load a pre-trained sentence transformer model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each paragraph
embeddings = embedder.encode(paragraphs, convert_to_tensor=True)

# Add embeddings to the DataFrame
df['embeddings'] = embeddings.tolist()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Create a Hugging Face Dataset and Add FAISS Index

In [5]:
from datasets import Dataset

# Load the DataFrame into a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Add a FAISS index to the dataset
dataset = dataset.add_faiss_index(column='embeddings')


  0%|          | 0/4 [00:00<?, ?it/s]

Set Up the RAG Model with the Custom Dataset

In [13]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, RagRetriever, RagSequenceForGeneration, RagTokenizer

# Load the LLM (e.g., T5 or BART)
llm_tokenizer = AutoTokenizer.from_pretrained("t5-small")
llm_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Create the retriever using the custom dataset
retriever = RagRetriever.from_pretrained(
    "facebook/rag-sequence-nq",
    indexed_dataset=dataset
)
rag_tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
rag_model = RagSequenceForGeneration.from_pretrained("facebook/rag-sequence-nq", retriever=retriever)

def generate_response_with_llm(query):
    inputs = llm_tokenizer.encode("summarize: " + query, return_tensors="pt", max_length=512, truncation=True)
    outputs = llm_model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return llm_tokenizer.decode(outputs[0], skip_special_tokens=True)

def generate_response_with_rag(query):
    inputs = rag_tokenizer(query, return_tensors="pt")
    generated_ids = rag_model.generate(input_ids=inputs["input_ids"])
    return rag_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Define your queries
queries = [
    "Who is Jon Snow's mother?",
    "Describe the Battle of the Blackwater.",
    "What are the names of the direwolves?",
    "Explain the relationship between the Stark and Lannister families.",
    "What is the prophecy given to Daenerys Targaryen?"
]

# Implement the run_tests function
def run_tests(queries):
    results = []
    for query in queries:
        llm_response = generate_response_with_llm(query)
        rag_response = generate_response_with_rag(query)

        results.append({
            "query": query,
            "llm_response": llm_response,
            "rag_response": rag_response
        })
    return results

# Run the tests
test_results = run_tests(queries)

# Print the results for comparison
for result in test_results:
    print(f"Query: {result['query']}")
    print(f"LLM Response: {result['llm_response']}")
    print(f"RAG Response: {result['rag_response']}")
    print("\n")


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

pytorch_model.bin:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-sequence-nq were not used when initializing RagSequenceForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagSequenceForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagSequenceForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


AssertionError: 

In [6]:
# Check the embedding dimensions
print(f"Embedding dimension: {embeddings.shape[1]}")


Embedding dimension: 384


In [7]:
# Ensure embeddings have consistent dimensions
assert all(len(embedding) == embeddings.shape[1] for embedding in embeddings), "Inconsistent embedding dimensions"


In [8]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from datasets import Dataset

# Read the contents of GOTbook.txt
with open('GOTbook.txt', 'r') as file:
    got_text = file.read()

# Split the text into paragraphs or chunks
paragraphs = got_text.split('\n\n')

# Create a DataFrame for the retriever with dummy titles
df = pd.DataFrame({
    'title': [f"Paragraph {i}" for i in range(len(paragraphs))],
    'text': paragraphs
})

# Load a pre-trained sentence transformer model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each paragraph
embeddings = embedder.encode(paragraphs, convert_to_tensor=False)

# Ensure embeddings have consistent dimensions
assert all(len(embedding) == embeddings.shape[1] for embedding in embeddings), "Inconsistent embedding dimensions"

# Add embeddings to the DataFrame
df['embeddings'] = embeddings.tolist()

# Load the DataFrame into a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Add a FAISS index to the dataset
dataset = dataset.add_faiss_index(column='embeddings')

# Print the embedding dimensions for verification
print(f"Embedding dimension: {embeddings.shape[1]}")

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, RagRetriever, RagSequenceForGeneration, RagTokenizer

# Load the LLM (e.g., T5 or BART)
llm_tokenizer = AutoTokenizer.from_pretrained("t5-small")
llm_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Create the retriever using the custom dataset
retriever = RagRetriever.from_pretrained(
    "facebook/rag-sequence-nq",
    indexed_dataset=dataset
)
rag_tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
rag_model = RagSequenceForGeneration.from_pretrained("facebook/rag-sequence-nq", retriever=retriever)

def generate_response_with_llm(query):
    inputs = llm_tokenizer.encode("summarize: " + query, return_tensors="pt", max_length=512, truncation=True)
    outputs = llm_model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return llm_tokenizer.decode(outputs[0], skip_special_tokens=True)

def generate_response_with_rag(query):
    inputs = rag_tokenizer(query, return_tensors="pt")
    generated_ids = rag_model.generate(input_ids=inputs["input_ids"])
    return rag_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Define your queries
queries = [
    "Who is Jon Snow's mother?",
    "Describe the Battle of the Blackwater.",
    "What are the names of the direwolves?",
    "Explain the relationship between the Stark and Lannister families.",
    "What is the prophecy given to Daenerys Targaryen?"
]

# Implement the run_tests function
def run_tests(queries):
    results = []
    for query in queries:
        llm_response = generate_response_with_llm(query)
        rag_response = generate_response_with_rag(query)

        results.append({
            "query": query,
            "llm_response": llm_response,
            "rag_response": rag_response
        })
    return results

# Run the tests
test_results = run_tests(queries)

# Print the results for comparison
for result in test_results:
    print(f"Query: {result['query']}")
    print(f"LLM Response: {result['llm_response']}")
    print(f"RAG Response: {result['rag_response']}")
    print("\n")




  0%|          | 0/4 [00:00<?, ?it/s]

Embedding dimension: 384




tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

(…)_encoder_tokenizer/tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

question_encoder_tokenizer/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)ncoder_tokenizer/special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.


(…)enerator_tokenizer/tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

generator_tokenizer/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

generator_tokenizer/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

(…)erator_tokenizer/special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may res

pytorch_model.bin:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-sequence-nq were not used when initializing RagSequenceForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagSequenceForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagSequenceForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


AssertionError: 