Loading LLM model: GPT 2 - training knowledge cutoff - 2019

In [13]:
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')
set_seed(2025)


Device set to use cpu


Load processed text

In [14]:
import pickle
with open("chunks.pkl", "rb") as f:
    all_chunks = pickle.load(f)
print(f"Loaded {len(all_chunks)} chunks")  # e.g., 30 chunks

Loaded 426 chunks


Embed the chunks above

In [15]:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')


embed_model = SentenceTransformer('all-MiniLM-L6-v2')  # From Day 1
embeddings = embed_model.encode(all_chunks)  # List of arrays -> one big array
embeddings = np.array(embeddings).astype('float32')  # FAISS needs this
print(f"Embeddings shape: {embeddings.shape}")  # e.g., (30, 384) – chunks x dimensions

Embeddings shape: (426, 384)


Using Faiss for fast vector database searching

In [16]:
import faiss

dimension = embeddings.shape[1]  # e.g., 384
index = faiss.IndexFlatL2(dimension)  # Basic flat index (exact search, good for small data)
index.add(embeddings)  # Train/add your vectors #type: Ignore
print(f"Index built with {index.ntotal} vectors")  # Matches chunk count

Index built with 426 vectors


In [17]:
#Test code 
# dummy_query = embed_model.encode(["AI bias in ethics"])
# distances, indices = index.search(dummy_query, k=3)  # Top 3 nearest
# print("Top indices:", indices)  # e.g., [5, 12, 3] – chunk IDs

In [18]:
# Testing retrival functionality
# from retrieval import *
# sample_query = "With all the information given, The name of the 46th president of the United States is"
# results = retrieve_chunks(sample_query, embed_model, index, all_chunks)
# for chunk, score in results:
#     print(f"Score: {score:.2f} | Chunk: {chunk[:100]}...")  # Preview

Defining RAG pipeline

In [19]:
import pandas as pd
import time
from pipelines import *

def compare_responses(queries, embed_model, index, all_chunks, max_length=100):
    results = []
    for query in queries:
        start_time = time.time()
        
        # Plain LLM
        plain_prompt = f"Question: {query}\nAnswer:"
        plain_answer = generate_response(generator, plain_prompt, max_length) 
        plain_time = time.time() - start_time
        
        # RAG
        rag_start = time.time()
        rag_answer = rag_pipeline(generator, query, embed_model, index, all_chunks, max_length) 
        rag_time = time.time() - rag_start
        
        results.append({
            'query': query,
            'plain_answer': plain_answer[0]["generated_text"],
            'rag_answer': rag_answer[0]["generated_text"],
            'plain_latency': plain_time,
            'rag_latency': rag_time
        })
        print(f"Processed: {query} | Plain: {plain_time:.2f}s | RAG: {rag_time:.2f}s")
    
    # Save to CSV
    df = pd.DataFrame(results)
    df.to_csv('comparison_results.csv', index=False)
    print("Results saved to comparison_results.csv")
    return df

# Test with 3 quick queries first
test_queries = ["The name of the 46th president of the United State (USA) is ", "Biden was born in ", "Biden studied at "]
df = compare_responses(test_queries, embed_model, index, all_chunks, max_length=100)
df

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will ta

Processed: The name of the 46th president of the United State (USA) is  | Plain: 9.61s | RAG: 9.74s


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Processed: Biden was born in  | Plain: 9.70s | RAG: 10.66s


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Processed: Biden studied at  | Plain: 9.45s | RAG: 9.50s
Results saved to comparison_results.csv


Unnamed: 0,query,plain_answer,rag_answer,plain_latency,rag_latency
0,The name of the 46th president of the United S...,Question: The name of the 46th president of th...,Use the following context to answer the questi...,9.609019,9.744886
1,Biden was born in,Question: Biden was born in \nAnswer: The firs...,Use the following context to answer the questi...,9.704808,10.656057
2,Biden studied at,Question: Biden studied at \nAnswer: He studie...,Use the following context to answer the questi...,9.44745,9.503505
