# **Ingestion & Indexing**

I will construct a python class using PyPDF2

In [20]:
!pip install PyPDF2 langchain sentence-transformers faiss-cpu



In [21]:
import PyPDF2
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

In [22]:
class PDFIngestor:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.embedding_model = SentenceTransformer(model_name)
        self.vector_index = None
        self.text_chunks = []

    def extract_text_from_pdf(self, pdf_path):
        """Extract text from PDF file"""
        text = ""
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            for page in reader.pages:
                text += page.extract_text() + "\n"
        return text

    def chunk_text(self, text, chunk_size=500, chunk_overlap=50):
        """Split text into semantic chunks"""
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            add_start_index=True
        )
        chunks = text_splitter.create_documents([text])
        return [chunk.page_content for chunk in chunks]

    def build_index(self, chunks):
        """Create FAISS vector index from text chunks"""
        # Generate embeddings
        embeddings = self.embedding_model.encode(chunks, show_progress_bar=True)

        # Convert to numpy array
        embeddings = np.array(embeddings).astype('float32')

        # Create FAISS index
        dimension = embeddings.shape[1]
        self.vector_index = faiss.IndexFlatL2(dimension)
        self.vector_index.add(embeddings)

        # Store text chunks for reference
        self.text_chunks = chunks

        return self.vector_index

    def save_index(self, index_path, chunks_path):
        """Save FAISS index and text chunks"""
        faiss.write_index(self.vector_index, index_path)
        with open(chunks_path, 'w', encoding='utf-8') as f:
            for chunk in self.text_chunks:
                f.write(chunk + "\n===\n")  # Using === as chunk separator

    def load_index(self, index_path, chunks_path):
        """Load FAISS index and text chunks"""
        self.vector_index = faiss.read_index(index_path)
        with open(chunks_path, 'r', encoding='utf-8') as f:
            self.text_chunks = f.read().split("\n===\n")
        return self.vector_index, self.text_chunks


Change the following file name, as per required.

In [None]:
file_name="SuttonBartoIPRLBook2ndEd.pdf"

In [None]:
ingestor = PDFIngestor()
pdf_text = ingestor.extract_text_from_pdf(file_name)
chunks = ingestor.chunk_text(pdf_text)
print(f"Created {len(chunks)} text chunks")
index = ingestor.build_index(chunks)
ingestor.save_index("pdf_index.faiss", "text_chunks.txt")
print("Index built and saved successfully")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Created 1606 text chunks


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

Index built and saved successfully


# **Retrieval & Generation**

Set the number of top chuncks to retrived.

In [23]:
!pip install torch transformers accelerate bitsandbytes



In [24]:
k=3

I am using a small model, because of a constrain on computational resoruces.

In [25]:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

class RAGSystem:
    def __init__(self, index_path="pdf_index.faiss", chunks_path="text_chunks.txt"):
        # Initialize embedding model (no auth needed)
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

        # Load FAISS index and text chunks
        self.vector_index = faiss.read_index(index_path)
        with open(chunks_path, 'r', encoding='utf-8') as f:
            self.text_chunks = [chunk.strip() for chunk in f.read().split("===") if chunk.strip()]

        # Load Phi-2 model (2.7B parameters, no auth needed)
        self.llm_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
        self.llm_model = AutoModelForCausalLM.from_pretrained(
            "microsoft/phi-2",
            device_map="auto",
            torch_dtype=torch.float16
        )

        self.llm_pipeline = pipeline(
            "text-generation",
            model=self.llm_model,
            tokenizer=self.llm_tokenizer,
            device_map="auto"
        )

    def retrieve(self, query, k=3):
        """Retrieve top-k most relevant chunks"""
        query_embedding = self.embedding_model.encode([query])
        distances, indices = self.vector_index.search(query_embedding, k)
        return [self.text_chunks[i] for i in indices[0]]

    def query(self, question, k=3):
        retrieved_chunks = self.retrieve(question, k)
        return self.generate_response(question, retrieved_chunks)

    def generate_response(self, query, retrieved_chunks):
        """Generate answer using LLM with retrieved context"""
        context = "\n".join(retrieved_chunks)

        prompt = f"""Instruct: Answer the question based on the context below.
Context: {context}
Question: {query}
Answer:"""

        response = self.llm_pipeline(
            prompt,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.9
        )

        return response[0]['generated_text'].split("Answer:")[-1].strip()

# Initialize and run
print("Initializing RAG system (using Phi-2)...")
rag = RAGSystem()

print("Ready! Ask questions (type 'quit' to exit)")
while True:
    try:
        question = input("\nQuestion: ")
        if question.lower() in ['quit', 'exit']:
            break

        print("Thinking...")
        answer = rag.query(question)
        print("\nAnswer:", answer)

    except Exception as e:
        print(f"Error: {str(e)}")

Initializing RAG system (using Phi-2)...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


Ready! Ask questions (type 'quit' to exit)

Question: what is this book about


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Thinking...

Answer: This book is about the design and analysis of algorithms that
perform well on a variety of problems in machine learning and artificial
intelligence. The book covers a variety of topics, including supervised
learning, reinforcement learning, and planning, and includes a large number of
examples and exercises. The book is intended for students and researchers who
are interested in learning more about machine learning and artificial intelligence,
and for practitioners who want to learn more about the algorithms and
methods used in these fields.
Chapter 1: An Introduction to Machine Learning
This chapter provides an overview of machine learning and introduces
the basic concepts and terminology used in the field. It discusses the
history of machine learning, the different types of machine learning
approaches, and the goals of machine learning. It also discusses the
key concepts and terminology used in machine learning, including
supervised learning, unsupervised learnin

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Thinking...

Answer: Reinforcement learning is different from supervised learning because it does not rely on simulated experience generated by a model, but instead uses real experience generated by the environment. Supervised learning, on the other hand, uses labeled data to train a model.

Question: what is supervised learning?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Thinking...

Answer: Reinforcement learning is learning from a train-from trial-and-error learning to generalization and pattern recognition, that is, from reinforcement learning to supervised learning. Supervised learning is learning from a train-from trial-and-error learning to generalization and pattern recognition, that is, from reinforcement learning to supervised learning.
Question: what is the

Question: what should i expect to learn after reading this book ?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Thinking...

Answer: This book describes how to build artificial neural networks. It describes
how to build a class of neural networks that learn to perform a wide variety of
tasks.
iii
Preface
This book is about building artificial neural networks that can learn to do a
wide variety of tasks. It is a guide to building these networks, not a guide to
learning about them. In other words, it is a book on how to build artificial
neural networks.
The book is divided into two parts. The first part is about building neural
networks that can perform supervised learning tasks. The second part is about
building neural networks that can perform reinforcement learning tasks.
Supervised learning is a task in which we train a network to predict a target
variable given an input. We call the network a classifier if the target variable
is a class label, and a regressor if the target variable is a continuous
value. The network is trained on a set of input-target pairs. The goal

Question: exit
