**OBJECTIVE**
- To develop a Retrieval Augmented Generation (RAG) model. The RAG model should be capable of
retrieving relevant information from the PDF and generating coherent responses or
summaries based on user queries and maintain the chain of conversion till the end.

**Model Development**
- Develop a Retrieval Augmented Generation model.
- Implement a retrieval mechanism to efficiently search for relevant passages from the PDF document based on user queries.
- Design the generation component to produce coherent responses or summaries based on the retrieved passages.

-To develop a Retrieval Augmented Generation (RAG) model using a dataset stored in PDF format, we need to follow these steps systematically:

- Data Extraction from PDFs
- Preprocessing and Indexing
- Training the Retrieval Component
- Training the Generation Component
- Integration and Fine-tuning
- Inference and Testing

### Data Extraction from PDF
First we need to extract text from PDF files

In [1]:
pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.4-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.24.3 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.4 PyMuPDFb-1.24.3


In [None]:
import pymupdf
pdf_path = "/content/Knowledge base for RAG-Handbook-of-Good-Dairy-Husbandry-Practices_.pdf"

def extract_text_from_pdf(pdf_path):
    text = ""
    doc = pymupdf.open(pdf_path)
    for page in doc:
        text += page.get_text()
    return text

text = extract_text_from_pdf(pdf_path)
print(text[:1000])

Handbook of 
Good Dairy Husbandry Practices
National Dairy
Development  Board
2
BREEDS OF INDIGENOUS DAIRY CATTLE
 
BULLS   
 
 
 
 
 
    COWS
Kankrej
Native tract: Kutch, Mehsana & 
Banaskantha districts of Gujarat
Tharparkar
Native tract: Jaisalmer, Barmer 
and Jodhpur districts in  
Rajasthan
Red Sindhi
Native tract: In Pakistan, also 
found in Punjab, Haryana & 
Rajasthan & Uttarakhand
Rathi
Native tract: Bikaner & 
Shri Ganganagar districts of 
Rajasthan
Sahiwal
Native tract: Ferozpur and 
Amritsar district of Punjab & Shri 
Ganganagar district of Rajasthan
Hariana
Native tract: Rohtak, Hissar, 
Sonepat, Gurgaon, Jind and 
Jhajjar districts in Haryana
Gir 
Native tract: Junagadh, Rajkot, 
Bhavnagar and Amreli districts 
in Gujarat
BREEDS OF BUFFALO
 
    BULLS 
 
 
 
 
 
  COWS
Murrah
Native tract: Hissar, Rohtak, 
Gurgaon and Jind districts of 
Haryana
Surti
Native tract: Anand, Kheda 
and Baroda districts of 
Gujarat
Mahesani
Native tract: Mahesana, 
Banaskantha & Sabarkantha 


**Step 2: Preprocessing and Indexing**
- Once we have the text extracted, we need to preprocess it (e.g. tokenization, cleaning) and then index it for efficient retrieval

In [None]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0


In [None]:
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
import faiss

nltk.download('punkt')

def preprocess_text(text):
  text = re.sub(r'[^\w\s]', '', text)
  sentences = nltk.sent_tokenize(text)
  return sentences

# Example usage
sentences = preprocess_text(text)

# Create TF-IDF embeddings
vectorizer = TfidfVectorizer()
embeddings = vectorizer.fit_transform([sentence for sentence in sentences])

# Create FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings.toarray())

# Save the index and vectorizer
faiss.write_index(index, 'index.faiss')
import pickle
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**Step3: Training the Retrieval Component**
- For retrieval component we will use TF-IDF-based FAISS index

** Step 4: Training the generation Component**
- We can use Hugging Face's Transformers library to fine-tune a generative model like GPT

In [None]:
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

In [None]:
pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.30.1


In [None]:
from transformers import BartForConditionalGeneration, BartTokenizerFast, Trainer, TrainingArguments
import torch
# Load pretrained BART tokenizer and model
Tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# We have a dataset of (question, context, answer) tuples

train_data = [
    {"question":"Why animal health is important?", "context":"Animal health plays an important role in ...", "answer": "A diseased animal cannot perform to the expected level. Timely intervention is therefore pivotal in reducing the economic losses due to diseases."}
]

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        inputs = self.tokenizer(item["question"], item["context"], truncation=True, padding="max_length", max_length=self.max_length)
        outputs = self.tokenizer(item["answer"], truncation=True, padding="max_length", max_length=self.max_length)
        inputs['labels'] = outputs.input_ids
        return inputs

# Prepare the dataset
dataset = CustomDataset(train_data, Tokenizer)

# Training arguments
training_args = TrainingArguments(
    output_dir="./result",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

# Train the model
trainer.train()

Step,Training Loss


TrainOutput(global_step=3, training_loss=11.442729949951172, metrics={'train_runtime': 1.9757, 'train_samples_per_second': 1.518, 'train_steps_per_second': 1.518, 'total_flos': 3250656903168.0, 'train_loss': 11.442729949951172, 'epoch': 3.0})

**Step 5: Integration and Fine-tuning**
- Integrate the retrieval and generation components. So during inference, the retrival component fetches relevant passages, which are then used by the generative model to produce a response.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Step 1: Extract Text from PDF
def extract_text_from_pdf(pdf_path):
    doc = pymupdf.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text

# Preprocess Text
def preprocess_text(text, chunk_size=100):
    sentences = text.split('.')
    chunks = [' '.join(sentences[i:i+chunk_size]) for i in range(0, len(sentences), chunk_size)]
    return chunks

# Implement Search Mechanism
class PDFSearchEngine:
    def __init__(self, pdf_path):
        text = extract_text_from_pdf(pdf_path)
        self.chunks = preprocess_text(text)
        self.vectorizer = TfidfVectorizer()
        self.chunk_vectors = self.vectorizer.fit_transform(self.chunks)

    def search(self, query, top_n=3):
        query_vector = self.vectorizer.transform([query])
        similarities = cosine_similarity(query_vector, self.chunk_vectors).flatten()
        top_indices = np.argsort(similarities)[-top_n:][::-1]
        results = [(self.chunks[i], similarities[i]) for i in top_indices]
        return results


# Example usage
query = "Why animal health is important?"
response = PDFSearchEngine.search(pdf_path, query)
print(response)
