<a href="https://colab.research.google.com/github/S-Umasankar/PDF_genAI_poc/blob/develop/RAG_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install transformers
! pip install torch
! pip install pdfplumber
! pip install sentence_transformers
! pip install faiss-gpu

In [30]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [75]:
cd /content/drive/MyDrive/'NLP project - POC'/

/content/drive/MyDrive/NLP project - POC


In [76]:
import faiss
import pdfplumber
import os
import re
import nltk
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer


In [77]:
# Step 1: Extract text from PDF
def extract_text_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        full_text = ""
        for page in pdf.pages:
            full_text += page.extract_text()
    return full_text.split("\n\n")  # Split into sections or paragraphs

In [78]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    if not isinstance(text, str):  # Check if text is a string
        return ""  # Return an empty string for None or non-string objects
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'\n', '', text)  # Remove extra \n
    words = text.split()
    words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [79]:
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

def embed_text(text):
  return sentence_model.encode([text])[0]



In [80]:
dimension = 4000 # Dimension of the embeddings
index = faiss.IndexFlatL2(dimension)

def index_embeddings(embeddings):
  faiss_index = faiss.IndexFlatL2(embeddings.shape[1])
  faiss_index.add(embeddings)
  return faiss_index


In [81]:
def retrieve_relevant_embeddings(query, index, k=2):
  query_embedding = embed_text(query)
  distances, indices = index.search(np.array([query_embedding]), k)
  return indices[0]

In [82]:
pdf_files = [path for path in os.listdir() if path.endswith('.pdf')]
texts = [extract_text_from_pdf(file) for file in pdf_files]
embeddings = np.array([embed_text(preprocess_text(text)) for text in texts])
index = index_embeddings(embeddings)

In [89]:
# Load the model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def generate_response(context):
  # Encode the input query using the tokenizer
  inputs = tokenizer.encode(context, return_tensors='pt')

  # Generate a response
  outputs = model.generate(inputs, max_length=25, num_return_sequences=1)

  # Decode the generated response
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
  return response

In [90]:
def chatbot(query):
  response = generate_response(query)
  return response

# Example usage
query = 'What is astrology?'
response = chatbot(query)
print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is astrology?

Astrology is a science that uses the power of the sun to tell us what is happening


In [91]:
query = 'What is mystic science?'
response = chatbot(query)
print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is mystic science?

The term mystic science is used to describe the study of the nature of the universe. It


In [92]:
query = 'What is karma?'
response = chatbot(query)
print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is karma?

Karma is the ability to change the world. It is the ability to change the world.
