<a href="https://colab.research.google.com/github/SurajMegharaj/QA--RAG-app/blob/main/RAG_APP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installations

In [None]:
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

In [47]:
!pip install PyPDF2
!pip install -U langchain-community
!pip install faiss-cpu



Load the Dataset

In [48]:
from PyPDF2 import PdfReader

# File path to your PDF
file_path = "/content/aws-vs-azure-vs-gcp-comparing-the-big-3-cloud-platforms.pdf"

# Initialize PdfReader
reader = PdfReader(file_path)

# Extract text from all pages
text = ""
for page in reader.pages:
    text += page.extract_text()

# Now that you've extracted text, let's prepare the document for RAG
from langchain.schema import Document

# Load the extracted text into the document
document = Document(page_content=text)

# Optionally, print the first few lines of the document to verify
print("Loaded document preview:")
print(document.page_content[:500])  # Display the first 500 characters


Loaded document preview:
AWS VS AZURE VS GCP: COMPARING THE BIG 3 CLOUD
PLATFORMS
The big three of cloud computing platforms
Cloud computing  has revolutionized the way organizations handle digital operations. Amazon Web
Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) are the three cloud service
providers  dominating the cloud market worldwide.
Most enterprises have moved computing from on-site servers into the cloud and even multi-cloud
environments , so that they can benefit from features such as:
Dec


In [49]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Using a pre-trained model to generate embeddings (vector representations)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
document = Document(page_content=text)
documents = [document]

# Index documents using FAISS
vectorstore = FAISS.from_documents(documents, embeddings)


In [50]:
from langchain.llms import HuggingFaceHub

# Initialize the language model (e.g., GPT from Hugging Face)
llm = HuggingFaceHub(repo_id="gpt2", model_kwargs={"temperature": 0.5, "max_length": 150},huggingfacehub_api_token="Enter your Api token")


In [34]:
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS

# Assuming llm is your language model and vectorstore (like FAISS) is ready

# Use the retriever from the vectorstore
retriever = vectorstore.as_retriever()

# Create the retrieval-based QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,  # Your language model
    chain_type="stuff",  # Standard chain type for retrieval-based QA
    retriever=retriever,  # Use the retriever from vectorstore
    return_source_documents=True  # Optionally return source documents
)

# Test with a query
query = "3 big platforms are"
# Use invoke() to get the full output
response = qa_chain.invoke({"query": query})

# Extract the result and source documents
result = response["result"]
source_documents = response["source_documents"]

# Print the response
print("Response:", response)




Response: {'query': '3 big platforms are', 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nAWS VS AZURE VS GCP: COMPARING THE BIG 3 CLOUD\nPLATFORMS\nThe big three of cloud computing platforms\nCloud computing  has revolutionized the way organizations handle digital operations. Amazon Web\nServices (AWS), Microsoft Azure, and Google Cloud Platform (GCP) are the three cloud service\nproviders  dominating the cloud market worldwide.\nMost enterprises have moved computing from on-site servers into the cloud and even multi-cloud\nenvironments , so that they can benefit from features such as:\nDecreased CapEx\nReduced infrastructure maintenance\nIncreased availability  and reliability\nScalability of an on-demand resource\nLower operational costs\nRemote access and facilitated collaboration\nSupport for multiple devices\nOptimized infrastructure for speed and perf

In [45]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, RetrievalQA

# Create the LLM prompt for summarization
llm_prompt = PromptTemplate(
    input_variables=["context"],
    template="Summarize the following content:\n\n{context}\n\nSummary:"
)

# Create the LLM chain for summarization
llm_chain = LLMChain(llm=llm, prompt=llm_prompt)

# Create a basic retrieval QA chain with summarization
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # This will use a 'StuffDocumentsChain' internally.
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Test the system with a query
query = "What is AWS?"
response = qa_chain({"query": query})

# Inspect the structure of the response
print(response)

# Extract the actual summarized response
# We assume `response` has 'result' and 'source_documents' keys

if 'result' in response:
    result = response['result']  # Get the result

    # Print the result part
    print("Summarized Response:", result)

    # If there are source documents, we can choose to ignore them
    if 'source_documents' in response:
        print("\nNote: Ignoring source document content...")
        for doc in response['source_documents']:
            # Example: Only show the metadata of the source docs, not the full text
            print("Source Document Metadata:", doc.metadata)
else:
    print("No summarized result found.")




{'query': 'What is AWS?', 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nAWS VS AZURE VS GCP: COMPARING THE BIG 3 CLOUD\nPLATFORMS\nThe big three of cloud computing platforms\nCloud computing  has revolutionized the way organizations handle digital operations. Amazon Web\nServices (AWS), Microsoft Azure, and Google Cloud Platform (GCP) are the three cloud service\nproviders  dominating the cloud market worldwide.\nMost enterprises have moved computing from on-site servers into the cloud and even multi-cloud\nenvironments , so that they can benefit from features such as:\nDecreased CapEx\nReduced infrastructure maintenance\nIncreased availability  and reliability\nScalability of an on-demand resource\nLower operational costs\nRemote access and facilitated collaboration\nSupport for multiple devices\nOptimized infrastructure for speed and performance\nEnhanced



Final version of working code

In [62]:
from PyPDF2 import PdfReader
from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFaceHub
from langchain.chains import RetrievalQA

# Step 1: Load and extract text from the PDF
file_path = "/content/Suraj J.pdf"
reader = PdfReader(file_path)

# Extract text from all pages
text = ""
for page in reader.pages:
    text += page.extract_text()

# Load the extracted text into a document
document = Document(page_content=text)

# Step 2: Generate embeddings using a pre-trained model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
documents = [document]

# Index the document using FAISS
vectorstore = FAISS.from_documents(documents, embeddings)

# Step 3: Initialize the language model (e.g., GPT from Hugging Face)
llm = HuggingFaceHub(repo_id="gpt2", model_kwargs={"temperature": 0.5, "max_length": 150}, huggingfacehub_api_token="Enter your API token")

# Step 4: Create the retrieval-based QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=False
)

while True:
  # Step 5: Test with a user query
  query = input("Enter your Query:")
  response = qa_chain({"query": query})

  # Step 6: Print the response
  result = response['result']
  print("Response:", result)


Enter your Query:is suraj ai engineer?




Response: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

suraj1642001@gmail.com  Suraj J +91 7022584746  
 
AI Engineer with 1.7 years of professional experience and a BE in Computer Science and Engineering (2023). Specialize s in building 
and fine -tuning NLP, Computer -vision, and large language models (LLMs), with expertise in Lang Chain -based agent and chain 
development.  Proficient in developing and deploying machine learning and deep learning models,  along with creating robust data 
pipelines.  
EDUCATION  
Bachelor  of Engineer,  Computer  Science  & Engineering,  GPA: 8.14 /10.00  Aug 2019 — May 2023 
EXPERIENCE  
AI Engineer  Feb 2024 — Present  
Atharvo  Technology  Pvt Ltd  Bangalore  
AI Models:  Contributed to building LLM -based chains and agents, fine -tuning, and integrating NLP, computer vision, machine learning, 
and LLM models into applications. Al



Response: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

suraj1642001@gmail.com  Suraj J +91 7022584746  
 
AI Engineer with 1.7 years of professional experience and a BE in Computer Science and Engineering (2023). Specialize s in building 
and fine -tuning NLP, Computer -vision, and large language models (LLMs), with expertise in Lang Chain -based agent and chain 
development.  Proficient in developing and deploying machine learning and deep learning models,  along with creating robust data 
pipelines.  
EDUCATION  
Bachelor  of Engineer,  Computer  Science  & Engineering,  GPA: 8.14 /10.00  Aug 2019 — May 2023 
EXPERIENCE  
AI Engineer  Feb 2024 — Present  
Atharvo  Technology  Pvt Ltd  Bangalore  
AI Models:  Contributed to building LLM -based chains and agents, fine -tuning, and integrating NLP, computer vision, machine learning, 
and LLM models into applications. Al



KeyboardInterrupt: 