<a href="https://colab.research.google.com/github/HimaniKM/sithafaltask2/blob/main/sithafalTask_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install PyPDF2 langchain faiss-cpu huggingface-hub

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2, faiss-cpu
Successfully installed PyPDF2-3.0.1 faiss-cpu-1.9.0.post1


In [2]:
pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.13-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.13 (from langchain-community)
  Downloading langchain-0.3.13-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.27 (from langchain-community)
  Downloading langchain_core-0.3.28-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.23.2-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

In [3]:
!pip install -U langchain-huggingface


Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Downloading langchain_huggingface-0.1.2-py3-none-any.whl (21 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.1.2


In [4]:
from sentence_transformers import SentenceTransformer
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
from typing import List, Optional
from PyPDF2 import PdfReader

class PDFQuestionAnswering:
    def __init__(self, model_name: str = 'sentence-transformers/all-MiniLM-L6-v2'):
        """Initialize the QA system with the specified embedding model."""
        # Initialize the embedding function using HuggingFaceEmbeddings
        self.embedding_function = HuggingFaceEmbeddings(
            model_name=model_name
        )
        self.vector_store = None

    def get_pdf_text(self, pdf_docs: List[str]) -> str:
        """Extract text from multiple PDF documents."""
        text = []
        for pdf_path in pdf_docs:
            try:
                pdf_reader = PdfReader(pdf_path)
                for page in pdf_reader.pages:
                    extracted_text = page.extract_text()
                    if extracted_text:
                        text.append(extracted_text)
            except Exception as e:
                print(f"Error processing {pdf_path}: {str(e)}")
        return "\n".join(text)

    def get_text_chunks(self, text: str) -> List[str]:
        """Split text into chunks with specified size and overlap."""
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len
        )
        return text_splitter.split_text(text)

    def create_vector_store(self, text_chunks: List[str], save_path: str = "faiss_index") -> None:
        """Create and save a FAISS vector store from text chunks."""
        try:
            # Create vector store directly using FAISS's from_texts method
            self.vector_store = FAISS.from_texts(
                texts=text_chunks,
                embedding=self.embedding_function
            )
            # Save the vector store
            self.vector_store.save_local(save_path)
        except Exception as e:
            print(f"Error creating vector store: {str(e)}")
            raise

    def load_vector_store(self, path: str = "faiss_index") -> None:
        """Load a previously saved vector store."""
        try:
            self.vector_store = FAISS.load_local(
                folder_path=path,
                embeddings=self.embedding_function
            )
        except Exception as e:
            print(f"Error loading vector store: {str(e)}")
            raise

    def answer_question(self, question: str, k: int = 4) -> Optional[str]:
        """Answer a question using the vector store."""
        if not self.vector_store:
            raise ValueError("Vector store not initialized. Please load or create one first.")

        try:
            # Get relevant documents using similarity search
            docs = self.vector_store.similarity_search(question, k=k)

            # Extract and combine the relevant text passages
            relevant_texts = [doc.page_content for doc in docs]

            # Return the most relevant passages
            return "\n\nRelevant passages:\n" + "\n---\n".join(relevant_texts)
        except Exception as e:
            print(f"Error answering question: {str(e)}")
            return None

def main():
    # Initialize the QA system
    qa_system = PDFQuestionAnswering()

    # Process PDF files
    pdf_files = ["test.pdf"]  # Replace with your PDF files
    if not pdf_files or not all(os.path.exists(pdf) for pdf in pdf_files):
        print("Error: Please ensure all PDF files exist.")
        return

    try:
        # Extract text and create vector store
        raw_text = qa_system.get_pdf_text(pdf_files)
        text_chunks = qa_system.get_text_chunks(raw_text)
        qa_system.create_vector_store(text_chunks)

        # Interactive question-answering loop
        print("Type 'exit' to quit.")
        while True:
            question = input("\nEnter your question: ").strip()
            if question.lower() == "exit":
                print("Exiting program. Goodbye!")
                break

            answer = qa_system.answer_question(question)
            if answer:
                print("\nAnswer:", answer)
            else:
                print("\nSorry, I couldn't find a relevant answer.")

    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    main()

  self.embedding_function = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Type 'exit' to quit.

Enter your question: line graph

Answer: 

Relevant passages:
In our example, we compared components of US GDP.
The line chart is useful when you want to show how a 
variable changes over time.  For our purposes, we used it 
show how GDP changed over time.
Bar graphs are good for comparing different groups of 
variables.  We used it to compare different components of US GDP.  We did the same with the pie chart; depending on your purposes you may choose to use a pie chart or a bar graph.
xy
00
13
2639
41 2
51 561 8
72 1
82 4•If given a table of data, we should be able to plot it.  Below is 
some sample data; plot the data with x on the x-axis and y on the y-axis.
051015202530
012345678•Below is a plot of the data on the table from the previous 
slide.  Notice that this plot is a straight line meaning that a linear equation must have generated this data.
•What if the data is not generated by a linear equation?  We can
---
slide.  Notice that this plot is a straigh