In [None]:
"""
1. Business Problem :  Develop a project that utilizes LangChain for question-answering 
based on four provided files
1.1. Business Objective:
The objective is to develop a question-answering system that can interact with PDF
and Word documents using OpenAI's API, providing detailed answers based on the content
of these documents. This tool is designed to help users quickly extract relevant 
information from large documents by asking specific questions.

1.2. Constraints:
> The system relies on the OpenAI API, which has a usage limit and cost associated with it.
> The quality of the answers depends on the context available in the documents and the 
efficiency of the text chunking and embedding processes.
> Users must upload valid PDF or Word documents, and the system's performance may vary 
based on document size and content complexity.

2. Work on Each Feature of the Dataset
2.1. Data Dictionary:
For this project, the dataset consists of text extracted from PDF and Word documents. 
While a traditional data dictionary is not applicable lets see an table.

3. Data Pre-processing
3.1. Data Cleaning, Feature Engineering, etc.:

Text Extraction: Text is extracted from PDF and Word documents.
Feature Engineering: Text chunks are created to ensure the context is maintained
and to handle large documents efficiently.

3.2. Outlier Treatment:
There’s no specific outlier treatment as the project deals with text data,
but text chunks are managed to maintain consistency and relevance.

4. Exploratory Data Analysis (EDA)
4.1. Summary:
The project involves extracting, chunking, and embedding text from documents for question-answering.
4.2. Univariate Analysis:
4.3. Bivariate Analysis:

5.	Model Building
The project involves creating a system that extracts text from PDF and Word documents, 
breaks it into manageable chunks, and converts these chunks into vector embeddings using
OpenAI's model. These embeddings are stored in a FAISS index, enabling efficient similarity
searches to retrieve relevant text based on user queries. A question-answering chain then
generates detailed answers using the retrieved text. Users interact with this system
through a Streamlit interface, allowing them to ask questions and receive precise answers
from the uploaded documents. The model's effectiveness hinges on the seamless integration 
of text processing, embedding, and retrieval components.

6. Benefits/Impact of the Solution
The business benefits from this solution by providing users with a powerful tool to 
extract relevant information from large and complex documents efficiently.
This saves time and enhances decision-making processes, especially in environments 
where quick access to specific information is crucial.
"""


In [1]:
from langchain_community.vectorstores import FAISS
from langchain_community.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from PyPDF2 import PdfReader
from docx import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv


In [2]:
# Load environment variables
# Environment Variable to Acess OPENAI API KEY
# good practise not to hard code API keys
# load environment variables from a file named .env into the environment.
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

In [3]:
# Function to extract text from PDF files
# creating a corpus form the pdf files.
# The open function is used to open the PDF file located at path.
# The "rb" mode specifies that the file is opened in "read-binary" mode, which is necessary for handling PDF files.
def pdf_text(pdf_paths):
    text = ""
    for path in pdf_paths:
        with open(path, "rb") as file:
            pdf_reader = PdfReader(file)
            for page in pdf_reader.pages:
                text += page.extract_text()
    return text

# Function to extract text from DOCX files
# creating a corpus form the word files
def docx_text(docx_paths):
    text = ""
    for path in docx_paths:
        doc_reader = Document(path)
        for paragraph in doc_reader.paragraphs:
            text += paragraph.text + "\n"
    return text


In [4]:
# Function to split text into chunks
# split a long text into smaller chunks
# recursively tries to split the text into chunks that are under a specified size
# Insuring that important content that spans the boundary of two chunks is included in both.
def get_text_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000)
    return text_splitter.split_text(text)


In [5]:
# Function to create and save vector store
# FAISS : Facebook AI similarity search
# gets vector from the extracted text after making chunks
def get_vector_store(text_chunks):
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=api_key)
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    vector_store.save_local("faiss_index")
# The FAISS.from_texts method converts the text_chunks into vectors using the embeddings
# model and creates a FAISS index to store these vectors.


In [6]:
# Function to load the QA chain
# providing the prompts

def get_conversational_chain():
    # template for generating responses. 
    prompt_template = """
    Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in
    provided context just say, "answer is not available in the context", don't provide the wrong answer\n\n
    Context:\n {context}?\n
    Question: \n{question}\n

    Answer:
    """
    model = OpenAI(temperature=0.0, openai_api_key=api_key) # low temperature (0.0) for more deterministic responses
    # Creates a PromptTemplate object using the predefined template. The template expects two input variables: context and question.
    prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
    return load_qa_chain(model, chain_type="stuff", prompt=prompt)#Loads a question-answering (QA) chain


In [7]:
# Function to handle user input and generate answers
def user_input(user_question):
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=api_key)
    new_db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
    # Loads a FAISS vector store from a local file named faiss_index.
    docs = new_db.similarity_search(user_question)
    # Performs a similarity search in the FAISS index using the user's question to find relevant documents or text chunks.
    chain = get_conversational_chain()
    # Calls the get_conversational_chain function to retrieve the QA chain for generating responses.
    response = chain({"input_documents": docs, "question": user_question}, return_only_outputs=True)
    print("Reply:", response["output_text"])


In [8]:
# Main function
def main():
   # Embedded file paths
    pdf_paths = [
       "C:/Assignment/Training Data/India A Comprehensive Overview.pdf",
        "C:/Assignment/Training Data/India's Diverse States and Territories.pdf"
    ]
    docx_paths = [
       "C:/Assignment/Training Data/India's Education, Healthcare, and Social Development.docx",
        "C:/Assignment/Training Data/India's Natural Beauty and Wildlife.docx"
    ]
    
    raw_text = ""
    # for the pdf_paths and docx_paths we call the get_pdf_text and get_doc_text fuctions and store it
    # in raw_text , the raw text contains the corpus of the file text data
    if pdf_paths:
        raw_text += pdf_text(pdf_paths)
    if docx_paths:
        raw_text += docx_text(docx_paths)

    # split a long text into smaller chunks using the get_text_chunks function 
    # works like tokenization , where tokens are created
    # get_vector_store function takes the tokenized output and create the vectors to be 
    # given to the model
    if raw_text:
        text_chunks = get_text_chunks(raw_text)
        get_vector_store(text_chunks)
        print("Processing Complete")

        while True:
            user_question = input("Ask a Question from the Uploaded Files (or type 'exit' to quit): ")
            if user_question.lower() == 'exit':
                break
            user_input(user_question)
    else:
        print("No valid files uploaded")

if __name__ == "__main__":
    main()


Processing Complete
Ask a Question from the Uploaded Files (or type 'exit' to quit): How Many Indian States


  warn_deprecated(
stuff: https://python.langchain.com/v0.2/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/v0.2/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/v0.2/docs/how_to/#qa-with-rag
  warn_deprecated(
  warn_deprecated(


Reply: 
India is a federal union comprising 28 states and 8 union territories.
Ask a Question from the Uploaded Files (or type 'exit' to quit): what support is given by ayushman bharat scheme
Reply: 
The Ayushman Bharat scheme aims to provide health insurance coverage to millions of poor and vulnerable people in India. This support is crucial for improving healthcare access and affordability, especially in rural areas.
Ask a Question from the Uploaded Files (or type 'exit' to quit): what is the population of India
Reply: 
India has a population of over 1.3 billion, making it the world's second most populous country.
Ask a Question from the Uploaded Files (or type 'exit' to quit): what is the highest mountain range in India
Reply: The highest mountain range in India is the Himalayas.
Ask a Question from the Uploaded Files (or type 'exit' to quit): exit
