**MAIN CODE FILE with EXPLAINATION**

1.   Step 1: Import Required Libraries
2.   Step 2: Define Function to Extract Q&A Pairs from PDFs
3. Step 3: Load PDFs and Create Knowledge Base
4. Step 4: Define the Function to Find Answers in the Knowledge Base
5. Step 5: Chatbot Response and Running the Chatbot


As I have used Colab to perform this assignment, following are the dependencies I have installed.

In [1]:
!pip install PyPDF2
!pip install python-Levenshtein
!pip install fuzzywuzzy
!pip install openai langchain

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/232.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Collecting python-Levenshtein
  Downloading python_Levenshtein-0.25.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.25.1 (from python-Levenshtein)
  Downloading Levenshtein-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Collecting rapidfuzz<4.0.0,>=3.8.0 (from Levenshtein==0.25.1->python-Levenshtein)
  Downloading rapidfuzz-3.9.7-cp310-cp310-manylinux_2_17_x86_64.

Below is the final code to run PDF based chatbot.

In [2]:
## Step 1: Import Required Libraries

import os
import re
import PyPDF2
from fuzzywuzzy import fuzz

# Conversation history to maintain dialogue
conversation_history = []

# Function to extract Q&A pairs from PDFs
def extract_qa_from_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()

    # Split text into Q&A pairs based on numbering (e.g., "4. ", "5. ")
    qa_pairs = re.split(r'(\d+\.\s+)', text)

    # Merge the split text into a list of questions and answers
    qa_list = []
    for i in range(1, len(qa_pairs), 2):
        question = qa_pairs[i].strip() + " " + qa_pairs[i + 1].strip()
        qa_list.append(question)

    # Create a dictionary of question-answer pairs
    qa_dict = {}
    for qa in qa_list:
        if '\n' in qa:
            parts = qa.split('\n', 1)
        else:
            parts = qa.split('.', 1)

        if len(parts) > 1:
            question = parts[0].strip()
            answer = parts[1].strip()
            qa_dict[question] = answer

    return qa_dict

# Define function to find answers in the knowledge base with fuzzy matching
def find_answer_in_docs(query, knowledge_base, threshold=70):
    query = query.lower()
    best_match = None
    best_score = 0
    best_answer = None
    source = None

    # Iterate over documents in the knowledge base
    for doc_name, qa_dict in knowledge_base.items():
        for question, answer in qa_dict.items():
            score = fuzz.partial_ratio(query, question.lower())
            if score > best_score:
                best_score = score
                best_match = question
                best_answer = answer
                source = doc_name

    if best_score >= threshold:
        return f"Found in {source}:\nQ: {best_match}\nA: {best_answer}\nSource URL: {source}", best_match
    else:
        return None, None

# Define the function to return chatbot response
def chatbot_response(query):
    global conversation_history
    answer, matched_question = find_answer_in_docs(query, knowledge_base)

    if answer:
        conversation_history.append(f"User: {query}")
        conversation_history.append(f"Chatbot: {answer}")
        return f"{answer}\n"

    # Suggest similar questions from the knowledge base based on keywords
    suggestions = suggest_questions(query, knowledge_base)
    if suggestions:
        response = "I'm sorry, I couldn't find an exact match. Did you mean one of the following?\n"
        for suggestion in suggestions:
            response += f"- {suggestion}\n"
        return response
    else:
        return "I'm sorry, I couldn't find an answer to your question in the knowledge base.\n"

# Function to suggest questions based on fuzzy matching
def suggest_questions(query, knowledge_base, threshold=50):
    query = query.lower()
    suggestions = []

    # Iterate over documents to find similar questions
    for doc_name, qa_dict in knowledge_base.items():
        for question in qa_dict.keys():
            score = fuzz.partial_ratio(query, question.lower())
            if score >= threshold:
                suggestions.append(question)

    return suggestions if suggestions else None

# Function to run the chatbot interactively
def run_chatbot():
    global conversation_history
    conversation_history = []
    while True:
        query = input("You: ")
        if query.lower() == "exit":
            break
        response = chatbot_response(query)
        print(f"Bot: {response}")

# Example PDF paths (replace with actual paths)
pdf_path_1 = '/content/FINAL_FAQs_June_2018.pdf'
pdf_path_2 = '/content/Final_FREQUENTLY_ASKED_QUESTIONS_-PATENT.pdf'

# Extract Q&A pairs from the PDFs
pdf_text_1 = extract_qa_from_pdf(pdf_path_1)
pdf_text_2 = extract_qa_from_pdf(pdf_path_2)

# Combine the Q&A pairs into the knowledge base
knowledge_base = {
    'Document 1': pdf_text_1,
    'Document 2': pdf_text_2
}

# Run the chatbot
run_chatbot()


You: what is patent?
Bot: Found in Document 2:
Q: 1. What  is a Patent?
A: A Patent is a stat utory right for an invention granted for a lim ited period of time to the patentee by 
the Government, in exchange of full disclosure of his invention for excluding others,  from making , 
using, selling , importing the patented product or process for producing that product for those 
purposes with out his consent.
Source URL: Document 2

You: Types of application
Bot: Found in Document 2:
Q: 23. What a re the types of applicatio ns?
A: The types of applications that can be filed ar e: 
A) PROVIS IONAL APPLICATION 
Indian Patent Law follows first to file system. A provisional applicatio n is an applicatio n which  can be 
filed if the invention is still under experimentati on stage. Filing a provisional specificat ion provides 
the  advantage  to  the  inventor  since  it  helps  in establishing a  ―priority  date  of  the  invention. 
Further,  the inventor gets 12 months’ time to fully devel

KeyboardInterrupt: Interrupted by user

# **Code explaination **

**[link text](https://)Step 1: Import Required Libraries**
os: Provides a way to interact with the operating system, although not directly used in the current code. It can be used for file management (e.g., checking file paths).
re: This is Python’s regular expression library, used here to handle text matching and splitting operations.
PyPDF2: A library used to read and extract text from PDF documents.
fuzzywuzzy: A library used for string matching, which compares two strings and provides a similarity score.

**Step 2: Define Function to Extract Q&A Pairs from PDFs**
1. extract_qa_from_pdf: This function opens a PDF file and reads its content.
2. file_path: The path to the PDF file that will be passed to this function.
3. open(file_path, "rb"): Opens the PDF file in binary mode.
4. PyPDF2.PdfReader(file): Creates a PdfReader object to read the contents of the PDF.
for page_num in range(len(pdf_reader.pages)): Loops through each page in the PDF.
5. text += page.extract_text(): Appends the extracted text from each page to the text variable.
6. re.split(r'(\d+.\s+)', text): Splits the text into chunks based on the pattern (\d+\.\s+), which matches numbers followed by a period and spaces (e.g., "4. ", "5. ").
This helps break down the content into potential Q&A pairs.

7. qa_list = []: Initializes an empty list to hold the extracted question-answer pairs.
8. for i in range(1, len(qa_pairs), 2): Loops through the split text, processing every second element as a potential question, followed by its answer.
9. question = qa_pairs[i].strip() + " " + qa_pairs[i + 1].strip(): Combines the question and answer, stripping any unnecessary whitespace.

10. qa_dict = {}: Initializes an empty dictionary to store the final question-answer pairs.
11. for qa in qa_list: Loops through the extracted questions and answers in qa_list.
12. if '\n' in qa: Checks if there is a line break (\n) in the question-answer pair.
13. parts = qa.split('\n', 1): Splits the text at the first line break.
14. parts = qa.split('.', 1): If no line break is found, it splits the text at the first full stop.
15. if len(parts) > 1: Ensures that both the question and the answer are present after splitting.
16.question = parts[0].strip(): Takes the first part as the question and removes any surrounding spaces.
17. answer = parts[1].strip(): Takes the second part as the answer and removes any surrounding spaces.
18. qa_dict[question] = answer: Stores the question and its corresponding answer in the qa_dict.
19. return qa_dict: Returns the dictionary of question-answer pairs.

**Step 3: Load PDFs and Create Knowledge Base**
1. pdf_path_1, pdf_path_2: Paths to the PDF files.
2. extract_qa_from_pdf(pdf_path_1): Calls the function to extract Q&A pairs from the first PDF.
3. knowledge_base: Combines the extracted Q&A pairs from both PDFs into a dictionary. Each document is represented as a key, with its extracted content as the value.

**Step 4: Define the Function to Find Answers in the Knowledge Base**
1. find_answer_in_docs: Function that searches for the best matching answer from the knowledge base based on the user query.
2. query.lower(): Converts the user query to lowercase for case-insensitive matching.
3. best_match, best_score, best_answer: Variables to store the closest match, its similarity score, and the corresponding answer.
4. for doc_name, qa_dict in knowledge_base.items(): Loops through the documents and their Q&A dictionaries.
5. score = fuzz.partial_ratio(query, question.lower()): Uses fuzzy matching to compare the user query with each question in the knowledge base.
6. if score > best_score: Updates the best_score and stores the matching question and answer if the current match has the highest score.
7. if best_score >= threshold: If the similarity score is above the threshold (70 by default), it returns the matched question, answer, and the document source.
8. else: If no match is found, the chatbot returns a fallback message saying it couldn't find an answer.

**Step 5: Chatbot Response and Running the Chatbot**

1. chatbot_response: This function processes the user query and retrieves the response from the knowledge base.
2. find_answer_in_docs(query, knowledge_base): Calls the previously defined function to find an answer to the user's query.
3. return f"{answer}\n": Returns the answer, if found, or a default message.
4. run_chatbot: Starts the chatbot interaction in a loop.
query = input("You: "): Accepts user input as a query.
5. if query.lower() == "exit": Ends the loop if the user types "exit".
response = chatbot_response(query): Calls the chatbot to get a response based on the user query.
6. print(f"Bot: {response}"): Outputs the chatbot's response.