The PDF Text Search and Indexing System for Electronic Datasheets is a specialized tool designed to extract and process text from electronic datasheets in PDF format. This system enables users to efficiently search for specific terms or components across multiple datasheets and retrieve a list of documents that contain those terms. The project utilizes PyMuPDF for handling PDF files, Tesseract OCR for text extraction from images, and NLTK for text processing, including tokenization, stopword removal, and stemming. The extracted data is indexed and saved for quick word search functionality, making it particularly useful for managing and analyzing technical datasheets in industries such as electronics, engineering, and manufacturing

download the file from this link https://drive.google.com/drive/folders/1jAT7h6jPOfPdq8T_jJXKyY_3bzxDsOuM?usp=sharing

attention : I used chat gpt to understand how scraping text from image & tables thene i tried to combined with what i have learned from the TPS also i used for stracturing the code

In [None]:
pip install pymupdf pytesseract nltk sentence-transformers


Collecting pymupdf
  Downloading PyMuPDF-1.24.13-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading PyMuPDF-1.24.13-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract, pymupdf
Successfully installed pymupdf-1.24.13 pytesseract-0.3.13


In [None]:
# Install pytesseract
!pip install pytesseract




In [None]:
# Install Tesseract OCR
!apt-get update -q
!apt-get install -y tesseract-ocr


Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,609 kB]
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,459 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,696 k

In [None]:
import os
import fitz  # PyMuPDF for handling PDF files
import pytesseract
from PIL import Image
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer  # For stemming words
import pickle  # Used for saving and loading processed data
from collections import defaultdict

# Download the necessary NLTK data files (for tokenization and stopwords)
nltk.download('punkt')
nltk.download('stopwords')

# Initialize the stemmer to reduce words to their root form (e.g., "running" becomes "run")
stemmer = PorterStemmer()

# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)  # Open the PDF using PyMuPDF
    text = ''  # Initialize an empty string to store extracted text

    # Loop through all the pages in the PDF
    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)  # Load the page
        text += page.get_text("text")  # Extract and add the text from this page

        # If there are images in the PDF, use OCR to extract any text from them
        pix = page.get_pixmap()  # Convert the page to an image
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)  # Convert to a PIL image
        ocr_text = pytesseract.image_to_string(img)  # Use Tesseract OCR to extract text from the image
        text += ocr_text  # Add the OCR text to the main text

    return text

# Function to process the text (tokenize, remove stopwords, and stem words)
def process_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize the text and convert to lowercase

    # Remove common stopwords (like "the", "and", "is") to focus on important words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

    # Stem the words to get their root form (e.g., "running" → "run")
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

    return stemmed_tokens

# Function to process all PDFs in a given directory and create a mapping of words to PDFs
def process_pdfs_in_directory(directory_path):
    word_pdf_mapping = defaultdict(list)  # Use a dictionary to map words to PDFs

    # Loop through all PDF files in the directory
    for pdf_file in os.listdir(directory_path):
        if pdf_file.lower().endswith('.pdf'):  # Check if the file is a PDF
            pdf_path = os.path.join(directory_path, pdf_file)  # Get the full path of the PDF
            print(f"Processing {pdf_file}...")  # Show progress

            # Extract text from the PDF
            text = extract_text_from_pdf(pdf_path)

            # Process the text (tokenize, remove stopwords, and stem words)
            stemmed_words = process_text(text)

            # Update the word-to-PDF mapping
            for word in stemmed_words:
                word_pdf_mapping[word].append(pdf_file)

    return word_pdf_mapping

# Function to search for a word and find out in which PDFs it appears
def search_word(word, word_pdf_mapping):
    word = word.lower()  # Make the search case-insensitive
    if word in word_pdf_mapping:
        return word_pdf_mapping[word]  # Return the list of PDFs containing the word
    else:
        return f"The word '{word}' does not appear in any PDF."  # Return a message if no PDFs contain the word

# Function to save the word-to-PDF mapping to a file (to avoid reprocessing the PDFs)
def save_word_pdf_mapping(word_pdf_mapping, filename='word_pdf_mapping.pkl'):
    with open(filename, 'wb') as file:
        pickle.dump(word_pdf_mapping, file)  # Save the mapping using Pickle
    print(f"Word-to-PDF mapping saved to {filename}.")  # Confirm saving

# Function to load the word-to-PDF mapping from a file (if already saved)
def load_word_pdf_mapping(filename='word_pdf_mapping.pkl'):
    if os.path.exists(filename):  # Check if the saved file exists
        with open(filename, 'rb') as file:
            word_pdf_mapping = pickle.load(file)  # Load the saved mapping
        print(f"Word-to-PDF mapping loaded from {filename}.")  # Confirm loading
        return word_pdf_mapping
    else:
        print(f"No saved mapping found. Please process PDFs first.")  # If the file doesn't exist
        return None

# Main execution
if __name__ == "__main__":
    directory_path = '/content/pdffiles'  # Set the path to your PDFs

    # Try to load previously saved word-to-PDF mapping
    word_pdf_mapping = load_word_pdf_mapping()

    # If there's no saved mapping, process the PDFs and save the mapping
    if word_pdf_mapping is None:
        word_pdf_mapping = process_pdfs_in_directory(directory_path)
        save_word_pdf_mapping(word_pdf_mapping)

    # Start a loop to allow the user to search for multiple words
    while True:
        word_to_search = input("Enter a word to search: ")  # Ask the user for a word

        # Search for the word in the mapping and display the result
        result = search_word(word_to_search, word_pdf_mapping)
        print("PDFs containing the word:", result)

        # Ask the user if they want to search for another word
        while True:
            repeat_search = input("Do you want to search for another word? (y/n): ").strip().lower()

            # Validate user input to handle unexpected responses
            if repeat_search == 'y':
                break  # Continue with another search
            elif repeat_search == 'n':
                print("Thank you for using the search tool!")  # Friendly exit message
                exit()  # Exit the program
            else:
                print("Please enter 'y' for yes or 'n' for no.")  # Prompt user for correct input


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Word-to-PDF mapping loaded from word_pdf_mapping.pkl.
PDFs containing the word: ['media (11).pdf', 'media (11).pdf', 'media (11).pdf', 'media (8).pdf', 'media (8).pdf', 'media (8).pdf', 'media (6).pdf', 'media (6).pdf', 'media (6).pdf', 'media (6).pdf', 'media (5).pdf', 'media (5).pdf', 'media (5).pdf', 'media (3).pdf', 'media (3).pdf', 'media (12).pdf', 'media (12).pdf', 'media (9).pdf', 'media (9).pdf', 'media (9).pdf', 'media (2).pdf', 'media (2).pdf', 'media (2).pdf']
PDFs containing the word: ['ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (1).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne555 (6).pdf', 'ne