# ***Read Sleuth***
> Aim : To facilitate image-based and text-based search for identical or similar content within a user-uploaded book database.\
This notebook aims at extracting text from pdf uploaded

# **Libraries used for project :**

In [None]:
!apt-get install -y poppler-utils
!pip install pytesseract pdf2image
!apt-get install tesseract-ocr
!pip install transformers
!pip install sentence-transformers
!ip install torch
!pip install PyMuPDF

**Importing Dependencies**

In [7]:
import matplotlib.pyplot as plt
import cv2
from PIL import Image
import pytesseract
import cv2
import numpy as np
import fitz

**Function to display images**

In [8]:
def display_image(image):
    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.axis('off')
    plt.show()

> Firstly, when a book is uploaded, all the text is extracted with the help of PyTesseract and is converted stored in the form of array along with its page number.\
> Before performing OCR, there are some preprocessing steps which are to be applied.These include:\
>*   GreyScale Conversion
*   Threshold Adjustment
*   Inversion
*   Dilation and Erosion

But in our case, only first two are necessary. So, these were applied using ***OPEN-CV***



In [9]:
def pre_proc(im):
    img = np.array(im)
    def grayscale(image):
        return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    gray_image = grayscale(img)
    thresh, im_bw = cv2.threshold(gray_image, 130, 200, cv2.THRESH_BINARY)
    return im_bw

def perform_ocr(pdf_path, page_number):
    # Open PDF file
    doc = fitz.open(pdf_path)

    # Get the page
    page = doc[page_number]

    # Convert PDF page to an image
    image = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
    img = Image.frombytes("RGB", [image.width, image.height], image.samples)

    img = pre_proc(img)

    # Perform OCR using Tesseract
    ocr_text = pytesseract.image_to_string(img)

    doc.close()

    return ocr_text


Function to get length of pdf

In [10]:
def get_pdf_len(pdf_path):
    try:
        doc = fitz.open(pdf_path)
        num_pages = doc.page_count
        doc.close()

        return num_pages
    except Exception as e:
        print(f"Error: {e}")
        return None

*Performing OCR*

In [16]:
pdf_path = "/content/thermodynamics-an-engineering-approach-cengel-boles.pdf"
pdf_pages = get_pdf_len(pdf_path)
e_text = []
for i in range(get_pdf_len(pdf_path)):
    if((i+1) % 10 == 0):
      print(i+1)
    text = perform_ocr(pdf_path, i)
    text = text.split('\n')
    e_text.append((i+1,text))
    if(i == 50):
      break

10
20
30
40
50


e_text is the array which contains text from all the pages of the book.

In [17]:
len(e_text)

51