# Use the pytesseract library in Python for optical character recognition from (i) an image file and (ii) a multi-page PDF file.

In [38]:
print("""Tesseract OCR is a powerful open-source tool developed by Google that helps computers read text from images and scanned documents. It works like the human eye and brain by recognizing characters, words, and lines from pictures containing printed or handwritten text. Tesseract can read many languages and even works with different fonts and image qualities. In Python, we use the pytesseract library to connect with Tesseract and extract text from images or PDFs. It is widely used in applications like digitizing printed books, reading scanned forms, and extracting text from photos.""")

print("\n" + "-"*120 + "\n")

print("""OCR (Optical Character Recognition) is a technology that helps a computer read and understand text from images or scanned documents, just like humans read printed text. For example, if you have a photo of a book page or a scanned PDF, OCR can extract the words and convert them into editable and searchable text. This is very useful for digitizing printed documents, recognizing text from forms, or copying text from an image without typing it manually.""")

print("\n" + "-"*120 + "\n")

print("""pytesseract is the main library used for OCR (Optical Character Recognition). It works as a bridge between Python and the Tesseract OCR engine to extract text from images.""")

print("\n" + "-"*120 + "\n")

print("""cv2 (OpenCV) is used for image processing. It helps improve the quality of the image by converting it to grayscale, removing noise, and applying thresholding so that OCR can work more accurately.""")

print("\n" + "-"*120 + "\n")

print("""numpy is used to handle image arrays when converting between formats (for example, from a PIL image to an OpenCV image). It makes it easy to work with image data as numerical arrays.""")

print("\n" + "-"*120 + "\n")

print("""PIL (Python Imaging Library), specifically the Image module, is used to open, save, and convert images in Python. Tesseract requires the image to be in PIL format, so we use it before sending the image to pytesseract.""")


Tesseract OCR is a powerful open-source tool developed by Google that helps computers read text from images and scanned documents. It works like the human eye and brain by recognizing characters, words, and lines from pictures containing printed or handwritten text. Tesseract can read many languages and even works with different fonts and image qualities. In Python, we use the pytesseract library to connect with Tesseract and extract text from images or PDFs. It is widely used in applications like digitizing printed books, reading scanned forms, and extracting text from photos.

------------------------------------------------------------------------------------------------------------------------

OCR (Optical Character Recognition) is a technology that helps a computer read and understand text from images or scanned documents, just like humans read printed text. For example, if you have a photo of a book page or a scanned PDF, OCR can extract the words and convert them into editable 

# Importing the required libraries

In [39]:
import pytesseract
import cv2
import numpy as np
from PIL import Image
from pdf2image import convert_from_path
import matplotlib.pyplot as plt

# OCR from Image File with Preprocessing

In [40]:
# Set the path to Tesseract-OCR executable
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

In [None]:
# Load the image using OpenCV
img = cv2.imread('image.png') 
plt.imshow(img)
plt.axis('off')
plt.title("Original Image (BGR)")
plt.show()

# Convert the image from BGR to RGB 
image_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(image_rgb)
plt.axis('off')  
plt.title("Original Image (RGB)")
plt.show()

# Convert to Grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
plt.imshow(gray, cmap='gray')
plt.axis('off')
plt.title("Grayscale Image")
plt.show()

# Apply Thresholding 
_, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

# Apply Median Blur to remove small noise
processed_img = cv2.medianBlur(thresh, 3)
plt.imshow(processed_img, cmap='gray')
plt.axis('off')
plt.title("Processed Image (Threshold + Median Blur)")
plt.show()

# Convert OpenCV image back to PIL Image for pytesseract
processed_pil_image = Image.fromarray(processed_img)

# Extract text from image
image_text = pytesseract.image_to_string(processed_pil_image)
print("Extracted Text:")
print(image_text)

[[[143 153 171]
  [144 154 172]
  [144 154 172]
  ...
  [192 196 197]
  [192 196 197]
  [192 196 197]]

 [[143 153 171]
  [144 154 172]
  [144 154 172]
  ...
  [192 196 197]
  [192 196 197]
  [192 196 197]]

 [[143 153 171]
  [144 154 172]
  [144 154 172]
  ...
  [192 196 197]
  [192 196 197]
  [192 196 197]]

 ...

 [[ 71  76  85]
  [ 71  76  85]
  [ 71  76  85]
  ...
  [ 98 100 108]
  [ 98 100 108]
  [ 98 100 108]]

 [[ 71  76  85]
  [ 71  76  85]
  [ 71  76  85]
  ...
  [ 98 100 108]
  [ 98 100 108]
  [ 98 100 108]]

 [[ 71  76  85]
  [ 71  76  85]
  [ 71  76  85]
  ...
  [ 98 100 108]
  [ 98 100 108]
  [ 98 100 108]]]


# OCR from Multi-page PDF with Preprocessing

In [49]:
# Define the PDF path
pdf_path = r"C:\Users\SARTH\OneDrive\Desktop\python\ML\result_college.pdf"

# Define the Poppler path (based on your screenshot)
poppler_path = r"C:\Users\SARTH\OneDrive\Desktop\python\ML\poppler-24.08.0\Library\bin"

# Convert PDF to list of images
pdf_pages = convert_from_path(pdf_path, dpi=300, poppler_path=poppler_path)

print("Conversion done.")


Conversion done.


In [50]:
# Iterate through each page
for page_num, page in enumerate(pdf_pages):
    # Convert PIL page to OpenCV format
    page_cv = np.array(page)
    page_cv = cv2.cvtColor(page_cv, cv2.COLOR_RGB2BGR)

    # Convert to grayscale
    gray_pdf = cv2.cvtColor(page_cv, cv2.COLOR_BGR2GRAY)

    # Thresholding
    _, thresh_pdf = cv2.threshold(gray_pdf, 150, 255, cv2.THRESH_BINARY)

    # Optional: Noise removal
    processed_pdf_page = cv2.medianBlur(thresh_pdf, 3)

    # Convert back to PIL for OCR
    processed_pil_pdf = Image.fromarray(processed_pdf_page)

    # OCR for the current page
    page_text = pytesseract.image_to_string(processed_pil_pdf)

    # Print the result
    print(f"\n Text Extracted from PDF Page {page_num + 1}:\n")
    print(page_text)



 Text Extracted from PDF Page 1:

PROGRESS REPORT

UNIVERSITY

NAAC ACCREDITED ‘A+’ GRADE

NIRMA UNIVERSITY,AHMEDABAD _” NIRMA

Institute : Institute of Technology

Programme: — 3B. Tech. in Computer Science and Enginecring

Admission Year: 2023-24

Student's Roll No.:  23BCE194 Name: Narola Sarth Dharmeshbhai
Registration Course Code & Title Course Grade Credit
Category Obtained Earned
Semester | (IR) (SEE) Result Date; 06-Feb-2024
(IR) IHS101 General English A+ 3
(IR) IMH101 Mathematics | At 3
dR) ISP201 Physics 0 3
(IR) 1CLS01 Environmental Science At 3
(IR) ICSSO1 Computer Programming A+ 3
(IR) lEF801 Electrical Science At 3
dR) OFTOO] Yoga At .
Credits Offered : 18.00 Credits Famed : 18.00 Grade Points Farmed: 165.00 SGPA: 9.17 10
Progressive Credits Offered : Progressive Credits Famed : Progressive Grade Points Earned : COPA PGPA :
18.00 18.00 165.00 917 10
Semester Il (IR) (SEE) Result Date 19-Jun-2024
dR) ICS101 Introduction to AI& MI. 0 3
(IR) 1HS102 Written Communication A 3