# OCR Test with Tesseract

## Requirements
1. **Install Tesseract OCR**: 
   - Install Tesseract: `winget install --id UB-Mannheim.TesseractOCR` 
   - add it to system varibales 
2. **Python packages**: `uv pip install opencv-python pytesseract pillow`

## Only Pdf (textual format)

In [22]:
import pdfplumber
import os

def extract_pdf(input_pdf: str, output_txt: str):
    if not os.path.exists(input_pdf):
        raise FileNotFoundError(f'Input PDF not found: {input_pdf}')

    out_dir = os.path.dirname(output_txt) or '.'
    os.makedirs(out_dir, exist_ok=True)

    pages = []
    with pdfplumber.open(input_pdf) as pdf:
        for page in pdf.pages:
            pages.append(page.extract_text() or '')

    text = '\n'.join(pages).strip()

    with open(output_txt, 'w', encoding='utf-8') as f:
        f.write(text)

    return text

# Defaults (change as needed)
input_pdf = './Tests files/test_text_only.pdf'
output_txt = './Tests output/output_from_text.txt'

# Run
extracted_text = extract_pdf(input_pdf, output_txt)
print(f'Saved to {output_txt}')
print('\n\nExtracted text:\n', extracted_text)

Saved to ./Tests output/output_from_text.txt


Extracted text:
 Experiment-07
Roll No: A3-754
Aim: To study & implement Part-of-Speech (POS) tagging using the Viterbi Algorithm in Hidden Markov
Models (HMM)
Theory:
Part-of-Speech (POS) tagging is one of the most fundamental tasks in Natural Language Processing (NLP). It
refers to the process of assigning a grammatical category, such as noun, verb, adjective, adverb, etc., to each
word in a sentence. For example, in the sentence “Fish can swim”, the word “Fish” can be a noun, “can”
can be a verb, and “swim” is also a verb. However, words are often ambiguous – for example, “can” can also
mean a noun (like “a can of soda”). Hence, simply looking at individual words is not enough; we need to
consider the context of the words in the sentence.
➢ Hidden Markov Model (HMM) for POS Tagging
A Hidden Markov Model is a statistical model in which the system being modeled is assumed to be a Markov
process with hidden states. In the context of POS ta

## Only Image (clear Text)

In [23]:
import cv2
import pytesseract
import os

def run_ocr(input_path: str, output_path: str):
    if not os.path.exists(input_path):
        raise FileNotFoundError(f'Input file not found: {input_path}')
    
    out_dir = os.path.dirname(output_path) or '.'
    os.makedirs(out_dir, exist_ok=True)
    
    img = cv2.imread(input_path)
    if img is None:
        raise ValueError(f'Could not read image: {input_path}')
    
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    gray = cv2.GaussianBlur(gray, (5,5), 0)
    gray = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
    
    text = pytesseract.image_to_string(gray)
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(text)
    
    return text

# Defaults (change as needed)
input_path = './Tests files/test_image_only.png'
output_path = './Tests output/output_from_image.txt'

# Run
text = run_ocr(input_path, output_path)
print('Saved to ./Tests output/output_from_image.txt')
print('\n\nExtracted text:\n',text)

Saved to ./Tests output/output_from_image.txt


Extracted text:
 Experiment-07

Roll No: A3-754

Aim: To study & implement Part-of-Speech (POS) tagging using the Viterbi Algorithm in Hidden Markov
Models (HMM)



## Pdf (Mixed photos and text)

In [None]:
import fitz  # PyMuPDF
import pytesseract
import cv2
import numpy as np
from PIL import Image

def extract_pdf_with_images(pdf_path, out_txt):
    doc = fitz.open(pdf_path)
    final_text = []

    for page_num, page in enumerate(doc, start=1):
        blocks = []

        # --- Text blocks (with bbox) ---
        for b in page.get_text("blocks"):
            x0, y0, x1, y1, text, *_ = b
            if text.strip():
                blocks.append({
                    "bbox": (x0, y0, x1, y1),
                    "type": "text",
                    "content": text.strip()
                })

        # --- Image blocks ---
        raw_dict = page.get_text("rawdict")
        for block in raw_dict["blocks"]:
            if block["type"] == 1:  # image
                bbox = block["bbox"]
                img = page.get_pixmap(matrix=fitz.Matrix(2, 2), clip=fitz.Rect(bbox))
                img_pil = Image.frombytes("RGB", [img.width, img.height], img.samples)

                # OCR on image
                img_cv = cv2.cvtColor(np.array(img_pil), cv2.COLOR_RGB2BGR)
                ocr_text = pytesseract.image_to_string(img_cv).strip()
                if ocr_text:
                    blocks.append({
                        "bbox": bbox,
                        "type": "image",
                        "content": ocr_text
                    })

        # --- Sort by layout order ---
        blocks.sort(key=lambda b: (round(b["bbox"][1]), round(b["bbox"][0])))

        # --- Merge page ---
        page_text = []
        for b in blocks:
            if b["type"] == "image":
                page_text.append(f"[IMAGE OCR] {b['content']}")
            else:
                page_text.append(b["content"])
        final_text.append("\n".join(page_text))

    merged_text = "\n\n--- Page Break ---\n\n".join(final_text)

    with open(out_txt, "w", encoding="utf-8") as f:
        f.write(merged_text)

    return merged_text


# Example run
pdf_path = "./Tests files/test_mix.pdf"
out_txt = "./Tests output/output_from_mix.txt"

text = extract_pdf_with_images(pdf_path, out_txt)
print('Extracted text:\n',text)

Extracted text:
 [IMAGE OCR] wanyandGessathe TRUST

RAJIV GANDHI INSTITUTE OF TECHNOLOGY, MUMBAI
DEPARTMENT OF COMPUTER ENGINEERING
Experiment-07
Roll No: A3-754
Aim: To study and implement the Random Forest algorithm for classification & regression tasks in ML
Theory:
Random Forest is an ensemble learning technique that combines multiple decision trees to produce a more 
accurate and robust model. It reduces overfitting and improves generalization compared to a single decision 
tree. It is a supervised machine learning algorithm that builds multiple decision trees and merges them to get 
a more accurate and stable prediction. It can be used for both classification and regression tasks.
[IMAGE OCR] Working of Random Forest

Tree2 (Aggregation)

Tree 3
➢ Importance:
• Handles high-dimensional data efficiently. 
• Reduces overfitting by averaging multiple trees. 
• Works well with large datasets and maintains accuracy even if a large portion of data is missing. 
• Provides feature import

## Pdf (Mixed photos and text), betterment in photo output