## Text Extraction Methods (OCR)
### Example 1: Using PyTesseract to Extract Text from Images or PDFs

PyTesseract is an open source Python library used to extract text from image files. Built as a wrapper for Google's Tesseract OCR engine, it relies on traditional image processing techniques and pattern matching for character recognition. PyTesseract performs well on high-resolution, clean images with clear text and simple layouts. Preprocessing (image cleanup, noise reduction, deskewing) is required. 

Combining PyTesseract with the pdf2image library, you can extract pages from a PDF as images, then process them with PyTesseract

#### Step 1: Setup libraries and global definitions

Install pytesseract and pdf2image libraries and their dependencies

In [None]:
from pathlib import Path
from PIL import Image
import pytesseract
import pdf2image

Define input and output directories. Place source images and pdfs in the input directory. 

In [2]:
input_dir = './input/'
output_dir = './output/'


Pull lists of image files and pdf files from the input directory.

In [3]:
image_files = Path(input_dir).glob('*.jp[g|eg]')
pdf_files = Path(input_dir).glob('*.pdf')

### Step 2: Define function for text extraction

Function to extract text from images using pytesseract.

In [4]:
def img_to_text(images):
    extracted_text = []
    for pagenum,pagedata in enumerate(images):
        text = pytesseract.image_to_string(pagedata)
        extracted_text.append(f"--- Page {pagenum+1} ---\n")
        extracted_text.append(str(text)+'\n')
    full_text = "\n".join(extracted_text)
    return full_text


### Step 3: Process images and PDFs in input folder

#### A. Extract text from images

In [5]:
for file in image_files:    
    img_filename = file.name[:len(file.name)-4].strip()  
    result = pytesseract.image_to_string(Image.open(file))
    
    # Write result to txt file
    with open(f'pytess_{img_filename}.txt','w',encoding='utf-8') as f:
        f.write(result)

#### B. Extract text from pdfs

In [7]:
# Convert pdf files
for file in pdf_files:  
    pdf_filename = file.name[:len(file.name)-4].strip()  
    images = pdf2image.convert_from_path(file)
    result = img_to_text(images)
    
    # Write result to txt file
    with open(f'{output_dir}_pytess_{pdf_filename}.txt','w',encoding='utf-8') as f:
        f.write(result)