## Text Extraction Methods (OCR)
### Example 1: Using EasyOCR to Extract Text from Images or PDFs

EasyOCR is an open source Python library (like PyTesseract) used to extract text from image files. 

Key features:
- **Deep learning approach:** EasyOCR leverages deep learning models, specifically Convolutional Recurrent Neural Networks (CRNN) and Connectionist Temporal Classification (CTC), for text detection and recognition.
- **Handles Noisy and Complex Images**: Excels at recognizing text in challenging conditions: noisy images, varying fonts, complex layouts, and distorted text.
- **Less Preprocessing Required**: Deep learning approach means less extensive image preprocessing needed
- **GPU Support**: Supports GPU acceleration for faster processing

Combining this with the pdf2image library, you can extract pages from a PDF as images, then process them with EasyOCR

#### Step 1: Setup libraries and global definitions

Install easyocr and pdf2image libraries and their dependencies. Also create a **reader** to process text of a particular language or set of languages.

In [None]:
from pathlib import Path
from PIL import Image
import easyocr
import os
import pdf2image

# Create an OCR reader object (indicate which languages are expected)
reader = easyocr.Reader(['en'])

Define input and output directories. Place source images and pdfs in the input directory. 

In [None]:
input_dir = './input/'
output_dir = './output/'


Pull lists of image files and pdf files from the input directory.

In [None]:
image_files = Path(input_dir).glob('*.jp[g|eg]')
pdf_files = Path(input_dir).glob('*.pdf')

### Step 2: Define text extraction function

Function to extract text from images using pytesseract.

In [None]:
def img_to_text(files):
    extracted_text = []
    for file in files:
        result = reader.readtext(file)
        for detection in result:
            text = detection[1]
            extracted_text.append(text)  
            print(text)  

    full_text = "\n".join(extracted_text)
    return full_text


### Step 3: Process images and PDFs in input folder

#### A. Extract text from images

In [None]:
for file in image_files:    
    img_filename = file.name[:len(file.name)-4].strip()      
    extracted_text = []    
    result = reader.readtext(f'{input_dir}{file.name}')
    for detection in result:
        text = detection[1]
        extracted_text.append(text)          

    full_text = "\n".join(extracted_text)
    print(full_text)
    
    # Write result to txt file
    with open(f'{output_dir}easyocr_{img_filename}.txt','w',encoding='utf-8') as f:
        f.write(full_text)

#### B. Extract text from pdfs

In [None]:
# Convert pdf files
for file in pdf_files:    
    images = pdf2image.convert_from_path(file)
    pdf_filename = file.name[:len(file.name)-4].strip()
    pg_images = []
    for pagenum,pagedata in enumerate(images):
        image_filename = os.path.join(output_dir, f'{pdf_filename}_{pagenum+1}.jpg')
        pagedata.save(image_filename,'JPEG')    
        pg_images.append(image_filename)
    
    # Extract text from image version of pdf pages using EasyOCR
    result = img_to_text(pg_images)
    
    # Write result to txt file
    with open(f'{output_dir}easyocr_{pdf_filename}.txt','w',encoding='utf-8') as f:
        f.write(result)