## Text Extraction Methods (OCR)
### Example 1: Using EasyOCR to Extract Text from Images or PDFs

EasyOCR is an open source Python library (like PyTesseract) used to extract text from image files. 

Key features:
- **Deep learning approach:** EasyOCR leverages deep learning models, specifically Convolutional Recurrent Neural Networks (CRNN) and Connectionist Temporal Classification (CTC), for text detection and recognition.
- **Handles Noisy and Complex Images**: Excels at recognizing text in challenging conditions: noisy images, varying fonts, complex layouts, and distorted text.
- **Less Preprocessing Required**: Deep learning approach means less extensive image preprocessing needed
- **GPU Support**: Supports GPU acceleration for faster processing

Combining this with the pdf2image library, you can extract pages from a PDF as images, then process them with EasyOCR

#### Step 1: Setup libraries and global definitions

Install easyocr and pdf2image libraries and their dependencies. Also create a **reader** to process text of a particular language or set of languages.

In [17]:
from pathlib import Path
from PIL import Image
import easyocr
import os
import pdf2image

# Create an OCR reader object (indicate which languages are expected)
reader = easyocr.Reader(['en'])

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


Define input and output directories. Place source images and pdfs in the input directory. 

In [18]:
input_dir = './input/'
output_dir = './output/'


Pull lists of image files and pdf files from the input directory.

In [19]:
image_files = Path(input_dir).glob('*.jp[g|eg]')
pdf_files = Path(input_dir).glob('*.pdf')

### Step 2: Define text extraction function

Function to extract text from images using pytesseract.

In [20]:
def img_to_text(files):
    extracted_text = []
    for file in files:
        result = reader.readtext(file)
        for detection in result:
            text = detection[1]
            extracted_text.append(text)  
            print(text)  

    full_text = "\n".join(extracted_text)
    return full_text


### Step 3: Process images and PDFs in input folder

#### A. Extract text from images

In [22]:
for file in image_files:    
    img_filename = file.name[:len(file.name)-4].strip()      
    extracted_text = []    
    result = reader.readtext(f'{input_dir}{file.name}')
    for detection in result:
        text = detection[1]
        extracted_text.append(text)          

    full_text = "\n".join(extracted_text)
    print(full_text)
    
    # Write result to txt file
    with open(f'{output_dir}easyocr_{img_filename}.txt','w',encoding='utf-8') as f:
        f.write(full_text)



Create
Iink back to your website: You want to direct them
specific page or location within your website or
Do not just direct them back to your home page:
You must Iink them to
specific area
<a href="
http: / /www lovelvlongbeachhomes com/bixby_knolls shtml
">Bixby Knolls</a>
blog:


#### B. Extract text from pdfs

In [23]:
# Convert pdf files
for file in pdf_files:    
    images = pdf2image.convert_from_path(file)
    pdf_filename = file.name[:len(file.name)-4].strip()
    pg_images = []
    for pagenum,pagedata in enumerate(images):
        image_filename = os.path.join(output_dir, f'{pdf_filename}_{pagenum+1}.jpg')
        pagedata.save(image_filename,'JPEG')    
        pg_images.append(image_filename)
    
    # Extract text from image version of pdf pages using EasyOCR
    result = img_to_text(pg_images)
    
    # Write result to txt file
    with open(f'{output_dir}easyocr_{pdf_filename}.txt','w',encoding='utf-8') as f:
        f.write(result)



Ada Limon, THE CARRYING
DANDELION INSOMNIA
The big-ass bees are back, tipsy, sun drunk
and
with thick knitted
warmers
of pollen. I was up all night
So
today"
S
yellow hours seem strange and hallucinogenic.
The
neighborhood is
with mowers, crazy
and
mending what winter ruined.
What I can't
over is
something simple, easy:
How could a dandelion seed head seemingly
grow
overnight?
A
neighbor
mows the
and bam, the next morning; there'$ a hundred
dandelion seed heads straight as arrows
and proud as cats high above any green blade
of manicured grass. It must
some
folks,
a
flower so
tricky it can reproduce asexually,
making perfect identical selves, bam, another me,
bam, another me. I can't help it-[ root
for that persecuted rosette so hyper in its
own
making it seems to devour the land.
Even its name, translated from the French
dent de
means lion'$ tooth: It'$ vicious,
made for a time that requires tenacity,a way
of remaking the toughest self while everyone
else is asleep:
heavy
leg -
again
l



UBC
THE
UNIVERSITY
BRITISH
COLUMBIA
School of Information
Irving K. Barber Learning Centre
470-1961 East Mall
Vancouver; BC Canada V6T 1ZI
Phone 604 822 2404
Fax 604 822 6006
ischool.ubcca
PERSONAL AND CONFIDENTIAL
April 28,2025
Neil Aitken
neil.aitken@ubc.ca
RE: Offer_Letter for Graduate_Research Assistant position
Dear Neil,
We are pleased to offer you the position of Graduate Research Assistant in the iSchool This position
provides an hourly rate of S30 plus a 4 % vacation entitlement and will commence upon May 1st, 2025 on
a term basis which will conclude upon June 30th, 2025. You will be working a maximum of 4 hours per
week: Please note that your employment is subject to you being physically located within BC during the
term of your
appointment. Please inform us immediately if your physical location changes during the
term of the appointment:
You will work under the
supervision of Dr. Fatemeh Salehian Kia.
As a Graduate Academic Assistant, your duties will include:
MyLA Deploymen



UBC
THE
UNIVERSITY
BRITISH
COLUMBIA
School of Information
Irving K. Barber Learning Centre
470-1961 East Mall
Vancouver; BC Canada V6T 1ZI
Phone 604 822 2404
Fax 604 822 6006
ischool.ubc ca
British   Columbia' s
Workers   Compensation
Act
includes   policies addressing
workplace
bullying
&
harassment: The BC Government requires that all UBC Faculty and Staff receive training on how to
recognize, prevent, and address workplace bullying and harassment: If you have not completed the
training modules previously, you are required to complete the online training module before
begin
your
appointment Please take a moment to do s0 and provide a copy of the certificate of completion,
ideally, at the same time as you provide acceptance of this offer: For more information, please visit:
https g[bullyingandharassment ubc caltraining-eventsL
Once you have reviewed and agreed to these terms and conditions of employment, please sign and return
this letter by April
2025 to Nicole Chan at ischooLassista