# Image Text Extraction and Information Retrieval

## The Question

Being able to take pictures of text, recognize, convert into computer text, and search either typed or hand-written is a very interesting and useful concept.
**How can we use python image processing and information retrieval libraries to extract text from images and search through the results with specific queries?**

## The Data

Our data consists of **typed and hand-written document images and scans**. We created a dataset that is combined from multiple other datasets. It consists of approximately **1509 data points.**

## The Approach

Since all these images have different document styles and structures, we want to see how well the image text detection library we're using will be able to convert the document into computer text thus allowing us to search through the documents. We will not be implementing the document image word extraction or TF-IDF from scratch, but we will be using multiple libraries in order to see how these machine learning and natural language processing concepts and libraries work. The libraries we will be using are

## Our Hypothesis

Since some characters share similar features with other characters, and handwritten characters are less consistent than typed characters, we expect a higher accuracy rate when we convert images with typed text to regular text than when we convert images with handwritten text to regular text.

## Why Bother?

Teaching a machine to recognize physical typed and hand-written documents from pictures and scans as well as being able to search through these documents is a very important and useful machine learning concept.

There are many applications such as being able to scan these documents into a database and being able to search for information within the documents.

## Basic Setup

### PIL, Pytesseract

We first loaded in the machine learning libraries. **PIL** allows us to process the images. **Pytesseract** allows us to convert the text in the image into a text string.

In [4]:
!pip install pytesseract
!pip install googletrans
!pip install pytesseract
!pip install pandas



In [5]:
# adds image processing capabilities
from PIL import Image

# will convert the image text into a text string
import pytesseract

import os

import pandas as pd

from elasticsearch import Elasticsearch, helpers
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
import csv
import sys

## Convert Documents

In [6]:
#list for storing each image and it's extracted text
images = []
results = []

for filename in os.listdir('Data/Typed'):
    if filename.endswith('.png'):
        with open(os.path.join('Data/Typed', filename)) as f:
            images.append('Data/Typed/' + filename)
            # assigning an image from the source path
            img = Image.open('Data/Typed/' + filename)
            # converts the image to result and saves it into result variable
            result = pytesseract.image_to_string(img)
            results.append(result)

for filename in os.listdir('Data/Written'):
    if filename.endswith('.jpg'):
        with open(os.path.join('Data/Written', filename)) as f:
            images.append('Data/Written/' + filename)
            # assigning an image from the source path
            img = Image.open('Data/Written/' + filename)
            # converts the image to result and saves it into result variable
            result = pytesseract.image_to_string(img)
            results.append(result)
    if filename.endswith('.png'):
        with open(os.path.join('Data/Written', filename)) as f:
            images.append('Data/Written/' + filename)
            # assigning an image from the source path
            img = Image.open('Data/Written/' + filename)
            # converts the image to result and saves it into result variable
            result = pytesseract.image_to_string(img)
            results.append(result)

## Store in Datastore

In [7]:
df = pd.DataFrame({
        'image': images,
        'result': results
})

df.to_csv('output.csv', index=False)

## Elasticsearch

In [12]:
with open('output.csv') as f:
    reader = csv.DictReader(f)
    es.indices.delete(index='documents')
    helpers.bulk(es, reader, index='documents')



In [None]:
searching = True

while searching:
    search = input("Search (to exit type 'exit'): ")

    if (search == 'exit') or (search == 'Exit'):
        break
    else:
        print('Search: ' + search)
        print()
        result = es.search(
            index="documents",
            body={
                "query": {
                    "multi_match": {
                        "query": search,
                        "fields": ["result"]
                    }
                }
            }
        )
    all_hits = result['hits']['hits']

    for hit in all_hits:
        print("Image:", hit['_source']['image'])
        im = Image.open(hit['_source']['image'])
        im.show()
        print("Result:", hit['_source']['result'])

Search: yes

Image: Data/Written/r0098_06.png
Result: Page 2

tcp tsetse enim tte tenet

Form 4562 (1988)

FAUT Automobiles, Certain Other Vehicles, Computers, and Property Used for Entertainment, Recreation, or
Amusement (Listed Property).
if you are using the standard mileage rate or deducting vehicle lease expense, complete columns (a) through (d) of Section A, all of
Section B, and Section C if applicable.
Section A.—Depreciation (If automobiles and other listed property placed in service after June 18, 1984, are used 50% or
less in a trade or business, the section 179 deduction is not allowed and depreciation must be
taken using the straight line method over 5 years. For other limitations, see instructions.)

Yes LJ No It “Yes,” is the evidence written? Yes AI No

 

 

 

Do you have evidence to support the business use claimed?

    
     
   
  

     

     
    

   
    
  

   
 
  

  
 
    

    
      

 

Business (d) Cost or . api woe.
ope (b) Date () ‘ | ’ (e) Basis 



## The Data

## Voila!