This code pattern demonstrates a methodology of deriving insights from scanned documents that has information organized into various sections or layouts.
Some of the scenarios encountered regularly in business are:
- A real estate company which scans newspaper classifieds to extract the individual classifieds. The classifieds appear in different layouts depending on the newspaper.
- In a bank, the supporting documents for a loan are scanned and uploaded. We have to extract some relevant information from the uploaded documents for further processing. Here the information is organized into sections, and the information we want is in one of the sections.
- Employers get scanned copies of certificates, government ids and other documents from employees joining the organization. They are verified for the details.
- In hospitals, a lot of times medical records are scanned and stored. They are later taken up for auditing and analysis. In such a scenario, we need to extract specific information from the records for analysis.
When we use an OCR
tool directly on an image, we get text that is aggregated and mixed from the various sections. In the above described scenarios, we need to understand the layout and then retrieve only the information from individual sections.
Let us take the first scenario of the real estate company. The real estate classifieds are usually laid out in columns as shown below. Here, the individual classifieds must be identified and then the text content extracted.
Let us build a solution for this scenario to demonstrate our code pattern methodology. We will take a regional newspaper where the classifieds appear in Hindi
language. This methodology can be extended to any other regional language.
We need to perform the following steps on the scanned image:
- Extract the individual classified information laid out in columns and rows
- Convert it to text format, translate if required, and extract the required information.
This solution is applicable to other use-case scenarios where we need to extract only relevant portions of text in images and get insights on them.
When the reader has completed this Code Pattern, they will understand how to:
- Containerize OpenCV, Tesseract and Cloud object storage client using an
Appsody
stack, and deploy them on anOpenShift cluster on IBM Cloud
. - Pre-process images to separate them into different sections using OpenCV
- Use Tesseract to extract text from an image
- Use Watson language translation to translate the text from
Hindi
toEnglish
. - Use Watson Natural language Understanding to derive insights on the text.
- The classifieds image is stored in Object storage, and the jupyter notebook execution is triggered.
- The Object storage operations microservice is invoked.
- The classifieds image from Object storage is retrieved.
- The Image pre-processor service is invoked. The different sections in the image are identified and extracted into separate images each containing only one single classified.
- The individual classified image is sent to the Text extractor service where the address text is extracted.
- The extracted address text is sent to Watson Language Translator where the content is translated to English.
- The translated text in English is sent to Watson Natural Language Understanding where the entities of interest is extracted to generate the required insights.
- IBM Cloud account: Create an IBM Cloud account.
- Python 3: Install python 3.
- Jupyter Software: Install Jupyter Software.
- Appsody CLI: Install Appsody CLI.
Please follow the below to setup and run this code pattern.
- Clone the repo
- Create text extractor service
- Create image pre-processor service
- Create object storage operations service
- Setup Watson Language Translator
- Setup Watson Natural Language Understanding
- Run locally
- Deploy and run on cloud
- Analyze the results
Clone this git repo. Else, in a terminal, run:
$ git clone https://github.com/IBM/process-images-derive-insights
Please refer to the below 3 steps in the tutorial Create a custom Appsody stack with support for Python Flask and Tesseract to create the appsody stack.
a. Create a copy of an Appsody Python Flask stack
b. Modify the Python Flask stack to add support for Tesseract
Create a new empty folder say text_extractor
. Create an appsody project inside the newly created folder by running the below commands:
$ cd text_extractor
$ appsody init dev.local/python-flask-tesseract
Copy the file __init__.py
under the folder sources/text_extraction/
in this repo that you have cloned.
Replace file __init__.py
under folder text_extractor
with __init__.py
under the folder sources/text_extraction/
in this repo that you have cloned.
Goto the text_extractor
folder and run the below commands:
$ appsody build
$ appsody run -p 3501:8080 -p 3502:5678
Open the url http://localhost:3501/home
If it says, Your text extraction application test is successful
then your application is working fine.
Press Ctrl+C on the terminal to stop the running server.
Please refer to the below 3 steps in the tutorial Create a custom Appsody stack with support for Python Flask and OpenCV to create the appsody stack.
a. Create a copy of an Appsody Python Flask stack
b. Modify the Python Flask stack to add support for OpenCV
Note: Please specify the version of the opencv-python in the install command - pip install opencv-python==4.1.2.30. This change is needed for both
Dockerfile-stack
under the image folder andDockerfile
under folder image/project.
Create a new empty folder say image_preprocessor
. Create an appsody project inside the newly created folder by running the below commands:
$ cd image_preprocessor
$ appsody init dev.local/python-flask-opencv
$ mkdir images
Copy the files __init__.py
and improcess.py
under the folder sources/image_preprocessor/
in this repo that you have cloned.
Replace file __init__.py
under folder image_preprocessor
with __init__.py
under the folder sources/image_preprocessor/
in this repo that you have cloned.
Replace file improcess.py
under folder image_preprocessor
with improcess.py
under the folder sources/image_preprocessor/
in this repo that you have cloned.
Goto the image_preprocessor
folder and run the below commands:
$ appsody build
$ appsody run -p 4501:8080 -p 4502:5678
Open the url http://localhost:4501/home
If it says, Your image preprocessor application test is successful
then your application is working fine. Press Ctrl+C on the terminal to stop the running server.
Please refer to the below 3 steps in the tutorial Create a custom Appsody stack with template for IBM Cloud Object Storage operations to create the appsody stack.
a. Create a copy of an Appsody Python Flask stack
b. Modify the Python Flask stack to add support for Object Storage operations
Create an instance of IBM Cloud Object Storage.
Create credentials for the newly created Cloud Object service
- Click on
Credentials
- Click on
New Credential
. Make a note of the newly created credential in json format.
Create a bucket and upload an image to the bucket
- Click on
Buckets
- Click on
Create bucket
- Select
Standard
underPredefined buckets
- Give a name under
Unique bucket name
likeclassifieds
and click onNext
- Click on
Browse and upload
and upload the imagenewspaper_hindi.jpg
undersources/object_storage_operations/
in this repo that you have cloned and click onNext
.
-
Click on
Next
inTest bucket out
section. -
In
Summary
section click onView Buckets
and search for your bucketclassifieds
in the search tab.
- When you click on your bucket, it should show
newspaper_hindi.jpg
image in the bucket.
- You have now successfully created a bucket and uploaded an image.
Create a new empty folder say object_storage_operations
. Create an appsody project inside the newly created folder by running the below commands:
$ cd object_storage_operations
$ appsody init dev.local/python-flask-os ostemplate
The files Pipfile
,osclient.py
and config.ini
are created under the object_storage_operations
folder.
Copy the files __init__.py
under the folder sources/object_storage_operations/
in this repo that you have cloned.
Replace file __init__.py
under folder object_storage_operations
with __init__.py
under the folder sources/object_storage_operations/
in this repo that you have cloned.
- The file
__init__.py
has REST interfaces exposed to test the Cloud Object Storage operations.
Modify the contents of config.ini with the credentials that we created earlier.
Credentials we noted earlier on IBM Cloud:
The relevant portions indicated in the credentials are entered into the config.ini
as shown below:
Modify the COS_BUCKET_LOCATION
appropriately. The list of valid location constraints can be found here.
Goto the object_storage_operations
folder and run the below commands:
$ appsody build
$ appsody run -p 5501:8080 -p 5502:5678
Open the url http://localhost:5501/home.
If it says, Your object storage operations application test is successful
then your application is working fine. Press Ctrl+C on the terminal to stop the running server.
Create an instance of IBM Language Translator.
Copy the Credentials
, both API Key
and URL
as shown below and make a note of them. We will use them in step 7.
Create an instance of IBM Natural Language Understanding.
Copy the Credentials
, both API Key
and URL
as shown below and make a note of them. We will use them in step 7.
Goto the text_extractor
folder and run the below commands:
$ appsody run -p 3501:8080 -p 3502:5678
Goto the image_preprocessor
folder and run the below commands:
$ appsody run -p 4501:8080 -p 4502:5678
Goto the object_storage_operations
folder and run the below commands:
$ appsody run -p 5501:8080 -p 5502:5678
- Launch jupyter notebook.
- Upload
process_image_insights.ipynb
which is undernotebook
folder in this repo that you have cloned. - Goto
section 1.3
in the notebook.
-
Specify the urls for the
text_extractor
(http://localhost:3501),image_preprocessor
(http://localhost:4501) andobject_storage_operations
(http://localhost:5501). -
Fill the Language translator and Natural language understanding credentials that you have noted in step 5 and step 6 accordingly.
-
Click on
Run All
underCell
tab. Else you can run the notebook cell by cell by clicking onRun
on each cell.
```
ibmcloud login
```
8.2 Add a namespace to create your own image repository. Replace <my_namespace> with your preferred namespace.
```
ibmcloud cr namespace-add <my_namespace>
```
```
ibmcloud cr namespace-list
```
```
docker tag dev.local/text-extractor:latest <region>.icr.io/<my_namespace>/text-extractor:latest
docker tag dev.local/image-preprocessor:latest <region>.icr.io/<my_namespace>/image-preprocessor:latest
docker tag dev.local/object-storage-operations:latest <region>.icr.io/<my_namespace>/object-storage-operations:latest
```
8.5 Push the docker images into your namespace. Replace the placeholders for region, namespace with your container registry region and the namespace created earlier.
```
ibmcloud cr login
docker push <region>.icr.io/<my_namespace>/text-extractor:latest
docker push <region>.icr.io/<my_namespace>/image-preprocessor:latest
docker push <region>.icr.io/<my_namespace>/object-storage-operations:latest
```
Create a OpenShift cluster here.
Run the command.
Create three deployment configuration files - text_extractor_deploy.yaml,image_preprocessor_deploy.yaml and object_storage_operations_deploy.yaml. Replace the region, namespace, service name with your container registry region, the namespace created earlier and service name(text-extractor / image-preprocessor / object-storage-operations)
apiVersion: apps/v1
kind: Deployment
metadata:
name: <service name>-deployment
spec:
replicas: 1
selector:
matchLabels:
app: <service name>
template:
metadata:
labels:
app: <service name>
spec:
containers:
- name: <service name>
image: <region>.icr.io/<namespace>/<service name>:latest
ports:
- containerPort: 8080
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app: <service name>
name: <service name>
spec:
ports:
- port: 8080
protocol: TCP
targetPort: 8080
name: web
selector:
app: <service name>
type: ClusterIP
```
oc apply -f text_extractor_deploy.yaml
oc apply -f image_preprocessor_deploy.yaml
oc apply -f object_storage_operations_deploy.yaml
```
```
oc expose service/text-extractor
oc expose service/image-preprocessor
oc expose service/object-storage-operations
```
```
oc get routes
```
You will see the route to the service as seen below:
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
text-extractor text-extractor-default...us-south.containers.appdomain.cloud text-extractor web None
image-preprocessor image-preprocessor-default...us-south.containers.appdomain.cloud image-preprocessor web None
object-storage-operations object-storage-operations-default...us-south.containers.appdomain.cloud object-storage-operations web None
Note down the urls under PATH
column for the services as shown.
Create an instance of Watson Studio here
Click Create
.
Create a new project insights
by selecting New Project
.
Select Add to project
and Notebook
.
Import the notebook process_image_insights.ipynb
which is under notebook
folder in this repo that you have cloned. Click Create notebook
.
Goto section 1.3
in the notebook.
- Specify the urls for the
image_preprocessor
,text_extractor
andobject_storage_operations
. - Fill the Language translator and Natural language understanding credentials that you have noted in step 5 and step 6 accordingly.
- Click on
Run All
underCell
tab. Else you can run the notebook cell by cell by clicking onRun
on each cell.
In section 2.1 of the notebook, we have retrived the required image newspaper_hindi.jpg
from the Cloud Object Storage.
In section 3.1 of the notebook, we have preprocessed the image using opencv to detect different addresses in the newspaper.
In section 4.1 of the notebook, we have extracted all the different addresses that were detected after the preprocessing using tesseract.
In section 5.1 of the notebook, we have translated all the extracted addresses using Watson Language Translator
service. This will help us in data processing and analytics.
In section 6.1 of the notebook, we have used Watson Natural Language Understanding
service to derive insights from the data.
In our case, the insights uncovers that 50% of the detected addresses are from Karnataka state
, 25% of the detected addresses are from West Bengal state
and the other 25% of the detected addresses are from Maharashtra state
. We can also choose to see comparision between two states, like between Karnataka and Maharashtra.
In section 7.1 of the notebook, we can search for addresses in the newspaper based on the location name. Type an address in the location text box and press enter. It will return you all the addresses in the newspaper from the specified locality .
Lets type Karnataka
and press enter,
Similarly, lets now try with West Bengal
and press enter,
-
Create a custom Appsody stack with support for Python Flask and Tesseract
-
Create a custom Appsody stack with support for Python Flask and OpenCV
-
TODO - Add Object storage tutorial link
This code pattern is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2.