Create a low cost custom scan box #73

tshrinivasan · 2019-08-30T01:22:33Z

We need to create a low cost custom scan box, so that we can scan books easily.

the existing scanners are costly.

SV600 - 47,000 INR
https://www.amazon.in/Fujitsu-PA03641-B301-ScanSnap-SV600-Scanner/dp/B01AJI0426

CZUR ET 16 Plus Smart Book Scanner - 56,000 INR
https://www.amazon.in/CZUR-Plus-Smart-Scanner-Black/dp/B0758VY4G7

Instead of this, make a scanbox like
https://www.kickstarter.com/projects/limemouse/scanbox-turn-your-smartphone-into-a-portable-scann

https://www.amazon.com/Scanner-Bin-Document-Scanning-Solution/dp/B00XM7LKZM

tshrinivasan · 2019-08-30T01:25:15Z

Created the above box.

Size = 1.5 x 1.5 x 1.5 feet

Expenses so far - plywood, carpenter charge - 2100

Pending items

Making a 2.5 inch hole on the top
Painting inside and outside
Adding 4 LED lamps on inner side of TOP
Getting a thick non reflective glass to press the big books

Expected cost - 3000 Rs

With this, anyone can scan a book with a normal digital camera or mobile phone.
This is portable.

Once we did this, we can add more improvements for the next scan boxes.

gnuanwar · 2019-08-30T03:55:31Z

Srini check Kalyan recommendations
[30/08, 7:42 AM] kalyan greenbooks: Design to be improved
[30/08, 7:45 AM] kalyan greenbooks: Lots of glare will be there in this design. Lights to be away from camera area. If you want I give you physical drawing. I think I have it. You may try with that.
[30/08, 9:08 AM] Gnuanwar: Ok give me the drawing we try tell me when I need to collect

tshrinivasan · 2019-08-30T07:25:30Z

http://techforelders.blogspot.com/2012/12/blog-post.html
Found here a good alternate.

Have to fit the lights on the side walls and fit white sheets in between.

tshrinivasan · 2019-09-04T11:19:29Z

Added a hole on the top for 2.5 inches diameter.

Added two lights on both inner sides.

Added a 8mm thickness glass of 15 x 17 inch size, to press the big books.

Adding a glass, adds more reflections. Exploring on how to avoid this.

Keep the box away of kids. This seems a great fun box for them :-)

tshrinivasan · 2019-09-04T11:36:32Z

Exploring on adding a semi transparent sheets between the lights and the glass to emit limited light.

Trying with sheets used in Drums. This seems thick, easily fixable with screws. emits lights lmitedly. No reflection over glass.

tshrinivasan · 2019-09-06T07:45:09Z

tshrinivasan · 2019-09-06T07:45:52Z

Finally, added the thick mica sheet used on the drums to minimize the light reflection.

The box is ready to use now.

tshrinivasan · 2019-09-06T07:48:49Z

தாமோதரம்.pdf

Here is sample scan from android phone with cam scanner.

Did post processing the scanned images using scantailor.
here is the result from scantailor
thamodaram.pdf

Found that tesseract 4 is good for Tamil OCR.
Here is the result.
thamodaram.txt

tshrinivasan · 2019-09-06T07:50:30Z

Install Scantailor and tesseract

sudo apt-get install tesseract-ocr yagf tesseract-ocr-tam tesseract-ocr-script-taml tesseract-ocr python3-pyocr ocrodjvu ocrmypdf lios gimagereader

sudo apt-get install scantailor

scan a book
save as images
split, improve using scantailor

split a pdf to multiple images using ghostscript

gs -dNOPAUSE -dBATCH -sDEVICE=png16m -sOutputFile="Pic-%02d.png" output.pdf

Do ocr using tesseract

ls -1Nv *.png > filelist.txt

tesseract -l eng+tam filelist.txt article txt

tshrinivasan · 2019-09-06T07:53:11Z

if the output from scanning is only double sided PDF, we can split it using mutool.

mutool poster -x 2 input.pdf output.pdf

To crop the PDF, we can use pdfshuffler

tshrinivasan · 2019-09-06T07:54:21Z

Here is a video on how to use scantailor

https://vimeo.com/12524529

venkatarangan · 2019-09-07T07:37:03Z

@tshrinivasan Thanks for the post, the scanner box is super impressive.

I am interested in the software process. One problem which has been plaguing scanning of old Tamil books to PDF has been, the inability to select, copy n paste intelligible Tamil text (Unicode) from the PDF pages to say MS Word or Notepad. For this to work, I guessed we need to have an OCR to generate Tamil text somewhere in the pipeline and then embed the text back to the PDF.

In this context, recently I found all the books in Singapore's NLB Tamil collection, supporting seamless (extremely low error) copy n paste of Tamil text out from PDF. I have been researching how they are doing - I experimented with a python script to use Google Vision OCR to get something working after I saved all pages in a PDF as images. That's where I got stuck, unable to proceed further in the workflow.

Now, seeing the tweet today, I was pleasantly surprised, you have solved the problem. I have two queries:

In the thamodharan.pdf you generated from CamScanner->ScanTailor, the copy n paste of Tamil works. But you are mentioning you doing OCR with tesseract later. Am I missing something? If not, how is copy n paste working before OCR?
Once OCR is done, how to get the Tamil Unicode text back to the PDF.

Lastly, can you write a detailed blog post on the steps outside this GitHub thread, as it will benefit a wider set of people (especially non-programmers) who wish to scan and support Copy N Paste for old Tamil books.

Once again, many thanks for your effort and sharing them.

tshrinivasan · 2019-09-08T13:56:33Z

@venkatarangan The Tesseract OCR can give output as text file and PDF file.

PDF file is searchable. It adds a text layer ontop of the original image when producing the PDF output.

Will write detailed blog soon.

tshrinivasan · 2019-09-08T15:16:16Z

Initially, I thought that we do all the scan, fix images, improve images, convert to PDF using mobile itself.

Adobe scan, cam scanner kind of apps do this.

But, the results are not impressive.

Did a test with ScanTailor. It is the magician and King in this field.

Raw scan output

Result from Adobe scan mobile app

See the difference of contrast across the page due to light reflection over the glass. They are not good for printing and OCRing.

But Found scantailor improves the image quality to a very high level.
It does all the magics to bring the super resulted images.

This is perfect for print or OCR.

tshrinivasan · 2019-09-08T15:19:10Z

Few inputs from bharat varma in twitter
https://twitter.com/BharatVarma3/status/1170241584031391744

Hinge glass at the far edge to ease book placement.

Cut panels on the side & put white acrylic sheets for light diffusion. Keep lights outside.

If using a camera, use a CPL filter to cut glare.

Use a long lens and an inclined base, and you can probably shoot two books at once.

you can add a CPL filter to a Point and shoot camera..
Point and shoots have three possibilities for attaching a CPL filter -

A filter thread on the lens itself, an add on accessory tube that attaches to the body with a threaded receptacle at the end & a similar small tube that can be stuck to the front of the tube housing the lens.

Does this section work? :)

The inclined lines are the base & the book.

"O" is camera with a long focal length (less distortion, edge to edge sharpness).

o o is direction.

The camera needs to be parallel to the book. Keeping it at a distance with a longer focal length makes it more forgiving of alignment issues and the minimal distortion ensures better quality images.

nithyadurai87 · 2019-09-17T18:51:31Z

To convert all the tif output of scantailor to PDF

ls *.tif | parallel convert {} {.}.pdf

pdfunite *.pdf bookname.pdf

tshrinivasan · 2019-09-17T19:30:14Z

Here is the first books scanned with this scanbox

https://archive.org/details/thamodharam

https://ia601401.us.archive.org/23/items/thamodharam/thamodharam.pdf

Camera used : Android Phone Honor 9N
Camera Software : Adobe Scan (Saved the original images in gallery, the processed the images via scantailor)

venkatarangan · 2019-09-18T03:43:35Z

@tshrinivasan Considering, the low-cost and OSS setup, the quality of the scan for this PDF is great. Kudos. If you are going to release this book (PDF) as such, I will request you add the tesseract OCR support as well so that the text becomes searchable. Thanks.

tshrinivasan · 2019-09-18T18:39:44Z

@venkatarangan Sure sir, Dreaming of a website like FreeTamilEbooks.com but only with the scanned PDF files.
Will launch the site soon.

Is it OK to go with the tesseract as there is a lot of room for improvement for it with Tamil ?

tshrinivasan · 2019-11-10T14:59:20Z

Improved the scanbox.

Added a light at back side wall.

Now the results are more good.

tshrinivasan · 2019-11-10T15:00:11Z

tshrinivasan · 2019-11-10T15:00:51Z

Added a handle to the glass for easy operation.

venkatarangan · 2019-12-01T16:12:05Z

Install Scantailor and tesseract
Scan Tailor is great. Thank you for referencing it here. For a full 150 pages book, I used it to split pages (from two facing pages to single pages), straighten up the pages, despeckle (remove the non-text/image areas like the brown paper background) and then output clean sharp looking pages.

gnuanwar · 2019-12-01T16:21:15Z

Welcome

…

On Sun 1 Dec, 2019, 9:42 PM venkatarangan thirumalai, < ***@***.***> wrote: *Install Scantailor and tesseract* Scan Tailor <https://scantailor.org/> is great. Thank you for referencing it here. I used it now to split pages (from two facing pages to single pages), straighten up the pages, despeckle (remove the non text/image areas like brown paper background) and then output clean sharp looking pages. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#73?email_source=notifications&email_token=AJDTEQUSW3W4UBXHCMZXJ6DQWPO5LA5CNFSM4ISIHQRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFRNQMI#issuecomment-560126001>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJDTEQUBYKNN362MSGRWB4DQWPO5LANCNFSM4ISIHQRA> .

tshrinivasan · 2020-02-11T06:25:40Z

tshrinivasan · 2020-02-11T06:26:23Z

scanbox is now with Khaleel for scanning old magazines.

tshrinivasan · 2020-04-20T05:19:11Z

Wrote a blog post on making of scanbox
https://goinggnu.wordpress.com/2020/04/20/making-of-kaniyam-scanbox-diy-scanner/

More photos
https://photos.app.goo.gl/rCTpCaqkW8tZ68md9

abubelinha · 2023-07-16T18:32:04Z

Thank you guys. Amazing work 👍 👍 👍
@abubelinha

tshrinivasan added the Book Scanning மின்னுருவாக்கம் label Aug 30, 2019

tshrinivasan closed this as completed Feb 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a low cost custom scan box #73

Create a low cost custom scan box #73

tshrinivasan commented Aug 30, 2019

tshrinivasan commented Aug 30, 2019 •

edited

Loading

gnuanwar commented Aug 30, 2019

tshrinivasan commented Aug 30, 2019

tshrinivasan commented Sep 4, 2019 •

edited

Loading

tshrinivasan commented Sep 4, 2019 •

edited

Loading

tshrinivasan commented Sep 6, 2019

tshrinivasan commented Sep 6, 2019

tshrinivasan commented Sep 6, 2019

tshrinivasan commented Sep 6, 2019 •

edited

Loading

tshrinivasan commented Sep 6, 2019

tshrinivasan commented Sep 6, 2019

venkatarangan commented Sep 7, 2019

tshrinivasan commented Sep 8, 2019

tshrinivasan commented Sep 8, 2019

tshrinivasan commented Sep 8, 2019 •

edited

Loading

nithyadurai87 commented Sep 17, 2019

tshrinivasan commented Sep 17, 2019 •

edited

Loading

venkatarangan commented Sep 18, 2019

tshrinivasan commented Sep 18, 2019

tshrinivasan commented Nov 10, 2019

tshrinivasan commented Nov 10, 2019

tshrinivasan commented Nov 10, 2019

venkatarangan commented Dec 1, 2019 •

edited

Loading

gnuanwar commented Dec 1, 2019 via email

tshrinivasan commented Feb 11, 2020

tshrinivasan commented Feb 11, 2020

tshrinivasan commented Apr 20, 2020

abubelinha commented Jul 16, 2023

Create a low cost custom scan box #73

Create a low cost custom scan box #73

Comments

tshrinivasan commented Aug 30, 2019

tshrinivasan commented Aug 30, 2019 • edited Loading

gnuanwar commented Aug 30, 2019

tshrinivasan commented Aug 30, 2019

tshrinivasan commented Sep 4, 2019 • edited Loading

tshrinivasan commented Sep 4, 2019 • edited Loading

tshrinivasan commented Sep 6, 2019

tshrinivasan commented Sep 6, 2019

tshrinivasan commented Sep 6, 2019

tshrinivasan commented Sep 6, 2019 • edited Loading

tshrinivasan commented Sep 6, 2019

tshrinivasan commented Sep 6, 2019

venkatarangan commented Sep 7, 2019

tshrinivasan commented Sep 8, 2019

tshrinivasan commented Sep 8, 2019

tshrinivasan commented Sep 8, 2019 • edited Loading

nithyadurai87 commented Sep 17, 2019

tshrinivasan commented Sep 17, 2019 • edited Loading

venkatarangan commented Sep 18, 2019

tshrinivasan commented Sep 18, 2019

tshrinivasan commented Nov 10, 2019

tshrinivasan commented Nov 10, 2019

tshrinivasan commented Nov 10, 2019

venkatarangan commented Dec 1, 2019 • edited Loading

gnuanwar commented Dec 1, 2019 via email

tshrinivasan commented Feb 11, 2020

tshrinivasan commented Feb 11, 2020

tshrinivasan commented Apr 20, 2020

abubelinha commented Jul 16, 2023

tshrinivasan commented Aug 30, 2019 •

edited

Loading

tshrinivasan commented Sep 4, 2019 •

edited

Loading

tshrinivasan commented Sep 4, 2019 •

edited

Loading

tshrinivasan commented Sep 6, 2019 •

edited

Loading

tshrinivasan commented Sep 8, 2019 •

edited

Loading

tshrinivasan commented Sep 17, 2019 •

edited

Loading

venkatarangan commented Dec 1, 2019 •

edited

Loading