New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a low cost custom scan box #73
Comments
Created the above box. Size = 1.5 x 1.5 x 1.5 feet Expenses so far - plywood, carpenter charge - 2100 Pending items
Expected cost - 3000 Rs With this, anyone can scan a book with a normal digital camera or mobile phone. Once we did this, we can add more improvements for the next scan boxes. |
Srini check Kalyan recommendations |
http://techforelders.blogspot.com/2012/12/blog-post.html Have to fit the lights on the side walls and fit white sheets in between. |
Finally, added the thick mica sheet used on the drums to minimize the light reflection. The box is ready to use now. |
Here is sample scan from android phone with cam scanner. Did post processing the scanned images using scantailor. Found that tesseract 4 is good for Tamil OCR. |
Install Scantailor and tesseract
split a pdf to multiple images using ghostscript
Do ocr using tesseract
|
Here is a video on how to use scantailor |
@tshrinivasan Thanks for the post, the scanner box is super impressive. I am interested in the software process. One problem which has been plaguing scanning of old Tamil books to PDF has been, the inability to select, copy n paste intelligible Tamil text (Unicode) from the PDF pages to say MS Word or Notepad. For this to work, I guessed we need to have an OCR to generate Tamil text somewhere in the pipeline and then embed the text back to the PDF. In this context, recently I found all the books in Singapore's NLB Tamil collection, supporting seamless (extremely low error) copy n paste of Tamil text out from PDF. I have been researching how they are doing - I experimented with a python script to use Google Vision OCR to get something working after I saved all pages in a PDF as images. That's where I got stuck, unable to proceed further in the workflow. Now, seeing the tweet today, I was pleasantly surprised, you have solved the problem. I have two queries:
Lastly, can you write a detailed blog post on the steps outside this GitHub thread, as it will benefit a wider set of people (especially non-programmers) who wish to scan and support Copy N Paste for old Tamil books. Once again, many thanks for your effort and sharing them. |
@venkatarangan The Tesseract OCR can give output as text file and PDF file. PDF file is searchable. It adds a text layer ontop of the original image when producing the PDF output. Will write detailed blog soon. |
Initially, I thought that we do all the scan, fix images, improve images, convert to PDF using mobile itself. Adobe scan, cam scanner kind of apps do this. But, the results are not impressive. Did a test with ScanTailor. It is the magician and King in this field. Raw scan output Result from Adobe scan mobile app See the difference of contrast across the page due to light reflection over the glass. They are not good for printing and OCRing. But Found scantailor improves the image quality to a very high level. This is perfect for print or OCR. |
Few inputs from bharat varma in twitter Hinge glass at the far edge to ease book placement. Cut panels on the side & put white acrylic sheets for light diffusion. Keep lights outside. If using a camera, use a CPL filter to cut glare. Use a long lens and an inclined base, and you can probably shoot two books at once. you can add a CPL filter to a Point and shoot camera.. A filter thread on the lens itself, an add on accessory tube that attaches to the body with a threaded receptacle at the end & a similar small tube that can be stuck to the front of the tube housing the lens. Does this section work? :) The inclined lines are the base & the book. "O" is camera with a long focal length (less distortion, edge to edge sharpness). o o is direction. The camera needs to be parallel to the book. Keeping it at a distance with a longer focal length makes it more forgiving of alignment issues and the minimal distortion ensures better quality images. |
To convert all the tif output of scantailor to PDF
|
Here is the first books scanned with this scanbox https://archive.org/details/thamodharam https://ia601401.us.archive.org/23/items/thamodharam/thamodharam.pdf Camera used : Android Phone Honor 9N |
@tshrinivasan Considering, the low-cost and OSS setup, the quality of the scan for this PDF is great. Kudos. If you are going to release this book (PDF) as such, I will request you add the tesseract OCR support as well so that the text becomes searchable. Thanks. |
@venkatarangan Sure sir, Dreaming of a website like FreeTamilEbooks.com but only with the scanned PDF files. Is it OK to go with the tesseract as there is a lot of room for improvement for it with Tamil ? |
Improved the scanbox. Added a light at back side wall. Now the results are more good. |
Added a handle to the glass for easy operation. |
|
Welcome
…On Sun 1 Dec, 2019, 9:42 PM venkatarangan thirumalai, < ***@***.***> wrote:
*Install Scantailor and tesseract*
Scan Tailor <https://scantailor.org/> is great. Thank you for referencing
it here. I used it now to split pages (from two facing pages to single
pages), straighten up the pages, despeckle (remove the non text/image areas
like brown paper background) and then output clean sharp looking pages.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#73?email_source=notifications&email_token=AJDTEQUSW3W4UBXHCMZXJ6DQWPO5LA5CNFSM4ISIHQRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFRNQMI#issuecomment-560126001>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJDTEQUBYKNN362MSGRWB4DQWPO5LANCNFSM4ISIHQRA>
.
|
scanbox is now with Khaleel for scanning old magazines. |
Wrote a blog post on making of scanbox More photos |
Thank you guys. Amazing work 👍 👍 👍 |
We need to create a low cost custom scan box, so that we can scan books easily.
the existing scanners are costly.
SV600 - 47,000 INR
https://www.amazon.in/Fujitsu-PA03641-B301-ScanSnap-SV600-Scanner/dp/B01AJI0426
CZUR ET 16 Plus Smart Book Scanner - 56,000 INR
https://www.amazon.in/CZUR-Plus-Smart-Scanner-Black/dp/B0758VY4G7
Instead of this, make a scanbox like
https://www.kickstarter.com/projects/limemouse/scanbox-turn-your-smartphone-into-a-portable-scann
https://www.amazon.com/Scanner-Bin-Document-Scanning-Solution/dp/B00XM7LKZM
The text was updated successfully, but these errors were encountered: