Skip to content

Latest commit

 

History

History
208 lines (150 loc) · 28.8 KB

README.md

File metadata and controls

208 lines (150 loc) · 28.8 KB

ScriptReader

This handwriting OCR application can convert JPEG handwritten text images into RTF documents, while removing typos for you!

Quote OCR Result

ScriptReader

License: AGPL-3.0 GitHub last commit Linux


ScriptReader is a tool enabling you to convert scanned handwritten pages (in JPEG image format) into rich text format (RTF) documents, complete with formatting elements such as text alignment, paragraphs, underline, italics, bold and strikethrough.

A neat functionality of ScriptReader is that the typos (square dot grid cells containing mistakes, which are filled in with ink) automatically get filtered out, and do not appear in the final RTF text. Also, the biodegradable fountain pen ink that I have disclosed earlier (see https://www.instructables.com/Recipe-for-a-Biodegradable-Blue-Ink-for-Fountain-P/) pairs well with the notebooks that you print with PrintANotebook (see https://github.com/LPBeaulieu/Notebook-Maker-PrintANotebook)!

My tests with over 30,000 characters of training data (about 50 half-letter pages of cursive handwriting on the ScriptReader pages, with 0.13 inch dot spacing and double line spacing) consistently gave me an OCR accuracy above 99%!

📝 Table of Contents

⛓️ Dependencies / Limitations

  • This Python project relies on the Fastai deep learning library (https://docs.fast.ai/) to generate a convolutional neural network deep learning model, which allows for handwriting optical character recognition (OCR). It also needs OpenCV to perform image segmentation (to crop the individual characters in the handwritten scanned images) and glob2 to automatically retrieve the cropped character image size.

  • A deep learning model trained with a specific handwriting is unlikely to generalize well to other handwritings. Also, I would advise you to keep your dataset private, as it would in theory be possible to reverse engineer it in order to generate text with your handwriting. For this reason, I encourage you to use another handwriting than your official handwriting, as an added precaution. For example, should you normally use print script, you would then train your OCR model with small caps handwriting instead.

  • The ScriptReader pages from PrintANotebook need to be used, and the individual letters need to be written within the vertical boundaries of a given dot grid square cell (comprised of four dots). The segmentation code allows plenty of space above and below the line of text for ascenders and descenders, however. The handwritten pages should be scanned at a resolution of 300 dpi, with the US Letter page size setting and the text facing the top of the page, as the black squares will be used to automatically align the pages. You should refrain from writing near the black squares to allow for the alignment to be unimpeded by any artifacts. You can write with any color of ink, as long as it is saturated enough to be picked up by your scanner, as the images are converted to greyscale images for training the model and OCR.

  • Make sure that all of your characters are very distinct from one another. I suggest using bars or dots in the zeros and writing the ones the way you see them displayed on screen ("1"), so that they aren't confused with an uppercase "O" and a lowercase "l", respectively. Also, I recommend that a sizable portion of your handwritten training data be comprised of pangrams (sentences that contain every letter of the alphabet), to ensure that you have a good set of characters to train on. You could write a set of pangrams all in capital letters and another in lowercase letters, so that every letter would be represented in your dataset both in upper- and lowercase form. Be sure to include RTF commands and the most common punctuation marks ("'", '"', ".", ",", ":", ";", "?", "!", "(", ")", "-") regularly throughout your sentences in order for them to be trained in the context of normal writing.

  • As mentioned above, the code is compatible with RTF commands (see examples below) so you will need to train the model to recognize hand-drawn backslashes as well if you wish to include formatting elements such as tabs, bold, new lines and paragraphs, for instance. For an in-depth explanation of all the most common RTF commands and escapes, please consult: https://www.oreilly.com/library/view/rtf-pocket-guide/9781449302047/ch01.html.

Here is a list of the most common RTF commands (within quotes) that you are likely to need:

  • "\b": Bold opening tag "\b0": Bold closing tag
  • "\i": Italics opening tag "\i0": Italics closing tag
  • "\ul": Underline opening tag "\ul0": Underline closing tag
  • "\scaps": Smallcaps opening tag "\scaps0": Smallcaps closing tag
  • "\tab": Insert a tab
  • "\par": Start a new paragraph (By default, my code already includes a tab at the start of every new paragraph)
  • "\qc": Center this paragraph (Simply add the "\qc" after "\par", for example: "\par\qc Your Centered Paragraph")
  • "\qj": Justified alignment for this paragraph (Simply add the "\qj" after "\par", for example: "\par\qj Your Justified Paragraph")
  • "\ql": Left alignment for this paragraph
    (Simply add the "\ql" after "\par", for example: "\par\ql Your Left-Aligned Paragraph")
  • "\qr": Right alignment for this paragraph
    (Simply add the "\qr" after "\par", for example: "\par\qr Your Right-Aligned Paragraph")
  • "\keepn": Do not split this paragraph from the next one on different pages, if possible. This is particularly useful for heading paragraphs, so that they appear on the same page as the first paragraph of the following section. For example: "\par\qc\keepn Heading \par Next paragraph on the same page as the heading"
  • "\line": Insert a line break
  • "\page": Insert a page break
  • "\sbkpage": Insert a section break that also starts a new page
  • "\sub": Subscript (ex: "You need to drink plenty of H{\sub 2}O", where the curly brackets delimit the subscript command)
  • "\super": Superscript (ex: "This is the 1{\super st} time that I use RTF commands!}")

ScriptReader Demo


The illustration above shows the various use cases of common RTF commands with ScriptReader. Notice that the typos have simply been darkened with ink, such that the model identifies dark squares as being mistakes that need to be filtered out of the final text. While curly brackets may be employed to avoid using closing RTF commands, their use should be avoided, as the presence of an OCR error in one of them might prevent the file from being opened in a regular word processor. You would then be notified of the location of the mistake in the RTF file, which could be corrected by hand by opening the file in a basic text editor (such as Notepad on Windows or Text Editor on Ubuntu) and by fixing the error at the appropriate row and column. Make sure to train your model on handwritten text that includes a lot of the abovementioned RTF commands, so that the OCR errors are minimized. Also remember to add a space at the beginning of a new line if you ended the previous word at the very end of the last line. Otherwise, the two words will end up merged together, as obviated by the red squiggly line in the image above.

To keep things as simple as possible in the (default) basic RTF mode of the "get_predictions.py" code, "\par" is changed for "\par\pard\tab" after OCR. This means that the paragraph-formatting attributes (such as centered alignment, "qc") are returned to their default values, and a tab is included automatically when a new paragraph is started by writing "\par". The advanced RTF mode just interprets the RTF commands as you write them.

🏁 Getting Started

The following instructions will be provided in great detail, as they are intended for a broad audience and will allow to run a copy of ScriptReader on a local computer.

The instructions below are for Windows operating systems, and while I am not 100% sure that it is able to run on Windows, I made every effort to adapt the code so that it would be compatible, but the code runs very nicely on Linux.

Start by holding the "Shift" key while right-clicking in your working folder, then select "Open PowerShell window here" to access the PowerShell in your working folder and enter the commands described below.

Step 1- Install PyTorch (Required Fastai library to convert images into a format usable for deep learning) using the following command (or the equivalent command found at https://pytorch.org/get-started/locally/ suitable to your system):

pip3 install torch torchvision torchaudio

Step 2- Install the CPU-only version of Fastai, which is a deep learning Python library. The CPU-only version suffices for this application, at least when running on Linux, as the character images are very small in size:

py -m pip install fastai

Step 3- Install OpenCV (Python library for image segmentation):

py -m pip install opencv-python

Step 4- Install alive-Progress (Python module for a progress bar displayed in command line):

py -m pip install alive-progress

Step 5- Install glob (Python module used to automatically retrieve the cropped character image pixel size):

py -m pip install glob2

Step 6- Install TextBlob (Python module used for the autocorrect feature):

py -m pip install textblob

Step 7- Create the folders "Training&Validation Data" and "OCR Raw Data" in your working folder:

mkdir "OCR Raw Data" 
mkdir "Training&Validation Data" 

Step 8- You're now ready to use ScriptReader! 🎉

🎈 Usage

First off, you will need to print some ScriptReader notebook pages, which are special in that they are dot grid pages with line spacing in-between lines of text, so that there may be enough room to accommodate the ascenders and descenders of your handwriting when performing OCR. Also, these pages have black squares in the top of the page, which help the code to automatically align the pages in order to correct for slight rotation angles (below about 1 degree) of the scanned images. Please refer to the PrintANotebook github repository for the basics on how to run this application on your system.

For a basic template, simply pass in "scriptreader:" as an additional argument when running PrintANotebook, with the following parameters after the colon, each separated by additional colons: the number of inches in-between dot grid dots (in inches and in decimal form, but without units):the dot diameter (5 px is a good value): the dot line width (1 px is appropriate): the number of lines in-between the lines of text (2 works well for me, but if your handwriting has marked ascenders and descenders, you might want to go with 3): gutter margin width (in inches and decimal form, but without units, 0.75 is a good setting that allows for you to use a hole punch). For example, the following ("scriptreader:0.13:5:1:2:0.75") would mean that there is 0.13 inch in-between dots, the dot diameter is 5 px, the dot line width is 1 px, there are two empty lines in-between every line of text and that the gutter margin measures 0.75 inch.

Alternatively, you might wish to write in your notebook using an erasable ink (Such as my biodegradable wet erase fountain pen ink! See my Instructables article on how to prepare it at home: https://www.instructables.com/Recipe-for-a-Multipurpose-Biodegradable-and-Wet-Er/) on notebook pages printed on transparencies with a laser printer. Make sure to train your model using the ink and medium (paper vs transparencies) that you will writing with, for best OCR accuracy. Also, please dispose of your transparencies responsibly once they are worn out, as the goal here is to be green after all :-). You would then pass in "scriptreader_acetate:" as an additional argument when running PrintANotebook, with the same parameters explained in the paragraph above. The pages would then be mirrored, such that you would print them on the left-hand side of the transparency sheets for the first half of your PDF document pages, then flip the stack of transparencies and print the second half of your PDF document pages on the reverse side (see the figure below for an illustration of the end result). There is also a vertical line indicating where you will need to cut the transparency in half in order to assemble your reusable notebook. The default heading on the pages will be "Write on this side", which guides you in remembering on which side of the transparency to write on, in order to avoid scratching the laser toner on the reverse side. You could change the heading to any other text, preceded by the argument "heading_text_right:" when running the Python code.

Printing instructions for transparencies


In the image above, the left portion was printed on the front side and is mirrored, while the right portion was printed on the back of the transparency, and can be read from above. You will need to place the transparency on top of a sheet of printer paper and cut them both along the central vertical line. The paper will serve as a reusable backing for your transparencies and will prevent ink on a page from blotting on the previous transparency in the notebook. I entered the following parameters when running the PrintANotebook Python code on my Linux system (I wanted 26 TOC pages, but you could just as well omit the "toc_pages_spacing:" argument and go with the default 8 pages).

python3 printanotebook.py "title:Notes and Thoughts (Volume 1)" "author:Louis-Philippe Bonhomme-Beaulieu" "number_of_pages:100" "page_numbers" "heading_text_right:Date:                " "scriptreader_acetate:0.13:5:1:2:0.75" "toc_pages_spacing:26"

It is preferable to scan transparencies on a flatbed scanner (at a resolution of 300 dpi and with the US Letter page size setting), in order to avoid scratching and damaging them in a multi-page scanner. This should still save you lots of time when compared to typing all these pages on the computer! When scanning pages on a flatbed scanner, make sure to place the page such that its top left corner is in the top left corner of the scanned image, as shown in the image below.

Transparency scanned on a flatbed scanner


The photo above shows a transparency that was scanned on a flatbed scanner. Notice that the top left corner of the transparency lines up with the top left corner of the scan. It is important that you scan your images like this on the flatbed scanner, so that the character segmentation may proceed smoothly.

Punching instructions


Once you have printed your notebook, you could punch 3 holes at the standard 2.75 inch spacing of Junior ring binders, using the instructions of the image above, for an Officemate Heavy Duty Adjustable 2-3 Hole Punch with Lever Handle, which could readily punch holes through 10 sheets of 28 lb half letter paper at a time in my experience (while I'm not affiliated with the company that makes that hole punch, I do recommend it).

Punched notebooks


Here is what the notebooks you generate might look like!

There are four different Python code files that are to be run in sequence. You can find instructional videos on ScriptReader usage on my YouTube channel: https://www.youtube.com/watch?v=o5gZ0sazeEk.

File 1: "create_rectangles.py"- This Python code enables you to see the segmentation results (the green rectangles delimiting the individual characters on the handwritten scanned image) and then write a ".txt" file with the correct labels for each rectangle. The mapping of every rectangle to a label will allow to generate a dataset of character images with their corresponding labels. The handwritten page images overlaid with the character rectangles are stored in the "Page image files with rectangles" folder, which is created automatically by the code.

Character Segmentation


The figure above shows the segmentation results (green rectangles) for the left-hand scanned image. In reality, some overlap is allowed horizontally and vertically in order to fully include the characters. The red rectangles show the position where the code has detected the black squares on top of the page, which allows for the automatic alignment of the page. The blue rectangles show where the code has screened in order to detect the black rectangles. It is therefore important that you avoid including very long titles or any writing in these regions of your notebook pages.

Image txt file processing


The image above illustrates the format of the ".txt" file listing all of the character rectangle labels. Any spaces (empty dot grid cells) on your scanned handwritten page need to be represented by the Cyrillic Capital Letter I ("И"). Furthermore, any typos on your training data, which are erroneous handwritten letters of which the dot grid cell was later darkened will need to be reported with the Cyrillic Capital Letter Be ("Б"). Finally, should there be any artifacts on the page or should you write a character out of frame of a given dot grid cell, you would then need to designate it with the Cyrillic Capital Letter De ("Д"). In this case, I added the "Д" symbol for the zeros on the seventh line, as I forgot to put the bars in the digits, which would certainly lead to some confusion with a capital "O". All the other character rectangles are represented by their own characters in the ".txt" file.

Importantly, such ".txt" files should be created, modified and saved exclusively in basic text editors (such as Notepad on Windows or Text Editor on Ubuntu), as more elaborate word processors would include extra formatting information, which would interfere with the correct mapping of the character rectangles to their labels in the ".txt" file.

Furthermore, the ".txt" files in the "Training&Validation Data" folder must have identical names to their corresponding JPEG images (minus the file extensions). For example, the file "my_text.txt" would contain the labels corresponding to the raw scanned handwritten page JPEG image (without the character rectangles) named "my_text.jpg".


File 2: "create_dataset.py"- This code will crop the individual characters in the same way as the "create_rectangles.py" code, and will then open the ".txt" file containing the labels in order to create the dataset. Each character image will be sorted in its label subfolder within the "Dataset" folder, which is created automatically by the code.

A good practice when creating a dataset is to make the ".txt" file and then run the "create_dataset.py" code one page at a time (only one JPEG image and its corresponding ".txt" file at a time in the "Training&Validation Data" folder) to validate that the labels in the ".txt" file line up with the character rectangles on the handwritten text image. Such a validation step involves opening every "Dataset" subfolder and ensuring that every image corresponds to its subfolder label (pro tip: select the icon display option in the folder in order to display the image thumbnails, which makes the validation a whole lot quicker). You can pinpoint the location of mistakes in your text file by dividing the first mistaken common character index (such as an "a", "e", or "space" label that displays another character in the thumbnail image) by the total number of characters per line, which corresponds to the number of square grid cells per horizontal line. You will need to delete the "Dataset" folder in-between every page, otherwise it will add the labels to the existing ones within the subfolders. This makes it more manageable to correct any mistakes in the writing of the ".txt" files. Of note, the spaces are picked up as actual characters and framed with rectangles. You need to label those spaces with a "И" symbol. Here is the list of symbols present in the ".txt" files mapping to the different character rectangles:
  • "И": "blank" character rectangle, which corresponds to a space. These character images are stored in the "space" subfolder within the "Dataset" folder. "Ctrl-C" and "Ctrl-V" will be your friends when adding the "И" characters for every space in your scanned image.
  • "Б": "typo" character rectangle (any character of which the square dot grid cell has been completely darkened). These character images are stored in the "empty" subfolder within the "Dataset" folder.
  • "Д": "to be deleted" character rectangle (any undesired artifact or letter that is out of frame with the dot grid cell). The "to be deleted" subfolder (within the "Dataset" folder) and all its contents is automatically deleted and the characters labelled with "Д" in the ".txt" file will be absent from the dataset, to avoid training on this erroneous data.
  • All the other characters in the ".txt" files are the same as those that you have written by hand. The character images are stored in subfolders within the "Dataset" folder bearing the character's name (e.g. "a" character images are stored in the subfolder named "a").

Once you're done validating the individual ".txt" files, you can delete the "Dataset" folder once more, add all of the ".txt" files along with their corresponding JPEG images to the "Training&Validation Data" folder and run the "create_dataset.py" code to get your complete dataset!


File 3: "train_model.py"- This code will train a convolutional neural network deep learning model from the labelled character images within the "Dataset" folder. It will also provide you with the accuracy of the model in making OCR predictions, which will be displayed in the command line for every epoch (run through the entire dataset). The default hyperparameters (number of epochs=5, batch size=64, learning rate=0.003, kernel size=5) were optimal and consistently gave OCR accuracies above 99%, provided a good-sized dataset is used (mine was above 30,000 characters). As this is a simple deep learning task, the accuracy relies more heavily on having good quality segmentation and character images that accurately reflect those that would be found in text. When you obtain a model with good accuracy, you should make a backup of it along with the "Training&Validation Data" and "Dataset" folders on which it was trained. Once again, it is advisable to keep this data private (offline), as it contains your handwriting fingerprint of sorts, as was mentioned above.


File 4: "get_predictions.py"- This code will perform OCR on JPEG images of scanned handwritten text (at a resolution of 300 dpi and with the US Letter page size setting) that you will place in the folder "OCR Raw Data".

Please note that all of the JPEG file names in the "OCR Raw Data" folder must contain at least one hyphen ("-") in order for the code to properly create subfolders in the "OCR Predictions" folder. These subfolders will contain the rich text format (RTF) OCR conversion documents.

The reason for this is that when you will scan a large document in a multi-page scanner, you will provide your scanner with a file root name (e.g. "my_text-") and the scanner will number them automatically (e.g."my_text-.jpg", "my_text-0001.jpg", "my_text-0002.jpg", "my_text-"0003.jpg", etc.) and the code would then label the subfolder within the "OCR Predictions" folder as "my_text". The OCR prediction results for each page will be added in sequence to the "my_text.rtf" file within the "my_text" subfolder of the "OCR Predictions" folder. Should you ever want to repeat the OCR prediction for a set of JPEG images, it would then be important to remove the "my_text" subfolder before running the "get_predictions.py" code once more, in order to avoid appending more text to the existing "my_text.rtf" file.

When scanning the ScriptReader notebook pages generated with PrintANotebook, you would ideally need to scan them in a multi-page scanner, which is typically found in all-in-one printers. Select 300 dpi resolution, JPEG file format, as well as the US Letter size settings and first scan the odd pages (right-hand pages) by specifying a file name that ends with a hyphen. Once the odd pages are scanned, you would simply flip the recovered stack of pages and scan the reverse pages (starting with the last one on top of the stack), with the same file name, but preceded by "back". The code will automatically assemble the left- and right-hand pages in the right order when performing OCR predictions. For example, your first scanned right-hand page file name would be "my_text-.jpg" and your first scanned left-hand page (the back side of the last odd page) file name would be "back my_text-.jpg".

There are a few options available to you with respect to the handling of smart quotes (also known as directional quotes). Should there be at least one instance of a non-directional single " ' " or double ' " ' quote in the document generated by OCR, or should you have selected the "smart_quotes" or "symmetrical_quotes" options by passing in these arguments when running the code, then all of the directional quotes found within the page will be switched to their non-directional counterparts. This would be relevant had you trained the CNN model on directional quotes, but didn't get good OCR accuracy when handling them. After that first step, if there was at least one instance of a symmetrical quote in the original document and that you didn't specify the "symmetrical_quotes" option, or if you selected the "smart_quotes" option, the appropriate directional quotes will be applied to the page.

There is an autocorrect feature in ScriptReader that allows you to specify the confidence threshold above which a correction should be made. Simply pass in "autocorrect:", followed by a percentage expressed in decimal form, the default being 1.0 (100%). For example, should you want the autocorrect feature to only make corrections for instances where it is at least 95% certain that the suggested word is the correct one, you would enter "autocorrect:0.95".


Well there you have it! You're now ready to convert your handwritten text into digital format! You can now write at the cottage or in the park without worrying about your laptop's battery life and still get your document polished up in digital form in the end! 🎉📖

✍️ Authors

  • 👋 Hi, I’m Louis-Philippe!
  • 👀 I’m interested in natural language processing (NLP) and anything to do with words, really! 📝
  • 🌱 I’m currently reading about deep learning (and reviewing the underlying math involved in coding such applications 🧮😕)
  • 📫 How to reach me: By e-mail! LPBeaulieu@gmail.com 💻

🎉 Acknowledgments

  • Hat tip to @kylelobo for the GitHub README template!