Using wordninja on ocr text #5616

Dnkhatri · 2021-12-17T07:51:28Z

Dnkhatri
Dec 17, 2021

A sometimes after ocr the text come out as long string of combined words. Using wordninja specially with subtitle edit dictionary where we have added the names etc should make it a lot easier to clean up the text. It could be part of the spell check or just another option to fix text.
https://github.com/jiawenhao2015/wordninja

niksedk · 2021-12-17T08:14:12Z

niksedk
Dec 17, 2021
Maintainer

Looks nice... is a c# version available?

0 replies

Dnkhatri · 2021-12-17T08:41:02Z

Dnkhatri
Dec 17, 2021
Author

Sadly no just Python, go and rust, no c# that I could find

0 replies

niksedk · 2021-12-17T10:42:12Z

niksedk
Dec 17, 2021
Maintainer

OK, it's fairly easy to do this... it guess it comes down to two issues:

Word list for all languages
How often will this give bad results

0 replies

Dnkhatri · 2021-12-17T18:58:30Z

Dnkhatri
Dec 17, 2021
Author

Well after looking around some more I think I have found something better it even has c# versions
https://github.com/wolfgarbe/WordSegmentationTM
https://github.com/wolfgarbe/WordSegmentationDP
https://github.com/wolfgarbe/LinSpell

0 replies

niksedk · 2021-12-18T12:55:29Z

niksedk
Dec 18, 2021
Maintainer

OK, an early test version is up now: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip
This test version includes a English split word file called Dictionaries/eng_WordSplitList.txt with 2.784 words extracted from 70 .srt files via this tool:

To activate this tool - press Ctrl+shift+alt+F12 in the main SE windows.

Works in OCR or in Fix common errors:

1 reply

Dnkhatri Dec 18, 2021
Author

So we can use the tool to generate word lists to improve the correction when we run ocr or fix common errors?

niksedk · 2021-12-18T14:48:16Z

niksedk
Dec 18, 2021
Maintainer

Yes - but this feature will require some testing + tuning... also, it's not using the names list atm.
Some words can be split even when it's wrong, like "meat" or "me at" ?

0 replies

Dnkhatri · 2021-12-18T15:39:31Z

Dnkhatri
Dec 18, 2021
Author

It works quite well already though its not perfect. To make it really accurate you would need to generate word lists for each language as well as word pair frequency etc by ingesting large amount of text for each language which is not really feasible. People being able to generate word list for languages they use will be good enough for most people.
I would suggest adding a short cut key as well similar to the spell check we get a line by line check so we can accept or reject the change.

0 replies

niksedk · 2021-12-18T16:48:45Z

niksedk
Dec 18, 2021
Maintainer

The Word split dictionary generator already removes non frequent words.
ATM, long words are taken first, if words are of equal length, then most frequent words are used first.
Also, if a split fails because a "wrong" word is used (e.g. "needspecial" is changed to "needs pecial"), the split will be retried without using the wrong word (e.g. "needs").

Seems to work okay, but I'll test more - and try adding "names"

0 replies

niksedk · 2021-12-18T18:10:46Z

niksedk
Dec 18, 2021
Maintainer

Name list is now used in latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip

Please test :)

0 replies

Dnkhatri · 2021-12-20T03:31:23Z

Dnkhatri
Dec 20, 2021
Author

The split dictionary now is showing the actual dictionaries. I tried combining a few names does not seem to work

0 replies

niksedk · 2021-12-20T06:53:30Z

niksedk
Dec 20, 2021
Maintainer

I need more details and concrete examples to check anything.

The "Word split dictionary generator" is only for generating a new word-split-list.

1 reply

Dnkhatri Dec 20, 2021
Author

ThisiscityofNanxing is the subtitle. Nanxing is the name included in the name list. It stays the same when run thorough the fix common OCR errors. When I change it to Thisiscityof Nanxing then the words gets separated out properly to This is city of Nanxing . After some testing the names are no longer being split up once added to the name list. But they are not separated out either so if a name is in a string of text contains a name it is not separated out.

Dnkhatri · 2021-12-20T08:25:34Z

Dnkhatri
Dec 20, 2021
Author

After running the generator on 1000 srt files that I had the resulting wordsplit dictionary seems to work very well. A better curated word list I expect would give even better results.

I have question as I no longer have access to the files that were giving me long string of text when using subtitle edit ocr. I am wondering is the wordsplit working before we get the ocr results as I am not getting similar long strings like previously

0 replies

niksedk · 2021-12-20T08:59:16Z

niksedk
Dec 20, 2021
Maintainer

OK, names list is now sorted (by word length descending) into the word-split-list + spell check after split is improved to use name list, so ThisiscityofNanxing should now work: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip

Yes, the word-split-list is working in the OCR (fixes should be visible in logs/guesses).

0 replies

niksedk · 2021-12-20T16:56:47Z

niksedk
Dec 20, 2021
Maintainer

I've improved the split word a bit: ignore workin' (and other *in' words) + include more words in the word split list + added some words to the English user dic.

Also, not having words like "stepmom" or "badass" in the normal dictionaries gives some annoying results - so this works best if the spell checker is not missing too many informal words often used in dialog.

https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.4/SubtitleEditBeta.zip

0 replies

Dnkhatri · 2021-12-22T04:22:51Z

Dnkhatri
Dec 22, 2021
Author

I think the last version is almost perfect with rare manual intervention required and that mostly for names only. There is one error I keep getting though I think it has to do with tesseract rather than the split word but just in case I am attaching the files. Sometime when the sentence starts with "I'm" the sentence seems to break down to single letters or double letters. I am attaching 3 files 2 where it happens 1 where it does not. Using tesseract 5 with original tesseract setting other settings disabled. For image preprocessing invert colors and crop transparent colors are used.

2 replies

Boswell-Scrubbs Apr 21, 2022

A huge thank you for this! Just tried it out and it worked beautifully. Saved hours of time.

Just BTW, here's how your fix solved my problem:

I had a set of subtitles for a TV series (a couple dozen .srt files). In every file there were many instances where two lines had been combined into one long line with no space between the two (I'm not sure if these were originally created with OCR, but somehow this error had been introduced). Example:

I'm driving into town today and willattend a meeting at the new library.

Should be:

I'm driving into town today and will
attend a meeting at the new library.

I was working my way (tediously) through each file, manually breaking the lines. I had the thought, "I wish there was some solution that could detect where the lines needed to be broken and do this for me. But, no, that's pie in the sky." However, I decided to do some googling and lo-and-behold, your amazing fix was just included in the latest release!

I used these three steps:

Fix common errors... Fix common OCR errors (using OCR replace list)
I assume this is where you created your fix.
Spaces were automagically inserted in the jammed together words.
_Fix common errors... Fix missing spaces
Took care of the instances where the error included a punctuation mark, as in
"Hey.You."
_Fix common errors... Break long lines

And voilà! Worked like a charm.

Thank you again for saving hours of dreary editing for me, but more importantly, for creating a tool that improves subtitle files and enhances the viewing experience for many, many people. :-)

P.S. New member -- prompted to sign up so I could leave this "thank you."

niksedk Apr 23, 2022
Maintainer

Nice to hear your experience :)

niksedk · 2021-12-22T07:58:46Z

niksedk
Dec 22, 2021
Maintainer

Yes, it seems Tesseract has issues with this font (not related to word splitting)

0 replies

niksedk · 2022-01-01T10:24:49Z

niksedk
Jan 1, 2022
Maintainer

And this feature can be turned on/off in Settings.xml via the tag "OcrUseWordSplitList"

1 reply

niksedk Apr 23, 2022
Maintainer

Or in UI - Options - Settings - Tools:

Using wordninja on ocr text #5616

Dnkhatri Dec 17, 2021

Replies: 17 comments · 5 replies

niksedk Dec 17, 2021 Maintainer

Dnkhatri Dec 17, 2021 Author

niksedk Dec 17, 2021 Maintainer

Dnkhatri Dec 17, 2021 Author

niksedk Dec 18, 2021 Maintainer

Dnkhatri Dec 18, 2021 Author

niksedk Dec 18, 2021 Maintainer

Dnkhatri Dec 18, 2021 Author

niksedk Dec 18, 2021 Maintainer

niksedk Dec 18, 2021 Maintainer

Dnkhatri Dec 20, 2021 Author

niksedk Dec 20, 2021 Maintainer

Dnkhatri Dec 20, 2021 Author

Dnkhatri Dec 20, 2021 Author

niksedk Dec 20, 2021 Maintainer

niksedk Dec 20, 2021 Maintainer

Dnkhatri Dec 22, 2021 Author

Boswell-Scrubbs Apr 21, 2022

niksedk Apr 23, 2022 Maintainer

niksedk Dec 22, 2021 Maintainer

niksedk Jan 1, 2022 Maintainer

niksedk Apr 23, 2022 Maintainer

Dnkhatri
Dec 17, 2021

Replies: 17 comments 5 replies

niksedk
Dec 17, 2021
Maintainer

Dnkhatri
Dec 17, 2021
Author

niksedk
Dec 17, 2021
Maintainer

Dnkhatri
Dec 17, 2021
Author

niksedk
Dec 18, 2021
Maintainer

Dnkhatri Dec 18, 2021
Author

niksedk
Dec 18, 2021
Maintainer

Dnkhatri
Dec 18, 2021
Author

niksedk
Dec 18, 2021
Maintainer

niksedk
Dec 18, 2021
Maintainer

Dnkhatri
Dec 20, 2021
Author

niksedk
Dec 20, 2021
Maintainer

Dnkhatri Dec 20, 2021
Author

Dnkhatri
Dec 20, 2021
Author

niksedk
Dec 20, 2021
Maintainer

niksedk
Dec 20, 2021
Maintainer

Dnkhatri
Dec 22, 2021
Author

niksedk Apr 23, 2022
Maintainer

niksedk
Dec 22, 2021
Maintainer

niksedk
Jan 1, 2022
Maintainer

niksedk Apr 23, 2022
Maintainer