Fix OCR errors option ? #41

alexdns1 · 2023-10-17T09:46:50Z

Do you plan on adding "Fix OCR errors" like subtitle edit option to resolve badly OCRd text ?

GFoley83 · 2024-04-13T13:33:56Z

@Tentacule Just bumping this one.

I've run a bit of a test with the latest version of PgsToSrt and Subtitle Edit w/ Tesseract 5.3.3
I can't seem to replicate the accuracy of Subtitle Edit using PgsToSrt, even with Fix OCR errors unchecked. Some of the conversion, which should be pretty basic, come out as gibberish (see screenshot).

Command I'm using for PgsToSrt is:
dotnet PgsToSrt.dll --input "file.mkv" --tracklanguage eng --tesseractdata "C:\Program Files\Tesseract-OCR\tessdata" --tesseractlanguage eng

Any ideas here?

Tentacule · 2024-04-13T15:23:13Z

I won't add "Fix OCR errors" for now because this functionality is not included in LibSE.

I have done some tests, it looks like an issue on windows, it's working fine when run on linux. I'll investigate.

GFoley83 · 2024-04-13T21:40:45Z

I just flicked you an email.

I don't think "Fix OCR errors" will make a difference anyway as I had it disabled in SE (see first screenshot) and it still converted the PGS subs almost perfectly. Issue is something else.

Thanks for looking into it.

Tentacule · 2024-04-14T21:35:20Z

There was an isssue in windows Tesseract dll, I tried another one and it looks good now.

Here is a new release with this change: PgsToStr-1.4.5.zip

GFoley83 · 2024-04-16T04:43:21Z

Can confirm that 1.4.5 fixes it. Does a much better job at conversion with no random gibberish to be seen.
Tested with eng.traineddata from:

Command:

dotnet "PgsToSrt-1.4.5\\PgsToSrt.dll" --input "file.mkv" --tracklanguage eng --tesseractdata "C:\\Program Files\\Tesseract-OCR\\tessdata" --tesseractlanguage eng

dotnet "PgsToSrt-1.4.5\\PgsToSrt.dll" --input "file.mkv" --tracklanguage eng --tesseractdata "C:\\Program Files\\Tesseract-OCR\\tessdata_best" --tesseractlanguage eng

On the test I ran with the english subtitles for the movie Blade, using tessdata_best took just under 4 minutes with mostly perfect results, while tessdata took 1 minute and had only a few very minor mistakes e.g. capital "I" instead of "i" etc.

GFoley83 mentioned this issue Apr 16, 2024

Model used for trained data #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OCR errors option ? #41

Fix OCR errors option ? #41

alexdns1 commented Oct 17, 2023

GFoley83 commented Apr 13, 2024 •

edited

Loading

Tentacule commented Apr 13, 2024 •

edited

Loading

GFoley83 commented Apr 13, 2024 •

edited

Loading

Tentacule commented Apr 14, 2024

GFoley83 commented Apr 16, 2024 •

edited

Loading

Fix OCR errors option ? #41

Fix OCR errors option ? #41

Comments

alexdns1 commented Oct 17, 2023

GFoley83 commented Apr 13, 2024 • edited Loading

Tentacule commented Apr 13, 2024 • edited Loading

GFoley83 commented Apr 13, 2024 • edited Loading

Tentacule commented Apr 14, 2024

GFoley83 commented Apr 16, 2024 • edited Loading

GFoley83 commented Apr 13, 2024 •

edited

Loading

Tentacule commented Apr 13, 2024 •

edited

Loading

GFoley83 commented Apr 13, 2024 •

edited

Loading

GFoley83 commented Apr 16, 2024 •

edited

Loading