Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OCR errors option ? #41

Open
alexdns1 opened this issue Oct 17, 2023 · 5 comments
Open

Fix OCR errors option ? #41

alexdns1 opened this issue Oct 17, 2023 · 5 comments

Comments

@alexdns1
Copy link

Do you plan on adding "Fix OCR errors" like subtitle edit option to resolve badly OCRd text ?

@GFoley83
Copy link

GFoley83 commented Apr 13, 2024

@Tentacule Just bumping this one.

I've run a bit of a test with the latest version of PgsToSrt and Subtitle Edit w/ Tesseract 5.3.3
I can't seem to replicate the accuracy of Subtitle Edit using PgsToSrt, even with Fix OCR errors unchecked. Some of the conversion, which should be pretty basic, come out as gibberish (see screenshot).

Command I'm using for PgsToSrt is:
dotnet PgsToSrt.dll --input "file.mkv" --tracklanguage eng --tesseractdata "C:\Program Files\Tesseract-OCR\tessdata" --tesseractlanguage eng

Any ideas here?

image

image

@Tentacule
Copy link
Owner

Tentacule commented Apr 13, 2024

I won't add "Fix OCR errors" for now because this functionality is not included in LibSE.

I have done some tests, it looks like an issue on windows, it's working fine when run on linux. I'll investigate.

@GFoley83
Copy link

GFoley83 commented Apr 13, 2024

I just flicked you an email.

I don't think "Fix OCR errors" will make a difference anyway as I had it disabled in SE (see first screenshot) and it still converted the PGS subs almost perfectly. Issue is something else.

Thanks for looking into it.

@Tentacule
Copy link
Owner

There was an isssue in windows Tesseract dll, I tried another one and it looks good now.

Here is a new release with this change: PgsToStr-1.4.5.zip

@GFoley83
Copy link

GFoley83 commented Apr 16, 2024

Can confirm that 1.4.5 fixes it. Does a much better job at conversion with no random gibberish to be seen.
Tested with eng.traineddata from:

Command:

dotnet "PgsToSrt-1.4.5\\PgsToSrt.dll" --input "file.mkv" --tracklanguage eng --tesseractdata "C:\\Program Files\\Tesseract-OCR\\tessdata" --tesseractlanguage eng

dotnet "PgsToSrt-1.4.5\\PgsToSrt.dll" --input "file.mkv" --tracklanguage eng --tesseractdata "C:\\Program Files\\Tesseract-OCR\\tessdata_best" --tesseractlanguage eng

On the test I ran with the english subtitles for the movie Blade, using tessdata_best took just under 4 minutes with mostly perfect results, while tessdata took 1 minute and had only a few very minor mistakes e.g. capital "I" instead of "i" etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants