Fix/tessdata prefix path resolution#2251
Fix/tessdata prefix path resolution#2251DhanushVarma-2 wants to merge 5 commits intoCCExtractor:masterfrom
Conversation
The matroska_track_text_subtitle_id_extensions array had 7 entries for an 8-value enum, leaving MATROSKA_TRACK_SUBTITLE_CODEC_ID_KATE (index 7) out of bounds. On most platforms this read NULL, which then caused strlen(NULL) UB and snprintf to emit .(null) in the output filename. Two fixes: - Add "kate" at index 7 in the extensions array so KATE tracks produce correct .kate output filenames - Add a NULL guard in generate_filename_from_track() so any future unknown codec ID safely falls back to .bin instead of crashing or producing .(null) Fixes CCExtractor#972
The matroska_track_text_subtitle_id_extensions array had 7 entries for an 8-value enum, leaving MATROSKA_TRACK_SUBTITLE_CODEC_ID_KATE (index 7) out of bounds. On most platforms this read NULL, which then caused strlen(NULL) UB and snprintf to emit .(null) in the output filename. Two fixes: - Add "kate" at index 7 in the extensions array so KATE tracks produce correct .kate output filenames - Add a NULL guard in generate_filename_from_track() so any future unknown codec ID safely falls back to .bin instead of crashing or producing .(null) Fixes CCExtractor#972
The matroska_track_text_subtitle_id_extensions array had 7 entries for an 8-value enum, leaving MATROSKA_TRACK_SUBTITLE_CODEC_ID_KATE (index 7) out of bounds. On most platforms this read NULL, which then caused strlen(NULL) UB and snprintf to emit .(null) in the output filename. Two fixes: - Add "kate" at index 7 in the extensions array so KATE tracks produce correct .kate output filenames - Add a NULL guard in generate_filename_from_track() so any future unknown codec ID safely falls back to .bin instead of crashing or producing .(null) Fixes CCExtractor#972
- Both Tesseract 4/5 and legacy (<4) branches now use a consistently built tess_path instead of raw tessdata_path or manual concatenation - Handles the case where TESSDATA_PREFIX already points at the tessdata dir itself (avoids double-appending 'tessdata') - Handles Windows paths ending with backslash correctly - Adds mprint diagnostic showing the resolved tessdata path Fixes CCExtractor#1492
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit d56a6be...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit d56a6be...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
|
Closing — a few issues:
The OCR path fix itself looks reasonable — please resubmit as a new PR with:
Thanks. |
In raising this pull request, I confirm the following (please check boxes):
Reason for this PR:
Sanity check:
Repro instructions:
This is essential. We will not merge ANY PR that doesn't come with detailed instructions, including a sample. We don't want
"fixes" for theoretical issues that an AI agent found, without context. If you can't reproduce the bug, don't send a PR.
Creating PRs with AI is very quick, but we still have humans (even if AI assisted) going over each.
Be mindful of reviewers' time.
Root cause: Two bugs in init_ocr() in ocr.c:
The Tesseract 4/5 branch always blindly appended /tessdata to the path returned by probe_tessdata_location(). If TESSDATA_PREFIX was already set to a path ending in tessdata/, this caused a double-append (e.g. /usr/share/tessdata/tessdata).
The legacy Tesseract <4 branch passed tessdata_path raw to TessBaseAPIInit4 without appending tessdata at all — causing Tesseract to look for eng.traineddata directly in e.g. /usr/share/ instead of /usr/share/tessdata/.
Fix: Normalize the path once before both branches — detect whether the returned path already ends with tessdata or tessdata/, and handle Windows backslash separators correctly.
Tested on: macOS (Apple Silicon, Tesseract 5.5.1 via Homebrew). All 6 path cases verified correct including TESSDATA_PREFIX pointing directly at tessdata dir and Windows paths.