Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ver 0.85 CEA-708: 16 bit charset (Korean) Not support #690

Open
gkehstn opened this issue Feb 17, 2017 · 18 comments

Comments

Projects
None yet
7 participants
@gkehstn
Copy link

commented Feb 17, 2017

0.78 (2015-12-12)
  - CEA-708: 16 bit charset support (tested on Korean).
0.84 test result normal
0.85 Not supported.

  • See issue # 286.
@cfsmp3

This comment has been minimized.

Copy link
Contributor

commented Feb 18, 2017

GSoC qualification: 2 points

@Izaron

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2017

Well, I changed this part of code, because in many videos I got wrong output.
Link 1 (before my changes) - https://gist.github.com/Izaron/34136a8ec8216469c3c3828acdfbe53e
Link 2 (my change) - d60baf1
Link 3 (after my changes - absolutely correct) - https://gist.github.com/Izaron/44c030eae8c6c1049ae6d3e6c6d0dd32

Can you please attach your video file with wrong text? If this worked correctly in 0.84 and don't works in 0.85. I will try to fix this error.

@HaneolLee

This comment has been minimized.

Copy link

commented Feb 20, 2017

  1. When I run it in 0.84 version, Korean is good.
    link 1 : https://drive.google.com/open?id=0BxFzM3fSXVOiZEo2R1E4MEFFY1U

  2. When I run it in version 0.85, I do not see Korean.
    link 2 : https://drive.google.com/open?id=0BxFzM3fSXVOiSnVBZkc4RlBzVkE

All run with the same options.

  1. Upload the tested video file.
    https://drive.google.com/open?id=0BxFzM3fSXVOiV3hUTnVoVVRjeDg
@Izaron

This comment has been minimized.

Copy link
Contributor

commented Feb 20, 2017

I wrote a patch
Remember you should call it as "ccextractor -svc all[EUC-KR]" or so.
Resulting file - https://paste.fedoraproject.org/paste/imMCT5qPdsAk8TlL8Qa35V5M1UNdIGYhyRLivL9gydE=/raw

See issue # 286

Yes, that's bad... I can say I wait for new GSoC student to come and fix it 😄

@cfsmp3 cfsmp3 closed this in #693 Feb 20, 2017

@unicode45

This comment has been minimized.

Copy link

commented Dec 25, 2017

Version 0.85 still can not extract proper Korean characters.

I've attached sample srt files using below samples.
https://drive.google.com/drive/folders/0B_61ywKPmI0TZU00VjRYWENfYjg

Files start with Ver079 is correct. 0.85 produce broken characters except ASCII charcters.
cea708.zip

@cfsmp3 cfsmp3 reopened this Dec 25, 2017

@gray-v

This comment has been minimized.

Copy link
Contributor

commented Dec 26, 2017

Further regressions since 0.85: Using mbc.ts linked above, I get 00:00:01,234 --> 00:00:01,368
җס, Ѩ½ ֧ٮLߺյԄ.
instead of 00:00:01,601 --> 00:00:01,735
뇗랡, 냨쇽 뚧릮샌뻺듵도.

This is caused by using write_utf16_char instead of utf16_to_utf8 in 29180a9

Attempting fix now.

@gray-v

This comment has been minimized.

Copy link
Contributor

commented Dec 26, 2017

....mate, I don't know how Korean encoding works, but in the previous versions I'm not getting korean.

Here's a byte-by-byte analysis between .85 and .84 respectively:

EB 87 97 EB 9E A1 2C 20 EB 83 A8 EC 87 BD 20 EB 9A A7 EB A6 AE EC 83 8C EB BB BA EB 93 B5 EB 8F 84 2E

B1 D7 B7 A1 2C 20 B0 E8 C1 FD 20 B6 A7 B9 AE C0 CC BE FA B4 F5 B3 C4

0.84 literally does not produce valid unicode characters, so either it was actually a fix (doubt it, 0.85 produces completely illegible strings of random words) or some other type of encoding apart from unicode. Can someone confirm what exactly Korean 708 subs are in, EUC-KR or UTF16 or something else maybe?

@unicode45

This comment has been minimized.

Copy link

commented Dec 26, 2017

Basically EUC-KR is common but both Unicode and EUC-KR can be used.
You can find which encoding is used by checking Caption Service Descriptor in PMT.
If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

@gray-v

This comment has been minimized.

Copy link
Contributor

commented Dec 26, 2017

OK, fairly sure we don't have EUC-KR support and that definitely wasn't EUC-KR since it was on notepad of all things so I'm just as stumped here. I'll work on EUC-KR support, I guess, there's a cool free lib for that but other than that I'm actually stumped since none of these are legible outputs and I have no idea what encoding @HaneolLee used to get that output on 0.84

@unicode45

This comment has been minimized.

Copy link

commented Dec 26, 2017

I think ccextractor requires iconv (libiconv) for it. I could convert it by adding "-svc all[EUC-KR]".

@gray-v

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2017

Confirming, on latest builds conversions for both samples linked by @unicode45 process successfully if I add -svc all[EUC-KR]

mystery solved

@cfsmp3

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2017

@gray-v

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2017

Doesn't seem possible, valid EUC-KR characters are also valid Unicode characters and I reckon it would be very hard to tell automatically what the correct encoding.

@cfsmp3

This comment has been minimized.

Copy link
Contributor

commented Dec 28, 2017

@gray-v did you read this?

Basically EUC-KR is common but both Unicode and EUC-KR can be used.
You can find which encoding is used by checking Caption Service Descriptor in PMT.
If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

@thetransformerr

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2018

Hi all ,@unicode45 , @cfsmp3

as I have tested with -svc all it was working fine but as per suggestion above

Basically EUC-KR is common but both Unicode and EUC-KR can be used.
You can find which encoding is used by checking Caption Service Descriptor in PMT.
If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

I cannot find any entry or reference towards such an field in PMT , either in code or standard for PMT ISO13818 table 2.24 or it might be the case that I have missed that, would anyone please point out where I can find references to make above changes possible.
All I could determine was PMT are used to store program information guide and its table location can be defined for each service in PAT but ISO 13818 recommends it as 0x0002.

following are the lines from code that looks like it but I can't understand how to modify them,

int parse_PMT (struct ccx_demuxer *ctx, unsigned char *buf, int len, struct program_info *pinfo)

please point out what I am missing.....

@unicode45

This comment has been minimized.

Copy link

commented Jul 9, 2018

Hi, @thetransformerr

I've found a information but I'm sorry it's written in Korean (Google translation will be helpful).
http://www.nl.go.kr/app/nl/search/common/download.jsp?file_id=FILE-00008442489

Here's summary related PMT.

  • Page No.25, Chapter B.1
    "PMT is an optional value."
    (I think that's the reason you could not find PMT.)

  • Page No.25, Chapter B.2 to Page No.28
    Described caption service descriptor.

  • Page No.28, Chapter B.3
    "DTVCC Default Mode in Korea : Although DTVCC subtitles data exists in DTVCC transmission channels but PMT and EIT do not have any caption service descriptor, it will be treated as Service 1 and EUC-KR."
    (So, if you could not find any PMT information on it, please regard it Service 1 and EUC-KR.)

I could not find any Korean subtitle written in Unicode in my experience so far.
I hope it will be helpful.

@thetransformerr

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2018

hey @unicode45 ,

Thanks very much for your reply and help , so given that with svc we are able to extract Korean , Wouldn't it be useful if we make svc 1 , EUC-KR as default ?
In case of failure , user can provide unicode manually.

@unicode45

This comment has been minimized.

Copy link

commented Jul 9, 2018

Hi, @thetransformerr

Wouldn't it be useful if we make svc 1 , EUC-KR as default ?
Yes, I think so because all broadcasts were svc 1, EUC-KR in my several years experience.

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.