[BUG] Code page problems

CCExtractor version (using the --version parameter preferably) : **0.87** (it is actually cfsmp3's build from #926)

- [X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md).
- [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
- [X] I have checked that the issue I'm posting isn't already reported.
- [X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues)
- [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
- [X] I have used the latest available version of CCExtractor to verify this issue exists.

**My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):**

- [X] I absolutely love CCExtractor, but have contributed only once previously.

**Necessary information**
- Is this a regression (did it work before)? [X] NO | [ ] YES
- What platform did you use? [X] Windows - [ ] Linux - [ ] Mac
- What were the used arguments? Multiple combinations. See below.

**Additional information**

On Windows, there are currently several issues with regards to codepages and the handling of special characters. All of the following examples were tested on a system with CP-1252 as the default codepage for non-unicode non-cli applications (a German windows system); the default cli codepage is CP 850.

a) `ccextractorwin.exe -autoprogram ä.ts` (make sure that there is a file "ä.ts" in your working directory or somewhere where ccextractor can find it; the actual content doesn't matter): The file is correctly handled, but ccextractor emits `Input: õ.ts` and `Opening file: õ.ts` on the command line. `ä` is 0xE4 in CP-1252 and 0xE4 is `õ` in CP 850 that cmd.exe uses and expects. So ccextractor seems to omit a conversion to the currently used codepage of the console. (If I am not mistaken then these codepages are actually legacy (and have been so for quite some time); the console has full unicode support and using unicode is therefore probably the cleaner solution than actually using codepages.)
b) The command line used is the same as a), but this time the active console CP is 852 (use `chcp 852` before). The file is correctly opened, but now it emits `ń.ts` because 0xE4 is `ń` in CP 852. This confirms my conclusion from a).
c) `ccextractorwin.exe -autoprogram ě.ts`: This time the file is not handled at all; it outputs `Input: e.ts` and `Error: Failed to open input file: File does not exist.` (as well as its configuration data (`[Program : Auto ]` etc.)). My guess to what happens: Because ě is not in CP-1252 or CP 850, somewhere in the processing there is a lossy conversion to CP-1252 or CP 850 on a best-effort basis and the best match for ě is e.
d) [echoargs.exe](http://ss64.com/ps/EchoArgs.exe) shows that when the cli CP is 850, the console itself converts ě to e. This does not conclusively show whether the console or ccextractor converts the characters in case c).
e) Same as c), but this time we set chcp 852 first. The output is the same as c).
f) But with chcp 852 echoargs shows that the console leaves the ě untouched. So there seems to be a conversion of the input to CP 1252 (or more generally, to the CP that the non-unicode non-cli applications use) in any case.
g) In order to find out if there is an intermediate conversion to the CP of the console I set the used codepage to 437 and test a file called "Ø.ts"; CP 437 lacks "Ø" and echoargs shows that it is converted to O by the console if the console converts it to CP 437. ccextractor can open it; of course the problem a) happens here, too: Ø is 0xD8 in CP-1252 and 0xD8 in CP 437 is ╪. This shows that there there is no conversion to the cli CP in between so probably it's not the console at all that does the conversion of the input file name.
h) Same as c), but this time there is also a file e.ts besides the ě.ts. Result: e.ts is opened.

I searched a bit and it seems that the usual way of solving this is using proper unicode by including windows.h and defining UNICODE; but the fact that ccextractor is cross-plattform might complicate things. 

And finally, there is a bug in the GUI's preview window. Some non-ASCII characters aren't properly displayed; others are fine, though. For example the [sample](https://www.dropbox.com/s/0o2ncppc0hq8ljt/DVB-Teletext%20incomplete.ts?dl=0) I uploaded for #922 shows this in the preview box:
```
00:00  00:04  Sie können ihn nicht von der Schule
               schmei�Yen! Schulpflicht!
 00:05  00:09  Für diese Klassenstufe ist er nicht
               geeignet.    Stufen Sie ihn zurück!
```
I ran the cli version with the gui_mode_reports parameter and redirected stderr. The subtitle related part is proper UTF-8. And all characters with more than 1 B length are correct, too, including the "ß" which is displayed as �Y above. I have no clue why the umlauts are fine, but ß isn't in the final display.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Code page problems #937

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Code page problems #937

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions