Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Encoding Arabic Support #40

Closed
AmrSubZero opened this issue Jul 7, 2016 · 16 comments
Closed

UTF-8 Encoding Arabic Support #40

AmrSubZero opened this issue Jul 7, 2016 · 16 comments

Comments

@AmrSubZero
Copy link

AmrSubZero commented Jul 7, 2016

As i know The API sends and receives the raw bytes

But when i try to display the arabic words correctly, it displays like "����" which seems to be "ISO-8859-1" or maybe "windows-1252" i don't know.

Is there an option allows me to change the API Charset Encoding to UTF-8? like the Pear2/RouterOS PHP API? (see the link).

or maybe convert the (diamond question marks) to "UTF-8" in Java? if so, please post an example!

I really need to correctly display arabic words, thanks.

@GideonLeGrange
Copy link
Owner

Can you please provide me with an example RouterOS config so I can reproduce the problem?

@GideonLeGrange
Copy link
Owner

@AmrSubZero I'm looking into this. It should be possible to add a setCharSet() call that will allow you to get strings constructed with the right charset. I need to change code deep in the implementation which will cause some refactoring. I don't want to just patch and release code without being able to test, so I have two questions for you:

  1. Can you please provide me with a sample Mikrotik configuration with Arabic characters in?
  2. Can you tell me what character encoding is the correct one? I'm thinking cp1256 but is it?

Let me know please, I'd like to fix this.

@AmrSubZero
Copy link
Author

AmrSubZero commented Jul 17, 2016

First, i' have created a sample configuration working on v5.25 of routerOS with arabic characters (comments column) for "ip/hotspot/users & ip bindings" so you can connect, get and display the "comment" column for those, and test with them.

In the configuration the user is admin and there's no password
the IN or WAN Ethernet IP is 192.168.1.1 and the OUT or LAN or LOCAL IP is 10.0.0.1

I'm getting connected via API by connecting to 10.0.0.1 with the user admin and no password as i mentioned above, if there's a connection problem, please let me know.

Here's the configuration file

Second, the correct character encoding is "cp1256".

I'm truly glad that you're ready to help, waiting you, thanks.

@GideonLeGrange
Copy link
Owner

@AmrSubZero I think I have a fix for this. If you want to test the fix, please pull the current master branch from github and build and try that version.

As an experiment, I've changed the internal encoding for transmission between the mikrotik-java API and RouterOS to be UTF-8 and it seems to solve the problem. I'll release version 3.0.3 shortly to make this change available.

Thanks to @boen_robot - your comments above pointed me in the right direction, which lead to this StackOverflow article http://stackoverflow.com/questions/5729806/encode-string-to-utf-8

A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.
So if you have an encoding problem, by the time you have String, it's too late to fix. You need to fix the place where you create that String from a file, DB or network connection.

@GideonLeGrange
Copy link
Owner

GideonLeGrange commented Oct 8, 2016

@AmrSubZero

This works for me. Can you please see if it works for you as well.

Example java code that creates users with Arabic, Japanese and Cyrillic text in comments

    private static final String JAPANESE = "事報ハヤ久送とゅ歳用ト候新放すルドう二5春園ユヲロ然納レ部悲と被状クヘ芸一ーあぽだ野健い産隊ず";
    private static final String CRYLLIC = "Лорем ипсум долор сит амет, легере елояуентиам хис ид. Елигенди нолуиссе вих ут. Нихил";
    private static final String ARABIC = "تجهيز والمانيا تم قام. وحتّى المتاخمة ما وقد. أسر أمدها تكبّد عل. فقد بسبب ترتيب استدعى أم, مما مع غرّة، لأداء. الشتاء، عسكرياً";

    private void test() throws MikrotikApiException {
        con.execute("/ip/hotspot/user/add name=userJ comment='" + JAPANESE + "'");
        con.execute("/ip/hotspot/user/add name=userC comment='" + CRYLLIC + "'");
        con.execute("/ip/hotspot/user/add name=userA comment='" + ARABIC + "'");

        for (Map<String, String> res : con.execute("/ip/hotspot/user/print return name,comment")) {
            System.out.printf("%s : %s\n", res.get("name"), res.get("comment"));
        }
    }

_Output from the that code_

default-trial : counters and limits for trial users
userJ : 事報ハヤ久送とゅ歳用ト候新放すルドう二5春園ユヲロ然納レ部悲と被状クヘ芸一ーあぽだ野健い産隊ず
userC : Лорем ипсум долор сит амет, легере елояуентиам хис ид. Елигенди нолуиссе вих ут. Нихил
userA : تجهيز والمانيا تم قام. وحتّى المتاخمة ما وقد. أسر أمدها تكبّد عل. فقد بسبب ترتيب استدعى أم, مما مع غرّة، لأداء. الشتاء، عسكرياً

_Screen shot of ssh login to Mikrotik_

screen shot 2016-10-08 at 20 35 16

@AmrSubZero
Copy link
Author

AmrSubZero commented Oct 9, 2016

@GideonLeGrange

I've updated the maven repository in my android project to : compile 'me.legrange:mikrotik:3.0.3'

cleaned & rebuilt the project and tried the following examples :

private static final String ARABIC = "تجربة نص عربى";
con.execute("/ip/hotspot/user/add name=userA comment='" + ARABIC + "'");

Results :

usera

and :

for (Map<String, String> res : con.execute("/ip/hotspot/user/print return name,comment")) { testArray.add(res.get("name") + " / " + res.get("comment"));
}

Results :

useracomment

Which correctly displayed the userA arabic string in the App, but not the other users.

So we have Two problems here :

First problem : when i send a command to add user with comment it's not adding the comment in the right arabic condition it's displaying in Winbox like :

non-arabic

This is not the correct arabic string that i passed in the add user command.

i think sending arabic from api to mikrotik should not be in "UTF-8" encoding, or maybe it should be a specific encoding which i don't know, to be displayed correctly in Winbox.

Second problem : in that case i will need to Set all the users comments to that weird character encoding to display them correctly in the App, but i won't be able to read arabic in Winbox i will only read arabic correctly in the App which is a problem not a feature.
and still the same thing, the correct arabic comments in winbox, displaying as diamond question mark in the App.

Sending characters from API to MikroTik should be done with a specific encoding.
Reading characters from MikroTik through API should be a different encoding (UTF-8 i think).

If there anything you don't understand just let me know.

@GideonLeGrange
Copy link
Owner

On my setup the API and ssh command line show the same information, but Winbox and the web interface don't display them correctly, they look like in your pictures (I even enbled Arabic in my Chrome settings to see if the web interface will display correctly).

So I have a question. If you ssh in to your mikrotik and do a /ip/hotspot/user print command, how are the users shown?

@GideonLeGrange
Copy link
Owner

GideonLeGrange commented Oct 9, 2016

I've experimented some more and modified the API code to explicitly convert from cp1256 before sending data over the network and to convert to cp1256 when receiving data from the network (and before passing it to the user) and it makes no difference at all. I tried the same with iso-8859-6, no difference.

The Arabic text set using the API looks correct from the router (ssh) command line, and it is read back correctly. But Winbox and the web interface do not show correctly.

I suspect that Winbox and the web interface do not encode and decode the Arabic characters. I set Chrome's encoding settings (under View/Encoding) to "Arabic (Windows-1256)" and it does not make any difference either. I tried with "Arabic (iso-8859-6)" as well, no difference. This leads me to believe that it is already wrong by the time it reaches the browser.

I fixed the encoding problem in the API in 3.0.3 and at this point the API will send non-English characters correctly and read them correctly. Other software is not working and I can't fix that. I think you're going to have to decide on a workaround for this problem. The options I see are:

  • Trust Winbox and don't show the comments in your app.
  • Trust your app and ignore the fact that Winbox formats it badly.
  • Restrict your app and your users to English characters.

@AmrSubZero
Copy link
Author

AmrSubZero commented Oct 10, 2016

I understand that you want to fix this, and i appreciate your help.

But let's discuss it.

As i asked (this question) boen_robot told me that WinBox displays the bytes using the Windows ANSI charset which is different for every locale.

And as i know, boen_robot is the owner of the Pear2/RouterOS PHP API for Mikrotik, he has a Charset Configurable that works correctly for each user configuration, as i mentioned in this Issue the setCharset() function, the user choosing the REMOTE and LOCAL charset, to Write to MikroTik with the correct charset, and Read from MikroTik also with the correct charset.

//Here's where we specify the charset pair.
$client->setCharset(
array(
RouterOS\Communicator::CHARSET_REMOTE => 'windows-1256',
RouterOS\Communicator::CHARSET_LOCAL => 'UTF-8'
)
);

The cool thing is that i tested his API using setCharset() and it is working like a charm!

Writing to MikroTik from API, displayed the Arabic correctly in the WinBox :

winboxphp

Reading from MikroTik through API, also displayed the Arabic correctly in the Web :

app

boen_robot has recommended you to create a Charset Configurable to fix this problem :

boen_robot

Can that be achieved? is it possible to fix this by doing a Charset Configurable setCharset() method?

I'll be happy if that can be achieved, i have a huge data in my RouterOS the only way that i can distinction between each item and the other is the comment column and it will be too much easier to read it in arabic.

Let me know any step you take, please!

Thanks alot.

@GideonLeGrange
Copy link
Owner

I tried manipulating the the charset and it does not solve it. Maybe I'm doing it wrong, but
unfortunately I cannot spend more time on this.

The library is open source and you can easily implement a setCharset() method yourself. If you do so and it fixes the problem, please submit a pull request for your changes.

@AmrSubZero
Copy link
Author

That is a bad news, but anyway, thanks Mr Gideon for your help.

@AbdelsalamHaa
Copy link

AbdelsalamHaa commented May 2, 2018

Hi , I'm using tesseract 4 with vs 2017 . i have used with English characters first , now i started to include arabic as well . the thing is got weird characters even when i change the eng.traineddata to ara.traineddata.

i found out that it's the characters when the UTF8 code is treated as Hex code .
image
this is the image i want to recognize .
the result is
image
i convert the letters in arabic to UTF8 code using this website
https://r12a.github.io/app-conversion/
and when i take this code and converted as a hex code to character i get the same characters that tesseract showed me the first time.
"عبدالسلام مدي عبدالعزيز"

I think the problem is with UTF8 or might be my visual studio can't recognize Arabic letters or what

@GideonLeGrange
Copy link
Owner

GideonLeGrange commented May 2, 2018 via email

@AbdelsalamHaa
Copy link

@GideonLeGrange okay thanks for your reply

@mohamadtik
Copy link

how i can used this code via class.php.api ????

//Here's where we specify the charset pair.
$client->setCharset(
array(
RouterOS\Communicator::CHARSET_REMOTE => 'windows-1256',
RouterOS\Communicator::CHARSET_LOCAL => 'UTF-8'
)
);

@GideonLeGrange
Copy link
Owner

how i can used this code via class.php.api ????

//Here's where we specify the charset pair.
$client->setCharset(
array(
RouterOS\Communicator::CHARSET_REMOTE => 'windows-1256',
RouterOS\Communicator::CHARSET_LOCAL => 'UTF-8'
)
);

I wrote and maintain a Java API. I cannot help you with a PHP API that somebody else wrote. Ask the author of that class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants