Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support for spanish é, ç and other letters. #31

Closed
codigomaye opened this issue Mar 26, 2024 · 10 comments
Closed

No support for spanish é, ç and other letters. #31

codigomaye opened this issue Mar 26, 2024 · 10 comments

Comments

@codigomaye
Copy link

codigomaye commented Mar 26, 2024

Hey dear MommaWatasu,

I hope you can help me on this one please 😅.

The problem

When I try to insert Spanish text in a p HTML tag, I get some error.
Ex:

<p>Desde el corazón de Jerusalén, pasando por Galilea hasta llegar al desierto. Cada paso que des te hara sentir un verdadero discípulo de Cristo
</p>

The error:

│   exception =
│    BoundsError: attempt to access 19-element Vector{Union{AbstractString, Symbol}} at index [20]

Futher explanations:

Julia uses "byte" indexing for characters, instead of "character" indexing.

Example: The text "OteraEngine" has the following index:

[1] => O
[2] => t
[3] => e
[4] => r
[5] => a
[6] => E
[7] => n
[8] => g
[9] => i
[10] => n
[11] => e

Character t is the length of character O + 1. Because each of these character represent 1 byte. (So we can consider that each English alphabet letter is 1 byte long).

However, Spanish (french and many others) have alphabet letters that are 2 byte long. Which includes: ñ, é, ç, and others.

Example 2: The text "España" (which means spain) has the index:

[1] => E
[2] => s
[3] => p
[4] => a
[5] => n
[6] => ~
[7] => a

So, when I try to retrieve the index by using the lenght() function, I get into an error because text[5] and text[6] can't be separated. (This is how I understand it, I think there is a better way to explain it though)

Solution

Replace while i <= length(txt) to for i in eachindex(txt). This ensures that you get the character index each time. Which enables OteraEngine to parse characters of their languages. (I tried it and it worked!).

@codigomaye
Copy link
Author

I just added the error description 😅

@codigomaye
Copy link
Author

I found the solution to the problem:

Julia uses "byte" indexing for characters, instead of "character" indexing.

Futher explanations:

Example: The text "OteraEngine" has the following index:

[1] => O
[2] => t
[3] => e
[4] => r
[5] => a
[6] => E
[7] => n
[8] => g
[9] => i
[10] => n
[11] => e

Character t is the length of character O + 1. Because each of these character represent 1 byte. (So we can consider that each English alphabet letter is 1 byte long).

However, Spanish (french and many others) have alphabet letters that are 2 byte long. Which includes: ñ, é, ç, and others.

Example 2: The text "España" (which means spain) has the index:

[1] => E
[2] => s
[3] => p
[4] => a
[5] => n
[6] => ~
[7] => a

So, when I try to retrieve the index by using the lenght() function, I get into an error because text[5] and text[6] can't be separated. (This is how I understand it, I think there is a better way to explain it though)

Solution

Replace while i <= length(txt) to for i in eachindex(txt). This ensures that you get the character index each time. Which enables OteraEngine to parse characters of their languages. (I tried it and it worked!).

@MommaWatasu
Copy link
Owner

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1.
If you use the other version, please update the package and try again.

codigomaye pushed a commit to codigomaye/OteraEngine.jl that referenced this issue Mar 26, 2024
@codigomaye
Copy link
Author

codigomaye commented Mar 26, 2024

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1. If you use the other version, please update the package and try again.

Hey @MommaWatasu, Thanks for your reply.

The error I get is from both v0.5.1 and v0.5.0 (I tried both out of curiosity)

Reproducible steps:

  1. Install OteraEngine latest version (Pkg.add("OteraEngine") installs v.0.5.1 by default).
  2. Try to generate a template from the following text:
txt = "élena"
tmp = Template(txt, path = false)

This leads to an error BoundError. This is because of the indexing problem explained before. Now, a brief demonstration of why this happens:

julia> txt = "élena"
julia> i = 1
julia> while i <= length(txt)
    println(txt[i])
   i = i + 1
end

This example reproduce the while loop used it tokenizer() function, inside the parser.jl. Which can't parse the text, neither. Because at some point it will get to a 2-byte character (é), which index is not accessible using a length and i++ indexing as in other programming language due to the language design (which is the case for JavaScript)

Now, if we change a while loop and length with a for loop and eachindex:

julia> txt = "élena"
julia> i = 1
julia> for i in eachindex(txt)
    println(txt[i])
end

We get the desired result, without affecting the logic of the codes 👍 .

@codigomaye
Copy link
Author

I rectified the while example above 👍

@codigomaye
Copy link
Author

codigomaye commented Mar 27, 2024

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1.
If you use the other version, please update the package and try again.

I checked this issue right now.

More or less the same problem as in v.0.5.0.

v0.5.1 solves it for characters like à, but not for others like ñ and é. Because of how nextind works. I read this answer post on Julia Discourse to figure out the problem, I read the entire post till I came to the answer.

The solution in v.0.5.1 is still incomplete, as I couldn't render a Spanish web page until I did a dev OteraEngine, tweaked the tokenizer. Then it happily worked ☺️

@MommaWatasu
Copy link
Owner

Thanks for your great effort!
I fixed tokenizer function and added some test for Japanese and Spanish text. But, if nextind is incomplete solution for this issue as you mentioned, current code may still have the problem.
Could you check the code in master branch and tell me whether it still has bug or not? If you don't reply by the next day, I'll release this code as v0.5.2.

@codigomaye
Copy link
Author

Hey @MommaWatasu , I will check it right away 👍

@codigomaye
Copy link
Author

Hey @MommaWatasu , now it works perfectly 👍 .

This is a screen capture of the rendered page, to make you feel happy for the great job you did 😄

Screenshot

Have a nice day 😄

@MommaWatasu
Copy link
Owner

Thanks for the screenshot, I'm glad to see how OteraEngine is used!
Now v0.5.2 is available. Please update and enjoy using it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants