No support for spanish é, ç and other letters. #31

codigomaye · 2024-03-26T17:01:27Z

Hey dear MommaWatasu,

I hope you can help me on this one please 😅.

The problem

When I try to insert Spanish text in a p HTML tag, I get some error.
Ex:

<p>Desde el corazón de Jerusalén, pasando por Galilea hasta llegar al desierto. Cada paso que des te hara sentir un verdadero discípulo de Cristo
</p>

The error:

│   exception =
│    BoundsError: attempt to access 19-element Vector{Union{AbstractString, Symbol}} at index [20]

Futher explanations:

Julia uses "byte" indexing for characters, instead of "character" indexing.

Example: The text "OteraEngine" has the following index:

[1] => O
[2] => t
[3] => e
[4] => r
[5] => a
[6] => E
[7] => n
[8] => g
[9] => i
[10] => n
[11] => e

Character t is the length of character O + 1. Because each of these character represent 1 byte. (So we can consider that each English alphabet letter is 1 byte long).

However, Spanish (french and many others) have alphabet letters that are 2 byte long. Which includes: ñ, é, ç, and others.

Example 2: The text "España" (which means spain) has the index:

[1] => E
[2] => s
[3] => p
[4] => a
[5] => n
[6] => ~
[7] => a

So, when I try to retrieve the index by using the lenght() function, I get into an error because text[5] and text[6] can't be separated. (This is how I understand it, I think there is a better way to explain it though)

Solution

Replace while i <= length(txt) to for i in eachindex(txt). This ensures that you get the character index each time. Which enables OteraEngine to parse characters of their languages. (I tried it and it worked!).

The text was updated successfully, but these errors were encountered:

codigomaye · 2024-03-26T17:05:36Z

I just added the error description 😅

codigomaye · 2024-03-26T22:42:01Z

I found the solution to the problem:

Julia uses "byte" indexing for characters, instead of "character" indexing.

Futher explanations:

Example: The text "OteraEngine" has the following index:

[1] => O
[2] => t
[3] => e
[4] => r
[5] => a
[6] => E
[7] => n
[8] => g
[9] => i
[10] => n
[11] => e

Character t is the length of character O + 1. Because each of these character represent 1 byte. (So we can consider that each English alphabet letter is 1 byte long).

However, Spanish (french and many others) have alphabet letters that are 2 byte long. Which includes: ñ, é, ç, and others.

Example 2: The text "España" (which means spain) has the index:

[1] => E
[2] => s
[3] => p
[4] => a
[5] => n
[6] => ~
[7] => a

So, when I try to retrieve the index by using the lenght() function, I get into an error because text[5] and text[6] can't be separated. (This is how I understand it, I think there is a better way to explain it though)

Solution

Replace while i <= length(txt) to for i in eachindex(txt). This ensures that you get the character index each time. Which enables OteraEngine to parse characters of their languages. (I tried it and it worked!).

MommaWatasu · 2024-03-26T22:43:23Z

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1.
If you use the other version, please update the package and try again.

codigomaye · 2024-03-26T23:49:12Z

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1. If you use the other version, please update the package and try again.

Hey @MommaWatasu, Thanks for your reply.

The error I get is from both v0.5.1 and v0.5.0 (I tried both out of curiosity)

Reproducible steps:

Install OteraEngine latest version (Pkg.add("OteraEngine") installs v.0.5.1 by default).
Try to generate a template from the following text:

txt = "élena"
tmp = Template(txt, path = false)

This leads to an error BoundError. This is because of the indexing problem explained before. Now, a brief demonstration of why this happens:

julia> txt = "élena"
julia> i = 1
julia> while i <= length(txt)
    println(txt[i])
   i = i + 1
end

This example reproduce the while loop used it tokenizer() function, inside the parser.jl. Which can't parse the text, neither. Because at some point it will get to a 2-byte character (é), which index is not accessible using a length and i++ indexing as in other programming language due to the language design (which is the case for JavaScript)

Now, if we change a while loop and length with a for loop and eachindex:

julia> txt = "élena"
julia> i = 1
julia> for i in eachindex(txt)
    println(txt[i])
end

We get the desired result, without affecting the logic of the codes 👍 .

codigomaye · 2024-03-26T23:52:22Z

I rectified the while example above 👍

codigomaye · 2024-03-27T00:05:20Z

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1.
If you use the other version, please update the package and try again.

I checked this issue right now.

More or less the same problem as in v.0.5.0.

v0.5.1 solves it for characters like à, but not for others like ñ and é. Because of how nextind works. I read this answer post on Julia Discourse to figure out the problem, I read the entire post till I came to the answer.

The solution in v.0.5.1 is still incomplete, as I couldn't render a Spanish web page until I did a dev OteraEngine, tweaked the tokenizer. Then it happily worked ☺️

MommaWatasu · 2024-03-27T05:04:13Z

Thanks for your great effort!
I fixed tokenizer function and added some test for Japanese and Spanish text. But, if nextind is incomplete solution for this issue as you mentioned, current code may still have the problem.
Could you check the code in master branch and tell me whether it still has bug or not? If you don't reply by the next day, I'll release this code as v0.5.2.

codigomaye · 2024-03-27T10:25:42Z

Hey @MommaWatasu , I will check it right away 👍

codigomaye · 2024-03-27T10:37:30Z

Hey @MommaWatasu , now it works perfectly 👍 .

This is a screen capture of the rendered page, to make you feel happy for the great job you did 😄

Have a nice day 😄

MommaWatasu · 2024-03-27T11:16:45Z

Thanks for the screenshot, I'm glad to see how OteraEngine is used!
Now v0.5.2 is available. Please update and enjoy using it!

codigomaye pushed a commit to codigomaye/OteraEngine.jl that referenced this issue Mar 26, 2024

[fix] Resolves MommaWatasu#31

768bc51

codigomaye mentioned this issue Mar 26, 2024

[fix] Resolves #31 #32

Closed

MommaWatasu closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No support for spanish é, ç and other letters. #31

No support for spanish é, ç and other letters. #31

codigomaye commented Mar 26, 2024 •

edited

Loading

codigomaye commented Mar 26, 2024

codigomaye commented Mar 26, 2024

MommaWatasu commented Mar 26, 2024

codigomaye commented Mar 26, 2024 •

edited

Loading

codigomaye commented Mar 26, 2024

codigomaye commented Mar 27, 2024 •

edited

Loading

MommaWatasu commented Mar 27, 2024

codigomaye commented Mar 27, 2024

codigomaye commented Mar 27, 2024

MommaWatasu commented Mar 27, 2024

No support for spanish é, ç and other letters. #31

No support for spanish é, ç and other letters. #31

Comments

codigomaye commented Mar 26, 2024 • edited Loading

The problem

Futher explanations:

Solution

codigomaye commented Mar 26, 2024

codigomaye commented Mar 26, 2024

Futher explanations:

Solution

MommaWatasu commented Mar 26, 2024

codigomaye commented Mar 26, 2024 • edited Loading

Reproducible steps:

codigomaye commented Mar 26, 2024

codigomaye commented Mar 27, 2024 • edited Loading

MommaWatasu commented Mar 27, 2024

codigomaye commented Mar 27, 2024

codigomaye commented Mar 27, 2024

MommaWatasu commented Mar 27, 2024

codigomaye commented Mar 26, 2024 •

edited

Loading

codigomaye commented Mar 26, 2024 •

edited

Loading

codigomaye commented Mar 27, 2024 •

edited

Loading