Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve character encoding handling for friendly titles (Trac #2276) #2276

Closed
elgg-gitbot opened this issue Feb 16, 2013 · 42 comments

Comments

@elgg-gitbot
Copy link

commented Feb 16, 2013

Original ticket http://trac.elgg.org/ticket/2276 on 40445254-04-27 by trac user belhkaci, assigned to trac user mrclay.

Elgg version: 1.7

I can not see the page content rss blogs when there is a blog that has a title that contains accented characters such as "é"

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Milestone changed to Elgg 1.7.2 by brettp on 40445929-01-14

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 40489073-08-08

I am not able to reproduce. Could you post a set of steps to reproduce this?

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user bman wrote on 40489097-10-12

I started looking at this and i don't see a specific blog related issue. The issue seems to lie in php's lack of unicode support. I have gotten a few functions working that will help, but the actual issue lies in the url creation and the characters being stripped out in strip tags and friendly_title function. The way I was looking to solve this was to convert utf8 to unicode array, convert unicode array to unicode entities (preserving ascii chars), striptags, then convert the remaining entities back to utf8. I am working on going back to utf8 still but not complete yet. anyway, just letting you all know my progress and that i am actually looking at this issue and my thought process.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user bman wrote on 40489100-03-11

yeah the actual bug should be titled "Add unicode support in friendly_title"
sorry cash i should have posted before i started working on this.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 40489432-07-28

Okay - I see the issue. I believe this is related to #2027.

I started to look into using iconv just before the release of 1.7.1. If that works, it would be a clean solution. Since Wikipedia seems to do a good job with URLs, MediaWiki would be a good place to look for how they handle character encodings and URLs.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Title changed from Problem of rss in blogs to improve character encoding handling for friendly titles by cash on 40489432-07-28

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user bman wrote on 40489955-12-12

Very nice, yeah unicode characters are just stripped out totally so posting something like: "é-test" would result in "-test" as the title used in the url

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 40489999-10-20

iconv should fallback to closest character in the ASCII character set if it can figure it out. This may be dependent on the server. In some tests I've seen something like "é-test" output as "e-test". I didn't do enough testing to turn it on in svn.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user bman wrote on 40490008-05-22

Ahhh well that makes it readable, but I was actually trying to preserve unicode seeing as its supported on the web now, and the internalization would probably be a good thing for others.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 40490078-02-18

Yeah, it was a temporary solution while PHP plays catch-up in the Unicode world. I think 5.3 is suppose to start having better i18n support. Have you checked out the intl extension: http://www.php.net/manual/en/book.intl.php

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user bman wrote on 40490114-08-22

Very good, I will read up and look again.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 40496896-05-29

(In [svn:6586]) Refs #2117 #2276 #2027 - added elgg_get_friendly_time and elgg_get_friendly_title so they can be used in non-html views

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 40514881-07-29

Pushing this back to 1.8

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Milestone changed to Elgg 1.8 by cash on 40514881-07-29

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Milestone changed to Elgg 1.8.1 by ewinslow on 41120363-04-12

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 41880155-04-16

Valid characters: http://www.ietf.org/rfc/rfc3986.txt

Germanazo's pull request: #83

I believe Wikipedia encodes the friendly title with url encode. It works for most modern browsers, but causes issues when people copy/paste the url like so: http://ta.wikipedia.org/wiki/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AF%8D

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Milestone changed to Elgg 1.8.2 by cash on 41880155-04-16

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 41891104-06-12

This may require more time than we have for 1.8.2 so pushing back to 1.8.3. If it is done earlier, that's great.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Milestone changed to Elgg 1.8.3 by cash on 41891104-06-12

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 42003719-12-24

Options:

  • Use IRIs like we do for profile URLs
  • Encode the text like Wikipedia
  • Improve the convert technique of removing non-URI allowed characters (like Wordpress)
@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Milestone changed to Elgg 1.8.4 by cash on 42003719-12-24

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

brettp wrote on 42039239-12-01

I vote for option 3.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user sembrestels wrote on 42093554-11-22

Drupal uses that module for this:

http://drupal.org/project/transliteration

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user sembrestels wrote on 42132645-03-03

Let's use this function:

http://www.php.net/manual/en/function.iconv.php

Do you agree?

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 42140132-04-19

My concern with iconv is that it may be dependent on the server's configuration.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

brettp wrote on 42154993-03-16

Drupal's Transliteration plugin is basically a collection of maps of character transliterations. I don't want to maintain something like that.

Since we're not ready to use iconv, I'm going to push this out to 1.8.5.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Milestone changed to Elgg 1.8.5 by brettp on 42154993-03-16

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user mrclay wrote on 42388964-04-13

This looks, to me anyway, a unique method of transliteration based on decomposing diacritic marks out of the characters then stripping them. I believe each sequences of preg_replace calls could be replaced by str_replace w/ array args for a big speed improvement. Seems to only require vanilla PHP5.3.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user ManUtopiK wrote on 42425749-06-16

I suggest an other way to solve this issue.
I just do a pull request her on github :
#264

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user mrclay wrote on 42433022-09-21

iconv looks awfully dependent on server config. It also apparently chokes on invalid UTF-8 data. Since its input would practically always be coming from the DB (already UTF-8), I'm not sure the latter is a worry.

I don't think the [http://us.php.net/manual/en/normalizer.normalize.php#92592 Normalizer-based translit] would translate chars like 語 (that may not neatly separate into base latin + diacritics). The given code just strips all non-ASCII afterwords, but we could strike a balance and urlencode after. I.e. convert Español 日本語 to espanol-%e6%97%a5%e6%9c%ac%e8%aa%9e. Western langs would get ASCII translit, non Westerns get urlencoded.

Will 1.9 require PHP5.3?

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

ewinslow wrote on 42433148-08-14

1.9 Will likely not require PHP 5.3. I hope 1.10 does...

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

Milestone changed to Elgg 1.8.7 by cash on 42460625-09-26

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 42483022-06-02

Long term - something like Normalizer is the way to go. Until then, how about pulling in this?

We could modify it along the lines of what Steve is suggesting: European languages get transliterated and others get encoded as IRIs (which means those users would have to deal with ugly links if pasted somewhere).

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user mrclay wrote on 42483964-04-08

#281 adds an ElggTranslit class based on Doctrine1 Inflector, and more unit tests. Is there a better way to include the Doctrine license in the PHPDoc?

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user mrclay wrote on 42484839-01-02

Should we keep a special case for leaving periods in what look like file names? E.g. Someone uploads an image without setting title. The friendly title becomes "photo1jpg". I could easily leave periods that are immediately followed by an alphanumeric. Is there a guide/spec for this kind of function considering UX and SEO?

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user mrclay wrote on 42484883-04-21

I know we're trying to take only MIT licensed code, but WordPress's sanitize_title_with_dashes is exactly what we need here. http://core.trac.wordpress.org/browser/tags/3.4/wp-includes/formatting.php#L941

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

cash wrote on 42484970-01-14

For the license, I think we could use license with a link to the license.

I don't think it is worth the effort to have special code to check for periods in titles (or if we do, it should probably be in the file plugin).

As long as we have a goal of producing a MIT only core, we cannot grab WP code. By the way, most of the plugins will remain GPL only (decision by the primary copyright holder - the old Curverider company).

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user mrclay wrote on 42487745-11-04

Squashed this work into 1 commit. The license in the Doctrine file looks MIT-ish, but the link is to GPL, so I suggest we keep the PHPDoc license. Regardless, almost the whole thing has been rewritten. Used this to format the translit array.

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user Steve Clay wrote on 42513475-07-31

Fixes #2276: Better friendly titles, portable ElggTranslit class, better units
Changeset: 4fff20a

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user Cash Costello wrote on 42513487-09-07

Refs #2276 made the new class private
Changeset: 2755f1c

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user Steve Clay wrote on 42524624-05-28

Fixes #2276: Better friendly titles, portable ElggTranslit class, better units
Changeset: 4fff20a

@elgg-gitbot

This comment has been minimized.

Copy link
Author

commented Feb 16, 2013

trac user Cash Costello wrote on 42524624-09-10

Merge pull request #281 from mrclay/2276-friendly-title

Fixes #2276: Better friendly titles
Changeset: 35bd23e

adayth pushed a commit to adayth/Elgg that referenced this issue Feb 22, 2013
adayth pushed a commit to adayth/Elgg that referenced this issue Feb 22, 2013
adayth pushed a commit to adayth/Elgg that referenced this issue Feb 22, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.