Skip to content
Manfred edited this page Apr 24, 2012 · 5 revisions

Handling Unicode

This page answers all your questions regarding “How to make Picky work with my Unicode character sets”.

Indexing special character sets

The basic generator will generate a project that does not work with e.g. Japanese or Cyrillic character sets.

This is because indexing will remove all characters not defined in the example’s “negative” (“remove not these”) regexp:

indexing removes_characters: /[^a-z0-9\s]/i

(the newlines are not removed so the text can be split on them later on). What this means is that your non-alphanumeric characters are simply removed.

If you wish Picky to index your Cyrillic characters, you need to tell it to do so:

indexing removes_characters: /[^\p{Cyrillic}0-9\s]/i

This means: “Indexing removes characters, but not cyrillic or numeric ones, or newlines”.

The Ruby documentation has more on Unicode character classes: http://www.ruby-doc.org/core-1.9.3/Regexp.html (look for \p{<character_class_name>})

Problems with case insensitive searches

Picky uses downcase! to make searches case insensitive. There are two reasons why this might cause problems with unicode strings.

Ruby only knows how to make ASCII characters lower case. As an example, see issue 76. Ruby does not downcase cyrillic characters.

The equivalence between characters is a little bit more complicated than with ASCII strings. There is a technical and a cultural reason, for the full discussion on the issues see: Unicode equivalence.

What can you do?

Mitya solved this specific case the following way: http://github.com/floere/picky/issues/76#issuecomment-5280965, reprinted in full:

require 'unicode' # gem
class String
  def downcase
     Unicode::downcase(self)
  end
  def downcase!
    self.replace downcase
  end
end

Manfred Stienstra helpfully offers a few other solutions: http://github.com/floere/picky/issues/76#issuecomment-5280616. For example, if you used his Unichars lib:

require 'unichars' # gem
class String
  def downcase
     Unichars.new(self).normalize.downcase
  end
  def downcase!
    self.replace downcase
  end
end

The mentioned libs differ in scope and performance. We suggest you use one fitting your needs.

Remember: If all else fails, you can always override String#downcase! to have Ruby and Picky work as you need it to.

class String
  def downcase
     # Your correct downcase implementation that also works for Klingon
  end
  def downcase!
    self.replace downcase
  end
end