Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unicode characters by replacing unidecode with new normalize method #4484

Merged
merged 11 commits into from
Jan 27, 2024

Conversation

nllong
Copy link
Member

@nllong nllong commented Jan 15, 2024

Any background context you want to provide?

Back in 2016 we addd the unidecode library to fix unicode issues with the data. That worked well until now, where we need to keep diacritics/accent marks and further support the arabic character set.

What's this PR do?

  • Remove unidecode
  • Create new method to normalize set of characters that would prevent reasonable matches (e.g., mdash, fancy quotes, etc).
  • Use the unicodedata.normalize method to force unicode characters to combine the letter and diacritic together. Using the NFC (Normalization Form Composition) setting which has extended functionality.

How should this be manually tested?

  • unit tests
  • import unicode data (new test file forthcoming)
  • UI testing by inserting unicode characters. A great test is to edit a matching field to insert a unicode character, then import a new dataset with that unicode character in the matching field.

What are the relevant tickets?

#4479

Screenshots (if appropriate)

Copy link

Label error. Requires at least 1 of: Feature, Bug, Enhancement, Maintenance, Documentation, Performance, Do not publish. Found:

@nllong nllong added the Feature Add this label to new features. This will be reflected in the change log when generated. label Jan 15, 2024
@nllong nllong changed the title Replace unidecode with new normalize method Support unicode characters by replacing unidecode with new normalize method Jan 15, 2024
@nllong nllong marked this pull request as ready for review January 18, 2024 01:10
seed/lib/mcm/cleaners.py Outdated Show resolved Hide resolved
Copy link
Member

@axelstudios axelstudios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nllong nllong merged commit c8a8e06 into develop Jan 27, 2024
8 checks passed
@nllong nllong deleted the enable-unicode-in-fields branch January 27, 2024 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Add this label to new features. This will be reflected in the change log when generated.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants