Skip to content

Normalization Tools

Noah Santacruz edited this page Dec 22, 2021 · 3 revisions

Code for normalization tools can be found here

Normalizers

Normalizers allow you to:

  • remove / edit a string using a regular expression or generic function
  • convert indices in a normalized string to indices in the original string. This is helpful for cases where it is simpler to search in the normalized string

Basic Usage

Basic use-cases will use either NormalizerByLang or NormalizerComposer. Below is a brief description of these classes. See the source code for more specific documentation.

General Case

Below is a standard use-case for how one might use a normalizer

    double_space_normalizer = RegexNormalizer('\s+', ' ')
    my_string = "hello with   extra  spaces"
    my_string_norm = double_space_normalizer.normalize(my_string)
    assert my_string_norm == "hello with extra spaces"

    norm_ind_start = my_string_norm.find("spaces")
    norm_range = (norm_ind_start, norm_ind_start + len("spaces"))  # example range

    # let's say you want to know where `norm_range` would map to in the original string
    mapping = double_space_normalizer.get_mapping_after_normalization(my_string)
    orig_range = double_space_normalizer.convert_normalized_indices_to_unnormalized_indices([norm_range], mapping)[0]
    assert my_string[slice(*orig_range)] == "spaces"

NormalizerComposer

Allows you to list normalizers you want to apply in order. The NormalizerComposer instance can then be used as if it is one normalizer. This is convenient because it is difficult to calculate the mapping from a normalized string to the original after muliple normalization steps.

NormalizerByLang

An easy way to map normalizers by language. Allows you to call .normalize(string, lang=<LANG>) and will apply the appropriate normalizer based on the language passed.

Other Classes

Below is a list of other normalizers that currently exist

  • ITagNormalizer: Removes i-tags from text (e.g. footnotes or commentary markers)
  • ReplaceNormalizer: Simply replaces a with b where a and b are strings
  • RegexNormalizer: Replaces the matches of a regex with a static string
  • TableReplaceNormalizer: Replace keys in a table with their values

Text Sanitizer

This class is designed so we can easily move from a list of segments to the flat list of words necessary for use in dibbur_hamatchil_matcher.match_text. It is primarily helpful when we need to keep track of text before and after edits were made to said text that were necessary for improving text matching.

Other Functions

  • char_indices_from_word_indices: Given a string and word ranges, return the corresponding indices of the characters in the string
  • sanitized_words_to_unsanitized_words: Convert normalized word indices to original word indices. This is useful for use with dibur_hamatchil_matcher