Commentary Matcher (aka Dibur Hamatchil Matcher)

Noah Santacruz edited this page Sep 5, 2016 · 13 revisions

Commentary Matcher

This logic and first versions of this script were originally developed by Dicta The Israel Center for Text Analysis. They generously gave Sefaria permission to use their code as a template for our own, and to include it in our code base.

The commentary matcher can be found in dibur_hamatchil_matcher.py. It is meant to be used when a commentary quotes short sections from a base text, without explicitly saying where in the text his comment is referring to. The script matches blocks of commentary to blocks of base text. There are a few useful tools in the script, which will be explained below

Overview of useful functions

The script has two main functions, match_ref() and match_text().

match_text()

This function contains the main string matching functionality. It is generalized so it doesn't need to be used with Sefaria's database.

The signature for this function is

def match_text(base_text, comments, dh_extract_method=lambda x: x,verbose=False,word_threshold=0.27,char_threshold=0.2,prev_matched_results=None,with_abbrev_ranges=False):
    """
    base_text: list - list of words
    comments: list - list of comment strings
    dh_extract_method: f(string)->string
    prev_matched_results: [(start,end)] list of start/end indexes found in a previous iteration of match_text
    with_abbrev_ranges: boolean
    returns: [(start,end)] - start and end index of each comment. optionally also returns abbrev_ranges (see with_abbrev_ranges parameter)
    """

Parameters

base_text: list of words, representing the base text which you are matching to.

comments: list of comment strings. Each string represents a full comment. Note that the entire string doesn't need to be a quote from the base text. In a case where only part of the comment string is a quote, use the dh_extract_method parameter.

dh_extract_method: (optional) defines a function of the form f(str) -> str which takes elements from comments and returns only the portion which is a quote from the base text.

prev_matched_results: a list where every element is of the form (int,int). The tuple represents a start/end index range. The list should be the same length as comments. This parameter should be used in a case where some of the placement of the comments is already known from the start (e.g. you already ran the matcher and got partial results). To represent a comment with an unknown position, the tuple should be (-1,-1). Any comment with a non-(-1,-1) range will not be matched and the input range will match the output range.

with_abbrev_ranges: True if you want a second return value which is a list of tuples. Each tuple contains the range of text, with indices relative to the comment text, which matched as an abbreviation in the base text, if any was found.

returns: list of tuples of the form (int,int) the same length as comments representing the range which each comment matched in the base text. The indices are relative to the base_text list. Any comment which didn't match will have a range of (-1,-1)

match_ref()

This function wraps the match_text() function and is meant to be used when at least the base text is in the Sefaria database (the commentary doesn't necessarily need to be in the DB, as will be explained below)

def match_ref(base_text, comments, base_tokenizer,dh_extract_method=lambda x: x,verbose=False, word_threshold=0.27,char_threshold=0.2):
    """
    base_text: TextChunk
    comments: TextChunk or list of comment strings
    base_tokenizer: f(string)->list(string)
    dh_extract_method: f(string)->string
    :returns: [(Ref, Ref)] - base text, commentary
         or
         [Ref] - base text refs - if comments is a list
    """

Parameters

base_text: TextChunk representing the base text

comments: Can either be a TextChunk or a list of comment strings. Depending on which one is used, the return value will be different (see returns). It's recommended to use TextChunk when the commentary text is already in the Sefaria database, as the return value will be more useful.

base_tokenizer: a function of the form f(str) -> list(str). The input is a segment from base_text and the output is a list of words. In cases where there are unwanted tokens in the base text (e.g. html tags, citations) they should be removed in this function.

dh_extract_method: see this parameter described above in match_text

returns: If comments was a TextChunk it a outputs a list of tuples where each tuple is (Ref,Ref). The first Ref comes from the base text and the second comes from the commentary text. In a case where the commentary text didn't match, the tuple will be (None,Ref). If comments is a list, the output is a list of Refs where each Ref is from the base text. The list of the same length as comments.

Tutorial

(The code discussed in this tutorial can be found in: Sefaria-Data/research/dibur_hamatchil/dh_source_scripts/mishnah_berurah.py)

To explain how to use the tools, the example of matching Mishnah Berurah to Shulchan Arukh, Orach Chaim will be used.

An example comment of the Mishnah Berurah is (Mishnah Berurah 1:3)

(ג) שהציבור מתפללין – היינו אף על פי שלא יעבור זמן תפילה, מכל מקום מצווה עם הציבור. ועיין לקמן סוף סעיף קטן ט'.

which is commenting on (Shulchan Arukh, Orach Chaim 1:1)

יִתְגַּבֵּר כַּאֲרִי לַעֲמֹד בַּבֹּקֶר לַעֲבוֹדַת בּוֹרְאוֹ שֶׁיְּהֵא הוּא מְעוֹרֵר הַשַּׁחַר: הַגָּה: וְעַל כָּל פָּנִים לֹא יְאַחֵר זְמַן הַתְּפִלָּה שֶׁהַצִּבּוּר מִתְפַּלְּלִין. (טוּר) הַגָּה: שִׁוִּיתִי ה' לְנֶגְדִּי תָמִיד הוּא כְּלָל גָּדוֹל בַּתּוֹרָה וּבְמַעֲלוֹת הַצַּדִּיקִים אֲשֶׁר הוֹלְכִים לִפְנֵי הָאֱלֹהִים, כִּי אֵין יְשִׁיבַת הָאָדָם וּתְנוּעוֹתָיו וַעֲסָקָיו וְהוּא לְבַדּוֹ בְּבֵיתוֹ כִּישִׁיבָתוֹ וּתְנוּעוֹתָיו וַעֲסָקָיו וְהוּא לִפְנֵי מֶלֶךְ גָּדוֹל, וְלֹא דִּבּוּרוֹ וְהַרְחָבַת פִּיו כִּרְצוֹנוֹ וְהוּא עִם אַנְשֵׁי בֵּיתוֹ וּקְרוֹבָיו כְּדִבּוּרוֹ בְּמוֹשַׁב הַמֶּלֶךְ, כָּל שֶׁכֵּן כְּשֶׁיָּשִׂים הָאָדָם אֶל לִבּוֹ שֶׁהַמֶּלֶךְ הַגָּדוֹל הָקָּבָּ''ה אֲשֶׁר מְלֹא כָל הָאָרֶץ כְּבוֹדוֹ עוֹמֵד עָלָיו וְרוֹאֶה בְּמַעֲשָׂיו, כְּמוֹ שֶׁנֶּאֱמַר: אִם יִסָּתֵר אִישׁ בַּמִּסְתָּרִים וַאֲנִי לֹא אֶרְאֶנּוּ נְאֻם ה', מִיָּד יַגִּיעַ אֵלָיו הַיִּרְאָה וְהַהַכְנָעָה וּפַחַד ה' יִתְבָּרַךְ וּבָשְׁתּוֹ מִמֶּנּוּ תָּמִיד (מוֹרֵה נְבוֹכִים ח''ג פ' כ''ב) וְלֹא יִתְבַּיֵּשׁ מִפְּנֵי בְּנֵי אָדָם הַמַּלְעִיגִים עָלָיו בַּעֲבוֹדַת ה' יִתְבָּרַךְ גַּם בְּהֶצְנֵעַ לֶכֶת. וּבְשָׁכְבּוֹ עַל מִשְׁכָּבוֹ יֵדַע לִפְנֵי מִי הוּא שׁוֹכֵב, וּמִיָּד כְּשֶׁיֵּעוֹר מִשְּׁנָתוֹ יָקוּם בִּזְרִיזוּת לַעֲבוֹדַת בּוֹרְאוֹ יִתְעַלֶּה וְיִתְרוֹמֵם (טוּר).

Note that the text matchup has been highlighted above. When considering this example from a programmatic point of view, one notices a few issues:

  1. The quoted portion from Shulchan Arukh which the Mishnah Berurah is quoting (which will be referred to as the 'dibur hamatchil') needs to be parsed out of the comment.

  2. The base text, Shulchan Arukh, has nikud as well as other features such as html tags (not shown here) and punctuation, which make it more difficult to match the strings directly

  3. The matches themselves in this case aren't an exact match (even after removing nikud). (שהציבור מתפללין != שהצבור מתפללין) The yud is missing in Shulchan Arukh, since the text already has nikud

  4. An additional issue, not shown here, is when Shulchan Arukh uses an abbreviation and Mishnah Berurah expands the abbreviation (or vice versa).

These issues are addressed by the matcher script. Since both Mishnah Berurah and Shulchan Arukh are already in Sefaria, the match_ref() function will be used.

To address issue (1), the following dh_extract_method was defined:

def dh_extraction_method(str):
    m = re.search(ur"(\([^\(]+\))([^]+)–", str)
    if m is None:
        m = re.search(ur"(\([^\(]+\))([^-]+)-", str)
    if m:
        dh = m.group(2).strip()
        return dh.replace(u"וכו'",u"")
    else:
        return ""

Essentially, it finds the text between the (א) and the -. Additionally it removes "וכו" from the text, which doesn't appear in the base text

To address issue (2), the following base_tokenizer function was defined:

def base_tokenizer(str):
    punc_pat = re.compile(ur"(\.|,|:)$")

    str = re.sub(ur"\([^\(\)]+\)", u"", str)
    str = re.sub(r"</?[a-z]+>", "", str)  # get rid of html tags
    str = hebrew.strip_cantillation(str, strip_vowels=True)
    word_list = re.split(ur"\s+", str)
    word_list = [re.sub(punc_pat,u"",w).strip() for w in word_list if len(re.sub(punc_pat,u"",w).strip()) > 0]  # remove empty strings and punctuation at the end of a word
    return word_list

The function removes punctuation, nikud, html tags and citations (surrounded by parentheses).

The call to match_ref() therefore looks like

ref_map = dibur_hamatchil_matcher.match_ref(octc,mbtc,base_tokenizer=base_tokenizer,dh_extract_method=dh_extraction_method)

octc and mbtc are TextChunks sections of Shulchan Arukh and Mishnah Berurah respectively (In this case, they're both Simanim).

Issues (3) and (4) are solved by match_text() (again, match_ref wraps match_text)

For issue (3), match_text uses a Weighted Levenshtein algorithm to compare strings. The exact weights can be seen in Sefaria-Data/research/talmud_pos_research/language_classifier/language_tools.py

The weights are based on the relative frequencies of letters in Hebrew. Since yud tends to be the most common letter, it has the lowest weight and therefore a missing yud doesn't raise the overall Levenshtein distance between two strings.

For issue (4), match_text searches these one-sided abbreviations. It currently can find abbreviations where

  • Each letter represents a word
  • The first two letters are one word and each consecutive letter is another word
  • The first three letters are one word and each consecutive letter is another word

To do

  • match_ref doesn't generalize well for TextChunks with more than two levels
  • The abbreviation matcher should be looking for more complicated abbreviations