Commentary Matcher (aka Dibur Hamatchil Matcher)
Clone this wiki locally
This logic and first versions of this script were originally developed by Dicta The Israel Center for Text Analysis. They generously gave Sefaria permission to use their code as a template for our own, and to include it in our code base.
The commentary matcher can be found in dibur_hamatchil_matcher.py. It is meant to be used when a commentary quotes short sections from a base text, without explicitly saying where in the text his comment is referring to. The script matches blocks of commentary to blocks of base text. There are a few useful tools in the script, which will be explained below
Overview of useful functions
The script has two main functions,
This function contains the main string matching functionality. It is generalized so it doesn't need to be used with Sefaria's database.
The signature for this function is
def match_text(base_text, comments, dh_extract_method=lambda x: x,verbose=False,word_threshold=0.27,char_threshold=0.2,prev_matched_results=None,with_abbrev_ranges=False): """ base_text: list - list of words comments: list - list of comment strings dh_extract_method: f(string)->string prev_matched_results: [(start,end)] list of start/end indexes found in a previous iteration of match_text with_abbrev_ranges: boolean returns: [(start,end)] - start and end index of each comment. optionally also returns abbrev_ranges (see with_abbrev_ranges parameter) """
base_text: list of words, representing the base text which you are matching to.
comments: list of comment strings. Each string represents a full comment. Note that the entire string doesn't need to be a quote from the base text. In a case where only part of the comment string is a quote, use the
dh_extract_method: (optional) defines a function of the form f(str) -> str which takes elements from
comments and returns only the portion which is a quote from the base text.
prev_matched_results: a list where every element is of the form (int,int). The tuple represents a start/end index range. The list should be the same length as
comments. This parameter should be used in a case where some of the placement of the comments is already known from the start (e.g. you already ran the matcher and got partial results). To represent a comment with an unknown position, the tuple should be (-1,-1). Any comment with a non-(-1,-1) range will not be matched and the input range will match the output range.
with_abbrev_ranges: True if you want a second return value which is a list of tuples. Each tuple contains the range of text, with indices relative to the comment text, which matched as an abbreviation in the base text, if any was found.
returns: list of tuples of the form (int,int) the same length as
comments representing the range which each comment matched in the base text. The indices are relative to the
base_text list. Any comment which didn't match will have a range of (-1,-1)
This function wraps the
match_text() function and is meant to be used when at least the base text is in the Sefaria database (the commentary doesn't necessarily need to be in the DB, as will be explained below)
def match_ref(base_text, comments, base_tokenizer,dh_extract_method=lambda x: x,verbose=False, word_threshold=0.27,char_threshold=0.2): """ base_text: TextChunk comments: TextChunk or list of comment strings base_tokenizer: f(string)->list(string) dh_extract_method: f(string)->string :returns: [(Ref, Ref)] - base text, commentary or [Ref] - base text refs - if comments is a list """
base_text: TextChunk representing the base text
comments: Can either be a TextChunk or a list of comment strings. Depending on which one is used, the return value will be different (see
returns). It's recommended to use
TextChunk when the commentary text is already in the Sefaria database, as the return value will be more useful.
base_tokenizer: a function of the form f(str) -> list(str). The input is a segment from
base_text and the output is a list of words. In cases where there are unwanted tokens in the base text (e.g. html tags, citations) they should be removed in this function.
dh_extract_method: see this parameter described above in
comments was a TextChunk it a outputs a list of tuples where each tuple is (Ref,Ref). The first Ref comes from the base text and the second comes from the commentary text. In a case where the commentary text didn't match, the tuple will be (None,Ref). If
comments is a list, the output is a list of Refs where each Ref is from the base text. The list of the same length as
(The code discussed in this tutorial can be found in:
To explain how to use the tools, the example of matching Mishnah Berurah to Shulchan Arukh, Orach Chaim will be used.
An example comment of the Mishnah Berurah is (Mishnah Berurah 1:3)
(ג) שהציבור מתפללין – היינו אף על פי שלא יעבור זמן תפילה, מכל מקום מצווה עם הציבור. ועיין לקמן סוף סעיף קטן ט'.
which is commenting on (Shulchan Arukh, Orach Chaim 1:1)
יִתְגַּבֵּר כַּאֲרִי לַעֲמֹד בַּבֹּקֶר לַעֲבוֹדַת בּוֹרְאוֹ שֶׁיְּהֵא הוּא מְעוֹרֵר הַשַּׁחַר: הַגָּה: וְעַל כָּל פָּנִים לֹא יְאַחֵר זְמַן הַתְּפִלָּה שֶׁהַצִּבּוּר מִתְפַּלְּלִין. (טוּר) הַגָּה: שִׁוִּיתִי ה' לְנֶגְדִּי תָמִיד הוּא כְּלָל גָּדוֹל בַּתּוֹרָה וּבְמַעֲלוֹת הַצַּדִּיקִים אֲשֶׁר הוֹלְכִים לִפְנֵי הָאֱלֹהִים, כִּי אֵין יְשִׁיבַת הָאָדָם וּתְנוּעוֹתָיו וַעֲסָקָיו וְהוּא לְבַדּוֹ בְּבֵיתוֹ כִּישִׁיבָתוֹ וּתְנוּעוֹתָיו וַעֲסָקָיו וְהוּא לִפְנֵי מֶלֶךְ גָּדוֹל, וְלֹא דִּבּוּרוֹ וְהַרְחָבַת פִּיו כִּרְצוֹנוֹ וְהוּא עִם אַנְשֵׁי בֵּיתוֹ וּקְרוֹבָיו כְּדִבּוּרוֹ בְּמוֹשַׁב הַמֶּלֶךְ, כָּל שֶׁכֵּן כְּשֶׁיָּשִׂים הָאָדָם אֶל לִבּוֹ שֶׁהַמֶּלֶךְ הַגָּדוֹל הָקָּבָּ''ה אֲשֶׁר מְלֹא כָל הָאָרֶץ כְּבוֹדוֹ עוֹמֵד עָלָיו וְרוֹאֶה בְּמַעֲשָׂיו, כְּמוֹ שֶׁנֶּאֱמַר: אִם יִסָּתֵר אִישׁ בַּמִּסְתָּרִים וַאֲנִי לֹא אֶרְאֶנּוּ נְאֻם ה', מִיָּד יַגִּיעַ אֵלָיו הַיִּרְאָה וְהַהַכְנָעָה וּפַחַד ה' יִתְבָּרַךְ וּבָשְׁתּוֹ מִמֶּנּוּ תָּמִיד (מוֹרֵה נְבוֹכִים ח''ג פ' כ''ב) וְלֹא יִתְבַּיֵּשׁ מִפְּנֵי בְּנֵי אָדָם הַמַּלְעִיגִים עָלָיו בַּעֲבוֹדַת ה' יִתְבָּרַךְ גַּם בְּהֶצְנֵעַ לֶכֶת. וּבְשָׁכְבּוֹ עַל מִשְׁכָּבוֹ יֵדַע לִפְנֵי מִי הוּא שׁוֹכֵב, וּמִיָּד כְּשֶׁיֵּעוֹר מִשְּׁנָתוֹ יָקוּם בִּזְרִיזוּת לַעֲבוֹדַת בּוֹרְאוֹ יִתְעַלֶּה וְיִתְרוֹמֵם (טוּר).
Note that the text matchup has been highlighted above. When considering this example from a programmatic point of view, one notices a few issues:
The quoted portion from Shulchan Arukh which the Mishnah Berurah is quoting (which will be referred to as the 'dibur hamatchil') needs to be parsed out of the comment.
The base text, Shulchan Arukh, has nikud as well as other features such as html tags (not shown here) and punctuation, which make it more difficult to match the strings directly
The matches themselves in this case aren't an exact match (even after removing nikud). (שהציבור מתפללין != שהצבור מתפללין) The yud is missing in Shulchan Arukh, since the text already has nikud
An additional issue, not shown here, is when Shulchan Arukh uses an abbreviation and Mishnah Berurah expands the abbreviation (or vice versa).
These issues are addressed by the matcher script.
Since both Mishnah Berurah and Shulchan Arukh are already in Sefaria, the
match_ref() function will be used.
To address issue (1), the following
dh_extract_method was defined:
def dh_extraction_method(str): m = re.search(ur"(\([^\(]+\))([^–]+)–", str) if m is None: m = re.search(ur"(\([^\(]+\))([^-]+)-", str) if m: dh = m.group(2).strip() return dh.replace(u"וכו'",u"") else: return ""
Essentially, it finds the text between the (א) and the -. Additionally it removes "וכו" from the text, which doesn't appear in the base text
To address issue (2), the following
base_tokenizer function was defined:
def base_tokenizer(str): punc_pat = re.compile(ur"(\.|,|:)$") str = re.sub(ur"\([^\(\)]+\)", u"", str) str = re.sub(r"</?[a-z]+>", "", str) # get rid of html tags str = hebrew.strip_cantillation(str, strip_vowels=True) word_list = re.split(ur"\s+", str) word_list = [re.sub(punc_pat,u"",w).strip() for w in word_list if len(re.sub(punc_pat,u"",w).strip()) > 0] # remove empty strings and punctuation at the end of a word return word_list
The function removes punctuation, nikud, html tags and citations (surrounded by parentheses).
The call to
match_ref() therefore looks like
ref_map = dibur_hamatchil_matcher.match_ref(octc,mbtc,base_tokenizer=base_tokenizer,dh_extract_method=dh_extraction_method)
mbtc are TextChunks sections of Shulchan Arukh and Mishnah Berurah respectively (In this case, they're both Simanim).
Issues (3) and (4) are solved by
For issue (3),
match_text uses a Weighted Levenshtein algorithm to compare strings. The exact weights can be seen in
The weights are based on the relative frequencies of letters in Hebrew. Since yud tends to be the most common letter, it has the lowest weight and therefore a missing yud doesn't raise the overall Levenshtein distance between two strings.
For issue (4),
match_text searches these one-sided abbreviations. It currently can find abbreviations where
- Each letter represents a word
- The first two letters are one word and each consecutive letter is another word
- The first three letters are one word and each consecutive letter is another word
match_refdoesn't generalize well for TextChunks with more than two levels
- The abbreviation matcher should be looking for more complicated abbreviations