Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Datamining Masoret HaShas
The Vilna Page of Talmud contains the Masoret HaShas, which gives cross-references to parallel Talmudic texts. Inspired by this concept, we have implemented an algorithm for automatically generating our own set of links between parallel passages in the Bavli using ElasticSearch.
There are several stages to the algorithm. First, the entire text of the Bavli is broken down into 6-grams. These 6-grams are dumped to a series of json files. Then, for every 6-gram, a query is generated and sent to the ElasticSearch server. The results are stored in a dictionary which maps the daf where the 6-gram came from to the daf where the 6-gram was found. Once all the hits have been generated, hits are merged together, given a score representing their potential interest, and added to the results. These results can then be compared with the existing Masoret HaShas links in the database, or posted as links through the API.
When the script generates n-grams, it maps them to the precise location it found them, which is a Ref containing the Daf and the line numbers. When the n-grams are searched with ElasticSearch, however, the results contain only the Daf where the term was found. In order to solve this, the hits are merged after they are generated. This means that if there is a hit from Menachot 12a:14 to Yevamot 65a and a hit from Yevamot 65a:1 to Menachot 12a, the hits are merged into a single link from Menachot 12a:14 to Yevamot 65a:1.
Given that there can be multiple hits on the same Daf, the algorithm greedily links the largest contiguous text chunk that is found in both locations. Also, ElasticSearch cannot find text across multiple Dafs, while the n-gram generator does. This means that not every link has line numbers for both references. For the current iteration of the script, 83% of the results have complete references.
For every result, a score is assigned based on the potential interest of the link by the function
score_result(result, count_avg). For example, a 6 word common phrase found throughout the Talmud is not an interesting textual link, but a longer passage found in a only two or three places might be. The equation for the scoring algorithm is:
(Number of words - 6) * 10 - Average number of occurrences * 10 - 2 * (Number of Rabbinic chain words) - 5 * Number of Common Phrases - 20 * Number of Biblical Citations.
For the phrases, the script contains a list of common phrases which are checked against the text. For the Rabbinic Chain words, there is a list of common Rabbinic titles and a list of chain words (Such as בריה and משום). In addition, the number of occurrences of the root "אמר" are counted up and added to the Rabbinic chain count. The average number of occurrences is found by averaging the number of times each n-gram within this phrase were found in the database. This algorithm can be tweaked to alter the results, as only links with a non-negative score get added to the database
The Sefaria database already contains some links from the original Masoret HaShas that were added in manually, mostly in Sukkah. In order to test the accuracy of the algorithm, the results can be compared to the existing Masoret HaShas type links.
The links are pulled from the database, and then checked against the results generated by the script. For every Ref, the algorithm checks the results for any links within two lines of the Ref. In addition, because multi-Daf results are stored in the dictionary by the Daf that they start on, the previous Daf is also checked.