Text Structure Refactor

Lev Eliezer Israel edited this page Mar 29, 2015 · 45 revisions

This is an old development document, and is not reflective of what was actually developed

Please refer to Index Records for Simple & Complex Texts

Currently, all texts are assumed to have a Book-Chapter-Verse kind of structure, and are stored as Jagged Arrays - arrays of arrays.

The goal of the text structure refactor is to support more complex text formats, including introductions, multiple sections with differing structures, overlapping reference schemas that differ from the storage format, and named sections (intrinsic or extrinsic)

Use cases

Recognize complete Ref in a text

The function get_refs_in_string(st) returns a list of valid string refs found within a text.
It currently works by matching the text against one of two regular expressions - English or Hebrew. All references currently caught are matched by these regex. After this change, we will be supporting a wider range of references. Rather than continuing to catch all potential references with one regex, titles will be searched for first, and matching titles will be checked as references against regexes defined in the Index Node Types.

Instantiate Ref object from string

The main work of instantiating a Ref object from a string is identifying the appropriate Index record and the address or range within it. Said in another way, it's the process of translating a reference name into an address.

The first pass is to match any 'map' names, and translate them to storage-format references.

For storage-format references, we keep a dictionary mapping all possible titles to the Index nodes that they reference (see getTitleNodeDict()). To process a text ref, we will match the maximal title string within it, and then get the regex from the matched node to parse the rest of the ref.

Autocomplete

As a person is typing, offer suggestions for completion of the reference. It looks like we can use the standard JQuery autocomplete - http://api.jqueryui.com/autocomplete/ with the results of full_title_list() as source. If we see that the source is too large to be workable, we can implement autocomplete based on a nested dictionary of names, which will give us quicker lookups.

Object Model

Full Size Image

Rough Code

def get_refs_in_string(st):
    refs = []
    lang = 'he' if is_hebrew(st) else 'en'
    for match in all_title_regex(lang).finditer(st):
        title_re = regex_for_title(match.group(), lang)
        ref_match = title_re.match(st[match.start():])
        if ref_match:
            refs.append(ref_match.group())
    return refs

def all_titles_regex(lang):
    escaped = map(re.escape, full_title_list(lang))
    combined = '|'.join(sorted(escaped, key=len, reverse=True)) #Match longer titles first
    return re.compile(combined)

def full_title_list(lang): 
    """ Returns a list of strings of all possible titles, including maps """
    titles = getTitleNodeDict().keys()
    titles.append(getMapDict().keys())
    return titles

def getMapDict():
    """ Returns a dictionary of maps - {from: to} """
    maps  = {}
    for i in getIndexForest(): 
        for map in i.getMapNodes():  # both simple maps & those derived from term schemes
            maps[map.from] = map.to
    returns maps    

def getIndexForest(titleBased = False):
    """
    Returns a list of nodes.
    :param titleBased: If true, texts with presentation 'alone' are passed as root level nodes
    """
    indexTrees = []
    for i in IndexSet():
        ...

def getTitleNodeDict(lang):
    """
    Returns a dictionary of string titles and the nodes that they point to.
    This does not include any map names. 
    """
    titleMap = {}
    trees = getIndexForest(titleBased=True)
    for tree in trees:
        titleMap.update(_branchTitleNodeMap(tree, lang))
    return titleMap

def _branchTitleNodeMap(node, lang, baseList = []):
    """
    :param baseList: list of starting strings that lead to this node
    """
    titleMap = {}
    thisnode = node

        if node.hasTitleScheme():
                thisNodeTitles = node.getSchemeTitles(lang)
        else:
                thisNodeTitles = [title["text"] for title in node.titles if title["lang"] == lang and title["presentation"] != "alone"]

        nodeTitleList = [baseName + " " + title for baseName in baseList for title in thisNodeTitles]

    if node.hasChildren():
        for child in node.children():
            if child.isDefault():
                thisnode = child
            if not child.isOnlyRootTitle():
                    titleMap.update(_branchTitleNodeMap(child, land, nodeTitleList)

    for title in nodeTitleList:
        titleMap[title] = thisnode

    return titleMap


def regex_for_title(title, lang):
    '''
    Return a beginning-anchored regular expression for a full citation match of this title
    '''
    node = getTitleNodeDict()[title]
    return title + " " + node.regex()

Storage Format of new Index records

Text structures are trees made up of nodes. A nodes can be a structure node or a content node, not both.

Index Record and Root Node

The index record has attributes:

  • "categories":
  • "schema":
  • "order"
  • "mapSchemes":
  • "maps":
  • "title": - This legacy field is the PK of the Index record. It's value should be identical to the 'key' field of the root node.

map nodes:

{
   "from": <string>
   "to": <ref>
   "map_scheme": <optional map_scheme>
   "order": <optional integer>
}

Structure Node

A structure node has one required members:

  • "nodes": An ordered list of dictionaries: the nodes under this node.

    {
        "nodes" : [
            { <structure or content node...> }
            { <structure or content node...> }
        ]
    }
    

Content Node

  • "nodeType": required. Specifies the structure of the document from this point on. Corresponds to a related class in the Python code.
  • "nodeParameters": required with certain "nodeType" arguments. If present, it is a dictionary of keyword arguments that further specify the structure of the document from this point on.

Simple string example:

 {
    "nodeType": "String"
 }

Jagged Array example:

 {
    "nodeType": "JaggedArrayNode",`
    "nodeParameters": {
        "depth": 2,
        "addressTypes": ["Integer","Integer"],
        "sectionNames": ["Chapter","Verse"],
        "lengths": [12, 122]
    }
 }

Every Node

  • every node has a 'key' field. The 'key' field of the root node has identical value to the 'title' field of its Index record. The 'key' field is what is used in the version document as the key to the node
  • Every nodes has either explicit titles, a reference to a term node, or is marked 'default':True.
  • Only one content node among siblings may be marked as default. If the title points to the parent structure node, reference interpretation continues with the default. (e.g. a reference - "Mishnah Torah, Foundations of the Law 5:7", gets mapped to ["Mishnah Torah"]["Foundations of the Law"]["Laws"][5][7]. The ["Laws"] is implied but not explicit. It gets added in to distinguish from its sibling node ["introduction"].)
  • The string keys of nodes are not the titles, they are keys for the storage format. Only explicit or shared "titles" nodes are used as titles.
  • Every "titles" list must have one and only one title marked a "primary" for each language present.
  • There must be an English title.
  • The "presentation" field of a title indicates how it combines with earlier titles. Possible values:
    • "combined" - in referencing this node, earlier titles nodes are prepended to this one (default)
    • "alone" - this node is reference by this title alone
    • "both" - this node is addressable both in a combined and a alone form.

explicit titles

 "titles" : [
    {
        "lang": "en",
        "text": "Eight Chapters",
        "primary": true
        "presentation": "alone"
    },
    {
        "lang": "he",
        "text": "שמנה פרקים",
        "primary": true
        "presentation": "alone"
    },
    {
        "lang": "en",
        "text": "Introduction",
    },
    {
        "lang": "he",
        "text": "הקדמה",
    }
 ]

shared title

 "sharedTitle" : "Noah"

Shared Titles

Record: TermScheme

 {
  "name": "Parsha"
 }

Record: Term

 {
  "name": "Noah", 
  "scheme": "Parsha",
  "order": 2,
  "ref": "Genesis 6:12 - 11:29",
  "titles" : [
    { 
      "lang": "en",
      "text": "Noah",
      "primary": true
    },
    {
      "lang": "he", 
      "text": "נח",
      "primary": true
    }
  ]
 }

Python Classes used to represent structure trees

Node types

 class Index_Node(object):
      def getChildren():
      def hasChildren():
      def getParent():
      def getPhysicalPath():
      def isDefault():
      def isOnlyRootTitle(lang): # Does this node only have 'alone' representations?

 class Index_Structure_Node(Index_Node):


 class Index_Content_Node(Index_Node):
      def getRegex():

 class JaggedArrayNode(Index_Content_Node):
      def __init__(depth, address_types, section_names):
           """
           depth: Integer depth of this JaggedArray
           address_types: A list of length (depth), with string values indicating class names for address types for each level
           section_names:      A list of length (depth), with string values of section names for each level
           """
           pass

 class StringNode(Index_Content_Node):
      pass

Addressing Schemes

 class Address_Type(object):
      def toIndex():
           pass

 class Address_Integer(Address_Type):
      pass

 class Address_BavliDafAmud(Address_Type):
      pass

 class Address_YerushalmiDafAmud(Address_Type):
      pass

Structure of Version documents

Version documents are currently stored in the 'texts' collection, and the textual content is stored as a jagged array in the attribute 'chapter'.

The collection will be moved from 'texts' to 'version', and the attribute 'chapter' renamed to 'content'. The structure of the document in the 'content' attribute is defined by the Index record.

Odds and Ends

Questions

  • "title" is now used as a primary key of Index.  With this refactor, what happens to the old "title" node?
    • keep title node on Index record, but use it only as key. Copy the current title to the 'default' title of the root structure node and to it's 'key' field.
  • What of "Commentary" Index records?
    • Leave as is
  • For the sake of presentation on the TOC - To what depth do we present? Do we stop at the root? Different for different books?
  • Structure nodes don't have names yet.  E.g. How do I know that Sefer Mada is a book, or section, or whatever?
  • All the books in Rambam have the same structure.  Can we define that once and reference it?
  • Do we support references to intermediate structure nodes? Should it be a matter of a boolean flag? What about references to nodes that have a default child?
  • Delimiters between nodes. Can we define just one, e.g. space? Probably not. Proliferation is a pain.
  • chained default? theoretically possible, but has weird possibilities?
  • Do nodes that have title 'alone' get shown on TOC? If so - they need categories.

What has to change on the client level:

  • In TOC and texts page - counts api
  • Next / Prev
  • Change building of TOC and/or make a book TOC page
  • Text browser on source sheet & discussion page

Examples

Job

Complete Old Index Record

 {
      "sectionNames" : [ 
           "Chapter", 
           "Verse"
      ],
      "title" : "Job",
      "lengths" : [ 
           42, 
           1070
      ],
      "heTitleVariants" : [ 
           "איוב"
      ],
      "heTitle" : "איוב",
      "maps" : [],
      "length" : 42,
      "titleVariants" : [ 
           "Job", 
           "Iyov"
      ],
      "order" : [ 
           29, 
           3
      ],
      "categories" : [ 
           "Tanach", 
           "Writings"
      ]
 }

Complete New Index Record

 {
   "order" : [ 
        29, 
        3
   ],
   "categories" : [ 
        "Tanach", 
        "Writings"
   ]
   "schema": {
      "titles" : [
        { 
          "lang": "en",
          "text": "Job",
          "primary": true
        },
        { 
          "lang": "en",
          "text": "Iyov"
        },          
        {
          "lang": "he", 
          "text": "איוב",
          "primary": true
        }
      ],
      "nodeType": "JaggedArrayNode",
      "nodeParameters": {
             "depth": 2, 
             "addressTypes": ["Integer","Integer"], 
             "sectionNames": ["Chapter","Verse"],
             "lengths": [42, 1070]
      }
    }
 }

Talmud Shabbat - nodeType and nodeParameters

 {
       "nodeType": "JaggedArrayNode"
       "nodeParameters": {
             "depth": 2, 
             "addressTypes": ["BavliDafAmud","Integer"], 
             "sectionNames": ["Page","Line"]
             "lengths": [82, 5023]

       }
 }

Jerusalem Talmud Shabbat - nodeType and nodeParameters

 {
       "nodeType": "JaggedArrayNode"
       "nodeParameters": {
             "depth": 2, 
             "addressTypes": ["YerushalmiDafAmud","Integer"], 
             "sectionNames": ["Page","Line"]
       }
 }

Mishna Torah - schema nodes

Index Record

{
    "titles": [
        {
            "lang": "en",
            "text": "Mishna Torah",
            "primary": True
        },
        {
            "lang": "en",
            "text": "Rambam"
        },
        {
            "lang": "he",
            "text": "משנה תורה",
            "primary": True
        }
    ],
    "nodes": [
        {
            "key": "Introduction",
            "titles": [
                {
                    "lang": "en",
                    "text": "Introduction",
                    "primary": True
                },
                {
                    "lang": "he",
                    "text": "הקדמה",
                    "primary": True
                }
            ],
            "nodes": [
                {
                    "key": "Transmission",
                    "titles": [
                        {
                        "lang": "en",
                        "text": "Transmission",
                        "primary": True
                        }
                    ],
                    "nodeType": "JaggedArrayNode",
                    "nodeParameters": {
                        "depth": 1,
                        "addressTypes": ["Integer"],
                        "sectionNames": ["Paragraph"]
                    }
                },
                {
                    "key": "List of Positive Mitzvot",
                    "titles": [
                        {
                        "lang": "en",
                        "text": "List of Positive Mitzvot",
                        "primary": True
                        }
                    ],
                    "nodeType": "JaggedArrayNode",
                    "nodeParameters": {
                        "depth": 1,
                        "addressTypes": ["Integer"],
                        "sectionNames": ["Mitzvah"]
                    }
                },
                {
                    "key": "List of Negative Mitzvot",
                    "titles": [
                        {
                        "lang": "en",
                        "text": "List of Negative Mitzvot",
                        "primary": True
                        }
                    ],
                    "nodeType": "JaggedArrayNode",
                    "nodeParameters": {
                        "depth": 1,
                        "addressTypes": ["Integer"],
                        "sectionNames": ["Mitzvah"]
                    }
                }
            ]

        },
        {
            "key": "Sefer Mada",
            "titles": [
                {
                "lang": "en",
                "text": "Sefer Mada",
                "primary": True
                }
            ],
            "nodes": [
                {
                    "key": "Foundations of the Torah",
                    "titles": [
                        {
                        "lang": "en",
                        "text": "Foundations of the Torah",
                        "primary": True
                        }
                    ],
                    "nodes": [
                        {
                            "key": "Introduction",
                            "titles": [
                                {
                                "lang": "en",
                                "text": "Introduction",
                                "primary": True
                                }
                            ],
                            "nodeType": "StringNode"
                        },
                        {
                            "key": "Laws",
                            "default": True,
                            "nodeType": "JaggedArrayNode",
                            "nodeParameters": {
                                "depth": 2,
                                "addressTypes": ["Integer", "Integer"],
                                "sectionNames": ["Chapter", "Law"]
                            }
                        }
                    ]
                },
                {
                    "key": "Human Dispositions",
                    "titles": [
                        {
                        "lang": "en",
                        "text": "Human Dispositions",
                        "primary": True
                        }
                    ],
                    "nodes": [
                        {
                            "key": "Introduction",
                            "titles": [
                                {
                                "lang": "en",
                                "text": "Introduction",
                                "primary": True
                                }
                            ],
                            "nodeType": "StringNode"
                        },
                        {
                            "key": "Laws",
                            "default": True,
                            "nodeType": "JaggedArrayNode",
                            "nodeParameters": {
                                "depth": 2,
                                "addressTypes": ["Integer", "Integer"],
                                "sectionNames": ["Chapter", "Law"]
                            }
                        }
                    ]
                },
                {
                    "key": "Torah Study",
                    "titles": [
                        {
                        "lang": "en",
                        "text": "Torah Study",
                        "primary": True
                        }
                    ],
                    "nodes": [
                        {
                            "key": "Introduction",
                            "titles": [
                                {
                                "lang": "en",
                                "text": "Introduction",
                                "primary": True
                                }
                            ],
                            "nodeType": "StringNode"
                        },
                        {
                            "key": "Laws",
                            "default": True,
                            "nodeType": "JaggedArrayNode",
                            "nodeParameters": {
                                "depth": 2,
                                "addressTypes": ["Integer", "Integer"],
                                "sectionNames": ["Chapter", "Law"]
                            }
                        }
                    ]
                },
                {
                    "key": "Foreign Worship and Customs of the Nations",
                    "titles": [
                        {
                        "lang": "en",
                        "text": "Foreign Worship and Customs of the Nations",
                        "primary": True
                        }
                    ],
                    "nodes": [
                        {
                            "key": "Introduction",
                            "titles": [
                                {
                                "lang": "en",
                                "text": "Introduction",
                                "primary": True
                                }
                            ],
                            "nodeType": "StringNode"
                        },
                        {
                            "key": "Laws",
                            "default": True,
                            "nodeType": "JaggedArrayNode",
                            "nodeParameters": {
                                "depth": 2,
                                "addressTypes": ["Integer", "Integer"],
                                "sectionNames": ["Chapter", "Law"]
                            }
                        }
                    ]
                },
                {
                    "key": "Repentance",
                    "titles": [
                        {
                        "lang": "en",
                        "text": "Repentance",
                        "primary": True
                        }
                    ],
                    "nodes": [
                        {
                            "key": "Introduction",
                            "titles": [
                                {
                                "lang": "en",
                                "text": "Introduction",
                                "primary": True
                                }
                            ],
                            "nodeType": "StringNode"
                        },
                        {
                            "key": "Laws",
                            "default": True,
                            "nodeType": "JaggedArrayNode",
                            "nodeParameters": {
                                "depth": 2,
                                "addressTypes": ["Integer", "Integer"],
                                "sectionNames": ["Chapter", "Law"]
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

Version Record text content

 {
       "Introduction": {
             "Transmission": ["Foo", "Bar", "Flan"],
             "List of Positive Mitzvot": ["Foo", "Bar", "Flan"],
             "List of Negative Mitzvot": ["Foo", "Bar", "Flan"],
       },
       "Sefer Mada": {
             "Foundations of the Torah": {
                   "Introduction": "foo",
                   "Laws": [["Foo", "Bar"],["Blam", "Flam"]]
             },
             "Human Dispositions": {
                   "Introduction":  "foo",
                   "Laws": [["Foo", "Bar"],["Blam", "Flam"]]
             },
             "Torah Study": {
                   "Introduction":  "foo",
                   "Laws": [["Foo", "Bar"],["Blam", "Flam"]]
             },
             "Foreign Worship and Customs of the Nations": {
                   "Introduction":  "foo",
                   "Laws": [["Foo", "Bar"],["Blam", "Flam"]]
             },
             "Repentance": {
                   "Introduction":  "foo",
                   "Laws": [["Foo", "Bar"],["Blam", "Flam"]]
             },
       }
 }

Reference

Structure of current Index records

Required Attributes

  • title
  • titleVariants
  • categories

Optional Attributes

  • sectionNames # required for simple texts, not for commentary
  • heTitle
  • heTitleVariants
  • maps
  • order
  • length
  • lengths
  • transliteratedTitle
  • maps