Index Records for Simple & Complex Texts

JonMosenkis edited this page Jul 27, 2016 · 34 revisions

Overview

For the first couple years, all texts in Sefaria had a Book-Chapter-Verse kind of structure, and were stored as Jagged Arrays - arrays of arrays. In the beginning of 2015, we completed a refactor of our text structures, allowing us to support more complex text formats, including texts with introductions, multiple sections with differing structures, overlapping reference schemas that differ from the storage format, named sections, and sections that are referred to both by name and by number. This allows us to support a broad range of text structures.

It makes index records quite a bit more complicated.

API for complex Index records

The API endpoint for GETing and POSTing index records in the new format is /api/v2/raw/index/:title.

It is also possible to GET an index from /api/v2/index/:title - records returned from that endpoint will be filled out with additional data, and are not fit for POSTing.

See API Documentation for details.

Structure of an Index record

Required Attributes

  • schema (dict) - Storage format and primary addressing scheme of the text. See Index Schemas.
  • categories (array of strings) - List of categories that the text is kept in, starting with the top level category
  • title (string) - Identifying string for the Index record. It's value must be identical to the 'key' field of the root schema node and the primary title of the root schema node.

Optional Attributes

  • alt_structs (dict) - one or more alternate addressing scheme for this text Alternate Structures.
  • order (array of integers) - Order that this text appears in, within its category
  • authors (array of strings) - Keys to the Persons collection
  • enDesc (string) - English language description of the text
  • heDesc (string) - Hebrew language description of the text
  • pubDate (int) - Publication date (Gregorian) The compilation date has an optional error margin, to make it a range. For example, a compDate of 300, with an errorMargin of 50 is understood to mean 250-350.
  • compDate (int) - Composition/Compilation date (Gregorian)
  • errorMargin (int) - Margin of error, +/- this many years, for the compilation date.
  • compPlace (string) - Place of composition - key to the Place collection
  • pubPlace (string) - Place of publication - Key to the Place collection
  • era (string) - Key to the TimePeriod collection - usually one of Gaonim, Rishonim, Achronim, or Contemporary

Index Schemas

Index schemas define the structure of texts within the database. These schemas are trees made up of nodes. In the Python code, these are nodes are instances of the class sefaria.model.schema.SchemaNode and its children. The trees are stored in the database and transmitted through the API in a serialized form.

Serialized form

The simplest Index records in the system have schema trees with just one node - a content node. Specifically, they have a JaggedArrayNode, which is a type of content node.

Simple Schemas (Single Node Trees)

Content nodes describe a repeating structure that the text is stored in, from that level of the tree on. Currently, all content nodes are JaggedArrayNodes - they describe a jagged array (array of arrays) structure.

An example of a simple schema is the book of Genesis. The text for Genesis is stored in a depth 2 jagged array - An array with 50 elements, each representing a chapter, and each element being an array containing strings, which are verses.

The content for this book looks like this:

[["Verse 1:1", "Verse 1:2", ...]["Verse 2:1", "Verse 2:2", ...][...]]

The Index schema describing the book looks like this. (We'll discuss the contents below.): (Note: This is an example of a complete Index record, including the schema node in it. Other examples below might only show the schema node of the index record)

{
    "title" : "Genesis",
    "maps" : [],
    "order" : [ 
        1, 
        1
    ],
    "categories" : [ 
        "Tanach", 
        "Torah"
    ],
    "schema" : {
        "titles" : [ 
            {
                "lang" : "en",
                "text" : "Genesis",
                "primary" : True
            }, 
            {
                "lang" : "en",
                "text" : "Bereishit"
            }, 
            {
                "lang" : "he",
                "text" : "בראשית",
                "primary" : True
            }
        ],
        "nodeType" : "JaggedArrayNode",
        "lengths" : [ 
            50, 
            1533
        ],
        "depth" : 2,
        "sectionNames" : [ 
            "Chapter", 
            "Verse"
        ],
        "addressTypes" : [ 
           "Integer", 
            "Integer"
        ],
        "key" : "Genesis"
    }
}

Properties of the schema:

  • key: a text field. For a single node schema like this, the value of "key" is the same as the value of the "title" field on the Index record.
  • nodeType: This corresponds to a related class in the Python code. The value above, JaggedArrayNode, is currently the only one that is always used for single-node classes.
  • titles: is an array of dictionaries specifying titles for this node. (For the full description of how titles work, see Titles below) Each title dictionary has two required keys:
    • text: the title string
    • lang: either "en" or "he"
    • and optionally, primary: this field needs to be present and true for exactly one Hebrew and one English title.

JaggedArray nodes have these attributes as well:

  • depth: Integer: the depth of the Jagged Array
  • addressTypes: Array with depth number of values, each one indicating how that level of the jagged array is addressed. Most commonly, these values are Integer, but could also be Talmud, or some less common values defined in safaria.model.schema
  • sectionNames: Array with depth number of values, each one a string name for that level of the jagged array, e.g. ["Chapter","Verse"]
  • and optionally, lengths: Array with up to depth number of values, each one an integer specifying how many element exist at that level of the jagged array.
  • toc_zoom: Integer: Sets the terminal depth for display in the table of contents. 0 will display segments (each string in the Jagged Array), 1 for sections, 2 for super-sections. If not set, toc will display the section level (or segment level for depth 1 texts).

Complex Schemas (Multi-Node Trees)

Texts which are more complex than a simple Jagged Array need a complex Index schema. Complex schemas are always structured as trees made up of many nodes. Each node has a key and titles, and either has other nodes as children, or else describes Jagged Array content, as the node above does.

Let's look at a moderately complex text, and then at the Index Schema that describes it. This text has an introduction section, a main body, and a conclusion. The introduction and conclusion both have a series of paragraphs, while the main body is structured as chapter and section.

The text looks like this.

{
    "Introduction": ["Intro Paragraph 1", "Intro Paragraph 2", ...],
    "Contents": [["Chapter 1, Section 1", "Chapter 1, Section 2"],["Chapter 2, Section 1", "Chapter 2, Section 2"], ...],
    "Conclusion": ["Conclusion Paragraph 1", "Conclusion Paragraph 2", ...]
}

The Index record looks like this:

{
    "key": "Example Book",
    "titles": [
        {
            "lang": "en",
            "text": "Example Book",
            "primary": True
        },
        {
            "lang": "he",
            "text": "דוגמא",
            "primary": True
        }
    ],
    "nodes": [
        {
            "key": "Introduction",
            "titles": [
                {
                    "lang": "en",
                    "text": "Introduction",
                    "primary": True
                },
                {
                    "lang": "en",
                    "text": "Intro",
                },
                {
                    "lang": "he",
                    "text": "הקדמה",
                    "primary": True
                }
            ],
            "nodeType": "JaggedArrayNode",
            "depth": 1,
            "sectionNames": [
                "Paragraph"
            ],
            "addressTypes": [
                "Integer"
            ]
        },
        {
            "key": "Contents",
            "titles": [
                {
                    "lang": "en",
                    "text": "Contents",
                    "primary": True
                },
                {
                    "lang": "he",
                    "text": "תוכן",
                    "primary": True
                }
            ],
            "nodeType": "JaggedArrayNode",
            "depth": 2,
            "sectionNames": [
                "Chapter",
                "Section"
            ],
            "addressTypes": [
                "Integer",
                "Integer",
            ]
        },
        {
            "key": "Conclusion",
            "titles": [
                {
                    "lang": "en",
                    "text": "Conclusion",
                    "primary": True
                },
                {
                    "lang": "en",
                    "text": "Ending",
                },
                {
                    "lang": "he",
                    "text": "סיום",
                    "primary": True
                }
            ],
            "nodeType": "JaggedArrayNode",
            "depth": 1,
            "sectionNames": [
                "Paragraph"
            ],
            "addressTypes": [
                "Integer"
            ]
        }
    ]
}

The root node in the schema, besides the required key and titles attributes, has an attribute called nodes. This is a list of dictionaries, each of those dictionaries itself containing a node. Look at the keys of each of those children. They match the keys of the dictionary of the text. Each of those children nodes in the Index record describes how that section of the text is structured.

These trees can descend to any depth. A text like the Mishnah Torah contains many nodes: general introductions followed by 14 books, each with an introductory verse and then many sections, and then each section has chapters and laws.

Titles

Above we saw a few cases of simple node titling. For all of these, the node titles were listed explicitly on the node. That is one of three potential ways that nodes can be titled. They can also refer to a shared title (or "term"), shared by many nodes across the library, or they can be an untitled 'default' node. These options are discussed below, but first, let's look at how titles of nodes in a complex tree are put together.

How titles are put together

The full title of any node is built up from the title of all of its parents, in order. Looking at our example book, above, the normalized full title of the Introduction section would be "Sample Book, Introduction". A reference to that part of the book would use that title. As trees get deep and the number of alternate titles grows, and also considering that node names can be separated by a space or a comma and space, there is a combinatorial expansion of the number of titles for a node.

Let's look at the way nodes get titles.

Explicit titles on nodes

A node can have explicit titles defined on it. That's what we see above, using the titles attribute of the node. As we said above, this attribute is a list of dictionaries, and each title dictionary has two required keys:

  • text: the title string
  • lang: either "en" or "he"

It optionally has two other keys:

  • primary: this field needs to be present and true for exactly one Hebrew and one English title. It specifies the default title used for presentation and normalization.
  • presentation: This field, if present, can have one of three values:
    • combined: in referencing this node, earlier titles nodes are prepended to this one. This is the default assumed, if no value is specified.
    • alone: this node is reference by this title alone
    • both: this node is addressable both in a combined and a alone form.

Shared Titles (Terms)

Instead of listing its titles in the titles field, a node can specify the key of a shared title in the sharedTitles field. This is useful for titles that are used repeatedly, like parsha names and mesechet names. Each shared title has a collection of title dictionaries that are used on the node as if they were defined on that node. Valid keys for shared titles are defined by the Term class. A node that defines sharedTitle does not have a titles field.

An Example node with shared titles"

{
    "key" : "Genesis",
        "sharedTitle" : "Genesis",
    "nodeType" : "JaggedArrayNode",
    "depth" : 2,
    "sectionNames" : [ 
        "Chapter", 
        "Verse"
    ],
    "addressTypes" : [ 
        "Integer", 
        "Integer"
    ]
}

If you are creating a new index with no titles or changing the titles within an object, the field "sharedTitle" cannot be set,and instead the method "add_shared_term()" must be called, as follows:

node = JaggedArrayNode()
    node.key = "Genesis"
    node.add_shared_term("Genesis")
    node.depth = 1
    node.addressTypes = ['Integer']
    node.sectionNames = ['Verse']
    record.append(node)
record.validate()

Default Nodes

Look at our "Example Book" Index record above. The section called 'Contents' seems to have some useless titles. To refer to the main body of the book, I would need references like "Example Book, Contents 3:5". What I really want to be able to say is "Example Book 3:5". I want anything that isn't the Introduction or Conclusion to go the main body of the book. I can accomplish this by making that node a default node.

Some rules about default nodes:

  • must have key: "default"
  • must have default: True specified.
  • can not have titles or sharedTitle attributes.
  • must not have another sibling that is a default node (There can be only one default node among siblings.)
  • must be a content node (e.g. JaggedArrayNode),
  • can not have any children nodes.

Once a default node is specified, references to the parent that do not match other nodes will default to matching from the default node and below.

Example text using default:

{
"Introduction": ["Intro Paragraph 1", "Intro Paragraph 2", ...],
"default": [["Chapter 1, Section 1", "Chapter 1, Section 2"],["Chapter 2, Section 1", "Chapter 2, Section 2"], ...],
"Conclusion": ["Conclusion Paragraph 1", "Conclusion Paragraph 2", ...]
}

Example Index Record using default:

{
    "key": "Example Book",
    "titles": [
        {
            "lang": "en",
            "text": "Example Book",
            "primary": True
        },
        {
            "lang": "he",
            "text": "דוגמא",
            "primary": True
        }
    ],
    "nodes": [
        {
            "key": "Introduction",
            "titles": [
                {
                    "lang": "en",
                    "text": "Introduction",
                    "primary": True
                },
                {
                    "lang": "en",
                    "text": "Intro",
                },
                {
                    "lang": "he",
                    "text": "הקדמה",
                    "primary": True
                }
            ],
            "nodeType": "JaggedArrayNode",
            "depth": 1,
            "sectionNames": [
                "Paragraph"
            ],
            "addressTypes": [
                "Integer"
            ]
        },
        {
            "key": "default",
            "default": True,
            "nodeType": "JaggedArrayNode",
            "depth": 2,
            "sectionNames": [
                "Chapter",
                "Section"
            ],
            "addressTypes": [
                "Integer",
                "Integer"
            ]
        },
        {
            "key": "Conclusion",
            "titles": [
                {
                    "lang": "en",
                    "text": "Conclusion",
                    "primary": True
                },
                {
                    "lang": "en",
                    "text": "Ending",
                },
                {
                    "lang": "he",
                    "text": "סיום",
                    "primary": True
                }
            ],
            "nodeType": "JaggedArrayNode",
            "depth": 1,
            "sectionNames": [
                "Paragraph"
            ],
            "addressTypes": [
                "Integer"
            ]
        }
    ]
}

Another example for handling introductions

An example of how to handle introductions that precede numbered content.

Let's say there is a chapter called "Laws" which has an Introduction with 3 paragraphs, then 10 laws, each with sub-sections. We should create three nodes for it:

SchemaNode: "Laws" with two children:
    JaggedArrayNode: "Introduction"
        depth: 1
        sectionNames: ["Paragraph"]
    default JaggedArrayNode
        depth: 2
        sectionNames: ["Law","Subsection"]  

Numbered sections

Intermediate nodes, meaning nodes that have children, generally reference their children by means of titles. Optionally, these children can be references by number as well. This is useful for books that have chapters that are both named and numbered.

In these cases, the intermediate node, which usually does not have nodeType, sectionNames, addressTypes and depth fields, will have them. The fields and values below, added to an intermediate node, would allow its children to be addressed by integer as well as by their titles.

            "nodeType": "JaggedArrayNode",
            "sectionNames": ["Chapter"],
            "addressTypes": ["Integer"],
            "depth": 1,

Alternate Structures

Oftentimes a text is referred to using more than one overlapping scheme. In cases like this, alternate structures may be specified on Index records. Examples of this are Torah, which has both a chapter-verse addressing schema, and a parsha-aliyah addressing schema, and Talmud which has both a daf addressing schema and a chapter-mishnah schema.

One structure, the one with the greatest detail, is used as the storage format of the text. It is specified in the schema attribute of the Index record. The other formats are specified in the alt_structs attribute of the Index record.

alt_structs is a dictionary, mapping structure keys (which can be arbitrary) to alt structures. Alt structures look very much like text schemas, but for a few differences.

  • The nodeType is generally ArrayMapNode
  • The root node has no titles. It uses the titles of the schema root.
  • All nodes of the alt structure do not have key fields.
  • Terminal nodes in an alt structure have mappings to underlying references, using one or two attributes:
    • wholeRef: A single string, which has a ref to the whole range covered by this node
    • refs: (required only when depth is greater than zero) A jagged array of refs the correspond to how wholeRef is broken into sections named by sectionNames.
  • Display attributes can be set which affect how an alternate structure is visualized in its Table of Contents:
    • includeSections: when True, the node will include links to each individual section within wholeRef underneath the alternate node name (e.g., Zohar).

Here is an example of one section of the alternate structure of the book of Exodus

   "alt_structs" : {
        "Parasha" : {
            "nodes" : [ 
                {
                    "sharedTitle" : "Shemot",
                    "nodeType" : "ArrayMapNode",
                    "depth" : 1,
                    "sectionNames" : [ 
                       "Aliyah"
                    ],
                    "wholeRef" : "Exodus 1:1-6:1",
                    "refs" : [ 
                        "Exodus 1:1-1:17", 
                        "Exodus 1:18-2:10", 
                        "Exodus 2:11-2:25", 
                        "Exodus 3:1-3:15", 
                        "Exodus 3:16-4:17", 
                        "Exodus 4:18-4:31", 
                        "Exodus 5:1-6:1"
                    ],
                    "addressTypes" : [ 
                        "Integer"
                    ]
                }, 
                {
                    "sharedTitle" : "Vaera",
                    "nodeType" : "ArrayMapNode",
                    "depth" : 1,
                    "sectionNames" : [ 
                        "Aliyah"
                    ],
                    "wholeRef" : "Exodus 6:2-9:35",
                    "refs" : [ 
                        "Exodus 6:2-6:13", 
                        "Exodus 6:14-6:28", 
                        "Exodus 6:29-7:7", 
                        "Exodus 7:8-8:6", 
                        "Exodus 8:7-8:18", 
                        "Exodus 8:19-9:16", 
                        "Exodus 9:17-9:35"
                    ],
                    "addressTypes" : [ 
                        "Integer"
                    ]
                }, 
                {
                    "sharedTitle" : "Bo",
                    "nodeType" : "ArrayMapNode",
                    "depth" : 1,
                    "sectionNames" : [ 
                        "Aliyah"
                    ],
                    "wholeRef" : "Exodus 10:1-13:16",
                    "refs" : [ 
                        "Exodus 10:1-10:11", 
                        "Exodus 10:12-10:23", 
                        "Exodus 10:24-11:3", 
                        "Exodus 11:4-12:20", 
                        "Exodus 12:21-12:28", 
                        "Exodus 12:29-12:51", 
                        "Exodus 13:1-13:16"
                    ],
                    "addressTypes" : [ 
                        "Integer"
                    ]
                }, 
...