Getting Started as a Sefaria Developer

JonMosenkis edited this page Nov 6, 2016 · 28 revisions

Working with Sefaria

There are two primary ways to interact with Sefaria, through the API and through the Python code.

API

With the API, you can requests data with GET requests, and change data with POST requests. The API is reasonably well (but not completely) documented in the API Documentation. In the codebase, URL mapping for API methods are defined at the top of /sefaria/urls.py and are generally implemented in /reader/views.py and have names that end in _api.

A common use of the write API for a developer is to post a new text. This involves posting a new Index record using the /api/v2/raw/index endpoint, and then a new text record using the api/texts endpoint.

Python Code

The other way to interact with Sefaria is through Python code. Using python code is always faster than using the API, but can only be done with shell access to the server that you're working on. It recommended to utilize a hybrid approach, where data is read from a local installation of Sefaria, while changes to the site are uploaded via API.

Scripts

Python scripts generally load the Sefaria code base with from sefaria.model import *. That loads the most commonly used classes and methods. See the sefaria.model init.py for exactly what symbols become available with that import. The most commonly used classes of the code base are documented here.

Command Line Interface

For convenience, there is a Command Line Interface script, which sets the appropriate environment variables and imports sefaria.model. The command line interface can be executed from the Sefaria-Project directory with cli -i

Using utilities from Sefaria-Data

Sefaria-Data is the repository in which we store parsing scripts and source files. For a new parsing project, open up a new folder within Sefaria-Data/sources. Within Sefaria-Data are several utilities which can be very helpful for parsing projects.

One of the most useful modules is Sefaria-Data/sources/functions. This module is currently being refactored, but it still has several useful functions - particularly post_index post_text and post_link. It is important to note that using these functions requires a current API key as well as the destination server outlined in the local_settings file within Sefaria-Data. This file can be viewed on github here: https://github.com/Sefaria/Sefaria-Data/blob/master/sources/functions.py.

We are currently building a package of reusable functions and classes to help simplify parsing projects. This is located in Sefaria-Data/data-utilities. Several notable functions:

  • getGematria(txt) which will calculate the Gematria of a hebrew text
  • ja_to_xml(ja, section_names, filename='output.xml') will create a representation of a jagged array (nested list of any depth with text at the lowest level) as an xml file. This function takes a jagged array as the first parameter, and a list of strings as the second. These stings should represent the names of each section (i.e. 'chapter', 'verse').
  • file_to_ja(...) This function is designed to format a text file into a jagged array structure. Documentation for this function can be found here: https://github.com/Sefaria/Sefaria-Data/blob/master/data_utilities/util.py#L449

Setting up Sefaria-Data

To use Sefaria-Data, fork the repository to your personal Github account then clone the forked repository to your local system. Warning - Sefaria-Data requires over 5.3GB, so make sure your system has enough memory and a stable internet connection, as an interruption in the connection will require restarting the download.

Once downloaded, head over to the sources folder and make a copy of the file local_settings_example.py located within the sources folder, and name it local_settings.py. Fill in the placeholders and you're ready to go!

Importing Sefaria-Data packages and modules is quite simple for scripts located within the sources file. Here is an example of some helpful imports:

Sefaria-Data Imports example:

from sources import functions
import re                # Regular Expression engine
import data_utilities    # Sefaria-Data reusable utilities package
import codecs            # Highly recommended for handling non ascii characters.

Importing local_settings.py is usually not necessary, as the modules that require this data already import the data.

Upload Script Examples:

Simple index upload script example:

def upload(jagged_array_like_object):

# create index
schema = JaggedArrayNode()
schema.add_title(data['en'], 'en', True)
schema.add_title(data['he'], 'he', True)
schema.key = data['en']
schema.depth = 3
schema.addressTypes = ['Integer', 'Integer', 'Integer']
schema.sectionNames = ['Chapter', 'Verse', 'Comment']
schema.validate()

index_dict = {
    'title': <title of book goes here>,
    'categories': [<place categories here>],
    'schema': schema.serialize() # This line converts the schema into json
}
functions.post_index(index_dict)

Simple text upload script

upload_text = <parsed text goes here>
text_version = {
    'versionTitle': <version Title>,
    'versionSource': <version Source>,
    'language': <language of text, must be 'en', or 'he'>,
    'text': upload_text
}
functions.post_text(<name of text as defined in the index record>, text_version)

Complex index upload

Books that have named sections need to be a complex text. For example, a commentary on Torah cannot be structured just by chapter and verse, but must also have named sections for books (i.e. , Genesis). In the case of the simple text, we created a JaggedArrayNode to outline the structure. In the case of the complex text, we will create a SchemaNode, which is basically a list of JaggedArrayNodes as shown in the following example:

books = library.get_indexes_in_category('Torah')

# create index record
record = SchemaNode()
record.add_title('Baal HaTurim', 'en', primary=True, )
record.add_title(u'בעל הטורים', 'he', primary=True, )
record.key = 'Baal HaTurim'

# add nodes
for book in books:
    node = JaggedArrayNode()
    node.add_title(book, 'en', primary=True)
    node.add_title(hebrew_term(book), 'he', primary=True)
    node.key = book
    node.depth = 3
    node.addressTypes = ['Integer', 'Integer', 'Integer']
    node.sectionNames = ['Chapter', 'Verse', 'Comment']
    record.append(node)
record.validate()

index_dict = {
    "title": "Baal HaTurim",
    "categories": ["Commentary2", "Torah", "Baal HaTurim"],
    "schema": record.serialize()
}
functions.post_index(index_dict)

Complex text upload

Currently only simple texts can be uploaded by API. As a complex text is basically a list of simple texts, it is therefore necessary to upload each simple text making up the complex text individually. In the following example, a dictionary was created with keys being the names of each section and values set to the parsed text in a jagged-array like structure. Notice how the post_text() function lies inside the for loop.

for book in library.get_indexes_in_category('Torah'):
    version = {
        'versionTitle': 'Baal HaTurim',
        'versionSource': 'http://www.toratemetfreeware.com/',
        'language': 'he',
        'text': parsed_data[book]
    }
    functions.post_text('Baal HaTurim, {}'.format(book), version)

It is important to note that a comma must appear after the title of the SchemaNode (in this example, Baal HaTurim is the name of the SchemaNode).

Adding an alternate struct

Sometimes, there may be more than one way to navigate a book. For example, the Zohar is referenced by daf as well as by Parasha. To allow for this, an alternate structure can be added. The following is an example of a script that will add an alternate struct of book, parsha to your index. Constructing an alternate structure is almost identical to constructing a regular structure, with the following two exceptions:

  1. Nodes must not have keys
  2. The terminal node will be an ArrayMapNode (as opposed to a JaggedArrayNode). An ArrayMapNode has a required attribute called wholeRef which describes the mapping of the node. For example, the ArrayMapNode for Parashat Bereshit is 'Genesis 1:1-6:8'.

For more information on alt-structs, please refer to the following page: https://github.com/Sefaria/Sefaria-Project/wiki/Index-Records-for-Simple-%26-Complex-Texts

Some helpful upload tips:

  • Pass the paraemter count_after=1 on the final POST request (for simple texts, just only one POST need be made, so you can hard-code this setting into your upload script).

  • Passing weak_network=True to any post function will re-try the upload in the event of a network failure. The paramater num_tries dictates the number of attempts to be made, defaults to 3.

  • The rate limiting step when uploading a new text is the Sefaria auto-linking utility. In the event that an upload is taking a very long time (i.e., the connection times out), the parameter skip_links=1 can be passed to functions.post_text(). Those with admin access can then use the following URL to run the auto-linker after the upload has succeded: admin/rebuild/commentary-links/(?P<title>). Alternatively, those with command line access can use sefaria.helper.link.rebuild_links_from_text(title, user). It is important to note that the auto-linker MUST be run on the environment to which the text is being uploaded.

    Skipping the auto-linking process is only recommended in extreme circumstances, and should not be attempted without first discussing alternative options with the development team.

5 Projects to get you coding on Sefaria

1. Parse a simple text and post it to the site with the API.

Index Records for Simple & Complex Texts

2. Parse a complex text and save it to the site via Python.

Index Records for Simple & Complex Texts

3. Extract a link set from a text and save it to our database.

4. Ask a question that spans our whole corpus.

5. Build an AI that can make informed halakhic rulings.