Parsing Projects

Ari Elias-Bachrach edited this page Jan 26, 2015 · 4 revisions

The majority of the text on Sefaria today has come from volunteer developers who have written scripts to parse and post existing public domain digital texts to the Sefaria API.

The process involves:

  1. Identifying a digital public domain text to work with. Torat Emet, Wikisource, Daat and OnYourWay are big sources for us. Our Content Director can also recommend texts that we have OCR'd but not yet parsed.
  2. Developing a schema for how the text should be structured inside of Sefaria.
  3. Writing a script to parse source files into JSON for POSTing. For examples look in the sources directory of our Sefaria-Data repo.
  4. Posting the text to dev.sefaria.org to identify any textual or technical issues.
  5. With an OK from the Sefaria team, posting the text to www.sefaria.org.