Search API (v1)

Noah Santacruz edited this page Jan 27, 2019 · 2 revisions

For the simpler version of this API, see Search API (v2). This API is simply a proxy for the ElasticSearch API which can be fairly complicated.

Makes a query to Sefaria's search engine and returns results. Note our search engine is built on ElasticSearch. The current version of ElasticSearch we're using is 6.2.3. To read the complete documentation of their API, see their full documentation.

We expose two endpoints of the ElasticSearch API *

POST /api/search/:index/_search

Make a query to the search database. Full search documentation here

index: can be either text or sheet

text

use to query our text index. Each document represents a segment in our library. Any of the fields in this document can be queried. Below is a description of some of the fields:

  • exact: Holds content from Sefaria segment. Field is indexed by the standard analyzer. Read more about analyzers here

  • naive_lemmatizer: Holds content from Sefaria segment. Field is indexed by the sefaria-naive-lemmatizer analyzer. This analyzer does basic lemmatization for Hebrew inputs

  • ref: The Sefaria reference

  • lang: Language of contentbelow is a sample document from the text index


Here is an example document from the text index:

  {
    "path" : "Tanakh/Torah/Genesis",
    "titleVariants" : [ "Genesis", "Beresheet", "Bereshit", "Bereishis", "Gen.", "Ber", "Bereshith", "Breishit", "Bereishit", "Ber.", "Gen" ],
    "exact" : "In the beginning God created the heavens and the earth.",
    "comp_date" : -1400,
    "categories" : [ "Tanakh", "Torah" ],
    "lang" : "en",
    "pagesheetrank" : 190762.7589503033,
    "title" : "Genesis Chapter 1 Verse 1 (Jewish English Torah)",
    "heRef" : "בראשית א׳:א׳",
    "version" : "Jewish English Torah",
    "naive_lemmatizer" : "In the beginning God created the heavens and the earth.",
    "ref" : "Genesis 1:1",
    "version_priority" : 0,
     "order" : "A00000000010001"
   }

Example queries

  • Simple query for exact text results

    {
       "from":0,  // document offset
       "size":100,  //number of documents to return
       "highlight":{
          "pre_tags":[  // tags to wrap terms that were found
             "<b>"
          ],
          "post_tags":[
             "</b>"
          ],
          "fields":{  // size of the snippet returned, in characters
             "exact":{
                "fragment_size":200
             }
          }
       },
       "sort":[  // fields of the document to use when sorting results. sorts are applied sequentially.
          {
             "comp_date":{
    
             }
          },
          {
             "order":{
    
             }
          }
       ],
       "query":{
          "match_phrase":{
             "exact":{
                "query":"moshe"
             }
          }
       }
    }
    
  • Query text using broad analyzer (sefaria-naive-lemmatizer) which parses Hebrew queries using heuristics. This involves converting plural to singular, removing prefixes and dealing with מלא and חסר spellings.

    {
       "size":100,
       "highlight":{
          "pre_tags":[
             "<b>"
          ],
          "post_tags":[
             "</b>"
          ],
          "fields":{
             "naive_lemmatizer":{
                "fragment_size":200
             }
          }
       },
       "sort":[
          {
             "comp_date":{
    
             }
          },
          {
             "order":{
    
             }
          }
       ],
       "aggs":{
          "category":{
             "terms":{
                "field":"path",
                "size":10000
             }
          }
       },
       "query":{
          "match_phrase":{
             "naive_lemmatizer":{  //querying the naive_lemmatizer field
                "query":"moshe",
                "slop":10  // maximum distance between terms, in words
             }
          }
       }
    }
    
  • Query text using relevance ranking. Ranking is based on a variation of the PageRank algorithm using our link set plus references cited in source sheets.

    {
       "size":100,
       "highlight":{
          "pre_tags":[
             "<b>"
          ],
          "post_tags":[
             "</b>"
          ],
          "fields":{
             "naive_lemmatizer":{
                "fragment_size":200
             }
          }
       },
       "query":{
          "function_score":{
             "field_value_factor":{
                "field":"pagesheetrank",  // sort using pre calculated pagesheetrank values
                "missing":0.04
             },
             "query":{
                "match_phrase":{
                   "naive_lemmatizer":{
                      "query":"moshe",
                      "slop":10
                   }
                }
             }
          }
       }
    }
    
  • Query filtered to a specific book in our library

    {
       "size":100,
       "highlight":{
          "pre_tags":[
             "<b>"
          ],
          "post_tags":[
             "</b>"
          ],
          "fields":{
             "exact":{
                "fragment_size":200
             }
          }
       },
       "sort":[
          {
             "comp_date":{
    
             }
          },
          {
             "order":{
    
             }
          }
       ],
       "query":{
          "bool":{
             "must":{
                "match_phrase":{
                   "exact":{
                      "query":"moshe"
                   }
                }
             },
             "filter":{
                "bool":{
                   "should":[  // include a list of regular expressions that should be matched. In this case, we specify a regex on the `path` field
                      {
                         "regexp":{
                            "path":"Mishnah\\/Seder Zeraim\\/Mishnah Kilayim.*"
                         }
                      }
                   ]
                }
             }
          }
       }
    }
    
  1. POST _analyze: See how a query will be analyzed by ElasticSearch Full Documentation

The JSON object included in the request should include the following fields:

analyzer: Name of analyzer you want to use. Currently Sefaria uses the standard analyzer for exact searches and the sefaria-naive-lemmatizer analyzer for broad searches

text: Text you want to analyze

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.