Skip to content
ferhatsb edited this page Jun 27, 2012 · 3 revisions

Summary

Interface details at indextank yard documentation (from rubydoc.info).

Downloads

Package: gem install indextank
Source code (github)

Installation

(Please make sure you have gem installed)

Get the indextank gem (might need administrative privileges):

$ gem install indextank

Basic usage

If you already have created an index you'll need to use your index name to instantiate the client:

require 'rubygems'
require 'indextank'

api = IndexTank::Client.new "http://:iXJkJpnmlMIVk9@didwv.api.houndsleuth.com"

index = api.indexes "<YOUR INDEX NAME>"

Once you have an instance of the client all you need is the content you want to index. The simplest way to add a document is sending that content in a single field called "text":

docid = "<YOUR DOCUMENT ID>"
text = "<THE TEXTUAL CONTENT>"

index.document(docid).add({ :text => text })

That's it, you have indexed a document.

You can now search the index for any indexed document by simply providing the search query:

query = "<YOUR QUERY STRING>"

results = index.search query

print results['matches'], " results\n"
results['results'].each {|doc|
    docid = doc['docid']
    print "docid: #{docid}" 
}

As you can see, the results are the document ids you provided when indexing the documents. You use them to retrieve the actual documents from your DB to render the result page.

You can get richer results using the fetch fields and snippet fields options:

query = "<YOUR QUERY STRING>"
results = index.search(query, 
                       :fetch => 'title,timestamp', 
                       :snippet => 'text')

print results['matches'], " results\n"
results['results'].each {|doc|
    docid = doc['docid']
    title = doc['title']
    timestamp = doc['timestamp']
    text = doc['snippet_text']
    print "<a href='/#{docid}'>#{title}</a><p>#{text}</p>" 
}

Deleting an indexed document is also very easy:

docid = "<YOUR DOCUMENT ID>"

index.document(docid).delete()

Additional fields

When you index a document you can define different fields by adding more elements to the document object:

#### INDEX MULTIPLE FIELDS
index.document(docid).add({ :text => text, 
                            :title => title, 
                            :author => author })

By default, searches will only look at the "text" field, but you can define special searches that look at other fields by prefixing a search term with the field name. The following example filters results including only the user's content.

#### FILTER TO USER'S CONTENT
index.search "#{query} author:#{user}"

There's also a special field named "timestamp" that is expected to contain the publication date of the content in seconds since unix epoch (1/1/1970 00:00 UTC). If none is provided, the time of indexation will be used by default.

index.document(docid).add({ :text => text, 
                            :timestamp => Time.now.to_i })

Document variables

When you index a document you can define special floating point fields that can be used in the results scoring. The following example can be used in an index that allows 3 or more variables:

#### INDEX DOCUMENT WITH VARIABLES
fields = { :text => text }
variables = { 
              0 => rating,
              1 => reputation,
              2 => visits
            }

index.document(docid).add(fields, :variables => variables)

You can also update a document variables without having to re-send the entire document. This is much faster and should always be used if no changes were made to the document itself.

#### UPDATE DOCUMENT VARIABLES ONLY
new_variables = { 
                  0 => new_rating,
                  1 => new_reputation,
                  2 => new_visits
                }

index.document(docid).update_variables(new_variables)

Scoring functions

To use the variables in your searches you'll need to define scoring functions. These functions can be defined in your dashboard by clicking "Manage" in your index details or directly through the API client.

# FUNCTION 0 : sorts by most recent 
index.functions(0, "-age").add

# FUNCTION 1 : standard textual relevance
index.functions(1, "relevance").add

# FUNCTION 2 : sorts by rating
index.functions(2, "doc.var[0]").add

# FUNCTION 3 : sorts by reputation
index.functions(3, "d[1]").add

# FUNCTION 4 : advanced function
index.functions(4, "log(d[0]) - age/50000").add

Read the function definition syntax for help on how to write functions.

If you don't define any functions, you will get the default function 0 which sorts by timestamp (most recent first). By default, searches will use the function 0. You can specify a different function when searching by using the option scoring_function

index.search(query, :function => 2)

Query variables and Geolocation

Besides the document variables, in the scoring functions you can refer to query variables. These variables are defined at query time and can be different for each query.

A common use-case for query variables is geolocation, where you use two variables for latitude and longitude both in the documents and in the query, and use a distance function to sort by proximity to the user. For this example will assume that every document stores its position in variables 0 and 1 representing latitude and longitude respectively.

Defining a proximity scoring function:

# FUNCTION 5 : inverse distance calculated in miles
index.functions(5, "-miles(d[0], d[1], q[0], q[1])").add

Searching from a user's position:

index.search(query, 
             :function => 5, 
             :var0 => latitude, 
             :var1 => longitud)

Faceting

Besides being able to define numeric variables on a document you can tag documents with category values. Each category is defined by string, and its values are alse defined by strings. So for instance, you can define a category named "articleType" and its values can be "camera", "laptop", etc... You can have another category called "priceRange" and its values can be "$0 to $49", "$50 to $100", etc...

Documents can be tagged with a single value for each category, so if a document is in the "$0 to $49" priceRange it can't be in any other, and retagging over the same category results in overwriting the value.

You can tag several categories at once like this:

categories = { 
                  'priceRange' => '$0 to $299',
                  'articleType' => 'camera'
             }
index.document(docid).update_categories(categories)

When searching, you will get an attribute in the results called "facets", and it will contain a dictionary with categories for keys. For each category the value will be another map, with category value as key and occurences as value. So for instance:

{ 
    'matches' => 8,
    'results' => [ {'docid' => 'doc1'}, ... ],
    'facets' => {
        'articleType' => {
            'camera' => 5,
            'laptop' => 3
        },
        'priceRange' => {
            '$0 to $299' => 4,
            '$300 to $599' => 4
        }
    }    
}

Means that from the matches, 5 are of the "camera" articleType and 3 are "laptop". Also, 4 of them all are in the "$0 to $299" priceRange, and 4 on the "$300 to $599".

Then, you can also filter a query by restricting it to a particular set of category/values. For instance the following will only return results that are of the "camera" articleType and also are either in th "$0 to $299" or "$300 to $599" price range.

index.search(query,
             :category_filters => {
                'priceRange' => ['$0 to $299', '$300 to $599'],
                'articleType' => ['camera']
             })

Range queries

Document variables and scoring functions can also be used to filter your query results. When performing a search it is possible to add variable and function filters. This will allow you to only retrieve, in the search results, documents whose variable values are within a specific range (e.g.: posts that have more than 10 votes but less than a 100). Or only return documents for the which a certain scoring function returns values within a specific range.

You can specify more than one range for each variable or function (the value must be within at least ONE range) filter, and you can use as many filters as you want in every search (all filters must be met):

# Ranges are specified by a two elements list: 
#  bottom and top bounds.
# Both top and bottom can be nil indicating they should be ignored.
#   
# In this sample, the results will only include documents 
# whose variable #0 value is between 5 and 10 or between 15
# and 20, and variable #1 value is less than or equal to 3
index.search(query, 
             :docvar_filters => { 0 => [ [5, 10], [15, 20] ],
			                     1 => [ [nil, 3] ]}) 
# This also applies to functions
index.search(query, 
             :function_filters => { 0 => [ [0.5, nil])

Batch indexing

When populating an index for the first time or when a batch task for adding documents makes sense, you can use the batch indexing call.

When using batch indexing, you can add a large batch of documents to the Index with just one call. There is a limit to how many documents you can add in a single call, though. This limit is not related to the number of documents, but to the total size of the resulting HTTP request, which should be less than 1MB.

Making a batch indexing call reduces the number of request needed (reducing the latency introduced by round-trips) and increases the maximum throughput which can be very useful when initially loading a large index.

The indexing of individual documents may fail and your code should handle that and retry indexing them. If there are formal errors in the request, the entire batch will be rejected with an exception.

documents = []
documents << { :docid => 'doc1', :fields => { :text => text1 } }
documents << { :docid => 'doc2', :fields => { :text => text2 } }
documents << { :docid => 'doc3', :fields => { :text => text3 }, 
               :variables => { 0 => 1.5 } }
documents << { :docid => 'doc4', :fields => { :text => text4 }, 
               :variables => { 0 => 2.1 }, 
               :categories => { 'Price' => '0 to 100' } }
response = index.batch_insert(documents)

The response will be an array with the same length as the sent batch. Each element will be a map with the key "added" denoting whether the document in this position of the batch was successfully added to the index. If it's false, an error message will also be in the map with the key "error".

failed_documents = []
response.each_with_index do |r, i|
     failed_documents << documents[i] unless r['added']
end

Bulk Delete

With this method, you can delete a batch of documents (reducing the latency introduced by round-trips). The total size of the resulting HTTP request should be less than 1MB.

The deletion of individual documents may fail and your code should handle that and retry deleting them. If there are formal errors in the request, the entire batch will be rejected with an exception.

docids = ["doc1", "doc2", "doc3", "doc4"]
response = index.bulk_delete(docids)

The response will be an array with the same length as the sent batch. Each element will be a map with the key "deleted" denoting whether the document with the id in this position of the batch was successfully deleted from the index. If it's false, an error message will also be in the map with the key "error".

failed_documents = []
response.each_with_index do |r, i|
     failed_documents << docids[i] unless r['deleted']
end

Delete by Search

With this method, you can delete a batch of documents that match a particular search query. You can use many of the same arguments applied to a normal search - start (which will preserve the results found before the value of start), scoring function, category filters, variables, and docvar filters.

query = "<YOUR QUERY STRING>"

index.delete_by_search query

Index management

You can create and delete indexes directly with the API client. These methods are equivalent to their corresponding actions in the dashboard. Keep in mind that index creation may take a few seconds.

The create_index methods will return the new index's client:

require 'indextank'

api = IndexTank::Client.new "http://:iXJkJpnmlMIVk9@didwv.api.indextank.com"
index = api.indexes "<YOUR INDEX NAME>"

# this parameter allows you to create indexes with public search enabled.
# default is false. 
index.add :public_search => false

while not index.running?
    sleep 0.5
end

# use the index

The delete_index method completely removes the index represented by the object.

index = api.indexes "<YOUR INDEX NAME>"
index.delete