Skip to content

StevenJL/rawscsi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rawscsi

Build Status Code Climate Test Coverage

Ruby Amazon Web Services Cloud Search Interface (Rawscsi) is a flexible gem for searching and indexing AWS Cloud Search with the following characteristics:

  1. Maximal Flexibility: Rawscsi can construct very complex queries. For example, using Cloud Search's default movies search domain, Rawscsi can find the top three James Bond movies excluding 'Casino', staring either 'Connery', 'Craig', or 'Moore', but not 'Brosnan', released after 1970. This is how rawscsi would construct this query:
search_object.search(q: {
                          and: [
                                 { plot: "James Bond" }, 
                                 { not: { 
                                          title: "Casino"
                                        },
                                 },
                                 { or: [
                                         { actor: "Connery" },
                                         { actor: "Craig" },
                                         { actor: "Moore"}
                                       ]
                                 },
                                 { not: {
                                          actor: "Brosnan"
                                        }
                                 }]
                         },
                      date: {release_date: "['1970-01-01',}"},
                      sort: "rating desc",
                      limit: 3
                    )

In theory, if the query is allowed on Cloudsearch, this gem can construct it.

  1. Supports Cloud Search Version api 2013-01-01.

  2. Smart indexing of data to a search domain. Namely, it calculates the size of your batch and breaks it up into chunks, each of which is less than 5Mb (the upload limit).

  3. Has some nice Active Record integration features.

  4. Supports Ruby 1.9.3, 2.0.0, 2.1.0, and 2.1.2

  5. Supports Pagination

Here's the official aws cloud search documentation: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/what-is-cloudsearch.html

Installation

Add this line to your application's Gemfile:

gem 'rawscsi'

And then execute:

$ bundle

Or install it yourself as:

$ gem install rawscsi

Registering Search Domains

Suppose we have two search domains: Songs and Books. Let's say Song is an active_record model in our project. We first register them to Rawscsi.

Rawscsi.register 'Song' do |config|
  config.domain_name = 'good-songs'
  config.domain_id = 'akldfjakljf3894fjeaf9df'
  config.region = 'us-east-1'
  config.api_version = '2013-01-01' # rawscsi only supports api version 2013-01-01
  config.attributes = [:title, :album, :artist, :release_date] # since Song is an active model, we can specify the attributes we want indexed
end

Rawscsi.register 'Book' do |config|
  config.domain_name = 'good-books'
  config.domain_id = 'dj43g6i77dof86lk34fsf2s'
  config.region = 'us-east-1'
  config.api_version = '2013-01-01' # rawscsi only supports api version 2013-01-01
end

You can also specify AWS user credentials if you want to access the private serach domain. (note: search domain and AWS credentials specified below are fictional and serve only as an example)

Rawscsi.register 'SuperSecretDiaryEntry' do |config|
  config.domain_name = 'super-secret-diary-entries'
  config.domain_id = '12be09027f865cba2d57719'
  config.region = 'us-east-1'
  config.api_version = '2013-01-01' # rawscsi only supports api version 2013-01-01
  config.access_key_id = '7386b16f9bee30d6e5a98786'
  config.secret_key = 'af86958e6919a03713408fd9'
end

Uploading indices

song_indexer = Rawscsi::Index.new('Song', :active_record => true)
# Since Song is an active record object and we specified the attributes in the config, we can use the active record option
# Then we can just upload active record objects directly. Note this will use the active record id as the cloud search record id.
song_indexer.upload([
  # active record objects
  <id: 4567, title: "Saturdays Reprise", artist: "Cut Copy", album: "Bright Like Neon Love">
  <id: 5456, title: "Common Burn", artist: "Mazzy Star", album: "Seasons of Your Day">
  <id: 5346, title: "Honey Power", artist: "My Bloody Valentine">
  <id: 5762, title: "Introduction", artist: "Nick Drake">
])

book_indexer = Rawscsi::Index.new('Book')
# since Book is not an active record object, we upload hashes instead
# Note the id key in the hash  will be used as the cloud search record id.
book_indexer.upload([
  {id: 1234, title: "Surely you're Joking, Mr. Feynman", author: "Richard Feynman", publish_date: "1985-01-01T00:00:00Z"},
  {id: 5546, title: "Zen and the Art of Motorcycle Maintanence", author: "Robert Pirsig", publish_date: "1974-04-01T00:00:00Z"}
])
# Also note the date format

Deleting indices

song_indexer.delete([3244, 53452, 5435, 64545, 34342, 4545])
# To delete records in the search domain, you just pass an array of ids (the cloud search record id)

# you can also pass in an array of active record objects to be deleted
song_indexer.delete([
  # active record objects
  <id: 4567, title: "Saturdays Reprise", artist: "Cut Copy", album: "Bright Like Neon Love">
  <id: 5456, title: "Common Burn", artist: "Mazzy Star", album: "Seasons of Your Day">
])
Automatically Batches on cloud search's 5Mb Upload Limit

Rawscsi is smart enough to break up large batch uploads into chunks of 5 Mb (cloud search's upload size limit per post request).

Searching

Instantiate a search object

book_search_helper = Rawscsi::Search.new('Book')

song_search_helper = Rawscsi::Search.new('Song', :active_record => true)
# use the 'active record' option if 'Song' is a Active Record model in your project.
# Calling search will return an array of active record objects ordered by cloud search's rank score.

Simple Search

Simple search queries take one string as an argument and searches all text and text-array fields on the search domain.

search_books_helper.search('Richard Feynman')
  => [{id: 546, author: "Richard Feynman"  title: "Surely, You're Joking Mr. Feynman" },
      {id: 657, author: "Richard Feynman", title: "What Do You Care What Other People Think"}]

search_songs_helper.search('Air')
# since we initiated this search helper with the active record option
# it returns an array of active record objects
  => [<Song id:156, artist: "White Stripes", title: "The Air Near My Fingers">,
      <Song id:342, artist: "Air", title: "Mer du Japon">]

Compound Boolean Searches

Compound search queries are queries consisting of multiple constraints, linked together by either and or or or some combination of the two. Rawscsi's philosophy in this aspect is maximal flexibility: Any query that is possible on cloud search, should be possible to construct in Rawscsi.

Lets look at some sample compound boolean queries.

search_songs_helper.search(q: {
                                and: [{artist: "Daft Punk"}]
                              }
                           )
# Even though we're searching only one term, it's not a simple query because we're looking
# at just the artist field, not all text fields. Also note, even though its only one constrainst, we still use 'and'.
search_songs_helper.search(q: {
                                and: [{ title: 'Angel' }, { artist: 'Smith' }]
                              })
# Conjunction of two constraints is done using the `and` key
=> [{song_id: 12345, title: "Angel in the Snow", artist: "Elliot Smith"},
    {song_id: 43534, title: "Angel, Angel Down We Go Together", artist: "The Smiths"}]

By default, the search returns all the fields in the domain. But you can specify which fields to return.

search_songs.search(q: {and: [ {artist: "Cold Play"} ], 
                    fields: [:title])

=> [{ title: "Warning Sign"},
    { title: "God Put a Smile Upon Your Face"},
    { title: "Sparks"}]
Sort by location

Suppose your search domain has the latlon field for geographical location data. You can sort your results by geographical distance.

# returns all theathers playing "The Big Lebowski" sorted by geographical proximity
# where your location is 35.621966 latitude and -120.686706 longitude
movie_theaters.search(q: {films: "The Big Lebowski" },
            sort: "distance asc",
            location: {latitude: 35.621966, longitude: -120.686706}
            )
Weighing fields

You can weigh fields in your search request.

# You care much more about matches in the title, than matches in the plot.
movie_search.search(q: {and: [{title: "title", plot: "plot"}]}, weights: "{fields:['title^2','plot^0.5']}")

The default sort order is Amazon Cloudsearch's rank-score but you can use your own sort as well as a limiting the number of results.

search_songs.search(q: {and: [{genres: "Rock"}]}, sort: "rating desc", limit: 100)
# Top 100 Rock songs.

Here is an example of using a date constraint in conjunction ("and") with other constraints:

search_songs.search(q: {
                         and: [
                                { genres: "Hip Hop" }, 
                                { title: "Street"}
                               ]
                         },
                    date: { release_date: "['1995-01-01',}"}
                    limit: 5,
                    sort: "rating desc"
                   ) 
# Conjunction of two constraints and date constraint (Top 5 Hip Hop songs with Street in title released after 1995)
# Note date syntax is working in conjunction (and) with the frist two constraints. This is always the case.
# Also note syntax for searching date ranges: "['1995-01-01',}" is all dates after Jan 1 1995 while "{,'1995-01-01']" is all dates before.
# Note we're limiting only 5 results and sorting by rating in descending order.
=> [{song_id: 54642, title: "Street Dreams", artist: "Nas"},
    {song_id: 98786, title: "Street Struck", artist: "Big L"},
    {song_id: 54645, title: "Street Disputes", artist: "Wu-Tang Clan"},
    {song_id: 54542, title: "Streets is Watching", artist: "Jay-Z"},
    {song_id: 54644, title: "Street Corners", artist: "Wu-Tang Clan"}]

So far, we've only seen 'and' examples. Now let's look at some 'or' examples.

search_songs.search(q: {
                         or: [
                               { artist: "Digitalism"},
                               { artist: "Daft Punk"}, 
                               { artist: "Justice"}
                             ]
                        }
                    )
# Find the union of all the songs by these awesome electro-house bands
# you can also combine `and` and `or` constraints
search_songs.search(q: {
                          and: [
                                  { range: "rating:['9',}"}, 
                                  { or: [
                                          { artist: "Bob Dylan" },
                                          { artist: "Lorde"}
                                        ]
                                  }
                                ]
                         }
                     )
# You only want highly rated songs by either Bob Dylan or Lorde (cuz you know, Lorde is the new Bob Dylan)

And of course, negation:

# You love the song "All Along the Watchtower" but you didn't like the Dave Matthews Band cover 
search_songs.search(q: {
                         and: [ 
                                { title: "All Along the Watchtower"},
                                { not: { artist: "Dave Matthews Band" }
                              ]
                        }
                    )

By default, the search returns all the fields in the domain. But you can specify which fields to return.

search_songs.search(q: {and: [ {artist: "Cold Play"} ], 
                    fields: [:title])

=> [{ title: "Warning Sign"},
    { title: "God Put a Smile Upon Your Face"},
    { title: "Sparks"}]

Prefix Matching

Rawscsi supports prefix matching using the prefix key.

search_songs.search(q: {prefix: "To"})
=> [{song_id: 54967, title: "Aenima", artist: "Tool"},
    {song_id: 96566, title: "Lateralus", artist: "Tool"},
    {song_id: 32356, title: "Today is the Day", artist: "Yo La Tengo"}]

# you can specify the prefix for a certain field
search_songs.search(q: {and: [{genres: "80s"},
                              {prefix: {title: "Every"}})
=> [{song_id: 91485, title: "Everybody Wants to Rule the World", artist: "Tears for Fears"},
    {song_id: 96566, title: "Everything Counts", artist: "Depeche Mode"}]

Compound queries with phrase

Here's the AWS documentation on phrase matching in compound queries: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html#searching-text-phrases

Rawscsi supports the phrase condition for compound queries

search_songs.search(q: {and: [{phrase: {title: "Air Near"}}]})
=> [<Song id:156, artist: "White Stripes", title: "The Air Near My Fingers">]

Pagination

# paginate with the `start` and `limit` params
search_songs.search(q: {and: [{artist: "Beatles"}]},
                    start: 25,
                    limit: 5)

Request signing

Signature V4 guide from Amazon

Request signature is created by Rawscsi::RequestSignature class with simple public api with only a few methods:

  • #initialize - accepts the hash with the data of request to sign. Required keys are :secret_key, :access_key_id, :region_name, :endpoint, :method, :host. Optional keys are :debug, :payload, :service_name, :headers, :query.

  • #build - calculates and returns the hash with :signature key containing the headers to include in request. If :debug is passed on object creation, it also provides the debug data in :debug key, including of results of intermidiate steps.

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

License

(The MIT License)

Copyright © 2014 Steven Li

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ‘Software’), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.