public
Description: Webby blogging "engine"
Homepage: http://beta.reinh.com
Clone URL: git://github.com/ReinH/reinh-com.git
reinh-com / content / blog / 2008 / 07 / 14 / a-thinking-mans-sphinx.txt
100644 166 lines (119 sloc) 10.084 kb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: A Thinking Man's Sphinx
created_at: 2008-07-14 12:21:05.990162 -04:00
layout: post
summary: Wherein our hero ponders the ineffable questions of life, the universe and full text search to see if he might not be able to eff them after all.
excerpt: We've recently switched a number of projects to "ThinkingSphinx":http://ts.freelancing-gods.com/ here at "Hashrocket":http://www.hashrocket.com. These projects were originally using SOLR or "UltraSphinx":http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/files/README.html. Today, we'll explore the differences between ThinkingSphinx and UltraSphinx and why we chose to switch.
filter:
  - erb
  - textile
---
 
We've recently switched a number of projects to "ThinkingSphinx":http://ts.freelancing-gods.com/ here at "Hashrocket":http://www.hashrocket.com. These projects were originally using SOLR or "UltraSphinx":http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/files/README.html. Today, we'll explore the differences between UltraSphinx and ThinkingSphinx and why we chose to switch.
 
UltraSphinx is written by "Evan Weaver":http://blog.evanweaver.com/. ThinkingSphinx is written by "Pat Allan":http://freelancing-gods.com/. They have some similarities: both use Sphinx (obviously); both are based on the underlying Ruby API for Sphinx, "Riddle":http://riddle.freelancing-gods.com/ (also by Pat Allan); both have excellent documentation and well-written tutorials. The similarities pretty much end there, however, and the differences are far more interesting.
 
h3. Basic Sphinx Configuration
 
Both plugins help you generate a sphinx.conf file for your each of your rails environments, but they do it in drastically different ways. ThinkingSphinx lets you use a configuration format you are already used to at the expense of reduced configuration options. UltraSphinx is more flexible but less Rubyish.
 
h4. UltraSphinx
 
UltraSphinx generates the sphinx.conf file from a base configuration file. This base file uses the sphinx configuration syntax, passing it through ERB for some DRYness. A base file can be specified per-environment. It puts all of its configuration information in @RAILS_ROOT/config/ultrasphinx/@. This provides fine-grained - if rather tediously verbose - control over the multitude of Sphinx configuration options.
 
h4. ThinkingSphinx
 
ThinkingSphinx uses a YAML configuration file that it locates at @RAILS_ROOT/config/sphinx.yml@. It accepts a YAML hash of configuration settings. These settings allow you to specify most of the basic Sphinx configuration options with ease but you may be out of luck if the option you need isn't available.
 
h3. Basic Index Configuration
 
Let's start with a basic example of a sphinx index declaration. Keep in mind that your indexes will likely be significantly more complex in the real world.
 
h4. UltraSphinx
 
UltraSphinx uses a declarative @is_indexed@ statement in the model that feels vaguely similar in style to an association or named scope declaration. This is the usage example given in the "README":http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/files/README.html:
 
<% coderay( :lang => "ruby") do -%>
class Post
  is_indexed :fields => ['created_at', 'title', 'body']
end
<% end %>
 
This seems simple enough for such a simple case. We'll see how it looks for less trivial cases.
 
h4. ThinkingSphinx
 
ThinkingSphinx, on the other hand, uses a @define_index@ block in the model to allow the individual index configuration options to be stated declaratively. The canonical example from UltraSphinx would look like this in ThinkingSphinx:
 
<% coderay( :lang => "ruby") do -%>
class Post
  define_index do
    indexes created_at, title, body
  end
end
<% end %>
 
The first thing you may notice is that the same index configuration is three lines in ThinkingSphinx instead of one in UltraSphinx. If you look closely, you'll also see that the field names are not symbols as you might expect but method calls. We'll get into why this is in a moment.
 
h3. Real World Index Configuration
 
Your real world applications are likely to require a significantly more complex index declaration to meet the search needs of your users. Let's look at an example of such a real world Sphinx index declaration.
 
h4. UltraSphinx
 
Here's an example of a more realistic UltraSphinx index configuration. This is the type of configuration you're likely to use on any non-trivial project.
 
<% coderay( :lang => "ruby") do -%>
class Post < ActiveRecord::Base
  belongs_to :blog
  belongs_to :category
 
  is_indexed :conditions => "posts.state = 'published'",
             :fields => [{:field => 'title', :sortable => true},
                             {:field => 'body'},
                             {:field => 'cached_tag_list'}],
             :include => [{:association_name => "blog",
                              :field => "title",
                              :as => "blog",
                              :sortable => true},
                             {:association_name => "blog",
                              :field => "description",
                              :as => "blog_description"},
                             {:association_name => "category",
                              :field => "title",
                              :as => "category",
                              :sortable => true}]
end
<% end %>
 
This is about as pretty as it's going to get - and that's not very pretty. Large, deeply nested hashes of arrays of hashes are not easily scannable and will become exponentially difficult to maintain as their size and complexity increases.
 
h4. ThinkingSphinx
 
Let's look at that same example translated to ThinkingSphinx.
 
<% coderay( :lang => "ruby") do -%>
class Post < ActiveRecord::Base
  belongs_to :blog
  belongs_to :category
 
  define_index do
    indexes title, :sortable => true
    indexes body, cached_tag_list
 
    indexes blog.description, :as => :blog_description
    indexes blog.title, :as => :blog, :sortable => true
    indexes category.title, :as => :category, :sortable => true
 
    where "posts.state = 'published'"
  end
end
<% end %>
 
Not only did the number of lines decrease, the readability is far greater. I know which one I'd rather write. More importantly, I know which one I'd rather have to maintain weeks or months downline when it needs to be modified.
 
Notice that the declarations use methods rather than symbols. ThinkingSphinx uses some interesting metaprogramming to allow this. Notice also that indexed fields on associations are specified in the same way you would access that field. Simple.
 
h3. Sphinx Rake Tasks
 
Both UltraSphinx and ThinkingSphinx provide a number of rake tasks for common sphinx tasks such as generating the configuration file; generating the index; and starting, stopping, and restarting the searchd daemon. Both provide abbreviations for the more common task, such as @ts:in@ for @thinking_sphinx:index@ or @us:conf@ for @ultrasphinx:configure@.
 
h3. Deployment and Configuration Management
 
Both UltraSphinx and ThinkingSphinx are pretty simple to deploy. You should symlink your configuration file from a shared location into your app's path after deployment, just as you probably do for your @database.yml@ file. You will probably want to run the configuration task after you update the code. Here, for instance, is a Capistrano task to run your ThinkingSphinx configuration task:
 
<% coderay( :lang => "ruby") do -%>
namespace :sphinx do
  desc "Generate the ThinkingSphinx configuration file"
    task :configure do
      run "cd #{release_path} && rake thinking_sphinx:configure"
    end
  end
end
<% end %>
 
You'll want to have this task run after each deployment:
 
<% coderay( :lang => "ruby") do -%>
after "deploy:update_code", "sphinx:configure"
<% end %>
 
You can create other tasks relatively easily for reindexing and managing the searchd daemon. I found a good guide to "deploying a rails app with ThinkingSphinx":http://www.updrift.com/article/deploying-a-rails-app-with-thinking-sphinx linked from Pat Allan's blog. I found a useful set of "UltraSphinx capistrano tasks":http://github.com/ruberion/ruberion_server_tools/tree/master/recipes/tasks/ultrasphinx.rb in Ruberion's server tools plugin on Github. If you chose to host with "EngineYard":http://www.engineyard.com/, they can manage either configuration for you with their pre-baked builds and deploy tasks.
 
h3. Real World Experience
 
We ran into a number of issues when setting up UltraSphinx:
 
# <p>UltraSphinx loads your models without loading the full rails environment. This means that if
  your models depend on any of your lib files or any gems frozen in vendor/gems, you will have to
  require all of these files explicitly in each model. This is a pain.</p>
# <p>The fundamentally sound design and code of UltraSphinx are somewhat undermined by poorly
  implemented exception handling. This means that while most of the time things work swimmingly,
  when they fail you're really sunk! The errors that you receive are often useless in diagnosing
  the actual problem.</p>
# <p>We had bugs in our index that only existed on our staging and production slices. These caused
  page counts to be incorrect and nil records to be returned in certain cases. In certain cases it
  also caused 5xx errors.</p>
  
h4. Moving To ThinkingSphinx
 
After another Hashrocket team had success moving their project from SOLR to ThinkingSphinx, I decided to move our project as well. Moving to ThinkingSphinx proved to be a relatively painless experience. The process was essentially five-fold:
 
# <p>Uninstall UltraSphinx and install ThinkingSphinx.</p>
# <p>Translate your @is_indexed@ declaration into a @define_index@ block and change your search actions to use the ThinkingSphinx API.</p>
# <p>Rewrite your deployment tasks to run the ThinkingSphinx rake tasks.</p>
# <p>Stop searchd and then run your new configure, index and startd start tasks.</p>
# <p>PROFIT!</p>