Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy Matching Search for Semantic Data #264

Closed
d3mon187 opened this issue Apr 4, 2014 · 3 comments
Closed

Fuzzy Matching Search for Semantic Data #264

d3mon187 opened this issue Apr 4, 2014 · 3 comments
Labels

Comments

@d3mon187
Copy link

d3mon187 commented Apr 4, 2014

I have a site I'm working on with millions of pages of data (business listings). Each listing has the business name, categories, location data, and keywords, all as semantic data. I have a semantic search where a person can enter a "what" string (name, category, or keyword), and a "where" string (city, state, or zip). The search of course needs to be forgiving (i.e. matching "restaurant" to "restaurants", and san antonio, tx to San Antonio, Texas).

So far I've achieved this by creating semantic properties with strings like "italian; restaurants; olive garden" and "san antonio, tx 78211; san antonio, texas 78211", and then using an ask search like {{#ask:[[SearchArea::~*{{lc:{{{Where|}}}}}*]][[SearchLocation::~*{{lc:{{{What|}}}}}*]]}}. It's fairly forgiving and works well, but after a few thousand records it becomes slow for obvious reasons when using the wildcard matching. So I've come to the conclusion that I either need to split all words up, get rid of wilcards, and add things like "s" and "es" to the end of the user's search words, or get help with a better solution involving indexing or some other thing I haven't thought of.

I'm sure I'm not the only one searching for a better semantic search solution, so hopefully someone can help me out. I will also happily fund a solution if I can get some help quickly with this. Thanks!

@d3mon187
Copy link
Author

d3mon187 commented Apr 6, 2014

So I changed my smw_di_blob table to MyISAM, utf8_general_ci, and changed o_hash to varchar(255). Then added a fulltext index on o_hash.

Running the queries manually seems to get good speed, but obviously I now need an option for the ask query to generate MATCH(t0.o_hash) AGAINST ('keyword' IN BOOLEAN MODE), with a sort by match score desc.

Does this seem like a viable solution to the problem? What files would I need to go about changing to add the new query?

@JeroenDeDauw
Copy link
Member

Does this seem like a viable solution to the problem?

As specified, this cannot go into SMW. If we want to support such functionality, we'd need to do it well, and not by "abusing" our current text table, which is not meant to be used like this. However if it works well enough for you, then you can always create such a hack for yourself. The obvious downside being that you'll need to maintain it yourself and will need to update it for each upgrade you make.

What files would I need to go about changing to add the new query?

You'd want to add a new comparator to SMWValueDescription (in includes/query/SMW_Description.php) and extend the Ask wikitext parser appropriately. Unfortunately this code is not as clear as it could be. I'd hope #209 would fix that, though unfortunately it did not.

@mwjames
Copy link
Contributor

mwjames commented Aug 12, 2017

Fuzzy Matching Search

Instead of fuzziness or "Fuzzy Matching Search", the issue is more about the support of case-insensitive matching which got addressed in:

query to generate MATCH(t0.o_hash) AGAINST ('keyword' IN BOOLEAN MODE),

#1481

I changed my smw_di_blob table to MyISAM, utf8_general_ci, and changed o_hash to

#2499

@mwjames mwjames closed this as completed Aug 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants