New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "native" full-text search support (MySQL/MariaDB) #1481

Merged
merged 1 commit into from Aug 7, 2016

Conversation

Projects
None yet
5 participants
@mwjames
Contributor

mwjames commented Mar 28, 2016

Problem

Using LIKE/NLIKE matches on values instance longer than 72 char is currently not supported [0] and even if the character length is extended it would most likely create performance issues due to [1].

Solution

This PR adds MySQL fulltext index to help improve performance and match condition support (using LIKE/NLIKE on text fields can be an expensive endeavour [1], is case sensitive etc.) on text contents.

All ~/!~ operations remain, only that in case of $GLOBALS['smwgEnabledFulltextSearch'] being enabled the new smw_ft_search table uses the MATCH ... AGAINST query syntax to match the MySQL/MariaDB fulltext implementation.

On the occasion that this feature was enabled, setupStore.php is required to be run first followed by the rebuildFulltextSearchTable.php script.

This experimental feature is not enabled by default ($GLOBALS['smwgEnabledFulltextSearch'] = false;) and is only implemented for MySQL/MariaDB.

Features and limitations

  • Text value indexing happens independently from existing tables using a dedicated smw_ft_search consolidation table with text values only being updated when CompositePropertyTableDiffIterator returns with an appropriate update entry
  • Fulltext match search is only executed when the ~/!~ expression are used
  • Text values longer than > 255 chars can be (DIBlob types) searched
  • Support for phrase matching
  • IN BOOLEAN MODE[3] is enabled by default for queries
  • Other MySQL search modifiers are supported by adding &BOOL (IN BOOLEAN MODE), &INL (IN NATURAL LANGUAGE MODE), and &QE (WITH QUERY EXPANSION) to a condition statement (e.g. [[Has text::~+foo, -bar&BOOL]]) [4]
  • Only DIBlob values are selected for the index process

Settings

  • smwgEnabledFulltextSearch requires to be set true with update.php / setupStore.php being executed thereafter, and rebuildFulltextSearchTable.php to rebuild the index table
  • smwgFulltextSearchTableOptions (internal table settings)
  • smwgFulltextSearchMinTokenSize (default = 3) is expected to correspond to either innodb_ft_min_token_size or ft_min_word_len (this helps us to switch back to LIKE in cases where the min threshold is not applicable)
  • rebuildFulltextSearchTable.php is provided to collect and rebuild the text index in its entirety

Technical notes

After a subject is stored, the CompositePropertyTableDiffIterator contains all elements that have been deleted or inserted during the storage transaction and FulltextSearchTableUpdater::addUpdatesFromPropertyTableDiff is filtering and concatenating those values that are eligible for index storage (in the smw_ft_search table). Any indexing, tokenzing, stopword or stemmer application is left the MySQL/MariaDB backend to handle.

image

image

[0] https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/includes/storage/SQLStore/SMW_DIHandler_Blob.php#L17-L34
[1] http://stackoverflow.com/questions/224714/what-is-full-text-search-vs-like and http://stackoverflow.com/questions/5629491/mysql-full-text-search-vs-like
[2] http://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html
[3] http://dev.mysql.com/doc/refman/5.7/en/fulltext-boolean.html
[4] http://www.drdobbs.com/database/full-text-search-with-innodb/231902587?pgno=2

@mwjames mwjames added the feature label Mar 28, 2016

@mwjames

This comment has been minimized.

Contributor

mwjames commented Mar 28, 2016

MySQL/MariaDB

  • "Full-text indexes can be used only with InnoDB or MyISAM tables, and can be created only for CHAR, VARCHAR, or TEXT columns" [2]
  • MySQL Full-Text Restrictions
  • As of MySQL 5.7.6, MySQL provides a built-in full-text ngram parser that supports Chinese, Japanese, and Korean (CJK), and an installable MeCab full-text parser plugin for Japanese. [2]
  • While not tested in connection with SMW, there is a possibility to use the Sphinx search engine with MySQL to improve performance and query features [4, 5, 6, 7]

[4] http://sphinxsearch.com/blog/2014/02/07/use-sphinx-with-mysql/
[5] http://sphinxsearch.com/blog/2014/11/05/sphinx-search-quick-tour-using-a-mysql-datasource/
[6] http://www.ibm.com/developerworks/library/os-sphinx/
[7] http://journal.code4lib.org/articles/9793

@mwjames

This comment has been minimized.

Contributor

mwjames commented Mar 28, 2016

@kghbln @JeroenDeDauw The 2016 Easter egg, q-0104.json contains the integration test with similar examples such as:

Examples

Data

{{#subobject:
 |Has text=MySQL vs MariaDB database
}}{{#subobject:
 |Has text=Oracle vs MariaDB database
}}{{#subobject:
 |Has text=PostgreSQL vs MariaDB database
}}{{#subobject:
 |Has text=MariaDB overview 
}}{{#subobject:
 |Has text=Elastic search
}}{{#subobject:
 |Has text=Sphinx search
}}

Queries

image

image

image

image

image

@mwjames

This comment has been minimized.

Contributor

mwjames commented Mar 28, 2016

If a long text is stored as in the following example (which I don't recommend but then again what do I know):

image

One can now easily ask for (using phrase matching indicated by " ... " also note the lower case):

{{#ask: [[Has text::~"probably first invented by plato"]]
 |?Has text
}}

which returns the subject:

image

@@ -120,7 +122,7 @@ private static function createTable( $tableName, array $fields, $db ) {
* @param DatabaseBase|Database $db
* @param object $reportTo Object to report back to.
*/
private static function updateTable( $tableName, array $fields, $db, $reportTo ) {
private static function updateTable( $tableName, array $fields, $tableOptions = array(), $db, $reportTo ) {

This comment has been minimized.

@JeroenDeDauw

JeroenDeDauw Mar 28, 2016

Member

$tableOptions is unused?

This comment has been minimized.

@mwjames

mwjames Mar 29, 2016

Contributor

Just need this for a8312dc.

@mwjames mwjames added this to the SMW 2.5 milestone Apr 8, 2016

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 2, 2016

For people (I remember that some have ask) who want to test this early on:

  • Requires onoi\tesa therefore pulling via composer/semantic-media-wiki:dev-fulltext is advised
  • Updates to the fulltext index are posted either via DeferredRequestDispatchManager or if that isn't successful then SearchTableUpdateJob is scheduled (so a cron job for SearchTableUpdateJob on a regular basis should help with any possible lag)
  • Make sure that your backend actually supports the fulltext feature (MySQL 5.5+/MariaDB 10.0.5+)
  • smwgEnabledFulltextSearch requires to be set true
  • smwgFulltextSearchTableOptions can be used to tweak the table/index characteristics in case one uses InnoDB
  • smwgFulltextDeferredUpdate to "Throttle the amount of expected index updates" and by default (true) posts index updates using a deferred update process
  • rebuildFulltextSearchTable.php is expected to be executed first (since the index update only acts on changes) and should output something below
  • If certain text or URI type properties should not be indexed then smwgFulltextSearchPropertyExemptionList needs to be extended
  • Fulltext index search is only supported for conditions using the ~/!~ expression like [[Foo::~some]], [[Bar::!~some]]. The newly added ~~ special feature condition expression allows a broad proximity search and can be used if one is unsure about the property a value is assigned to. It searches on all stored texts in the fulltext index that matches that condition and returns those subjects but you have to match the property manually in the printrequest.)
  • onoi\tesa brings some rudimentary CJK support without requiring to install extra extensions or software (and again the support is rudimentary based on the selected Tokenizer)
The script rebuilds the search index from property tables that
support a fulltext search. Any change of the index rules (altered
stopwords, new stemmer etc.) and/or a newly added or altered table
requires to run this script again to ensure that the index complies
with the rules set forth by the DB or Sanitizer.

- ICU (Intl) PHP-extension         54.1
- Tesa::Sanitizer                  0.2
- Tesa::Transliterator             0.2
- Tesa::LanguageDetector           (disabled)

The following properties are exempted from the fulltext search index.

- _ASKFO, _ASKST, _IMPO, _LCODE, _UNIT, _CONV, _TYPE, __sil_iwl_lang
- __sil_ill_lang

The entire index table is going to be purged first and
it may take a moment before the rebuild is completed due to
dependencies on table content and selected options.

Abort the rebuild with control-c in the next five seconds ...  0

The entire 'smw_ft_search' table was purged.

Rebuilding the text index from (rows finished/expected):

- smw_di_blob                       100% (1361/1361)
- smw_di_uri                        100% (561/561)
- smw_fpt_pval                      100% (14/14)
- smw_fpt_list                      100% (3/3)
- smw_fpt_serv                      100% (1/1)
- smw_fpt_uri                       100% (1/1)
- smw_fpt_text                      100% (72/72)
- smw_fpt_dtitle                    100% (723/723)
- smw_fpt_media                     100% (10/10)
- smw_fpt_mime                      100% (10/10)

@jaideraf

This comment has been minimized.

Member

jaideraf commented Aug 2, 2016

@mwjames how can I resolve the following issue with composer?
In composer.json I have "mediawiki/semantic-media-wiki": "dev-fulltext"
when I run composer update

> ComposerHookHandler::onPreUpdate
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - Installation request for mediawiki/semantic-media-wiki dev-fulltext -> satisfiable by mediawiki/semantic-media-wiki[dev-fulltext].
    - mediawiki/semantic-media-wiki dev-fulltext requires onoi/tesa @dev -> satisfiable by onoi/tesa[dev-master] but these conflict with your requirements or minimum-stability.
@jaideraf

This comment has been minimized.

Member

jaideraf commented Aug 2, 2016

Oh, I can't test this. I use MySQL 5.5.32 and I can't upgrade it. So, nevermind... 😥

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 3, 2016

I use MySQL 5.5.32 and I can't upgrade it.

In case of MyISAM (smwgFulltextSearchTableOptions uses that as standard Engine option) then 5.5 should be fine according to [0].

how can I resolve the following issue with composer?
In composer.json I have "mediawiki/semantic-media-wiki": "dev-fulltext"

It operates in dev mode therefore adding "minimum-stability": "dev", to your json file should satisfy the "... but these conflict with your requirements or minimum-stability ..." requirement.

Also, you need to run update.php to make sure that the new ft_search table is created.

[0] https://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html.

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 3, 2016

@jaideraf

This comment has been minimized.

Member

jaideraf commented Aug 3, 2016

In case of MyISAM (smwgFulltextSearchTableOptions uses that as standard Engine option) then 5.5 should be fine according to [0].

Great! @mwjames 👍

I am excited about this feature, but I will have to wait. I will give it a try when it arrives on master. I can't handle Composer. Composer just complains and complains and complains...

In my understanding SRF-dev requires SMW-dev and does not accept SMW-dev-fulltext.
SMW-dev-fulltext requires wikimedia/cdb 1.4, but I have MW 1.27 which requires wikimedia/cdb 1.3.
I install wikimedia/cdb 1.4 in MW 1.27 and I try to install SMW-dev-fulltext, but SRF-dev complains...

So, it's too much to me, I am not expert enough in managing Composer or dependencies to try this feature right now (but I am happy about this is coming soon). 😄

mwjames added a commit to onoi/tesa that referenced this pull request Aug 4, 2016

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 4, 2016

SMW-dev-fulltext requires wikimedia/cdb 1.4, but I have MW 1.27 which requires wikimedia/cdb 1.3.
I install wikimedia/cdb 1.4 in MW 1.27

Of course, MW has fixed the releases which makes it impossible to retrieve a newer library version. In light of this discovery, tesa now only requires ~1.0 to match the deployed MW version. onoi/tesa@70b353c

Add "native" fulltext search support (MySQL/MariaDB)
This experimental feature is not enabled by default
(`$GLOBALS['smwgEnabledFulltextSearch'] = false;`) and is only implemented
for MySQL/MariaDB.

On the occasion that this feature was enabled, `setupStore.php` is required
to be run first followed by the `rebuildFulltextSearchTable.php` script.

@mwjames mwjames merged commit 45885fe into master Aug 7, 2016

3 checks passed

Scrutinizer 20 new issues, 150 updated code elements
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@mwjames mwjames deleted the fulltext branch Aug 7, 2016

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 7, 2016

Okay, this has landed now in master.

The normal query process should not change or its results (this is also what the integration tests given as indication) and for those who enable smwgEnabledFulltextSearch, see my comments.

I tried to make this PR less invasive (compared to other parts of SMW), yet there's always room for improvement but trying to make those on this PR itself seems rather counter productive.

As a side note: For those who try to use/enable TextCatLanguageDetector (which relies on wikimedia/textcat [0] as a means to make predictions about a text and is associated language) it showed a rather unfavoured performance impact therefore has been disabled by default. Also, I removed CdbNGramLanguageDetector from onoi/tesa for now as it would require some time investment. All this just means that smwgFulltextLanguageDetection shouldn't be used in production.

image

[0] https://lists.wikimedia.org/pipermail/wikitech-l/2016-July/086096.html

@mwjames

This comment has been minimized.

@jaideraf

This comment has been minimized.

Member

jaideraf commented Aug 7, 2016

As @mwjames commented [0], the fulltext search support seems to be working fine (with the default settings and the default tests) with MySQL 5.5.32 [1].

[0] #1481 (comment)
[1] http://wikincat.org/wiki/User:Jaider/Text_examples?uselang=en

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 7, 2016

commented [0], the fulltext search support seems to be working fine (with the default settings and the default tests) with MySQL 5.5.32 [1].

To extend on the special ~~ proximity marker.

As one can see below [[~~maria*]] does not contain a property and only declares a "everything that contains maria plus something else".

Results are returned for all subjects that contain maria* in any of the "legitimate" properties on those subjects. ~~ is seen as wide or broad proximity marker for when a related property is unknown to the person who executes the query.

Results only tell us that those listed subjects contain maria* in some form on one or more of the annotated properties, yet it will not allow a conclusion about which property contains the content at this step and requires a narrower search hence the "wide or broad proximity" characteristics.

image

@mwjames mwjames referenced this pull request Aug 7, 2016

Merged

FTS to retain spaces on +/- operators, refs #1481 #1762

2 of 2 tasks complete

mwjames added a commit that referenced this pull request Aug 7, 2016

Merge pull request #1762 from SemanticMediaWiki/fts
FTS to retain spaces on +/- operators, refs #1481
@kghbln

This comment has been minimized.

Member

kghbln commented Aug 8, 2016

Now live on sandbox. I have to note that even without setting $smwgEnabledFulltextSearch = true; one has to run "update.php" then updating from an earlier version of SMW.

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 8, 2016

one has to run "update.php" then updating from an earlier version of SMW.

Right, well the TableBuilder doesn't access the smwgEnabledFulltextSearch and creates all tables according to their definition. As to whether a table is used or not is decided later by the application.

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 8, 2016

@kghbln Did you run rebuildFulltextSearchTable.php?

@kghbln

This comment has been minimized.

Member

kghbln commented Aug 8, 2016

Did you run rebuildFulltextSearchTable.php?

Yes, see also my e-mail.

@mwjames

This comment has been minimized.

Contributor

mwjames commented Aug 8, 2016

onoi\tesa brings some rudimentary CJK support without requiring to install extra extensions or software (and again the support is rudimentary based on the selected Tokenizer)

As a side note:

The ICU version on the sandbox is rather old (52.1 from 2013-10-09) [0] which makes the https://github.com/onoi/tesa/blob/master/src/Tokenizer/IcuWordBoundaryTokenizer.php work sub-optimal on http://sandbox.semantic-mediawiki.org/wiki/Issue/1481_%28Fulltext%29/CJK_examples.

I used 54.1 for testing (https://github.com/onoi/tesa/blob/master/tests/phpunit/Unit/Tokenizer/IcuWordBoundaryTokenizerTest.php#L21).

[0] http://site.icu-project.org/download/52

mwjames added a commit that referenced this pull request Aug 12, 2016

mwjames added a commit that referenced this pull request Aug 18, 2016

Merge pull request #1801 from SemanticMediaWiki/fulltext-sqlite
Add "native" fulltext search support (SQLite), refs #1481
@kghbln

This comment has been minimized.

Member

kghbln commented Sep 14, 2016

REM Related: Annotation / Query

@kghbln kghbln changed the title from Add "native" fulltext search support (MySQL/MariaDB) to Add "native" full-text search support (MySQL/MariaDB) Dec 19, 2016

@mwjames

This comment has been minimized.

Contributor

mwjames commented Dec 31, 2016

Notes on stopword list

As shown by example [0], when running a query on a single word condition {{#ask:[[Has text::~*enough*]]}} which by default is part of the MySQL/MariaDB's stopword list [1, 2] then no results are expected to return.

The general suggestion is to avoid arbitrary searches on [1, 2] otherwise the standard MySQL/MariaDB configuration (as described in the documentation) requires to be altered.

For those listed stopwords [1, 2], SMW relies on the back-end to return appropriate matches and it cannot manipulate the DB settings for queries that contain those "system" stopwords.

[0] https://sandbox.semantic-mediawiki.org/wiki/TestSelectingPagesQuery
[1] https://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html
[2] https://mariadb.com/kb/en/mariadb/stopwords/

@kghbln

This comment has been minimized.

Member

kghbln commented Mar 11, 2017

@mwjames Today php rebuildFulltextSearchTable.php --quiet --with-maintenance-log returned "543210" after completion on smw.o. This has never happened on s.smw.o though admittedly I use php rebuildFulltextSearchTable.php --quiet --quick --with-maintenance-log over there which is what it should be in the end.

@mwjames

This comment has been minimized.

Contributor

mwjames commented Mar 11, 2017

Today php rebuildFulltextSearchTable.php --quiet --with-maintenance-log returned "543210" after completion on smw.o.

You mean "543210" was an output you did not expect because of the --quiet option?

@kghbln

This comment has been minimized.

Member

kghbln commented Mar 11, 2017

You mean "543210" was an output you did not expect because of the --quiet option?

Indeed.

@mwjames

This comment has been minimized.

Contributor

mwjames commented Mar 11, 2017

I'm diving blind on this as I'm not sure what "543210" suppose to mean and --quiet normally enforces a no verbose information state. What happens when you run the script without --quiet? Does the "543210" somewhere appear during its processing on the cmd?

@kghbln

This comment has been minimized.

Member

kghbln commented Mar 11, 2017

I'm not sure what "543210" suppose to mean

Now it came to my mind what this is: It is the count down we had since I forgot to add the --quick flag. Talk about you know what... This was too easy.

@mwjames mwjames referenced this pull request Apr 2, 2017

Merged

TextByChangeUpdater change update approach, refs 1481 #2388

2 of 2 tasks complete

@kghbln kghbln referenced this pull request Apr 21, 2017

Merged

Update DefaultSettings.php #2426

1 of 2 tasks complete
@sommer-gei

This comment has been minimized.

sommer-gei commented May 31, 2017

Hi!

I enabled $smwgEnabledFulltextSearch = true; and my not case-sensitive search worked like a charme! :-)

But then I realised that the search for a part of property string isn’t working anymore. − Is that a known problem?

Example:
I have the custom property Foaf:name. − f.i. [[Foaf:name::John Doe]]
If I search for ohn (by example: ~*ohn*) there are no results anymore after enabling the (new) FullTextSearch. I have tried the /wiki/Special:Ask and my /w/api.php.

Thanks in advance!

@kghbln

This comment has been minimized.

Member

kghbln commented May 31, 2017

Thanks for asking. I suggest to recreate the issue on sandbox and file a separate issue if the problem can be observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment