(Team 4) Design: Keyword match operator #31

chenlica · 2016-04-05T05:11:49Z

Team 4:

Please do the following:

Put the design to https://github.com/TextDB/textdb/wiki/CS290-2016S-Task:-Keyword-Match-Operator
Use this ticket to keep track of the progress.

Add @prakul to this issue.

chenlica · 2016-04-12T06:15:28Z

Per our discussion in the lecture today, please do the following:

Design your feature(s) as operators.
Come up with a few good test cases.
Update this issue with your progress.

Please contact me to schedule a F2F meeting to discuss the details. Also I wonder whether you are still interested to solve the problem of "finding documents similar to a document."

2. Modified DictionaryMatcher and KeyWordMatcher to use the methods in Utils class.

[Issue #31] (Team4) Keyword Operator

[Issue #31] (Team 4) keyword matcher refactoring

[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type

chenlica · 2016-05-31T04:10:48Z

@akshaybetala and @prakul : please finish the documentation and performance test. Thanks.

chenlica · 2016-06-13T22:16:39Z

The initial performance numbers (time) for index-based search operator were very high (https://github.com/TextDB/textdb/wiki/CS290-2016S-Task:-Keyword-Match-Operator).

Here's the answer of @prakul :

The 'lucene query time' includes the time for preparing lucene query 
(in our custom function) from string, initializing the scan operator and 
obtaining tokens for the query based on our choosen analyzer. These 
are one off time expenses which our system is incurring for first query. 
For example : for 1M dataset, 

The time take for query : "medicine history" ->

Lucene Query time: 6.8490 seconds

is getting down to 

lucene Query time: 3.1050 seconds

on second run. 

My reasoning is that our implementation across operators is incurring this 
cost because of various initializations which lucene probably does only 
once and caches.

I still wonder why the overhead is so high. @sandeepreddy602 @rajesh9625 @inkudo @zuozhi for their thoughts.

rajesh9625 · 2016-06-15T02:24:34Z

@chenlica

These are the results I got when I ran DictionaryMatcher: index-based search operator(PHRASEOPERATOR) on my machine:

Machine configuration : MacBook Pro, 2.7 GHz Intel Core i5, 8 GB 1867 MHz DDR3

Performance results for DictionaryMatcher with PHRASEOPERATOR:

Dictionary : {"medical","medication","medicare","medicaid"}
Lucene Query time: 0.5950 seconds
Match time: 5.6970 seconds
Total: 36528 results

Lucene Query time for me has always been under a second.

zuozhiw · 2016-06-15T03:29:48Z

I just ran the query "medicine" with keyword operator on 1 Million records. The numbers I get is pretty normal:
Lucene Query time: 0.4100 seconds
match time: 2.0910 seconds
total: 9114 results

So is "7192.9650 seconds" the real number or a typo?

And are we talking about "Lucene Query time" here or the total time in general?

chenlica · 2016-06-15T06:19:43Z

@rajesh9625 : please include more interesting entities in the dictionary, with more varieties and multiple-keyword entities.

@zuozhi : I think we are talking about total query time, since that's what a user experiences.

chenlica · 2016-07-02T18:03:55Z

This task is done.

chenlica assigned akshaybetala Apr 5, 2016

chenlica added the 4-keyword-matcher label Apr 5, 2016

sandeepreddy602 added a commit that referenced this issue May 2, 2016

Issue #31: Moved span related methods to Utils class

2d8afe6

2. Modified DictionaryMatcher and KeyWordMatcher to use the methods in Utils class.

sandeepreddy602 mentioned this issue May 2, 2016

Team 1 - Refactor Span Related Methods #86

Closed

prakul added a commit that referenced this issue May 6, 2016

Merge pull request #85 from TextDB/team4-indexsourceoperator

141dcf0

[Issue #31] (Team4) Keyword Operator

akshaybetala added a commit that referenced this issue May 11, 2016

Keyword Operator Refactoring #31

039e83e

prakul added a commit that referenced this issue May 14, 2016

Merge pull request #97 from TextDB/team4-keyword-refactoring

d464d88

[Issue #31] (Team 4) keyword matcher refactoring

akshaybetala added a commit that referenced this issue May 18, 2016

Merge pull request #103 from TextDB/team4-phrase-operator

db53ad5

[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type

chenlica closed this as completed Jul 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Team 4) Design: Keyword match operator #31

(Team 4) Design: Keyword match operator #31

chenlica commented Apr 5, 2016

chenlica commented Apr 12, 2016

chenlica commented May 31, 2016

chenlica commented Jun 13, 2016

rajesh9625 commented Jun 15, 2016

zuozhiw commented Jun 15, 2016 •

edited

Loading

chenlica commented Jun 15, 2016

chenlica commented Jul 2, 2016

(Team 4) Design: Keyword match operator #31

(Team 4) Design: Keyword match operator #31

Comments

chenlica commented Apr 5, 2016

chenlica commented Apr 12, 2016

chenlica commented May 31, 2016

chenlica commented Jun 13, 2016

rajesh9625 commented Jun 15, 2016

zuozhiw commented Jun 15, 2016 • edited Loading

chenlica commented Jun 15, 2016

chenlica commented Jul 2, 2016

zuozhiw commented Jun 15, 2016 •

edited

Loading