Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Team 4) Design: Keyword match operator #31

Closed
chenlica opened this issue Apr 5, 2016 · 7 comments
Closed

(Team 4) Design: Keyword match operator #31

chenlica opened this issue Apr 5, 2016 · 7 comments
Assignees

Comments

@chenlica
Copy link
Collaborator

chenlica commented Apr 5, 2016

Team 4:

Please do the following:

Add @prakul to this issue.

@chenlica
Copy link
Collaborator Author

Per our discussion in the lecture today, please do the following:

  • Design your feature(s) as operators.
  • Come up with a few good test cases.
  • Update this issue with your progress.

Please contact me to schedule a F2F meeting to discuss the details. Also I wonder whether you are still interested to solve the problem of "finding documents similar to a document."

sandeepreddy602 added a commit that referenced this issue May 2, 2016
2. Modified DictionaryMatcher and KeyWordMatcher to use the methods in
Utils class.
prakul added a commit that referenced this issue May 6, 2016
akshaybetala added a commit that referenced this issue May 11, 2016
prakul added a commit that referenced this issue May 14, 2016
[Issue #31] (Team 4) keyword matcher refactoring
akshaybetala added a commit that referenced this issue May 18, 2016
[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type
@chenlica
Copy link
Collaborator Author

@akshaybetala and @prakul : please finish the documentation and performance test. Thanks.

@chenlica
Copy link
Collaborator Author

The initial performance numbers (time) for index-based search operator were very high (https://github.com/TextDB/textdb/wiki/CS290-2016S-Task:-Keyword-Match-Operator).

Here's the answer of @prakul :

The 'lucene query time' includes the time for preparing lucene query 
(in our custom function) from string, initializing the scan operator and 
obtaining tokens for the query based on our choosen analyzer. These 
are one off time expenses which our system is incurring for first query. 
For example : for 1M dataset, 

The time take for query : "medicine history" ->

Lucene Query time: 6.8490 seconds

is getting down to 

lucene Query time: 3.1050 seconds

on second run. 

My reasoning is that our implementation across operators is incurring this 
cost because of various initializations which lucene probably does only 
once and caches.

I still wonder why the overhead is so high. @sandeepreddy602 @rajesh9625 @inkudo @zuozhi for their thoughts.

@rajesh9625
Copy link
Contributor

@chenlica

These are the results I got when I ran DictionaryMatcher: index-based search operator(PHRASEOPERATOR) on my machine:

Machine configuration : MacBook Pro, 2.7 GHz Intel Core i5, 8 GB 1867 MHz DDR3

Performance results for DictionaryMatcher with PHRASEOPERATOR:

Dictionary : {"medical","medication","medicare","medicaid"}
Lucene Query time: 0.5950 seconds
Match time: 5.6970 seconds
Total: 36528 results

Lucene Query time for me has always been under a second.

@zuozhiw
Copy link
Collaborator

zuozhiw commented Jun 15, 2016

I just ran the query "medicine" with keyword operator on 1 Million records. The numbers I get is pretty normal:
Lucene Query time: 0.4100 seconds
match time: 2.0910 seconds
total: 9114 results

So is "7192.9650 seconds" the real number or a typo?

And are we talking about "Lucene Query time" here or the total time in general?

@chenlica
Copy link
Collaborator Author

@rajesh9625 : please include more interesting entities in the dictionary, with more varieties and multiple-keyword entities.

@zuozhi : I think we are talking about total query time, since that's what a user experiences.

@chenlica
Copy link
Collaborator Author

chenlica commented Jul 2, 2016

This task is done.

@chenlica chenlica closed this as completed Jul 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants