-
Notifications
You must be signed in to change notification settings - Fork 0
The-Wallfacer-Plan/CI6226
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
LSearcher: A DBLP Search Engine Based on Lucene =============================================== Overview -------- ## LSearcher consists of two projects: - Project 1 is a basic search engine that can be used to search for the publication records in DBLP. - Project 2 contains two applications. - Application 1 finds the top-10 most popular research topics in each year from Year 2000 to 2015, and in each year for a specific publication venue or an author; - Application 2 discovers the top-10 most similar publication venues for a given conference or journal. ### Resources and toolkits used in LSearcher: (1) Dataset in compressed XML in DBLP: http://dblp.uni-trier.de/xml/ (2) Oracle Java 8: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html (3) Lucene 5.5.0: https://lucene.apache.org/ (4) Play! framework 2.5.0 : https://www.playframework.com/ (5) bootstrap-v4-alpha4: http://v4-alpha.getbootstrap.com/getting-started/introduction/ (6) jQuery-2.2: https://jquery.com/download/ (7) Scala-2.11.7: http://scala-lang.org/ (8) Mallet library (MAchine Learning for LanguagE Toolkit): http://mallet.cs.umass.edu/ (9) Lightbend Activator£ºhttps://www.lightbend.com/activator/download ### Setup The latest source code of LSearcher is available on GitHub (https://github.com/HongxuChen/ci6226), a demo is currently available on http://155.69.145.146:9001 (NTU access only). Oracle Java 8 is required and should be pre-installed; if you have multiple JREs, please ensure `JAVA_HOME` is correctly pointed to the Java 8 directory. git clone https://github.com/HongxuChen/ci6226.git cd ci6226/ # put downloaded dblp.xml file to "public/resources" # change "rootDir" value in app/models/common/Config.scala and point to "public/resources" ./activator # for Linux/Mac OS X # On Windows, download activator and run executable `activator` in project directory run -Dhttp.port=9001 # (inside activator shell) open http://localhost:9001 in your favorite web browser Functionalities and Results --------------------------- There are three tags at the top of the interface. Tag "Home" is the basic search engine, and Tags "App1" and "App2" are the first and second applications in Project 2. ## Basic Search It is tagged by "Home" at the top of the UI. (1) The first step is to determine the settings of the system. There are three kinds of options for the indexing process: stemming or not, case sensitivity or not, and stopwords removal or not. You can also select how many records to be display and which kind of similarity is used (classic similarity or BM25 similarity). (2) Once the system is configured, the second step is to index the collection by clicking the "Index Docs" button. When the indexing process is finished, the textbox at the bottom returns the options settings and indexing time. (3) At the third step, you can enter your query in the input textbox. It accepts many kinds of queries. (a) The simple queries are single terms or phrases. A single term is a single word, e.g., system; while a phrase is a group of words surrounded by double quotes, e.g., "system model". For a single query "system", the first resulted record is described as follows. docID: 1134718 Score: 1.522377 pubYear 1992 paperId journals/csys/Neuman92 authors B. Clifford Neuman kind article title The Prospero File System: A Global File System Based on the Virtual System Model. venue Computing Systems For the phrase "system model" (must be surrounded by double quotes), the first record is the following one. docID: 356624 Score: 2.047171 pubYear 2011 paperId journals/procedia/Strand11 authors Gary Strand kind inproceedings title Community Earth System Model Data Management: Policies and Challenges. venue ICCS (b) The queries can also specify the field. Without a detailed field, the search will consider all terms in the records. However, once the field is specified, the search will only consider the terms in the matched field. The format of a query with a field can be described as: field: <term> or field:"<phrase>". For example, to search the records whose titles contain "system model", the query can be described as: title: "system model". The record listed at the first place is the following one: docID: 570186 Score: 4.107412 pubYear 2008 paperId journals/istr/ProbstH08 authors Christian W. Probst;Ren¨¦ Rydhof Hansen kind article title An extensible analysable system model. venue Inf. Sec. Techn. Report (c) For a single term query with or without a field, it also support single and multiple character wildcard searches. For a single character wildcard, use the "?" symbol; for a multiple character wildcard, use the "*" symbol. The single character wildcard search looks for terms that match that with the single character replaced, and multiple character wildcard searches looks for 0 or more characters. For example, to search for model, modeled or modeling, the query can be described as: model*. With this query, some returned records are: Record 1: docID: 19 Score: 1.000000 pubYear 2015 paperId journals/twc/KhademiCIJV15 authors Seyran Khademi;Sundeep Prabhakar Chepuri;Zoubir Irahhauten;Gerard J. M. Janssen;Alle-Jan van der Veen kind article title Channel Measurements and Modeling for a 60 GHz Wireless Link Within a Metal Cabinet. venue IEEE Transactions on Wireless Communications Record 2: docID: 35 Score: 1.000000 pubYear 2004 paperId journals/twc/XuCHV04 authors Hao Xu;Dmitry Chizhik;Howard C. Huang;Reinaldo A. Valenzuela kind article title A generalized space-time multiple-input multiple-output (MIMO) channel model. venue IEEE Transactions on Wireless Communications Record 3: docID: 133 Score: 1.000000 pubYear 2007 paperId journals/twc/HanT07 authors Y. Han;Kah Chan Teh kind article title Performance Study of Asynchronous FFH/MFSK Communications using Various Diversity Combining Techniques with MAI Modeled as Alpha-Stable Process. venue IEEE Transactions on Wireless Communications To search for test or text, the query can be written as: te?t. Some sample returned records are: Record 1: docID: 1939 Score: 1.000000 pubYear 2013 paperId journals/twc/HuangC13a authors Qi Huang;Pei-Jung Chung kind article title An F-Test Based Approach for Spectrum Sensing in Cognitive Radio. venue IEEE Transactions on Wireless Communications Record 2: docID: 3066 Score: 1.000000 pubYear 2010 paperId journals/siu/BellissensJDM10 authors C¨¦drick Bellissens;Patrick Jeuniaux;Nicholas D. Duran;Danielle S. McNamara kind article title A Text Relatedness and Dependency Computational Model. venue Stud. Inform. Univ. (d) A query can also described as regular expression matching a pattern between forward slashes "/". For example, to find records containing "text" or "test", the query can be written as /te[xs]t/. The results of this example are the same as the query te?t. (e) The query can also be a fuzzy expression by using the tilde "~". An additional (optional) parameter can specify the maximum number of edits allowed. The value is between 0 and 2, and the default is 2 edit distances. For example, the following query can used to search for a term similar in spelling to text: text~, i.e., text~2 since 2 is default. This query can return records containing text, texts, test, and so on. For example, Record 1: docID: 691645 Score: 0.592871 pubYear 2013 paperId journals/corr/abs-1305-2831 authors Khushboo Thakkar;Urmila Shrawankar kind article title Test Model for Text Categorization and Text Summarization venue CoRR Record 2: docID: 2240559 Score: 0.564503 pubYear 2014 paperId conf/vts/Gattiker14 authors Anne Gattiker kind inproceedings title Unstructured text: Test analysis techniques applied to non-test problems. venue VTS Record 3: docID: 87327 Score: 0.547886 pubYear 2010 paperId journals/jcse/Jo10 authors Taeho Jo kind article title Representation of Texts into String Vectors for Text Categorization. venue JCSE (f) The system can also support proximity searches using the tilde "~" at the end of a phrase, i.e., finding words that are within a specific distance away. (g) Range searches. Range queries allow one to match records whose field(s) values are between the lower and upper bound specified by the Range Query. For example, to search for records whose authors are between A and B, but not including A and B, use the query authors: {A TO B}. (h) Boosting a term. To boost a term use the caret "^" with a boost factor (a number) at the end of the term that are searched. The higher the boost factor, the more relevant the term will be. Boosting allows controling the relevance of a document by boosting its term. If you want a term to be more relevant, it is boosted using the ^ symbol along with the boost factor next to the term. Here, give two examples. The first one is the query: system model (note this query contain two single term, different with "system model"). The the two terms are equal relevant. The most relevant record is: docID: 1134718 Score: 1.713261 pubYear 1992 paperId journals/csys/Neuman92 authors B. Clifford Neuman kind article title The Prospero File System: A Global File System Based on the Virtual System Model. venue Computing Systems The second one is the query: system model^4. Here the records with model are more relevant. Thus, the most relevant record is: docID: 2520501 Score: 1.493212 pubYear 2009 paperId conf/icsoft/SchlechterSF09 authors Antoine Schlechter;Guy Simon;Fernand Feltz kind inproceedings title From an Abstract Object-oriented Domain Model to a Meta-model for the Domain - Model Driven Development of a Manufacturing Execution System. venue ICSOFT (1) (i) The system also support Boolean operators, such as AND, OR, NOT, +, and - . For example, search for records that contain system or model: system OR model; search for records that contain system or the phrase "model cheking" is in the titles: system OR "model checking"; search for records that contain system and model: system AND model; search for records that contain system and have the phrase "model cheking" in the title: system AND "model checking"; search for records whose titles contain sytem and the authors contain Steve Walker: system AND authors: ``Steve Walker''; search for records that contain system but no model: system NOT model search for records that must contain system but may contain model: +system model; search for records that must contain system model but not model checking: "system model" -"model checking"; The operators can also be combined. For example, to search for either "system" or "checking" and "model" use the query: (system OR checking) AND model (j) Users can use parentheses to group multiple clauses to a single field, for example, to search for a title that contains both the phrase "system model" and the word "data" use the query: title: (+"system model" +data). (k) Some special characters are also supposed by using the "\" before the characters. These characters are +, -, &&, ||, !, (, ), {, }, [, ], ^, ", ~, *, ?, :, \, and /. For example, (1+1)||(2-1) can be described as \(1\+1\)\||\(2\-1\) For more details of the supporting queries, users can click help to look for the detailed descriptions of the acceptable formats of queries. (4) The last step is to search for and display the records based on the specified queries. ## Application 1 It is tagged by "App1" at the top of the user interface. This application is to search for the popular topics. In this application, users can define the year, venue, and author they want to query. The year duration is from 2000 to 2015. The venue and author are options and thus can be empty. The results or any queries contain two parts: the first one is the topics based on term frequency, and the other is the topics based on topical N-Gram considering each term as a unit. Here some examples are given. (1) Search for the most popular research topics in Year 2007. For this query, users only need to set the pubYear to be 2007. The returned top 10 topics are: sensor networks (2145) wireless sensor (1576) wireless sensor networks (1324) neural networks (996) wireless networks (839) hoc networks (830) web services (663) genetic algorithm (555) performance analysis (553) support vector (513) (2) Search for the most popular research topics in SIGIR 2012. For this query, the pubYear is set to 2012, and the venue is sigir. The results are shown as follows. web search (10) federated search (4) aggregated search (4) search engines (4) social media (4) search result (3) text classification (3) learning rank (3) search engine (3) retrieval evaluation (3) ## Application 2 By using App2, users can search for the most similar publication venue and year for a given conference or journal. In this application, as for the basic search engine, users can also configure the system by setting stemming, character case, and stopwords. Before input the queries, users should go through the representations of all venues in DBLP and find out the venue that they want to query in DBLP. The help contains the DBLP representations of all publications at different years. For example, the item "2014 IEEE Trans. Knowl. Data Eng." from the help means that in DBLP, the journal "IEEE Transactions on Knowledge and Data Engineering" of Year 2014 is written as "IEEE Trans. Knowl. Data Eng.". Thus, if you want to search for the most similar publication venue and year of TKDE 2014, you input "IEEE Trans. Knowl. Data Eng." in the venue textbox, rather than TKDE. The returned results are top N publication venues and their scores. For example, the top 10 similar publication venues of TKDE 2014 are: IEEE Trans. Knowl. Data Eng., 2015 0.7648006 ICDE, 2011 0.7269442 ICDE, 2012 0.71243507 IEEE Trans. Knowl. Data Eng., 2013 0.67970675 IEEE Trans. Knowl. Data Eng., 2011 0.66545534 IEEE Trans. Knowl. Data Eng., 2016 0.657765 IEEE Trans. Knowl. Data Eng., 2012 0.6533236 EDBT, 2009 0.6407726 DASFAA, 2009 0.63892615 DASFAA (1), 2012 0.61591434
About
Assignment for CI6226 Information Retrieval & Analysis in NTU, Singapore
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published