GitHub - The-Wallfacer-Plan/CI6226: Assignment for CI6226 Information Retrieval & Analysis in NTU, Singapore

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
app		app
conf		conf
project		project
public		public
report		report
test		test
.gitignore		.gitignore
activator		activator
activator-launch-1.3.6.jar		activator-launch-1.3.6.jar
activator.properties		activator.properties
build.sbt		build.sbt
readme.txt		readme.txt

Repository files navigation

LSearcher: A DBLP Search Engine Based on Lucene
===============================================

Overview
--------

## LSearcher consists of two projects:

- Project 1 is a basic search engine that can be used to search for the publication records in DBLP.
- Project 2 contains two applications.
- Application 1 finds the top-10 most popular research topics in each year from Year 2000 to 2015, and in each year for a specific publication venue or an author;
- Application 2 discovers the top-10 most similar publication venues for a given conference or journal.

### Resources and toolkits used in LSearcher:

(1) Dataset in compressed XML in DBLP: http://dblp.uni-trier.de/xml/
(2) Oracle Java 8: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
(3) Lucene 5.5.0: https://lucene.apache.org/
(4) Play! framework 2.5.0 : https://www.playframework.com/
(5) bootstrap-v4-alpha4: http://v4-alpha.getbootstrap.com/getting-started/introduction/
(6) jQuery-2.2: https://jquery.com/download/
(7) Scala-2.11.7: http://scala-lang.org/
(8) Mallet library (MAchine Learning for LanguagE Toolkit): http://mallet.cs.umass.edu/
(9) Lightbend Activator£ºhttps://www.lightbend.com/activator/download

### Setup

The latest source code of LSearcher is available on GitHub (https://github.com/HongxuChen/ci6226), a demo is currently available on http://155.69.145.146:9001 (NTU access only). Oracle Java 8 is required and should be pre-installed; if you have multiple JREs, please ensure `JAVA_HOME` is correctly pointed to the Java 8 directory.

git clone https://github.com/HongxuChen/ci6226.git
cd ci6226/
# put downloaded dblp.xml file to "public/resources"
# change "rootDir" value in app/models/common/Config.scala and point to "public/resources"
./activator # for Linux/Mac OS X
# On Windows, download activator and run executable `activator` in project directory
run -Dhttp.port=9001 # (inside activator shell)
open http://localhost:9001 in your favorite web browser

Functionalities and Results
---------------------------

There are three tags at the top of the interface. Tag "Home" is the basic search engine, and Tags "App1" and "App2" are the first and second applications in Project 2.

## Basic Search

It is tagged by "Home" at the top of the UI.
(1) The first step is to determine the settings of the system. There are three kinds of options for the indexing process: stemming or not, case sensitivity or not, and stopwords removal or not. You can also select how many records to be display and which kind of similarity is used (classic similarity or BM25 similarity).

(2) Once the system is configured, the second step is to index the collection by clicking the "Index Docs" button. When the indexing process is finished, the textbox at the bottom returns the options settings and indexing time.

(3) At the third step, you can enter your query in the input textbox. It accepts many kinds of queries.
(a) The simple queries are single terms or phrases. A single term is a single word, e.g., system; while a phrase is a group of words surrounded by double quotes, e.g., "system model".
For a single query "system", the first resulted record is described as follows.
docID: 1134718 Score: 1.522377
pubYear 1992
paperId journals/csys/Neuman92
authors B. Clifford Neuman
kind article
title The Prospero File System: A Global File System Based on the Virtual System Model.
venue Computing Systems
For the phrase "system model" (must be surrounded by double quotes), the first record is the following one.
docID: 356624 Score: 2.047171
pubYear 2011
paperId journals/procedia/Strand11
authors Gary Strand
kind inproceedings
title Community Earth System Model Data Management: Policies and Challenges.
venue ICCS
(b) The queries can also specify the field. Without a detailed field, the search will consider all terms in the records. However, once the field is specified, the search will only consider the terms in the matched field. The format of a query with a field can be described as: field: <term> or field:"<phrase>". For example, to search the records whose titles contain "system model", the query can be described as: title: "system model". The record listed at the first place is the following one:
docID: 570186 Score: 4.107412
pubYear 2008
paperId journals/istr/ProbstH08
authors Christian W. Probst;Ren¨¦ Rydhof Hansen
kind article
title An extensible analysable system model.
venue Inf. Sec. Techn. Report

(c) For a single term query with or without a field, it also support single and multiple character wildcard searches. For a single character wildcard, use the "?" symbol; for a multiple character wildcard, use the "*" symbol. The single character wildcard search looks for terms that match that with the single character replaced, and multiple character wildcard searches looks for 0 or more characters.
For example, to search for model, modeled or modeling, the query can be described as: model*. With this query, some returned records are:
Record 1: docID: 19 Score: 1.000000
pubYear 2015
paperId journals/twc/KhademiCIJV15
authors Seyran Khademi;Sundeep Prabhakar Chepuri;Zoubir Irahhauten;Gerard J. M. Janssen;Alle-Jan van der Veen
kind article
title Channel Measurements and Modeling for a 60 GHz Wireless Link Within a Metal Cabinet.
venue IEEE Transactions on Wireless Communications
Record 2: docID: 35 Score: 1.000000
pubYear 2004
paperId journals/twc/XuCHV04
authors Hao Xu;Dmitry Chizhik;Howard C. Huang;Reinaldo A. Valenzuela
kind article
title A generalized space-time multiple-input multiple-output (MIMO) channel model.
venue IEEE Transactions on Wireless Communications
Record 3: docID: 133 Score: 1.000000
pubYear 2007
paperId journals/twc/HanT07
authors Y. Han;Kah Chan Teh
kind article
title Performance Study of Asynchronous FFH/MFSK Communications using Various Diversity Combining Techniques with MAI Modeled as Alpha-Stable Process.
venue IEEE Transactions on Wireless Communications

To search for test or text, the query can be written as: te?t. Some sample returned records are:
Record 1: docID: 1939 Score: 1.000000
pubYear 2013
paperId journals/twc/HuangC13a
authors Qi Huang;Pei-Jung Chung
kind article
title An F-Test Based Approach for Spectrum Sensing in Cognitive Radio.
venue IEEE Transactions on Wireless Communications
Record 2: docID: 3066 Score: 1.000000
pubYear 2010
paperId journals/siu/BellissensJDM10
authors C¨¦drick Bellissens;Patrick Jeuniaux;Nicholas D. Duran;Danielle S. McNamara
kind article
title A Text Relatedness and Dependency Computational Model.
venue Stud. Inform. Univ.
(d) A query can also described as regular expression matching a pattern between forward slashes "/". For example, to find records containing "text" or "test", the query can be written as /te[xs]t/. The results of this example are the same as the query te?t.
(e) The query can also be a fuzzy expression by using the tilde "~". An additional (optional) parameter can specify the maximum number of edits allowed. The value is between 0 and 2, and the default is 2 edit distances. For example, the following query can used to search for a term similar in spelling to text: text~, i.e., text~2 since 2 is default. This query can return records containing text, texts, test, and so on. For example,
Record 1: docID: 691645 Score: 0.592871
pubYear 2013
paperId journals/corr/abs-1305-2831
authors Khushboo Thakkar;Urmila Shrawankar
kind article
title Test Model for Text Categorization and Text Summarization
venue CoRR
Record 2: docID: 2240559 Score: 0.564503
pubYear 2014
paperId conf/vts/Gattiker14
authors Anne Gattiker
kind inproceedings
title Unstructured text: Test analysis techniques applied to non-test problems.
venue VTS
Record 3: docID: 87327 Score: 0.547886
pubYear 2010
paperId journals/jcse/Jo10
authors Taeho Jo
kind article
title Representation of Texts into String Vectors for Text Categorization.
venue JCSE
(f) The system can also support proximity searches using the tilde "~" at the end of a phrase, i.e., finding words that are within a specific distance away.
(g) Range searches. Range queries allow one to match records whose field(s) values are between the lower and upper bound specified by the Range Query. For example, to search for records whose authors are between A and B, but not including A and B, use the query authors: {A TO B}.
(h) Boosting a term. To boost a term use the caret "^" with a boost factor (a number) at the end of the term that are searched. The higher the boost factor, the more relevant the term will be. Boosting allows controling the relevance of a document by boosting its term. If you want a term to be more relevant, it is boosted using the ^ symbol along with the boost factor next to the term.
Here, give two examples.
The first one is the query: system model (note this query contain two single term, different with "system model"). The the two terms are equal relevant. The most relevant record is:
docID: 1134718 Score: 1.713261
pubYear 1992
paperId journals/csys/Neuman92
authors B. Clifford Neuman
kind article
title The Prospero File System: A Global File System Based on the Virtual System Model.
venue Computing Systems
The second one is the query: system model^4. Here the records with model are more relevant. Thus, the most relevant record is:
docID: 2520501 Score: 1.493212
pubYear 2009
paperId conf/icsoft/SchlechterSF09
authors Antoine Schlechter;Guy Simon;Fernand Feltz
kind inproceedings
title From an Abstract Object-oriented Domain Model to a Meta-model for the Domain - Model Driven Development of a Manufacturing Execution System.
venue ICSOFT (1)
(i) The system also support Boolean operators, such as AND, OR, NOT, +, and - .
For example, search for records that contain system or model: system OR model;
search for records that contain system or the phrase "model cheking" is in the titles: system OR "model checking";
search for records that contain system and model: system AND model;
search for records that contain system and have the phrase "model cheking" in the title: system AND "model checking";
search for records whose titles contain sytem and the authors contain Steve Walker: system AND authors: ``Steve Walker'';
search for records that contain system but no model: system NOT model
search for records that must contain system but may contain model: +system model;
search for records that must contain system model but not model checking: "system model" -"model checking";
The operators can also be combined. For example, to search for either "system" or "checking" and "model" use the query: (system OR checking) AND model
(j) Users can use parentheses to group multiple clauses to a single field, for example, to search for a title that contains both the phrase "system model" and the word "data" use the query: title: (+"system model" +data).
(k) Some special characters are also supposed by using the "\" before the characters. These characters are +, -, &&, ||, !, (, ), {, }, [, ], ^, ", ~, *, ?, :, \, and /. For example, (1+1)||(2-1) can be described as \(1\+1\)\||\(2\-1\)

For more details of the supporting queries, users can click help to look for the detailed descriptions of the acceptable formats of queries.

(4) The last step is to search for and display the records based on the specified queries.

## Application 1

It is tagged by "App1" at the top of the user interface. This application is to search for the popular topics.
In this application, users can define the year, venue, and author they want to query. The year duration is from 2000 to 2015. The venue and author are options and thus can be empty. The results or any queries contain two parts: the first one is the topics based on term frequency, and the other is the topics based on topical N-Gram considering each term as a unit.
Here some examples are given.
(1) Search for the most popular research topics in Year 2007. For this query, users only need to set the pubYear to be 2007.
The returned top 10 topics are:
sensor networks (2145)
wireless sensor (1576)
wireless sensor networks (1324)
neural networks (996)
wireless networks (839)
hoc networks (830)
web services (663)
genetic algorithm (555)
performance analysis (553)
support vector (513)

(2) Search for the most popular research topics in SIGIR 2012. For this query, the pubYear is set to 2012, and the venue is sigir. The results are shown as follows.
web search (10)
federated search (4)
aggregated search (4)
search engines (4)
social media (4)
search result (3)
text classification (3)
learning rank (3)
search engine (3)
retrieval evaluation (3)

## Application 2

By using App2, users can search for the most similar publication venue and year for a given conference or journal.
In this application, as for the basic search engine, users can also configure the system by setting stemming, character case, and stopwords.
Before input the queries, users should go through the representations of all venues in DBLP and find out the venue that they want to query in DBLP. The help contains the DBLP representations of all publications at different years.
For example, the item "2014 IEEE Trans. Knowl. Data Eng." from the help means that in DBLP, the journal "IEEE Transactions on Knowledge and Data Engineering" of Year 2014 is written as "IEEE Trans. Knowl. Data Eng.". Thus, if you want to search for the most similar publication venue and year of TKDE 2014, you input "IEEE Trans. Knowl. Data Eng." in the venue textbox, rather than TKDE. The returned results are top N publication venues and their scores. For example, the top 10 similar publication venues of TKDE 2014 are:

IEEE Trans. Knowl. Data Eng., 2015 0.7648006
ICDE, 2011 0.7269442
ICDE, 2012 0.71243507
IEEE Trans. Knowl. Data Eng., 2013 0.67970675
IEEE Trans. Knowl. Data Eng., 2011 0.66545534
IEEE Trans. Knowl. Data Eng., 2016 0.657765
IEEE Trans. Knowl. Data Eng., 2012 0.6533236
EDBT, 2009 0.6407726
DASFAA, 2009 0.63892615
DASFAA (1), 2012 0.61591434

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 3

Languages

The-Wallfacer-Plan/CI6226

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages