GitHub - neubig/webigator: A program to aggregate, rank, and search text information

neubig / webigator Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

A program to aggregate, rank, and search text information

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
script		script
src		src
.gitignore		.gitignore
AUTHORS		AUTHORS
COPYING		COPYING
ChangeLog		ChangeLog
INSTALL		INSTALL
Makefile.am		Makefile.am
NEWS		NEWS
README		README
README.ja		README.ja
configure.ac		configure.ac

Repository files navigation

** webigator
**   by Graham Neubig
**   neubig at gmail dot com

Webigator is a program to extract useful information from large amounts of text.
In particular, it was developed considering the information needs after
disasters, such as:

* Locations at which people can evacuate, get water or supplies
* Text about the safety of people in the disaster-affected areas
* Requests for rescue or disaster supplies

At the moment, we are mainly focusing on filtering this information from Twitter.

-------------- Install -------------------

The latest version of webigator can be found here:
http://www.phontron.com/webigator
http://www.github.com/neubig/webigator
under the Eclipse Common License.

webigator works on linux, and will likely work on Mac or Cygwin. It is
dependent on a number of libraries including Boost and XML-RPC, which can be
downloaded as follows on Ubuntu:

$ sudo apt-get install libboost-all-dev libxmlrpc-c3-dev

In addition, for the Perl interface you will want XML::RPC:

$ cpan XML::RPC

Next, we build the server

$ autoreconf -i
$ ./configure
$ make

If this works properly, you should see a help message as follows:
src/bin/webigator --help
If this doesn't work, please contact the authors.

---------- Setting up the Web Interface ----------

Webigator runs best as a web interface, so we will want to install this too.
First, you will need to locate two directories. Your web server home directory,
and your cgi-bin directory. If you are running Apache on Ubuntu with the default
settings, these are /var/www and /usr/lib/cgi-bin respectively. So, to set up
the web server, you will run the following commands:

Create the webigator-demo directory:

$ mkdir /var/www/webigator-demo

Move all of the images, css files, etc to this directory, and all of the program
files to the cgi-bin directory:

$ cp -r script/cgi/{css,img,js} /var/www/webigator-demo/
$ cp script/cgi/*.{pl,cgi,tpl,pm} /usr/lib/cgi-bin

If this works, you will be able to access the webigator tool at:

http://localhost/cgi-bin/webigator-demo.cgi

(Or replace "localhost" with a different name if you are running on a server.)
There are also number of settings in /usr/lib/cgi-bin/settings.pl that you
can change, such as the UI language, tokenization method (the default is by
characters for Japanese), etc.

--------------- Using the tool -------------------

Once install has finished, run the server:

$ src/bin/webigator

Next, we can create a task by going to the web interface and clicking "create
task."

We start a program that streams data to the webigator server. This should eventually
do something like gather data from the Twitter stream, but for the time being, let's
assume that you have some data in tab-separated format in the file data.csv:

$ script/data-streamer.pl -data data.csv

Next, we access the web interface again, and start labeling the data for the task.

-------------------- API -------------------------

The perl programs and server interact through an XML API:

***** Add a keyword
add_keyword:
    id = Data ID (int)
    text = Data (string)
    lab = Is the keyword for useful info or not useful info (1 or 0)

***** Add an unlabeled example to be scored
add_unlabeled:
    id = Data ID (int)
    text = Data (string)

***** Add an instance that has been labeled
add_labeled:
    id = Data ID (int)
    text = Data (string)
    lab = Is the data for useful info or not useful info (1 or 0)

***** Return the highest scoring value from the server
pop_best:

***** Rescore all of the values on the server:
rescore:

--------------- TODO ------------------

* Improve the scoring function so it doesn't over-value short text at the beginning
* Create a program to stream tweets directly from Twitter