Entity extraction and span identification for heterogeneous document types. Build instructions and dependencies can be found below. This project is in development.
Installing and running the entity extraction and analysis tools
The following instructions are tested only in Ubuntu 18.04LTS.
Make sure the core system is up to date
sudo apt-get update sudo apt-get upgrade
Install some basic requirements to build in a Python virtualenv:
sudo apt-get install virtualenv virtualenvwrapper python3-pip python3-dev
Install postgres and some textract dependencies
You will need the postgres database to store entity and span data produced by the tool. Run the following commands to install postgres, along with some dependencies required for the textract package.
sudo apt-get install postgresql postgresql-contrib postgresql-server-dev-10 sudo apt-get install libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev libasound2-dev
Set up virtualenv and virtualenvwrapper:
You can skip or modify this step (and the remaining virtualenv steps) if your local setup differs or you don't wish to use virtualenvs.
Add the following to the end of your .bashrc file. You may need to verify the location of virtualenvwrapper on your system:
# Virtualenv and virtualenvwrapper export WORKON_HOME="$HOME/.virtualenvs" source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
shell source ~/.bashrc or close and reopen the terminal.
Make a virtualenv for the bitcurator-nlp-entspan tools
mkvirtualenv -p /usr/bin/python3 entspan
Install a textacy dependency requiring gcc-5:
The cld2-cffi package - see https://github.com/chartbeat-labs/textacy/issues/5 - must be built with gcc-5 for the time being. Revisit in future. (Do not use sudo in the second step when installing via pip in a virtualenv; if you do, the cld2-cffi dep will remain broken as it won't be found in the venv, and textacy will try to build it again with gcc-6).
sudo apt-get install gcc-5 g++-5 libffi-dev env CC=/usr/bin/gcc-5 pip3 install -U cld2-cffi
Install textract, textacy, and some other required pip packages.
Note: Installing textacy via pip will also install the latest release of spaCy.
pip install textract textacy psycopg2-binary sqlalchemy sqlalchemy-utils configobj matplotlib
Create and populate the database
Note: If the DB "bcnlp_db" already exists and you want to start afresh, first delete it.
# drop the db named: "bcnlp_db" sudo -u postgres dropdb bcnlp_db
To create a db with a user and password (commands are shown in the postgres prompt):
sudo -u postgres psql postgres=# create database bcnlp_db; CREATE DATABASE postgres=# create user bcnlp with password 'bcnlp'; CREATE ROLE postgres=# grant all privileges on database bcnlp_db to bcnlp; GRANT postgres=# \q
(Optional) Logging in to the db with the psql command:
psql -h localhost -U bcnlp bcnlp_db (passwd: bcnlp)
To list tables: \dt
Download the spaCy English language model:
python -m spacy download en
If the language model is not downloaded properly, you will see the following Spacy error:
"Warning: no model found for 'en' Only loading the 'en' tokenizer." when running bcnlp_main.py.
Populating the DB with tables of entities, POS, etc.
Run the Python script bcnlp_main.py:
python bcnlp_main.py --infile < inputfile > ex: python bcnlp_main.py --infile indir
Next, check if the DB is populated
psql -h localhost -U bcnlp bcnlp_db
Note: the password in this case is "bcnlp"
Some useful commands:
To list tables: \dt To delete a table: drop table <table_name> To see items in a table: select * from <table_name>
Run the curses interface and navigate through the menu:
You can do the following in this interface:
- List all the documents with number of terms, nouns, verbs and prepositions
- Export lists of entities, POS or each document listed.
- Similarity measures:
- Get a list of common entities in selected set of 2 documents. A new table is created in the database.
- Get the Similarity measure for two documents (Cosine, Euclidian or Manhattan)
Run createspan to create the entity spans and bar graphs:
python bcnlp_createspan.py [--bg] --infile <directory> ex: python bcnlp_createspan.py --infile indir python bcnlp_createspan.py --bg --infile mango_cake.txt
- It will create a file .span for each file in indir.
- To clear the span, same script is run with --clearspan flag.
- If --bg flag is specified, it will generate a set of bar graphs in the directory bgdir.
Deactivating the python virtual environment
To deactivate the environment you're working in, simply type:
You can reactivate the virtualenv again by running the activate command again from any terminal.
Permanently deleting a python virtual environment
To permanently remove the "entspan" environment and all dependencies (not including the Postgres database), run the following:
The BitCurator logo, BitCurator project documentation, and other non-software products of the BitCurator team are subject to the the Creative Commons Attribution 4.0 Generic license (CC By 4.0).
Unless otherwise indicated, software items in this repository are distributed under the terms of the GNU General Public License, Version 3. See the text file "COPYING" for further details about the terms of this license.
In addition to software produced by the BitCurator team, BitCurator packages and modifies open source software produced by other developers. Licenses and attributions are retained here where applicable.