Identify entities and entity spans in open text extracted from files in disk images
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
attic
sample-docs
.gitignore
LICENSE
README.md
bcnlp_config.txt
bcnlp_createspan.py
bcnlp_curses.py
bcnlp_db.py
bcnlp_extract.py
bcnlp_main.py
bcnlp_query.py

README.md

Logo

bitcurator-nlp-entspan

GitHub issues GitHub forks

Entity extraction and span identification for heterogeneous document types. Build instructions and dependencies can be found below. This project is in development.

Installing and running the entity extraction and analysis tools

The following instructions are tested only in Ubuntu 18.04LTS.

Make sure the core system is up to date

sudo apt-get update
sudo apt-get upgrade

Install some basic requirements to build in a Python virtualenv:

sudo apt-get install virtualenv virtualenvwrapper python3-pip python3-dev

Install postgres and some textract dependencies

You will need the postgres database to store entity and span data produced by the tool. Run the following commands to install postgres, along with some dependencies required for the textract package.

sudo apt-get install postgresql postgresql-contrib postgresql-server-dev-10
sudo apt-get install libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev libasound2-dev

Set up virtualenv and virtualenvwrapper:

You can skip or modify this step (and the remaining virtualenv steps) if your local setup differs or you don't wish to use virtualenvs.

mkdir ~/.virtualenvs

Add the following to the end of your .bashrc file. You may need to verify the location of virtualenvwrapper on your system:

# Virtualenv and virtualenvwrapper
export WORKON_HOME="$HOME/.virtualenvs"
source /usr/share/virtualenvwrapper/virtualenvwrapper.sh

Type shell source ~/.bashrc or close and reopen the terminal.

Make a virtualenv for the bitcurator-nlp-entspan tools

mkvirtualenv -p /usr/bin/python3 entspan

Install a textacy dependency requiring gcc-5:

The cld2-cffi package - see https://github.com/chartbeat-labs/textacy/issues/5 - must be built with gcc-5 for the time being. Revisit in future. (Do not use sudo in the second step when installing via pip in a virtualenv; if you do, the cld2-cffi dep will remain broken as it won't be found in the venv, and textacy will try to build it again with gcc-6).

sudo apt-get install gcc-5 g++-5 libffi-dev
env CC=/usr/bin/gcc-5 pip3 install -U cld2-cffi

Install textract, textacy, and some other required pip packages.

Note: Installing textacy via pip will also install the latest release of spaCy.

pip install textract textacy psycopg2-binary sqlalchemy sqlalchemy-utils configobj matplotlib

Create and populate the database

Note: If the DB "bcnlp_db" already exists and you want to start afresh, first delete it.

# drop the db named: "bcnlp_db"  
sudo -u postgres dropdb bcnlp_db

To create a db with a user and password (commands are shown in the postgres prompt):

sudo -u postgres psql  
postgres=# create database bcnlp_db;  
CREATE DATABASE  
postgres=# create user bcnlp with password 'bcnlp';  
CREATE ROLE   
postgres=# grant all privileges on database bcnlp_db to bcnlp;  
GRANT  
postgres=# \q  

(Optional) Logging in to the db with the psql command:

psql -h localhost -U bcnlp bcnlp_db  
(passwd: bcnlp)

To list tables: \dt

Download the spaCy English language model:

python -m spacy download en

If the language model is not downloaded properly, you will see the following Spacy error:
"Warning: no model found for 'en' Only loading the 'en' tokenizer." when running bcnlp_main.py.

Populating the DB with tables of entities, POS, etc.

Run the Python script bcnlp_main.py:

python bcnlp_main.py --infile < inputfile >   
ex: python bcnlp_main.py --infile indir   

Next, check if the DB is populated

psql -h localhost -U bcnlp bcnlp_db  

Note: the password in this case is "bcnlp"

Some useful commands:

To list tables: \dt  
To delete a table: drop table <table_name>  
To see items in a table: select * from <table_name>  

Run the curses interface and navigate through the menu:

python bcnlp_curses.py  

You can do the following in this interface:

  • List all the documents with number of terms, nouns, verbs and prepositions
  • Export lists of entities, POS or each document listed.
  • Similarity measures:
    • Get a list of common entities in selected set of 2 documents. A new table is created in the database.
    • Get the Similarity measure for two documents (Cosine, Euclidian or Manhattan)

Run createspan to create the entity spans and bar graphs:

python bcnlp_createspan.py [--bg] --infile <directory>   
ex: python bcnlp_createspan.py --infile indir  
    python bcnlp_createspan.py --bg --infile mango_cake.txt   
  • It will create a file .span for each file in indir.
  • To clear the span, same script is run with --clearspan flag.
  • If --bg flag is specified, it will generate a set of bar graphs in the directory bgdir.

Deactivating the python virtual environment

To deactivate the environment you're working in, simply type:

deactivate

You can reactivate the virtualenv again by running the activate command again from any terminal.

Permanently deleting a python virtual environment

To permanently remove the "entspan" environment and all dependencies (not including the Postgres database), run the following:

rmvirtualenv entspan

License(s)

The BitCurator logo, BitCurator project documentation, and other non-software products of the BitCurator team are subject to the the Creative Commons Attribution 4.0 Generic license (CC By 4.0).

Unless otherwise indicated, software items in this repository are distributed under the terms of the GNU General Public License, Version 3. See the text file "COPYING" for further details about the terms of this license.

In addition to software produced by the BitCurator team, BitCurator packages and modifies open source software produced by other developers. Licenses and attributions are retained here where applicable.