This code provides models to extract intelligent information (company, address, date, total amount) from invoice documents based on natural language processing. The flamework mainly includes two steps. Firstly, text data extraction and processing by using text detector algorithm. And then recognizing the context by using recurrent neural network.
You can download the pre-trained models from Google Drive
To install InvoiceNet on Ubuntu 18.04, run the following commands:
git clone https://github.com/RijunLiao/invoice.git
cd InvoiceNet/
# Run installation script
./install.sh
The install.sh script will install all the dependencies, create a virtual environment, and install InvoiceNet in the virtual environment.
To be able to use InvoiceNet, you need to source the virtual environment that the package was installed in.
# Source virtual environment
source env/bin/activate
Prepare the data for training first by running the following command:
python prepare_data.py --data_dir train_data/
Train InvoiceNet using the following command:
python train.py --field enter-field-here --batch_size 8
# For example, for field 'total_amount'
python train.py --field total_amount --batch_size 8
To extract a field from a single invoice file, run the following command:
python predict.py --field enter-field-here --invoice path-to-invoice-file
# For example, to extract field total_amount from an invoice file invoices/1.pdf
python predict.py --field total --invoice invoices/1.pdf # just predict the amount
python predict.py --field comany address total date --invoice invoices/1.pdf # predict the comany address total date at the same time
For extracting information using the trained InvoiceNet model, you just need to place the PDF invoice documents in one directory in the following format:
predict_data/
invoice1.pdf
invoice2.pdf
...
Run InvoiceNet using the following command:
python predict.py --field enter-field-here --data_dir predict_data/
# For example, for field 'total_amount'
python predict.py --field total --data_dir predict_data/ # just predict the amount
python predict.py --field comany address total date --data_dir predict_data/ # predict the omany address total date at the same time
Input invoice:
Result of total amount extraction:
Result of total amount extraction:
This implementation is largely based on the work of InvoiceNet