<a href="https://colab.research.google.com/github/Kabongosalomon/task-dataset-metric-extraction/blob/dataleakage/BERT/BERTScienceResultExtractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Run experiments based on textual entailment system

We release the training/testing datasets for all experiments described in the paper. You can find them under the data/exp directory. The results reported in the paper are based on the datasets under the [data/exp/few-shot-setup/NLP-TDMS/paperVersion](data/exp/few-shot-setup/NLP-TDMS/paperVersion) directory. We later further clean the datasets (e.g., remove five pdf files from the testing datasets which appear in the training datasets with a different name) and the clean version is under the [data/exp/few-shot-setup/NLP-TDMS](data/exp/few-shot-setup/NLP-TDMS) folder. Below we illustrate how to run experiments on the NLP-TDSM dataset in the few-shot setup to extract TDM pairs. 


1) Fork and clone this repository.

2) Download or clone [BERT](https://github.com/google-research/bert).

3) Run this command `pip install -r requirements.txt` from `./bert_tdms/` folder. 

4) Copy [run_classifier_sci.py](./bert_tdms/run_classifier_sci.py) into the BERT directory.

5) Download BERT embeddings.  We use the [base uncased models](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip).

6) If we use `BERT_DIR` to point to the directory with the embeddings and `DATA_DIR` to point to the [directory with our train and test data](./data/exp/few-shot-setup/NLP-TDMS/), we can run the textual entailment system with  [run_classifier_sci.py](./bert_tdms/run_classifier_sci.py). For example:

```
> DATA_DIR=../data/exp/few-shot-setup/NLP-TDMS/
> BERT_DIR=./model/uncased_L-12_H-768_A-12
> python run_classifier_sci.py --do_train=true --do_eval=false --do_predict=true --data_dir=${DATA_DIR} --task_name=sci --vocab_file=${BERT_DIR}/vocab.txt --bert_config_file=${BERT_DIR}/bert_config.json --init_checkpoint=${BERT_DIR}/bert_model.ckpt --output_dir=bert_tdms --max_seq_length=512 --train_batch_size=6 --predict_batch_size=6
```

5) [TEModelEvalOnNLPTDMS](nlpLeaderboard/src/main/java/com/ibm/sre/tdmsie/TEModelEvalOnNLPTDMS.java) provides methods to evaluate TDMS tuples extraction.

6) [GenerateTestDataOnPDFPapers](nlpLeaderboard/src/main/java/com/ibm/sre/tdmsie/GenerateTestDataOnPDFPapers.java) provides methods to generate testing dataset for any PDF papers.


### Read NLP-TDMS and ARC-PDN corpora ###

1) Follow the instructions in the [README](data/NLP-TDMS/downloader/README.md) in [data/NLP-TDMS/downloader/](data/NLP-TDMS/downloader/) to download the entire collection of raw PDFs of the NLP-TDMS dataset.  The downloaded PDFs can be moved to [data/NLP-TDMS/pdfFile](./data/NLP-TDMS/pdfFile) (i.e., `mv *.pdf ../pdfFile/.`).

2) For the ARC-PDN corpus, the original pdf files can be downloaded from the [ACL Anthology Reference Corpus (Version 20160301)](https://acl-arc.comp.nus.edu.sg/). We use papers from ACL(P)/EMNLP(D)/NAACL(N) between 2010 and 2015. After uncompressing the downloaded PDF files, put the PDF files into the corresponding directories under the /data/ARC-PDN/ folder, e.g., copy D10 to /data/ARC-PDN/D/D10.

3) We release the parsed NLP-TDMS and ARC-PDN corpora. [NlpTDMSReader](nlpLeaderboard/src/main/java/com/ibm/sre/data/corpus/NlpTDMSReader.java) and [ArcPDNReader](nlpLeaderboard/src/main/java/com/ibm/sre/data/corpus/ArcPDNReader.java) in the corpus package illustrate how to read section and table contents from PDF files in these two corpora. 



In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
! git clone https://github.com/google-research/bert

Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 315.49 KiB | 4.26 MiB/s, done.
Resolving deltas: 100% (185/185), done.


In [3]:
# !wget https://raw.githubusercontent.com/Kabongosalomon/task-dataset-metric-extraction/dataleakage/bert_tdms/requirements.txt
!pip install tensorflow==1.11.0
!pip install tensorflow-gpu==1.11.0

Collecting tensorflow==1.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/ce/d5/38cd4543401708e64c9ee6afa664b936860f4630dd93a49ab863f9998cd2/tensorflow-1.11.0-cp36-cp36m-manylinux1_x86_64.whl (63.0MB)
[K     |████████████████████████████████| 63.0MB 70kB/s 
Collecting setuptools<=39.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/8c/10/79282747f9169f21c053c562a0baa21815a8c7879be97abd930dbcf862e8/setuptools-39.1.0-py2.py3-none-any.whl (566kB)
[K     |████████████████████████████████| 573kB 45.6MB/s 
Collecting tensorboard<1.12.0,>=1.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/9b/2f/4d788919b1feef04624d63ed6ea45a49d1d1c834199ec50716edb5d310f4/tensorboard-1.11.0-py3-none-any.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 43.0MB/s 
Collecting keras-applications>=1.0.5
[?25l  Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-

Collecting tensorflow-gpu==1.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/25/52/01438b81806765936eee690709edc2a975472c4e9d8d465a01840869c691/tensorflow_gpu-1.11.0-cp36-cp36m-manylinux1_x86_64.whl (258.8MB)
[K     |████████████████████████████████| 258.8MB 54kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-1.11.0


5) Download BERT embeddings.  We use the [base uncased models](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip).

In [6]:
# !wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

--2020-12-09 13:38:42--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.212.128, 172.217.214.128, 108.177.111.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.212.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘uncased_L-12_H-768_A-12.zip’


2020-12-09 13:38:45 (153 MB/s) - ‘uncased_L-12_H-768_A-12.zip’ saved [407727028/407727028]



In [7]:
# !unzip uncased_L-12_H-768_A-12.zip

Archive:  uncased_L-12_H-768_A-12.zip
   creating: uncased_L-12_H-768_A-12/
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: uncased_L-12_H-768_A-12/vocab.txt  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: uncased_L-12_H-768_A-12/bert_config.json  


6) If we use `BERT_DIR` to point to the directory with the embeddings and `DATA_DIR` to point to the [directory with our train and test data](./data/exp/few-shot-setup/NLP-TDMS/), we can run the textual entailment system with  [run_classifier_sci.py](./bert_tdms/run_classifier_sci.py). For example:

```
> DATA_DIR=../data/exp/few-shot-setup/NLP-TDMS/
> BERT_DIR=./model/uncased_L-12_H-768_A-12
> python run_classifier_sci.py --do_train=true --do_eval=false --do_predict=true --data_dir=${DATA_DIR} --task_name=sci --vocab_file=${BERT_DIR}/vocab.txt --bert_config_file=${BERT_DIR}/bert_config.json --init_checkpoint=${BERT_DIR}/bert_model.ckpt --output_dir=bert_tdms --max_seq_length=512 --train_batch_size=6 --predict_batch_size=6
```

In [1]:
!DATA_DIR=../content/gdrive/MyDrive/colab-ssh/science-result-extractor/data/exp/few-shot-setup/NLP-TDMS/
!BERT_DIR=../content/gdrive/MyDrive/colab-ssh/uncased_L-12_H-768_A-12

In [10]:
# !cp -r uncased_L-12_H-768_A-12/ ../content/gdrive/MyDrive/colab-ssh/

In [None]:
!wget https://raw.githubusercontent.com/Kabongosalomon/task-dataset-metric-extraction/dataleakage/bert_tdms/run_classifier_sci.py
!cp run_classifier_sci.py bert/

In [11]:
%cd bert/
# %cd ../

/content/bert


In [12]:
!python run_classifier_sci.py --do_train=true --do_eval=false --do_predict=true --data_dir=${DATA_DIR} --task_name=sci --vocab_file=${BERT_DIR}/vocab.txt --bert_config_file=${BERT_DIR}/bert_config.json --init_checkpoint=${BERT_DIR}/bert_model.ckpt --output_dir=bert_tdms --max_seq_length=512 --train_batch_size=6 --predict_batch_size=6

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_classifie