# PURE: Entity and Relation Extraction from Text

This is a reproduction of the results presnseted in the research paper [A Frustratingly Easy Approach for Entity and Relation Extraction](https://arxiv.org/pdf/2010.12812.pdf)

##### Environment information
Windows 11

Python 3.6.13

pip 21.2.2

## Setup

### Install dependencies

The authors first instruct us to run the following comand to install all the requirements.

In [1]:
! pip install -r requirements.txt

Collecting tqdm (from -r requirements.txt (line 1))
  Using cached tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
Collecting allennlp==0.9.0 (from -r requirements.txt (line 2))
  Using cached allennlp-0.9.0-py3-none-any.whl.metadata (11 kB)


ERROR: Ignored the following versions that require a different python version: 0.2.0 Requires-Python ==3.6
ERROR: Could not find a version that satisfies the requirement torch==1.4.0 (from versions: 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1)
ERROR: No matching distribution found for torch==1.4.0


But as you can see, that didn't really work. At least not in my environment. The requirements.text includes a quite older vaersion of PyTorch. Namely, version 1.4.0. Trying to install it with pip just didn't work.

So I did some digging and found [this stackoverflow question](https://stackoverflow.com/questions/56239310/could-not-find-a-version-that-satisfies-the-requirement-torch-1-0-0), which had this comand in one of the answers by [Sandokan](https://stackoverflow.com/users/8168933/sandokan):

In [1]:
! pip install torch===1.4.0 torchvision===0.5.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch===1.4.0
  Using cached https://download.pytorch.org/whl/cu101/torch-1.4.0-cp36-cp36m-win_amd64.whl (796.8 MB)
Collecting torchvision===0.5.0
  Using cached torchvision-0.5.0-cp36-cp36m-win_amd64.whl (1.2 MB)
Collecting pillow>=4.1.1
  Using cached Pillow-8.4.0-cp36-cp36m-win_amd64.whl (3.2 MB)
Collecting numpy
  Using cached numpy-1.19.5-cp36-cp36m-win_amd64.whl (13.2 MB)
Installing collected packages: torch, pillow, numpy, torchvision
Successfully installed numpy-1.19.5 pillow-8.4.0 torch-1.4.0 torchvision-0.5.0


In [2]:
! pip install allennlp-models

Collecting allennlp-models
  Downloading allennlp_models-2.10.1-py3-none-any.whl.metadata (23 kB)
Collecting torch<1.13.0,>=1.7.0 (from allennlp-models)
  Downloading torch-1.12.1-cp39-cp39-win_amd64.whl.metadata (22 kB)
Collecting conllu==4.4.2 (from allennlp-models)
  Downloading conllu-4.4.2-py2.py3-none-any.whl.metadata (19 kB)
Collecting word2number>=1.1 (from allennlp-models)
  Downloading word2number-1.1.zip (9.7 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting py-rouge==1.1 (from allennlp-models)
  Downloading py_rouge-1.1-py3-none-any.whl.metadata (8.7 kB)
Collecting nltk>=3.6.5 (from allennlp-models)
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting ftfy (from allennlp-models)
  Downloading ftfy-6.2.0-py3-none-any.whl.metadata (7.3 kB)
Collecting datasets (from allennlp-models)
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting allennlp<2.11,>=2.10.1 (from allennlp

That worked like a charm!

No let's try isntalling the rest of the requirements with the same original comand: 

In [3]:
! pip install -r requirements.txt

Collecting allennlp==0.9.0 (from -r requirements.txt (line 2))
  Using cached allennlp-0.9.0-py3-none-any.whl.metadata (11 kB)


ERROR: Ignored the following versions that require a different python version: 0.2.0 Requires-Python ==3.6
ERROR: Could not find a version that satisfies the requirement torch==1.4.0 (from versions: 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1)
ERROR: No matching distribution found for torch==1.4.0


There we go! There's a little warning, but let's pertend we didn't see that.

### Download and preprocess the datasets

Next, the authors ask us to download and process the datasets
Their experiments are based on three datasets: ACE04, ACE05, and SciERC. Please find the links and pre-processing below:
* ACE04/ACE05: We use the preprocessing code from [DyGIE repo](https://github.com/luanyi/DyGIE/tree/master/preprocessing). Please follow the instructions to preprocess the ACE05 and ACE04 datasets.
* SciERC: The preprocessed SciERC dataset can be downloaded in their project [website](http://nlp.cs.washington.edu/sciIE/).

Let's do that in the next section

## Quick Start


For this reproduction, we will use the pre-trained models provided by the authors. And we will start with the SciERC dataset.

In [40]:
! pip install clint

Collecting clint
  Downloading clint-0.5.1.tar.gz (29 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting args
  Downloading args-0.1.0.tar.gz (3.0 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: clint, args
  Building wheel for clint (setup.py): started
  Building wheel for clint (setup.py): finished with status 'done'
  Created wheel for clint: filename=clint-0.5.1-py3-none-any.whl size=34458 sha256=362cf8d8bbfcf083639eb05965cfbbefd23e2defeb0d69747083d96324b38260
  Stored in directory: c:\users\odaim\appdata\local\pip\cache\wheels\2c\69\16\04ffdd2e6fbbf2b3aa97970ba8d01c36d09df025f19f25c57e
  Building wheel for args (setup.py): started
  Building wheel for args (setup.py): finished with status 'done'
  Created wheel for args: filename=args-0.1.0-py3-none-any.whl size=3321 sha256=e53dc3c1bf6de0fcf15aacd97872c9b249ae6ab5600eb7

### Helper functions
Let's implenet some functions to help us
 - Dowanload files
 - Extract tar files
 - Unzip zip files

In [2]:
import os
import requests
from tqdm import tqdm

def download_file(file_name, url):
    file_name = os.getcwd() + file_name
    os.makedirs(os.path.dirname(file_name), exist_ok=True)
    r = requests.get(url, stream=True)

    # Throw an error for bad status codes
    r.raise_for_status()

    with open(file_name, 'wb') as f:
        pbar = tqdm(unit="B", total=int(r.headers['Content-Length']), position=0, leave=True, desc='Downloading')
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                pbar.update(len(chunk))
                f.write(chunk)

import tarfile 

def extract_tar_file(file_name, target_directory):
    with tarfile.open(name=os.getcwd() + file_name) as tar:

        # Go over each member
        for member in tqdm(iterable=tar.getmembers(), total=len(tar.getmembers()), desc='Extracting'):
            tar.extract(member=member, path=os.getcwd() + target_directory) 

import zipfile

def unzip_file(file_name, target_directory):
    with zipfile.ZipFile(os.getcwd() + file_name) as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, os.getcwd() + target_directory)

### Download and extract the SciERC dataset
This is the first step to get the data that we will run the pretrained models on

In [109]:
# Downlaod the SciERC dataset
download_file('/scierc_data/sciERC_processed.tar.gz', 'http://nlp.cs.washington.edu/sciIE/data/sciERC_processed.tar.gz')

Downloading: 100%|██████████| 695340151/695340151 [02:17<00:00, 5061976.13B/s] 


In [1]:
# Extract the SciERC dataset
extract_tar_file('/scierc_data/sciERC_processed.tar.gz', '/scierc_data')

NameError: name 'extract_tar_file' is not defined

### Download and extract the pre-trained entity model
Now we will downlaod the pre-trained entity model to use it to extract entities from the dataset

In [111]:
# Download the pre-trained entity model
download_file('/scierc_models/ent-scib-ctx0.zip', 'https://nlp.cs.princeton.edu/projects/pure/scierc_models/ent-scib-ctx0.zip')

Downloading: 100%|██████████| 409227718/409227718 [01:10<00:00, 5842408.54B/s] 


In [4]:
### Unzip the pre-trained entity model
unzip_file('/scierc_models/ent-scib-ctx0.zip', '/scierc_models')

Extracting: 100%|██████████| 6/6 [00:02<00:00,  2.47it/s]


### Download and extract the pre-trained relation model
Now we'll do the same for the relation model

In [113]:
# Download the pre-trained relation model
download_file('/scierc_models/rel-scib-ctx0.zip', 'https://nlp.cs.princeton.edu/projects/pure/scierc_models/rel-scib-ctx0.zip')

Downloading: 100%|██████████| 408246037/408246037 [00:22<00:00, 17918528.12B/s]


In [5]:
# Unzip the pre-trained relation model
unzip_file('/scierc_models/rel-scib-ctx0.zip', '/scierc_models')

Extracting: 100%|██████████| 6/6 [00:02<00:00,  2.38it/s]


### Download and extract the pre-trained approximation relation model
And one last time for the approximation relation model

In [115]:
# Download the pre-trained approximation relation model
download_file('/scierc_models/rel_approx-scib-ctx0.zip', 'https://nlp.cs.princeton.edu/projects/pure/scierc_models/rel_approx-scib-ctx0.zip')

Downloading: 100%|██████████| 408248055/408248055 [00:31<00:00, 13067492.36B/s]


In [6]:
# Unzip the pre-trained approximation relation model
unzip_file('/scierc_models/rel_approx-scib-ctx0.zip', '/scierc_models')

Extracting: 100%|██████████| 6/6 [00:02<00:00,  2.42it/s]
