Skip to content

Commit

Permalink
update OAGBERT readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Somefive committed Feb 23, 2021
1 parent af2344b commit f2b58a5
Showing 1 changed file with 61 additions and 0 deletions.
61 changes: 61 additions & 0 deletions cogdl/oag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# OAGBERT (Open Academic Graph BERT)
We released two versions of OAGBERT by the end of 2021 February, including

### V1: The vanilla version
A basic version OAGBERT. Similar to [SciBERT](https://github.com/allenai/scibert), we pre-train the BERT model on academic text corpus in Open Academic Graph, including paper titles, abstracts and bodies.

The usage of OAGBERT is the same of ordinary SciBERT or BERT. For example, you can use the following code to encode two text sequences and retrieve their outputs
```python
from cogdl import oagbert

tokenizer, bert_model = oagbert()

sequence = ["CogDL is developed by KEG, Tsinghua.", "OAGBert is developed by KEG, Tsinghua."]
tokens = tokenizer(sequence, return_tensors="pt", padding=True)
outputs = bert_model(**tokens)
```

### V2: The entity augmented version
An extension to the vanilla OAGBERT. We incorporate rich entity information in Open Academic Graph such as **authors** and **field-of-study**. Thus, you can encode various type of entities in OAGBERT v2. For example, to encode the paper of BERT, you can use the following code
```python
from cogdl import oagbert
import torch

tokenizer, model = oagbert("oagbert-v2")
title = 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'
abstract = 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation...'
authors = ['Jacob Devlin', 'Ming-Wei Chang', 'Kenton Lee', 'Kristina Toutanova']
venue = 'north american chapter of the association for computational linguistics'
affiliations = ['Google']
concepts = ['language model', 'natural language inference', 'question answering']
# build model inputs
input_ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second, masked_positions, num_spans = model.build_inputs(
title=title, abstract=abstract, venue=venue, authors=authors, concepts=concepts, affiliations=affiliations
)
# run forward
sequence_output, pooled_output = model.bert.forward(
input_ids=torch.LongTensor(input_ids).unsqueeze(0),
token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0),
attention_mask=torch.LongTensor(input_masks).unsqueeze(0),
output_all_encoded_layers=False,
checkpoint_activations=False,
position_ids=torch.LongTensor(position_ids).unsqueeze(0),
position_ids_second=torch.LongTensor(position_ids).unsqueeze(0)
)
```
You can also use some integrated functions to use OAGBERT v2 directly, such as using `decode_beamsearch` to generate entities based on existing context. For example, to generate concepts with 2 tokens for the BERT paper, run the following code
```python
model.eval()
candidates = model.decode_beamsearch(
title=title,
abstract=abstract,
venue=venue,
authors=authors,
affiliations=affiliations,
decode_span_type='FOS',
decode_span_length=2,
beam_width=8,
force_forward=False
)
```
For more examples, refer to [examples/oagbert_metainfo.py](../../examples/oagbert_metainfo.py).

0 comments on commit f2b58a5

Please sign in to comment.