# REBEL

This notebook is part of the lecture series at the Faculty Development Programme organised by the Department of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, jointly in association with ShodhGuru Innovation and Research Labs, India. Specifically, this notebook is part of Tek Raj Chhetri's lecture entitled Applications of Deep Neural Networks in Knowledge Graph Construction.



### Installation 
 
We will use Babelscape/rebel-large and ibm/knowgl-large 

- Cabot, P.L.H. and Navigli, R., 2021, November. REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 2370-2381).
 
 
- https://huggingface.co/Babelscape/rebel-large



### Babelscape/rebel-large

In [6]:
from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')


In [7]:
def extract_triplets(text):
#     https://huggingface.co/Babelscape/rebel-large
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'subject': subject.strip(), 'relation': relation.strip(),'object': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'subject': subject.strip(), 'relation': relation.strip(),'object': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'subject': subject.strip(), 'relation': relation.strip(),'object': object_.strip()})
    return triplets


In [8]:
input_text_einstein = "Mount Gardner  is a mountain, high, standing west of Mount Tyree in the west-central part of the Sentinel Range, in the Ellsworth Mountains of Antarctica. It surmounts Patton Glacier to the northeast.The peak was discovered by the Marie Byrd Land Traverse party of 1957–58 under Charles R. Bentley and was named by the Advisory Committee on Antarctic Names for Lieutenant Harvey L. Gardner, U.S. Navy, a pilot in Antarctica in the 1957–58 and 1958–59 seasons who was killed in the crash of a UB-1 Otter airplane at Marble Point on January 4, 1959. Russet Hills  is a line of hills trending east–west for 3.5 nautical miles (6 km) and forming the southern ridge of Gallipoli Heights in the Freyberg Mountains. Named by the New Zealand Antarctic Place-Names Committee (NZ-APC) on the proposal of P.J. Oliver, New Zealand Antarctic Research Program (NZARP) geologist who studied the hills, 1981–82. Named descriptively from the red-colored ignimbrite rock of this feature."

In [9]:
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(
    input_text_einstein, 
    return_tensors=True, return_text=False)[0]["generated_token_ids"]])

In [10]:
import pandas as pd

In [11]:
babel_df = pd.DataFrame.from_dict(extract_triplets(extracted_text[0]))

In [12]:
babel_df

Unnamed: 0,subject,relation,object
0,Mount Gardner,instance of,mountain
1,Mount Gardner,mountain range,Sentinel Range
2,Sentinel Range,mountain range,Ellsworth Mountains
3,Ellsworth Mountains,has part,Sentinel Range


In [13]:
babel_df.to_csv("babel_df_einstein.csv", index=None)  

In [14]:
%%writefile babel_df.py

import streamlit
from streamlit_agraph import TripleStore
from streamlit_agraph import agraph, Node, Edge, Config
import pandas as pd

nodes = set()
Edges = []
triples = pd.read_csv("babel_df_einstein.csv")
 

for _,triple in triples.iterrows():
    nodes.add(triple['subject'])
    nodes.add(triple['object'])
    Edges.append(Edge(source=triple['subject'],
                      label=triple['relation'],
                      target=triple['object'],
                      # **kwargs
                      )
                 )
st_nodes=[]
for n in nodes:
    st_nodes.append(Node(id=n, label=n))
# width and height are large because of my screen size, you may want to change accordingly    
config = Config(width=3800,
                height=1300,
                nodeHighlightBehavior=True, 
                highlightColor="#ff0000",
                directed=True, 
                hierarchical=True,  
                )

agraph(nodes=st_nodes,
       edges=Edges, config=config)


Overwriting babel_df.py


In [None]:
!streamlit run babel_df.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2023-04-06 13:03:55.867 INFO    numexpr.utils: NumExpr defaulting to 8 threads.
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://138.232.106.80:8501[0m
[0m
