This notebook is part of the lecture series at the Faculty Development Programme organised by the Department of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, jointly in association with ShodhGuru Innovation and Research Labs, India. Specifically, this notebook is part of Tek Raj Chhetri's lecture entitled Applications of Deep Neural Networks in Knowledge Graph Construction.

The demo uses the Python implementation of Stanford OpenIE from [1] and F-COREF [2]. 

[1] __Angeli, G., Premkumar, M.J.J. and Manning, C.D., 2015, July. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 344-354).__

[2] __Otmazgin, S., Cattan, A. and Goldberg, Y., 2022. F-COREF: Fast, Accurate and Easy to Use Coreference Resolution. arXiv preprint arXiv:2209.04280.__

### Installation
`pip install stanford_openie`

`pip install fastcoref` 

`pip install -U spacy`

We will use small model, `en_core_web_sm` for this case but if you want more accuracy, you should use `en_core_web_trf` as suggested by [Spacy](https://spacy.io/usage).


Note: You also require Java.  

### Installation 

In [1]:
# !pip install -U spacy --quiet
# !python -m spacy download en_core_web_sm --quiet

In [6]:
import spacy
from fastcoref import spacy_component
from openie import StanfordOpenIE
from fastcoref import FCoref

In [7]:
def apply_conference_resolution(sentence):
    # load spacy
    spacym = spacy.load('en_core_web_sm')
    # add fastcoref to spacy pipeline
    spacym.add_pipe(
       "fastcoref", 
       config={'model_architecture': 'LingMessCoref', 
               'model_path': 'biu-nlp/lingmess-coref', 'device': 'cpu'}
    )
    resolved_sentence = spacym(sentence, component_cfg={"fastcoref": {'resolve_text': True}})._.resolved_text
    return resolved_sentence

def extract_information(sentence, graph_name=None, apply_resolution=False): 
    if apply_resolution:
        sentence = apply_conference_resolution(sentence)
    # The affinity value above which confidence of the extraction is taken as 1.0.
    # Default value: 1/3.
    properties = {
        'openie.affinity_probability_cap': 1.5 / 3,
    }
    
    with StanfordOpenIE(properties=properties) as client:
        text = sentence 
        triples=[]
        for triple in client.annotate(text): 
            triples.append(triple)
        #for large graph we don't want to create graphviz visualisation
        if graph_name is not None:
            graph_image = graph_name
            client.generate_graphviz_graph(text, graph_image)
        #just for viz of large triples 
        return triples
        

## Before applying conference resolution:


Text: __Sanju Tiwari is a researcher. She works at the Universidad Autonoma de Tamaulipas.__


![KG1](kg_pronoun.png)

In [9]:
sentence = 'Sanju Tiwari is a researcher. She works at the Universidad Autonoma de Tamaulipas.'
graph_name = 'kg_conference_resolution_fixed.png'
extract_information(sentence=sentence, graph_name=graph_name, apply_resolution=True)

04/06/2023 10:40:09 - INFO - 	 missing_keys: []
04/06/2023 10:40:09 - INFO - 	 unexpected_keys: []
04/06/2023 10:40:09 - INFO - 	 mismatched_keys: []
04/06/2023 10:40:09 - INFO - 	 error_msgs: []
04/06/2023 10:40:09 - INFO - 	 Model Parameters: 590.0M, Transformer: 434.6M, Coref head: 155.4M
04/06/2023 10:40:09 - INFO - 	 Tokenize 1 inputs...
04/06/2023 10:40:09 - INFO - 	 ***** Running Inference on 1 texts *****         
Inference: 100%|██████████████████████████████████| 1/1 [00:04<00:00,  4.66s/it]


Starting server with command: java -Xmx8G -cp /Users/tekrajchhetri/.stanfordnlp_resources/stanford-corenlp-4.5.3/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-28819b58fd494ee9.props -preload openie
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[{'subject': 'Sanju Tiwari', 'relation': 'is', 'object': 'researcher'},
 {'subject': 'Sanju Tiwari',
  'relation': 'works at',
  'object': 'Universidad Autonoma de Tamaulipas'}]

## After conference resolution

Here by applying the conference resolution, we have successfully fixed our previous issue with knowledge graph.

![KG1](kg_conference_resolution_fixed.png)

Let's create a knowledge graphs with some large text. For this I have taken the following text from https://www.nobelprize.org/prizes/physics/1921/einstein/biographical/

`Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, 1879. Six weeks later the family moved to Munich, where he later on began his schooling at the Luitpold Gymnasium. Later, they moved to Italy and Albert continued his education at Aarau, Switzerland and in 1896 he entered the Swiss Federal Polytechnic School in Zurich to be trained as a teacher in physics and mathematics. In 1901, the year he gained his diploma, he acquired Swiss citizenship and, as he was unable to find a teaching post, he accepted a position as technical assistant in the Swiss Patent Office. In 1905 he obtained his doctor’s degree. During his stay at the Patent Office, and in his spare time, he produced much of his remarkable work and in 1908 he was appointed Privatdozent in Berne. In 1909 he became Professor Extraordinary at Zurich, in 1911 Professor of Theoretical Physics at Prague, returning to Zurich in the following year to fill a similar post. In 1914 he was appointed Director of the Kaiser Wilhelm Physical Institute and Professor in the University of Berlin. He became a German citizen in 1914 and remained in Berlin until 1933 when he renounced his citizenship for political reasons and emigrated to America to take the position of Professor of Theoretical Physics at Princeton*. He became a United States citizen in 1940 and retired from his post in 1945. After World War II, Einstein was a leading figure in the World Government Movement, he was offered the Presidency of the State of Israel, which he declined, and he collaborated with Dr. Chaim Weizmann in establishing the Hebrew University of Jerusalem. Einstein always appeared to have a clear view of the problems of physics and the determination to solve them. He had a strategy of his own and was able to visualize the main stages on the way to his goal. He regarded his major achievements as mere stepping-stones for the next advance. At the start of his scientific work, Einstein realized the inadequacies of Newtonian mechanics and his special theory of relativity stemmed from an attempt to reconcile the laws of mechanics with the laws of the electromagnetic field. He dealt with classical problems of statistical mechanics and problems in which they were merged with quantum theory: this led to an explanation of the Brownian movement of molecules. He investigated the thermal properties of light with a low radiation density and his observations laid the foundation of the photon theory of light. In his early days in Berlin, Einstein postulated that the correct interpretation of the special theory of relativity must also furnish a theory of gravitation and in 1916 he published his paper on the general theory of relativity. During this time he also contributed to the problems of the theory of radiation and statistical mechanics. In the 1920s, Einstein embarked on the construction of unified field theories, although he continued to work on the probabilistic interpretation of quantum theory, and he persevered with this work in America. He contributed to statistical mechanics by his development of the quantum theory of a monatomic gas and he has also accomplished valuable work in connection with atomic transition probabilities and relativistic cosmology. After his retirement he continued to work towards the unification of the basic concepts of physics, taking the opposite approach, geometrisation, to the majority of physicists.`




In [41]:
input_text_einstein = "Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, 1879. Six weeks later the family moved to Munich, where he later on began his schooling at the Luitpold Gymnasium. Later, they moved to Italy and Albert continued his education at Aarau, Switzerland and in 1896 he entered the Swiss Federal Polytechnic School in Zurich to be trained as a teacher in physics and mathematics. In 1901, the year he gained his diploma, he acquired Swiss citizenship and, as he was unable to find a teaching post, he accepted a position as technical assistant in the Swiss Patent Office. In 1905 he obtained his doctor’s degree. During his stay at the Patent Office, and in his spare time, he produced much of his remarkable work and in 1908 he was appointed Privatdozent in Berne. In 1909 he became Professor Extraordinary at Zurich, in 1911 Professor of Theoretical Physics at Prague, returning to Zurich in the following year to fill a similar post. In 1914 he was appointed Director of the Kaiser Wilhelm Physical Institute and Professor in the University of Berlin. He became a German citizen in 1914 and remained in Berlin until 1933 when he renounced his citizenship for political reasons and emigrated to America to take the position of Professor of Theoretical Physics at Princeton*. He became a United States citizen in 1940 and retired from his post in 1945. After World War II, Einstein was a leading figure in the World Government Movement, he was offered the Presidency of the State of Israel, which he declined, and he collaborated with Dr. Chaim Weizmann in establishing the Hebrew University of Jerusalem. Einstein always appeared to have a clear view of the problems of physics and the determination to solve them. He had a strategy of his own and was able to visualize the main stages on the way to his goal. He regarded his major achievements as mere stepping-stones for the next advance. At the start of his scientific work, Einstein realized the inadequacies of Newtonian mechanics and his special theory of relativity stemmed from an attempt to reconcile the laws of mechanics with the laws of the electromagnetic field. He dealt with classical problems of statistical mechanics and problems in which they were merged with quantum theory: this led to an explanation of the Brownian movement of molecules. He investigated the thermal properties of light with a low radiation density and his observations laid the foundation of the photon theory of light. In his early days in Berlin, Einstein postulated that the correct interpretation of the special theory of relativity must also furnish a theory of gravitation and in 1916 he published his paper on the general theory of relativity. During this time he also contributed to the problems of the theory of radiation and statistical mechanics. In the 1920s, Einstein embarked on the construction of unified field theories, although he continued to work on the probabilistic interpretation of quantum theory, and he persevered with this work in America. He contributed to statistical mechanics by his development of the quantum theory of a monatomic gas and he has also accomplished valuable work in connection with atomic transition probabilities and relativistic cosmology. After his retirement he continued to work towards the unification of the basic concepts of physics, taking the opposite approach, geometrisation, to the majority of physicists." 

In [42]:
kg_triples_einstein = extract_information(input_text_einstein)

Starting server with command: java -Xmx8G -cp /Users/tekrajchhetri/.stanfordnlp_resources/stanford-corenlp-4.5.3/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-04f744b98f864a35.props -preload openie


In [47]:
# with conference resolution
kg_triples_einstein_conf_res = extract_information(input_text_einstein, apply_resolution=True)

Exception ignored in: <function StanfordOpenIE.__del__ at 0x7ff727f57280>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/openie/openie.py", line 116, in __del__
    del os.environ['CORENLP_HOME']
  File "/opt/anaconda3/lib/python3.8/os.py", line 691, in __delitem__
    raise KeyError(key) from None
KeyError: 'CORENLP_HOME'
Some weights of the model checkpoint at biu-nlp/lingmess-coref were not used when initializing LingMessModel: ['longformer.embeddings.position_ids']
- This IS expected if you are initializing LingMessModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LingMessModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2023-04-03 18:21:19.434 mi

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

2023-04-03 18:21:20.789 ***** Running Inference on 1 texts *****


Inference:   0%|          | 0/1 [00:00<?, ?it/s]

Starting server with command: java -Xmx8G -cp /Users/tekrajchhetri/.stanfordnlp_resources/stanford-corenlp-4.5.3/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-ea448e4aacf64c9c.props -preload openie
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# Interactive visualisation of our knowledge graphs

Since our knowledge graphs is now large, we will have an interactive visualisation. For this, following two libraries.
- streamlit
- streamlit_agraph

## Without conference resolution

In [43]:
import pandas as pd
kg_df = pd.DataFrame.from_dict(kg_triples_einstein)
kg_df.head()

Unnamed: 0,subject,relation,object
0,Albert Einstein,was,born at Ulm in Württemberg
1,Albert Einstein,was born on,1879
2,Albert Einstein,was born at,Ulm
3,Albert Einstein,was,born in Württemberg on March 14
4,Albert Einstein,was,born


In [44]:
kg_df.tail()

Unnamed: 0,subject,relation,object
257,he,continued,work towards unification of basic concepts
258,he,continued,to work towards unification of concepts of phy...
259,he,continued,to work towards unification to majority
260,he,continued,to work
261,he,work towards,unification of basic concepts


In [45]:
kg_df.to_csv("einstein_kg_without_resolution.csv", index=None)    

# With conference resolution

In [48]:
kg_df_cres = pd.DataFrame.from_dict(kg_triples_einstein_conf_res)
kg_df_cres.head()

Unnamed: 0,subject,relation,object
0,Albert Einstein,was,born at Ulm in Württemberg
1,Albert Einstein,was born on,1879
2,Albert Einstein,was born at,Ulm
3,Albert Einstein,was,born in Württemberg on March 14
4,Albert Einstein,was,born


In [49]:
kg_df_cres.tail()

Unnamed: 0,subject,relation,object
274,Albert Einstein,continued,to work towards unification of basic concepts ...
275,Albert Einstein,work to,majority of physicists
276,Albert Einstein,continued,work towards unification to majority of physic...
277,Albert Einstein,continued,work to majority
278,Albert Einstein,continued,to work towards unification of concepts of phy...


In [50]:
kg_df_cres.to_csv("einstein_kg_with_resolution.csv", index=None)   

In [12]:
%%writefile knowledge_graphs_visualisation.py

import streamlit
from streamlit_agraph import TripleStore
from streamlit_agraph import agraph, Node, Edge, Config
import pandas as pd

nodes = set()
Edges = []
triples = pd.read_csv("einstein_kg_with_resolution.csv")
# triples = pd.read_csv("einstein_kg_without_resolution.csv")
 

for _,triple in triples.iterrows():
    nodes.add(triple['subject'])
    nodes.add(triple['object'])
    Edges.append(Edge(source=triple['subject'],
                      label=triple['relation'],
                      target=triple['object'],
                      # **kwargs
                      )
                 )
st_nodes=[]
for n in nodes:
    st_nodes.append(Node(id=n, label=n))
# width and height are large because of my screen size, you may want to change accordingly    
config = Config(width=3800,
                height=1300,
                nodeHighlightBehavior=True, 
                highlightColor="#ff0000",
                directed=True, 
                hierarchical=True,  
                )

agraph(nodes=st_nodes,
       edges=Edges, config=config)


Overwriting knowledge_graphs_visualisation.py


In [13]:
!streamlit  run knowledge_graphs_visualisation.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2023-04-06 10:51:29.137 INFO    numexpr.utils: NumExpr defaulting to 8 threads.
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://138.232.106.80:8501[0m
[0m
^C
[34m  Stopping...[0m
