# Graph Extraction

In spark-nlp, we can use `GraphExtraction` annotator to extract dependency graph between entities. <br/>
The `GraphExtraction` takes e.g. extracted entities from a `NerDLModel` and creates a dependency tree which describes how the entities relate to each other. **Nodes** represent the entities and the **edges** represent the relations between those entities. <br/>
*Triple store format* is used for that. The relationships between Nodes in a Dependency Tree describe **RDF Triples**.<br/><br/>



These triplets can be used to index a dataset of text and do various semantic queries and analytics and for creating a [RDF dataset](https://en.wikipedia.org/wiki/Resource_Description_Framework)







**Input Annotation types**: `DOCUMENT`, `TOKEN`, `NAMED_ENTITY` <br/>
**Output Annotation type**: `NODE`

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/12.Graph_extraction.ipynb)

Python Documentation: [GraphExtraction](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/graph_extraction/index.html#sparknlp.annotator.graph_extraction.GraphExtraction)

Scala Documentation: [GraphExtraction](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/GraphExtraction.html)


In [1]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp==4.3.2
! pip install -q spark-nlp-display

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m473.2/473.2 KB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 KB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, IntegerType

spark = sparknlp.start()
spark

**Parameters** <br/>
List of parameters that can be set:
- `relationshipTypes` *(List[str])*: Paths to find between a pair of token and entity
- `entityTypes` *(List[str])*: Paths to find between a pair of entities
- `explodeEntities` *(Boolean)*:Whether to find paths between entities.
- `rootTokens` *(List[str])*: Tokens to be considered as the root to start traversing the paths. Use it along with `explodeEntities`. 
- `maxSentenceSize` *(Integer)*: Maximum sentence size that the annotator will process, by default 1000. Above this, the sentence is skipped.
- `minSentenceSize` *(Integer)*: Minimum sentence size that the annotator will process, by default 2. Below this, the sentence is skipped.
- `mergeEntities` *(Boolean)*: Whether to merge same neighboring entities as a single token.
- `mergeEntitiesIOBFormat` *(String)*: IOB format to apply when merging entities. Values IOB or IOB2.
- `includeEdges` *(Boolean)*: Whether to include edges when building paths.
- `delimiter` *(String)*: Delimiter symbol used for path output. 
- `posModel` *(List[String])*: Coordinates (name, lang, remoteLoc) to a pretrained POS model.
- `dependencyParserModel` *(List[String])*: Coordinates (name, lang, remoteLoc) to a pretrained Dependency Parser model.
- `typedDependencyParserModel` *(List[String])*: Coordinates (name, lang, remoteLoc) to a pretrained Typed Dependency Parser model.



## Graph Extraction

We can leverage the output of Dependency Parser and NER to extract paths from a dependency tree to find relevant relationships between words and entities by using the `GraphExtraction` annotator. <br/>
However, we do not need to create Part of Speech(Pos), Dependency Parser and Typed Dependency Parser. If we set `mergeEntities(True)` parameter, `GraphExtraction` uses Pos, Dependency Parser and Typed Dependency Parser features under the hood. 

In [3]:
#Creating a sample data

text = ['Peter Parker is a nice lad and lives in New York']
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+------------------------------------------------+
|text                                            |
+------------------------------------------------+
|Peter Parker is a nice lad and lives in New York|
+------------------------------------------------+



## `mergeEntities` & `RelationshipTypes`
We set the `mergeEntities(True)` parameter to ensure the `GraphExtraction` uses Pos, Dependency Parser and Typed Dependency Parser features under the hood.  <br/>

In addition, we need to set either `relationshipTypes` or `explodeEntities` parameter. Otherwise it returns empty results. In this example, firstly we will use `relationshipTypes` parameter. <br/>

Using the parameter `relationshipTypes`, we need to set a list of token-ENTITY relationships we want to extract paths from. 
In the following sample pipeline, we can extract paths for the following pair of tokens-ENTITIES:

- `lad-PER`, will output the path between "lad" and "John"

A pipeline consisting `ner_dl` NER model and `GraphExtraction`. We will extract relations 

In [3]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setRelationshipTypes(["lad-PER"]) \
    #.setMergeEntities(True) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


In [4]:
# The result dataset has a graph column with the paths between lad,LOC relationship
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

NameError: ignored

As seen the output above, we extratced a relationship from the given text: <br/>
> Node is "lad", the entity is "PER", the type of the relationship is "flat". 


## `explodeEntities`
When setting `explodeEntities` to True, Graph Extraction will find paths between all possible pair of entities. 

Since our example sentence only has two entities (PER and LOC), it will display the paths between "lad" and PER as well as "lad" and LOC. Each pair of entities will have a left path and a right path. <br/>
By default the paths start from the root of the dependency tree, which is the token lad in this case. 

In [6]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setExplodeEntities(True) \
    .setMergeEntities(True) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [7]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------+
|[{node, 23, 25, lad, {entities -> PER,LOC, left_path -> lad,flat,Peter Parker, right_path -> lad,flat,New York}, []}]|
+---------------------------------------------------------------------------------------------------------------------+



As seen above, we have all possible relations. <br/>
Our entities are: PER, LOC <br/>
1. Node is "lad", the entity is "PER", the type of the relationship is "flat", The type of the path is left_path. 
1. Node is "lad", the entity is "LOC", the type of the relationship is "flat", The type of the path is right_path. 


## `rootTokens`
We set this parameter to set tokens which are considered as the root to start traversing the paths. We use it along with `explodeEntities`.

In [8]:
#Creating a sample data

text= ['Peter was born in Mexico and very succesful man.']
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+------------------------------------------------+
|text                                            |
+------------------------------------------------+
|Peter was born in Mexico and very succesful man.|
+------------------------------------------------+



DEFAULT ROOT/NODE IS "BORN"

In [9]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setExplodeEntities(True) \
    .setMergeEntities(True) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

By default the paths start from the root of the dependency tree, which is the token lad in this case.

In [10]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------+
|[{node, 10, 13, born, {entities -> PER,LOC, left_path -> born,flat,Peter, right_path -> born,nsubj,man,flat,Mexico}, []}]|
+-------------------------------------------------------------------------------------------------------------------------+



DOES NOT RETURN RESULT WHEN WE SET ANOTHER TOKEN

In [11]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setExplodeEntities(True) \
    .setMergeEntities(True) \
    .setRootTokens(['Peter'])


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [12]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+-----+
|graph|
+-----+
|[]   |
+-----+



IT RETURNS SAME RESULTS WHEN WE SET THE ROOT AS A DEFAULT TOKEN WHICH IS 'born'

In [13]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setExplodeEntities(True) \
    .setMergeEntities(True) \
    .setRootTokens(['born'])


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [14]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------+
|[{node, 10, 13, born, {entities -> PER,LOC, left_path -> born,flat,Peter, right_path -> born,nsubj,man,flat,Mexico}, []}]|
+-------------------------------------------------------------------------------------------------------------------------+



## `EntityTypes`

DOES NOT WORK WHEN I SET ANY ENTITY

In [15]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


In [16]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setEntityTypes(['PER']) \
    .setMergeEntities(True) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [17]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+-----+
|graph|
+-----+
|[]   |
+-----+



In [18]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setEntityTypes(['LOC']) \
    .setMergeEntities(True) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [19]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+-----+
|graph|
+-----+
|[]   |
+-----+



In [20]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setEntityTypes(['LOC', 'PER']) \
    .setMergeEntities(True) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [21]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+-----+
|graph|
+-----+
|[]   |
+-----+



I ADDED B- ,  STILL DOES NOT WORK

In [22]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setEntityTypes(['B-PER', "B-LOC"]) \
    .setMergeEntities(True) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [23]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+-----+
|graph|
+-----+
|[]   |
+-----+



## `includeEdges`
We set this parameter to choose whether to include edges(relationship types) when building paths. 

In [24]:
#Creating a sample data

text = ['Peter Parker is a nice lad and lives in New York']
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+------------------------------------------------+
|text                                            |
+------------------------------------------------+
|Peter Parker is a nice lad and lives in New York|
+------------------------------------------------+



**`setIncludeEdges(True)`**

We expect to see the edges (relation types) in the result since we set this parameter as True which is default value. 

In [25]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setMergeEntities(True) \
    .setRelationshipTypes(["lad-PER", "lad-LOC"]) \
    .setIncludeEdges(True) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [26]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{node, 23, 25, lad, {relationship -> lad,PER, path1 -> lad,flat,Peter Parker}, []}, {node, 23, 25, lad, {relationship -> lad,LOC, path1 -> lad,flat,New York}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+



As seen above, we see the relations types; flat. 

**`setIncludeEdges(False)`**

We do not expect to see the edges (relation types) in the result since we set this parameter as False.

In [27]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setMergeEntities(True) \
    .setRelationshipTypes(["lad-PER", "lad-LOC"]) \
    .setIncludeEdges(False) 


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [28]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{node, 23, 25, lad, {relationship -> lad,PER, path1 -> lad,Peter Parker}, []}, {node, 23, 25, lad, {relationship -> lad,LOC, path1 -> lad,New York}, []}]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------+



As seen above, we do not see the relationship types. 

## `delimiter`
We set this parameter to specify a delimiter symbol which can be used for path output.


**setDelimiter("/")** 

As a default that delimiter is comma(,), let's set it as slash (/).

In [29]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setMergeEntities(True) \
    .setRelationshipTypes(["lad-PER", "lad-LOC"]) \
    .setDelimiter("/")

graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [30]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{node, 23, 25, lad, {relationship -> lad,PER, path1 -> lad/flat/Peter Parker}, []}, {node, 23, 25, lad, {relationship -> lad,LOC, path1 -> lad/flat/New York}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+



As seen above, we see slash (/) as a delimiter among the path. 

## `mergeEntitiesIOBFormat`

We choose which IOB format to apply when merging entities. There are 2 options:  IOB or IOB2.

**setMergeEntitiesIOBFormat("IOB2")**

It will use IOB2 format (default format) while merging the entities. 

In [31]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setMergeEntities(True) \
    .setRelationshipTypes(["lad-PER", "lad-LOC"]) \
    .setMergeEntitiesIOBFormat("IOB2")

graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [32]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{node, 23, 25, lad, {relationship -> lad,PER, path1 -> lad,flat,Peter Parker}, []}, {node, 23, 25, lad, {relationship -> lad,LOC, path1 -> lad,flat,New York}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+



**setMergeEntitiesIOBFormat("IOB")**

It will use IOB format while merging the entities. 

In [33]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setMergeEntities(True) \
    .setRelationshipTypes(["lad-PER", "lad-LOC"]) \
    .setMergeEntitiesIOBFormat("IOB")

graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [34]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                                                                                         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{node, 23, 25, lad, {relationship -> lad,PER, path1 -> lad,flat,Peter, path2 -> lad,flat,Peter,flat,Parker}, []}, {node, 23, 25, lad, {relationship -> lad,LOC, path1 -> lad,flat,York}, []}]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



As seen above, this time Peter and Parker as well as New and York entities didn't merged.

## `posModel` 
We can choose which Part of Speech model to use under the hood by using that parameter. We need to coordinate (name, lang, remoteLoc) to a pretrained POS model.