![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/prediction/english/graph_extraction_roots_paths.ipynb)

In [None]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

[K     |████████████████████████████████| 281.3 MB 39 kB/s 
[K     |████████████████████████████████| 198 kB 59.2 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./spark_nlp-4.2.7-py2.py3-none-any.whl
Installing collected packages: spark-nlp
Successfully installed spark-nlp-4.2.7


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import SparkSession

print("Spark NLP version", sparknlp.version())

Spark NLP version 4.2.7


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

spark = sparknlp.start(real_time_output=True)

print("Spark NLP version", sparknlp.version())

Spark NLP version 4.2.7


In [None]:
from pyspark.sql.types import StringType

text = ['Peter was born in Mexico and very successful man.']
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+-------------------------------------------------+
|text                                             |
+-------------------------------------------------+
|Peter was born in Mexico and very successful man.|
+-------------------------------------------------+



Graph Extraction requires POS, DependencyParsers and NER to extract information from a Dependency Tree. Check this [introductory notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/graph-extraction/graph_extraction_intro.ipynb).

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained() \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ / ]glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ — ]Download done! Loading the resource.
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[ / ]ner_dl download started this may take some time.
Approximate size to download 13.6 MB
Download done! Loading the resource.
[OK!]


# Graph Extraction Default Values

Graph Extraction by default will merge and explode entities. Which means:

*   **explodeEntities**: This parameter finds paths between all pair of entities labeled by NER
*   **mergeEntities**: This parameter merges same neighboring entities as a single token e.g. `New York` will be consider a single token, instead of `New` as one token and `York` as another one.

**mergeEntities** will also configure Graph Extraction to use default pretrained POS, Dependency Parser and Typed Dependency Parser models under the hood. If we set this parameter to `false`, we will need to define those in the pipeline.

In [None]:
graph_extraction = GraphExtraction() \
            .setInputCols(["document", "token", "ner"]) \
            .setOutputCol("graph")

In [None]:
graph_pipeline = Pipeline().setStages([document_assembler, tokenizer,
                                       word_embeddings, ner_tagger,
                                       graph_extraction])

In [None]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
Download done! Loading the resource.
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
Download done! Loading the resource.
dependency_typed_conllu download started this may take some time.
Approximate size to download 2.4 MB
Download done! Loading the resource.
+-------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------+
|[{node, 10, 13, born, {entities -> PER,LOC, left_path -> born,flat,Peter, right_path -> born,nsubj,man,flat,Mexico}, []}]|
+-----------------------------------------------------------------------------------------------------------

## Entity Types

**entitTypes** parameter allow us to find paths between a pair of entities. The pair of entities must be separated by hyphen. So, we must use this format:

`[ "ENTITY_1-ENTITY_2", "ENTITY_3-ENTITY_4", "ENTITY_N-ENTITY_M"]`

In [None]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setEntityTypes(['LOC-PER'])


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [None]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------+
|[{node, 10, 13, born, {entities -> LOC,PER, left_path -> born,nsubj,man,flat,Mexico, right_path -> born,flat,Peter}, []}]|
+-------------------------------------------------------------------------------------------------------------------------+



## Modifying Root Token

We can set a different root. For that we need to check which words can be defined as root. Visualizing the first level of the dependency tree in [this notebook](https://colab.research.google.com/drive/1BbLeRBjHxqIvcz8812ckwk5gNc2es383?usp=sharing), besides `born` those could be: `Peter`, `was`, `.` and `man`. However, some of those won't return a relationship.

To define a root that will return meaningful relationships, a token has to fulfill the following requirements:
1. It has to have an ancestor node
2. It has to have descendants
3. It has to have at least one descendant node labeled as entity by NER

Let's check `Peter` token:
1. It has an ancestor node: `born` (OK)
2. It does not have any descendant. 

*Peter* does not comply to requirement 2. So, it won't output any relationship. The same will hold for tokens `was` and `.` 

Now. let's check `man` token:
1. It has an ancestor node: `born` (OK)
2. It has descendants: `Mexico` and `successful` (OK)
3. It has to have at least one descendant node labeled as an entity by NER: `Mexico` as `LOC` (as we can see in [this visualization for NER](https://colab.research.google.com/drive/1BbLeRBjHxqIvcz8812ckwk5gNc2es383?usp=sharing)) (OK)

Now, if we let things by default. It won't output anything as we can see below:

In [None]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setRootTokens(['man'])


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [None]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

[WARN] Not found paths between given roots: [man] and entities pairs: (PER,LOC).
This could mean there are no more labeled tokens below the given roots or NER didn't label any token.
You can try using relationshipTypes parameter, check this notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english//graph-extraction/graph_extraction_roots_paths.ipynb 
You can also use spark-nlp-display to visualize Dependency Parser and NER output to help identify the kind of relations you can extract, check this notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english//graph-extraction/graph_extraction_helper_display.ipynb
+-----+
|graph|
+-----+
|[]   |
+-----+



The output is empty, because under `man` we only have `Mexico` as an entity. NER does not identify any other entity. So, `Mexico` does not have another pair to show a path. But, we can use `relationshipTypes` parameter to find a path between and unlabeled token and a labeled token, as we can see in the example below:

## Relationship Types

**relationshipTypes** allows us to find a path between an unlabeled token and a labeled token. To use this parameter, we need to set **explodEntities** parameter to `false`

In [None]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setExplodeEntities(False) \
    .setRootTokens(['man']) \
    .setRelationshipTypes(["man-LOC"])

graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [None]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+------------------------------------------------------------------------------+
|graph                                                                         |
+------------------------------------------------------------------------------+
|[{node, 45, 47, man, {relationship -> man,LOC, path1 -> man,flat,Mexico}, []}]|
+------------------------------------------------------------------------------+



Currently, it searchs deep which means it will find relationships from the defined root to its labeled descendants. This means that if for example we set a relationship like `setRelationshipTypes(["successful-LOC"])` it won't output a path. 

So, a requirement to use `setRelationshipTypes` is that the unlabeled token in the relationship has to be an ancestor node. Remember to use hyphen to separate the pair `["unlabeled_token-labeled_token"]`

## More Entities more Relations

Following the example above, we can set a root token and let other parameters as default to get an output. However, we need a different sentence that produces a deeper dependency tree with descendants that have labeled tokens. If we tweak the sentence as shown below, we can make it work:

In [None]:
text = ['Peter was born in Mexico and very successful in Queens.']
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+-------------------------------------------------------+
|text                                                   |
+-------------------------------------------------------+
|Peter was born in Mexico and very successful in Queens.|
+-------------------------------------------------------+



As we can see in this [visualization notebook ](https://colab.research.google.com/drive/1BbLeRBjHxqIvcz8812ckwk5gNc2es383?usp=sharing), now we have a labeled token (`Queens`) at a deeper level. So, we can use it safely to get a path from another root.

In [None]:
graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setRootTokens(['Mexico'])


graph_pipeline = Pipeline().setStages([document_assembler, 
                                       tokenizer,
                                       word_embeddings, 
                                       ner_tagger,
                                       graph_extraction])

In [None]:
graph_data_set = graph_pipeline.fit(data_set).transform(data_set)
graph_data_set.select("graph").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------+
|graph                                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------+
|[{node, 18, 23, Mexico, {entities -> LOC,LOC, left_path -> Mexico, right_path -> Mexico,amod,successful,nsubj,Queens}, []}]|
+---------------------------------------------------------------------------------------------------------------------------+

