In [1]:
# increase the cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; } </style>"))

## Knowledge Graph

- <a href='https://github.com/bdmarius/python-knowledge-graph' style='text-decoration:none'>python-knowledge-graph</a> 
- <a href='https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/' style='text-decoration:none'>Knowledge Graph – A Powerful Data Science Technique to Mine Information from Text</a> 
- <a href='https://medium.com/analytics-vidhya/knowledge-graph-creation-part-ii-675fa480773a' style='text-decoration:none'>Knowledge Graph Creation</a>

In [2]:
import spacy  
import en_core_web_sm                        # a small English model trained on written web text 
nlp = en_core_web_sm.load()                  # load pretrained models 



In [3]:
doc = nlp(
    "The Empire of Japan aimed to dominate Asia and the Pacific and was "
    "already at war with the Republic of China in 1937, but the world war is "
    "generally said to have begun on 1 September 1939 with the invasion of "
    "Poland by Germany and subsequent declarations of war on Germany by "
    "France and the United Kingdom. From late 1939 to early 1941, in a "
    "series of campaigns and treaties, Germany conquered or controlled much "
    "of continental Europe, and formed the Axis alliance with Italy and "
    "Japan. Under the Molotov-Ribbentrop Pact of August 1939, Germany and the "
    "Soviet Union partitioned and annexed territories of their European "
    "neighbours, Poland, Finland, Romania and the Baltic states. The war "
    "continued primarily between the European Axis powers and the coalition "
    "of the United Kingdom and the British Commonwealth, with campaigns "
    "including the North Africa and East Africa campaigns, the aerial Battle "
    "of Britain, the Blitz bombing campaign, the Balkan Campaign as well as "
    "the long-running Battle of the Atlantic. In June 1941, the European Axis "
    "powers launched an invasion of the Soviet Union, opening the largest "
    "land theatre of war in history, which trapped the major part of the "
    "Axis' military forces into a war of attrition. In December 1941, Japan "
    "attacked the United States and European territories in the Pacific "
    "Ocean, and quickly conquered much of the Western Pacific.")

<a href='https://github.com/krzysiekfonal/textpipeliner' style='text-decoration:none'>textpipeliner</a>
- Extracting parts of sentences in the form of structured tuples from unstructured text.
- This lib provides _Pipes_ and _PipelineEngine_. 
    - pipes: extract parts from every sentence.  
        - AggregatePipe: This pipe gets a list of other pipes and collects results from them.
        - SequencePipe: This pipe gets a list of other pipes and processes them in sequence, passing tokens as an argument to next one.
        - AnyPipe: This pipe gets list of another pipes and processes them until one returns a non-empty result.
        - GenericPipe: This pipe takes a function as a argument. This function will be called with 2 arguments: context and tokens list.
        - FindTokensPipe: This pipe takes a regex-like pattern to extract using the grammaregex library.
        - NamedEntityFilterPipe: This pipe filters passed tokens choosing the ones which are part of a named entity. During creation of this pipe it is possible to pass a specific named entity type we want to filter (like PERSON, LOC...).
        - NamedEntityExtractorPipe: This pipe collects a whole chain from a single token which is part of an entity.
    - PipelineEngine: use this pipes structure and return list of extracted tuples.

In [4]:
# !pip install textpipeliner
import sys
sys.path.insert(-1, '/home/u32/fanluo/.local/lib/python3.5/site-packages') 
from textpipeliner import PipelineEngine, Context
from textpipeliner.pipes import *
pipes_structure = [
    SequencePipe([
        FindTokensPipe("VERB/nsubj/*"),
        NamedEntityFilterPipe(),
        NamedEntityExtractorPipe()
    ]),
    FindTokensPipe("VERB"),
    AnyPipe([
        SequencePipe([
            FindTokensPipe("VBD/dobj/NNP"),
            AggregatePipe([
                NamedEntityFilterPipe("GPE"),
                NamedEntityFilterPipe("PERSON")
            ]),
            NamedEntityExtractorPipe()
        ]),
        SequencePipe([
            FindTokensPipe("VBD/**/*/pobj/NNP"),
            AggregatePipe([
                NamedEntityFilterPipe("LOC"),
                NamedEntityFilterPipe("PERSON")
            ]),
            NamedEntityExtractorPipe()
        ])
    ])
]

engine = PipelineEngine(pipes_structure, Context(doc), [0, 1, 2])
engine.process()

[([Germany], [conquered], [Europe]),
 ([Japan], [attacked], [the, United, States])]

## <a href="https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf" style="text-decoration:none">TextRank</a>
<a href="https://github.com/DerwenAI/pytextrank" style='text-decoration:none'>PyTextRank</a>: spaCy pipeline extension
-  <a href="https://colab.research.google.com/github/DerwenAI/pytextrank/blob/master/explain_algo.ipynb" style='text-decoration:none'>Explain PyTextRank: the algorithm</a>
    1. Create a lemma_graph:
        - node: lemmas of tokens whose pos is in the POS_KEPT list (merge node when there are multiple occurences with same pos.)
        - link: between each kept lemma and 3 kept lemmas before it
    2. Run page rank algorithm on lemma_graph to rank nodes.
    3. For each phrase (noun_chunk or named entity), get its score from all the scores of nodes it contains.  
        - merge phrases that contain same nodes
        - if a pharse is both noun_chunk and named entity, it would be count twice.
- <a href="https://colab.research.google.com/github/DerwenAI/pytextrank/blob/master/explain_summ.ipynb" style='text-decoration:none'>Explain PyTextRank: extractive summarization</a>

In [1]:
import spacy
import sys
sys.path.insert(-1, '/xdisk/msurdeanu/fanluo/miniconda3/lib/python3.7/site-packages')
import en_core_web_lg
nlp = en_core_web_lg.load()

In [2]:
#!python -m pip install pytextrank
# Fan: changed pytextrank.py so that p.text are the joint of lemma tokens with pos_ in kept_pos, and maintain the order when join 
import pytextrank
tr = pytextrank.TextRank()     
nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)

In [5]:
text = "Which magazine was started first Arthur's Magazine or First for Women?"
doc = nlp(text)

# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print('{:.4f} {:5d}  {}   {}'.format(p.rank, p.count, p.text, p.chunks)) 

0.2105     1  First   [First]
0.1579     1  Arthur Magazine   [Arthur's Magazine]
0.1562     1  woman   [Women]
0.0828     1  magazine   [Which magazine]
0.0793     1  Arthur   [first Arthur's]


In [6]:
for t in doc:
    print(t.text, t.pos_)

Which DET
magazine NOUN
was AUX
started VERB
first ADV
Arthur PROPN
's PART
Magazine PROPN
or CCONJ
First PROPN
for ADP
Women NOUN
? PUNCT


In [7]:
spacy.explain('PROPN')

'proper noun'

In [8]:
spacy.explain('PRON')

'pronoun'