## Building a Basic Knowledge Graph using spaCy
[kaggle](https://www.kaggle.com/shivamb/spacy-text-meta-features-knowledge-graphs)

In [3]:
import os
import pandas as pd
import spacy 
from spacy import displacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_lg")

In [5]:
path = "../data/uci-news-aggregator"
files = os.listdir(path)
df = pd.read_csv(os.path.join(path,files[1]),nrows=500)
df[:3]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550


In [6]:
# Apply spacy to the article titles
df["spacy_title"] = df["TITLE"].apply(lambda x : nlp(x))

# add field of NE
df["named_entities"] = df["spacy_title"].apply(lambda x : x.ents)

df[['spacy_title','named_entities']][:3]

Unnamed: 0,spacy_title,named_entities
0,"(Fed, official, says, weak, data, caused, by, ...","((Fed),)"
1,"(Fed, 's, Charles, Plosser, sees, high, bar, f...","((Fed), (Charles, Plosser))"
2,"(US, open, :, Stocks, fall, after, Fed, offici...","((US), (Fed))"


## IE-Relations using POS Pattern Recognition
#### [here is a list of POS](https://sites.google.com/site/partofspeechhelp/home/nnp_nnps#TOC-Definition-of-NNPS-Proper-Noun-Plural-Form-1)
Now, we will define a grammar pattern / part of speech pattern to identify what type of relations we want to extract from the data. 

Let's we are interested in finding an action relation between two named entities. so we can define a pattern using part of speech tags as : 

Proper Noun - Verb - Proper Noun

In [7]:
pos_chain_1 = "NNP-VBZ-NNP"

Using spaCy, we can now iterate in text and identify what are the relevant triplets (governer, relation, dependent) or in other terms, what are the entities and relations.

In [9]:
index_list = list()
for i, r in df.iterrows():
    pos_chain = "-".join([d.tag_ for d in r['spacy_title']])
    if pos_chain_1 in pos_chain:
        if len(r["named_entities"]) >= 2:
            index_list.append(i)
            print (r["TITLE"])
            print (r["named_entities"])
            print (pos_chain+'\n')

Fed's Plosser expects US unemployment to fall to 6.2% by the end of 2014
(Fed, US, 6.2%, the end of 2014)
NNP-POS-NNP-VBZ-NNP-NN-TO-VB-IN-CD-NN-IN-DT-NN-IN-CD

Noyer Says Strong Euro Creates Unwarranted Economic Pressure
(Noyer, Strong Euro Creates Unwarranted Economic Pressure)
NNP-VBZ-NNP-NNP-VBZ-JJ-NNP-NN

Noyer Says Strong Euro Creates Unwarranted Economic Pressure (1)
(Noyer, Strong Euro Creates Unwarranted Economic Pressure, 1)
NNP-VBZ-NNP-NNP-VBZ-JJ-NNP-NN--LRB--CD--RRB-

Omega's Cooperman says eBay should spin off portion of PayPal: CNBC
(Omega, Cooperman, eBay, PayPal, CNBC)
NN-POS-NNP-VBZ-NNP-MD-VB-RP-NN-IN-NNP-:-NNP

Carl Icahn Rift Hurts eBay (EBAY)
(Carl Icahn Rift Hurts eBay, EBAY)
NNP-NNP-NNP-VBZ-NNP--LRB--NNP--RRB-

EBay rejects Icahn slate of directors
(EBay, Icahn)
NNP-VBZ-NNP-NN-IN-NNS

EBay rejects Icahn board nominees, asks investors to do same
(EBay, Icahn)
NNP-VBZ-NNP-NN-NNS-,-VBZ-NNS-TO-VB-JJ

EBay Rejects Icahn Board Picks As Activist Strikes Again
(EBay, Icahn

So from these examples, one can see different entities and relations for example: 

- Honda --- **restructures** ---> US operations  
- Carl Icahn --- **slams** ---> eBay CEO
- Google --- **confirms** ---> Android SDK 
- GM --- **hires** ---> Lehman Brothers 

References : https://kgtutorial.github.io/

## What about relation between Named Entities? 
The problem with the above approach is that one needs to have a comprehensive list of possible *Part-Of-Speech* tags defined a priori. In reality nouns and verbs come in a wide variety of forms and with modifiers etc. 
For instance you might also want to capture: IN, eg IN-VBZ, VBZ-IN, VBZ-IN-IN, VBN-IN etc

To overcome this you can:
 1. Constrain the type and number of relaitons you wish to find, create patterns for those. 
 2. Constrain the entities on which you wish to find relaitons such as Person named entities.
 3. Train a probabilisitc model to identify relation triplets such as [Stanford, OLLIE - see reddit]
 
Below we will try to form relations using approach 2, between named entities.

In [11]:
limit = 25
n = 0
for i, r in df.iterrows():
    if len(r["named_entities"]) == 2:
        ents = r["named_entities"]
        words = r['spacy_title']
        pos_chain = "-".join([d.tag_ for d in r['spacy_title']])
        
        # for words between each NE pair
        for w in words[ents[0].end:ents[1].start]: 
            
            if w.tag_ == 'VBZ': # if VERB is between 2 NE
                n += 1
                print(words)
                print(pos_chain)
                print((ents[0],ents[0].label_),
                      (w,w.tag_),
                      (ents[1],ents[1].label_),'\n')
                
            elif w.tag_ == 'VBN': # if VERB noun is between 2 NE
                n += 1
                print(words)
                print(pos_chain)
                print((ents[0],ents[0].label_),
                      (w,w.tag_),
                      (ents[1],ents[1].label_),'\n')
            else:
                pass
        
        if n == limit:
            break

ECB FOCUS-Stronger euro drowns out ECB's message to keep rates low for  ...
NNP-NNP-HYPH-JJR-NN-VBZ-RP-NNP-POS-NN-TO-VB-NNS-JJ-IN--:
(ECB, 'ORG') (drowns, 'VBZ') (ECB, 'ORG') 

Forex - Pound drops to one-month lows against euro
NNP-HYPH-NNP-VBZ-IN-CD-HYPH-NN-NNS-IN-NN
(Forex, 'ORG') (drops, 'VBZ') (one-month, 'DATE') 

Noyer Says Strong Euro Creates Unwarranted Economic Pressure
NNP-VBZ-NNP-NNP-VBZ-JJ-NNP-NN
(Noyer, 'GPE') (Says, 'VBZ') (Strong Euro Creates Unwarranted Economic Pressure, 'LAW') 

Noyer Says Stronger Euro Creates Unwarranted Pressure on Economy
NNP-VBZ-JJR-NNP-VBZ-JJ-NN-IN-NN
(Noyer, 'ORG') (Says, 'VBZ') (Stronger Euro Creates Unwarranted Pressure on Economy, 'LAW') 

EU aims for deal on tackling failing banks next week
NNP-VBZ-IN-NN-IN-VBG-VBG-NNS-IN-NN
(EU, 'ORG') (aims, 'VBZ') (next week, 'DATE') 

EBay rejects Icahn slate of directors
NNP-VBZ-NNP-NN-IN-NNS
(EBay, 'ORG') (rejects, 'VBZ') (Icahn, 'ORG') 

EBay rejects Icahn board nominees, asks investors to do same
NN

Ok so using NE appears better able to capture our people and organisations. However, naievely creating Triplets by extracting the verbs between Entities is not that good due to:
 - It fails on complex sentence structures. 
 - It ignores other objects represented by Nouns, Propper Nouns, and Common Nouns etc. 
 - Not all ENTITY types are relevant: PERSON:ORDINAL

We could improve some of this by incoporating **[Noun Chunks](https://spacy.io/usage/linguistic-features#noun-chunks)**. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”.

    Text: The original noun chunk text.
    Root text: The original text of the word connecting the noun chunk to the rest of the parse.
    Root dep: Dependency relation connecting the root to its head.
    Root head text: The text of the root token’s head.
    Children: The immediate syntactic dependents of the root token.
    
 - spaCy uses the terms **head** and **child** to describe the words connected by a single arc in the dependency tree. 
 - The term **dep** is used for the arc label, which describes the type of syntactic relation that connects the child to the head.
 
We can extract further relations by examining the noun modifiers in the noun chunks.  

In [28]:
print(df['spacy_title'][0])

print([x for x in df['spacy_title'][0].noun_chunks])

Fed official says weak data caused by weather, should not slow taper
[Fed official, weak data, weather, taper]


In [29]:
words = nlp("""\
    Google is expanding its pool of machine learning talent with the purchase of a startup that specializes in 'instant' smartphone image recognition. \
    On Wednesday, French firm Moodstocks announced on its website that it's being acquired by Google, stating that it expects the deal to be completed in the next few weeks. \
    There's no word yet on how much Google is paying for the company. \
    Moodstocks' "on-device image recognition" software for smartphones will be phased out as it joins Google. \
    Moodstocks' team will also move over to Google's R&D center in Paris, according to Google's French blog. \
    "Ever since we started Moodstocks, our dream has been to give eyes to machines by \
    turning cameras into smart sensors able to make sense of their surroundings," Moodstocks said in a statement on its site.
    "Our focus will be to build great image recognition tools within Google, \
    but rest assured that current paying Moodstocks customers will be able to use it until the end of their subscription." 
    """)

words = nlp("Barack Obama was born in Hawaii.")

In [32]:
dat = list()
for chunk in df['spacy_title'][0].noun_chunks:
    dat.append(pd.DataFrame([chunk.text, chunk.root.text, chunk.root.dep_,chunk.root.head.text,[c for c in chunk.root.children]]).T)

print(displacy.render(df['spacy_title'][0], style='dep', jupyter=True, options={'distance':110}))
print(displacy.render(df['spacy_title'][0], style='ent', jupyter=True, options={'distance':110}))

dat = pd.concat(dat)
dat.columns=['Chunk','root.text','root.dep','root.head','root.child']
dat

None


None


Unnamed: 0,Chunk,root.text,root.dep,root.head,root.child
0,Fed official,official,nsubj,says,[Fed]
0,weak data,data,nsubj,slow,"[weak, caused]"
0,weather,weather,pobj,by,[]
0,taper,taper,dobj,slow,[]


In [33]:
pos_chain = "-".join([d.tag_ for d in df['spacy_title'][0]])
for w in words[ents[0].end:ents[1].start]:
    ents = words.ents
    if w.tag_ == 'VBZ':
        n += 1
        print(words)
        print(pos_chain)
        print((ents[0],ents[0].label_),
              (w,w.tag_),
              (ents[1],ents[1].label_),'\n')
    elif w.tag_ == 'VBN':
        n += 1
        print(words)
        print(pos_chain)
        print((ents[0],ents[0].label_),
              (w,w.tag_),
              (ents[1],ents[1].label_),'\n')

Barack Obama was born in Hawaii.
NNP-NN-VBZ-JJ-NNS-VBN-IN-NN-,-MD-RB-VB-NN
(Barack Obama, 'PERSON') (born, 'VBN') (Hawaii, 'GPE') 



Some other factors to consider:

 - Ownership: E.g. Noun or Named Entity followed by : [NNS/VBZ](https://sites.google.com/site/partofspeechhelp/home/nns_vbz)
 - [KG and pruning](http://philipperemy.github.io/information-extract/)
  - [git](https://github.com/philipperemy/information-extraction-with-dominating-rules)
 
### References

 - [OLLIE](https://www.reddit.com/r/LanguageTechnology/comments/bovsf5/we_release_opiec_the_largest_open_information/)
 - [Clausie](https://github.com/mmxgn/clausiepy)
 - [Minie](https://github.com/mmxgn/miniepy/graphs/contributors)