# Simple, rule-based relation extraction

* Extract simple relations between entities of interest
* Can be specified as surface patterns

## Hearst patterns

<img src="figs/fig_hearst_1.png" />

* See [here](https://arxiv.org/pdf/1806.03191.pdf) for more, details
* Simple patterns defined on surface, possibly lemmatized text
* Surprisingly good at extracting properties and is-a types of relations
* Must be combined with large text amounts

## Let's try

* We can try these quite easily
* Extract text from our parser output
* Run simple patterns


In [7]:
import gzip
ID,FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC=range(10) #column names

def read_conllu(inp):
    """The simplest conllu reader I can imagine"""
    current_comments=[]
    current_tree=[]
    for line in inp:
        line=line.strip()
        if not line: #empty line -> new tree starting, get rid of the old one
            yield current_comments, current_tree
            current_comments=[]
            current_tree=[]
        elif line.startswith("#"):
            current_comments.append(line) #this is a comment
        else:
            current_tree.append(line.split("\t"))
    else: #all done
        yield current_comments, current_tree

with gzip.open("/course_data/textmine/parsed-data/english-news-crawl-30M.conllu.gz","rt",encoding="utf-8") as f:
    for comments, tree in read_conllu(f):
        words=[cols[FORM] for cols in tree]
        lemmas=[cols[LEMMA] for cols in tree]
        print("WORDS", words)
        print("LEMMAS", lemmas)
        break #or else we print waaay too much
        
        

WORDS ['On', 'Friday', ',', 'Los', 'Angeles', 'County', 'sheriff', "'s", 'detectives', 'announced', 'that', 'they', 'had', 'made', 'an', 'arrest', 'in', 'Broudreaux', "'s", 'killing', ',', 'the', 'result', 'of', 'a', 'DNA', 'match', '.']
LEMMAS ['on', 'Friday', ',', 'Los', 'Angeles', 'County', 'sheriff', "'s", 'detective', 'announce', 'that', 'they', 'have', 'make', 'a', 'arrest', 'in', 'Broudreaux', "'s", 'killing', ',', 'the', 'result', 'of', 'a', 'dna', 'match', '.']


* as a side remark
* many of the things in the course are too heavy or clumsy to run in Jupyter
* better run on command line and better written in a real editor, but for the sake of simplicity, we use jupyter to edit our code too
* Like so:

In [8]:
%%writefile conllu2text.py

#^^^^ this is how you can edit a file on the drive so you can run it in the terminal

import gzip
import sys
ID,FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC=range(10) #column names

def read_conllu(inp):
    """The simplest conllu reader I can imagine"""
    current_comments=[]
    current_tree=[]
    for line in inp:
        line=line.strip()
        if not line: #empty line -> new tree starting, get rid of the old one
            yield current_comments, current_tree
            current_comments=[]
            current_tree=[]
        elif line.startswith("#"):
            current_comments.append(line) #this is a comment
        else:
            current_tree.append(line.split("\t"))
    else: #all done
        yield current_comments, current_tree

for comments, tree in read_conllu(sys.stdin):
    lemmas=[cols[LEMMA] for cols in tree]
    print(" ".join(lemmas))



Overwriting conllu2text.py


* and now we can run it in the terminal like so
* note zipping and unzipping on the fly

In [3]:
%%bash

exit #comment this out, it protects me from accidentally running this cell

zcat /course_data/textmine/parsed-data/english-news-crawl-30M.conllu.gz | python3 conllu2text.py | gzip > english_news_30M_lemma.txt.gz

* Now we have English lemmas, let's try some simple patterns
* Once again, much can be done on the command line

In [5]:
%%bash
zcat english_news_30M_lemma.txt.gz | grep -Po '\s\w+\ssuch\sas\s\w+\sand' | head -n 20
echo
echo "and in a tab-delimited form"
echo
zcat english_news_30M_lemma.txt.gz | grep -Po '\s\w+\ssuch\sas\s\w+\sand' | perl -pe 's/ such as /\t/;s/ and$//' | head -n 20
zcat english_news_30M_lemma.txt.gz | grep -Po '\s\w+\ssuch\sas\s\w+\sand' | perl -pe 's/ such as /\t/;s/ and$//' > simple_patterns.tsv

 giant such as Google and
 event such as heat and
 starch such as tapioca and
 brand such as Otrivine and
 food such as butter and
 medium such as pornography and
 region such as Gansu and
 excuse such as freedom and
 county such as Hertfordshire and
 country such as Finland and
 company such as Penguin and
 series such as homeland and
 country such as Germany and
 source such as wind and
 invertebrate such as worm and
 app such as WhatsApp and
 place such as Iraq and
 author such as well and
 symptom such as breathlessness and
 event such as concert and

and in a tab-delimited form

 giant	Google
 event	heat
 starch	tapioca
 brand	Otrivine
 food	butter
 medium	pornography
 region	Gansu
 excuse	freedom
 county	Hertfordshire
 country	Finland
 company	Penguin
 series	homeland
 country	Germany
 source	wind
 invertebrate	worm
 app	WhatsApp
 place	Iraq
 author	well
 symptom	breathlessness
 event	concert


* the patterns are now in the .tsv file, so we could try reading them in, see what we found

In [6]:
import collections
data={} #class_of_things -> list
with open("simple_patterns.tsv", encoding="utf-8") as f:
    for line in f:
        line=line.strip()
        clas,example=line.split("\t")
        data.setdefault(clas,[]).append(example)

def sort_key(clas_examples):
    clas,examples=clas_examples
    return len(examples)
        
items=sorted(data.items(), key=sort_key, reverse=True)
        
for clas,examples in items[:30]:
    print("CLASS:",clas)
    for ex,count in collections.Counter(examples).most_common(20):
        print(count,"   ",ex)
    print()
    print()

CLASS: country
88     China
53     Germany
39     France
36     India
26     Spain
25     Italy
21     Greece
21     Brazil
20     Canada
20     Syria
19     Turkey
18     Japan
17     Russia
16     Australia
14     Sweden
13     Finland
13     Britain
12     Indonesia
12     Vietnam
11     Norway


CLASS: company
71     Google
39     Facebook
34     Apple
30     Uber
10     Amazon
8     uber
7     IBM
7     Samsung
6     Unilever
6     Tesla
6     Netflix
5     BP
5     Comcast
5     Twitter
5     Boeing
4     Huawei
4     Microsoft
4     Vodafone
4     SpaceX
4     ExxonMobil


CLASS: service
57     Netflix
37     Spotify
32     Uber
23     school
13     health
12     WhatsApp
12     education
9     healthcare
8     netflix
7     uber
7     water
6     Skype
6     Twitter
5     Facebook
5     Snapchat
4     Pandora
4     electricity
4     library
4     hospital
3     Hulu


CLASS: area
21     health
20     education
9     agriculture
7     tax
6     retail
4     defence
4     trade
4

* not bad, for a 10-min job on the command line and mere 30M sentences from news!

# Dependency trees

* Surface patterns may miss a lot of cases
* Why?
* Remember dependency trees from the first lecture?

<img src="figs/giant_company.png" />


## The elements of a full parse tree

* Words
* Lemmas
* Tags
* Morphological features
* Dependency relations

* Where are they defined:
* [Universal Dependencies](https://universaldependencies.org/)

## Search in syntactic trees

* Simple relations can be stated in terms of patterns in the syntax trees
* In theory, this removes much of the surface variation and increases our recall
* In practice, there is a balance between increased recall, and parser-induced noise
* One also needs a tool to query parse trees

## Dep_search

* Querying trees in quantities is quite difficult
* Need a specialized tool for that
* Many exist, few scale up
* Here's one https://fginter.github.io/dep_search/
* We can try it at http://edu.turkunlp.org/dep_search with the parsed news datasets we used throughout

### Search examples

* Lemma equals *problem*: `L=problem` [Link](http://edu.turkunlp.org/dep_search/query?search=L%3Dproblem&db=NEWS_EN_10M&case_sensitive=False&hits_per_page=50)
* All posessive modifiers of these problem lemmas (who is having a problem): `_ <nmod:poss L=problem` [Link](http://edu.turkunlp.org/dep_search/query?search=_%20%3Cnmod%3Aposs%20L%3Dproblem&db=NEWS_EN_10M&case_sensitive=True&hits_per_page=10)
* Keep only NOUN and PROPN words (we are not interested in *he* or *she*): `NOUN|PROPN <nmod:poss L=problem` [Link](http://edu.turkunlp.org/dep_search/query?search=NOUN%7CPROPN%20%3Cnmod%3Aposs%20L%3Dproblem&db=NEWS_EN_10M&case_sensitive=True&hits_per_page=10)
* All adjective modifiers (what kind of a problem): `_ <amod L=problem` [Link](http://edu.turkunlp.org/dep_search/query?search=_%20%3Camod%20L%3Dproblem&db=NEWS_EN_10M&case_sensitive=True&hits_per_page=10)
* All nominal modifiers of the *problem* lemma with a *with* adposition (a problem with what): `_  >case with <nmod (L=problem >nmod:poss NOUN|PROPN) ` [Link](http://edu.turkunlp.org/dep_search/query?search=_%20%20%3Ecase%20with%20%3Cnmod%20%28L%3Dproblem%20%3Enmod%3Aposs%20NOUN%7CPROPN%29&db=NEWS_EN_10M&case_sensitive=True&hits_per_page=10) 

### Top hits

```
# db-name: /home/ginter/dep_search_py2/en_news/trees_00000.db
# graph id: 401
# db-name: /home/ginter/dep_search_py2/en_news/trees_00000.db
# graph id: 4478
# graph id: 401
# visual-style  80      bgColor:lightgreen
# hittoken:     80      person  person  NOUN    NN      Number=Sing     83      nmod:poss       _       SpaceAfter=No
# sent_id = 402
# text = But in Australia where 14.6 per cent don't believe depression is a mental illness at all, 21.6 per cent would not hire someone who had been depressed, and 24.7 per cent think a person with depression could "snap out of it" if they wanted to, how are we to offer healthcare that truly cares for a person who is overweight if we ignore, or refuse to validate, the cause of this person's health problem?
1       But     but     CCONJ   CC      _       35      cc      _       _
2       in      in      ADP     IN      _       3       case    _       _
3       Australia       Australia       PROPN   NNP     Number=Sing     24      obl     _       _
4       where   where   ADV     WRB     PronType=Rel    10      advmod  _       _
5       14.6    14.6    NUM     CD      NumType=Card    7       nummod  _       _
6       per     per     NOUN    NN      Number=Sing     7       compound        _       _
7       cent    cent    NOUN    NN      Number=Sing     10      nsubj   _       _
8       do      do      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        10      aux     _       SpaceAfter=No
9       n't     not     PART    RB      _       10      advmod  _       _
10      believe believe VERB    VB      VerbForm=Inf    3       acl:relcl       _       _
11      depression      depression      NOUN    NN      Number=Sing     15      nsubj   _       _
...

```

* The matching word is marked as **# hittoken**
* Pick most common matching words or lemmas with grep and sort: `cat results.conllu | grep -P '# hittoken:' | cut -f 4 | sort | uniq -c | sort -nr | head -10`

* Possessive modifiers:
```
     58 country
     36 world
     31 nation
     26 America
     23 Britain
     20 city
     17 Europe
     17 company
     16 Trump
     16 state
```

* Adjectives:
```
    468 serious
    362 big
    329 other
    313 real
    279 biggest
    255 major
    180 same
    175 many
    152 only
    143 huge
```    

* Nominal modifiers of a *problem* with a *with* adposition (a problem with what):
```
      3 woman
      2 photo
      2 drug
      1 wellbeing
      1 we
      1 voter
      1 violence
      1 Union
      1 Trump
      1 this
```



* subject-verb-object is a typical example of extraction targets
* let us investigate how that data behaves: [subject_verb_object.ipynb](subject_verb_object.ipynb)