# Simple, rule-based relation extraction

* Extract simple relations between entities of interest
* Can be specified as surface patterns

## Hearst patterns

<img src="figs/fig_hearst_1.png" />

* See [here](https://arxiv.org/pdf/1806.03191.pdf) for more, details
* Simple patterns defined on surface, possibly lemmatized text
* Surprisingly good at extracting properties and is-a types of relations
* Must be combined with large text amounts

## Let's try

* We can try these quite easily
* Extract text from our parser output
* Run simple patterns


In [4]:
import gzip
ID,FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC=range(10) #column names

def read_conllu(inp):
    """The simplest conllu reader I can imagine"""
    current_comments=[]
    current_tree=[]
    for line in inp:
        line=line.strip()
        if not line: #empty line -> new tree starting, get rid of the old one
            yield current_comments, current_tree
            current_comments=[]
            current_tree=[]
        elif line.startswith("#"):
            current_comments.append(line) #this is a comment
        else:
            current_tree.append(line.split("\t"))
    else: #all done
        yield current_comments, current_tree

with gzip.open("/home/jmnybl/english-news-crawl-30M.conllu.gz","rt",encoding="utf-8") as f:
    for comments, tree in read_conllu(f):
        words=[cols[FORM] for cols in tree]
        lemmas=[cols[LEMMA] for cols in tree]
        print("WORDS", words)
        print("LEMMAS", lemmas)
        break #or else we print waaay too much
        
        

WORDS ['On', 'Friday', ',', 'Los', 'Angeles', 'County', 'sheriff', "'s", 'detectives', 'announced', 'that', 'they', 'had', 'made', 'an', 'arrest', 'in', 'Broudreaux', "'s", 'killing', ',', 'the', 'result', 'of', 'a', 'DNA', 'match', '.']
LEMMAS ['on', 'Friday', ',', 'Los', 'Angeles', 'County', 'sheriff', "'s", 'detective', 'announce', 'that', 'they', 'have', 'make', 'a', 'arrest', 'in', 'Broudreaux', "'s", 'killing', ',', 'the', 'result', 'of', 'a', 'dna', 'match', '.']


* as a side remark
* many of the things in the course are too heavy or clumsy to run in Jupyter
* better run on command line and better written in a real editor, but for the sake of simplicity, we use jupyter to edit our code too
* Like so:

In [8]:
%%writefile conllu2text.py

import gzip
import sys
ID,FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC=range(10) #column names

def read_conllu(inp):
    """The simplest conllu reader I can imagine"""
    current_comments=[]
    current_tree=[]
    for line in inp:
        line=line.strip()
        if not line: #empty line -> new tree starting, get rid of the old one
            yield current_comments, current_tree
            current_comments=[]
            current_tree=[]
        elif line.startswith("#"):
            current_comments.append(line) #this is a comment
        else:
            current_tree.append(line.split("\t"))
    else: #all done
        yield current_comments, current_tree

for comments, tree in read_conllu(sys.stdin):
    words=[cols[LEMMA] for cols in tree]
    print(" ".join(words))



Overwriting conllu2text.py


* and now we can run it in the terminal like so
* note zipping and unzipping on the fly

In [9]:
%%bash

exit #comment this out, it protects me from accidentally running this cell

zcat ~jmnybl/english-news-crawl-30M.conllu.gz | python3 conllu2text.py | gzip > english_news_30M.txt.gz

* Now we have English lemmas, let's try some simple patterns
* Once again, much can be done on the command line

In [14]:
%%bash
zcat english_news_30M.txt.gz | grep -Po '\s\w+\ssuch\sas\s\w+\sand' | head -n 20
echo
echo "and in a tab-delimited form"
echo
zcat english_news_30M.txt.gz | grep -Po '\s\w+\ssuch\sas\s\w+\sand' | perl -pe 's/ such as /\t/;s/ and$//' | head -n 20
zcat english_news_30M.txt.gz | grep -Po '\s\w+\ssuch\sas\s\w+\sand' | perl -pe 's/ such as /\t/;s/ and$//' > simple_patterns.tsv

 giant such as Google and
 event such as heat and
 starch such as tapioca and
 brand such as Otrivine and
 food such as butter and
 medium such as pornography and
 region such as Gansu and
 excuse such as freedom and
 county such as Hertfordshire and
 country such as Finland and
 company such as Penguin and
 series such as homeland and
 country such as Germany and
 source such as wind and
 invertebrate such as worm and
 app such as WhatsApp and
 place such as Iraq and
 author such as well and
 symptom such as breathlessness and
 event such as concert and

and in a tab-delimited form

 giant	Google
 event	heat
 starch	tapioca
 brand	Otrivine
 food	butter
 medium	pornography
 region	Gansu
 excuse	freedom
 county	Hertfordshire
 country	Finland
 company	Penguin
 series	homeland
 country	Germany
 source	wind
 invertebrate	worm
 app	WhatsApp
 place	Iraq
 author	well
 symptom	breathlessness
 event	concert


* the patterns are now in the .tsv file, so we could try reading them in, see what we found

In [21]:
import collections
data={} #class_of_things -> list
with open("simple_patterns.tsv", encoding="utf-8") as f:
    for line in f:
        line=line.strip()
        clas,example=line.split("\t")
        data.setdefault(clas,[]).append(example)

def sort_key(clas_examples):
    clas,examples=clas_examples
    return len(examples)
        
items=sorted(data.items(), key=sort_key, reverse=True)
        
for clas,examples in items[:30]:
    print("CLASS:",clas)
    for ex,count in collections.Counter(examples).most_common(20):
        print(count,"   ",ex)
    print()
    print()

CLASS: country
88     China
53     Germany
39     France
36     India
26     Spain
25     Italy
21     Brazil
21     Greece
20     Syria
20     Canada
19     Turkey
18     Japan
17     Russia
16     Australia
14     Sweden
13     Finland
13     Britain
12     Vietnam
12     Indonesia
11     Norway


CLASS: company
71     Google
39     Facebook
34     Apple
30     Uber
10     Amazon
8     uber
7     Samsung
7     IBM
6     Tesla
6     Unilever
6     Netflix
5     Boeing
5     Twitter
5     BP
5     Comcast
4     SpaceX
4     Microsoft
4     Vodafone
4     ExxonMobil
4     Huawei


CLASS: service
57     Netflix
37     Spotify
32     Uber
23     school
13     health
12     WhatsApp
12     education
9     healthcare
8     netflix
7     uber
7     water
6     Skype
6     Twitter
5     Facebook
5     Snapchat
4     library
4     hospital
4     electricity
4     Pandora
3     insurance


CLASS: area
21     health
20     education
9     agriculture
7     tax
6     retail
4     housing
4     de

* not bad, for a 10-min job on the command line and mere 30M sentences from news!

# Dependency trees

* Surface patterns don't always work
* Why?
* Can get even worse in free word order languages
* Remember dependency trees from the first lecture?

<img src="figs/giant_company.png" />


## The elements of a full parse tree

* Words
* Lemmas
* Tags
* Morphological features
* Dependency relations

* Where are they defined?
* [Universal Dependencies](https://universaldependencies.org/)
