# Hearst Patterns

### A remark on scope of operators

I have seen several solutions using disjunction with the wrong scope. This is illustrated by the following example:

In [8]:
import re

s = 'We met John and Mary.'
re.findall(r'\w+ and|or \w+',s) 

['John and']

The pattern is equivalent to the following one. Obviously not what was intended:

In [9]:
re.findall(r'(\w+ and)|(or \w+)',s) 

[('John and', '')]

What we would like to have is the following:

In [10]:
re.findall(r'(\w+) (and|or) (\w+)',s) 

[('John', 'and', 'Mary')]

Lessons learned (hopefully):
    1. Test your patterns to make sure that they do what you think they do
    2. Use brackets to make clear what the scope of operators is!

Also note what happens, if we use a Kleene Closure:

In [11]:
re.findall(r'((\w+ )+)(and |or )(\w+)',s) 

[('We met John ', 'John ', 'and ', 'Mary')]

## Intended solution

In the following we use a lot of brackets. For each pattern we also stor which bracket group corresponds to the CONCEPT and which ones to INSTANCEs. Later we use this information to extract these parts.

In [28]:
import glob 
import codecs
import re
from collections import Counter



hearst_pat = [
    (r'([\w\-]+) such as ([\w\-]+) ((and|or) ([\w\-]+))?',0,[1,4]),
    (r'([\w\-]+),? especially ([\w\-]+) ((and|or) ([\w\-]+))?',0,[1,4]),
    (r'([\w\-]+),? including ([\w\-]+) ((and|or) ([\w\-]+))?',0,[1,4]),
    (r'([\w\-]+),? and other ([\w\-]+)',1,[0]),
    (r'([\w\-]+),? or other ([\w\-]+)',1,[0])
] 

paircount = Counter()

filelist = glob.glob("infect/*.txt")
for f in filelist:
    file = codecs.open(f,'r','utf8')

    for line in file:
        line = line.strip()
        for pat,cp,ips in hearst_pat: 
            resultlist = re.findall(pat,line) 
            for result in resultlist:
                concept = result[cp]
                for ip in ips:
                    if ip < len(result):
                        instance = result[ip]
                        if instance:
                            paircount.update([(concept,instance)])
                             
paircount.most_common()

0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4]
1 [0]
1 [0]
0 [1, 4]
0 [1, 4]
0 [1, 4

[(('animals', 'humans'), 7),
 (('bodily', 'blood'), 4),
 (('antibiotics', 'penicillin'), 4),
 (('animals', 'bats'), 3),
 (('is', 'important'), 3),
 (('antibiotics', 'the'), 2),
 (('organisms', 'Clostridium'), 2),
 (('pets', 'cats'), 2),
 (('organs', 'the'), 2),
 (('viruses', 'herpes'), 2),
 (('form', 'animal'), 2),
 (('infectious', 'eye'), 2),
 (('diseases', 'smallpox'), 2),
 (('is', 'common'), 2),
 (('countries', 'the'), 2),
 (('microorganisms', 'bacteria'), 2),
 (('parts', 'bloodstream'), 2),
 (('be', 'troublesome'), 2),
 (('symptoms', 'a'), 2),
 (('potentially', 'blood'), 2),
 (('general', 'monitors'), 2),
 (('procedures', 'those'), 2),
 (('antibiotics', 'methicillin'), 2),
 (('bodily', 'feces'), 2),
 (('agents', 'certain'), 2),
 (('agents', 'epidemiologically'), 2),
 (('disease', 'in'), 2),
 (('deep', 'Chromoblastomycosis'), 2),
 (('patients', 'those'), 2),
 (('lyssaviruses', 'the'), 2),
 (('pandemics', 'the'), 2),
 (('vectors', 'insects'), 2),
 (('bacteria', 'Staphylococcus'), 2),

We now get many results like $(('animals', 'cats'), 1)$ meaning that cats are animals.

## Extended Solution

Results would be better if we could restrict to nouns and proper names. Let us try to do so! 

In [13]:
import nltk

def tag_text(txt):
    sentences = ['/'.join(token) for s in nltk.sent_tokenize(txt) for token in nltk.pos_tag(nltk.word_tokenize(s))]
    return ' '.join(sentences)
    

In [14]:
text = 'The word quarantine comes from quarantena or quarantaine, meaning "forty days", used in the Venetian language in the 14th and 15th centuries and also in France. The word is designated in the period during which all ships were required to be isolated before passengers and crew could go ashore during the Black Death plague. The quarantena followed the trentino, or "thirty-day isolation" period, first imposed in 1347 in the Republic of Ragusa, Dalmatia (modern Dubrovnik in Croatia).Merriam-Webster gives various meanings to the noun form, including "a period of 40 days", several relating to ships, "a state of enforced isolation", and as "a restriction on the movement of people and goods which is intended to prevent the spread of disease or pests". The word is also used as a verb. Quarantine is distinct from medical isolation, in which those confirmed to be infected with a communicable disease are isolated from the healthy population. Quarantine may be used interchangeably with cordon sanitaire, and although the terms are related, cordon sanitaire refers to the restriction of movement of people into or out of a defined geographic area, such as a community, in order to prevent an infection from spreading.'

In [15]:
tag_text(text)

"The/DT word/NN quarantine/NN comes/VBZ from/IN quarantena/NN or/CC quarantaine/NN ,/, meaning/VBG ``/`` forty/JJ days/NNS ''/'' ,/, used/VBD in/IN the/DT Venetian/JJ language/NN in/IN the/DT 14th/CD and/CC 15th/CD centuries/NNS and/CC also/RB in/IN France/NNP ./. The/DT word/NN is/VBZ designated/VBN in/IN the/DT period/NN during/IN which/WDT all/DT ships/NNS were/VBD required/VBN to/TO be/VB isolated/VBN before/IN passengers/NNS and/CC crew/NN could/MD go/VB ashore/RB during/IN the/DT Black/NNP Death/NNP plague/NN ./. The/DT quarantena/NN followed/VBD the/DT trentino/NN ,/, or/CC ``/`` thirty-day/JJ isolation/NN ''/'' period/NN ,/, first/RB imposed/VBN in/IN 1347/CD in/IN the/DT Republic/NNP of/IN Ragusa/NNP ,/, Dalmatia/NNP (/( modern/JJ Dubrovnik/NNP in/IN Croatia/NNP )/) .Merriam-Webster/NN gives/VBZ various/JJ meanings/NNS to/TO the/DT noun/JJ form/NN ,/, including/VBG ``/`` a/DT period/NN of/IN 40/CD days/NNS ''/'' ,/, several/JJ relating/VBG to/TO ships/VB ,/, ``/`` a/DT state/N

In [16]:
print(tag_text("animals, including dogs and cats."))
print(tag_text("animals, especially Dogs and cats."))
print(tag_text("dogs and other animals."))

animals/NNS ,/, including/VBG dogs/NNS and/CC cats/NNS ./.
animals/NNS ,/, especially/RB Dogs/NNP and/CC cats/NNS ./.
dogs/NNS and/CC other/JJ animals/NNS ./.


Now we have to adjust the regular expressions.

In [17]:
hearst_pat = [
    (r'(\S+/NN[SP]?) such/JJ as/IN (\S+/NN[SP]?) ((and|or)/CC (\S+/NN[SP]?))?',0,[1,4]),
    (r'(\S+/NN[SP]?)( ,/,)? especially/\w+ (\S+/NN[SP]?) ((and|or)/CC (\S+/NN[SP]?))?',0,[2,5]),
    (r'(\S+/NN[SP]?)( ,/,)? including/\w+ (\S+/NN[SP]?) ((and|or)/CC (\S+/NN[SP]?))?',0,[2,5]),
    (r'(\S+/NN[SP]?)( ,/,)? (and|or)/CC other/\w+ (\S+/NN[SP]?)',3,[0])
]

paircount = Counter()

filelist = glob.glob("infect/*.txt")
for f in filelist:
    file = codecs.open(f,'r','utf8')

    for line in file:
        line = tag_text(line)
        for pat,cp,ips in hearst_pat:
            resultlist = re.findall(pat,line) 
            for result in resultlist:
                concept = result[cp].split('/')[0]
                for ip in ips:
                    if ip < len(result):
                        instance = result[ip].split('/')[0]
                        if instance:
                            paircount.update([(concept,instance)])
                             
paircount.most_common()

[]

We could make many extensions and include cases with more complex noun phrases. Let us try slightly more complex phrases. In order to keep the expressions readable, we will abbreviate the NPs and expand them later.

Furthermore, we compile the regex to speed up the process a little bit.

In [18]:
NP = r'\\S+/NN[SP]?( \\S+/NN[SP]?)*'

hearst_pat = [
    (r'(NP) such/JJ as/IN( \w+/DT)? (NP) ((and|or)/CC( \w+/DT)? (NP))?',0,[3,8]),
    (r'(NP)( ,/,)? especially/\w+( \w+/DT)? (NP) ((and|or)/CC( \w+/DT)? (NP))?',0,[4,9]),
    (r'(NP)( ,/,)? including/\w+( \w+/DT)? (NP) ((and|or)/CC( \w+/DT)? (NP))?',0,[4,9]),
    (r'(NP)( ,/,)? (and|or)/CC other/\w+ (NP)',4,[0])
]

hearst_pat = [(re.compile(re.sub('NP',NP,p)),c,i) for p,c,i in hearst_pat]

In [19]:
paircount = Counter()

filelist = glob.glob("infect/*.txt")
for f in filelist:
    file = codecs.open(f,'r','utf8')

    for line in file:
        line = tag_text(line)
        for pat,cp,ips in hearst_pat:
            resultlist = pat.findall(line) #use the compiled pattern
            for result in resultlist:
                concept = result[cp].split('/')[0]
                for ip in ips:
                    if ip < len(result):
                        instance = re.sub(r'/[A-Z]+','',result[ip])
                        if instance:
                            paircount.update([(concept,instance)])
                             
paircount.most_common()

[]

In [20]:
len(paircount)

0

In [21]:
sum(paircount.values())

0