## 📚 Exercise 13: Entity & Relation Extraction

### Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [21]:
import urllib.request, json, csv, re
import spacy
nlp = spacy.load('en_core_web_sm')

In [22]:
#read tsv with input movies
def read_tsv():
    movies=[]
    with open('movies.tsv','r') as file:
        tsv = csv.reader(file, delimiter='\t')
        next(tsv) #remove header
        movies = [{'movie':line[0], 'director':line[1]} for line in tsv]
    return movies

#parse wikipedia page
def parse_wikipedia(movie):
    txt = ''
    try:
        with urllib.request.urlopen('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles='+movie) as url:
            data = json.loads(url.read().decode())
            txt = next (iter (data['query']['pages'].values()))['extract']
    except:
        pass
    return txt

In [23]:
txt = parse_wikipedia('Inception')

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [24]:
def find_PER_entities(txt):
    persons = []
    doc = nlp(txt)

    for e in doc.ents: 
        if e.label_ == 'PERSON':
            persons.append(e.text)
    return set(persons)

In [25]:
find_PER_entities(txt)

{'Christopher Nolan',
 'Cillian Murphy',
 'Dileep Rao',
 'Elliot Page',
 'Emma Thomas',
 'Joseph Gordon-Levitt',
 'Ken Watanabe',
 'Leonardo DiCaprio',
 'Marion Cotillard',
 'Michael Caine',
 'The Dark Knight',
 'Tom Berenger',
 'Tom Hardy'}

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

In [29]:
def find_director(txt, persons):
    # hint from solution: find persons after the word directed
    for index, word in enumerate(txt.split()):
        if word == 'directed':
            remaining = ' '.join(txt.split()[index+1:])
            for ii in range(len(remaining.split())):
                for person in persons: 
                    if remaining.startswith(person): 
                        return person
                remaining = ' '.join(remaining.split()[1:])
    return ''

In [30]:
movies = read_tsv()
movies[:10]

[{'movie': '13_Assassins_(2010_film)', 'director': 'Takashi Miike'},
 {'movie': '14_Blades', 'director': 'Daniel Lee'},
 {'movie': '22_Bullets', 'director': 'Richard Berry'},
 {'movie': 'The_A-Team_(film)', 'director': 'Joe Carnahan'},
 {'movie': 'Alien_vs_Ninja', 'director': 'Seiji Chiba'},
 {'movie': 'Bad_Blood_(2010_film)', 'director': 'Dennis Law'},
 {'movie': 'Bangkok_Knockout', 'director': 'Panna Rittikrai'},
 {'movie': 'Blades_of_Blood', 'director': 'Lee Joon-ik'},
 {'movie': 'The_Book_of_Eli', 'director': 'Allen Hughes'},
 {'movie': 'The_Bounty_Hunter_(2010_film)', 'director': 'Andy Tennant'}]

In [31]:
from tqdm import tqdm

statements=[]
predicted_movies = {}
for m in tqdm(movies):

        txt = parse_wikipedia(m['movie'])
        persons = find_PER_entities(txt)
        director = find_director(txt, persons)
        
        if director != '':
            statements.append(m['movie'] + ' is directed by ' + director + '.')
            predicted_movies[m['movie']] = director
        else: 
            predicted_movies[m['movie']] = " "

100%|██████████| 287/287 [01:47<00:00,  2.67it/s]


In [32]:
gt = {m['movie'] : m['director'] for m in movies}

#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [37]:
# compute precision and recall
precision = sum([1 if predicted_movies[m] == gt[m] else 0 for m in predicted_movies.keys()]) / len(predicted_movies)
recall = sum([1 if predicted_movies[m] == gt[m] else 0 for m in predicted_movies.keys()]) / len(gt)
print ('Precision:',precision)
print ('Recall:',recall)
print('\n***Sample Statements***')
print(statements[:5])

Precision: 0.6550522648083623
Recall: 0.6550522648083623

***Sample Statements***
['13_Assassins_(2010_film) is directed by Takashi Miike.', '14_Blades is directed by Daniel Lee.', '22_Bullets is directed by Richard Berry.', 'Alien_vs_Ninja is directed by Seiji Chiba.', 'Bad_Blood_(2010_film) is directed by Dennis Law.']


## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [38]:
training_set=['The best blues singer was Bobby Bland while Ray Charles pioneered soul music .', \
              'Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .' \
              'None of them lived in Chicago .']

test_set=['Ray Charles was born in 1930 .', \
          'Bobby Bland was born the same year as Ray Charles .', \
          'Muddy Waters is the father of Chicago Blues .']

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [40]:
#Bigram Representation
def getBigrams(sents):
    return [b[0]+' '+b[1] for l in sents for b in zip(l.split(' ')[:-1], l.split(' ')[1:])]

bigrams = getBigrams(training_set)

#Annotation
PER = ['Bobby Bland', 'Ray Charles']
annotations = []
for b in bigrams:
    annotations.append('I' if b in PER else 'O')
print('Annotation\n', annotations,'\n')

Annotation
 ['O', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] 



#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [62]:
lambda_ = 0.5

#Transition Probabilities
transition_prob={}

from collections import Counter

c = Counter(annotations)
c2 = Counter(list(zip(annotations, annotations[1:])))

#Prior
transition_prob['P(I|start)'] = c["I"] / c.total()
transition_prob['P(O|start)'] = c['O'] / c.total()

transition_prob['P(O|O)'] = (c2[('O','O')] / c2.total())  # this is wrong, only divide by the number of 'O'!!!
transition_prob['P(O|I)'] = (c2[('I','O')] / c2.total())
transition_prob['P(I|O)'] = (c2[('O','I')] / c2.total())
transition_prob['P(I|I)'] = (c2[('I','I')] / c2.total())

print('Transition Probabilities\n',transition_prob, '\n')

#Emission Probabilities
emission_prob={}

uppers = []
for b in bigrams:
    c = 0
    if b.split()[0][0].isupper():
        c+=1
    if b.split()[1][0].isupper():
        c +=1
    uppers.append(c)
        
default_emission = (1-lambda_) / len(bigrams)

c = Counter(list(zip(uppers, annotations)))

emission_prob['P(2_upper|O)'] = lambda_ * c[(2,'O')] / c.total() + default_emission  # same mistake as above !!!
emission_prob['P(2_upper|I)'] = lambda_ * c[(2,'I')] / c.total() + default_emission
emission_prob['P(1_upper|O)'] = lambda_ * c[(1,'O')] / c.total() + default_emission
emission_prob['P(1_upper|I)'] = lambda_ * c[(1,'I')] / c.total() + default_emission
emission_prob['P(0_upper|O)'] = lambda_ * c[(0,'O')] / c.total() + default_emission
emission_prob['P(0_upper|I)'] = lambda_ * c[(0,'I')] / c.total() + default_emission

print('Emission Probabilities\n', emission_prob, '\n')

Transition Probabilities
 {'P(I|start)': 0.11428571428571428, 'P(O|start)': 0.8857142857142857, 'P(O|O)': 0.7647058823529411, 'P(O|I)': 0.11764705882352941, 'P(I|O)': 0.11764705882352941, 'P(I|I)': 0.0} 

Emission Probabilities
 {'P(2_upper|O)': 0.125, 'P(2_upper|I)': 0.020833333333333332, 'P(1_upper|O)': 0.14583333333333334, 'P(1_upper|I)': 0.041666666666666664, 'P(0_upper|O)': 0.20833333333333334, 'P(0_upper|I)': 0.08333333333333333} 



#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [64]:
#Prediction
bigrams = getBigrams(test_set)
entities=[]
prev_state='start'
for b in bigrams:

    c = 0
    if b.split()[0][0].isupper():
        c+=1
    if b.split()[1][0].isupper():
        c +=1

    I_prob = transition_prob['P(I|{})'.format(prev_state)]*emission_prob['P({}_upper|I)'.format(c)]
    O_prob = transition_prob['P(O|{})'.format(prev_state)]*emission_prob['P({}_upper|O)'.format(c)]

    print(I_prob, O_prob)
    
    if O_prob > I_prob:
        prev_state = 'O'
    else:
        entities.append(b)
        prev_state = 'I'

print('Predicted Entities\n', entities, '\n')

0.0023809523809523807 0.11071428571428571
0.004901960784313725 0.11151960784313726
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.0024509803921568627 0.09558823529411764
0.004901960784313725 0.11151960784313726
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.004901960784313725 0.11151960784313726
0.0024509803921568627 0.09558823529411764
0.004901960784313725 0.11151960784313726
0.0024509803921568627 0.09558823529411764
0.004901960784313725 0.11151960784313726
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.00980392156862745 0.15931372549019607
0.004901960784313725 0.11151960784313726
0.0024509803921568627 0.09558823529411764
0.004901960784313725 0.11151960784313726
Predicted Entities
 [] 

Precision is *...%* while recall is *...%*. 

#### 4) Comment on how you can further improve this model.

...