## 📚 Exercise 13: Entity & Relation Extraction

### Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [1]:
import urllib.request, json, csv, re
import spacy
from tqdm import tqdm 
nlp = spacy.load('en_core_web_sm')

In [2]:
#read tsv with input movies
def read_tsv():
    movies=[]
    with open('movies.tsv','r') as file:
        tsv = csv.reader(file, delimiter='\t')
        next(tsv) #remove header
        movies = [{'movie':line[0], 'director':line[1]} for line in tsv]
    return movies

#parse wikipedia page
def parse_wikipedia(movie):
    txt = ''
    try:
        with urllib.request.urlopen('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles='+movie) as url:
            data = json.loads(url.read().decode())
            txt = next (iter (data['query']['pages'].values()))['extract']
    except:
        pass
    return txt

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [3]:
def find_PER_entities(txt):
    persons = []
    doc = nlp(txt)
    for entity in doc.ents: 
        if entity.label_ == 'PERSON':
            persons.append(entity.text)
    return persons

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

In [11]:
# heuristic from the mulö: return the person after the word 'direct'
def find_director(txt, persons):
    doc = nlp(txt)
    start = 0
    for ii, token in enumerate(doc): 
        if token.lemma_ == 'direct': 
            start = ii 
            break

    for ii in range(start, len(txt.split())):
        for person in persons: 
            if person.startswith(txt.split()[ii]):
                return person
            
    return ''

In [12]:
movies = read_tsv()
movies[:10]

[{'movie': '13_Assassins_(2010_film)', 'director': 'Takashi Miike'},
 {'movie': '14_Blades', 'director': 'Daniel Lee'},
 {'movie': '22_Bullets', 'director': 'Richard Berry'},
 {'movie': 'The_A-Team_(film)', 'director': 'Joe Carnahan'},
 {'movie': 'Alien_vs_Ninja', 'director': 'Seiji Chiba'},
 {'movie': 'Bad_Blood_(2010_film)', 'director': 'Dennis Law'},
 {'movie': 'Bangkok_Knockout', 'director': 'Panna Rittikrai'},
 {'movie': 'Blades_of_Blood', 'director': 'Lee Joon-ik'},
 {'movie': 'The_Book_of_Eli', 'director': 'Allen Hughes'},
 {'movie': 'The_Bounty_Hunter_(2010_film)', 'director': 'Andy Tennant'}]

In [25]:
statements=[]
tp = 0

for m in tqdm(movies):

        txt = parse_wikipedia(m['movie'])
        persons = find_PER_entities(txt)
        director = find_director(txt, persons)
        
        if director != '':
            statements.append(m['movie'] + ' is directed by ' + director + '.')
            if director == m['director']: 
                tp += 1 

100%|██████████| 287/287 [02:27<00:00,  1.95it/s]


#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [26]:
# compute precision and recall
precision = tp / len(statements)
recall = tp / len(movies)
print ('Precision:',precision)
print ('Recall:',recall)
print('\n***Sample Statements***')
print(statements[:5])

Precision: 0.4777327935222672
Recall: 0.41114982578397213

***Sample Statements***
['13_Assassins_(2010_film) is directed by Kōji Yakusho.', '14_Blades is directed by Donnie Yen.', '22_Bullets is directed by Jacky Imbert.', 'The_A-Team_(film) is directed by Carnahan.', 'Bad_Blood_(2010_film) is directed by Simon Yam.']


## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [27]:
training_set=['The best blues singer was Bobby Bland while Ray Charles pioneered soul music .', \
              'Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .' \
              'None of them lived in Chicago .']

test_set=['Ray Charles was born in 1930 .', \
          'Bobby Bland was born the same year as Ray Charles .', \
          'Muddy Waters is the father of Chicago Blues .']

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [28]:
#Bigram Representation
def getBigrams(sents):
    return [b[0]+' '+b[1] for l in sents for b in zip(l.split(' ')[:-1], l.split(' ')[1:])]

bigrams = getBigrams(training_set)

#Annotation
PER = ['Bobby Bland', 'Ray Charles']
annotations = []
for b in bigrams:
    annotations.append('I' if b in PER else 'O')
print('Annotation\n', annotations,'\n')

Annotation
 ['O', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] 



In [33]:
bigrams

['The best',
 'best blues',
 'blues singer',
 'singer was',
 'was Bobby',
 'Bobby Bland',
 'Bland while',
 'while Ray',
 'Ray Charles',
 'Charles pioneered',
 'pioneered soul',
 'soul music',
 'music .',
 'Bobby Bland',
 'Bland was',
 'was just',
 'just a',
 'a singer',
 'singer whereas',
 'whereas Ray',
 'Ray Charles',
 'Charles was',
 'was a',
 'a pianist',
 'pianist ,',
 ', songwriter',
 'songwriter and',
 'and singer',
 'singer .None',
 '.None of',
 'of them',
 'them lived',
 'lived in',
 'in Chicago',
 'Chicago .']

#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [35]:
lambda_ = 0.5

#Transition Probabilities
transition_prob={}


#Prior
transition_prob['P(I|start)'] = lambda_*0 + (1-lambda_) * 1 / len(bigrams)
transition_prob['P(O|start)'] = lambda_*1 + (1-lambda_) * 1 / len(bigrams)

transition_prob['P(O|O)'] = lambda_*(26/34) + (1-lambda_) * 1 / len(bigrams)
transition_prob['P(O|I)'] = lambda_*(4/34) + (1-lambda_) * 1/len(bigrams)
transition_prob['P(I|O)'] = lambda_*(4/34) + (1-lambda_) * 1/len(bigrams)
transition_prob['P(I|I)'] = lambda_*(0) + (1-lambda_) * 1/len(bigrams)

print('Transition Probabilities\n',transition_prob, '\n')

#Emission Probabilities
emission_prob={}

        
default_emission = (1-lambda_)* 1 / len(bigrams)

emission_prob['P(2_upper|O)'] = lambda_*0 + (1-lambda_) * 1 / len(bigrams)
emission_prob['P(2_upper|I)'] = lambda_*1 + (1-lambda_) * 1 / len(bigrams)
emission_prob['P(1_upper|O)'] = lambda_*12/34 + (1-lambda_) * 1 / len(bigrams)
emission_prob['P(1_upper|I)'] = lambda_*0 + (1-lambda_) * 1 / len(bigrams)
emission_prob['P(0_upper|O)'] = lambda_*22/34 + (1-lambda_) * 1 / len(bigrams)
emission_prob['P(0_upper|I)'] = lambda_*0 + (1-lambda_) * 1 / len(bigrams)

print('Emission Probabilities\n', emission_prob, '\n')

Transition Probabilities
 {'P(I|start)': 0.014285714285714285, 'P(O|start)': 0.5142857142857142, 'P(O|O)': 0.39663865546218485, 'P(O|I)': 0.073109243697479, 'P(I|O)': 0.073109243697479, 'P(I|I)': 0.014285714285714285} 

Emission Probabilities
 {'P(2_upper|O)': 0.014285714285714285, 'P(2_upper|I)': 0.5142857142857142, 'P(1_upper|O)': 0.19075630252100842, 'P(1_upper|I)': 0.014285714285714285, 'P(0_upper|O)': 0.3378151260504202, 'P(0_upper|I)': 0.014285714285714285} 



#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [37]:
def count_upper_first_char(bigram):
    count=0
    if bigram.split(' ')[0][0].isupper():
        count+=1
    if bigram.split(' ')[1][0].isupper():
        count+=1
    return count

In [38]:
#Prediction
bigrams = getBigrams(test_set)
entities=[]
prev_state='start'
for b in bigrams:
    I_prob = transition_prob['P(I|' + prev_state +')'] * emission_prob['P('+str(count_upper_first_char(b))+'_upper|I)']
    O_prob = transition_prob['P(O|' + prev_state +')'] * emission_prob['P('+str(count_upper_first_char(b))+'_upper|O)']
    
    if O_prob > I_prob:
        prev_state = 'O'
    else:
        entities.append(b)
        prev_state = 'I'

print('Predicted Entities\n', entities, '\n')

Predicted Entities
 ['Ray Charles', 'Bobby Bland', 'Ray Charles', 'Muddy Waters', 'Chicago Blues'] 



Precision is *...%* while recall is *...%*. 

#### 4) Comment on how you can further improve this model.

could also consider unigrams and do an averaging with bigrams in the last step. 