# Exercise 11: Entity and Relation Extraction

## Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [1]:
import urllib.request, json, csv, re
import spacy
nlp = spacy.load('en')

In [2]:
#read tsv with input movies
def read_tsv():
    movies=[]
    with open('movies.tsv','r') as file:
        tsv = csv.reader(file, delimiter='\t')
        next(tsv) #remove header
        movies = [{'movie':line[0], 'director':line[1]} for line in tsv]
    return movies

#parse wikipedia page
def parse_wikipedia(movie):
    txt = ''
    try:
        with urllib.request.urlopen('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles='+movie) as url:
            data = json.loads(url.read().decode())
            txt = next (iter (data['query']['pages'].values()))['extract']
    except:
        pass
    return txt

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [3]:
def find_PER_entities(txt):
    txt = nlp(txt)
    
    persons = []
    for e in txt.ents:
        if e.label_ == 'PERSON':
            persons.append(e.text)
    return persons

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

In [4]:
def find_director(txt, persons):
    txt = re.sub('[!?,.]', '', txt).split()
    
    # look for directed in text
    for p1 in range(0, len(txt)):
        if(txt[p1] == 'directed'):
            
            # look for first person after the word "directed"
            for p2 in range(p1, len(txt)):
                # iterate through list of known persons
                for per in persons:
                    if per.startswith(txt[p2]):
                        return per
        
    # return empty string if no director found
    return ''

In [5]:
movies = read_tsv()

statements=[]

tp = 0
fp = 0

# for each movie
for m in movies:
    
    # find the director
    txt = parse_wikipedia(m['movie'])
    persons = find_PER_entities(txt)
    director = find_director(txt, persons)
    
    if director != '':
        statements.append(m['movie'] + ' is directed by ' + director + '.')
        if director != m['director']:
            fp += 1
    
    #if director != '':
    #    statements.append(m['movie'] + ' is directed by ' + director + '.')
        
    # if director is correct
    #if(m['director'] == director):
    #    tp += 1
    #else:
    #    fp += 1
            

#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [6]:
# compute precision and recall
fn = len(movies) - len(statements)
tp = len(statements) - fp
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print ('Precision:',precision)
print ('Recall:',recall)

print()
print('***Sample Statements***')
for s in statements[:5]:
    print (s)

Precision: 0.7918367346938775
Recall: 0.8220338983050848

***Sample Statements***
13_Assassins_(2010_film) is directed by Takashi Miike.
14_Blades is directed by Daniel Lee.
22_Bullets is directed by Richard Berry.
The_A-Team_(film) is directed by Joe Carnahan.
Alien_vs_Ninja is directed by Seiji Chiba.


## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [None]:
training_set=['The best blues singer was Bobby Bland while Ray Charles pioneered soul music .', \
              'Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .' \
              'None of them lived in Chicago .']

test_set=['Ray Charles was born in 1930 .', \
          'Bobby Bland was born the same year as Ray Charles .', \
          'Muddy Waters is the father of Chicago Blues .']

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [None]:
#Bigram Representation
def getBigrams(sents):
    return [b[0]+' '+b[1] for l in sents for b in zip(l.split(' ')[:-1], l.split(' ')[1:])]

bigrams = getBigrams(training_set)
print(bigrams)

#Annotation
PER = ['Bobby Bland', 'Ray Charles']
annotations = []
for b in bigrams:
    
    if(b in PER):
        annotations.append([b, 'I'])
    else:
        annotations.append([b, 'O'])
    
print('Annotation\n', annotations,'\n')

#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [None]:
# Compute probabilities for state transitions
# -------------------------------------------
# prior probabilities
p_o = 0
p_i = 0
# marginal probabilities
p_xo = 0
p_xi = 0           
# joint probabilities
p_oo = 0
p_oi = 0
p_io = 0
p_ii = 0

# initialize start state
for i, _ in enumerate(annotations):
    
    if(i == 0):
        if(annotations[i][1] == 'O'): # prior
            p_o += 1
        if(annotations[i][1] == 'I'): # prior
            p_i += 1
    
    if(i != 0):
        if(annotations[i][1] == 'O'): # prior
            p_o += 1
        if(annotations[i][1] == 'I'): # prior
            p_i += 1
        
        if(annotations[i-1][1] == 'O' and annotations[i][1] == 'O'): # transition
            p_oo += 1
            p_xo += 1
        if(annotations[i-1][1] == 'O' and annotations[i][1] == 'I'): # transition
            p_oi += 1
            p_xi += 1
        if(annotations[i-1][1] == 'I' and annotations[i][1] == 'O'): # transition
            p_io += 1 
            p_xo += 1
        if(annotations[i-1][1] == 'I' and annotations[i][1] == 'I'): # transition
            p_ii += 1
            p_xi += 1
            
# Compute probabilities for emission
# ----------------------------------
p_0_upper_o = 0
p_1_upper_o = 0
p_2_upper_o = 0
p_0_upper_i = 0
p_1_upper_i = 0
p_2_upper_i = 0

for i, _ in enumerate(annotations):
        
    count = annotations[i][0].split(' ')[0][0].isupper() + annotations[i][0].split(' ')[1][0].isupper()
    
    if(annotations[i][1] == 'O'):
        if(count == 0):
            p_0_upper_o += 1
        if(count == 1):
            p_1_upper_o += 1
        if(count == 2):
            p_2_upper_o += 1
            
    if(annotations[i][1] == 'I'):
        if(count == 0):
            p_0_upper_i += 1
        if(count == 1):
            p_1_upper_i += 1
        if(count == 2):
            p_2_upper_i += 1

In [None]:
lambda_ = 0.5

#Transition Probabilities
transition_prob={}

#Prior
transition_prob['P(I|start)'] = p_i / (p_o + p_i)
transition_prob['P(O|start)'] = p_o / (p_o + p_i)

transition_prob['P(O|O)'] = p_oo / p_xo
transition_prob['P(O|I)'] = p_oi / p_xi
transition_prob['P(I|O)'] = p_io / p_xo
transition_prob['P(I|I)'] = p_ii / p_xi

print('Transition Probabilities\n',transition_prob, '\n')

#Emission Probabilities
emission_prob={}

default_emission = 1/len(bigrams) * (1 - lambda_)

emission_prob['P(2_upper|O)'] = (p_2_upper_o / p_o) * lambda_ + default_emission
emission_prob['P(2_upper|I)'] = (p_2_upper_i / p_i) * lambda_ + default_emission
emission_prob['P(1_upper|O)'] = (p_1_upper_o / p_o) * lambda_ + default_emission
emission_prob['P(1_upper|I)'] = (p_1_upper_i / p_i) * lambda_ + default_emission
emission_prob['P(0_upper|O)'] = (p_0_upper_o / p_o) * lambda_ + default_emission
emission_prob['P(0_upper|I)'] = (p_0_upper_i / p_i) * lambda_ + default_emission

print('Emission Probabilities\n', emission_prob, '\n')

#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [None]:
#Greedy search vs Viterbi
#Prediction
bigrams = getBigrams(test_set)
entities=[]
prev_state='start'
for b in bigrams:
    
    count = b.split(' ')[0][0].isupper() + b.split(' ')[1][0].isupper()
    
    I_prob = transition_prob['P(I|'+prev_state+')'] * emission_prob['P('+str(count)+'_upper|I)']
    O_prob = transition_prob['P(O|'+prev_state+')'] * emission_prob['P('+str(count)+'_upper|O)']
    
    if O_prob > I_prob:
        prev_state = 'O'
    else:
        entities.append(b)
        prev_state = 'I'

print('Predicted Entities\n', entities, '\n')

Precision is *...%* while recall is *...%*. 

#### 4) Comment on how you can further improve this model.

...