## 📚 Exercise 13: Entity & Relation Extraction

### Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [None]:
import urllib.request, json, csv, re
import spacy
nlp = spacy.load('en')

In [None]:
#read tsv with input movies
def read_tsv():
    movies=[]
    with open('movies.tsv','r') as file:
        tsv = csv.reader(file, delimiter='\t')
        next(tsv) #remove header
        movies = [{'movie':line[0], 'director':line[1]} for line in tsv]
    return movies

#parse wikipedia page
def parse_wikipedia(movie):
    txt = ''
    try:
        with urllib.request.urlopen('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles='+movie) as url:
            data = json.loads(url.read().decode())
            txt = next (iter (data['query']['pages'].values()))['extract']
    except:
        pass
    return txt

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [None]:
def find_PER_entities(txt):
    persons = []
    ...
    return persons

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

In [None]:
def find_director(txt, persons):
    ...
    return ''

In [None]:
movies = read_tsv()
movies[:10]

In [None]:
statements=[]
for m in movies:

        txt = parse_wikipedia(m['movie'])
        persons = find_PER_entities(txt)
        director = find_director(txt, persons)
        
        if director != '':
            statements.append(m['movie'] + ' is directed by ' + director + '.')

#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [None]:
# compute precision and recall
precision = ...
recall = ...
print ('Precision:',precision)
print ('Recall:',recall)
print('\n***Sample Statements***')
print(statements[:5])

## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [None]:
training_set=['The best blues singer was Bobby Bland while Ray Charles pioneered soul music .', \
              'Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .' \
              'None of them lived in Chicago .']

test_set=['Ray Charles was born in 1930 .', \
          'Bobby Bland was born the same year as Ray Charles .', \
          'Muddy Waters is the father of Chicago Blues .']

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [None]:
#Bigram Representation
def getBigrams(sents):
    return [b[0]+' '+b[1] for l in sents for b in zip(l.split(' ')[:-1], l.split(' ')[1:])]

bigrams = getBigrams(training_set)

#Annotation
PER = ['Bobby Bland', 'Ray Charles']
annotations = []
for b in bigrams:
    ...
print('Annotation\n', annotations,'\n')

#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [None]:
lambda_ = 0.5

#Transition Probabilities
transition_prob={}


#Prior
transition_prob['P(I|start)'] = ...
transition_prob['P(O|start)'] = ...

transition_prob['P(O|O)'] = ...
transition_prob['P(O|I)'] = ...
transition_prob['P(I|O)'] = ...
transition_prob['P(I|I)'] = ...

print('Transition Probabilities\n',transition_prob, '\n')

#Emission Probabilities
emission_prob={}

        
default_emission = ...

emission_prob['P(2_upper|O)'] = ...
emission_prob['P(2_upper|I)'] = ...
emission_prob['P(1_upper|O)'] = ...
emission_prob['P(1_upper|I)'] = ...
emission_prob['P(0_upper|O)'] = ...
emission_prob['P(0_upper|I)'] = ...

print('Emission Probabilities\n', emission_prob, '\n')

#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [None]:
#Prediction
bigrams = getBigrams(test_set)
entities=[]
prev_state='start'
for b in bigrams:
    I_prob = ...
    O_prob = ...
    
    if ...:
        prev_state = 'O'
    else:
        entities.append(b)
        prev_state = 'I'

print('Predicted Entities\n', entities, '\n')

Precision is *...%* while recall is *...%*. 

#### 4) Comment on how you can further improve this model.

...