## 📚 Exercise 13: Entity & Relation Extraction

### Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [1]:
import urllib.request, json, csv, re
import spacy
nlp = spacy.load('en_core_web_sm')

In [11]:
#read tsv with input movies
def read_tsv():
    movies=[]
    with open('movies.tsv','r') as file:
        tsv = csv.reader(file, delimiter='\t')
        next(tsv) #remove header
        movies = [{'movie':line[0], 'director':line[1]} for line in tsv]
    return movies

#parse wikipedia page
def parse_wikipedia(movie):
    txt = ''
    try:
        with urllib.request.urlopen('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles='+movie) as url:
            data = json.loads(url.read().decode())
            txt = next (iter (data['query']['pages'].values()))['extract']
    except:
        pass
    return txt

In [6]:
test = parse_wikipedia('14_Blades')
find_PER_entities(test)

['Daniel Lee',
 'Donnie Yen',
 'Zhao Wei',
 'Sammo Hung',
 'Wu Chun',
 'Kate Tsui',
 'Qi Yuwu']

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [12]:

def find_PER_entities(txt):

    persons = []
    doc = nlp(txt)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            persons.append(ent.text)
    
    return persons

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

The code re.sub('[!?,.]', '', txt) uses the re.sub() function from the re (regular expression) module to remove certain characters from a string txt.

The first argument '[!?,.]' is a regular expression pattern that matches any of the characters !, ?, ,, or ..

The second argument '' is a string that specifies what to replace the matched characters with. In this case, it's an empty string, so the characters will be removed.

The third argument txt is the input string to perform the substitution on.

So, the code will replace any instances of the characters !, ?, ,, or . in txt with an empty string, effectively removing those characters from txt.

In [20]:
#simple heuristic: find the next PER entity after the word 'directed'

def find_director(txt, persons):
    txt_list = re.sub('[!?,.]','',txt).split() # 将有间隔的名字也分隔开出来了
    for i in range(len(txt_list)):
        if txt_list[i] == 'directed':
            for j in range(i,len(txt_list)):
                for per in persons:
                    if per.startswith(txt_list[j]): #必须要用satrtwith通过person来找，因为person是中间有空格的返回值
                        return per
    return ''

In [21]:
movies = read_tsv()
movies[:10]

[{'movie': '13_Assassins_(2010_film)', 'director': 'Takashi Miike'},
 {'movie': '14_Blades', 'director': 'Daniel Lee'},
 {'movie': '22_Bullets', 'director': 'Richard Berry'},
 {'movie': 'The_A-Team_(film)', 'director': 'Joe Carnahan'},
 {'movie': 'Alien_vs_Ninja', 'director': 'Seiji Chiba'},
 {'movie': 'Bad_Blood_(2010_film)', 'director': 'Dennis Law'},
 {'movie': 'Bangkok_Knockout', 'director': 'Panna Rittikrai'},
 {'movie': 'Blades_of_Blood', 'director': 'Lee Joon-ik'},
 {'movie': 'The_Book_of_Eli', 'director': 'Allen Hughes'},
 {'movie': 'The_Bounty_Hunter_(2010_film)', 'director': 'Andy Tennant'}]

In [22]:
statements=[]
tp = 0
for m in movies:

        txt = parse_wikipedia(m['movie'])
        persons = find_PER_entities(txt)
        director = find_director(txt, persons)
        
        if director != '':
            statements.append(m['movie'] + ' is directed by ' + director + '.')
            if director == m['director']:
                tp += 1

#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [16]:
for s in statements[:5]:
    print (s)
    

13_Assassins_(2010_film) is directed by Jūsannin no Shikaku.
22_Bullets is directed by L'Immortel.
Alien_vs_Ninja is directed by Seiji Chiba.
Blades_of_Blood is directed by Gureumeul Beoseonan Dalcheoreom.
The_Bounty_Hunter_(2010_film) is directed by Andy Tennant.


In [23]:
# compute precision and recall

precision = tp / len(statements)
recall = tp / len(movies)
print ('Precision:',precision)
print ('Recall:',recall)
print('\n***Sample Statements***')
print(statements[:5])

Precision: 0.7763713080168776
Recall: 0.6411149825783972

***Sample Statements***
['13_Assassins_(2010_film) is directed by Takashi Miike.', '14_Blades is directed by Daniel Lee.', '22_Bullets is directed by Richard Berry.', 'Alien_vs_Ninja is directed by Seiji Chiba.', 'Bad_Blood_(2010_film) is directed by Dennis Law.']


## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [25]:
training_set=['The best blues singer was Bobby Bland while Ray Charles pioneered soul music .', \
              'Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .' \
              'None of them lived in Chicago .']

test_set=['Ray Charles was born in 1930 .', \
          'Bobby Bland was born the same year as Ray Charles .', \
          'Muddy Waters is the father of Chicago Blues .']

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [26]:
#Bigram Representation
def getBigrams(sents):
    return [b[0]+' '+b[1] for l in sents for b in zip(l.split(' ')[:-1], l.split(' ')[1:])]

bigrams = getBigrams(training_set)

#Annotation
PER = ['Bobby Bland', 'Ray Charles']
annotations = []
for b in bigrams:
    if b in PER:
        annotations.append([b,'I'])
    else:
        annotations.append([b,'O'])
print('Annotation\n', annotations,'\n')

Annotation
 [['The best', 'O'], ['best blues', 'O'], ['blues singer', 'O'], ['singer was', 'O'], ['was Bobby', 'O'], ['Bobby Bland', 'I'], ['Bland while', 'O'], ['while Ray', 'O'], ['Ray Charles', 'I'], ['Charles pioneered', 'O'], ['pioneered soul', 'O'], ['soul music', 'O'], ['music .', 'O'], ['Bobby Bland', 'I'], ['Bland was', 'O'], ['was just', 'O'], ['just a', 'O'], ['a singer', 'O'], ['singer whereas', 'O'], ['whereas Ray', 'O'], ['Ray Charles', 'I'], ['Charles was', 'O'], ['was a', 'O'], ['a pianist', 'O'], ['pianist ,', 'O'], [', songwriter', 'O'], ['songwriter and', 'O'], ['and singer', 'O'], ['singer .None', 'O'], ['.None of', 'O'], ['of them', 'O'], ['them lived', 'O'], ['lived in', 'O'], ['in Chicago', 'O'], ['Chicago .', 'O']] 



#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [None]:
lambda_ = 0.5

#Transition Probabilities
transition_prob={}


#Prior
transition_prob['P(I|start)'] = ...
transition_prob['P(O|start)'] = ...

transition_prob['P(O|O)'] = ...
transition_prob['P(O|I)'] = ...
transition_prob['P(I|O)'] = ...
transition_prob['P(I|I)'] = ...

print('Transition Probabilities\n',transition_prob, '\n')

#Emission Probabilities
emission_prob={}

        
default_emission = ...

emission_prob['P(2_upper|O)'] = ...
emission_prob['P(2_upper|I)'] = ...
emission_prob['P(1_upper|O)'] = ...
emission_prob['P(1_upper|I)'] = ...
emission_prob['P(0_upper|O)'] = ...
emission_prob['P(0_upper|I)'] = ...

print('Emission Probabilities\n', emission_prob, '\n')

#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [None]:
#Prediction
bigrams = getBigrams(test_set)
entities=[]
prev_state='start'
for b in bigrams:
    I_prob = ...
    O_prob = ...
    
    if ...:
        prev_state = 'O'
    else:
        entities.append(b)
        prev_state = 'I'

print('Predicted Entities\n', entities, '\n')

Precision is *...%* while recall is *...%*. 

#### 4) Comment on how you can further improve this model.

...