## 📚 Exercise 13: Entity & Relation Extraction

### Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [1]:
import urllib.request, json, csv, re
import spacy

if int(spacy.__version__[0]) > 2:
    nlp = spacy.load("en_core_web_sm")
else:
    nlp = spacy.load("en")

In [2]:
# read tsv with input movies
def read_tsv():
    movies = []
    with open("movies.tsv", "r") as file:
        tsv = csv.reader(file, delimiter="\t")
        next(tsv)  # remove header
        movies = [{"movie": line[0], "director": line[1]} for line in tsv]
    return movies


# parse wikipedia page
def parse_wikipedia(movie):
    txt = ""
    try:
        with urllib.request.urlopen(
            "https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles="
            + movie
        ) as url:
            data = json.loads(url.read().decode())
            txt = next(iter(data["query"]["pages"].values()))["extract"]
    except:
        pass
    return txt

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [3]:
def find_PER_entities(txt):
    txt = nlp(txt)

    persons = []
    for e in txt.ents:
        if e.label_ == "PERSON":
            persons.append(e.text)
    return persons

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

In [4]:
# simple heuristic: find the next PER entity after the word 'directed'
def find_director(txt, persons):
    txt = re.sub("[!?,.]", "", txt).split()
    for p1 in range(0, len(txt)):
        if txt[p1] == "directed":
            for p2 in range(p1, len(txt)):
                for per in persons:
                    if per.startswith(txt[p2]):
                        return per
    return ""

In [5]:
movies = read_tsv()

fp = 0
statements = []
for m in movies:
    txt = parse_wikipedia(m["movie"])
    persons = find_PER_entities(txt)
    director = find_director(txt, persons)

    if director != "":
        statements.append(m["movie"] + " is directed by " + director + ".")
        if director != m["director"]:
            fp += 1

#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [6]:
# compute precision and recall
fn = len(movies) - len(statements)
tp = len(statements) - fp
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print("Precision: {:.0%}".format(precision))
print("Recall: {:.0%}".format(recall))

print()
print("***Sample Statements***")
for s in statements[:5]:
    print(s)

Precision: 77%
Recall: 78%

***Sample Statements***
13_Assassins_(2010_film) is directed by Takashi Miike.
14_Blades is directed by Daniel Lee.
22_Bullets is directed by Richard Berry.
Alien_vs_Ninja is directed by Seiji Chiba.
Bad_Blood_(2010_film) is directed by Dennis Law.


## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [7]:
training_set = [
    "The best blues singer was Bobby Bland while Ray Charles pioneered soul music .",
    "Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .",
    "None of them lived in Chicago .",
]

test_set = [
    "Ray Charles was born in 1930 .",
    "Bobby Bland was born the same year as Ray Charles .",
    "Muddy Waters is the father of Chicago Blues .",
]

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [8]:
# Bigram Representation
def getBigrams(sents):
    return [
        [b[0] + " " + b[1] for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
        for l in sents
    ]


bigrams = getBigrams(training_set)

# Annotation
PER = ["Bobby Bland", "Ray Charles"]
annotations = [
    [[b, "I" if b in PER else "O"] for b in sentence] for sentence in bigrams
]

print("Annotation")
for annotated_sentence in annotations:
    print(annotated_sentence)

Annotation
[['The best', 'O'], ['best blues', 'O'], ['blues singer', 'O'], ['singer was', 'O'], ['was Bobby', 'O'], ['Bobby Bland', 'I'], ['Bland while', 'O'], ['while Ray', 'O'], ['Ray Charles', 'I'], ['Charles pioneered', 'O'], ['pioneered soul', 'O'], ['soul music', 'O'], ['music .', 'O']]
[['Bobby Bland', 'I'], ['Bland was', 'O'], ['was just', 'O'], ['just a', 'O'], ['a singer', 'O'], ['singer whereas', 'O'], ['whereas Ray', 'O'], ['Ray Charles', 'I'], ['Charles was', 'O'], ['was a', 'O'], ['a pianist', 'O'], ['pianist ,', 'O'], [', songwriter', 'O'], ['songwriter and', 'O'], ['and singer', 'O'], ['singer .', 'O']]
[['None of', 'O'], ['of them', 'O'], ['them lived', 'O'], ['lived in', 'O'], ['in Chicago', 'O'], ['Chicago .', 'O']]


#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [9]:
lambda_ = 0.5

# Transition Probabilities
transition_prob = {}

I_count = sum(
    a[1] == "I" for annotated_sentence in annotations for a in annotated_sentence[:-1]
)
O_count = sum(
    a[1] == "O" for annotated_sentence in annotations for a in annotated_sentence[:-1]
)

# Initial probabilities
transition_prob["P(I|start)"] = float(
    sum(annotated_sentence[0][1] == "I" for annotated_sentence in annotations)
) / len(annotations)
transition_prob["P(O|start)"] = 1 - transition_prob["P(I|start)"]


O_after_O_count = 0
O_after_I_count = 0
I_after_O_count = 0
I_after_I_count = 0
for annotated_sentence in annotations:
    for i, (_, token) in enumerate(annotated_sentence[:-1]):
        next_token = annotated_sentence[i + 1][1]
        if token == "O":
            if next_token == "O":
                O_after_O_count += 1
            else:
                I_after_O_count += 1
        else:
            if next_token == "O":
                O_after_I_count += 1
            else:
                I_after_I_count += 1

transition_prob["P(O|O)"] = O_after_O_count / O_count
transition_prob["P(O|I)"] = O_after_I_count / I_count
transition_prob["P(I|O)"] = I_after_O_count / O_count
transition_prob["P(I|I)"] = I_after_I_count / I_count

print("Transition Probabilities", transition_prob, sep="\n")
print()

# Emission Probabilities
emission_prob = {}


def count_upper_first_char(bigram):
    count = 0
    if bigram.split(" ")[0][0].isupper():
        count += 1
    if bigram.split(" ")[1][0].isupper():
        count += 1
    return count


both_upper_count_O = 0
both_upper_count_I = 0
one_upper_count_O = 0
one_upper_count_I = 0
no_upper_count_O = 0
no_upper_count_I = 0
for a in sum(annotations, []):
    if count_upper_first_char(a[0]) == 2 and a[1] == "O":
        both_upper_count_O += 1
    elif count_upper_first_char(a[0]) == 2 and a[1] == "I":
        both_upper_count_I += 1
    elif count_upper_first_char(a[0]) == 1 and a[1] == "O":
        one_upper_count_O += 1
    elif count_upper_first_char(a[0]) == 1 and a[1] == "I":
        one_upper_count_I += 1
    elif count_upper_first_char(a[0]) == 0 and a[1] == "O":
        no_upper_count_O += 1
    elif count_upper_first_char(a[0]) == 0 and a[1] == "I":
        no_upper_count_I += 1


default_emission = 1 / len(sum(bigrams, [])) * (1 - lambda_)

emission_prob["P(2_upper|O)"] = (
    both_upper_count_O / O_count
) * lambda_ + default_emission
emission_prob["P(2_upper|I)"] = (
    both_upper_count_I / I_count
) * lambda_ + default_emission
emission_prob["P(1_upper|O)"] = (
    one_upper_count_O / O_count
) * lambda_ + default_emission
emission_prob["P(1_upper|I)"] = (
    one_upper_count_I / I_count
) * lambda_ + default_emission
emission_prob["P(0_upper|O)"] = (
    no_upper_count_O / O_count
) * lambda_ + default_emission
emission_prob["P(0_upper|I)"] = (
    no_upper_count_I / I_count
) * lambda_ + default_emission

print("Emission Probabilities", emission_prob, sep="\n")

Transition Probabilities
{'P(I|start)': 0.3333333333333333, 'P(O|start)': 0.6666666666666667, 'P(O|O)': 0.8928571428571429, 'P(O|I)': 1.0, 'P(I|O)': 0.10714285714285714, 'P(I|I)': 0.0}

Emission Probabilities
{'P(2_upper|O)': 0.014285714285714285, 'P(2_upper|I)': 0.5142857142857142, 'P(1_upper|O)': 0.21071428571428572, 'P(1_upper|I)': 0.014285714285714285, 'P(0_upper|O)': 0.37142857142857144, 'P(0_upper|I)': 0.014285714285714285}


#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [10]:
# Prediction
bigrams = getBigrams(test_set)
entities = []
for sentence in bigrams:
    prev_state = "start"
    for b in sentence:
        I_prob = (
            transition_prob["P(I|" + prev_state + ")"]
            * emission_prob["P(" + str(count_upper_first_char(b)) + "_upper|I)"]
        )
        O_prob = (
            transition_prob["P(O|" + prev_state + ")"]
            * emission_prob["P(" + str(count_upper_first_char(b)) + "_upper|O)"]
        )

        if I_prob > O_prob:
            entities.append(b)
            prev_state = "I"
        else:
            prev_state = "O"

print("Predicted Entities\n", entities, "\n")

Predicted Entities
 ['Ray Charles', 'Bobby Bland', 'Ray Charles', 'Muddy Waters', 'Chicago Blues'] 



Precision is *80%* while recall is *100%*. 

#### 4) Comment on how you can further improve this model.

We could increase precision by computing also the probabilities for unigrams and averaging them in the prediction step.