# Assignment 2: Information Extraction

In [None]:
import nltk
import re

nltk.download('all')

## Task 1: Named Entity Annotation (10 Marks)

Using the IOB tagging scheme annotate all of the named entities (PERson, LOCation, ORGanisation, TIME) in the following sentence:

*Wayne Rooney is a professional footballer from England who last played for Major League Soccer club D.C. United and will join Derby County in January 2020.*

Edit this cell and write your annotation below the line. (Note that you don't have to write code for this task, you have to annotate it manually)

---

**B_PER**:Wayne 
**I_PER**:Rooney 
**O**:is 
**O**:a 
**O**:professional 
**O**:footballer 
**O** from 
**B_LOC**:England 
**O**:who 
**O**:last 
**O**:played 
**O**:for 
**B_ORG**:Major 
**I_ORG**:League 
**I_ORG**:Soccer 
**O**:club 
**B_ORG**:D.C. 
**I_ORG**:United 
**O**:and 
**O**:will 
**O**:join 
**B_ORG**:Derby 
**I_ORG**:County 
**O**:in 
**B_TIME**:January 
**I_TIME**:2020.


---

### For subsequent tasks in this assignment, you will work with the documents in `football_players.txt` to perform various information extraction tasks.

In [None]:
# Download the text file (uncomment the line below in this cell, if not already downloaded from Blackboard)
!curl "https://ideone.com/plain/OvwDXZ" > football_players.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24172  100 24172    0     0  33021      0 --:--:-- --:--:-- --:--:-- 32976


 Read all the documents from `football_players.txt` into a list called `docs`.

In [None]:
docs = []

with open('football_players.txt', encoding='utf-8') as f:
  docs = f.readlines()

## Task 2 (10 Marks)
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

Please keep in mind that the expected output is a list within a list as shown below.


Hint: For this task you need to perform three steps:
1. Sentence Segmentation
1. Word Tokenization
1. Part-of-Speech Tagging

In [None]:
def ie_preprocess(document):
  sentences = nltk.sent_tokenize(document) 
  words = [nltk.word_tokenize(sent) for sent in sentences] 
  tagged_sentences = [nltk.pos_tag(word) for word in words]

  return tagged_sentences

Run the cell below to verify your result for the second sentence in the first document.
Expected output: 
`[('He', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('forward', 'NN'), ('and', 'CC'), ('serves', 'NNS'), ('as', 'IN'), ('captain', 'NN'), ('for', 'IN'), ('Portugal', 'NNP'), ('.', '.')]`

In [None]:
first_doc = docs[0]
tagged_sentences = ie_preprocess(first_doc)
tagged_sentences[1]

[('He', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('forward', 'NN'),
 ('and', 'CC'),
 ('serves', 'NNS'),
 ('as', 'IN'),
 ('captain', 'NN'),
 ('for', 'IN'),
 ('Portugal', 'NNP'),
 ('.', '.')]

## Task 3 (20 Marks)
Write a function that takes a list of tokens with POS tags for a sentence and returns a list of named entities (NE). 

Hint: Use `binary = True` while calling NE chunk function

In [None]:
def find_named_entities(sent):
  named_entities = []

  tree = nltk.ne_chunk(sent,binary=True)

  for subtree in tree.subtrees():
    if subtree.label() == 'NE':
        entity = ""
        for leaf in subtree.leaves():
            entity = entity + leaf[0] + " "
        named_entities.append(entity.strip())

  return named_entities

Run the cell below to verify your result for the first sentence in the first document.
Expected output: `['Cristiano Ronaldo', 'Santos Aveiro', 'ComM', 'GOIH', 'Portuguese', 'Portuguese', 'Spanish', 'Real Madrid', 'Portugal']`

In [None]:
tagged_sentences = ie_preprocess(docs[0])
find_named_entities(tagged_sentences[0])

['Cristiano Ronaldo',
 'Santos Aveiro',
 'ComM',
 'GOIH',
 'Portuguese',
 'Portuguese',
 'Spanish',
 'Real Madrid',
 'Portugal']

## Task 4 (5 Marks)

Implement the `find_all_named_entities` function below to find **all** NEs in a given document.

Hint: Use `find_named_entities` implemented above for this task.

In [None]:
def find_all_named_entities(doc):
  named_entities = []

  tagged_sentences = ie_preprocess(doc)

  for sent in tagged_sentences:
    named_entities.extend(find_named_entities(sent))
  
  return named_entities   # return a flat list and not a list of lists

How many named entities did you find in the first document?

In [None]:
named_entities = find_all_named_entities(docs[0])
len(named_entities)

56

## Task 5 (5 Marks)

Find named entities across **all** documents in `football_players.txt`, and save the result into a single flat list.

In [None]:
all_named_entities = []

for doc in docs:
  named_entities = find_all_named_entities(doc)
  all_named_entities.extend(named_entities)

How many named entities did you find across all documents?

In [None]:
len(all_named_entities)

380

## Task 6 (40 Marks)

Write functions to extract the name of the player, country of origin and date of birth as well as the following relations: team(s) of the player and position(s) of the player.

Hint: Use the `re.compile()` function to create the extraction patterns.

Reference: https://docs.python.org/3/howto/regex.html

---

1. **`name_of_the_player`** <br>
    - **Named Entity Recognition**: Extract the first named entity or the most frequently occuring named entity with the label `PERSON`. An issue highlighted by many students here is that some names are matched only partially following this approach. This is because the `NER` module in `nltk` is trained on English data while there are some Spanish and Portugese names in the football corpus which can lead to some names being missed out.
    - **Noun Phrase Extraction**: A similar approach is to parse the documents using grammar rules and extract `NNP` (noun phrases) using the POS-tags.
    - **Pattern matching using Regex**: The name of the player is the first thing mentioned each document, so extracting this pattern using a regular expression is the easiest way to retrieve the name of the player.
1. **`country_of_origin`** <br>
    - **Named Entity Recognition**: Extract the first named entity or the most frequently occuring named entity with the label `GPE`.
    - **Relation Extraction**: <br>
        - Each document contains the information about the *national team* of the player and this relation can be used here.
        - Another useful relation is "*`X` is a `Y` professional footballer*" where `X` is the name of the player and `Y` is the nationality of the player. However, this does not find the name of the country, rather it finds the nationaity, we have penalised solutions here which do not return the name of the country. In some solutions this fixed by either using a custom mapping or using wordnet synsets or using edit distance to convert from nationality to country name.
        - An assumption here correctly highlighted by few students is that the information about nationality or national team is essentially a proxy because the country of origin can be something different.
1. **`date_of_birth`** <br>
    - **Relation Extraction**: Extract the relation "*`X` born `Y`*" where `X` is the name of the player and `Y` is the date of birth. The can be implemented using a regular expression where the pattern for the date looks something like this: `(\d+ \w+ \d{4})`.
    - Instead of using regex, some solutions use grammar rules like `<CD><NNP><CD>` to capture the pattern for the date which is also a valid solution here.
    - Some solutions were penalised here as they only searched for the date pattern without specifying the `born` relation and returned multiple dates.
1. **`team_of_the_player`**<br>
    - **Relation Extraction**: Extract the relation "*`X` plays for `Y`*" or the relation "*`X` played for `Y`*" where `X` is the player and `Y` contains the names of teams. This requires some post-processing to correctly identify the names of the teams which can be done with named entity recognition by looking for `ORG` and `GPE` labels or by extracting terms and phrases using grammar rules. This is sufficient for the most part and can be improved further by extracting similar relations like "*`X` signed for `Y`*" or "*`X` transferred to `Y`*" or "*`X` joined `Y`*"
    - Other solutions extract this relation using a combination of `club`, `FC` and `national team` patterns which works fine too.
    - Some solutions searched the names of teams inside the documents using a pre-defined list of teams. This is a somewhat naive solution since it can be difficult to know ahead of time the names of all the teams as the information in the documents spans a large variety of countries and soccer leagues.
    - For almost all the players, the documents contain detailed information about multiple football clubs they have played as well as information about the national team. So solutions which omit either the national team or the club were penalised here.
1. **`position_of_the_player`**<br>
    - **Relation Extraction**: Extract the relation "*`X` plays as a `Y`*" or the relation "*`X` is a `Y`*" where `X` is the player and `Y` is the position. 
    - Similar to country and teams above, term extraction on noun phrase extraction can also be used here to identify the name of the positions.
    - Along with the relations mentioned above, some solutions searched for positions inside the documents using a pre-defined list of football positions. Unlike teams, this is acceptable here since there is only a small number of positions on a football field which have consistent and fixed names.

