# Assignment 3 - CT5120/CT5146

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **December 31, 2021**. Please note that there will be no further extensions to this deadline and we highly encourage you to submit this assignment before Semester 1 exams.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $100$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

|           | Task | Marks |
| :---      | :-----| -----:|
| Task 1    | Pre-processing |   15 |
| Task 2    | Named Entity Recognition |    10 |
| Task 3    | Information / Relation Extraction (I) | 30 |
| Task 4    | Information / Relation Extraction (II) | 15 |
| Task 5    | Combining information in the output   | 5 |
| Task 6    | Evaluation (I) | 15 |
| Task 7    | Evaluation (II) | 10 |



---

## Information Extraction and Relation Extraction

In the following tasks you will write code to perform **_information extraction_** and **_relation extraction_** across a collection of documents in `movies.zip`.

The zip archive contains 100 files, out of which 50 are plaintext documents and other 50 contain data structured as JSON.
Each plaintext document contains a text description of a movie taken from the English version of Wikipedia, while each JSON document contains *gold-standard* labels (also called *reference* labels) stored as key-value pairs for the entities and relations for each document.

You are only allowed to use the given documents and labels and **must not** use any other external sources of data for this assignment.

---

Download and unarchive `movies.zip` from Blackboard and place it in the same location as this notebook or uncomment the code cell below to get the data in a directory called `movies` and also place it automatically in the same location as this notebook.

In [2]:
!if test -f "movies.zip"; then rm "movies.zip"; fi
!if test -d "movies/"; then rm -rf "movies/"; fi
!wget "https://drive.google.com/uc?export=download&id=1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB" -O "movies.zip"
!unzip "movies.zip"

-f was unexpected at this time.
-d was unexpected at this time.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'unzip' is not recognized as an internal or external command,
operable program or batch file.


---

## Reading Data

Place the unzipped `movies` directory in the same location as this notebook and run the following code cell to read the plaintext and JSON documents.

In [1]:
######### DO NOT EDIT THIS CELL #########

import os
import json

documents = []   # store the text documents as a list of strings
labels = []      # store the gold-standard labels as a list of dictionaries

for idx in range(50):
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.doc.txt') , encoding='utf8') as f:
    doc = f.read().strip()
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.info.json'), encoding='utf8') as f:
    label = json.load(f)

  documents.append(doc)
  labels.append(label)

assert len(documents) == 50
assert len(labels) == 50

---

In [7]:
# Load the libraries which might be useful

import re
import nltk
import itertools
from nltk.tokenize import word_tokenize
nltk.download('all', quiet=True)

True

---

## Task 1: Document Pre-processing (15 Marks)
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

The expected output is a list of tagged sentences where each tagged sentence is a list containing `(token, tag)` pairs.


In [8]:
def ie_preprocess(document):
    '''
    Return a list of sentences tagged with part-of-speech tags for the given document.
    '''
    tagged_sentences = []

  # your code goes here
  # ...
    sentences = nltk.sent_tokenize(document)
    for sentence in sentences:
        tagged_sentences.append(nltk.pos_tag(word_tokenize(sentence)))

    return tagged_sentences

Run the cell below to check if the output is formatted correctly.

Expected output: `[('It', 'PRP'), ('received', 'VBD'), ('ten', 'JJ'), ('Oscar', 'NNP'), ('nominations', 'NNS'), ('(', '('), ('including', 'VBG'), ('Best', 'NNP'), ('Picture', 'NN'), (')', ')'), (',', ','), ('winning', 'VBG'), ('seven', 'CD'), ('.', '.')]`

In [9]:
# check output for Task 1
ie_preprocess(documents[0])[-10]

[('It', 'PRP'),
 ('received', 'VBD'),
 ('ten', 'JJ'),
 ('Oscar', 'NNP'),
 ('nominations', 'NNS'),
 ('(', '('),
 ('including', 'VBG'),
 ('Best', 'NNP'),
 ('Picture', 'NN'),
 (')', ')'),
 (',', ','),
 ('winning', 'VBG'),
 ('seven', 'CD'),
 ('.', '.')]

## Task 2: Named Entity Recognition (10 Marks)

Write a function that returns a list of all the named entities in a given document. The document here is structured as a list of sentences and tagged with part-of-speech tags.

Hint: Set `binary = True` while calling the `ne_chunk` function.

In [10]:
def find_named_entities(tagged_document):
    '''Return a list of all the named entities in the given tagged document.'''
  
    named_entities = []

    # your code goes here
    # ...
    tree = nltk.ne_chunk(tagged_document[0], binary=True)
    
    for subtree in tree.subtrees():
        entity = ""
        for leaf in subtree.leaves():
            entity = entity+leaf[0]+" "
        named_entities.append(entity.strip())
    
    return named_entities

Run the cell below to check if the output is formatted correctly.

The output values might not match exactly, but should look similar to: `['Star Wars', 'Star Wars', 'New Hope', 'American', 'George Lucas', 'Lucasfilm', ...]`

In [12]:
# check output for Task 2
tagged_document = ie_preprocess(documents[0]) # pre-process the first document
find_named_entities(tagged_document)[1:10]     # display the first 10 named entities

['Star Wars',
 'Star Wars',
 'New Hope',
 'American',
 'George Lucas',
 'Lucasfilm',
 'Century Fox']

## Task 3: Information / Relation Extraction (I) (30 Marks)

Choose any **three** relations out of the following and write functions to extract them from a given document.

* **Title**
* **Language**
* **Starring**
* **Release date**
* **Cinematography**
* **Dialogue by**
* **Directed by**
* **Edited by**
* **Music by**
* **Narrated by**
* **Produced by**
* **Screenplay by**
* **Story by**
* **Written by**
* **Production companies**
* **Distribution companies**
* **Budget**
* **Box office**


The functions you define here must take as input a string called `document` and return the information/relation extracted as a list. You can explain your approach with comments along with your code.


In [193]:
# relation 1 - your code goes here
def extract_budget(document):
    dollar_refs = (re.findall(r'[budget of]\s\$\w*\s[b,m]*\w*|\$\w*\s[b]\w*|\$\w*\s[m]\w*', document))
    # Typically these documents have a budget and a box office gross income and hopefully the gross is greater than the budget, so return the lowest million/billion dollar entry.
    return[min(dollar_refs)]


In [194]:
# relation 2 - your code goes here
def extract_releaseDate(document):
    # Extract text in format dd/mm/yyyy after keyword 'on' (released on ... ) or any year in format YYYY following keyword 'a' (this is a 2018 movie...) 
    date_refs = re.findall(r'[on]*\s*[0-9]{1,2}\s[a-z,A-Z]*\s[0-9]{2,4}|([a]\s)([0-9]{4})', document)[0][1]
    return [date_refs]

In [195]:
# relation 3 - your code goes here
def extract_title(document):
    try:
        # Extract the title by taking text before 'is a' shows up or before the first ( as is the case with a few Star Ward movies
        title_ref = re.split(' \(| is a',re.match(r"([a-z,A-Z, 0-9,\:, \, \-']*)\sis a|([a-z,A-Z,0-9, \: \( ]*)\s\(*\w*", document)[0])[0]
    except:
        title_ref=None
    return [title_ref]

In [216]:
def extract_director(document):
    try:
        # Extract the text following 'produced by', can be a single studio or a FirstName Surname producer
        director_refs = re.findall(r'([dD]irected\sby\s)([A-Z]\w*) | ([dD]irected\sby\s)([A-Z\.]\w*\s[A-Z\.]\w*)| ([dD]irected\sby\s)([A-Z]\.\w*\s[A-Z]\.*\w*\s[A-Z\.]\w*)', document)[0]
    except:
        director_refs='None'
    #pdb.set_trace()
    if len(director_refs[1])>1:
        return [director_refs[1]]
    elif len(director_refs[3])>1:
        return [director_refs[3]]
    else:
        return [director_refs]

---

## Task 4: Information / Relation Extraction (II)  (15 Marks)

Identify one other relation of your choice, besides the ones mentioned in the previous task, and write a function to extract it. 

The function you define here must take as input a string called `document` and return the information/relations extracted as a list.

In [217]:
# your code goes here
def extract_producer(document):
    try:
        # Extract the text following 'produced by', can be a single studio or a FirstName Surname producer
        producer_refs = re.findall(r'([pP]roduced\sby\s)([A-Z]\w*) | ([pP]roduced\sby\s)([A-Z]\w*\s[A-Z]\w*)', document)[0]
    except:
        producer_refs='None'
    #pdb.set_trace()
    if len(producer_refs[1])>1:
        return [producer_refs[1]]
    elif len(producer_refs[3])>1:
        return [producer_refs[3]]
    else:
        return [producer_refs]

In [218]:
for i in documents:
    extract_producer(i)
    

---

## Task 5: Combining information in the output (5 Marks)

Edit the function below to return a Python dictionary with the outputs from the functions defined in tasks $3 - 4$.

In [219]:
def extract_info(document):
  '''Extract information and relations from a given document.'''

  # Edit the output dict below and assign the values to keys by 
  # calling the appropriate functions from Tasks 3 and 4.
  
  # You can delete the keys for which you do not perform extraction in Task 3.

  output = {
    ##### EDIT BELOW THIS LINE #####
    
    # For the relations you extract in Task 3, 
    # save the output in the appropriate key and delete rest of the keys.
    
    "Title": extract_title(document),
    "Release date": extract_releaseDate(document),
    "Budget": extract_budget(document),
    "Director":extract_director(document),

    # save the output from Task 4 here
    "Task 4": extract_producer(document),

    ##### EDIT ABOVE THIS LINE #####
  }

  return output


# check output for the first document
extract_info(documents[0])

{'Title': ['Star Wars'],
 'Release date': ['1977'],
 'Budget': ['$550 million'],
 'Director': ['George Lucas'],
 'Task 4': ['Lucasfilm']}

The output from the cell above should look something like the dictionary shown below. Overall values might be different, based on what four items you choose to extract in Tasks 3 and 4, but the structure should be similar.

For example, if you choose to extract **Starring**, **Release Date**, **Box office**, and **Directed by**, then the output should look something like this for the first document:

```javascript
{
  'Box office': ['$775 million'],
  'Directed by': ['George Lucas'],
  'Release date': ['May 25, 1977'],
  'Starring': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 
               'Peter Cushing', 'David Prowse', 'James Earl Jones', ],
}
```

---

## Task 6: Evaluation (I) (15 Marks)

Write a function to evaluate the performance of Task $3$ using **Precision**, **Recall** and **F1** scores. Use the gold-standard labels provided in the JSON files to calculate these values.

Please note that not all the information / relations mentioned in Task $3$ have associated labels for each and every movie in the JSON documents, i.e., some JSON documents will have certain keys-value pairs missing. For example, we have labels for *Budget* in 46 out of the 50 movies and in the remaining 4 documents, you will find that the key `Budget` is omitted from the JSON.
 
Also keep in mind that we will further run this evaluation on a hidden test set containing similar movie descriptions.

In [None]:
def evaluate(labels, predictions):
  '''
  Evaluate the performance of relation extraction 
  using Precision, Recall, and F1 scores.

  Args:
    labels: A list containing gold-standard labels
    predictions: A list containing information extracted from documents
  Returns:
    scores: A dictionary containing Precision, Recall and F1 scores 
            for the information/relations extracted in Task 3.
  '''

  assert len(predictions) == len(labels)

  scores = {
      'precision': 0.0, 'recall': 0.0, 'f1': 0.0
  }

  # calculate the precision, recall and f1 score over the information fields 
  # corresponding to Task 3 and store the result in the `scores` dict.

  # your code goes here
  # ...



  return scores

---
Run the cell below to calculate and display the evaluation scores for the 50 documents in `movies.zip`.

You can consider the following as a baseline score. Your aim should be to score higher or atleast get as close as possible to these values.

| Precision | Recall | F1    |
| :---:     | :---:  | :---: |
| 0.5       | 0.25   | 0.333 |

In [None]:
# !pip install pandas
import pandas as pd

# calculate evaluation score across all the 50 documents
extracted_infos = []
for document in documents:
  extracted_infos.append(extract_info(document))

scores = evaluate(labels, extracted_infos)

pd.DataFrame([scores])

---

## Task 7: Evaluation (II) (10 Marks)

Describe **two** challenges you encountered above or might encounter in the evaluation of *information extraction* or *relation extraction* tasks.


Edit this cell to write your answer below the line in no more than 100 words. No coding is required for this task.

---

> Delete this line and write your answer here.