# Assignment 3 - CT5120/CT5146

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **December 31, 2021**. Please note that there will be no further extensions to this deadline and we highly encourage you to submit this assignment before Semester 1 exams.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $100$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

|           | Task | Marks |
| :---      | :-----| -----:|
| Task 1    | Pre-processing |   15 |
| Task 2    | Named Entity Recognition |    10 |
| Task 3    | Information / Relation Extraction (I) | 30 |
| Task 4    | Information / Relation Extraction (II) | 15 |
| Task 5    | Combining information in the output   | 5 |
| Task 6    | Evaluation (I) | 15 |
| Task 7    | Evaluation (II) | 10 |



---

## Information Extraction and Relation Extraction

In the following tasks you will write code to perform **_information extraction_** and **_relation extraction_** across a collection of documents in `movies.zip`.

The zip archive contains 100 files, out of which 50 are plaintext documents and other 50 contain data structured as JSON.
Each plaintext document contains a text description of a movie taken from the English version of Wikipedia, while each JSON document contains *gold-standard* labels (also called *reference* labels) stored as key-value pairs for the entities and relations for each document.

You are only allowed to use the given documents and labels and **must not** use any other external sources of data for this assignment.

---

Download and unarchive `movies.zip` from Blackboard and place it in the same location as this notebook or uncomment the code cell below to get the data in a directory called `movies` and also place it automatically in the same location as this notebook.

In [4]:
# !if test -f "movies.zip"; then rm "movies.zip"; fi
# !if test -d "movies/"; then rm -rf "movies/"; fi
# !wget "https://drive.google.com/uc?export=download&id=1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB" -O "movies.zip"
# !unzip "movies.zip"

---

## Reading Data

Place the unzipped `movies` directory in the same location as this notebook and run the following code cell to read the plaintext and JSON documents.

In [4]:
######### DO NOT EDIT THIS CELL #########

import os
import json

documents = []   # store the text documents as a list of strings
labels = []      # store the gold-standard labels as a list of dictionaries

for idx in range(3):
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.doc.txt'),encoding="utf8") as f:
    doc = f.read().strip()
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.info.json'),encoding="utf8") as f:
    label = json.load(f)

  documents.append(doc)
  labels.append(label)

assert len(documents) == 3
assert len(labels) == 3

---

In [5]:
# Load the libraries which might be useful

import re
import nltk
nltk.download('all', quiet=True)

True

---

## Task 1: Document Pre-processing (15 Marks)
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

The expected output is a list of tagged sentences where each tagged sentence is a list containing `(token, tag)` pairs.


In [7]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/hema/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/hema/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/hema/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /Users/hema/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/hema/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/hema/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to
[nltk_data]   

True

In [8]:
documents[0]

'Star Wars (retroactively titled Star Wars: Episode IV - A New Hope) is a 1977 American epic space-opera film written and directed by George Lucas, produced by Lucasfilm and distributed by 20th Century Fox. It stars Mark Hamill, Harrison Ford, Carrie Fisher, Peter Cushing, Alec Guinness, David Prowse, James Earl Jones, Anthony Daniels, Kenny Baker and Peter Mayhew. It is the first installment of the original Star Wars trilogy, the first of the franchise to be produced, and the fourth episode of the "Skywalker saga". Lucas had the idea for a science-fiction film in the vein of Flash Gordon around the time he completed his first film, THX 1138 (1971) and began working on a treatment after the release of American Graffiti (1973). Star Wars takes place "a long time ago", in a fictional universe inhabited by both humans and various alien species; most of the known galaxy is ruled by the tyrannical Galactic Empire, which is only opposed by the Rebel Alliance, a group of freedom fighters. The

In [17]:
def ie_preprocess(document):
    '''Return a list of sentences tagged with part-of-speech tags for the given document.'''

    tagged_sentences = []

    # your code goes here
    
    # Step 1: Sentence segmentation.
    sentences = nltk.sent_tokenize(document)
    # Step 2: Tokenize sentences into words.
    tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
    # Step 3: POS tagging.
    tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]
    # return tagged
    return tagged_sentences

Run the cell below to check if the output is formatted correctly.

Expected output: `[('It', 'PRP'), ('received', 'VBD'), ('ten', 'JJ'), ('Oscar', 'NNP'), ('nominations', 'NNS'), ('(', '('), ('including', 'VBG'), ('Best', 'NNP'), ('Picture', 'NN'), (')', ')'), (',', ','), ('winning', 'VBG'), ('seven', 'CD'), ('.', '.')]`

In [18]:
# check output for Task 1
ie_preprocess(documents[0])[-10]

[('It', 'PRP'),
 ('received', 'VBD'),
 ('ten', 'JJ'),
 ('Oscar', 'NNP'),
 ('nominations', 'NNS'),
 ('(', '('),
 ('including', 'VBG'),
 ('Best', 'NNP'),
 ('Picture', 'NN'),
 (')', ')'),
 (',', ','),
 ('winning', 'VBG'),
 ('seven', 'CD'),
 ('.', '.')]

## Task 2: Named Entity Recognition (10 Marks)

Write a function that returns a list of all the named entities in a given document. The document here is structured as a list of sentences and tagged with part-of-speech tags.

Hint: Set `binary = True` while calling the `ne_chunk` function.

In [40]:
def find_named_entities(tagged_document):
   
  # your code goes here
    named_entities = []
   
    # Using for loop to find Named entities for one tagged sentence at a time
    for i in range(len(tagged_document)):
        
        pos_tags=tagged_document[i]
        
        tree = nltk.ne_chunk(pos_tags, binary=True)

        # find named entities
        for subtree in tree.subtrees():
            if subtree.label() == 'NE':
                entity = ""
                for leaf in subtree.leaves():
                    entity = entity + leaf[0] + " "
                named_entities.append(entity.strip())
      #'''Return a list of all the named entities in the given tagged document.'''
    return named_entities

Run the cell below to check if the output is formatted correctly.

The output values might not match exactly, but should look similar to: `['Star Wars', 'Star Wars', 'New Hope', 'American', 'George Lucas', 'Lucasfilm', ...]`

In [43]:
# check output for Task 2
tagged_document = ie_preprocess(documents[0]) # pre-process the first document

find_named_entities(tagged_document)[:10]     # display the first 10 named entities

['Star Wars',
 'Star Wars',
 'New Hope',
 'American',
 'George Lucas',
 'Lucasfilm',
 'Century Fox',
 'Mark Hamill',
 'Harrison Ford',
 'Carrie Fisher']

## Task 3: Information / Relation Extraction (I) (30 Marks)

Choose any **three** relations out of the following and write functions to extract them from a given document.

* **Title**
* **Language**
* **Starring**
* **Release date**
* **Cinematography**
* **Dialogue by**
* **Directed by**
* **Edited by**
* **Music by**
* **Narrated by**
* **Produced by**
* **Screenplay by**
* **Story by**
* **Written by**
* **Production companies**
* **Distribution companies**
* **Budget**
* **Box office**


The functions you define here must take as input a string called `document` and return the information/relation extracted as a list. You can explain your approach with comments along with your code.


In [44]:
# relation 1 - your code goes here
def relation1_title(document):
    movie1=document
    movie1sen1 = nltk.sent_tokenize(movie1)
    new_title=[]
    found=0
   # print(movie1sen1[0])
    sentence=movie1sen1[0]
    pattern = r'(.*\bis a)'
    pattern_bracket=r'(.\()'
    titles = re.findall(pattern, sentence)
    print('Result:', titles)
    if titles:
        found=1
        pattern2 = r'(.*\()'
        new_title=re.findall(pattern2, titles[0])
        #print("new_title ",  new_title)
        title = titles[0]
    else:
        for sent in movie1sen1:
            titlesen = re.findall(pattern, sent)
            if titlesen:
                title=first_NE(titlesen)
                found=1
                break
        if found==0:
            #print("not found")
            for sent in movie1sen1:
                titles_new = re.find(pattern_bracket, sentence)
                if titles_new:
                    #print("found",titles_new)
                    found=1
                    break
    
    if new_title:
        title=title.replace(new_title[0],'')
    if 'is a' in title:
        title=title.replace(' is a','')
    return title

In [58]:
relation1_title(documents[2])

Result: ['The Dark Knight is a']


'The Dark Knight'

In [49]:
def first_NE(result):   
    named_entities = []
    tokens = nltk.word_tokenize(result[0])   # tokenization
    pos_tags = nltk.pos_tag(tokens)         # pos-tagging
    # chunking
    tree = nltk.ne_chunk(pos_tags, binary=True)
    for subtree in tree.subtrees():
      if subtree.label() == 'NE':
        entity = ""
        for leaf in subtree.leaves():
          entity = entity + leaf[0] + " "
        named_entities.append(entity.strip()) 
    if named_entities:
        return named_entities[0]
    else:
        return []

In [50]:
# relation 2 - your code goes here
def relation2_directed(document):
    result=[]
    movie1=document
    movie1sen1 = nltk.sent_tokenize(movie1)
    pattern = r'(directed\b.*|Directed\b.*)'
    for sent in movie1sen1:
        results = re.findall(pattern, sent)
        if results:
            result=results
            break
   # print(result)
    if result:
        director=first_NE(result)
    else:
        director=[]
    return director

In [51]:
relation2_directed(documents[2])

'Christopher Nolan'

In [52]:
# relation 3 - your code goes here
def relation3_produced(document):
    result=[]
    producer=[]
    movie1=document
    movie1sen1 = nltk.sent_tokenize(movie1)
    pattern = r'(produced\b.*|Produced\b.*|Producers\b.*|producers\b.*)'
    for sent in movie1sen1:
        results = re.findall(pattern, sent)
        if results:
            result=results
            break
    if result:
        producer=first_NE(result)
    return producer

In [57]:
relation3_produced(documents[2])

'Christopher Nolan'

---

## Task 4: Information / Relation Extraction (II)  (15 Marks)

Identify one other relation of your choice, besides the ones mentioned in the previous task, and write a function to extract it. 

The function you define here must take as input a string called `document` and return the information/relations extracted as a list.

In [55]:
# your code goes here
def task4_Released_year(document):

    movie1=document
    movie1sen1 = nltk.sent_tokenize(movie1)
    #print(movie1sen1[0])
    sentence=movie1sen1[0]
    pattern = r'(\bis a [1-9][0-9]{3})'
    results = re.findall(pattern, sentence)
    #print('Result:', results)
    if results:
        year=results[0].replace('is a ','')
    else:
        for sent in movie1sen1:
            yearPattern=r'(\b[1-9][0-9]{3})'
            year = re.findall(yearPattern, sent)
            if year:
                break
    return year

In [56]:
task4_Released_year(documents[2])

'2008'

---

## Task 5: Combining information in the output (5 Marks)

Edit the function below to return a Python dictionary with the outputs from the functions defined in tasks $3 - 4$.

In [62]:
def extract_info(document):
  '''Extract information and relations from a given document.'''

  # Edit the output dict below and assign the values to keys by 
  # calling the appropriate functions from Tasks 3 and 4.
  
  # You can delete the keys for which you do not perform extraction in Task 3.

  output = {
    ##### EDIT BELOW THIS LINE #####
    
    # For the relations you extract in Task 3, 
    # save the output in the appropriate key and delete rest of the keys.
    
    "Title": [],
    "Directed by": [],
    "Produced by": [],
    # save the output from Task 4 here
    "Task 4": [],

    ##### EDIT ABOVE THIS LINE #####
  }
  output["Title"]=relation1_title(document)
  output["Directed by"]=relation2_directed(document)
  output["Produced by"]=relation3_produced(document)
  output["Task 4"]=task4_Released_year(document)
  return output


# check output for the first document
extract_info(documents[2])

Result: ['The Dark Knight is a']


{'Title': 'The Dark Knight',
 'Directed by': 'Christopher Nolan',
 'Produced by': 'Christopher Nolan',
 'Task 4': '2008'}

The output from the cell above should look something like the dictionary shown below. Overall values might be different, based on what four items you choose to extract in Tasks 3 and 4, but the structure should be similar.

For example, if you choose to extract **Starring**, **Release Date**, **Box office**, and **Directed by**, then the output should look something like this for the first document:

```javascript
{
  'Box office': ['$775 million'],
  'Directed by': ['George Lucas'],
  'Release date': ['May 25, 1977'],
  'Starring': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 
               'Peter Cushing', 'David Prowse', 'James Earl Jones', ],
}
```

---

## Task 6: Evaluation (I) (15 Marks)

Write a function to evaluate the performance of Task $3$ using **Precision**, **Recall** and **F1** scores. Use the gold-standard labels provided in the JSON files to calculate these values.

Please note that not all the information / relations mentioned in Task $3$ have associated labels for each and every movie in the JSON documents, i.e., some JSON documents will have certain keys-value pairs missing. For example, we have labels for *Budget* in 46 out of the 50 movies and in the remaining 4 documents, you will find that the key `Budget` is omitted from the JSON.
 
Also keep in mind that we will further run this evaluation on a hidden test set containing similar movie descriptions.

In [None]:
def evaluate(labels, predictions):
  '''
  Evaluate the performance of relation extraction 
  using Precision, Recall, and F1 scores.

  Args:
    labels: A list containing gold-standard labels
    predictions: A list containing information extracted from documents
  Returns:
    scores: A dictionary containing Precision, Recall and F1 scores 
            for the information/relations extracted in Task 3.
  '''

  assert len(predictions) == len(labels)

  scores = {
      'precision': 0.0, 'recall': 0.0, 'f1': 0.0
  }

  # calculate the precision, recall and f1 score over the information fields 
  # corresponding to Task 3 and store the result in the `scores` dict.

  # your code goes here
  # ...



  return scores

---
Run the cell below to calculate and display the evaluation scores for the 50 documents in `movies.zip`.

You can consider the following as a baseline score. Your aim should be to score higher or atleast get as close as possible to these values.

| Precision | Recall | F1    |
| :---:     | :---:  | :---: |
| 0.5       | 0.25   | 0.333 |

In [None]:
# !pip install pandas
import pandas as pd

# calculate evaluation score across all the 50 documents
extracted_infos = []
for document in documents:
  extracted_infos.append(extract_info(document))

scores = evaluate(labels, extracted_infos)

pd.DataFrame([scores])

---

## Task 7: Evaluation (II) (10 Marks)

Describe **two** challenges you encountered above or might encounter in the evaluation of *information extraction* or *relation extraction* tasks.


Edit this cell to write your answer below the line in no more than 100 words. No coding is required for this task.

---

> Delete this line and write your answer here.