# Assignment 3 - CT5120/CT5146

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **December 31, 2021**. Please note that there will be no further extensions to this deadline and we highly encourage you to submit this assignment before Semester 1 exams.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $100$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

|           | Task | Marks |
| :---      | :-----| -----:|
| Task 1    | Pre-processing |   15 |
| Task 2    | Named Entity Recognition |    10 |
| Task 3    | Information / Relation Extraction (I) | 30 |
| Task 4    | Information / Relation Extraction (II) | 15 |
| Task 5    | Combining information in the output   | 5 |
| Task 6    | Evaluation (I) | 15 |
| Task 7    | Evaluation (II) | 10 |



---

## Information Extraction and Relation Extraction

In the following tasks you will write code to perform **_information extraction_** and **_relation extraction_** across a collection of documents in `movies.zip`.

The zip archive contains 100 files, out of which 50 are plaintext documents and other 50 contain data structured as JSON.
Each plaintext document contains a text description of a movie taken from the English version of Wikipedia, while each JSON document contains *gold-standard* labels (also called *reference* labels) stored as key-value pairs for the entities and relations for each document.

You are only allowed to use the given documents and labels and **must not** use any other external sources of data for this assignment.

---

Download and unarchive `movies.zip` from Blackboard and place it in the same location as this notebook or uncomment the code cell below to get the data in a directory called `movies` and also place it automatically in the same location as this notebook.

In [None]:
!if test -f "movies.zip"; then rm "movies.zip"; fi
!if test -d "movies/"; then rm -rf "movies/"; fi
!wget "https://drive.google.com/uc?export=download&id=1L6NcSGkubNJaL6xSnYEZZKSrlyXq1AbB" -O "movies.zip"
!unzip "movies.zip"

---

## Reading Data

Place the unzipped `movies` directory in the same location as this notebook and run the following code cell to read the plaintext and JSON documents.

In [46]:
######### DO NOT EDIT THIS CELL #########

import os
import json

documents = []   # store the text documents as a list of strings
labels = []      # store the gold-standard labels as a list of dictionaries

for idx in range(50):
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.doc.txt')) as f:
    doc = f.read().strip()
  with open(os.path.join('movies', str(idx+1).zfill(2) + '.info.json')) as f:
    label = json.load(f)

  documents.append(doc)
  labels.append(label)

assert len(documents) == 50
assert len(labels) == 50

---

In [47]:
# Load the libraries which might be useful

import re
import nltk
nltk.download('all', quiet=True)

True

---

## Task 1: Document Pre-processing (15 Marks)
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

The expected output is a list of tagged sentences where each tagged sentence is a list containing `(token, tag)` pairs.


In [48]:
def ie_preprocess(document):
  '''Return a list of sentences tagged with part-of-speech tags for the given document.'''

  tagged_sentences = []

  # your code goes here
  # ...
  sentences = nltk.sent_tokenize(document)

# Step 2: Tokenize sentences into words.
  tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Step 3: POS tagging.
  tagged_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentences]

  return tagged_sentences

Run the cell below to check if the output is formatted correctly.

Expected output: `[('It', 'PRP'), ('received', 'VBD'), ('ten', 'JJ'), ('Oscar', 'NNP'), ('nominations', 'NNS'), ('(', '('), ('including', 'VBG'), ('Best', 'NNP'), ('Picture', 'NN'), (')', ')'), (',', ','), ('winning', 'VBG'), ('seven', 'CD'), ('.', '.')]`

In [49]:
# check output for Task 1
ie_preprocess(documents[0])[-10]

[('It', 'PRP'),
 ('received', 'VBD'),
 ('ten', 'JJ'),
 ('Oscar', 'NNP'),
 ('nominations', 'NNS'),
 ('(', '('),
 ('including', 'VBG'),
 ('Best', 'NNP'),
 ('Picture', 'NN'),
 (')', ')'),
 (',', ','),
 ('winning', 'VBG'),
 ('seven', 'CD'),
 ('.', '.')]

## Task 2: Named Entity Recognition (10 Marks)

Write a function that returns a list of all the named entities in a given document. The document here is structured as a list of sentences and tagged with part-of-speech tags.

Hint: Set `binary = True` while calling the `ne_chunk` function.

In [50]:
def find_named_entities(tagged_document):
  '''Return a list of all the named entities in the given tagged document.'''
  
  # your code goes here
  # ...
  named_entities = []
  
  tree = nltk.ne_chunk_sents(tagged_document, binary=True)
  for tree in tree:
    for subtree in tree.subtrees():
      if subtree.label() == "NE":
        entity = ""
        for leaf in subtree.leaves():
          entity = entity + leaf[0] + " "
        named_entities.append(entity.strip())

  return named_entities



Run the cell below to check if the output is formatted correctly.

The output values might not match exactly, but should look similar to: `['Star Wars', 'Star Wars', 'New Hope', 'American', 'George Lucas', 'Lucasfilm', ...]`

In [51]:
# check output for Task 2
tagged_document = ie_preprocess(documents[0]) # pre-process the first document
find_named_entities(tagged_document)[:10]     # display the first 10 named entities

['Star Wars',
 'Star Wars',
 'New Hope',
 'American',
 'George Lucas',
 'Lucasfilm',
 'Century Fox',
 'Mark Hamill',
 'Harrison Ford',
 'Carrie Fisher']

## Task 3: Information / Relation Extraction (I) (30 Marks)

Choose any **three** relations out of the following and write functions to extract them from a given document.

* **Title**
* **Language**
* **Starring**
* **Release date**
* **Cinematography**
* **Dialogue by**
* **Directed by**
* **Edited by**
* **Music by**
* **Narrated by**
* **Produced by**
* **Screenplay by**
* **Story by**
* **Written by**
* **Production companies**
* **Distribution companies**
* **Budget**
* **Box office**


The functions you define here must take as input a string called `document` and return the information/relation extracted as a list. You can explain your approach with comments along with your code.


In [72]:
#Approach-
#Firstly,a regular expression pattern which is a string  is compiled  into a regex pattern object which is later used to search for a match. 
#then from  every single sentence in tagged document tree object is generated using nltk.ne_chunk() function. 
#furthermore these tree objects are spiltted into a list of two-member lists, which consists of a string followed by a Named Entity.
#This is done by tree2semi_rel() function. Then semi_rel2reldict() function converts these pairs into a dictionary of the subject and object NEs.
#filter() function is then filter out the pattern (subjectNE...selected_pattern(produced)....objectNE) from the output of function semi_rel2reldict().
#This ouput is appended to a list by removing  underscore and making the first letter uppercase. replace() and title() functions used for this purpose.

def extract_producer_name(document):

 subject_class_entity = 'NE'
 object_class_entity = 'NE'
 searched_pattern_word = re.compile(r'.*\bproduced\b.*',re.IGNORECASE)             # Pattern for searching of a sentence


 producer_namelist = []
 sentence_tagged = ie_preprocess(document)                                    # Tagging of sentence using ie_preprocess() function

 for single_tagged_sent in sentence_tagged:

  sentence_chunked = nltk.ne_chunk(single_tagged_sent, binary = True)                           # Chunking of sentnece to generate tree object
  pair_extraction_task = nltk.sem.relextract.tree2semi_rel(sentence_chunked)                            # Extracting pair from chunked sentence
  relation_dictionary_extract = nltk.sem.relextract.semi_rel2reldict(pair_extraction_task + [[[]]])     # Creation of reldict using extrated pair


  pattern_filter = lambda sentence: (sentence['subjclass'] == subject_class_entity and searched_pattern_word.match(sentence['filler']) and sentence['objclass'] == object_class_entity)

  relation_load = list(filter(pattern_filter, relation_dictionary_extract))                     #filtering pattern

  for rel in relation_load:
    producer_namelist.append(rel['objsym'].replace("_", " ").title())                               #Removing underscore and Uppercasing first letter
    
 return producer_namelist

In [73]:
#Approach-
#Firstly,a regular expression pattern which is a string  is compiled  into a regex pattern object which is later used to search for a match. 
#then from  every single sentence in tagged document tree object is generated using nltk.ne_chunk() function. 
#furthermore these tree objects are spiltted into a list of two-member lists, which consists of a string followed by a Named Entity.
#This is done by tree2semi_rel() function. Then semi_rel2reldict() function converts these pairs into a dictionary of the subject and object NEs.
#filter() function is then filter out the pattern (subjectNE...selected_pattern(directed)....objectNE) from the output of function semi_rel2reldict().
#This ouput is appended to a list by removing  underscore and making the first letter uppercase. replace() and title() functions used for this purpose.

def extract_director_name(document):

 subject_class_entity = 'NE'
 object_class_entity = 'NE'
 searched_pattern_word = re.compile(r'.*\bdirected\b.*',re.IGNORECASE)                   # Pattern for searching of a sentence


 director_namelist = []
 sentence_tagged = ie_preprocess(document)                                            # Tagging of sentence using ie_preprocess() function

 for single_tagged_sent in sentence_tagged:

  sentence_chunked = nltk.ne_chunk(single_tagged_sent, binary = True)                           # Chunking of sentnece to generate tree object
  pair_extraction_task = nltk.sem.relextract.tree2semi_rel(sentence_chunked)                            # Extracting pair from chunked sentence
  relation_dictionary_extract = nltk.sem.relextract.semi_rel2reldict(pair_extraction_task + [[[]]])     # Creation of reldict using extrated pair


  pattern_filter = lambda sentence: (sentence['subjclass'] == subject_class_entity and searched_pattern_word.match(sentence['filler']) and sentence['objclass'] == object_class_entity)

  relation_load = list(filter(pattern_filter, relation_dictionary_extract))                   #filtering pattern

  for rel in relation_load:
    director_namelist.append(rel['objsym'].replace("_", " ").title())                              #Removing underscore and Uppercasing first letter
    
 return director_namelist

In [74]:
#Approach-
#Firstly,a regular expression pattern which is a string  is compiled  into a regex pattern object which is later used to search for a match. 
#then from  every single sentence in tagged document tree object is generated using nltk.ne_chunk() function. 
#furthermore these tree objects are spiltted into a list of two-member lists, which consists of a string followed by a Named Entity.
#This is done by tree2semi_rel() function. Then semi_rel2reldict() function converts these pairs into a dictionary of the subject and object NEs.
#filter() function is then filter out the pattern (subjectNE...selected_pattern(written)....objectNE) from the output of function semi_rel2reldict().
#This ouput is appended to a list by removing  underscore and making the first letter uppercase. replace() and title() functions used for this purpose.

def extract_writer_name(document):

 subject_class_entity = 'NE'
 object_class_entity = 'NE'
 searched_pattern_word = re.compile(r'.*\bwritten\b.*',re.IGNORECASE)                            # Pattern for searching of a sentence


 writer_namelist = []
 sentence_tagged = ie_preprocess(document)                                                  # Tagging of sentence using ie_preprocess() function

 for single_tagged_sent in sentence_tagged:

  sentence_chunked = nltk.ne_chunk(single_tagged_sent, binary = True)                             # Chunking of sentnece to generate tree object
  pair_extraction_task = nltk.sem.relextract.tree2semi_rel(sentence_chunked)                              # Extracting pair from chunked sentence
  relation_dictionary_extract = nltk.sem.relextract.semi_rel2reldict(pair_extraction_task + [[[]]])       # Creation of reldict using extrated pair


  pattern_filter = lambda sentence: (sentence['subjclass'] == subject_class_entity and searched_pattern_word.match(sentence['filler']) and sentence['objclass'] == object_class_entity)

  relation_load = list(filter(pattern_filter, relation_dictionary_extract))                     #filtering pattern

  for rel in relation_load:
    writer_namelist.append(rel['objsym'].replace("_", " ").title())                                 #Removing underscore and Uppercasing first letter
    
 return writer_namelist

---

## Task 4: Information / Relation Extraction (II)  (15 Marks)

Identify one other relation of your choice, besides the ones mentioned in the previous task, and write a function to extract it. 

The function you define here must take as input a string called `document` and return the information/relations extracted as a list.

In [75]:
#Approach-
#Firstly,a regular expression pattern which is a string  is compiled  into a regex pattern object which is later used to search for a match. 
#then from  every single sentence in tagged document tree object is generated using nltk.ne_chunk() function. 
#furthermore these tree objects are spiltted into a list of two-member lists, which consists of a string followed by a Named Entity.
#This is done by tree2semi_rel() function. Then semi_rel2reldict() function converts these pairs into a dictionary of the subject and object NEs.
#filter() function is then filter out the pattern (subjectNE...selected_pattern(awards)....objectNE) from the output of function semi_rel2reldict().
#This ouput is appended to a list by removing  underscore and making the first letter uppercase. replace() and title() functions used for this purpose.

def extract_award_names(document):

 subject_class_entity = 'NE'
 object_class_entity = 'NE'
 searched_pattern_word = re.compile(r'.*\bawards\b.*',re.IGNORECASE)                             # Pattern for searching of a sentence


 award_namelist = []
 sentence_tagged = ie_preprocess(document)                                                  # Tagging of sentence using ie_preprocess() function

 for single_tagged_sent in sentence_tagged:

  sentence_chunked = nltk.ne_chunk(single_tagged_sent, binary = True)                             # Chunking of sentnece to generate tree object
  pair_extraction_task = nltk.sem.relextract.tree2semi_rel(sentence_chunked)                              # Extracting pair from chunked sentence
  relation_dictionary_extract = nltk.sem.relextract.semi_rel2reldict(pair_extraction_task + [[[]]])       # Creation of reldict using extrated pair


  pattern_filter = lambda sentence: (sentence['subjclass'] == subject_class_entity and searched_pattern_word.match(sentence['filler']) and sentence['objclass'] == object_class_entity)

  relation_load = list(filter(pattern_filter, relation_dictionary_extract))                       #filtering pattern

  for rel in relation_load:
    award_namelist.append(rel['objsym'].replace("_", " ").title())                                    #Removing underscore and Uppercasing first letter
    
 return award_namelist

---

## Task 5: Combining information in the output (5 Marks)

Edit the function below to return a Python dictionary with the outputs from the functions defined in tasks $3 - 4$.

In [76]:
def extract_info(document):
  '''Extract information and relations from a given document.'''

  # Edit the output dict below and assign the values to keys by 
  # calling the appropriate functions from Tasks 3 and 4.
  
  # You can delete the keys for which you do not perform extraction in Task 3.

  output = {
    ##### EDIT BELOW THIS LINE #####
    
    # For the relations you extract in Task 3, 
    # save the output in the appropriate key and delete rest of the keys.
   # "Distribution companies": extract_distribution_name(document),
    "Directed by": extract_director_name(document),
    "Written by": extract_writer_name(document),
    "Produced by": extract_producer_name(document),

    # save the output from Task 4 here
    "Task 4": extract_award_names(document)

    ##### EDIT ABOVE THIS LINE #####
  }

  return output


# check output for the first document
extract_info(documents[0])

{'Directed by': ['George Lucas'],
 'Produced by': ['Lucasfilm'],
 'Task 4': [],
 'Written by': ['George Lucas']}

The output from the cell above should look something like the dictionary shown below. Overall values might be different, based on what four items you choose to extract in Tasks 3 and 4, but the structure should be similar.

For example, if you choose to extract **Starring**, **Release Date**, **Box office**, and **Directed by**, then the output should look something like this for the first document:

```javascript
{
  'Box office': ['$775 million'],
  'Directed by': ['George Lucas'],
  'Release date': ['May 25, 1977'],
  'Starring': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 
               'Peter Cushing', 'David Prowse', 'James Earl Jones', ],
}
```

---

## Task 6: Evaluation (I) (15 Marks)

Write a function to evaluate the performance of Task $3$ using **Precision**, **Recall** and **F1** scores. Use the gold-standard labels provided in the JSON files to calculate these values.

Please note that not all the information / relations mentioned in Task $3$ have associated labels for each and every movie in the JSON documents, i.e., some JSON documents will have certain keys-value pairs missing. For example, we have labels for *Budget* in 46 out of the 50 movies and in the remaining 4 documents, you will find that the key `Budget` is omitted from the JSON.
 
Also keep in mind that we will further run this evaluation on a hidden test set containing similar movie descriptions.

In [78]:
def evaluate(labels, predictions):
  '''
  Evaluate the performance of relation extraction 
  using Precision, Recall, and F1 scores.

  Args:
    labels: A list containing gold-standard labels
    predictions: A list containing information extracted from documents
  Returns:
    scores: A dictionary containing Precision, Recall and F1 scores 
            for the information/relations extracted in Task 3.
  '''

  assert len(predictions) == len(labels)

  scores = {
      'precision': 0.0, 'recall': 0.0, 'f1': 0.0
  }

  # calculate the precision, recall and f1 score over the information fields 
  # corresponding to Task 3 and store the result in the `scores` dict.

  # your code goes here
  # ...
  true_positive_value = 0
  predicted_positive_value = 0
  actual_positive_value = 0
  
  for pred_value, lbl_value in zip(predictions, labels):
    for dict_key, dict_value in pred_value.items():
      if dict_key!="Task 4":
        set_1 = set(pred_value[dict_key])
      if dict_key in lbl_value:
          set_2 = set(lbl_value[dict_key])
          true_positive_value = true_positive_value + len(set_1.intersection(set_2))
          predicted_positive_value = predicted_positive_value + len(set(pred_value[dict_key]))
          actual_positive_value = actual_positive_value + len(set(lbl_value[dict_key]))

  scores['precision'] = true_positive_value / predicted_positive_value
  scores['recall'] = true_positive_value / actual_positive_value
  scores['f1'] = 2 * (scores['precision'] * scores['recall']) / (scores['precision'] + scores['recall']) 

  return scores

---
Run the cell below to calculate and display the evaluation scores for the 50 documents in `movies.zip`.

You can consider the following as a baseline score. Your aim should be to score higher or atleast get as close as possible to these values.

| Precision | Recall | F1    |
| :---:     | :---:  | :---: |
| 0.5       | 0.25   | 0.333 |

In [79]:
# !pip install pandas
import pandas as pd

# calculate evaluation score across all the 50 documents
extracted_infos = []
for document in documents:
  extracted_infos.append(extract_info(document))

scores = evaluate(labels, extracted_infos)

pd.DataFrame([scores])

Unnamed: 0,precision,recall,f1
0,0.683544,0.252336,0.368601


---

## Task 7: Evaluation (II) (10 Marks)

Describe **two** challenges you encountered above or might encounter in the evaluation of *information extraction* or *relation extraction* tasks.


Edit this cell to write your answer below the line in no more than 100 words. No coding is required for this task.

---

> 

**Ambiguity** -
computers lack the cognitive capability like humans to interpret the meaning easily based on their knowledge.
Consider relations 'Produced by' and 'Production companies’ which are extracted using a common pattern "produced”. But it's fetching producer name in doc 3 and retrieving “Marvel Studios"(production company name) in doc 4, resulting less precision score.

**Term mismatch**-
Some of the terms specified in the gold-standard labels are officially defined, but similar words in the corpus are conventionally defined, resulting in a mismatch, for example. Although "Lucasfilm Ltd." is stated in JSON file, it is referred to as "Lucasfilm" in corpus.













 