In [0]:
# -----------------------------------------------------
# Natural Language Processing
# Assignment 2 - Automatic Sense Making and Explanation
# Task 3 - Explanation (Generation)
# Michael McAleer R00143621
# -----------------------------------------------------
# Notes: 
# 1. This has been run and tested on Python 3.6 with TensorFlow 1.15.0
# 2. This model only predicts on 100 samples from test data set due to time 
#    involved in predicting non-sensical answers
# 3. This model was only tested with 10 training epochs because Google
#    restricted GPU usage due to unfair usage or resources - it takes a
#    considerable amount of time to train per epoch (5-7mins)

### 1. Install package dependencies

In [0]:
!pip install -q textgenrnn

### 2. Import libraries

In [2]:
import argparse
import collections
import csv
import logging
import math
import os
import sys

from textgenrnn import textgenrnn
from typing import List, Dict

Using TensorFlow backend.


### 3. Set paths, download dependencies, set constants

In [3]:
# Set root path dependent on system
# If System is Windows, set ROOT_DIR as current working directory
if os.name == 'nt':
    ROOT_DIR = os.getcwd()
# Else running on CoLab, set ROOT_DIR to match environment path
else:
    from google.colab import drive

    drive.mount('/content/drive')
    ROOT_DIR = '/content/drive/My Drive/Colab Notebooks'

# Paths to data and model output dir
DATA_DIR = '{root}/data'.format(root=ROOT_DIR)
TRAIN_DIR = '{data}/train'.format(data=DATA_DIR)
TEST_DIR = '{data}/test'.format(data=DATA_DIR)

# Evaluator constants
EXIT_STATUS_ANSWERS_MALFORMED = 1
EXIT_STATUS_PREDICTIONS_MALFORMED = 2
EXIT_STATUS_PREDICTIONS_EXTRA = 3
EXIT_STATUS_PREDICTION_MISSING = 4

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 4. Import training and test data

In [0]:
def read_csv_values(file_name, answers_file=False):
    """Read any csv file.

    :param file_name: path to CSV file -- str
    :param answers_file: if the CSV file contains dataset answers -- bool
    :return: CSV parsed data -- dict
    """
    # open the file in universal line ending mode
    with open(file_name, newline='') as infile:
        # if the file contains answers set the column names manually
        if answers_file:
            reader = csv.DictReader(infile, fieldnames=['id', 'answer0',
                                                        'answer1', 'answer2'])
        else:
            # read the file as a list for each row ({header : value})
            reader = csv.DictReader(infile)
        # Initialise response dict
        data = dict()
        # For each row in the file
        for row in reader:
            # For header, value in each row
            for header, value in row.items():
                try:
                    # Add the value to the existing list
                    data[header].append(value)
                except KeyError:
                    # Create a new list for the header and assign value
                    data[header] = [value]
    return data


# Set training data path
training_data_path = '{td}/subtaskC_data_all.csv'.format(td=TRAIN_DIR)
training_answers_path = '{td}/subtaskC_answers_all.csv'.format(td=TRAIN_DIR)
# Load training data
train_data = read_csv_values(training_data_path)
train_answers = read_csv_values(training_answers_path, answers_file=True)

# Set test data path
test_data_path = '{td}/taskC_trial_data.csv'.format(td=TEST_DIR)
test_answers_path = '{td}/taskC_trial_references.csv'.format(td=TEST_DIR)
# Load test data
test_data = read_csv_values(test_data_path)
test_answers = read_csv_values(test_answers_path, answers_file=True)

# Initialise corpus list
corpus = list()
# For each context and possible reference answer
for c, a0, a1, a2 in zip(train_data['FalseSent'], train_answers['answer0'],
                         train_answers['answer1'], train_answers['answer2']):
    # For each answer
    for a in [a0, a1, a2]:
        # If the answer doesn't end in period add one, this will help
        # the model determine that a new sentence is required and not a
        # continuation of the context
        if c[-1] != '.':
            c += '.'
        # Add the context and answer the corpus
        corpus.append('{context} {answer}'.format(context=c, answer=a))

### 5. Transfer & Train Model

In [5]:
# Initialise Char-RNN (https://github.com/karpathy/char-rnn) module
# TextGenRNN (https://pypi.org/project/textgenrnn/)

# The model is pre-trained on a Reddit corpus and consists of multiple LSTM
# layers with attention included that determines the next character in a
# sequence of 394 possible characters. Retraining of the model is done with
# a momentum based optimiser and linearly decaying learning rate.
textgen = textgenrnn()

# The number of epochs had to be dropped significantly after getting my Colab
# account GPU restricted, however, given the nonsensical generated text that
# was produced, even after a number of epochs dropping the number of epochs
# will not have a lot of impact on the quality of the results without further
# work on the training corpus:
# https://research.google.com/colaboratory/faq.html#gpu-availability
textgen.train_on_texts(corpus, num_epochs=10, batch_size=1024,
                       train_size=0.95, validation=True, dropout=0.3)












Training on 2,393,457 character sequences.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/10
####################
Temperature: 0.2
####################
He put a bath in the sky. A dog is not a protect to see the store.

He was a liquid on the moon. A shower is not a computer to complain and it is a place to cook in the sun. The moon is not a place to be pretty the sun to be eaten.

He was a shopping and he was always a football. A person cannot be protected in a park.

####################
Temperature: 0.5
####################
The people can be suddented in the sun. A computer is a place to see it for human bears. It is a space of the parents and cannot be drunk

He was eating contains in a store. You cannot go out of a body in the bathroom.

Cars are so hoping the rain. A bath is not real.

####################
Temperature: 1.0
####################
you won't excrean from living food. the lady too much person cannot fa

### 6. Generate explanation why an input context makes sense

In [12]:
# Initialise answers dictionary
answers = dict()
# For index and context in the test dataset
for i, context in zip(train_data['id'][:100], train_data['FalseSent'][:100]):
    # Cast the index as an integer
    i = int(i)
    # If the context does not end in a period add one
    if context[-1] != '.':
        context += '.'
    # Occassionally the model does not return a prediction, this can be
    # circumvented by continuing to make predictions until one is made
    generated_texts = list()
    while not generated_texts:
        # Make prediction with the current context from dataset, the
        # temperature represents the threshold set on the model on selecting
        # sub-optimal predictions
        generated_texts = textgen.generate(n=1, prefix=context,
                                           temperature=0.7,
                                           return_as_list=True,
                                           max_gen_length=150)
    # Extract the generated text by splitting on the period at the end of the
    # input context
    answer = generated_texts[0].split('.')[1]
    # Add context and predicted text to answer dict
    answers[str(i + 1)] = answer
    # Output the first ten predictions
    if i <= 10:
        print('Context: {c} | Answer: {a}'.format(c=context, a=answer))

Context: He poured orange juice on his cereal. | Answer:  Boats are best programmed because they were not enough to be furniture
Context: He drinks apple. | Answer:  Programming is not used for speaking for cancer
Context: Jeff ran 100,000 miles today. | Answer:  walleting a clock is not a place for gold in a morning
Context: I sting a mosquito. | Answer:  Leaves are not sold as the sun in a fridge care of air
Context: A giraffe is a person. | Answer:  A cat gave his desk when they are dead
Context: A normal closet is larger than a walk-in closet. | Answer:  A book has nothing to get the car only on modern the freezer
Context: I like to ride my chocolate. | Answer:  Monkeys are not edible
Context: A GIRL WON THE RACE WITH HORSE. | Answer:  The sea and used to chase it to make people use by seawater
Context: he put elephant into the jug. | Answer:  Babies are a common thing
Context: A dog plays volleyball. | Answer:  Smoothies do not enhance dead people
Context: Eggs eat kis on Easter. 

### 7. Evaluate accuracy of model using BLEU

In [13]:
def _get_ngrams(segment, max_order):
    """Extracts all n-grams upto a given maximum order from an input segment.
    Args:
        segment: text segment from which n-grams will be extracted.
        max_order: maximum length in tokens of the n-grams returned by this
        methods.
    Returns:
        The Counter containing all n-grams upto max_order in segment
        with a count of how many times each n-gram occurred.
    """
    ngram_counts = collections.Counter()
    for order in range(1, max_order + 1):
        for i in range(0, len(segment) - order + 1):
            ngram = tuple(segment[i:i + order])
            ngram_counts[ngram] += 1
    return ngram_counts


def _compute_bleu(reference_corpus, translation_corpus, max_order=4,
                  smooth=False):
    """Computes BLEU score of translated segments against one or more references.
    Args:
        reference_corpus: list of lists of references for each translation. Each
            reference should be tokenized into a list of tokens.
        translation_corpus: list of translations to score. Each translation
            should be tokenized into a list of tokens.
        max_order: Maximum n-gram order to use when computing BLEU score.
        smooth: Whether or not to apply Lin et al. 2004 smoothing.
    Returns:
        3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
            precisions and brevity penalty.
    """
    matches_by_order = [0] * max_order
    possible_matches_by_order = [0] * max_order
    reference_length = 0
    translation_length = 0
    for (references, translation) in zip(reference_corpus, translation_corpus):
        reference_length += min(len(r) for r in references)
        translation_length += len(translation)

        merged_ref_ngram_counts = collections.Counter()
        for reference in references:
            merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
        translation_ngram_counts = _get_ngrams(translation, max_order)
        overlap = translation_ngram_counts & merged_ref_ngram_counts
        for ngram in overlap:
            matches_by_order[len(ngram) - 1] += overlap[ngram]
        for order in range(1, max_order + 1):
            possible_matches = len(translation) - order + 1
            if possible_matches > 0:
                possible_matches_by_order[order - 1] += possible_matches

    precisions = [0] * max_order
    for i in range(0, max_order):
        if smooth:
            precisions[i] = ((matches_by_order[i] + 1.) /
                             (possible_matches_by_order[i] + 1.))
        else:
            if possible_matches_by_order[i] > 0:
                precisions[i] = (float(matches_by_order[i]) /
                                 possible_matches_by_order[i])
            else:
                precisions[i] = 0.0

    if min(precisions) > 0:
        p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
        geo_mean = math.exp(p_log_sum)
    else:
        geo_mean = 0

    ratio = float(translation_length) / reference_length

    if ratio > 1.0:
        bp = 1.
    else:
        bp = math.exp(1 - 1. / ratio)

    bleu = geo_mean * bp

    return (bleu, precisions, bp, ratio, translation_length, reference_length)


def calculate_bleu(references: Dict[str, List[List[str]]],
                   predictions: Dict[str, List[str]],
                   max_order=4,
                   smooth=False) -> float:
    reference_corpus = []
    prediction_corpus = []

    for instance_id, reference_sents in references.items():
        try:
            prediction_sent = predictions[instance_id]
        except KeyError:
            logging.error("Missing prediction for instance '%s'.", instance_id)
            sys.exit(EXIT_STATUS_PREDICTION_MISSING)

        del predictions[instance_id]

        prediction_corpus.append(prediction_sent)
        reference_corpus.append(reference_sents)

    if len(predictions) > 0:
        logging.error("Found %d extra predictions, for example: %s",
                      len(predictions),
                      ", ".join(list(predictions.keys())[:3]))
        sys.exit(EXIT_STATUS_PREDICTIONS_EXTRA)

    score = _compute_bleu(reference_corpus, prediction_corpus,
                          max_order=max_order, smooth=smooth)[0]

    return score


def read_references(filename: str) -> List[List[List[str]]]:
    references = {}
    with open(filename, "rt", encoding="UTF-8", errors="replace") as f:
        reader = csv.reader(f)
        try:
            count = 0
            for row in reader:
                if count < 100:
                    try:
                        instance_id = row[0]
                        references_raw1 = row[1]
                        references_raw2 = row[2]
                        references_raw3 = row[3]
                    except IndexError as e:
                        logging.error(
                            "Error reading value from CSV file %s on line %d: %s",
                            filename, reader.line_num, e)
                        sys.exit(EXIT_STATUS_ANSWERS_MALFORMED)

                    if instance_id in references:
                        logging.error("Key %s repeated in file %s on line %d",
                                      instance_id, filename, reader.line_num)
                        sys.exit(EXIT_STATUS_ANSWERS_MALFORMED)

                    if instance_id == "":
                        logging.error(
                            "Key is empty in file %s on line %d", filename,
                            reader.line_num)
                        sys.exit(EXIT_STATUS_ANSWERS_MALFORMED)

                    tokens = []
                    for ref in [references_raw1, references_raw2,
                                references_raw3]:
                        if ref:
                            tokens.append(ref.split())

                    if len(tokens) == 0:
                        logging.error(
                            "No reference sentence in file %s on line %d",
                            filename, reader.line_num)
                        sys.exit(EXIT_STATUS_ANSWERS_MALFORMED)

                    references[instance_id] = tokens
                count += 1

        except csv.Error as e:
            logging.error('file %s, line %d: %s', filename, reader.line_num, e)
            sys.exit(EXIT_STATUS_ANSWERS_MALFORMED)

    return references


references = read_references(test_answers_path)
bleu = calculate_bleu(references, answers,
                      max_order=4, smooth=True)

print(f'BLEU score: {bleu * 100:.4f}.')

BLEU score: 0.0712.


### 8. Results: Task 3 - Explanation (Generation)

The results from task 3 were very disappointing, the BLEU accuracy score was less than 0.1 and the generated sentences in every instance were non-sensical and only on rare occasions have any connection to the input context.

Example:

`Context: A dog plays volleyball. | Answer:  Smoothies do not enhance dead people`

The poor performance of the character level RNN is likely due to the lack of training on a suitable corpus. Whilst other NLU/NLP models such as BERT or GPT2 have been trained on extensive corpora, the character level RNN implemented in task-3 was trained on a subsection of the website Reddit. It is my opinion that to best approach this task the model would need to be trained on a body of text such as Wikipedia, or a collection of encyclopaedias, so it can start to understand the connections between words and learn deeper contexts.

Additional training and fine-tuning were carried out on the char-RNN to improve its accuracy, but after 10 epochs it was still not possible to generate any reasonable answers. It is worth noting that an attempt was made to extend the learning on the training data by increasing the number of epochs, but this resulted in a GPU restriction placed on my Colab account due to unfair usage of GPU resources which has still not been lifted at the time of writing.  It is assumed that more training epochs will results in better text generation, but this remains untested with the current dataset.

BERT was not chosen for this task because of its unsuitability for next word/character tasks due to its masked bidirectional nature. There is a possibility for further research into masking entire sentences for BERT inputs and predicting words/sentences in that manner, but it is outside of the scope of this assignment and the time frame available for such a body of work.

