In [None]:
import tensorflow as tf
import numpy as np

# Key Phrase Extraction

In [None]:
!pip install textacy==0.9.1
!python -m spacy download en_core_web_sm

import spacy
import textacy.ke
from textacy import *

Collecting textacy==0.9.1
[?25l  Downloading https://files.pythonhosted.org/packages/3a/5e/3b8391cf6ff39350b73f8421184cf6792002b5c2c17982b7c9fbd5ff36de/textacy-0.9.1-py3-none-any.whl (203kB)
[K     |████████████████████████████████| 204kB 2.7MB/s 
[?25hCollecting cytoolz>=0.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/58/67/1c60da8ba831bfefedb64c78b9f6820bdf58972797c95644ee3191daf27a/cytoolz-0.11.0.tar.gz (477kB)
[K     |████████████████████████████████| 481kB 8.3MB/s 
Collecting jellyfish>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/6c/09/927ae35fc5a9f70abb6cc2c27ee88fc48549f7bc4786c1d4b177c22e997d/jellyfish-0.8.2-cp36-cp36m-manylinux2014_x86_64.whl (93kB)
[K     |████████████████████████████████| 102kB 7.3MB/s 
Collecting pyphen>=0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/7c/5a/5bc036e01389bc6a6667a932bac3e388de6e7fa5777a6ff50e652f60ec79/Pyphen-0.10.0-py3-none-any.whl (1.9MB)
[K     |███████████████████████████████

In [None]:
#Load a spacy model, which will be used for all further processing.
en = textacy.load_spacy_lang("en_core_web_sm")

In [None]:
text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation. Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled 'Computing Machinery and Intelligence' which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpretation and generation of natural language, but at the time not articulated as a problem separate from artificial intelligence. The premise of symbolic NLP is well-summarized by John Searle's Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it is confronted with. Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques can achieve state-of-the-art results in many natural language tasks, for example in language modeling, parsing, and many others. A major drawback of statistical methods is that they require elaborate feature engineering. Since the early 2010s, the field has thus largely abandoned statistical methods and shifted to neural networks for machine learning. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT)."

In [None]:
doc = textacy.make_spacy_doc(text, lang=en)

In [None]:
textacy.ke.textrank(doc, topn=5)

[('statistical natural language processing', 0.035339725905947916),
 ('natural language processing system', 0.03163262196028641),
 ('natural language task', 0.029347081246597313),
 ('natural language datum', 0.025156505672350015),
 ('style machine learning method', 0.024700572677656928)]

In [None]:
textacy.ke.sgrank(doc, topn=5)

[('natural language processing', 0.5730151310757177),
 ('natural language understanding', 0.09846370418298048),
 ('NLP', 0.01582432031799395),
 ('artificial intelligence', 0.014770636080973837),
 ('deep neural network', 0.012977602491495587)]

To address the issue of overlapping key phrases, textacy has a function: aggregage_term_variants.

In [None]:
terms = set([term for term, weight in textacy.ke.sgrank(doc)])
print(textacy.ke.utils.aggregate_term_variants(terms))

[{'natural language understanding'}, {'natural language processing'}, {'artificial intelligence'}, {'deep neural network'}, {'linguistic'}, {'computer'}, {'speech'}, {'datum'}, {'task'}, {'NLP'}]


# Question Answering

We will leave bulding a QA system in a couple of labs after learning about BERT and other transformers. 

For today, we will only apply the preprocessing steps for Squad v1

In [None]:
train_data_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json"
train_path = tf.keras.utils.get_file("train.json", train_data_url)
eval_data_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
eval_path = tf.keras.utils.get_file("eval.json", eval_data_url)

In [None]:
import json

with open(train_path) as f:
    raw_train_data = json.load(f)

with open(eval_path) as f:
    raw_eval_data = json.load(f)

In [None]:
raw_train_data["data"][0]["paragraphs"][0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
     'text': 'Saint Bernadette Soubirous'}],
   'id': '5733be284776f41900661182',
   'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'},
  {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ

Tokenizer

In [None]:
all_text = []

for item in raw_train_data["data"]:
        for para in item["paragraphs"]:
            all_text.append(para["context"])
            for qa in para["qas"]:
                all_text.append([qa["question"]])

for item in raw_test_data["data"]:
        for para in item["paragraphs"]:
            all_text.append(para["context"])
            for qa in para["qas"]:
                all_text.append([qa["question"]])

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(all_text)

In [None]:
print(all_text[2])
tokenizer.texts_to_sequences(all_text[2])

['What is in front of the Notre Dame Main Building?']


[[244, 8, 4, 1022, 2, 1, 2567, 2568, 274, 327]]

In [None]:
max_len = 384 # context + question

Squad Object

In [None]:
class SquadExample:
    def __init__(self, question, context, start_char_idx, answer_text):
        self.question = question
        self.context = context
        self.start_char_idx = start_char_idx
        self.answer_text = answer_text
        self.skip = False

    def preprocess(self):
        context = self.context
        question = self.question
        answer_text = self.answer_text
        start_char_idx = self.start_char_idx

        # Find end character index of answer in context
        end_char_idx = start_char_idx + len(answer_text)
        if end_char_idx >= len(context):
            self.skip = True
            return

        # Mark the character indexes in context that are in answer
        is_char_in_ans = [0] * len(context)
        for idx in range(start_char_idx, end_char_idx):
            is_char_in_ans[idx] = 1

        # Tokenize context
        tokenized_before = tokenizer.texts_to_sequences([context[: start_char_idx]])[0]
        tokenized_answer = tokenizer.texts_to_sequences([context[start_char_idx: end_char_idx]])[0]
        tokenized_after = tokenizer.texts_to_sequences([context[end_char_idx:]])[0]
        tokenized_context = tokenized_before + tokenized_answer + tokenized_after


        # Find tokens that were created from answer characters
        ans_token_idx = list(range(len(tokenized_before), len(tokenized_before)+ len(tokenized_answer)))

        if len(ans_token_idx) == 0:
            self.skip = True
            return

        # Find start and end token index for tokens from answer
        start_token_idx = ans_token_idx[0]
        end_token_idx = ans_token_idx[-1]

        # Tokenize question
        tokenized_question = tokenizer.texts_to_sequences([question])[0]
        # Create inputs
        input_ids = tokenized_context  + tokenized_question
        token_type_ids = [0] * len(tokenized_context) + [1] * len(tokenized_question)
        attention_mask = [1] * len(input_ids)

        # Pad and create attention masks.
        # Skip if truncation is needed
        padding_length = max_len - len(input_ids)
        if padding_length > 0:  # pad
            input_ids = input_ids + ([0] * padding_length)
            attention_mask = attention_mask + ([0] * padding_length)
            token_type_ids = token_type_ids + ([0] * padding_length)
        elif padding_length < 0:  # skip
            self.skip = True
            return

        self.input_ids = input_ids
        self.token_type_ids = token_type_ids
        self.attention_mask = attention_mask
        self.start_token_idx = start_token_idx
        self.end_token_idx = end_token_idx

Important: We are missing \<start> \<end> tokens. Transformer tokenizers add them automatically, that is why we don't do it here. However, if you want to train a seq2seq model don't forget them.

Input: \<start> context \<end> question \<end>

In [None]:
def create_squad_examples(raw_data):
    squad_examples = []
    for item in raw_data["data"]:
        for para in item["paragraphs"]:
            context = para["context"]
            for qa in para["qas"]:
                question = qa["question"]
                answer_text = qa["answers"][0]["text"]
                squad_eg = SquadExample(question, context, start_char_idx, answer_text)
                squad_eg.preprocess()
                squad_examples.append(squad_eg)
    return squad_examples

train_squad_examples = create_squad_examples(raw_train_data)
eval_squad_examples = create_squad_examples(raw_eval_data)

In [None]:
def create_inputs_targets(squad_examples):
    dataset_dict = {
        "input_ids": [],
        "token_type_ids": [],
        "attention_mask": [],
        "start_token_idx": [],
        "end_token_idx": [],
    }
    for item in squad_examples:
        if item.skip == False:
            for key in dataset_dict:
                dataset_dict[key].append(getattr(item, key))
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key])

    x = [
        dataset_dict["input_ids"],
        dataset_dict["token_type_ids"],
        dataset_dict["attention_mask"],
    ]
    y = [dataset_dict["start_token_idx"], dataset_dict["end_token_idx"]]
    return x, y

In [None]:
x_train, y_train = create_inputs_targets(train_squad_examples)
x_eval, y_eval = create_inputs_targets(eval_squad_examples)

In [None]:
x_train[2][0]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
y_train[0][0]

90

# Continue Learning

## Implement a POS-Tagger from scratch

We will use the Penn Treebank dataset from NLTK

The model will take a sequence of words in a sentence as input, then will output the
corresponding POS tag for each word. Thus, for an input sequence consisting of the
words [The, cat, sat. on, the, mat, .], the output sequence should be the POS symbols
[DT, NN, VB, IN, DT, NN, .].

In [None]:
import nltk
nltk.download("treebank")

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [None]:
sentences = nltk.corpus.treebank.tagged_sents()
sents = []
poss = []
for sentence in sentences:
    sents.append(" ".join([w for w, p in sentence]))
    poss.append(" ".join([p for w, p in sentence]))

print(len(sents))

3914


In [None]:
sent_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters="")
sent_tokenizer.fit_on_texts(sents)

poss_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters="", lower=False)
poss_tokenizer.fit_on_texts(poss)

sent_vocab_size = len(sent_tokenizer.word_index)
poss_vocab_size = len(poss_tokenizer.word_index)

In [None]:
def max_len(sentences):
    return max(len(s) for s in sentences)

sents_sequences = sent_tokenizer.texts_to_sequences(sents)
max_seqlen = max_len(sents_sequences)

sents_sequences = tf.keras.preprocessing.sequence.pad_sequences(sents_sequences,
                                maxlen=max_seqlen, padding="post")
poss_sequences = poss_tokenizer.texts_to_sequences(poss)
poss_sequences = tf.keras.preprocessing.sequence.pad_sequences(poss_sequences,
                                maxlen=max_seqlen, padding="post")

This time we will preprocess the POS tags not as a sequence but as categories! This makes the problem simpler. We will treat is as a multiclassification problem instead of a seq2seq problem. (A seq2seq can have better accuracy as it takes into account the correct order of POS tags, but lets keep it simple)

In [None]:
poss_categories = []
for p in poss_sequences:
    poss_categories.append(tf.keras.utils.to_categorical(p, num_classes=poss_vocab_size+1, dtype="int32"))
poss_categories = tf.keras.preprocessing.sequence.pad_sequences(poss_categories, maxlen=max_seqlen)

In [None]:
print(sents_sequences[0])
print(poss_sequences[0])
print(poss_categories[0][0])

[5601 3746    1 2024   86  331    1   46 2405    2  131   27    6 2025
  332  459 2026    3    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [None]:
dataset = tf.data.Dataset.from_tensor_slices((sents_sequences, poss_categories))

Instead of using scikit learn train, split code, we will apply directly Dataset API operations. It is quite straightforward

In [None]:
# split into training, validation, and test datasets
dataset = dataset.shuffle(10000)
test_size = len(sents) // 3
val_size = (len(sents) - test_size) // 10
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)

Build the model

In [None]:
embedding_dims = 128
hidden_units = 256
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(sent_vocab_size + 1, # Add Padding!
                                    embedding_dims, 
                                    input_length=max_seqlen))
model.add(tf.keras.layers.SpatialDropout1D(0.2))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.GRU(hidden_units, return_sequences=True)))
model.add(tf.keras.layers.Dense(poss_vocab_size + 1, activation="softmax"))  # Add Padding!

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 271, 128)          1457664   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 271, 128)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, 271, 512)          592896    
_________________________________________________________________
dense (Dense)                (None, 271, 47)           24111     
Total params: 2,074,671
Trainable params: 2,074,671
Non-trainable params: 0
_________________________________________________________________


Because of the padding, there are a lot of zeros on both the label and prediction, as a result of which the normal accuracy numbers will be very optimistic. Let's implement a maske accuracy that does not take into account the zero labels.

Very similar to the loss we implemented in the machine translation code!

However, this time I use tf.keras.backend to do operations between Tensors. They are almost the same as tf direct operations (tf.argmax is equivalent to tf.keras.backend.argmax. 
BUT, not always! (tf.keras.backend.sum is equivalent to tf.reduce_sum)

I show it here for you not to get confused when seeing the keras.backend operations

In [None]:
def masked_accuracy():
    def masked_accuracy_fn(ytrue, ypred):
        ytrue = tf.keras.backend.argmax(ytrue, axis=-1)
        ypred = tf.keras.backend.argmax(ypred, axis=-1)
 
        mask = tf.keras.backend.cast(
            tf.keras.backend.not_equal(ypred, 0), tf.int32)
        matches = tf.keras.backend.cast(
            tf.keras.backend.equal(ytrue, ypred), tf.int32) * mask
        numer = tf.keras.backend.sum(matches)
        denom = tf.keras.backend.maximum(tf.keras.backend.sum(mask), 1)
        accuracy =  numer / denom
        return accuracy

    return masked_accuracy_fn

In [None]:
model.compile(loss="categorical_crossentropy",
              optimizer="adam", 
              metrics=["accuracy", masked_accuracy()])

In [None]:
BATCH_SIZE = 32

history = model.fit(train_dataset.batch(BATCH_SIZE), 
                    epochs=20,
                    validation_data=val_dataset.batch(BATCH_SIZE))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
model.evaluate(test_dataset.batch(BATCH_SIZE))



[0.0026725211646407843, 0.9992699027061462, 0.9921793937683105]

In [None]:
import numpy as np

for test_example in test_dataset.take(5).batch(5):
    input, output = test_example
    pred = model.predict(input)
    preds_b = np.argmax(pred, axis=-1)
    outputs_b = np.argmax(output.numpy(), axis=-1)
    for i, (pred_l, output_l) in enumerate(zip(preds_b, outputs_b)):
        input_tokens = sent_tokenizer.sequences_to_texts(input.numpy())[0].split(" ")
        output_tokens = poss_tokenizer.sequences_to_texts([output_l])[0].split(" ")
        predicted_tokens = poss_tokenizer.sequences_to_texts([pred_l])[0].split(" ")
        true = ""
        predicted = ""
        for i, o, p in zip(input_tokens, output_tokens, predicted_tokens):
          true = true + i + "/" + o + " "
          predicted = predicted + i + "/" + p + " "
        print("True:", true.strip())
        print("Predicted:", predicted.strip())
        print("\n\n")

True: that/DT explains/VBZ why/WRB the/DT number/NN of/IN these/DT wines/NNS is/VBZ expanding/VBG so/RB rapidly/RB ./.
Predicted: that/DT explains/VBZ why/WRB the/DT number/NN of/IN these/DT wines/NNS is/VBZ expanding/VBG so/RB rapidly/RB ./.



True: that/PRP explains/VBD why/NNS the/IN number/DT of/JJ these/NN wines/DT is/NN expanding/RB so/WP rapidly/-NONE- ./VBD
Predicted: that/PRP explains/VBD why/NNS the/IN number/DT of/JJ these/NN wines/DT is/NN expanding/JJR so/WP rapidly/-NONE- ./VBD



True: that/JJ explains/NN why/VBN the/-NONE- number/RB of/IN these/NNP wines/NNPS is/NNP expanding/NNP so/: rapidly/CD ./NN
Predicted: that/JJ explains/NN why/VBN the/-NONE- number/RB of/IN these/NNP wines/NNPS is/NNP expanding/NNP so/: rapidly/CD ./NN



True: that/`` explains/PRP why/VBP the/VBN number/-NONE- of/TO these/VB wines/DT is/NN expanding/NN so/, rapidly/'' ./VBD
Predicted: that/`` explains/PRP why/VBP the/VBN number/-NONE- of/TO these/VB wines/DT is/NN expanding/NN so/, rapidly/'' 