# A solution to Watson's contradictions
### From Kaggle's recurring challenge "Contradictory, My Dear Watson"


This is a preliminary model to a Natural Language Inferencing (NLI) problem, which takes two sentences and decides if the first entails, contradicts, or is unrelated to the second.

It follows [Ana Uzsoy's tutorial](https://www.kaggle.com/anasofiauzsoy/tutorial-notebook).

See the challenge on Kaggle [here](https://www.kaggle.com/c/contradictory-my-dear-watson/overview).

## Improvments

After getting the notebook to run with TensorFlow and BERT on Kaggle's servers, I worked on improving the accuracy of my approach. Initially, my score was only 0.65. This matched the expected score reported by fitting the Keras model, but can definitely be improved. Improvements were approached in this order:
1. Optimize arguments to TensorFlow functions.
2. Attempt improvements to model structure.
3. Build model from scratch using only numpy methods.
4. Check that BERT encoding is optimal.

In [None]:
# Notebook-wide control to print debugging/extraneous info
VERBOSE = True

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
os.environ["WANDB_API_KEY"] = "0" ## to silence warning.

In [None]:
from transformers import BertTokenizer, TFBertModel
import matplotlib.pyplot as plt
import tensorflow as tf

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
    print('Number of replicas:', strategy.num_replicas_in_sync)

## Load the data

From Kaggle's [tutorial notebook](https://www.kaggle.com/anasofiauzsoy/tutorial-notebook):
>The training set contains a premise, a hypothesis, a label (0 = entailment, 1 = neutral, 2 = contradiction), and the language of the text.

In [None]:
train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")

train.head()

Looking at one pair of sentences, we have:

In [None]:
print('Premise:')
print(train.premise.values[1])
print('\nHypothesis:')
print(train.hypothesis.values[1])
print('\nRelationship:')
print(train.label.values[1])

# Find longest premise/hypothesis pair.

In [None]:
longest_len = 0
idx = 0
for index, row in train.iterrows():
    length = 3
    length += len(row.premise.split())
    length += len(row.hypothesis.split())
    if length > longest_len:
        longest_len = length
        idx = index
        
max_len = longest_len + 100
        
if VERBOSE:
    print(f"Length of longest premise/hypothesis pair:\n{longest_len}")
    print(idx)
    print(train.premise.values[idx])
    print(f"length:{len(train.premise.values[idx].split())}")
    print(train.hypothesis.values[idx])
    print(f"length:{len(train.hypothesis.values[idx].split())}")

## Data abstraction with pretrained model

Use a multilingual BERT (Bidirectional Encoder Representations from Transformers) model to tokenize the sentences in many languages.

In [None]:
model_name = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)

In [None]:
def encode_sentence(s):
    tokens = list(tokenizer.tokenize(s))
    tokens.append('[SEP]')
    return tokenizer.convert_tokens_to_ids(tokens)

def demo_encode():
    example = 'Hello ML world!'
    tokens = list(tokenizer.tokenize(example))
    tokens.append('[SEP]')
    print('Demo using:\n\"%s\"\n' % example)
    print('Tokens:\n', tokens)
    print('IDs:\n', tokenizer.convert_tokens_to_ids(tokens))

demo_encode()

In [None]:
print(encode_sentence("I love machine learning"))

if VERBOSE:
  print("Encode longest:")
  long_encode = encode_sentence(train.premise.values[idx])
  print(encode_sentence(train.premise.values[idx]))
  print(f"Length: {len(long_encode)}")

Bert has three input data:
1. Input word IDs
2. Input masks
3. Input type IDs

Using these, we can set up the model such that it can distinguish the premise and hypothesis as distinct sentences and ignore padding added from the tokenizer.

A [CLS] token is used to denote the beginning of the inputs, and a [SEP] token is used to denote separation between premise and hypothesis.

We now encode all of the premise/hypothesis pairs for input into BERT.

In [None]:
def bert_encode(hypotheses, premises, tokenizer):
    num_examples = len(hypotheses)
    print(f'Encoding {num_examples} pairs of hypotheses and premises as inputs...')
    
    sentence1 = tf.ragged.constant([
        encode_sentence(s)
        for s in np.array(hypotheses)    
    ])
    sentence2 = tf.ragged.constant([
        encode_sentence(s)
        for s in np.array(premises)
    ])
    
    print(sentence1[0])
    
    cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0]
    input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)
    print(input_word_ids[0])
    
    input_mask = tf.ones_like(input_word_ids).to_tensor(shape=[num_examples,max_len])
    
    type_cls = tf.zeros_like(cls)
    type_s1 = tf.zeros_like(sentence1)
    type_s2 = tf.ones_like(sentence2)
    input_type_ids = tf.concat([type_cls, type_s1, type_s2], axis=-1).to_tensor(shape=[num_examples,max_len])
    
    inputs = {
        'input_word_ids': input_word_ids.to_tensor(shape=[num_examples,max_len]),
        'input_mask': input_mask,
        'input_type_ids': input_type_ids
    }
    
    print('Done.')
    
    return inputs

In [None]:
train_input = bert_encode(train.premise.values, train.hypothesis.values, tokenizer)

In [None]:
if VERBOSE:
  print("Example encoded tensor:")
  print(train_input['input_word_ids'][0])
  print(tokenizer.convert_ids_to_tokens(train_input['input_word_ids'][0]))
  print(train_input['input_word_ids'][idx])

## Create and Train Keras Functional Model

Using the inputs created by the BERT transformer, we can now train a Keras Functional Model with them.  

**IMPROVEMENT: See if there are more ways to optimize the model with built-in Tensorflow Keras tools.**

In [None]:
# max_len = 50

def build_model():
    bert_encoder = TFBertModel.from_pretrained(model_name)
    
    input_word_ids = tf.keras.Input(
        shape=(max_len,),
        dtype=tf.int32,
        name="input_word_ids")
    input_mask = tf.keras.Input(
        shape=(max_len,),
        dtype=tf.int32,
        name="input_mask")
    input_type_ids = tf.keras.Input(
        shape=(max_len,),
        dtype=tf.int32,
        name="input_type_ids")
    
    embedding = bert_encoder([input_word_ids, input_mask, input_type_ids])[0]
    output = tf.keras.layers.Dense(3, activation='softmax')(embedding[:,0,:])
    
    model = tf.keras.Model(inputs=[input_word_ids, input_mask, input_type_ids], outputs=output)
    # TODO: check model optimization and see if there are other options to try.
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
with strategy.scope():
    model = build_model()
    model.summary()
    
tf.keras.utils.plot_model(model, "three-input-bert-model.png", show_shapes=True)

**IMPROVEMENT: See if there are other ways to optimize the TF.keras fit method.**

In [None]:
# TODO: check arguments for fit() for optimization.
model.fit(train_input, train.label.values, epochs=2, verbose=1, batch_size=64, validation_split=0.2)

# Get Test Data

In [None]:
test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")
test_input = bert_encode(test.premise.values, test.hypothesis.values, tokenizer)

test.head()

## Generate and Submit Predictions

In [None]:
predictions = [np.argmax(i) for i in model.predict(test_input)]

In [None]:
submission = test.id.copy().to_frame()
submission['prediction'] = predictions

submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)