# Introduction

In this notebook we will examine the Cross-Lingual Natural Language Inference ([XNLI](https://cims.nyu.edu/~sbowman/xnli/)) dataset. This is a collection of sentence pairs for textual entailment, so the first sentence either implies, contradicts, or is neutral towards the second sentence. Each pair is then translated from English into 14 other languages ranging from Spanish to Swahili.

# Dataset

To access this dataset we will be using the Hugging Face (HF) [datasets](https://huggingface.co/docs/datasets/index.html) package. The code below will download the data to the *data* folder, and subsequent calls will reuse this data. We note that HF will only allow us to download a single language set per call, so we will start with English as that will be the easiest to understand.

In [1]:
import datasets

In [2]:
en = datasets.load_dataset('xnli', 'en', cache_dir='./data')

Reusing dataset xnli (./data/xnli/en/1.1.0/51ba3a1091acf33fd7c2a54bcbeeee1b1df3ecb127fdca003d31968fa3a1e6a8)


That went smoothly, so let's examine the data we have obtained.

In [3]:
print(en.keys())

dict_keys(['train', 'test', 'validation'])


In [4]:
print('Total Training Examples:', len(en['train']))
print('Total Testing Examples:', len(en['test']))
print('Total Validation Examples:', len(en['validation']))

Total Training Examples: 392702
Total Testing Examples: 5010
Total Validation Examples: 2490


We have the three splits one would expect, and we can see that we have almost 400,000 training examples with around 7,500 test/validation examples. We note that each of the examples in the splits above also exists in its 14 translations. As well, the language for the premise and hypothesis can be mixed and matched resulting in a much larger amount of data than might be immediately clear.

Below we can see one such example from the validation split:

In [5]:
for key, val in en['validation'][0].items():
    print(key+':', val) 

hypothesis: He called his mom as soon as the school bus dropped him off.
label: 1
premise: And he said, Mama, I'm home.


Each example has this same structure of hypothesis and premise with a label (in this case 1). Examining the example above, I would expect label 1 to be neutral, but we can confirm this.

In [6]:
for i, name in enumerate(en['validation'].features['label'].names):
    print('{} -> {}'.format(i, name))

0 -> entailment
1 -> neutral
2 -> contradiction


Label 0 corresponds to entailment, 1 to neutral (as expected), and 2 to contradiction. For fun lets look at a few more examples.

In [8]:
import random
random.seed(57)
samples = random.sample(list(en['validation']), 2)
for s in samples:
    for key, val in s.items():
        print(key+':', val)
    print('\n')

hypothesis: They told me I should stay home.
label: 2
premise: They asked a few questions and I answered them and they said, Get your baggage and leave there immediately, and come to the address you were supposed to when you arrived in Washington.


hypothesis: It's hard to install the system because hackers attack it every night.
label: 1
premise: It is not clear the system can be installed before 2010, but even this timetable may be too slow, given the possible security dangers.




More of what we expect and although this is a small sample size, it appears that the examples are quite easy for a human to understand. I am not sure if there are human performace results for this dataset, but I would expect them to be quite high.

The last thing I want to look at in terms of our data is the distibution of classes in the various splits.

In [9]:
import numpy as np
dist = np.zeros((3,3))
for i, key in enumerate(en.keys()):
    for item in en[key]:
        j = item['label']
        dist[i,j] += 1

In [10]:
import pandas as pd
df = pd.DataFrame(dist, ['train', 'test', 'validation'], 
                  en['validation'].features['label'].names)
df

Unnamed: 0,entailment,neutral,contradiction
train,130899.0,130900.0,130903.0
test,1670.0,1670.0,1670.0
validation,830.0,830.0,830.0


That is about as even a distribution as possible, and really we shouldn't expect anything less from such a widely used and well curated dataset.

# Pre-Trained Model

I am interested in using this dataset by way of a pre-trained model. Specifically, we will consider a [RoBERTa](https://arxiv.org/abs/1911.02116) type model fine-tuned on the XNLI dataset which can be found [here](https://huggingface.co/vicgalle/xlm-roberta-large-xnli-anli).

RoBERTa, as its name suggests, is based on the well-known BERT language model. The architecture stays much the same, but a variety of imporved techniques are implemented for training. This includes using significantly more data, refining hyperparameter choices, and changing the training scheme itself. These changes allow much more power to be squeezed out of the same original base. The XLM-RoBERTa model is a sub-category specifically for cross language tasks, and it was trained on data in 100 different languages.

The instantiation we will be using is intended for zero-shot text classification which isn't quite our task, but we will adress this later. Some of its specific detials are vauge, but it appears to have been fine-tuned on at least XNLI and ANLI, the latter being an adversarial NLI dataset. Importantly, based on the information available it was not trained on any of the XNLI test data, which (suprisingly) was a problem with some of the other potential options on HF.

In [3]:
#Load model
from transformers import pipeline
classifier = pipeline('zero-shot-classification',
                        model='vicgalle/xlm-roberta-large-xnli-anli', device=0)

Some weights of XLMRobertaModel were not initialized from the model checkpoint at vicgalle/xlm-roberta-large-xnli-anli and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We have loaded our model above which results in the warning about un-trained parameters. As we will see it doesn't seem to make our task impossible, so I will go ahead an ignore that warning. As I mentioned earlier this model is meant to be used for zero-shot text classification: you give it an input, a selection of labels, and then it picks the best one. However, the way it does this is by using the input as a premise and then using a template hypothesis of 'This example is {label}', which is then filled in by the labels. We can use this to implement a work-around by considering the labels to be 'contradicts', 'implies', and 'is neutral towards'. Then we use the template 'This example is {} ' + h, where h is a given hypotheis. An example of this technique can be seen below.

In [13]:
premise = 'I am at home.'
labels = ['implies', 'is neutral towards', 'contradicts']

hypothesis = ['Home is where I am.', 'I am at work.', 'I am 22 years old.']

for h in hypothesis:
    hyp = 'This example {} ' + h
    out = classifier(premise, labels, hyp)
    
    print(f'({premise}) {out["labels"][0]} the hypothesis ({h}) with probability {out["scores"][0]}')
    print('\n')

(I am at home.) implies the hypothesis (Home is where I am.) with probability 0.8172876834869385


(I am at home.) contradicts the hypothesis (I am at work.) with probability 0.947782039642334


(I am at home.) is neutral towards the hypothesis (I am 22 years old.) with probability 0.36400923132896423




Our premise is 'I am at home.' and then we consider the following three hypothesis: 'Home is where I am.', 'I am not at home.', 'I am 22 years old.'. The first is an implication, the second a contradiction, and the third is neutral. I will note here that the neutral label is not ideal, but there does not seem to be a good English verb for this purpose.

We can see that the model gives the correct predicitions in all three cases, although the neutrality case has quite low probability. It seems likely to me that this latter result comes from the poorer quality of the label.

We will load the XNLI metric to evaluate the performance of our model. In fact, we will use three different metric objects to look independently at the performance on all three classes. Really, these metric objects are just convenience methods for computing accuracy.

In [78]:
#Load metrics
en_metrics = [datasets.load_metric('xnli') for i in range(3)]

sw_metrics = [datasets.load_metric('xnli') for i in range(3)]

I will not consider all 15 languages, but I am interested in looking at the performance as a function of the language resource. We will use English as our high-resource language and Swahili as a low-resource language.

In [65]:
#Load other languages
sw = datasets.load_dataset('xnli', 'sw', cache_dir='./data')

Reusing dataset xnli (./data/xnli/sw/1.1.0/51ba3a1091acf33fd7c2a54bcbeeee1b1df3ecb127fdca003d31968fa3a1e6a8)


In [67]:
print(sw['test'][0])

{'hypothesis': 'Sijaongea na yeye tena.', 'label': 2, 'premise': 'Naam, sikukuwa nafikiri juu ya hilo, lakini nilichanganyikiwa sana, na, hatimaye nikaendelea kuzungumza naye tena.'}


With the Swahili portion loaded we are ready to determine the performance.

In [8]:
label_map = {
    'implies' : 0,
    'is neutral towards' : 1,
    'contradicts' : 2
}

In [98]:
def evaluate(dataset, metrics):
    for example in dataset:
        premise = example['premise']
        h = 'This example {} ' + example['hypothesis']
        ref = example['label']

        out = classifier(premise, labels, h)

        pred = label_map[out['labels'][0]]

        metrics[ref].add(prediction=pred, reference=ref)

    score = [metric.compute() for metric in metrics]
    
    return score

In [99]:
en_score = evaluate(en['test'], en_metrics)
sw_score = evaluate(sw['test'], sw_metrics)

In [100]:
for i in range(3):
    print(f'Label: {labels[i]}')
    print(f'English Accuracy: {en_score[i]}')
    print(f'Swahili Accuracy: {sw_score[i]}')
    print('\n')

Label: implies
English Accuracy: {'accuracy': 0.8772455089820359}
Swahili Accuracy: {'accuracy': 0.7604790419161677}


Label: is neutral towards
English Accuracy: {'accuracy': 0.22994011976047904}
Swahili Accuracy: {'accuracy': 0.26706586826347306}


Label: contradicts
English Accuracy: {'accuracy': 0.45209580838323354}
Swahili Accuracy: {'accuracy': 0.5173652694610779}




In the results above we can see a number of interesting things. First, the implication accuracy is much better than etheir the neutrality or contradiction. I am not suprised that neutrality has the lowest accuracy, but I would have expected the contradiction accuracy to be higher. Second, accuracy on English is not necessarily higher. It is in the case of implication, but Swahili is better in contradiciton and neutrality.

There are a few potential explanations for the results we are seeing. The most obvious is that we aren't really using the model the way it is meant to be used, so that could have some adverse effects. This is especially likely in the case of neutrality where the label is a little funky. I will also say that we are using a pre-trained model from someone on the internet, and with things like the warning we encountered earlier it is hard to say exactly what we are working with. 