# CentraleSupelec - Natural language processing
# Practical session n°7

## Natural Language Inferencing (NLI): 

(NLI) is a classical NLP (Natural Language Processing) problem that involves taking two sentences (the premise and the hypothesis ), and deciding how they are related (if the premise *entails* the hypothesis, *contradicts* it, or *neither*).

Ex: 


| Premise | Label | Hypothesis |
| --- | --- | --- |
| A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping. |
| An older and younger man smiling. | neutral | Two men are smiling and laughing at the cats playing on the floor. |
| A soccer game with multiple males playing. | entailment | Some men are playing a sport. |

### Stanford NLI (SNLI) corpus

In this labwork, I propose to use the Stanford NLI (SNLI) corpus ( https://nlp.stanford.edu/projects/snli/ ), available in the *Datasets* library by Huggingface.

    from datasets import load_dataset
    snli = load_dataset("snli")
    #Removing sentence pairs with no label (-1)
    snli = snli.filter(lambda example: example['label'] != -1) 

## Subject

You are asked to provide an operational Jupyter notebook that performs the task of NLI. For that, you need to tackle the following aspects of the problem:

1. Loading and preprocessing the data
2. Designing a PyTorch model that, given two sentences, decides how they are related (*entails*, *contradicts* or *neither*.)
3. Training and evaluating the model using appropriate metrics
4. (Optional) Allowing to play with the model (forward user sentences and visualize the prediction easily)
5. (Optional) Providing visual insight about the model (i.e. visualizing the attention if your model is using attention)

Although it is not mandatory, I suggest that you use a transformer model to perform the task. For that, you can use the *Transformer* library by Huggingface.

## Evaluation

The evaluation will be based on several criteria:

- Clarity and readability of the notebook. The notebook is the report of you project. Make it easy and pleasant to read.
- Justification of implementation choices (i.e. the network, the cost funtion, the optimizer, ...)
- Quality of the code. The various deeplearning and NLP labworks provide many example of good practices for designing experiments with neural networks. Use them as inspirational examples!

## Additional recommendations

- You are not seeking to publish a research paper! I'm not expecting state-of-the-art results! The idea of this labwork is to assess that you have integrated the skills necessary to handle textual data using deep neural network techniques.

- This labwork will be evaluated but we are still here to help you! Don't hesitate to request our help if you are stuck.

- If you intend to use BERT based models, let me give you an advice. The bert-base-* models available in *Transformers* need more than 12Go to be fine-tuned on GPU. To avoid memory issues, you can use several solutions: 

    - Use a lighter BERT based model such as DistilBERT, ALBERT, ...
    - Train a classification model on top of BERT, whithout fine-tuning it (i.e. freezing BERT weights)

## Huggingface documentations

In case you want to use the huggingface *Datasets* and *Transformer* libraries (which I advice), here are some useful documentation pages:

- Dataset quick tour

    https://huggingface.co/docs/datasets/quicktour.html
    
- Documentation on data preprocessing for transformers

    https://huggingface.co/transformers/preprocessing.html
    
- Transformer Quick tour (with distilbert example for classification).

    https://huggingface.co/transformers/quicktour.html
    


## Quick summary of the notebook

This is the notebook from : Youssef, Othmane and Alain

- Fist we import the corpus and do some visualization
- Second we apply DistilBert for sequence classification

In [None]:
import os
import torch
import torch.functional as F
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

import pandas as pd
import numpy as np
import unicodedata
import re
import time
import random
import math

print(torch.__version__)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## 1/Loading the data

In [None]:
from datasets import load_dataset

snli = load_dataset("snli")

#Removing sentence pairs with no label (-1)
snli = snli.filter(lambda example: example['label'] != -1) 

The snli dataset is a dictionnary containing : train, test and validation corpus

In [4]:
print(snli)

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9824
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 549367
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9842
    })
})


The first element of the dataset is the following

In [6]:
print(snli['test'][0])

{'premise': 'This church choir sings to the masses as they sing joyous songs from the book at a church.', 'hypothesis': 'The church has cracks in the ceiling.', 'label': 1}


## 2/Tokenizing

we use the pretrained DistilBERT tokenizer, that uses only token embeddings

In [None]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased",num_labels=3)

def encode(examples):
    return tokenizer(examples['premise'], examples['hypothesis'], truncation=True, padding='max_length')

dataset = snli.map(encode,batched=True)


## 3/Formatting the dataset

We modify the dataset to be able to train:
- Renaming the 'label' column to correspond to the input label of distilBERT
- Using the .set_format to transform the corpus into dataset item
- Filtering the columns to only have ['input_ids', 'attention_mask', 'labels'], that are the ones used in DistilBERT
- Creating the dataframes for pytorch

In [None]:
dataset = dataset.map(lambda examples: {'labels': examples['label']}, batched=True)

In [None]:
import torch 

dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])


train_dataset = dataset["train"]
test_dataset = dataset["test"]
validation_dataset = dataset["validation"]

batch_size = 10
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size)
validation_dataloader = torch.utils.data.DataLoader(validation_dataset, batch_size=batch_size)

example = next(iter(train_dataloader))

In [None]:
print(example)

## 4/The model

we use distilBERT for sequence classification, pretrained

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification


model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)


## 5/Metrics

Loss function used is a cross-entropy between the predictions and the labels passed

we add an accuracy score to evaluate further the classification model

In [None]:
from sklearn.metrics import accuracy_score

def metrics(loss, output_logits, labels):
    
    loss = loss.clone().detach()
    logits = output_logits.clone().detach()
    labels = labels.clone().detach()
    
    predictions = torch.argmax(nn.functional.softmax(logits, dim=-1), dim=1) # computing the prediction from the logits
    
    # tensors need to be on the cpu to convert to numpy
    loss = loss.cpu().numpy()
    labels = labels.cpu().numpy()
    predictions = predictions.cpu().numpy()
    
    loss = loss.item()
    acc = accuracy_score(labels, predictions)
    
    return loss, acc

## 6/Training

training the model, iterating over the dataset

In [None]:
from transformers import get_linear_schedule_with_warmup
import tqdm as tqdm

device = 'cuda'
epochs = 3
warmup_steps = 4
train_steps = 2

model.to(device)
optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, train_steps)

print('training the model')

for epoch in range(epochs):
    
    model.train()

    for batch in tqdm.tqdm(train_dataloader):
        
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # forward :
        outputs = model(**batch)
        
        loss = outputs.loss
        
        # back propagate :
        loss.backward()
        
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()


    print('metrics train: loss and accuracy')
    display_variables = metrics(loss, outputs.logits, batch['labels'])
    print(display_variables)

    model.eval()
    
    with torch.no_grad():
        for batch in tqdm(validation_dataloader):

            batch = {k: v.to(device) for k, v in batch.items()}

            # forward :
            outputs = model(**batch)

            loss = outputs.loss
        
    print('metrics validation: loss and accuracy')
    display_variables = metrics(loss, outputs.logits, batch['labels'])
    print(display_variables)
    
    

## 7/Our own hypothesis and premise

building a function to input our own hypothesis and premise

In [None]:
from datasets import Dataset

def prediction(hypothesis, premise): 
    input_dict = {'hypothesis': [hypothesis], 'premise': [premise]}
    dataset = Dataset.from_dict(input_dict)
    dataset = dataset.map(encode, batched=True, batch_size=1)
    dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=1)
    
    model.eval()
    with torch.no_grad():
        batch = next(iter(dataloader))
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)

        logits = outputs.logits
        proba = nn.functional.softmax(logits, dim=-1)
        predictions = torch.argmax(probs, dim=1)

        proba = probs.cpu().numpy()[0]
        predictions = predictions.cpu().numpy().item()

    idx_to_labels = {0: "entails", 1: "is neutral regarding", 2: "contradicts"}
    out = f"The premise ['{premise}'] {idx_to_labels[prediction]} the hypothesis ['{hypothesis}']."
    
    return out, proba

In [None]:
result, probs = predict("I am fast", "I run a lot")
print(result)
print(probs)