Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# Sentence Representations

In a previous notebook, we have explored different ways of leveraging word embeddings to come up with a representation for an input sequence. Given a sequence of words (tokens), we can get the embeddings for each word in the sequence and either concatenate them, or use some kind of aggregation function such as taking the mean or element-wise max of the embeddings.

In the notebook on Transformer models, we have seen how to use a full pretrained BERT-based model as is, or even how to fine-tune the whole model to our task.

In this notebook, we explore alternative ways of coming up with richer sentence representations by leveraging language models, while avoiding the need to fine-tune the full BERT-based model. As mentioned in the [BERT paper](https://arxiv.org/abs/1810.04805), a *feature-based approach, where fixed features are extracted from the pretrained model, has certain advantages*. One of them is related with the computational benefits of pre-computing an expensive representation of the data and then running several experiments on top of this representation by resorting to computationally cheaper models.

## The dataset and some additional stuff

We will be comparing the effect of using different sentence representations for the same text classification task. For that, we start by loading our dataset:

In [None]:
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

dataset.head()

For ease of testing different sentence representations, let's define a generic function that given the features used to represent each text entry and the output labels, partitions the dataset into training and testing, trains a (logistic regression) classifier on the training set, and outputs results on the test set.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def evaluate_feature_representation(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)

    #print(X_train.shape, y_train.shape)
    #print(X_test.shape, y_test.shape)

    #print("\nLabel distribution in the training set:")
    #print(y_train.value_counts())

    #print("\nLabel distribution in the test set:")
    #print(y_test.value_counts())
    
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(confusion_matrix(y_test, y_pred))
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Precision: ', precision_score(y_test, y_pred))
    print('Recall: ', recall_score(y_test, y_pred))
    print('F1: ', f1_score(y_test, y_pred))
    
    return

## BERT embeddings

We can make use of BERT's internal representation of the input sequence as features. Let's start by loading a BERT model.

In [None]:
model_name = "distilbert-base-uncased"

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModel

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

### Using the CLS token

BERT models add a special CLS token to the beginning of the input sequence. In the model's final hidden state, this token's representation is used as an aggregate sequence representation for classification tasks.

Let's see what we get from the representation of the CLS token for 
a specific example.

In [None]:
print(dataset['Review'][0])
inputs = tokenizer(dataset['Review'][0], padding=True, truncation=True, return_tensors="pt")
print(inputs['input_ids'])
#print(inputs['input_ids'].shape)

As you can see, the text has been tokenized to 10 tokens, including the special [CLS] (101) and [SEP] (102) tokens.

We now pass the input through BERT and obtain the last hidden state of the model.
(Note: if you want to check all hidden states, via *outputs.hidden_states*, you must load the model with the *output_hidden_states=True* option.)

In [None]:
outputs = model(**inputs)
#print(outputs.last_hidden_state)   # or outputs["last_hidden_state"]
print(outputs.last_hidden_state.shape)

The embeddings size is, in this case, 768, so we have a tensor with dimentions 1x10x768. To get the CLS token embeddings, we access the first one.

In [None]:
print(outputs.last_hidden_state[0][0].shape)
outputs.last_hidden_state[0][0]   # the CLS token is the first one

Now, we can get the CLS token embeddings for every review. For that, we need to convert each tensor object into a numpy.ndarray by using the *numpy()* method.

In [None]:
import numpy as np

X = np.empty([0,768])
for rev in dataset['Review']:
    inputs = tokenizer(rev, padding=True, truncation=True, return_tensors="pt")
    outputs = model(**inputs)
    X = np.append(X, [outputs.last_hidden_state[0][0].detach().numpy()], axis=0)

# 1-liner:
#X = np.array([ 
#    model(**tokenizer(rev, padding=True, truncation=True, return_tensors="pt")).last_hidden_state[0][0].detach().numpy()
#    for rev in dataset['Review']
#    ])

We get the labels and check the shape of the feature matrix. Each input element should have 768 features (the dimension of encoder layers in Distill BERT, aka the hidden size for the BERT base model).

In [None]:
y = dataset['Liked']
print(X.shape, y.shape)

Let's see how this representation fares with our generic evaluation function, which trains and tests a classifier based on the representation we provide to it.

In [None]:
evaluate_feature_representation(X, y)

### Averaging over token embeddings

Alternatively, we can also average across the embeddings for all tokens in the last hidden state. In fact, even though the [BERT](https://arxiv.org/abs/1810.04805) paper suggests the CLS token be used as a representation of the input sequence for classification tasks, in some cases averaging across embeddings obtains improved performance. Can you try it out?

In [None]:
# your code here


## SBERT (SentenceTransformers)

Several other sentence representation models exist, and we here explore the usage of [SentenceTransformers](https://www.sbert.net/). Although this framework has been built having semantic similarity tasks in mind, these representations can also be used for text classification tasks, as evidenced in the [original paper](https://arxiv.org/abs/1908.10084).

SBERT uses a modification of the BERT network using a siamese architecture and a triplet loss function, trained with Natural Language Inference data ([SNLI](https://nlp.stanford.edu/projects/snli/)).

### Comparing BERT and SBERT representations

To compare the representations obtained by BERT and those provided by SentenceTransformers, we can see how similar those representations are for a few sentences.

In [None]:
sentences = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The quick brown fox jumps over the lazy dog."
    ]

Let's start with BERT, while making use of an util facility provided by SentenceTransformers to compute cosine similarities.

In [None]:
from sentence_transformers import util

embeddings = np.empty([0,768])
for s in sentences:
    inputs = tokenizer(s, padding=True, truncation=True, return_tensors="pt")
    outputs = model(**inputs)
    embeddings = np.append(embeddings, [outputs.last_hidden_state[0][0].detach().numpy()], axis=0)

cos_sim = util.cos_sim(embeddings, embeddings)
cos_sim

As you can see, all sentence representation pairs have very high cosine similarities. 
This can be somewhat alleviated by averaging across the embeddings for all tokens in the last hidden state, but the sentences will still have an unexpectedly high cosine similarity.

Let's now load a SentenceTransformer model and see what it gives us.

In [None]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

Using SentenceTransformers consists of simply encoding the sentences that we have, in a single step.

In [None]:
sbert_embeddings = sbert_model.encode(sentences)

cos_sim = util.cos_sim(sbert_embeddings, sbert_embeddings)
cos_sim

### Using SBERT embeddings for classification

We now use SentenceTransformer embeddings for our classification problem. For that, we need to encode the reviews in the dataset. You will find that this step is much faster than doing it using BERT.
Then, we can use our generic function to train and test a classifier by passing it the reviews' embeddings and the labels.

In [None]:
# your code here


# SimCSE: Simple Contrastive Learning of Sentence Embeddings

[SimCSE](https://github.com/princeton-nlp/SimCSE) is another recent model that trains a BERT-based model using contrastive learning.

Let's load a SimCSE model and see what it gives us as sentence representations.

In [None]:
from simcse import SimCSE
simcse_model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")

We can easily obtain sentence embeddings:

In [None]:
simcse_model.encode(sentences)

But SimCSE's API allows us to obtain similarity scores directly from the source sentences.

In [None]:
simcse_model.similarity(sentences, sentences)

Compare these with those obtained using SBERT.

### Using SimCSE embeddings for classification

We now use SimCSE embeddings for our classification problem.

In [None]:
# your code here
