# Language Modelling: ngram models, distributional semantics and word embeddings

## 1. Traditional Language Modelling (n-gram models)

In [None]:
!pip install -U pip
!pip install -U dill
!pip install -U nltk==3.8

Nowadays, everything seems to be going neural... 

Traditionally, we can use n-grams to generate language models to predict which word comes next given a history of words. 

We'll use the `lm` module in `nltk` to get a sense of how non-neural language modelling is done.

(**Source:** The content in this notebook is largely based on [language model tutorial in NLTK documentation by Ilia Kurenkov](https://github.com/nltk/nltk/blob/develop/nltk/lm/__init__.py))

In [None]:
from nltk.util import bigrams
from nltk.util import ngrams

If we want to train a bigram model, we need to turn this text into bigrams. Here's what the first sentence of our text would look like if we use the `ngrams` function from NLTK for this.

In [None]:
text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]

In [None]:
list(bigrams(text[0]))

In [None]:
list(ngrams(text[1], n=3))

Notice how "b" occurs both as the first and second member of different bigrams but "a" and "c" don't? 

Wouldn't it be nice to somehow indicate how often sentences start with "a" and end with "c"?


A standard way to deal with this is to add special "padding" symbols to the sentence before splitting it into ngrams. Fortunately, NLTK also has a function for that, let's see what it does to the first sentence.


In [None]:
from nltk.util import pad_sequence
list(pad_sequence(text[0],
                  pad_left=True, left_pad_symbol="<s>",
                  pad_right=True, right_pad_symbol="</s>",
                  n=2)) # The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 

In [None]:
padded_sent = list(pad_sequence(text[0], pad_left=True, left_pad_symbol="<s>", 
                                pad_right=True, right_pad_symbol="</s>", n=2))
list(ngrams(padded_sent, n=2))

In [None]:
list(pad_sequence(text[0],
                  pad_left=True, left_pad_symbol="<s>",
                  pad_right=True, right_pad_symbol="</s>",
                  n=3)) # The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 

In [None]:
padded_sent = list(pad_sequence(text[0], pad_left=True, left_pad_symbol="<s>", 
                                pad_right=True, right_pad_symbol="</s>", n=3))
list(ngrams(padded_sent, n=3))

Note the `n` argument, that tells the function we need padding for bigrams.

Now, passing all these parameters every time is tedious and in most cases they can be safely assumed as defaults anyway.

Thus the `nltk.lm` module provides a convenience function that has all these arguments already set while the other arguments remain the same as for `pad_sequence`.

In [None]:
from nltk.lm.preprocessing import pad_both_ends
list(pad_both_ends(text[0], n=2))


Combining the two parts discussed so far we get the following preparation steps for one sentence.

In [None]:
list(bigrams(pad_both_ends(text[0], n=2)))

To make our model more robust we could also train it on unigrams (single words) as well as bigrams, its main source of information.
NLTK once again helpfully provides a function called `everygrams`.

While not the most efficient, it is conceptually simple.

In [None]:
from nltk.util import everygrams
padded_bigrams = list(pad_both_ends(text[0], n=2))
list(everygrams(padded_bigrams, max_len=2))

We are almost ready to start counting ngrams, just one more step left.

During training and evaluation our model will rely on a vocabulary that defines which words are "known" to the model.

To create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.


In [None]:
from nltk.lm.preprocessing import flatten
list(flatten(pad_both_ends(sent, n=2) for sent in text))

In most cases we want to use the same text as the source for both vocabulary and ngram counts.

Now that we understand what this means for our preprocessing, we can simply import a function that does everything for us.

In [None]:
from nltk.lm.preprocessing import padded_everygram_pipeline
train, vocab = padded_everygram_pipeline(2, text)

So as to avoid re-creating the text in memory, both `train` and `vocab` are lazy iterators. They are evaluated on demand at training time.

For the sake of understanding the output of `padded_everygram_pipeline`, we'll "materialize" the lazy iterators by casting them into a list.

In [None]:
training_ngrams, padded_sentences = padded_everygram_pipeline(2, text)
for ngramlize_sent in training_ngrams:
    print(list(ngramlize_sent))
    print()
print('#############')
list(padded_sentences)

### Lets get some real data and tokenize it

In [None]:
#NB: You need to run nltk.download(), a window will appear, go in the models tab and install the punkt_tokenizer model. Then close the window. 
#If you don't do this, it will fallback on a very trivial regex based tokenization

try: # Use the default NLTK tokenizer.
    from nltk import word_tokenize, sent_tokenize 
    # Testing whether it works. 
    # Sometimes it doesn't work on some machines because of setup issues.
    word_tokenize(sent_tokenize("This is a foobar sentence. Yes it is.")[0])
except: # Use a naive sentence tokenizer and toktok.
    import re
    from nltk.tokenize import ToktokTokenizer
    # See https://stackoverflow.com/a/25736515/610569
    sent_tokenize = lambda x: re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', x)
    # Use the toktok tokenizer that requires no dependencies.
    toktok = ToktokTokenizer()
    word_tokenize = word_tokenize = toktok.tokenize

In [None]:
import os
import requests
import io #codecs


# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
    with io.open('language-never-random.txt', encoding='utf8') as fin:
        text = fin.read()
else:
    url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
    text = requests.get(url).content.decode('utf8')
    with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
        fout.write(text)

In [None]:
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                  for sent in sent_tokenize(text)]

In [None]:
tokenized_text[0]

In [None]:
print(text[:500])

In [None]:
# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

### Training an N-gram Model

Having prepared our data we are ready to start training a model. As a simple example, let us train a Maximum Likelihood Estimator (MLE).

We only need to specify the highest ngram order to instantiate it.

In [None]:
from nltk.lm import MLE
model = MLE(n) # Lets train a 3-grams model, previously we set n=3

Initializing the MLE model, creates an empty vocabulary

In [None]:
len(model.vocab)

... which gets filled as we fit the model.

In [None]:
model.fit(train_data, padded_sents)
print(model.vocab)

In [None]:
len(model.vocab)

The vocabulary helps us handle words that have not occurred during training.

In [None]:
print(model.vocab.lookup(tokenized_text[0]))

In [None]:
# If we lookup the vocab on unseen sentences not from the training data, 
# it automatically replace words not in the vocabulary with `<UNK>`.
print(model.vocab.lookup('language is never random lah .'.split()))

Moreover, in some cases we want to ignore words that we did see during training but that didn't occur frequently enough, to provide us useful information. 

You can tell the vocabulary to ignore such words using the `unk_cutoff` argument for the vocabulary lookup, To find out how that works, check out the docs for the [`nltk.lm.vocabulary.Vocabulary` class](https://github.com/nltk/nltk/blob/develop/nltk/lm/vocabulary.py)

**Note:** For more sophisticated ngram models, take a look at [these objects from `nltk.lm.models`](https://github.com/nltk/nltk/blob/develop/nltk/lm/models.py):

 - `Lidstone`: Provides Lidstone-smoothed scores.
 - `Laplace`: Implements Laplace (add one) smoothing.
 - `InterpolatedLanguageModel`: Logic common to all interpolated language models (Chen & Goodman 1995).
 - `WittenBellInterpolated`: Interpolated version of Witten-Bell smoothing.

### Using the N-gram Language Model

When it comes to ngram models the training boils down to counting up the ngrams from the training corpus.

In [None]:
print(model.counts)

This provides a convenient interface to access counts for unigrams...

In [None]:
model.counts['language'] # i.e. Count('language')

...and bigrams for the phrase "language is"

In [None]:
model.counts[['language']]['is'] # i.e. Count('is'|'language')

... and trigrams for the phrase "language is never"

In [None]:
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')

And so on. However, the real purpose of training a language model is to have it score how probable words are in certain contexts.

This being MLE, the model returns the item's relative frequency as its score.

In [None]:
model.score('language') # P('language')

In [None]:
model.score('is', 'language'.split())  # P('is'|'language')

In [None]:
model.score('never', 'language is'.split())  # P('never'|'language is')

Items that are not seen during training are mapped to the vocabulary's "unknown label" token.  This is "<UNK>" by default.


In [None]:
model.score("<UNK>") == model.score("lah")

In [None]:
model.score("<UNK>") == model.score("leh")

In [None]:
model.score("<UNK>") == model.score("lor")

To avoid underflow when working with many small score values it makes sense to take their logarithm. 

For convenience this can be done with the `logscore` method.


In [None]:
model.logscore("never", "language is".split())

### Generation using N-gram Language Model

One cool feature of ngram models is that they can be used to generate text.

In [None]:
print(model.generate(20, random_seed=7))

We can do some cleaning to the generated tokens to make it human-like.

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [None]:
generate_sent(model, 20, random_seed=7)

In [None]:
print(model.generate(28, random_seed=0))

In [None]:
generate_sent(model, 28, random_seed=0)

In [None]:
generate_sent(model, 20, random_seed=1)

In [None]:
generate_sent(model, 20, random_seed=30)

In [None]:
generate_sent(model, 20, random_seed=42)

### Saving the model 

The native Python's pickle may not save the lambda functions in the  model, so we can use the `dill` library in place of pickle to save and load the language model.


In [None]:
import dill as pickle 

with open('kilgariff_ngram_model.pkl', 'wb') as fout:
    pickle.dump(model, fout)

In [None]:
with open('kilgariff_ngram_model.pkl', 'rb') as fin:
    model_loaded = pickle.load(fin)

In [None]:
generate_sent(model_loaded, 20, random_seed=42)

### Lets try some generating with Donald Trump data!!!


**Dataset:** https://www.kaggle.com/kingburrito666/better-donald-trump-tweets#Donald-Tweets!.csv


In this part, I'll be munging that data as how I would be doing it at work. 
I've really no seen the data before but I hope this session would be helpful for you to see how to approach new datasets with the skills you have.

In [None]:
import pandas as pd
df = pd.read_csv('trump_tweets.csv')
df.head()

In [None]:
trump_corpus = list(df['Tweet_Text'].apply(word_tokenize))

In [None]:
# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, trump_corpus)

In [None]:
from nltk.lm import MLE
trump_model = MLE(n) # Lets train a 3-grams model, previously we set n=3
trump_model.fit(train_data, padded_sents)

In [None]:
generate_sent(trump_model, num_words=20, random_seed=42)

In [None]:
generate_sent(trump_model, num_words=10, random_seed=0)

In [None]:
generate_sent(trump_model, num_words=50, random_seed=10)

In [None]:
print(generate_sent(trump_model, num_words=100, random_seed=52))

## 2. Latent Word Models (aka. Distributional Semantics, aka. Word Embeddings)

The idea of word embeddings was born in the 1990 with the advant of so-called distributional
semantics methods that are inspired by the distributional hypothesis by (Firth, 1957).

"A word is known by the company it keeps" 

### Latent Semantic Analysis 
In IR, we are interested in building document models, that is the distributions of occurence of words in
documents. In that context LSA consists in building a term-document matrix, applying tf-idf or okapi-bm25 and then 
reducing the dimensionality with SVD or PCA to obtain a latent document model. 

As far surface-level semantics is concerned (lexical semantics), we want to capture the contexts 
in which words appear (we count co-occurences between words in a particular context), to do so
instead of computing word-documents matrices, we compute word-word matrices representing the cooccurences.

These coocurences are computed within a sliding window over the text (can be a few words around the target 
word or a sentence) on the basis of cooccurence frequency or on the frequency of dependences between the words. 

Once we have computed a word-word (or term-term) matrix, we can normalize it using tf-idf and use a dimentionality 
reduction technique to obtain compact latent representations. 

We can compute a sparse word-word co-occurence matrix weighted with tf-idf as follows in python using NLTK: here the documents are the sentences, the twist is that the tf calculation is based on cooccurences within the window.

In [None]:
import nltk
import scipy
import math
def create_cooccurrence_matrix(sentences, window_size=5, use_tfidf=True):
    vocabulary = {}
    token_idfs = []
    data = []
    row = []
    col = []

    tokenizer = nltk.tokenize.word_tokenize  # We could also use SpaCy here

    for sentence in sentences:
        sentence = sentence.strip()
        tokens = [token for token in tokenizer(sentence) if token != u""]

        for pos, token in enumerate(tokens):
            i = vocabulary.setdefault(token, len(vocabulary))
            start = max(0, pos - window_size)  # window start: current position - window size
            end = min(len(tokens), pos + window_size + 1)  # window end: current position + window size
            for pos2 in range(start, end):  # Sliding over the window and counting
                if pos2 == pos:
                    continue
                j = vocabulary.setdefault(tokens[pos2], len(vocabulary))
                data.append(1.)
                row.append(i)
                col.append(j)


    cooccurrence_matrix_sparse = scipy.sparse.coo_matrix((data, (row, col)))  # Transforming list into sparse matrix

    if use_tfidf:
        N = len(sentences)
        tf_idf_matrix = scipy.sparse.csr_matrix((data, (row, col)))
        total_counts = cooccurrence_matrix_sparse.sum(axis=0).tolist()[0]
        for token in vocabulary.keys():
            token_df = len([sentence for sentence in sentences if token in sentence])
            token_idfs.append(math.log(N / token_df))

        # Computing tf-idf on coocurence counts (here tf = coocurrence count)
        for i, j, v in zip(cooccurrence_matrix_sparse.row, cooccurrence_matrix_sparse.col, cooccurrence_matrix_sparse.data):
            tf_idf_matrix[i, j] = (0.5 + (0.5 * v / total_counts[i])) * token_idfs[j]
        return vocabulary, tf_idf_matrix
    else:
        return vocabulary, cooccurrence_matrix_sparse

In [None]:
import nltk
nltk.download('punkt')

In [None]:
import pandas

sentences = ['I love nlp',    'I love to learn',
             'nlp is future', 'nlp is cool']

#We generate the matrix
vocabs,co_occ = create_cooccurrence_matrix(sentences, use_tfidf=True)

df_co_occ  = pandas.DataFrame(co_occ.todense(),
                          index=vocabs.keys(),
                          columns = vocabs.keys())

#We sort the dimensions to be in the same alphabetical order as the vocabulary 
df_co_occ = df_co_occ.sort_index()[sorted(vocabs.keys())]

#Visualizing
df_co_occ.style.applymap(lambda x: 'color: red' if x>0 else '')



Here, we compute vectors over the entire vocabulary space, if the vocabulary is very large (e.g. millions of words), the vectors would be to large and sparse to be practical. 
A classical technique in information retrieval, but also in what we call distributional semantics (based on co-occurence information), is to apply dimentionality reduction and to keep only the most important components. 

Singular Value Decomposition is the standard choice, it's an approach equivalent to PCA (after normalization) and in fact, virtually all implementations of PCA use the SVD decomposition!
SVD is a matrix decomposition technique formulated as (on reals): 

$D=U\Sigma V^T$
Where $D$ is the original data, $\Sigma$ is a diagonal matrix, and \(U\) and \(V\) are two orthogonal matrices. 

We estimate SVD on the co_occurance matrix with $k$ components and the $U$ matrix contains the projected vector space. 

In [None]:

#Computing Sparse SVD with 5 components (embedding vectors of dimension 5):
from scipy.sparse.linalg import svds
u, s, vt = svds(co_occ, k=5)

#U contains the embeddings
df_embeddings = pandas.DataFrame(u,
                          index=vocabs.keys())
df_embeddings

The `u` matrix will contains our embedding vectors, each row corresponds to the embedding vector of a word of 
the vocabulary. 

In [None]:
future_vec = u[vocabs["future"]]
future_vec

We can thus compute distances and similarities between vectors:

In [None]:
from scipy.spatial.distance import cosine
print(cosine(u[vocabs["love"]], u[vocabs["I"]])) #Smaller distance = words are closer

print(cosine(u[vocabs["love"]], u[vocabs["nlp"]]))


### Neural Word Embeddings
With the creation of Word2Vec, Neural Word Embeddings revolutionized word embeddings by framing the learning of the embeddings as a predction task: 
- Predict a word from the context (Continuous Bag of Words)
- Predict the context from a target word (Skip-Gram Model)

Computing PCA/SVD on a huge matrix doesn't scale and is not only computationally costly but also difficult to distribute and run in parallel. 

Word2Vec and its simple neural architecture with NegSampling, allowed an exponential speed-up of the computation of word embeddings models, which in turn allowed to train models on Billion-Word corpora. See: Reading material 2 on word embeddings. 
For more context and history, please read: http://jalammar.github.io/illustrated-word2vec/


You have already seen how to load a pre-trained word embeddings model with TensorFlow2 and Keras:

In [None]:
!pip install tensorflow

In [None]:
!pip install tensorflow_hub

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import losses

#Loading a hub model (pretrained word vector)
hub_layer = hub.KerasLayer("https://tfhub.dev/google/Wiki-words-250/2", 
                           input_shape=[], 
                           dtype=tf.string, 
                           trainable=False) #If we just use the vectors on their own, no need to make them trainable

# Cosine similarity between words
tf.print("dog & cat: ", -losses.CosineSimilarity()(hub_layer(["cat"]),hub_layer(["dog"])))
tf.print("mom & dad: ", -losses.CosineSimilarity()(hub_layer(["mom"]),hub_layer(["dad"])))
tf.print("house & tree: ", -losses.CosineSimilarity()(hub_layer(["house"]),hub_layer(["tree"])))
print(hub_layer(["cat"]).numpy()[0])

### Transformers & Sentence Embeddings 

We will see Transformer Language Models in detail in the next tutorial session and apply them to text generation, however it is important to introduce their use to obtain word and sentence embeddings and exploit those for various applications. 

#### Embedding text with a pretrained transformer (hugging face transformers)
Transformers have revolutionized many NLP tasks, by providing a way of building model that can capture many aspects of language (except true semantics and pragmatics), through multi-task modular pre-training capabilities. 
Most large-scale pre-trained deep "language" models are for the most part encoder-decoder architectures stacking transformers that take in tokenized text encoded in a way that captures word position (positonal embeddings) and output, depeding on the target pre-training task, contextualized vectors for each token, a pooled vector specific for each classification problems, other tasks specific values (start/end offsets for a named entity recognition task). 
Beyond using transformers in the usual calssification tasks, one can also extract contextualised word vectors or even pooled sequence vectors that typically capture more information than classical word embeddings or distributional semantics models. 

Let's see how to do that with the transformers library. 


In [None]:
#If you don't have pytorch, install this. 
!pip install torch

In [None]:
#First we need to install the transformer library
!pip install transformers

We use autoclasses from transformers that instantiate the right neural network modules based on the specified pre-trained model. 

We will select one of the official pre-trained models https://huggingface.co/transformers/v3.3.1/pretrained_models.html, but there are many more available in the model hub https://huggingface.co/models.

Distillbert is a compressed BERT model that is much smaller but with only limited performance degradation. We can use a multilingual version to make it applicable on several common languages like French and English. 

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
model = AutoModel.from_pretrained("distilbert-base-multilingual-cased")

We can first tokenize the text by using the tokenizer. This is a class instance that overrides the `__call__` method, allowing us to call the object like a function.
`return_tensors='pt'` tells the tokenizer to return a pytorch tensor, which is required if we are going tu use a pytorch model. Transformers also support Tensorflow models!

In [None]:

input = tokenizer("Transformers have revolutionized many NLP tasks, by providing a way of building model that can capture many aspects of language", padding=True, return_tensors='pt')

We can see the actual tokens by using the `convert_ids_to_tokens` method of the tokenizer. Just make sure to convert the input ids to a list
Also notice how some words are sgmented into pieces. For example, `revolution` `##ized` or `NL` `##P`. `##` indicated that a word token has been split into pieces. This tokenization technique is called piece-wise tokenization and is used by many transformer models.

In [None]:

tokenizer.convert_ids_to_tokens(input['input_ids'].tolist()[0])

Now we can use the model to embed the text encoded as token ids in the input variable! We have to use `**input` to flatten the dictionary so that each key/value pair is passed as an argument to the call. Without any other options, we only get `last_hidden_state` as an output of the model, which gives us a vector for each of the tokens. For classification tasks, the output of the first token (always `[CLS]`) is a class-specific feature vector, but this isn't always the best way of getting additional information. 

In [None]:
output = model(**input)
print(output.last_hidden_state.shape)
cls_vector = output.last_hidden_state.squeeze()[0]
cls_vector.shape

A better way of having one vector is aggregating individual word vectors. This can be done manually by appying some aggregation function. Endless possibilities, but the arithmetic mean, the sum or the max are very typical examples. The naïve version below will work, but proper pooling should take into account attention weights.

In [None]:
output_matrix = output.last_hidden_state.squeeze()[1:] # Without the CLS vector
print(output_matrix.shape)
sum_pooled = output_matrix.sum(axis=0)
mean_pooled = output_matrix.mean(axis=0)
max_pooled = output_matrix.max(axis=0)

Now if we use those attention weights:

In [None]:
import torch
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

print(output)
mean_pooled = mean_pooling(output, input.attention_mask)


#### Sentence embeddings for text similarity and search 

SentenceTransformers are aimed at producing efficient sentence embeddings in order to compute similarity scores between setences (textual similarity). To do so SentenceTransformers uses BERT to encode two sentences, but then uses a cosine similarity loss function to train an encoder on top of BERT to rank sentences by similarity. They use either a Siamois inspired architecture called the BiEncoder or directly fine-tune BERT to bring sentences closer in its internal representation.

![Bi-encoder v.s. Cross-encoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png)

Contrarily to BERT that produces one vector per word and for the [CLS] token, after training, SentenceBERT can produce a single sentence embedding that is more adapted for text similarity tasks. The model uses trainable pooling layers to go from word vectors to a single aggregate vector. 

Let's install it first, load it and then embed our first sentence! Sentence-transformers can be used directly through the dedicated library or through the transformers library. We will give both examples. 



In [None]:
!pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')


In [None]:
sentences = ["Jacques à joyeusement passé la tondeuse à gazon ce matin sous un soleil chatoyant.", "Jack merrily mowed the lawn this morning under a shimmering sun."]
embeddings = model.encode(sentences, show_progress_bar=True)

In [None]:
from numpy import dot
from numpy.linalg import norm

cos_sim = lambda a,b: dot(a, b)/(norm(a)*norm(b))

cos_sim(embeddings[0], embeddings[1])

We can also directly use sentence transformers through the unified interface the transformers library.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('AIDA-UPM/mstsb-paraphrase-multilingual-mpnet-base-v2')
model = AutoModel.from_pretrained('AIDA-UPM/mstsb-paraphrase-multilingual-mpnet-base-v2')



In [None]:
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
#print(sentence_embeddings[1])
from torch import nn
t_cos = nn.CosineSimilarity(dim=0)
print(t_cos(sentence_embeddings[0],sentence_embeddings[1]))



## 4. Let's extend our text classification pipeline!

Now that you have seen how to make operational use of language modelling approaches, 
create three variants where you use: 

1. an n-gram model 
2. Word Embeddings (pre-trained) 
3. Sentence embeddings

You can use those as features to a standard skilearn classfier for example. 