#Introduction

Welcome to the tutorial on text embeddings with neural language models!

Here you will learn how to produce embeddings of a collection of short pieces of text. The embeddings correlate with the semantic meaning of the short pieces of text.

This allows us to automatically compare the embeddings of two or more pieces of text to see if they are similar in meaning.

The embeddings are produced by a pretrained neural language model. There are different ways to perform pretraining. A common approach is to show the model pairs of sentences that are deemed to be similar. For instance, this could be two sentences that occur in the sam Wikipedia article. The exact pretraining procedure and training data is different for each language model.
Different language models are available online, some are even multilingual. Which one we use to generate embeddings can be specified below.

Our code is executed on a Virtual Machine (VM) provided by Colab. It has enough processing power and memory to run our code.
A VM is just a computer without hardware. It can do almost anything a normal computer can do, but it is only virtual, i.e. it is simulated on another computer or server.
Colab VMs use Ubuntu, a Linux-based operating system. Many common libraries we use for data an analysis are already installed.


##Imports
We need some libraries to run our code. By importing them into our notebook, we can use them.

We use the following Libraries:

**NumPy (np)** 'is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.' (https://en.wikipedia.org/wiki/NumPy

**Pandas (pd)** 'is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.' (https://pandas.pydata.org/) Pandas represents data in a spreadsheet-like format and allows us to manipulate it accordingly.

**os** allows is to access the operating system. Will be used to create folders.

**sentence_transformers** 'SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining. The framework is based on PyTorch and Transformers and offers a large collection of pre-trained models tuned for various tasks. Further, it is easy to fine-tune your own models.' (https://www.sbert.net/)



**google.colab drive** is used to mount our Google drive folder

In [None]:
# Before we can use sentence_transformers we need to install it via pip ('pip is the package installer for Python.' https://pypi.org/project/pip/)
!pip install sentence_transformers
import numpy as np
import pandas as pd
import os
from sentence_transformers import SentenceTransformer,util


from google.colab import drive

##Set Paths

First we need to define the file paths.

This includes the path to our input folder and our output folder.

In [None]:
# Define the name of the root folder.
root = "/content/Gdrive"
#@markdown **path_name** specifies the folder where everything is saved.
input_path_name = "Data LUSIR/" #@param {type:'string'}

#@markdown **path_name** specifies the folder where everything is saved.
output_path_name = "Output LUSIR/" #@param {type:'string'}

output_path = root+"/My Drive/"+output_path_name

# Define the output folder
preprocessed_path = output_path+"preprocessed/"

# Define the input folder
data_path = root+"/My Drive/"+input_path_name

##Set File Names

Then we need to define, which file we want to use to produce embeddings.

The file should be a .pkl-file that represents a Pandas-dataframe. It should have a column "sentences" from where the sentences are read that we want to produce embeddings of.

In [None]:
#@markdown **input_file_name** specified which file you want to load. It should be the exact name of the file you want to load
input_file_name = "LUSIR_df_speakers_clean_normalized_50sentence(s)_NEW - CORPUS" #@param {type:'string'}

## Mount Drive Folder

In order to access our data, we need to connect the Colab VM to our Google Drive.

A new Colab VM is assigned to us every time we start the Colab notebook.

Colab VMs run on one of Google's servers. For security and privacy reasons, it cannot access our data without our permission.

Similar to how one would connect a network drive to a computer, we can connect our Google Drive to this Colab VM.

Colab uses OAuth2 to authorize the VM to access our Google Drive.

1. Execute the cell
2. Follow the link in the output.
3. Select your Google account
4. Confirm access privileges
4. Copy the link.
5. Paste the link in the input field in the output.
6. Press **Enter**

In [None]:
drive.mount(root, force_remount=True)

## Create Folders

Before we can write anything into our output folder, we have to create it.

If it already exists, no new directory will be created, but we will get a different output.

In [None]:
paths = [output_path, preprocessed_path]

for p in paths:
  print(p)
  #try to create folders

  try:
     os.mkdir(p)
  except OSError:
    print ("Creation of the directory %s failed" % p)
  else:
    print ("Successfully created the directory %s " % p)


## Set Language model

Here, we can specify, which language model we use.

**RoBERTa-DE-EN** (*hosted at*: https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer, *Paper*: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach (cite arxiv:1907.11692))

**USE_Multilingual** (*hosted at*: sbert.net , *Paper*: Yang, Y., Cer, D.M., Ahmad, A., Guo, M., Law, J., Constant, N., Ábrego, G., Yuan, S., Tar, C., Sung, Y., Strope, B., & Kurzweil, R. (2020). Multilingual Universal Sentence Encoder for Semantic Retrieval. ACL.)

Reimers, Nils & Gurevych, Iryna. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.


In [None]:
url_dict = {
            'RoBERTa-DE-EN' : 'T-Systems-onsite/cross-en-de-roberta-sentence-transformer',
            'USE_Multilingual':'distiluse-base-multilingual-cased-v1'
}

#@markdown  #Global Parameters

model_type = 'RoBERTa-DE-EN' #@param ['RoBERTa-DE-EN', 'USE_Multilingual']

model_name = url_dict[model_type]

## Pandas print options

To get more readable outputs when printing our data in Pandas, we have to set some print options.

In [None]:
#pandas print options

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

#Load Data

To access our data we need to load it into our notebook

In [None]:
# Specify the input file
input_file = data_path+input_file_name

# Load the data by reading the .pkl-file with Pandas..
data = pd.read_pickle(input_file)

# Sepcify which column is used to produce embeddings from
text_column = 'chunk'

In [None]:
# Print the shape of our data (number of rows, number of columns)
print(data.shape)

# Print the first rows
data.head(10)

# Preprocessing

Some preprocessing can be done to produce better embeddings. This step higly depends on the input data. For interviews, we could for example parse out speaker names, such as \*Interviewer says\*:

## Drop unused columns

In [None]:
data = data.dropna(subset=[text_column]) # drop rows with no content
print(data.shape)
#data=data.drop(['0'],axis=1) # drop unused columns
data.head()

##Remove New Line Tag

In [None]:

# Remove new line \n
new_line_pattern = r'\n'
data['chunk'] = data['chunk'].str.replace(new_line_pattern, ' ')
print( "Removed new line \n")

data.head()

In [None]:
import spacy
spacy.cli.download('de_core_news_sm')
nlp = spacy.load('de_core_news_sm')

In [None]:
open_stoplist = 'german_stopwords_full_BE_MOD Topics.txt'
stopword_path = root+"/My Drive/Output LUSIR/"

stoplist = open(stopword_path+open_stoplist, encoding='UTF-16', mode='r').read().split()

print(stoplist)

In [None]:
def preprocess_text(text):
    # Remove stop words
    filtered_words = [word for word in text.split() if word.lower() not in stoplist]

    # Lemmatization
    allowed_postags=['NOUN', 'PROPN', 'VERB', 'ADJ', 'ADV']
    min_wordlen = 2 #@param {type:"integer"}
    doc = nlp(text)
    lemmatized_words= [token.lemma_ for token in doc if len(token) > min_wordlen and token.pos_ in allowed_postags]

    # Join the lemmatized words back into a single string
    processed_text = ' '.join(lemmatized_words)
    return processed_text

data['processed']= data['chunk'].apply(preprocess_text)

# Create Text Embeddings with Language Model

## Load the model and Embed Documents

###Load Model

We use a pretrained model to produce our embeddings.
Before we can use the model we have to download it.

In [None]:
model = SentenceTransformer(model_name)

###Embed documents

Then we can pass our data into the model and it will return embeddings. This will take some time.

For each Sentence (i.e. each row in the data frame) one embedding is generated.
Embeddings are just lists of numbers that can be thought of as points in high-dimensional space. Each point represents one text. The closer two points are to each other, the more similar two sentences are.

In [None]:
print('Embedding...')
embeddings = model.encode(data['processed'], convert_to_tensor=True, batch_size = 128, show_progress_bar = True)

In [None]:
# The embeddings are matrices of numbers of shape (number of sentences, number of embedding dimensions)
print(embeddings.shape)
embeddings[:10]

# Try out the model

After we have generated embeddings, we can compare them.

Similarity is calculated with the cosine similarity.

We can either define our own lists of sentences to compare.


In [None]:
# Two lists of sentences
sentences1 = ['The cat is sitting on the roof.',
             'A man is playing guitar.',
             'The new movie is awesome.',
              'I want to wear a hat.',
              'this is cool',
              'it is cool',
              'Hello!, Hello!Hello!Hello!Hello!Hello!',
              'Tab'
              ]

sentences2 = ['Die Katze sitzt auf dem Dach.',
              'A woman watches TV.',
              'The new movie is so great.',
              'Ich trage einen Hut.',
              'is this cool',
              'it is sad',
              'Hi! Hello!Hello!',
              'Bat'
              ]

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))



Or compare randomly selected sentences from the LUSIR dataset

In [None]:
# Single list of sentences
number_of_samples = 300 #@param {type:'integer'}

#sentences = data['cleaned'][start:start+range_].tolist()

sentences = data[text_column]

sentences = sentences.sample(n=number_of_samples).tolist()

#Compute embeddings
embeddings_sentences = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(embeddings_sentences, embeddings_sentences)


#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        if cosine_scores[i][j] >= .5:
          pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=False)
print(len(pairs))
for pair in pairs[0:20]:
    i, j = pair['index']
    print("{}\n{}\n Score: {:.4f}\n\n".format(sentences[i], sentences[j], pair['score']))

#Save Data and Embeddings

After the data is preprocessed and embeddings are generated, both are saved to disk for further use.

In [None]:
 data.to_pickle(preprocessed_path+'data_'+model_type+'.pkl')
 print(data.head())

In [None]:
import numpy as np
embeddings = embeddings.to('cpu').detach().numpy()
np.save(preprocessed_path+'embeddings_'+model_type+'.npy', embeddings, allow_pickle=True, fix_imports=True)