### In this notebook we check what happens, if you feed a string with more than 128 tokens into sentence transformers using sentence-transformers/paraphrase-multilingual-mpnet-base-v2

In [1]:
import json
from sentence_transformers import SentenceTransformer
import pandas as pd

In [2]:
#load model and data
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')

def get_data(path): 
        merged_df = pd.read_csv(path)
        strings = merged_df['String']
        str_lst = strings.values

        vocab = merged_df['Title'].values
        identifier = merged_df['identifier']
        identifier_vocab = pd.DataFrame({'ID': identifier, 'Vocab': vocab})
        identifier_vocab = identifier_vocab.set_index('Vocab')['ID'].to_dict()
        return merged_df, str_lst, vocab, identifier_vocab, identifier


merged_df, str_lst, vocab, identifier_vocab, identifier = get_data('data/merged_data_for_AI.csv')


2023-07-11 20:11:00.407943: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-11 20:11:00.719134: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### we encode a string that has more than 4 times the number of tokens than 128. Then we encode the first half and the first quarter of the string seperatly. If the encodings are the same, then we need another approach for inputs of bigger lengths

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')


In [17]:
#create the strings
full_string = str_lst[0]
half_string = full_string[:int(len(full_string)/2)]
quarter_string = full_string[:int(len(full_string)/4)]

In [18]:
tokenized_input_full = tokenizer.tokenize(full_string)
tokenized_input_half = tokenizer.tokenize(half_string)
tokenized_input_quarter = tokenizer.tokenize(quarter_string)

print("# tokens full: ", len(tokenized_input_full))
print("# tokens half: ", len(tokenized_input_half))
print("# tokens quarter: ", len(tokenized_input_quarter))


Token indices sequence length is longer than the specified maximum sequence length for this model (1914 > 512). Running this sequence through the model will result in indexing errors


# tokens full:  1914
# tokens half:  962
# tokens quarter:  488


In [23]:
enc_full_string = model.encode(full_string)
enc_half_string = model.encode(half_string)
enc_quarter_string = model.encode(quarter_string)

In [24]:
import numpy as np

print(np.array_equal(enc_full_string, enc_half_string))
print(np.array_equal(enc_full_string, enc_quarter_string))

True
True


### The last cell prooves, that the encodings are identical.

128 tokens should be approximately 25 words. Lets print the first 25 words of some videos and see, how much info they give us.  

In [25]:
def print_first_25_words(strings):
    for string in strings:
        
        # Split the string into words
        words = string.split()

        # Print the first 50 words
        first_25_words = " ".join(words[:50])
        print(first_25_words)
        tokenized_input = tokenizer.tokenize(first_25_words)
        print("# tokens: ", len(tokenized_input))
        print("_________________________________")

print_first_25_words(str_lst[:100])

Flutter UI Design Tutorial File Storage - 2023 - Explained step by step part 2MJSD Coding Hey Leute, willkommen zur√ºck auf dem Kanal, heute werden wir unsere Videoserie fortsetzen, in der wir versuchen, sch√∂ne flatternde Benutzeroberfl√§chen mit Schritt-f√ºr-Schritt- Erkl√§rungen nachzubilden. Letzte Woche haben wir diese erste Seite des Designs erstellt,
# tokens:  81
_________________________________
Heartthrob Tangle Lesson Pattern #137Melinda Barlow CZT Inkidoodles das ist Melinda Barlow czt-zertifizierte Zentangle-Lehrerin und die heutige Lektion ist Herztropfen und eine Lektion in Versch√∂nerung, also werden wir heute viele Versch√∂nerungsideen und Verwicklungen lernen und eine tolle Zeit haben, also fangen wir an, das ist wirklich ein Frauenschwarm Zun√§chst ein
# tokens:  97
_________________________________
LG Optimus Elite Unboxing - Virgin Mobilefunzier1 Hallo zusammen, ich werde das neue jungfr√§uliche Handy lg optimus elite auspacken und ich habe es noch nicht ge√∂ffnet,