# Evaluating Models and Text pre Processing Options on Translated Sample

In this chapter the combination of the following options will be evaluated:

Embeddings Models:
1. SentenceTransformer('sentence-transformers/LaBSE') Dimensions:  (768,)     [Link](https://huggingface.co/sentence-transformers/LaBSE)
2. SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') Dimensions: (384,)   [Link](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)
3. SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')   Dimensions: (768,)  [Link](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)

Text pre processing steps:
1. Dirty Text (unprocessed text)
2. Cleaned Text (removal of non text characters and more... (see src.text_preparation_04.py))
3. Lemmas (Cleaned Text and lemmatized)
4. Stop-Word-Removed-Lemmas

Evaluation with Coherence Score and Diversity Score on both taking the top 40 words per topic into account:

1. Translated Nouns Adjectives Verbs
2. Not translated Nouns Adjective Verbs
3. Translated, stop word removed Lemmas
4. Not translated, stop word removed Lemmas.

## Preparing the embeddings

In [28]:
import pandas as pd
from src.SampleTranslation05.translation_01 import load_samples
from src.stop_words import stop_words

path_to_samples = (
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Samples/samples_translated.csv"
)
df = load_samples(path_to_samples)

df["stop_word_removed_lemmas"] = df.apply(
    lambda x: [word for word in x["lemmas"] if not word in stop_words[x["lang"]]],
    axis=1,
)
df["translated_stop_word_removed_lemmas"] = df.apply(
    lambda x: [word for word in x["translated_lemmas"] if not word in stop_words["en"]],
    axis=1,
)
df["translated_nouns_adjs_verbs"] = df["translated_nouns"] + df["translated_adjs_verbs"]
df["nouns_adjs_verbs"] = df["nouns"] + df["adjs_verbs"]

In [50]:
df["stop_word_removed_lemmas"] = df.apply(
    lambda x: [word for word in x["lemmas"] if not word in stop_words[x["lang"]]],
    axis=1,
)

In [49]:
df = df.drop("stop_word_removed_lemmas", axis=1)

In [51]:
df["stop_word_removed_lemmas"]

0                          [Bild, Ukraine, Putin, stopwar]
1        [wegschauen, Form, falsch, handelns, unterlass...
2        [stinknormal, Anektierung, Krim, plötzlich, hi...
3        [halten, Wort, putinversteh, gefährlich, Homop...
4        [absolut, Solidarität, Mensch, Ukraine, tiefst...
                               ...                        
48995    [entrer, temps, amérique, placer, protéger, in...
48996    [ukrain, frappe, russe, région, mort, kiev, uk...
48997    [sentir, aller, fendre, gueule, aller, explosi...
48998    [celer, vrai, pénurie, go, mois, aller, explos...
48999    [armageddon, sheytanyahu, missile, russe, frap...
Name: stop_word_removed_lemmas, Length: 49000, dtype: object

In [29]:
# Drop unnecessary columns
df = df[
    [
        "text",
        "cleaned_text",
        "lemmas",
        "stop_word_removed_lemmas",
        "translated_nouns_adjs_verbs",
        "nouns_adjs_verbs",
        "translated_lemmas",
        "stop_word_removed_lemmas",
        "translated_stop_word_removed_lemmas",
        "lang",
        "week",
        "translated",
        "emojis",
    ]
]
print(df.columns)
print(df.shape)
df.isnull().any()

Index(['text', 'cleaned_text', 'lemmas', 'stop_word_removed_lemmas',
       'translated_nouns_adjs_verbs', 'nouns_adjs_verbs', 'translated_lemmas',
       'stop_word_removed_lemmas', 'translated_stop_word_removed_lemmas',
       'lang', 'week', 'translated', 'emojis'],
      dtype='object')
(49000, 13)


text                                   False
cleaned_text                           False
lemmas                                 False
stop_word_removed_lemmas               False
translated_nouns_adjs_verbs            False
nouns_adjs_verbs                       False
translated_lemmas                      False
stop_word_removed_lemmas               False
translated_stop_word_removed_lemmas    False
lang                                   False
week                                   False
translated                             False
emojis                                 False
dtype: bool

In [32]:
df.head()

Unnamed: 0,text,cleaned_text,lemmas,stop_word_removed_lemmas,translated_nouns_adjs_verbs,nouns_adjs_verbs,translated_lemmas,stop_word_removed_lemmas.1,translated_stop_word_removed_lemmas,lang,week,translated,emojis
0,Was ein Bild.\n#Ukraine #Putin #StopWar https:...,was ein bild. ukraine putin stopwar,"[was, Bild, Ukraine, Putin, stopwar]","[Bild, Ukraine, Putin, stopwar]","[picture, ukraine, putin, stopwar, ]","[Bild, Putin, Ukraine, stopwar]","[what, picture, ukraine, putin, stopwar]","[Bild, Ukraine, Putin, stopwar]","[picture, ukraine, putin, stopwar]",de,2022-08,What a picture.\n#Ukraine #Putin #StopWar http...,[]
1,@A1Telekom @MagdalenaZzzet Wegschauen kann auc...,wegschauen kann auch eine form falschen handel...,"[wegschauen, auch, Form, falsch, handelns, unt...","[wegschauen, Form, falsch, handelns, unterlass...","[way, form, action, sin, omission, belarus, me...","[Form, handelns, Diktator, Invasion, ukrain, P...","[look, other, way, also, form, of, wrong, acti...","[wegschauen, Form, falsch, handelns, unterlass...","[look, way, form, wrong, action, sin, omission...",de,2022-08,@A1Telekom @MagdalenaZzzet Looking the other w...,[]
2,Die stinknormale #Anektierung der Krim und plö...,die stinknormale anektierung der krim und plö...,"[stinknormal, Anektierung, Krim, plötzlich, hi...","[stinknormal, Anektierung, Krim, plötzlich, hi...","[annexation, crimea, war, aggression, whimsy, ...","[Anektierung, Krim, Angriffskrieg, Ukraine, Sc...","[perfectly, normal, annexation, of, crimea, su...","[stinknormal, Anektierung, Krim, plötzlich, hi...","[perfectly, normal, annexation, crimea, sudden...",de,2022-08,The perfectly normal #annexation of Crimea and...,[]
3,"@Saefken Deshalb halten ich das Wort ""#Putinve...","deshalb halten ich das wort "" putinversteher"" ...","[deshalb, halten, ich, Wort, putinversteh, für...","[halten, Wort, putinversteh, gefährlich, Homop...","[word, putinversteher, homophobic, minority, d...","[Wort, Homophober, Minderheit, Diktator, verha...","[that, I, think, word, putinversteher, so, dan...","[halten, Wort, putinversteh, gefährlich, Homop...","[I, think, word, putinversteher, dangerous, ho...",de,2022-08,"@Saefken That's why I think the word ""#Putinve...",[]
4,Absolute Solidarität mit den Menschen in der U...,absolute solidarität mit den menschen in der u...,"[absolut, Solidarität, mit, Mensch, in, Ukrain...","[absolut, Solidarität, Mensch, Ukraine, tiefst...","[solidarity, people, ukraine, contempt, putin,...","[Solidarität, Mensch, Ukraine, Verachtung, Put...","[absolute, solidarity, with, people, of, ukrai...","[absolut, Solidarität, Mensch, Ukraine, tiefst...","[absolute, solidarity, people, ukraine, deep, ...",de,2022-08,Absolute solidarity with the people of Ukraine...,[🇺🇦]


## Labse Embeddings

In [30]:
from sentence_transformers import SentenceTransformer

model_labse = SentenceTransformer("sentence-transformers/LaBSE")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [31]:
df_labse = df.copy(deep=True)

In [54]:
df_labse = df_labse.drop("stop_word_removed_lemmas", axis=1)

In [56]:
df_labse["stop_word_removed_lemmas"] = df["stop_word_removed_lemmas"]

In [60]:
# Will take estimated around 2 hours to calculate
# df_labse['dirty_text_embedding'] = df_labse['text'].progress_apply(model_labse.encode)
# df_labse['cleaned_text_embeddings'] = df_labse['cleaned_text'].progress_apply(model_labse.encode)
# df_labse['lemmas_embeddings'] = df_labse['lemmas'].str.join(" ").progress_apply(model_labse.encode)
df_labse["stop_word_removed_lemmas_embeddings"] = (
    df_labse["stop_word_removed_lemmas"]
    .str.join(" ")
    .progress_apply(model_labse.encode)
)

100%|██████████| 49000/49000 [47:52<00:00, 17.06it/s]  


In [61]:
from src.utility import (
    save_csv_with_embeddings,
    load_samples_with_numpy,
    series_numpy_equals,
)

save_csv_with_embeddings(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Labse",
    df_labse,
    embeddings_columns=[
        "dirty_text_embedding",
        "cleaned_text_embeddings",
        "lemmas_embeddings",
        "stop_word_removed_lemmas_embeddings",
    ],
)

## Paraphrase Mini Embeddings

In [62]:
from sentence_transformers import SentenceTransformer

model_parahprase_min = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

In [63]:
df_paraphrase_min = df.copy(deep=True)

In [72]:
# Will take estimated around 2 hours to calculate
# df_paraphrase_min['cleaned_text_embeddings'] = df_paraphrase_min['cleaned_text'].progress_apply(model_parahprase_min.encode)
df_paraphrase_min["lemmas_embeddings"] = (
    df_paraphrase_min["lemmas"]
    .str.join(" ")
    .progress_apply(model_parahprase_min.encode)
)
# df_paraphrase_min['stop_word_removed_lemmas_embeddings'] = df_paraphrase_min['stop_word_removed_lemmas'].str.join(" ").progress_apply(model_parahprase_min.encode)
df_paraphrase_min["dirty_text_embedding"] = df_paraphrase_min["text"].progress_apply(
    model_parahprase_min.encode
)

  0%|          | 0/49000 [00:00<?, ?it/s]

100%|██████████| 49000/49000 [24:26<00:00, 33.42it/s]  
100%|██████████| 49000/49000 [31:03<00:00, 26.29it/s]  


In [71]:
df_paraphrase_min["dirty_text_embedding"].apply(lambda x: x.shape)

0        (384,)
1        (384,)
2        (384,)
3        (384,)
4        (384,)
          ...  
48995    (384,)
48996    (384,)
48997    (384,)
48998    (384,)
48999    (384,)
Name: dirty_text_embedding, Length: 49000, dtype: object

In [73]:
from src.utility import (
    save_csv_with_embeddings,
    load_samples_with_numpy,
    series_numpy_equals,
)

save_csv_with_embeddings(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Paraphrase_Min",
    df_paraphrase_min,
    embeddings_columns=[
        "dirty_text_embedding",
        "cleaned_text_embeddings",
        "lemmas_embeddings",
        "stop_word_removed_lemmas_embeddings",
    ],
)

## Paraphrase Max Embeddings

In [74]:
from sentence_transformers import SentenceTransformer

model_parahprase_max = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")

In [75]:
df_paraphrase_max = df.copy(deep=True)

In [76]:
# Will take estimated around 2 hours to calculate
df_paraphrase_max["cleaned_text_embeddings"] = df_paraphrase_max[
    "cleaned_text"
].progress_apply(model_parahprase_max.encode)
df_paraphrase_max["lemmas_embeddings"] = (
    df_paraphrase_max["lemmas"]
    .str.join(" ")
    .progress_apply(model_parahprase_max.encode)
)
df_paraphrase_max["stop_word_removed_lemmas_embeddings"] = (
    df_paraphrase_max["stop_word_removed_lemmas"]
    .str.join(" ")
    .progress_apply(model_parahprase_max.encode)
)
df_paraphrase_max["dirty_text_embedding"] = df_paraphrase_max["text"].progress_apply(
    model_parahprase_max.encode
)

100%|██████████| 49000/49000 [1:17:08<00:00, 10.59it/s]   
100%|██████████| 49000/49000 [7:12:45<00:00,  1.89it/s]     
100%|██████████| 49000/49000 [72:09:08<00:00,  5.30s/it]      
100%|██████████| 49000/49000 [38:20<00:00, 21.30it/s]


In [77]:
from src.utility import (
    save_csv_with_embeddings,
    load_samples_with_numpy,
    series_numpy_equals,
)

save_csv_with_embeddings(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Paraphrase_Max",
    df_paraphrase_max,
    embeddings_columns=[
        "dirty_text_embedding",
        "cleaned_text_embeddings",
        "lemmas_embeddings",
        "stop_word_removed_lemmas_embeddings",
    ],
)

## Loading Samples with Embeddings

In [1]:
from src.utility import load_samples_with_numpy, load_samples_with_list
import pandas as pd

cols_to_listify = [
    "lemmas",
    "translated_nouns_adjs_verbs",
    "nouns_adjs_verbs",
    "translated_lemmas",
    "stop_word_removed_lemmas",
    "translated_stop_word_removed_lemmas",
    "emojis",
]
df_labse = load_samples_with_numpy(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Labse",
    loading_func=lambda x: load_samples_with_list(cols_to_listify, x),
)
df_paraphrase_max = load_samples_with_numpy(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Paraphrase_Max",
    loading_func=lambda x: load_samples_with_list(cols_to_listify, x),
)
df_paraphrase_min = load_samples_with_numpy(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Paraphrase_Min",
    loading_func=lambda x: load_samples_with_list(cols_to_listify, x),
)

  from .autonotebook import tqdm as notebook_tqdm


## Define Metrics

Calculating topics and calculating the metrics takes ~40sec so i hope 100 trys will be calculated in ~1hour.

In [2]:
from src.metrics import calculate_metrics_weighted_sorted


def get_metrics(
    df: pd.DataFrame, topics: list[int], eval_space: str
) -> dict[str, float]:
    topic_series = pd.Series(topics)
    topic_series = topic_series[topic_series != -1]

    return {
        "mean_topic_size": topic_series.value_counts().mean(),
        "std_topic_size": topic_series.value_counts().std(),
        "outlier_freq": (pd.Series(topics) == -1).sum() / pd.Series(topics).shape[0],
        "nr_topics": topic_series.unique().shape[0],
        **{
            f"Topic_Size_{quantil}_Quantil": value
            for quantil, value in topic_series.value_counts()
            .quantile([0, 0.25, 0.5, 0.75])
            .to_dict()
            .items()
        },
        **calculate_metrics_weighted_sorted(df, topics, eval_space),
    }

from src.metrics import calculate_metrics

def get_metrics_unweighted(
    df: pd.DataFrame, topics: list[int], eval_space: str
) -> dict[str, float]:
    topic_series = pd.Series(topics)
    topic_series = topic_series[topic_series != -1]

    return {
        "mean_topic_size": topic_series.value_counts().mean(),
        "std_topic_size": topic_series.value_counts().std(),
        "outlier_freq": (pd.Series(topics) == -1).sum() / pd.Series(topics).shape[0],
        "nr_topics": topic_series.unique().shape[0],
        **{
            f"Topic_Size_{quantil}_Quantil": value
            for quantil, value in topic_series.value_counts()
            .quantile([0, 0.25, 0.5, 0.75])
            .to_dict()
            .items()
        },
        **calculate_metrics(df, topics, eval_space),
    }


In [5]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np


topic_model = BERTopic()


topics, probs = topic_model.fit_transform(
    df_labse["text"].to_list(),
    embeddings=np.stack(df_labse["dirty_text_embedding"].to_list(), axis=0),
)

get_metrics(df_labse, topics)

{'mean_topic_size': 62.88212927756654,
 'std_topic_size': 92.98287099105377,
 'outlier_freq': 0.6624897959183673,
 'nr_topics': 263,
 'Topic_Size_0.25_Quantil': 16.0,
 'Topic_Size_0.5_Quantil': 27.0,
 'Topic_Size_0.75_Quantil': 62.5,
 'coherence': -0.00013048423764516268,
 'diversity': 0.23854166666666668}

In [39]:
df_labse.columns

Index(['Unnamed: 0', 'text', 'cleaned_text', 'lemmas',
       'translated_nouns_adjs_verbs', 'nouns_adjs_verbs', 'translated_lemmas',
       'translated_stop_word_removed_lemmas', 'lang', 'week', 'translated',
       'emojis', 'stop_word_removed_lemmas', 'lemmas_embeddings',
       'stop_word_removed_lemmas_embeddings', 'dirty_text_embedding',
       'cleaned_text_embeddings'],
      dtype='object')

In [6]:
word_spaces = [
    # "translated_nouns_adjs_verbs",
    # "nouns_adjs_verbs",
    "translated_stop_word_removed_lemmas",
    # "stop_word_removed_lemmas",
]

## Modelling and Evaluation

## LDA

## Labse Embeddings 

In [1]:
from src.utility import load_samples_with_numpy, load_samples_with_list
import pandas as pd

cols_to_listify = [
    "lemmas",
    "translated_nouns_adjs_verbs",
    "nouns_adjs_verbs",
    "translated_lemmas",
    "stop_word_removed_lemmas",
    "translated_stop_word_removed_lemmas",
    "emojis",
]
df_labse = load_samples_with_numpy(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Labse",
    loading_func=lambda x: load_samples_with_list(cols_to_listify, x),
)

  from .autonotebook import tqdm as notebook_tqdm


### Dirty Text Embeddings

In [48]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_labse_dirty_text_2 = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_labse["text"].to_list(),
        embeddings=np.stack(df_labse["dirty_text_embedding"].to_list(), axis=0),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics(df_labse, topics, eval_space),
            "eval_space": eval_space,
            "model": "labse",
            "preprocessing": "dirty_text_embeddings",
        }

        eval_labse_dirty_text_2.append(eval_dict)

100%|██████████| 100/100 [1:38:05<00:00, 58.85s/it]


In [125]:
pd.DataFrame(eval_labse_dirty_text_2).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/labse_dirty_text.csv"
)

### Cleaned Text Embeddings

In [49]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_labse_cleaned_text_2 = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_labse["text"].to_list(),
        embeddings=np.stack(df_labse["cleaned_text_embeddings"].to_list(), axis=0),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics(df_labse, topics, eval_space),
            "eval_space": eval_space,
            "model": "labse",
            "preprocessing": "cleaned_text_embeddings",
        }

        eval_labse_cleaned_text_2.append(eval_dict)

  0%|          | 0/100 [00:00<?, ?it/s]

100%|██████████| 100/100 [1:24:39<00:00, 50.79s/it]


In [127]:
pd.DataFrame(
    eval_labse_cleaned_text_2
)  # .to_csv('/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/labse_cleaned_text.csv')

Unnamed: 0,mean_topic_size,std_topic_size,outlier_freq,nr_topics,Topic_Size_0.0_Quantil,Topic_Size_0.25_Quantil,Topic_Size_0.5_Quantil,Topic_Size_0.75_Quantil,coherence,diversity,eval_space,model,preprocessing
0,606.454545,2729.769391,0.727714,22,11.0,13.25,20.0,31.25,-0.001423,0.732609,translated_nouns_adjs_verbs,labse,cleaned_text_embeddings
1,606.454545,2729.769391,0.727714,22,11.0,13.25,20.0,31.25,-0.012167,0.855435,nouns_adjs_verbs,labse,cleaned_text_embeddings
2,606.454545,2729.769391,0.727714,22,11.0,13.25,20.0,31.25,-0.001166,0.741304,translated_stop_word_removed_lemmas,labse,cleaned_text_embeddings
3,606.454545,2729.769391,0.727714,22,11.0,13.25,20.0,31.25,-0.011957,0.861957,stop_word_removed_lemmas,labse,cleaned_text_embeddings
4,131.333333,896.214573,0.710531,108,10.0,16.00,23.0,43.25,-0.000506,0.520183,translated_nouns_adjs_verbs,labse,cleaned_text_embeddings
...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,1512.444444,4453.739471,0.722204,9,12.0,13.00,24.0,44.00,-0.028410,0.817500,stop_word_removed_lemmas,labse,cleaned_text_embeddings
396,444.787879,2381.009105,0.700449,33,10.0,13.00,22.0,32.00,-0.000774,0.682353,translated_nouns_adjs_verbs,labse,cleaned_text_embeddings
397,444.787879,2381.009105,0.700449,33,10.0,13.00,22.0,32.00,-0.007466,0.826471,nouns_adjs_verbs,labse,cleaned_text_embeddings
398,444.787879,2381.009105,0.700449,33,10.0,13.00,22.0,32.00,-0.000647,0.693382,translated_stop_word_removed_lemmas,labse,cleaned_text_embeddings


### Lemmas Embeddings

In [120]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_labse_lemmas = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):

    topics, probs = topic_model.fit_transform(
        df_labse["text"].to_list(),
        embeddings=np.stack(df_labse["lemmas_embeddings"].to_list(), axis=0),
    )
    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics(df_labse, topics, eval_space),
            "eval_space": eval_space,
            "model": "labse",
            "preprocessing": "lemmas_embeddings",
        }

        eval_labse_lemmas.append(eval_dict)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=

In [129]:
pd.DataFrame(eval_labse_lemmas).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/labse_lemmas.csv"
)

In [27]:
pd.DataFrame(eval_labse_lemmas)

Unnamed: 0,mean_topic_size,std_topic_size,outlier_freq,nr_topics,Topic_Size_0.25_Quantil,Topic_Size_0.5_Quantil,Topic_Size_0.75_Quantil,coherence,diversity,eval_space,model,preprocessing
0,57.241758,102.329055,0.681082,273,15.0,24.0,52.00,-0.000273,0.329197,translated_nouns_adjs_verbs,labse,lemmas_embeddings
1,57.241758,102.329055,0.681082,273,15.0,24.0,52.00,,0.549726,nouns_adjs_verbs,labse,lemmas_embeddings
2,57.241758,102.329055,0.681082,273,15.0,24.0,52.00,-0.000286,0.342336,translated_stop_word_removed_lemmas,labse,lemmas_embeddings
3,57.241758,102.329055,0.681082,273,15.0,24.0,52.00,,0.557847,stop_word_removed_lemmas,labse,lemmas_embeddings
4,91.529412,423.513620,0.587184,221,15.0,29.0,61.00,-0.000154,0.332658,translated_nouns_adjs_verbs,labse,lemmas_embeddings
...,...,...,...,...,...,...,...,...,...,...,...,...
395,54.335740,81.377493,0.692837,277,15.0,23.0,51.00,,0.564209,stop_word_removed_lemmas,labse,lemmas_embeddings
396,49.595070,70.042350,0.712551,284,15.0,25.0,50.25,-0.000304,0.319912,translated_nouns_adjs_verbs,labse,lemmas_embeddings
397,49.595070,70.042350,0.712551,284,15.0,25.0,50.25,,0.550702,nouns_adjs_verbs,labse,lemmas_embeddings
398,49.595070,70.042350,0.712551,284,15.0,25.0,50.25,-0.000318,0.331579,translated_stop_word_removed_lemmas,labse,lemmas_embeddings


### Stopword removed lemmas embeddings

In [131]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_labse_lemmas_stop_word_removed = []


topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):

    topics, probs = topic_model.fit_transform(
        df_labse["text"].to_list(),
        embeddings=np.stack(
            df_labse["stop_word_removed_lemmas_embeddings"].to_list(), axis=0
        ),
    )
    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics(df_labse, topics, eval_space),
            "eval_space": eval_space,
            "model": "labse",
            "preprocessing": "stop_word_removed_lemmas_embeddings",
        }

        eval_labse_lemmas_stop_word_removed.append(eval_dict)

  0%|          | 0/100 [00:00<?, ?it/s]

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=

In [134]:
pd.DataFrame(
    eval_labse_lemmas_stop_word_removed
)  # .to_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/labse_stop_word_removed_lemmas.csv")

Unnamed: 0,mean_topic_size,std_topic_size,outlier_freq,nr_topics,Topic_Size_0.0_Quantil,Topic_Size_0.25_Quantil,Topic_Size_0.5_Quantil,Topic_Size_0.75_Quantil,coherence,diversity,eval_space,model,preprocessing
0,394.300000,2215.144484,0.678122,40,11.0,14.75,22.5,36.25,-0.000966,0.626220,translated_nouns_adjs_verbs,labse,stop_word_removed_lemmas_embeddings
1,394.300000,2215.144484,0.678122,40,11.0,14.75,22.5,36.25,-0.005355,0.793902,nouns_adjs_verbs,labse,stop_word_removed_lemmas_embeddings
2,394.300000,2215.144484,0.678122,40,11.0,14.75,22.5,36.25,-0.000806,0.628659,translated_stop_word_removed_lemmas,labse,stop_word_removed_lemmas_embeddings
3,394.300000,2215.144484,0.678122,40,11.0,14.75,22.5,36.25,-0.005201,0.809756,stop_word_removed_lemmas,labse,stop_word_removed_lemmas_embeddings
4,768.173913,3467.012215,0.639429,23,10.0,15.50,23.0,34.00,-0.001370,0.677083,translated_nouns_adjs_verbs,labse,stop_word_removed_lemmas_embeddings
...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,117.413043,813.998554,0.669327,138,10.0,14.00,22.5,48.00,-0.000911,0.714388,stop_word_removed_lemmas,labse,stop_word_removed_lemmas_embeddings
396,147.711864,1045.040056,0.644286,118,10.0,14.00,29.5,59.75,-0.000514,0.458613,translated_nouns_adjs_verbs,labse,stop_word_removed_lemmas_embeddings
397,147.711864,1045.040056,0.644286,118,10.0,14.00,29.5,59.75,-0.001848,0.685294,nouns_adjs_verbs,labse,stop_word_removed_lemmas_embeddings
398,147.711864,1045.040056,0.644286,118,10.0,14.00,29.5,59.75,-0.000540,0.475420,translated_stop_word_removed_lemmas,labse,stop_word_removed_lemmas_embeddings


## Paraphrase Mini Embeddings


In [1]:
from src.utility import load_samples_with_numpy, load_samples_with_list
import pandas as pd

cols_to_listify = [
    "lemmas",
    "translated_nouns_adjs_verbs",
    "nouns_adjs_verbs",
    "translated_lemmas",
    "stop_word_removed_lemmas",
    "translated_stop_word_removed_lemmas",
    "emojis",
]
df_paraphrase_min = load_samples_with_numpy(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Paraphrase_Min",
    loading_func=lambda x: load_samples_with_list(cols_to_listify, x),
)
print(df_paraphrase_min.shape)
print(df_paraphrase_min.columns)

  from .autonotebook import tqdm as notebook_tqdm


(49000, 17)
Index(['Unnamed: 0', 'text', 'cleaned_text', 'lemmas',
       'translated_nouns_adjs_verbs', 'nouns_adjs_verbs', 'translated_lemmas',
       'translated_stop_word_removed_lemmas', 'lang', 'week', 'translated',
       'emojis', 'stop_word_removed_lemmas', 'lemmas_embeddings',
       'stop_word_removed_lemmas_embeddings', 'dirty_text_embedding',
       'cleaned_text_embeddings'],
      dtype='object')


### Dirty Text Embeddings

In [7]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_paraphrase_min_dirty_text = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_paraphrase_min["text"].to_list(),
        embeddings=np.stack(
            df_paraphrase_min["dirty_text_embedding"].to_list(), axis=0
        ),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics_unweighted(df_paraphrase_min, topics, eval_space),
            "eval_space": eval_space+"_unweighted",
            "model": "paraphrase_min",
            "preprocessing": "dirty_text_embeddings",
        }

        eval_paraphrase_min_dirty_text.append(eval_dict)

100%|██████████| 100/100 [2:14:00<00:00, 80.40s/it]  


In [8]:
df_t = pd.DataFrame(eval_paraphrase_min_dirty_text)
df_t["model"] = "paraphrase_min"
df_t.to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleDataUnweighted/paraphrase_min_dirty_text.csv"
)

### Cleaned Text Embeddings

In [9]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_paraphrase_min_cleaned_text = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_paraphrase_min["text"].to_list(),
        embeddings=np.stack(
            df_paraphrase_min["cleaned_text_embeddings"].to_list(), axis=0
        ),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics_unweighted(df_paraphrase_min, topics, eval_space),
            "eval_space": eval_space+"_unweighted",
            "model": "paraphrapse_min",
            "preprocessing": "cleaned_text_embeddings",
        }

        eval_paraphrase_min_cleaned_text.append(eval_dict)

100%|██████████| 100/100 [6:13:19<00:00, 223.99s/it]   


In [10]:
pd.DataFrame(eval_paraphrase_min_cleaned_text).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleDataUnweighted/eval_paraphrase_min_cleaned_text.csv"
)

### Lemmas Embeddings

In [11]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_paraphrase_min_lemmas = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_paraphrase_min["text"].to_list(),
        embeddings=np.stack(df_paraphrase_min["lemmas_embeddings"].to_list(), axis=0),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics_unweighted(df_paraphrase_min, topics, eval_space),
            "eval_space": eval_space+"_unweighted",
            "model": "paraphrapse_min",
            "preprocessing": "lemmas_embeddings",
        }

        eval_paraphrase_min_lemmas.append(eval_dict)

 54%|█████▍    | 54/100 [28:36<24:21, 31.78s/it]


KeyboardInterrupt: 

In [None]:
pd.DataFrame(eval_paraphrase_min_lemmas).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleDataUnweighted/eval_paraphrase_min_lemmas.csv"
)

### Stopword removed lemmas embeddings

In [None]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_paraphrase_min_stop_word_removed_lemmas = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_paraphrase_min["text"].to_list(),
        embeddings=np.stack(
            df_paraphrase_min["stop_word_removed_lemmas_embeddings"].to_list(), axis=0
        ),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics_unweighted(df_paraphrase_min, topics, eval_space),
            "eval_space": eval_space+"_unweighted",
            "model": "paraphrapse_min",
            "preprocessing": "stop_word_removed_lemmas_embeddings",
        }

        eval_paraphrase_min_stop_word_removed_lemmas.append(eval_dict)

100%|██████████| 100/100 [2:35:16<00:00, 93.17s/it]  


In [None]:
pd.DataFrame(
    eval_paraphrase_min_stop_word_removed_lemmas
).to_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleDataUnweighted/eval_paraphrase_min_stop_word_removed_lemmas.csv")

Unnamed: 0,mean_topic_size,std_topic_size,outlier_freq,nr_topics,Topic_Size_0.0_Quantil,Topic_Size_0.25_Quantil,Topic_Size_0.5_Quantil,Topic_Size_0.75_Quantil,coherence,diversity,eval_space,model,preprocessing
0,225.000000,1866.406341,0.513265,106,10.0,17.00,29.0,49.0,-0.000415,0.457710,translated_nouns_adjs_verbs,paraphrapse_min,stop_word_removed_lemmas_embeddings
1,225.000000,1866.406341,0.513265,106,10.0,17.00,29.0,49.0,-0.002358,0.733178,nouns_adjs_verbs,paraphrapse_min,stop_word_removed_lemmas_embeddings
2,225.000000,1866.406341,0.513265,106,10.0,17.00,29.0,49.0,-0.000376,0.484346,translated_stop_word_removed_lemmas,paraphrapse_min,stop_word_removed_lemmas_embeddings
3,225.000000,1866.406341,0.513265,106,10.0,17.00,29.0,49.0,-0.002323,0.756075,stop_word_removed_lemmas,paraphrapse_min,stop_word_removed_lemmas_embeddings
4,161.713287,1324.584578,0.528061,143,10.0,18.00,30.0,50.5,-0.000328,0.399479,translated_nouns_adjs_verbs,paraphrapse_min,stop_word_removed_lemmas_embeddings
...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,84.855124,388.529292,0.509918,283,10.0,16.50,30.0,59.5,-0.000836,0.576673,stop_word_removed_lemmas,paraphrapse_min,stop_word_removed_lemmas_embeddings
396,102.754464,684.567369,0.530265,224,10.0,16.75,29.5,58.0,-0.000254,0.338778,translated_nouns_adjs_verbs,paraphrapse_min,stop_word_removed_lemmas_embeddings
397,102.754464,684.567369,0.530265,224,10.0,16.75,29.5,58.0,-0.001034,0.620111,nouns_adjs_verbs,paraphrapse_min,stop_word_removed_lemmas_embeddings
398,102.754464,684.567369,0.530265,224,10.0,16.75,29.5,58.0,-0.000278,0.354333,translated_stop_word_removed_lemmas,paraphrapse_min,stop_word_removed_lemmas_embeddings


## Paraphrase Max Embeddings

In [1]:
from src.utility import load_samples_with_numpy, load_samples_with_list
import pandas as pd

cols_to_listify = [
    "lemmas",
    "translated_nouns_adjs_verbs",
    "nouns_adjs_verbs",
    "translated_lemmas",
    "stop_word_removed_lemmas",
    "translated_stop_word_removed_lemmas",
    "emojis",
]
df_paraphrase_max = load_samples_with_numpy(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Embeddings/Evaluation/Paraphrase_Max",
    loading_func=lambda x: load_samples_with_list(cols_to_listify, x),
)

  from .autonotebook import tqdm as notebook_tqdm


### Dirty Text Embeddings

In [6]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_paraphrase_max_dirty_text = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_paraphrase_max["text"].to_list(),
        embeddings=np.stack(
            df_paraphrase_max["dirty_text_embedding"].to_list(), axis=0
        ),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics(df_paraphrase_max, topics, eval_space),
            "eval_space": eval_space,
            "model": "paraphrase_max",
            "preprocessing": "dirty_text_embeddings",
        }

        eval_paraphrase_max_dirty_text.append(eval_dict)

  0%|          | 0/100 [00:00<?, ?it/s]

100%|██████████| 100/100 [4:55:00<00:00, 177.01s/it]   


In [12]:
pd.DataFrame(eval_paraphrase_max_dirty_text).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/eval_paraphrase_max_dirty_text.csv"
)

### Cleaned Text Embeddings

In [8]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_paraphrase_max_cleaned_text = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_paraphrase_max["text"].to_list(),
        embeddings=np.stack(
            df_paraphrase_max["cleaned_text_embeddings"].to_list(), axis=0
        ),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics(df_paraphrase_max, topics, eval_space),
            "eval_space": eval_space,
            "model": "paraphrase_max",
            "preprocessing": "cleaned_text_embeddings",
        }

        eval_paraphrase_max_cleaned_text.append(eval_dict)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|██████████| 100/100 [1:45:0

In [14]:
pd.DataFrame(eval_paraphrase_max_cleaned_text).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/eval_paraphrase_max_cleaned_text.csv"
)

### Lemmas Embeddings

In [9]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_paraphrase_max_lemmas = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_paraphrase_max["text"].to_list(),
        embeddings=np.stack(df_paraphrase_max["lemmas_embeddings"].to_list(), axis=0),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics(df_paraphrase_max, topics, eval_space),
            "eval_space": eval_space,
            "model": "paraphrase_max",
            "preprocessing": "lemmas_embeddings",
        }

        eval_paraphrase_max_lemmas.append(eval_dict)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=

In [16]:
pd.DataFrame(eval_paraphrase_max_lemmas).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/eval_paraphrase_max_lemmas.csv"
)

### Stop Word Removed Lemmas Embeddings

In [10]:
from tqdm import tqdm
from bertopic import BERTopic
import numpy as np

eval_paraphrase_max_stop_word_removed_lemmas = []

topic_model = BERTopic(nr_topics="auto")


for _ in tqdm(range(100)):
    topics, probs = topic_model.fit_transform(
        df_paraphrase_max["text"].to_list(),
        embeddings=np.stack(
            df_paraphrase_max["stop_word_removed_lemmas_embeddings"].to_list(), axis=0
        ),
    )

    for eval_space in word_spaces:

        eval_dict = {
            **get_metrics(df_paraphrase_max, topics, eval_space),
            "eval_space": eval_space,
            "model": "paraphrase_max",
            "preprocessing": "stop_word_removed_lemmas_embeddings",
        }

        eval_paraphrase_max_stop_word_removed_lemmas.append(eval_dict)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=

In [18]:
pd.DataFrame(eval_paraphrase_max_stop_word_removed_lemmas).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/eval_paraphrase_max_stop_word_removed_lemmas.csv"
)

### Concatenate all results to single dataframe

In [26]:
from src.utility import iterate_dataframes_path

dfs = []
for df, path in iterate_dataframes_path(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/"
):
    dfs.append(df)

100%|██████████| 12/12 [00:00<00:00, 731.84it/s]


In [28]:
pd.concat(dfs).to_csv(
    "/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/TranslateSampleEval/all_results.csv"
)

Unnamed: 0.1,Unnamed: 0,mean_topic_size,std_topic_size,outlier_freq,nr_topics,Topic_Size_0.0_Quantil,Topic_Size_0.25_Quantil,Topic_Size_0.5_Quantil,Topic_Size_0.75_Quantil,coherence,diversity,eval_space,model,preprocessing
0,0,99.502924,486.942349,0.652755,171,10.0,16.00,29.0,61.50,-0.000339,0.418023,translated_nouns_adjs_verbs,labse,lemmas_embeddings
1,1,99.502924,486.942349,0.652755,171,10.0,16.00,29.0,61.50,-0.001242,0.645058,nouns_adjs_verbs,labse,lemmas_embeddings
2,2,99.502924,486.942349,0.652755,171,10.0,16.00,29.0,61.50,-0.000367,0.429506,translated_stop_word_removed_lemmas,labse,lemmas_embeddings
3,3,99.502924,486.942349,0.652755,171,10.0,16.00,29.0,61.50,-0.001306,0.659884,stop_word_removed_lemmas,labse,lemmas_embeddings
4,4,383.500000,2415.561101,0.593020,52,11.0,14.75,29.0,52.75,-0.000687,0.619811,translated_nouns_adjs_verbs,labse,lemmas_embeddings
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,395,1167.764706,4681.187944,0.594857,17,10.0,13.00,21.0,40.00,-0.013246,0.776389,stop_word_removed_lemmas,labse,lemmas_embeddings
396,396,723.550000,3124.281382,0.704673,20,10.0,12.75,23.5,33.50,-0.001403,0.783333,translated_nouns_adjs_verbs,labse,lemmas_embeddings
397,397,723.550000,3124.281382,0.704673,20,10.0,12.75,23.5,33.50,-0.012435,0.833333,nouns_adjs_verbs,labse,lemmas_embeddings
398,398,723.550000,3124.281382,0.704673,20,10.0,12.75,23.5,33.50,-0.001305,0.783333,translated_stop_word_removed_lemmas,labse,lemmas_embeddings
