# 05 Sample Preparation


I have no experience with multilinugal topic modelling so far. To have the possibility to experiment with those and to compare those to common monolingual topic modelling approaches i need a sample of translated data. As translating resources are very limited i choose to sample 100 tweets per week and per language and to translate those. Then those samples will be cleaned and lemmatised for further use. In the last section some overview of the sample data will be presented.

## 5.1 Translation

Iam using the [Google Translate Api](https://cloud.google.com/translate/docs/reference/rest) to translate samples of tweets to see if everything works for languages that i don't understand. Iam using the free contingent that you get every month and hope it will be enough.

In [1]:
from src.SampleTranslation05.translation_01 import translate_text

translate_text(text = "Hallo, dies ist ein Beispiel Text", source_language="de")

'Hello, this is an example text'

## 5.2 Sample Selection and Translation

As translation resources are very limited, only a small sample of the tweets will be translated. Choosing a sample for each week and for each language, because i will try to interpret the topics afterwards on a timebased level.

In [17]:
import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()
from src.SampleTranslation05.translation_01 import translate_df, create_week_from_timestamp, sample_from_weeks

Sample 100 Tweets per Week and Language and translate.

In [9]:
for language in ['de','en','es','fr','it','ru','uk']:
    df = pd.read_csv(f'/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/{language}.csv')
    df = df[df['tweetcreatedts'].astype(str).progress_apply(lambda x:len(x)>10)]
    df = df[df['tweetcreatedts'].astype(str).progress_apply(lambda x: x[:10]).progress_apply(lambda x: x[0]=='2')]

    df['week'] = create_week_from_timestamp(df)
    df_sample = sample_from_weeks(df, sample_size=100)


    # Commented because costs translation resources.
    # df_sample['translated'] = translate_df(df_sample, language)
    df_sample[['text','translated','tweetcreatedts','tweetid','week']].to_csv(f'/Users/robinfeldmann/TopicAnalysisRUWTweets/src/SampleTranslation05/TranslatedSamples/{language}_100.csv')

Sample 100 english tweets and prepare as they would have been translated.



In [6]:
df = pd.read_csv('/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/en.csv')
df = df[df['tweetcreatedts'].astype(str).progress_apply(lambda x:len(x)>10)]
df = df[df['tweetcreatedts'].astype(str).progress_apply(lambda x: x[:10]).progress_apply(lambda x: x[0]=='2' and x[1]=='0')]

df['week'] = create_week_from_timestamp(df)
df_sample = sample_from_weeks(df, sample_size=100)

df_sample['translated'] = df_sample['text'].copy()
df_sample[['text','translated','tweetcreatedts','tweetid','week']].to_csv(f'/Users/robinfeldmann/TopicAnalysisRUWTweets/src/SampleTranslation05/TranslatedSamples/en_100.csv')

  0%|          | 0/12116412 [00:00<?, ?it/s]

  0%|          | 0/12116410 [00:00<?, ?it/s]

  0%|          | 0/12116410 [00:00<?, ?it/s]

  0%|          | 0/12116409 [00:00<?, ?it/s]

  0%|          | 0/12116409 [00:00<?, ?it/s]

  0%|          | 0/12116409 [00:00<?, ?it/s]

In [5]:
from src.utility import iterate_dataframes_path

translated_chars = 0
for df, path in iterate_dataframes_path('/Users/robinfeldmann/TopicAnalysisRUWTweets/src/SampleTranslation05/TranslatedSamples/'):
    lang = path.split('/')[-1].split('.')[0].split('_')[0]

    print(f"For {lang} there are {df.shape[0]} tweets that have {df['text'].str.len().sum()} characters.")
    if lang != 'en':
        translated_chars+= df['text'].str.len().sum()
print(f"Translated chars after all: {translated_chars}")

  0%|          | 0/7 [00:00<?, ?it/s]

For de there are 7000 tweets that have 1442229 characters.
For es there are 7000 tweets that have 1391900 characters.
For ru there are 7000 tweets that have 1023639 characters.
For uk there are 7000 tweets that have 957905 characters.
For en there are 7000 tweets that have 1338079 characters.
For it there are 7000 tweets that have 1368515 characters.
For fr there are 7000 tweets that have 1393112 characters.
Translated chars after all: 7577300


Translating those 7577300 characters used up ~50% of the free translation capacities that i got from google cloud api. So if necessary i could add other languages and maybe increase the sample size a little bit but not too much. 

## 5.3 Preparing Texts

Load the sample data into a dictionary of dataframes.

In [1]:
from src.utility import iterate_dataframes_path


language_df = {}
for df, path in iterate_dataframes_path('/Users/robinfeldmann/TopicAnalysisRUWTweets/src/SampleTranslation05/TranslatedSamples/'):
    lang = path.split('/')[-1].split('.')[0].split('_')[0]

    language_df[lang] = df


  0%|          | 0/7 [00:00<?, ?it/s]

Add columns for cleaned text for translated and not translated text.

In [2]:
from src.text_preparation_07 import clean
for df in language_df.values():
    df['cleaned_text_translated'] = df['translated'].progress_apply(clean)
    df['cleaned_text'] = df['text'].progress_apply(clean)

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

  0%|          | 0/7000 [00:00<?, ?it/s]

Extract lemmas, nouns, adj_verbs, emojis and ner with spacy.

In [5]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import add_lemmas_to_df
import pandas as pd

nlps = {}

nlps['en'] = spacy.load('en_core_web_sm')
nlps['de'] = spacy.load("de_core_news_sm")
nlps['fr'] = spacy.load("fr_core_news_sm")
nlps['es'] = spacy.load("es_core_news_sm")
nlps['it'] = spacy.load('it_core_news_sm')
nlps['ru'] = spacy.load('ru_core_news_sm')
nlps['uk'] = spacy.load('uk_core_news_sm')

for nlp in nlps.values():
    nlp.add_pipe("emoji", first=True)


for key, df in language_df.items():
    df = add_lemmas_to_df(df, nlps[key])
    df = add_lemmas_to_df(df, nlps['en'], 'cleaned_text_translated', 'translated_')

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

  0%|          | 0/70 [00:00<?, ?it/s]

Merge and save as csv.

In [10]:
for lang, df in language_df.items():
    df['lang'] = lang

pd.concat(language_df.values()).drop('Unnamed: 0', axis=1).to_csv('/Users/robinfeldmann/TopicAnalysisRUWTweets/src/SampleTranslation05/samples_ready.csv')

## 5.4 Analysing Sample

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/Users/robinfeldmann/TopicAnalysisRUWTweets/src/SampleTranslation05/samples_ready.csv')

In [8]:
df['lang'].unique()

array(['de', 'es', 'ru', 'uk', 'en', 'it', 'fr'], dtype=object)

In [4]:
df.shape

(49000, 19)

In [4]:
from collections import Counter

def counter_from_df(df: pd.DataFrame, columns: list[str]) -> dict[str,Counter]:
    for col in columns:
        if not col in df.columns:
            raise KeyError(f"{col} not in df.columns")
    
    counters = {col:Counter() for col in columns}

    for ind, row in df.iterrows():
        for col in columns:
            counters[col].update(row[col])

    return counters


In [5]:
counters = counter_from_df(df, ['lemmas'])

In [23]:
df.columns

Index(['Unnamed: 0', 'text', 'translated', 'tweetcreatedts', 'tweetid', 'week',
       'cleaned_text_translated', 'cleaned_text', 'lemmas', 'adjs_verbs',
       'nouns', 'entities', 'emojis', 'translated_lemmas',
       'translated_adjs_verbs', 'translated_nouns', 'translated_entities',
       'translated_emojis', 'lang'],
      dtype='object')

In [7]:
df = load_samples()

In [8]:
counters = counter_from_df(df, ['lemmas','nouns','adjs_verbs'])

NameError: name 'counter_from_df' is not defined

In [49]:
df['nouns'].head(30)

0                                         [Bild, Putin]
1     [Form, handelns, Diktator, Invasion, ukrain, P...
2     [Anektierung, Krim, Angriffskrieg, Ukraine, Sc...
3     [Wort, Homophober, Minderheit, Diktator, verha...
4     [Solidarität, Mensch, Ukraine, Verachtung, Putin]
5     [Russland, Land, Trolle, Druck, Kreml, Stopputin]
6                               [Putin, Ukraine, trump]
7     [kriegseintreiten, Deutschland, Russland, Krie...
8     [Mariupol, cyborg-veteranen, Gegenoffensive, M...
9                                          [kyiv, Ziel]
10    [Selenskyj, Tweet, Hilfe, Danksagung, Deutschl...
11    [Russland, Grenze, Inkl, Putin, Bodenschätz, a...
12    [Invasion, putin-regierung, Ukraine, Kriegstre...
13                         [oh, .., Aktion, Russia, As]
14                                          [Bild, Tag]
15    [Russe, Million, Schutzschilde, Widerstand, lu...
16    [Ende, Symbolpolitik, Wirkung, Grund, ukrain, ...
17    [Krieg, Waffe, Lösung, Gedanke, Mitgefühl,

In [47]:
counters['nouns'].most_common(19)

[('ukraine', 6129),
 ('ucrania', 3124),
 ('Putin', 2966),
 ('rusia', 2943),
 ('russia', 2661),
 ('putin', 2174),
 ('guerra', 1980),
 ('zelensky', 1779),
 ('україна', 1565),
 ('украина', 1556),
 ('Ukraine', 1522),
 ('russie', 1438),
 ('Russia', 1386),
 ('Russland', 1353),
 ('war', 1242),
 ('россия', 1165),
 ('guerre', 1165),
 ('', 1109),
 ('Ucraina', 1039)]

In [16]:
df['#lemmas'] = df['lemmas'].apply(len)
df.plot(kind='box', backend='plotly', x='lang', y='#lemmas', color='lang')

In [17]:
df['#text'] = df['text'].apply(len)
df.plot(kind='box', backend='plotly', x='lang', y='#text', color='lang')