# 04 Text Preparation

For Text Preparation iam sticking to methods that are defined in Prof. Albrechts Book: [Blueprints for Text Analytics Using Python](https://learning.oreilly.com/library/view/blueprints-for-text/9781492074076/ch04.html#idm46749280280440).
I will be using some of his python code original or slightly modified from here [Github Repo](https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch04/Data_Preparation.ipynb).

The first step of text preparation is cleaning the text. Espescially tweets can be very dirty, containing weird punctuation, symbols smileys and others.
The cleaning function from Prof. Albrechts lectures materials that is slightly modified for this context here will be used.

In [4]:
import pandas as pd
from src.text_preparation_04 import clean


df = pd.read_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min/de.csv")

  from .autonotebook import tqdm as notebook_tqdm


See an example on how the cleaning modifies the text from the german tweets. As you can see even though the cleaning improves the texts a little bit, the texts are still not very clean.

In [5]:
df_sample = df.sample(10)
df_sample['cleaned_text'] = df_sample['text'].apply(clean)

print(df_sample[['text']].iloc[0]['text'])

print("")

print(df_sample[['cleaned_text']].iloc[0]['cleaned_text'])

Absolut sehenswerte Einordnung der Hintergründe und aktuellen Ereignisse rund um den #UkraineKrieg. https://t.co/okaiNsguFA

absolut sehenswerte einordnung der hintergründe und aktuellen ereignisse rund um den  ukrainekrieg.


The second step is lemma extraction. Here lemmas, nouns, adjectives, verbs and emojis are extracted and lemmatised. I again use a modified version of Prof. Albrechts function from his Natural Language Processing Lectures. 
See an example for how the process works on german tweets.

In [17]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import add_lemmas_to_df
df_sample = df.sample(10)
df_sample['cleaned_text'] = df_sample['text'].apply(clean)


nlp = spacy.load('de_core_news_sm')
nlp.add_pipe("emoji", first=True)

df_sample = add_lemmas_to_df(df_sample, nlp)

print(df_sample.iloc[0]['cleaned_text'])

print("")

print(df_sample.iloc[0][['lemmas']])
print(df_sample.iloc[0][['nouns']])
print(df_sample.iloc[0][['adjs_verbs']])
print(df_sample.iloc[0][['emojis']])

100%|██████████| 1/1 [00:00<00:00, 14.58it/s]

dabei wird indes auch erkennbar, dass "der westen" die neuen realitäten nicht erkennen kann oder will.  china ist zur  supermacht geworden und schließt als solche diplomatische abkommen mit seinen nachbarn. egal ob  australien, die  usa oder  eu das wollen oder nicht.

lemmas    [dabei, indes, auch, erkennbar, Westen, neu, Realität, erkennen, wollen, China, zu, Supermacht, schließen, als, diplomatisch, Abkommen, mit, Nachbar, egal, Australien, USA, EU, der, wollen]
Name: 1083033, dtype: object
nouns    [Westen, Realität, China, Supermacht, Abkommen, Nachbar, Australien, USA, EU]
Name: 1083033, dtype: object
adjs_verbs    [neu, erkennen, wollen, schließen, diplomatisch, wollen]
Name: 1083033, dtype: object
emojis    []
Name: 1083033, dtype: object





## Lemma Extraction 
In the following section for every language the lemmas will be extracted and saved with the original text in /Lemmas . This is a very computional heavy calculation. On the biggest english tweet dataset it took my poor computer more then 24 hours to do this. So don't rerun this cells! Saving dfs hase been commented out to prevent overriding.

### Creating Lemmas for Germany Tweets

In [None]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import clean, add_lemmas_to_df
import pandas as pd


df = pd.read_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/de.csv")
df['cleaned_text'] = df['text'].progress_apply(clean)
nlp = spacy.load('de_core_news_sm')
nlp.add_pipe("emoji", first=True)

df = add_lemmas_to_df(df, nlp)

#df[['tweetid','tweetcreatedts','lemmas','adjs_verbs','nouns', 'entities', 'emojis']].to_csv("


")


  0%|          | 0/1580284 [00:00<?, ?it/s]

  0%|          | 0/15803 [00:00<?, ?it/s]

### Creating Lemmas for English Tweets

In [None]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import clean, add_lemmas_to_df
import pandas as pd


df = pd.read_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/en.csv")
df['cleaned_text'] = df['text'].progress_apply(clean)
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("emoji", first=True)

df = add_lemmas_to_df(df, nlp)

#df[['tweetid','tweetcreatedts','lemmas','adjs_verbs','nouns', 'entities', 'emojis']].to_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Lemmas/en.csv")

  0%|          | 0/12116412 [00:00<?, ?it/s]

  0%|          | 0/121165 [00:00<?, ?it/s]

### Creating Lemmas for Russian Tweets

In [20]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import clean, add_lemmas_to_df
import pandas as pd


df = pd.read_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/ru.csv")
df['cleaned_text'] = df['text'].progress_apply(clean)
nlp = spacy.load('ru_core_news_sm')
nlp.add_pipe("emoji", first=True)

df = add_lemmas_to_df(df, nlp)

#df[['tweetid','tweetcreatedts','lemmas','adjs_verbs','nouns', 'entities', 'emojis']].to_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Lemmas/ru.csv")

KeyboardInterrupt: 

### Creating Lemmas for Spanish Tweets

In [21]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import clean, add_lemmas_to_df
import pandas as pd


df = pd.read_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/es.csv")
df['cleaned_text'] = df['text'].progress_apply(clean)
nlp = spacy.load('es_core_news_sm')
nlp.add_pipe("emoji", first=True)

df = add_lemmas_to_df(df, nlp)

#df[['tweetid','tweetcreatedts','lemmas','adjs_verbs','nouns', 'entities', 'emojis']].to_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Lemmas/es.csv")

100%|██████████| 1002517/1002517 [01:34<00:00, 10578.34it/s]
100%|██████████| 10026/10026 [4:15:56<00:00,  1.53s/it]     


### Creating Lemmas for Italian Tweets

In [22]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import clean, add_lemmas_to_df
import pandas as pd


df = pd.read_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/it.csv")
df['cleaned_text'] = df['text'].progress_apply(clean)
nlp = spacy.load("it_core_news_sm")
nlp.add_pipe("emoji", first=True)

df = add_lemmas_to_df(df, nlp)

#df[['tweetid','tweetcreatedts','lemmas','adjs_verbs','nouns', 'entities', 'emojis']].to_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Lemmas/it.csv")

100%|██████████| 1178277/1178277 [01:47<00:00, 10922.41it/s]
100%|██████████| 11783/11783 [16:55:06<00:00,  5.17s/it]      


### Creating Lemmas for French Tweets

In [25]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import clean, add_lemmas_to_df
import pandas as pd


df = pd.read_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/fr.csv")
df['cleaned_text'] = df['text'].progress_apply(clean)
nlp = spacy.load("fr_core_news_sm")
nlp.add_pipe("emoji", first=True)

df = add_lemmas_to_df(df, nlp)

df[['tweetid','tweetcreatedts','lemmas','adjs_verbs','nouns', 'entities', 'emojis']].to_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Lemmas/fr.csv")

100%|██████████| 1049599/1049599 [01:58<00:00, 8849.74it/s]
100%|██████████| 10496/10496 [5:19:09<00:00,  1.82s/it]      


### Creating Lemmas for Ukrainian Tweets

In [26]:
import spacy
from spacymoji import Emoji
from src.text_preparation_04 import clean, add_lemmas_to_df
import pandas as pd


df = pd.read_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Language_min_dedupl/uk.csv")
df['cleaned_text'] = df['text'].progress_apply(clean)
nlp = spacy.load("uk_core_news_sm")
nlp.add_pipe("emoji", first=True)

df = add_lemmas_to_df(df, nlp)

df[['tweetid','tweetcreatedts','lemmas','adjs_verbs','nouns', 'entities', 'emojis']].to_csv("/Users/robinfeldmann/TopicAnalysisRUWTweets/Data/Lemmas/uk.csv")

100%|██████████| 899934/899934 [01:01<00:00, 14590.00it/s]
100%|██████████| 9000/9000 [7:26:01<00:00,  2.97s/it]       


# 8 Explorativ Analysis

In [None]:
df_sample = df

In [None]:
from collections import Counter

counter_nouns = Counter()
counter_verbs = Counter()
counter_lemmas = Counter()
counter_emojis = Counter()
for ind, row in df_sample.iterrows():
    counter_nouns.update(row['nouns'])
    counter_verbs.update(row['adjs_verbs'])
    counter_lemmas.update(row['lemmas'])
    counter_emojis.update(row['emojis'])


In [None]:
sum(counter_nouns.values())

9331

In [None]:
counter_nouns.most_common(10)

[('russia', 303),
 ('ukraine', 298),
 ('war', 140),
 ('putin', 130),
 ('news', 69),
 ('people', 66),
 ('usa', 65),
 ('nato', 60),
 ('biden', 47),
 ('country', 44)]

In [None]:
counter_verbs.most_common(10)

[('ukraine', 161),
 ('russian', 139),
 ('ukrainian', 84),
 ('have', 80),
 ('say', 65),
 ('go', 50),
 ('more', 44),
 ('ukrainewar', 44),
 ('do', 43),
 ('new', 42)]