# Data Preprocessing

In [112]:
! pip install --upgrade pip
! pip install fasttext
! pip install emoji
! pip install pandas emoji openpyxl



To identify the language of each Telegram post, I applied automatic language detection using a pre-trained fastText model (lid.176.bin), which supports 176 languages. This was necessary to filter and classify posts as Russian, Ukrainian. Before applying language detection, I cleaned the data to avoid errors caused by empty or short texts.

In [113]:
import fasttext
import pandas as pd
import re
import emoji

In [114]:
# Loading the language detection model
lang_model = fasttext.load_model("data/lid.176.ftz")

# Defining language detection function
def detect_language(text):
    if not isinstance(text, str) or len(text.strip()) < 10:
        return "unknown"
    prediction = lang_model.predict(text.replace('\n', ' '), k=1)
    return prediction[0][0].replace("__label__", "")

In [115]:
df_ko = pd.read_csv("data/ua/Posts_ko.csv")
df_ds = pd.read_csv("data/ua/Posts_ds.csv")
df_uo = pd.read_csv("data/ua/Posts_uo.csv")
df_no = pd.read_csv("data/ua/Posts_no.csv")
df_dm = pd.read_csv("data/ru/Posts_dm.csv")
df_cl = pd.read_csv("data/ru/Posts_cl.csv")
df_re = pd.read_csv("data/ru/Posts_re.csv")

In [116]:
df_ko = df_ko[df_ko["text"].notna()]
df_ko["language"] = df_ko["text"].apply(detect_language)

In [117]:
df_ds = df_ds[df_ds["text"].notna()]
df_ds["language"] = df_ds["text"].apply(detect_language)
df_ds

Unnamed: 0,post_id,date,text,views,forwards,channel_id,channel_name,language
0,21462,2025-03-13 21:41:55+00:00,**🌊**** **[**36 ОБрМП**](https://t.me/ua_marin...,311822,370,-1001469021333,DeepStateUA,uk
1,21461,2025-03-13 18:46:49+00:00,**🤝 Повністю згодні з думкою наших друзів з РН...,302309,280,-1001469021333,DeepStateUA,uk
2,21460,2025-03-13 17:03:33+00:00,⚖️ **ЄСПЛ **[**визнав**](https://hudoc.echr.co...,335154,896,-1001469021333,DeepStateUA,uk
3,21459,2025-03-13 13:48:11+00:00,**🔄**** Мапу оновлено\n**\n⚔️ Ворог просунувся...,296881,154,-1001469021333,DeepStateUA,uk
4,21458,2025-03-13 12:20:54+00:00,**🟡****Кадри ураження кацапні в Басівці від бі...,371653,557,-1001469021333,DeepStateUA,uk
...,...,...,...,...,...,...,...,...
175,21283,2025-02-15 07:38:48+00:00,**🇩🇪**** Медіана опитувань перед Парламентськи...,262553,441,-1001469021333,DeepStateUA,uk
176,21282,2025-02-14 23:45:47+00:00,**🔄**** Мапу оновлено\n**\n⚔️ Ворог просунувся...,272033,115,-1001469021333,DeepStateUA,uk
177,21281,2025-02-14 12:58:01+00:00,🇷🇺 **Просування кацапів на правому березі річк...,292437,743,-1001469021333,DeepStateUA,uk
178,21280,2025-02-14 12:36:50+00:00,**🕯 В Україну повернули тіла 757 українських Г...,274636,329,-1001469021333,DeepStateUA,uk


In [118]:
df_uo = df_uo[df_uo["text"].notna()]
df_uo["language"] = df_uo["text"].apply(detect_language)
df_uo

Unnamed: 0,post_id,date,text,views,forwards,channel_id,channel_name,language
0,97026,2025-03-13 21:51:13+00:00,❗️**Трамп не вводив нових санкцій проти рф: їх...,303971,652,-1001233777422,UaOnlii,uk
1,97025,2025-03-13 21:37:21+00:00,**❗️Завдяки діям Трампа майже всі великі росій...,292042,934,-1001233777422,UaOnlii,uk
2,97024,2025-03-13 21:24:05+00:00,**❗️Спецпредставника США Келлога відсторонили ...,276603,688,-1001233777422,UaOnlii,uk
3,97023,2025-03-13 21:13:59+00:00,**❗️США відновлюють постачання Україні далекоб...,274636,656,-1001233777422,UaOnlii,uk
4,97022,2025-03-13 21:10:44+00:00,**❗️Тиша протрималася недовго: відбулися пуски...,254215,241,-1001233777422,UaOnlii,uk
...,...,...,...,...,...,...,...,...
2090,94787,2025-02-14 07:17:53+00:00,"**❗️Вночі дрон рф атакував Чорнобильську АЕС, ...",231312,2596,-1001233777422,UaOnlii,uk
2091,94786,2025-02-14 06:40:39+00:00,**Не ведіться на емоційні гойдалки. Інтереси У...,218032,323,-1001233777422,UaOnlii,uk
2092,94785,2025-02-14 06:39:25+00:00,**Ціни на ліки офіційно знизять на 30% з 1 бер...,243144,256,-1001233777422,UaOnlii,uk
2093,94784,2025-02-14 06:01:15+00:00,"**❗️Трамп передасть Україні ядерну зброю, якщо...",258946,1286,-1001233777422,UaOnlii,uk


In [119]:
df_no = df_no[df_no["text"].notna()]
df_no["language"] = df_no["text"].apply(detect_language)
df_no

Unnamed: 0,post_id,date,text,views,forwards,channel_id,channel_name,language
0,53962,2025-03-13 21:52:57+00:00,Головне станом на зараз: \n\n• Канада [виділил...,52034.0,53.0,-1001134948258,novinach,uk
1,53961,2025-03-13 21:33:51+00:00,NBC News: спецпредставника США Келлога [відсто...,55178.0,514.0,-1001134948258,novinach,uk
2,53960,2025-03-13 19:42:14+00:00,"Зеленський заявив, що путін готує відмову від ...",55562.0,179.0,-1001134948258,novinach,uk
3,53959,2025-03-13 17:12:42+00:00,"Трамп заявив, що путін зробив ""дуже багатообіц...",56554.0,146.0,-1001134948258,novinach,uk
4,53958,2025-03-13 15:59:01+00:00,"путін вимагає гарантій, що під час 30-денного ...",58506.0,747.0,-1001134948258,novinach,uk
...,...,...,...,...,...,...,...,...
686,53258,2025-02-14 09:46:22+00:00,Суспільне: Україна [доопрацювала](https://susp...,48756.0,133.0,-1001134948258,novinach,uk
695,53249,2025-02-14 09:21:29+00:00,З Днем всіх закоханих! Надішліть валентинку св...,51155.0,998.0,-1001134948258,novinach,uk
696,53248,2025-02-14 08:19:22+00:00,"""Дія"" до Дня всіх закоханих [презентувала](htt...",55985.0,1490.0,-1001134948258,novinach,uk
697,53247,2025-02-14 07:20:17+00:00,російський безпілотник влучив по саркофагу Чор...,55546.0,1215.0,-1001134948258,novinach,uk


In [120]:
df_dm = df_dm[df_dm["text"].notna()]
df_dm["language"] = df_dm["text"].apply(detect_language)
df_dm

Unnamed: 0,post_id,date,text,views,forwards,channel_id,channel_name,language
0,66717,2025-03-13 20:03:47+00:00,«…Коль в верхах большие лица\nНе нашли для мир...,352222.0,275.0,-1001513431778,dva_majors,ru
1,66716,2025-03-13 19:50:18+00:00,**✨**** Сбор в** **Курскую область! ****✨****\...,553973.0,56.0,-1001513431778,dva_majors,ru
2,66715,2025-03-13 19:40:59+00:00,"**Курская область. Суджа**, **кадры **[**Групп...",418278.0,1029.0,-1001513431778,dva_majors,ru
3,66714,2025-03-13 19:08:55+00:00,[**Информация**](https://t.me/dva_majors/66712...,405044.0,4466.0,-1001513431778,dva_majors,ru
5,66712,2025-03-13 19:01:43+00:00,🔞 **Всем бойцам на передовой и в прифронтовой ...,315548.0,5767.0,-1001513431778,dva_majors,ru
...,...,...,...,...,...,...,...,...
2170,64517,2025-02-14 06:08:24+00:00,Началась подготовка к продвижению Залужного в ...,314479.0,1330.0,-1001513431778,dva_majors,ru
2171,64516,2025-02-14 04:31:56+00:00,**Переговорный **[**процесс**](https://t.me/ne...,318387.0,590.0,-1001513431778,dva_majors,ru
2172,64515,2025-02-14 04:20:56+00:00,**Минобороны России:**\n\nВ течение прошедшей ...,306357.0,240.0,-1001513431778,dva_majors,ru
2173,64514,2025-02-14 03:43:50+00:00,**#Сводка**** на утро 14 февраля 2025 года**\n...,629203.0,853.0,-1001513431778,dva_majors,ru


In [121]:
df_cl = df_cl[df_cl["text"].notna()]
df_cl["language"] = df_cl["text"].apply(detect_language)
df_cl

Unnamed: 0,post_id,date,text,views,forwards,channel_id,channel_name,language
0,157779,2025-03-13 23:20:52+00:00,"О тех, кто занимался военными преступлениями н...",253245.0,210.0,-1001101806611,boris_rozhin,ru
1,157778,2025-03-13 23:12:01+00:00,[Операторы](https://t.me/mod_russia/50058) FPV...,246294.0,56.0,-1001101806611,boris_rozhin,ru
2,157777,2025-03-13 22:23:01+00:00,**Европейский Союз очень и очень противный \nЯ...,250403.0,492.0,-1001101806611,boris_rozhin,ru
4,157775,2025-03-13 22:11:29+00:00,**Очередная самоделка с Покровского направлени...,254333.0,505.0,-1001101806611,boris_rozhin,ru
5,157774,2025-03-13 21:34:33+00:00,Путин провел телефонные переговоры с бин Салма...,271850.0,158.0,-1001101806611,boris_rozhin,ru
...,...,...,...,...,...,...,...,...
3198,154532,2025-02-14 03:13:01+00:00,Дроноводы «Рубикона» уничтожили бронированную ...,196734.0,78.0,-1001101806611,boris_rozhin,ru
3200,154530,2025-02-14 02:10:15+00:00,**⠀**\n**🔴**** ** **Свидетельства вернувшихся ...,187674.0,335.0,-1001101806611,boris_rozhin,ru
3201,154529,2025-02-14 01:36:01+00:00,"Новый иранский реактивный БПЛА-камикадзе ""Хали...",188367.0,231.0,-1001101806611,boris_rozhin,ru
3202,154528,2025-02-14 00:33:01+00:00,В районе Великой Новосёлки наши парни с 305 бр...,190817.0,110.0,-1001101806611,boris_rozhin,ru


In [122]:
df_re = df_re[df_re["text"].notna()]
df_re["language"] = df_re["text"].apply(detect_language)
df_re

Unnamed: 0,post_id,date,text,views,forwards,channel_id,channel_name,language
0,94053,2025-03-13 19:29:58+00:00,**Владимир Путин дипломатично расставил акцент...,959162,1495,-1001260622817,readovkanews,ru
1,94052,2025-03-13 18:59:58+00:00,**Подготовлены документы на снятие с розыска 1...,959836,925,-1001260622817,readovkanews,ru
2,94051,2025-03-13 18:34:58+00:00,**ДУМ РФ и Турция снова признаются в любви дру...,1020364,1215,-1001260622817,readovkanews,ru
4,94049,2025-03-13 17:59:58+00:00,**Русская армия формирует новый плацдарм за ка...,903210,675,-1001260622817,readovkanews,ru
5,94048,2025-03-13 17:00:31+00:00,"**Путин сделал очень многообещающее заявление,...",880199,1130,-1001260622817,readovkanews,ru
...,...,...,...,...,...,...,...,...
1034,93006,2025-02-14 07:37:59+00:00,**Накануне переговоров в Мюнхене Зеленский «шо...,334579,2377,-1001260622817,readovkanews,ru
1035,93005,2025-02-14 07:22:17+00:00,**Тайна пирогов раскрыта — ртуть в булочки доб...,292128,2216,-1001260622817,readovkanews,ru
1036,93004,2025-02-14 06:20:01+00:00,"**США могут отправить войска на Украину, если ...",307165,1787,-1001260622817,readovkanews,ru
1037,93003,2025-02-14 05:30:07+00:00,"❗️**«Всем спасибо, президенту Владимиру Владим...",319996,1688,-1001260622817,readovkanews,ru


# Further processing

### Removing duplicates

In [123]:
df_ko = df_ko.drop_duplicates(subset="text")
df_ds = df_ds.drop_duplicates(subset="text")
df_uo = df_uo.drop_duplicates(subset="text")
df_no = df_no.drop_duplicates(subset="text")
df_dm = df_dm.drop_duplicates(subset="text")
df_cl = df_cl.drop_duplicates(subset="text")
df_re = df_re.drop_duplicates(subset="text")

### Normalizing Whitespace

In [124]:
df_ko.loc["text"] = df_ko["text"].str.replace(r'\s+', ' ', regex=True).str.strip()
df_ds.loc["text"] = df_ds["text"].str.replace(r'\s+', ' ', regex=True).str.strip()
df_uo.loc["text"] = df_uo["text"].str.replace(r'\s+', ' ', regex=True).str.strip()
df_no.loc["text"] = df_no["text"].str.replace(r'\s+', ' ', regex=True).str.strip()
df_dm.loc["text"] = df_dm["text"].str.replace(r'\s+', ' ', regex=True).str.strip()
df_cl.loc["text"] = df_cl["text"].str.replace(r'\s+', ' ', regex=True).str.strip()
df_re.loc["text"] = df_re["text"].str.replace(r'\s+', ' ', regex=True).str.strip()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_uo.loc["text"] = df_uo["text"].str.replace(r'\s+', ' ', regex=True).str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dm.loc["text"] = df_dm["text"].str.replace(r'\s+', ' ', regex=True).str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cl.loc["text"] = df_cl["text"].str.replace(r'\s+', ' ', regex=True).str.strip()


### Filtering out non relevant posts from Ukrainian telegram channels

During the preprocessing stage, it's essential to remove posts that do not contribute meaningfully to the framing analysis. In particular, messages containing terms like "повітряна тривога" (air raid alert), "увага" (attention), "відбій тривоги" (end of alert), or similar phrases are often:

Automated system alerts, or

Security and public service announcements

Such messages typically lack political framing, narrative construction, or propaganda elements. Instead, they serve purely informational or emergency-related purposes and may introduce noise or bias into qualitative, discursive, or sentiment-based analyses.

By excluding these entries, the dataset becomes more focused on relevant, communicative content, which strengthens the reliability and interpretability of subsequent framing analysis.




In [125]:
system_keywords = r"повітряна тривога|увага|відбій тривоги|бот|нагадування"
df_ko = df_ko[~df_ko["text"].str.contains(system_keywords, case=False, na=False)]
df_ds = df_ds[~df_ds["text"].str.contains(system_keywords, case=False, na=False)]
df_uo = df_uo[~df_uo["text"].str.contains(system_keywords, case=False, na=False)]
df_no = df_no[~df_no["text"].str.contains(system_keywords, case=False, na=False)]

### Text Cleaning

In [126]:
def clean_text(text):
    if pd.isnull(text):
        return ""
    text = re.sub(r"http\S+|www\S+|t\.me\S+", "", text)  # removing URLs
    text = re.sub(r"@\w+", "", text)                     # removing mentions
    text = re.sub(r"#\w+", "", text)                     # removing hashtags
    return text

In [127]:
df_ko["text"] = df_ko["text"].apply(clean_text)
df_ds["text"] = df_ds["text"].apply(clean_text)
df_uo["text"] = df_uo["text"].apply(clean_text)
df_no["text"] = df_no["text"].apply(clean_text)
df_dm["text"] = df_dm["text"].apply(clean_text)
df_cl["text"] = df_cl["text"].apply(clean_text)
df_re["text"] = df_re["text"].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dm["text"] = df_dm["text"].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cl["text"] = df_cl["text"].apply(clean_text)


### Date into separate column

In [128]:
df_ko["date"] = pd.to_datetime(df_ko["date"], errors="coerce")
df_ds["date"] = pd.to_datetime(df_ds["date"], errors="coerce")
df_uo["date"] = pd.to_datetime(df_uo["date"], errors="coerce")
df_no["date"] = pd.to_datetime(df_no["date"], errors="coerce")
df_dm["date"] = pd.to_datetime(df_dm["date"], errors="coerce")
df_cl["date"] = pd.to_datetime(df_cl["date"], errors="coerce")
df_re["date"] = pd.to_datetime(df_re["date"], errors="coerce")
df_ko["date_only"] = df_ko["date"].dt.date
df_ds["date_only"] = df_ds["date"].dt.date
df_uo["date_only"] = df_uo["date"].dt.date
df_no["date_only"] = df_no["date"].dt.date
df_dm["date_only"] = df_dm["date"].dt.date
df_cl["date_only"] = df_cl["date"].dt.date
df_re["date_only"] = df_re["date"].dt.date

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dm["date"] = pd.to_datetime(df_dm["date"], errors="coerce")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cl["date"] = pd.to_datetime(df_cl["date"], errors="coerce")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dm["date_only"] = df_dm["date"].dt.date
A value is trying to be set on a cop

### Lowercasing

In [129]:
df_ko["text"] = df_ko["text"].str.lower()
df_ds["text"] = df_ds["text"].str.lower()
df_uo["text"] = df_uo["text"].str.lower()
df_no["text"] = df_no["text"].str.lower()
df_dm["text"] = df_dm["text"].str.lower()
df_cl["text"] = df_cl["text"].str.lower()
df_re["text"] = df_re["text"].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dm["text"] = df_dm["text"].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cl["text"] = df_cl["text"].str.lower()


# Transforming emotion into a text

In [130]:
# Function to transform emojis
def transform_emojis(text):
    if pd.isnull(text):
        return ""
    return emoji.demojize(text, delimiters=(":"," "))

In [131]:
# Applying transformation to the "text" column
df_ko["text_transformed"] = df_ko["text"].apply(transform_emojis)
df_ds["text_transformed"] = df_ds["text"].apply(transform_emojis)
df_uo["text_transformed"] = df_uo["text"].apply(transform_emojis)
df_no["text_transformed"] = df_no["text"].apply(transform_emojis)
df_dm["text_transformed"] = df_dm["text"].apply(transform_emojis)
df_cl["text_transformed"] = df_cl["text"].apply(transform_emojis)
df_re["text_transformed"] = df_re["text"].apply(transform_emojis)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dm["text_transformed"] = df_dm["text"].apply(transform_emojis)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cl["text_transformed"] = df_cl["text"].apply(transform_emojis)


### Saving data to a new file in a folder "data_clean/"