<a id='Homework'></a>
# Homework

Theory (5 points):
- Complete theory questions in Google Form
- Take a look at all the links
- Read and analyze all theory `TODO`s

Practice (10 points):
1. Take 2-3 channels from `KyivChannels_Dataset_v01`
2. Apply Clustering OR/AND Topic Modelling techniques to find topics of these channels. Ideal output: `channel_name:[topic_1, topic_2, topic_3]`. Examples: `–ö—Ä–∏–ø—Ç–∞ –ú–∏–∫–æ–ª–∏ : [–∫—Ä–∏–ø—Ç–æ–≤–∞–ª—é—Ç–∞, –±—ñ—Ä–∂–∞]`
3. (Advanced) Try to come up with a universal approach
4. (Advanced) Apply your approach on other channels

Here is a list of standard topics for TG channels (from TGStat). In the best-case scenario use them as topics.
```
Adult
Art
Blogs
Bookmaking
Books
Business and startups
Career
Courses and guides
Cryptocurrencies
Darknet
Design
Economics
Education
Edutainment
Erotic
Esoterics
Family & Children
Fashion and beauty
Food and cooking
Games
Handiwork
Health and Fitness
Humor and entertainment
Instagram
Interior and construction
Law
Linguistics
Marketing, PR, advertising
Medicine
Music
Nature
News and media
Other
Pictures and photos
Politics
Psychology
Quotes
Religion
Sales
Shock content
Software & Applications
Sport
Technologies
Telegram
Transport
Travel
Video and films
```

# Imports

In [1]:
!pip install -q langid transformers datasets sentence-transformers bertopic

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
os.environ["TOKENIZERS_PARALLELISM"]="true"

import pandas as pd
import numpy as np
import nltk
import spacy
import re
import torch
import torch.nn as nn
import string
import langid
import random

from matplotlib import pyplot as plt
from pprint import pprint
from nltk.corpus import stopwords
from nltk import tokenize
from wordcloud import WordCloud, STOPWORDS
from functools import reduce
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from copy import deepcopy
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification, BertTokenizer, BertModel, pipeline, get_linear_schedule_with_warmup
)
from datasets import Dataset
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import pairwise_distances
from bertopic import BERTopic

torch.manual_seed(42)
torch.backends.cuda.deterministic = True


%matplotlib inline

In [3]:
code_env = 'colab' # 'colab', 'local'

if code_env == 'colab':
    from google.colab import drive
    drive.mount('/content/drive')

    data_path = os.path.join('drive', 'MyDrive', 'MachineLearning', 'Data', 'KyivChannels_Dataset_v01.csv')

elif code_env == 'local':
    data_path = os.path.join('..', 'data', 'KyivChannels_Dataset_v01.csv')


Mounted at /content/drive


# Load Data

In [4]:
df = pd.read_csv(data_path, converters={"Date": pd.to_datetime})

In [5]:
df

Unnamed: 0,channelname,Date,content,lang
0,kyivpolitics,2023-08-01 09:45:38,–û—Ç–±–æ–π. –£–≥—Ä–æ–∑—ã –¥–ª—è —Å—Ç–æ–ª–∏—Ü—ã –Ω–µ—Ç\n\n–ö–∏–µ–≤. –ì–ª–∞–≤–Ω–æ–µ...,ru
1,kyivpolitics,2023-08-01 10:03:38,–ù–∞ 8 –ø–µ—Ä–µ–∫—Ä–µ—Å—Ç–∫–∞—Ö –ö–∏–µ–≤–∞ –≤ –ø–∏–ª–æ—Ç–Ω–æ–º —Ä–µ–∂–∏–º–µ –≤–Ω–µ–¥...,ru
2,kyivpolitics,2023-08-01 14:42:31,‚ö°Ô∏è–ù–ë–£ –æ—Ç–æ–∑–≤–∞–ª –±–∞–Ω–∫–æ–≤—Å–∫—É—é –ª–∏—Ü–µ–Ω–∑–∏—é –ö–æ–Ω–∫–æ—Ä–¥ –ë–∞–Ω–∫...,ru
3,kyivpolitics,2023-08-01 15:37:34,–ó–∞–≤—Ç—Ä–∞ —Å–∏–Ω–æ–ø—Ç–∏–∫–∏ –ø—Ä–æ–≥–Ω–æ–∑–∏—Ä—É—é—Ç –Ω–µ–±–æ–ª—å—à–æ–π –¥–æ–∂–¥—å ...,ru
4,kyivpolitics,2023-08-01 13:06:08,–ê –≤–æ—Ç –∏ —Å–∞–º —Å–Ω—è—Ç—ã–π —Å–æ–≤–µ—Ç—Å–∫–∏–π –≥–µ—Ä–± \n\n–ö–∏–µ–≤. –ì–ª...,ru
...,...,...,...,...
31177,hmarochos,2023-10-27 04:56:20,üé® –•—É–¥–æ–∂–Ω–∏—Ü—é –∑–æ–±–æ–≤ º—è–∑–∞–ª–∏ –∑–∞–º–∞–ª—é–≤–∞—Ç–∏ –º—É—Ä–∞–ª –Ω–∞ –°—ñ...,uk
31178,hmarochos,2023-10-27 06:12:15,üöß –õ—å–≤—ñ–≤ —Ö–æ—á–µ –æ—Ç—Ä–∏–º–∞—Ç–∏ 50 –º–ª–Ω —î–≤—Ä–æ –Ω–∞ —Ä–µ–∫–æ–Ω—Å—Ç—Ä—É...,uk
31179,hmarochos,2023-10-27 05:38:42,üôà –ù–∞ –ù–∞–±–µ—Ä–µ–∂–Ω–æ-–•—Ä–µ—â–∞—Ç–∏—Ü—å–∫—ñ–π —Å–∞–º–æ–≤—ñ–ª—å–Ω–æ –≤–ª–∞—à—Ç—É–≤...,uk
31180,semenovatut,2023-10-27 11:50:39,–ú–æ–∂–µ –∑–∞–ª–∏—à–∏—Ç–∏ –ü—É—à–∫—ñ–Ω–∞?\n–ë—É–¥–µ –æ–±‚Äò—î–∫—Ç–æ–º –ø–µ—Ä—Ñ–æ—Ä–º–∞...,uk


In [6]:
df['channelname'].value_counts()

channelname
novynylive                     3590
lossolomas_kyiv                3009
darnicalive                    2715
kievvlast                      2273
vichirniykyiv                  1738
big_kyiv                       1670
kyivpolitics                   1383
nashkyivua                     1366
kyiv_novyny_24                 1102
kievreal1                      1096
huevyi_kiev                    1091
obolonlife                     1070
kiev1                          1006
khreschatyk36                   959
kyiv_n                          809
lisovy_masyv_official           722
poznyakyosokorkykharkivskiy     633
hmarochos                       606
ushkiklichko                    578
semenovatut                     526
kyivpasstrans                   362
kyivpatrol                      313
kyivpastrans_live               280
kyivpassengers                  277
uhmc2022                        271
kyiv_pro_office                 248
kyivcityofficial                247
kyiv_by_grishyn 

# Data Processing

Load stopwords from Ukrainian and Russian languages

In [7]:

def read_txt_to_list(path):
    with open(path, 'r') as file:
        lines = file.readlines()
    return [line.strip() for line in lines]

nltk.download("stopwords")

ru_stopwords = stopwords.words("russian")

# Take https://raw.githubusercontent.com/skupriienko/Ukrainian-Stopwords/master/stopwords_ua.txt
!wget https://raw.githubusercontent.com/skupriienko/Ukrainian-Stopwords/master/stopwords_ua.txt
ua_stopwords = read_txt_to_list("stopwords_ua.txt")

combined_stopwords = ru_stopwords + ua_stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


--2023-11-13 15:15:01--  https://raw.githubusercontent.com/skupriienko/Ukrainian-Stopwords/master/stopwords_ua.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24502 (24K) [text/plain]
Saving to: ‚Äòstopwords_ua.txt‚Äô


2023-11-13 15:15:01 (13.9 MB/s) - ‚Äòstopwords_ua.txt‚Äô saved [24502/24502]



In [8]:
def collapse_dots(input):
    # Collapse sequential dots
    input = re.sub("\.+", ".", input)
    # Collapse dots separated by whitespaces
    all_collapsed = False
    while not all_collapsed:
        output = re.sub(r"\.(( )*)\.", ".", input)
        all_collapsed = input == output
        input = output
    return output


def remove_emojis(text):
    # Define a regex pattern for emojis
    emoji_pattern = re.compile("["
                              u"\U0001F600-\U0001F64F"  # emoticons
                              u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                              u"\U0001F680-\U0001F6FF"  # transport & map symbols
                              u"\U0001F700-\U0001F77F"  # alchemical symbols
                              u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                              u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                              u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                              u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                              u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                              u"\U00002702-\U000027B0"  # Dingbats
                              u"\U000024C2-\U0001F251"
                              "]+", flags=re.UNICODE)

    # Remove emojis using the regex pattern
    cleaned_text = emoji_pattern.sub('', text)

    return cleaned_text


def remove_symbols(text):
    # Define a regex pattern for the specified symbols
    symbol_pattern = r'[¬´¬ª‚Äú‚Äù‚Äî]'

    # Use re.sub to replace the specified symbols with an empty string
    cleaned_text = re.sub(symbol_pattern, '', text)

    return cleaned_text


def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in combined_stopwords]
    return ' '.join(filtered_words)


def process_text(input):
    if isinstance(input, str):
        input = remove_stopwords(input)
        input = remove_emojis(input)
        input = re.sub(r"http\S+", "", input)
        input = re.sub(r"\n+", ". ", input)
        for symb in ["!", ",", ":", ";", "?", "¬´", "¬ª", "‚Äú", "‚Äù", "‚Äî"]:
            input = re.sub(rf"\{symb}\.", symb, input)
        input = re.sub(r"#\S+", "", input)
        input = collapse_dots(input)
        input = input.strip()
    return input

df["content_processed"] = df["content"].apply(process_text)

In [9]:
df["content"].iloc[2]

'‚ö°Ô∏è–ù–ë–£ –æ—Ç–æ–∑–≤–∞–ª –±–∞–Ω–∫–æ–≤—Å–∫—É—é –ª–∏—Ü–µ–Ω–∑–∏—é –ö–æ–Ω–∫–æ—Ä–¥ –ë–∞–Ω–∫–∞ –∑–∞ –Ω–∞—Ä—É—à–µ–Ω–∏–µ –≤ —Å—Ñ–µ—Ä–µ –¥–µ–Ω–µ–∂–Ω–æ–≥–æ –º–æ–Ω–∏—Ç–æ—Ä–∏–Ω–≥–∞\n\n–†–∞–Ω–µ–µ –æ–Ω —Ñ–∏–≥—É—Ä–∏—Ä–æ–≤–∞–ª –≤ —Ä–∞—Å—Å–ª–µ–¥–æ–≤–∞–Ω–∏—è—Ö –ø–æ –º–∏—Å–∫–æ–¥–∏–Ω–≥—É –∏ –º–∞—Ö–∏–Ω–∞—Ü–∏—è–º –∏–≥–æ—Ä–Ω–æ–≥–æ –±–∏–∑–Ω–µ—Å–∞. –í—ã–≤–æ–¥ –µ–≥–æ —Å —Ä—ã–Ω–∫–∞ –Ω–µ –ø–æ–≤–ª–∏—è–µ—Ç –Ω–∞ —Å—Ç–∞–±–∏–ª—å–Ω–æ—Å—Ç—å –±–∞–Ω–∫–æ–≤—Å–∫–æ–≥–æ —Å–µ–∫—Ç–æ—Ä–∞ –£–∫—Ä–∞–∏–Ω—ã. –ö–∞–∂–¥—ã–π –≤–∫–ª–∞–¥—á–∏–∫ –±–∞–Ω–∫–∞ –ø–æ–ª—É—á–∏—Ç –ø–æ–ª–Ω–æ–µ –≤–æ–∑–º–µ—â–µ–Ω–∏–µ.\n\n–ö–∏–µ–≤. –ì–ª–∞–≤–Ω–æ–µ. –ü–æ–ª–∏—Ç–∏–∫–∞'

In [10]:
df["content_processed"].iloc[2]

'–ù–ë–£ –æ—Ç–æ–∑–≤–∞–ª –±–∞–Ω–∫–æ–≤—Å–∫—É—é –ª–∏—Ü–µ–Ω–∑–∏—é –ö–æ–Ω–∫–æ—Ä–¥ –ë–∞–Ω–∫–∞ –Ω–∞—Ä—É—à–µ–Ω–∏–µ —Å—Ñ–µ—Ä–µ –¥–µ–Ω–µ–∂–Ω–æ–≥–æ –º–æ–Ω–∏—Ç–æ—Ä–∏–Ω–≥–∞ –†–∞–Ω–µ–µ —Ñ–∏–≥—É—Ä–∏—Ä–æ–≤–∞–ª —Ä–∞—Å—Å–ª–µ–¥–æ–≤–∞–Ω–∏—è—Ö –º–∏—Å–∫–æ–¥–∏–Ω–≥—É –º–∞—Ö–∏–Ω–∞—Ü–∏—è–º –∏–≥–æ—Ä–Ω–æ–≥–æ –±–∏–∑–Ω–µ—Å–∞. –í—ã–≤–æ–¥ —Ä—ã–Ω–∫–∞ –ø–æ–≤–ª–∏—è–µ—Ç —Å—Ç–∞–±–∏–ª—å–Ω–æ—Å—Ç—å –±–∞–Ω–∫–æ–≤—Å–∫–æ–≥–æ —Å–µ–∫—Ç–æ—Ä–∞ –£–∫—Ä–∞–∏–Ω—ã. –ö–∞–∂–¥—ã–π –≤–∫–ª–∞–¥—á–∏–∫ –±–∞–Ω–∫–∞ –ø–æ–ª—É—á–∏—Ç –ø–æ–ª–Ω–æ–µ –≤–æ–∑–º–µ—â–µ–Ω–∏–µ. –ö–∏–µ–≤. –ì–ª–∞–≤–Ω–æ–µ. –ü–æ–ª–∏—Ç–∏–∫–∞'

Drop duplicates and sample texts from 3 channels from whole dataframe

In [11]:
deduplicated_indexes = df.drop_duplicates("content_processed").index

# # Get content from 3 channels - 'novynylive', 'kyivpolitics', 'poznyakyosokorkykharkivskiy'
selected_df = df.loc[deduplicated_indexes].loc[df["channelname"].isin(['novynylive', 'kyivpolitics', 'poznyakyosokorkykharkivskiy'])]


In [12]:
selected_df

Unnamed: 0,channelname,Date,content,lang,content_processed
0,kyivpolitics,2023-08-01 09:45:38,–û—Ç–±–æ–π. –£–≥—Ä–æ–∑—ã –¥–ª—è —Å—Ç–æ–ª–∏—Ü—ã –Ω–µ—Ç\n\n–ö–∏–µ–≤. –ì–ª–∞–≤–Ω–æ–µ...,ru,–û—Ç–±–æ–π. –£–≥—Ä–æ–∑—ã —Å—Ç–æ–ª–∏—Ü—ã –ö–∏–µ–≤. –ì–ª–∞–≤–Ω–æ–µ. –ü–æ–ª–∏—Ç–∏–∫–∞
1,kyivpolitics,2023-08-01 10:03:38,–ù–∞ 8 –ø–µ—Ä–µ–∫—Ä–µ—Å—Ç–∫–∞—Ö –ö–∏–µ–≤–∞ –≤ –ø–∏–ª–æ—Ç–Ω–æ–º —Ä–µ–∂–∏–º–µ –≤–Ω–µ–¥...,ru,8 –ø–µ—Ä–µ–∫—Ä–µ—Å—Ç–∫–∞—Ö –ö–∏–µ–≤–∞ –ø–∏–ª–æ—Ç–Ω–æ–º —Ä–µ–∂–∏–º–µ –≤–Ω–µ–¥—Ä—è—Ç —Å...
2,kyivpolitics,2023-08-01 14:42:31,‚ö°Ô∏è–ù–ë–£ –æ—Ç–æ–∑–≤–∞–ª –±–∞–Ω–∫–æ–≤—Å–∫—É—é –ª–∏—Ü–µ–Ω–∑–∏—é –ö–æ–Ω–∫–æ—Ä–¥ –ë–∞–Ω–∫...,ru,–ù–ë–£ –æ—Ç–æ–∑–≤–∞–ª –±–∞–Ω–∫–æ–≤—Å–∫—É—é –ª–∏—Ü–µ–Ω–∑–∏—é –ö–æ–Ω–∫–æ—Ä–¥ –ë–∞–Ω–∫–∞ ...
3,kyivpolitics,2023-08-01 15:37:34,–ó–∞–≤—Ç—Ä–∞ —Å–∏–Ω–æ–ø—Ç–∏–∫–∏ –ø—Ä–æ–≥–Ω–æ–∑–∏—Ä—É—é—Ç –Ω–µ–±–æ–ª—å—à–æ–π –¥–æ–∂–¥—å ...,ru,–ó–∞–≤—Ç—Ä–∞ —Å–∏–Ω–æ–ø—Ç–∏–∫–∏ –ø—Ä–æ–≥–Ω–æ–∑–∏—Ä—É—é—Ç –Ω–µ–±–æ–ª—å—à–æ–π –¥–æ–∂–¥—å ...
4,kyivpolitics,2023-08-01 13:06:08,–ê –≤–æ—Ç –∏ —Å–∞–º —Å–Ω—è—Ç—ã–π —Å–æ–≤–µ—Ç—Å–∫–∏–π –≥–µ—Ä–± \n\n–ö–∏–µ–≤. –ì–ª...,ru,—Å–Ω—è—Ç—ã–π —Å–æ–≤–µ—Ç—Å–∫–∏–π –≥–µ—Ä–± –ö–∏–µ–≤. –ì–ª–∞–≤–Ω–æ–µ. –ü–æ–ª–∏—Ç–∏–∫–∞
...,...,...,...,...,...
31147,novynylive,2023-10-27 05:28:18,–£ –ø—Ä–∏–º—ñ—â–µ–Ω–Ω—è—Ö —Å—Ç–æ–ª–∏—á–Ω–∏—Ö –¢–ï–¶ –î–ë–† –ø—Ä–æ–≤–æ–¥–∏—Ç—å –æ–±—à—É...,uk,–ø—Ä–∏–º—ñ—â–µ–Ω–Ω—è—Ö —Å—Ç–æ–ª–∏—á–Ω–∏—Ö –¢–ï–¶ –î–ë–† –ø—Ä–æ–≤–æ–¥–∏—Ç—å –æ–±—à—É–∫–∏...
31148,novynylive,2023-10-27 05:41:29,–û–∫—É–ø–∞–Ω—Ç –∑–∞—Ö–æ–ø–∏–≤ –±–µ–∑–ø—ñ–ª–æ—Ç–Ω–∏–∫ —É–∫—Ä–∞—ó–Ω—Å—å–∫–∏—Ö –≤—ñ–π—Å—å–∫...,uk,–û–∫—É–ø–∞–Ω—Ç –∑–∞—Ö–æ–ø–∏–≤ –±–µ–∑–ø—ñ–ª–æ—Ç–Ω–∏–∫ —É–∫—Ä–∞—ó–Ω—Å—å–∫–∏—Ö –≤—ñ–π—Å—å–∫...
31149,novynylive,2023-10-27 05:26:00,"–†–æ–∑–¥–∞–ª–∏ –Ω–µ–∑–∞–∫–æ–Ω–Ω–∏—Ö –ø—Ä–µ–º—ñ–π –Ω–∞ 1,6 –º–ª–Ω: –¥–≤–æ—î –µ–∫—Å...",uk,"–†–æ–∑–¥–∞–ª–∏ –Ω–µ–∑–∞–∫–æ–Ω–Ω–∏—Ö –ø—Ä–µ–º—ñ–π 1,6 –º–ª–Ω: –¥–≤–æ—î –µ–∫—Å–ø–æ—Å..."
31150,novynylive,2023-10-27 04:58:00,–ù—ñ–º–µ—á—á–∏–Ω–∞ —Ö–æ—á–µ –ø—Ä–∏—à–≤–∏–¥—à–∏—Ç–∏ –¥–µ–ø–æ—Ä—Ç–∞—Ü—ñ—é –Ω–µ–ª–µ–≥–∞–ª—ñ...,uk,–ù—ñ–º–µ—á—á–∏–Ω–∞ —Ö–æ—á–µ –ø—Ä–∏—à–≤–∏–¥—à–∏—Ç–∏ –¥–µ–ø–æ—Ä—Ç–∞—Ü—ñ—é –Ω–µ–ª–µ–≥–∞–ª—ñ...


# Topic Modeling

In [13]:
from bertopic import BERTopic
from transformers.pipelines import pipeline

embedding_model = pipeline("feature-extraction", model="distilbert-base-multilingual-cased")
topic_model = BERTopic(embedding_model=embedding_model, verbose=True)


Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (‚Ä¶)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (‚Ä¶)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [14]:
topics, probs = topic_model.fit_transform(selected_df['content_processed'].to_list())

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5504/5504 [18:00<00:00,  5.09it/s]
2023-11-13 15:34:00,046 - BERTopic - Transformed documents to Embeddings
2023-11-13 15:34:45,397 - BERTopic - Reduced dimensionality
2023-11-13 15:34:45,714 - BERTopic - Clustered reduced embeddings


In [16]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2102,-1_–Ω–æ–≤–∏–Ω–∏_live_–∫–∏–µ–≤_–ø–æ–ª–∏—Ç–∏–∫–∞,"[–Ω–æ–≤–∏–Ω–∏, live, –∫–∏–µ–≤, –ø–æ–ª–∏—Ç–∏–∫–∞, –≥–ª–∞–≤–Ω–æ–µ, –æ–¥–µ—Å–∞,...",[–û–∫—É–ø–∞–Ω—Ç–∏ –∞—Ç–∞–∫—É–≤–∞–ª–∏ –•–µ—Ä—Å–æ–Ω—â–∏–Ω—É –ø‚Äô—è—Ç–∏ –ª—ñ—Ç–∞–∫—ñ–≤ ¬´...
1,0,651,0_—É–∫—Ä–∞—ó–Ω–∏_—Ö–∞–º–∞—Å_—É–∫—Ä–∞—ó–Ω—ñ_—Å—à–∞,"[—É–∫—Ä–∞—ó–Ω–∏, —Ö–∞–º–∞—Å, —É–∫—Ä–∞—ó–Ω—ñ, —Å—à–∞, —Ä–æ—Å—ñ—ó, —Ä—Ñ, —î—Å, ...",[–û–≥–ª—è–¥ —Å–≤—ñ—Ç–æ–≤–∏—Ö –ó–ú–Ü –≤–µ—á—ñ—Ä: Politico: –Ñ–° –∑—Ä–æ—Å—Ç–∞...
2,1,390,1_–≥–ª–∞–≤–Ω–æ–µ_–∫–∏–µ–≤_–ø–æ–ª–∏—Ç–∏–∫–∞_–≥—Ä–Ω,"[–≥–ª–∞–≤–Ω–æ–µ, –∫–∏–µ–≤, –ø–æ–ª–∏—Ç–∏–∫–∞, –≥—Ä–Ω, —ç—Ç–æ, —Ç—ã—Å, –±—É–¥—É—Ç...","[–°–ë–£ —Ä–∞–∑–æ–±–ª–∞—á–∏–ª–∞ –ö–∏–µ–≤—â–∏–Ω–µ –≥–ª–∞–≤—É –í–õ–ö, –∫–æ—Ç–æ—Ä–∞—è ""..."
3,2,262,2_–æ–∫—É–ø–∞–Ω—Ç–∏_–æ–≤–∞_–æ–±—Å—Ç—Ä—ñ–ª—É_–ø–æ—Å—Ç—Ä–∞–∂–¥–∞–ª–∏—Ö,"[–æ–∫—É–ø–∞–Ω—Ç–∏, –æ–≤–∞, –æ–±—Å—Ç—Ä—ñ–ª—É, –ø–æ—Å—Ç—Ä–∞–∂–¥–∞–ª–∏—Ö, –ø–æ—à–∫–æ–¥...",[–û–∫—É–ø–∞–Ω—Ç–∏ —Å–∫–∏–Ω—É–ª–∏ –∞–≤—ñ–∞–±–æ–º–±—É –º—ñ—Å—å–∫—Ä–∞–¥—É –ö—É–ø'—è–Ω—Å—å...
4,3,207,3_—É–∫—Ä–∞—ó–Ω–∏_—Å—à–∞_—ñ–∑—Ä–∞—ó–ª—é_—î—Å,"[—É–∫—Ä–∞—ó–Ω–∏, —Å—à–∞, —ñ–∑—Ä–∞—ó–ª—é, —î—Å, –æ–æ–Ω, —ñ–∑—Ä–∞—ó–ª—å, —Ö–∞–º–∞...","[–í–† —Ä–æ–∑–ø–æ–≤—ñ–ª–∏, –£–∫—Ä–∞—ó–Ω–∞ —Ä–æ–∑–ø–æ—á–Ω–µ –ø–µ—Ä–µ–≥–æ–≤–æ—Ä–∏ –≤—Å—Ç..."
5,4,131,4_—á–æ–ª–æ–≤—ñ–∫_–ø–æ–ª—ñ—Ü—ñ—ó_—á–æ–ª–æ–≤—ñ–∫–∞_—Ä—ñ—á–Ω–∏–π,"[—á–æ–ª–æ–≤—ñ–∫, –ø–æ–ª—ñ—Ü—ñ—ó, —á–æ–ª–æ–≤—ñ–∫–∞, —Ä—ñ—á–Ω–∏–π, –≤–æ–ª—ñ, –∑–∞–≥...","[–û–¥–µ—Å—ñ –ø—ñ–¥–ª—ñ—Ç–æ–∫ –∑–≤'—è–∑–∞–≤ –æ—Ä–µ–Ω–¥–æ–¥–∞–≤—Ü—è, –∑'—ó–∂–¥–∂–∞–≤ ..."
6,5,111,5_–ø–æ–ª–∏—Ç–∏–∫–∞_–≥–ª–∞–≤–Ω–æ–µ_–∫–∏–µ–≤_–µ—â—ë,"[–ø–æ–ª–∏—Ç–∏–∫–∞, –≥–ª–∞–≤–Ω–æ–µ, –∫–∏–µ–≤, –µ—â—ë, –∫–∞–¥—Ä—ã, –æ—Ç–±–æ–π, —Ç...","[–û—Ç–±–æ–π –ö–∏–µ–≤. –ì–ª–∞–≤–Ω–æ–µ. –ü–æ–ª–∏—Ç–∏–∫–∞, –ö–∏–µ–≤. –ì–ª–∞–≤–Ω–æ–µ...."
7,6,109,6_–±—ñ–π—Ü—ñ_–±—Ä–∏–≥–∞–¥–∏_–≤—ñ–¥–µ–æ_–∑—Å—É,"[–±—ñ–π—Ü—ñ, –±—Ä–∏–≥–∞–¥–∏, –≤—ñ–¥–µ–æ, –∑—Å—É, –æ–∫—É–ø–∞–Ω—Ç—ñ–≤, –∑–Ω–∏—â–∏–ª...",[–í–æ—ó–Ω–∏ 93-—ó –û–ú–ë—Ä –ø–æ–¥—ñ–ª–∏–ª–∏—Å—è –≤—ñ–¥–µ–æ–∫–∞–¥—Ä–∞–º–∏ —Ä–æ–±–æ—Ç...
8,7,89,7_–≤—ñ–¥–µ–æ_—É–≤–∞–≥–∞_–º—ñ—Å—Ç–∏—Ç—å_–ª–µ–∫—Å–∏–∫—É,"[–≤—ñ–¥–µ–æ, —É–≤–∞–≥–∞, –º—ñ—Å—Ç–∏—Ç—å, –ª–µ–∫—Å–∏–∫—É, –Ω–µ–Ω–æ—Ä–º–∞—Ç–∏–≤–Ω—É,...",[–£–≤–∞–≥–∞! –í—ñ–¥–µ–æ –º—ñ—Å—Ç–∏—Ç—å –Ω–µ–Ω–æ—Ä–º–∞—Ç–∏–≤–Ω—É –ª–µ–∫—Å–∏–∫—É! –í—ñ...
9,8,83,8_–∫–∏–µ–≤_–ø–æ–ª–∏—Ç–∏–∫–∞_–≥–ª–∞–≤–Ω–æ–µ_—Ç–∞–∫–∂–µ,"[–∫–∏–µ–≤, –ø–æ–ª–∏—Ç–∏–∫–∞, –≥–ª–∞–≤–Ω–æ–µ, —Ç–∞–∫–∂–µ, –∫–∏–µ–≤–∞, —ç—Ç–æ, —É...",[–£–∫—Ä–∞–∏–Ω—Å–∫–∏–µ –°–ú–ò –ø—Ä–æ–¥–æ–ª–∂–∞—é—Ç –ø–µ—Å—Ç—Ä–∏—Ç—å –∑–∞–≥–æ–ª–æ–≤–∫–∞–º...


In [18]:
topic_model.get_topic(6)

[('–±—ñ–π—Ü—ñ', 0.06346711220639283),
 ('–±—Ä–∏–≥–∞–¥–∏', 0.052176667381763624),
 ('–≤—ñ–¥–µ–æ', 0.051959561332723773),
 ('–∑—Å—É', 0.049976704284323485),
 ('–æ–∫—É–ø–∞–Ω—Ç—ñ–≤', 0.044510861884420466),
 ('–∑–Ω–∏—â–∏–ª–∏', 0.04436727553937708),
 ('–æ–º–±—Ä', 0.04062445196424579),
 ('—Å–ª–∞–≤–∞', 0.038206108735797516),
 ('–Ω–∞–ø—Ä—è–º–∫—É', 0.03599833080091178),
 ('–ø–æ–∫–∞–∑–∞–ª–∏', 0.03262148051130972)]

In [19]:
topic_model.visualize_topics()

In [53]:
topic_model.save("my_model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)


Showcase topics for each channel

In [56]:
loaded_model = BERTopic.load("my_model_dir", embedding_model=embedding_model)

In [61]:
topics_per_class = loaded_model.topics_per_class(selected_df['content_processed'].to_list(),
    classes=selected_df.channelname)

loaded_model.visualize_topics_per_class(topics_per_class,
    top_n_topics=10, normalize_frequency = True)

3it [00:00,  4.88it/s]


In [80]:
# Get a topic for each document in dataframe
selected_df["topic"] = loaded_model.get_document_info(selected_df["content_processed"].to_list())["Topic"].values

In [104]:
# Return top 5 topics for each channel name
grouped_channels = selected_df.loc[selected_df["topic"] != -1].groupby("channelname")["topic"].apply(lambda x: x.value_counts().head(5)).reset_index()

In [106]:
grouped_channels

Unnamed: 0,channelname,level_1,topic
0,kyivpolitics,1,388
1,kyivpolitics,5,109
2,kyivpolitics,8,82
3,kyivpolitics,16,43
4,kyivpolitics,18,41
5,novynylive,0,584
6,novynylive,2,238
7,novynylive,3,203
8,novynylive,6,107
9,novynylive,4,97


In [114]:
for channel in grouped_channels["channelname"].unique():
    topic_numbers = grouped_channels[grouped_channels["channelname"] == channel]["level_1"].values
    topics = [loaded_model.get_topic(topic_num)[0][0] for topic_num in topic_numbers]
    print(f"{channel} topics: {topics}")

kyivpolitics topics: ['–≥–ª–∞–≤–Ω–æ–µ', '–ø–æ–ª–∏—Ç–∏–∫–∞', '–∫–∏–µ–≤', '–¥—Ç–ø', '–¥–≤–∏–∂–µ–Ω–∏–µ']
novynylive topics: ['—É–∫—Ä–∞—ó–Ω–∏', '–æ–∫—É–ø–∞–Ω—Ç–∏', '—É–∫—Ä–∞—ó–Ω–∏', '–±—ñ–π—Ü—ñ', '—á–æ–ª–æ–≤—ñ–∫']
poznyakyosokorkykharkivskiy topics: ['—É–∫—Ä–∞—ó–Ω–∏', '–≤—ñ–¥–µ–æ', '—á–æ–ª–æ–≤—ñ–∫', '–∑–∞–≥—É–±–∏–≤', '–æ–∫—É–ø–∞–Ω—Ç–∏']
