## Loading the data and taking a first look at it

In [1]:
import pandas as pd
import regex as re

In [2]:
import os
from os.path import expanduser
home = expanduser("~")
os.chdir(os.path.join(home, 'Documents', 'Projekty', 'tweets-analysis'))
print('Current working directory set to:')
os.getcwd()

Current working directory set to:


'C:\\Users\\Asia\\Documents\\Projekty\\tweets-analysis'

In [3]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device: ",device)

  from .autonotebook import tqdm as notebook_tqdm


Device:  cuda


In [4]:
RAW_DIR = os.path.join(os.getcwd(), 'data', 'raw')

In [5]:
tweets = pd.read_csv(os.path.join(RAW_DIR, 'tweety_rekrutacja.csv'))

In [6]:
tweets.describe()

Unnamed: 0,id,id_str,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,geo,retweet_count,favorite_count,quoted_status_id,quoted_status_id_str,quote_count,reply_count
count,92638.0,92638.0,34479.0,34479.0,34655.0,34655.0,0.0,92638.0,92638.0,6867.0,6867.0,0.0,0.0
mean,1.579679e+18,1.579679e+18,1.579132e+18,1.579132e+18,5.661875e+17,5.661875e+17,,73.162633,34.634783,1.575022e+18,1.575022e+18,,
std,7527990000000000.0,7527990000000000.0,8071696000000000.0,8071696000000000.0,6.109383e+17,6.109383e+17,,178.928928,202.122725,4.407537e+16,4.407537e+16,,
min,1.565128e+18,1.565128e+18,1.32685e+18,1.32685e+18,1652541.0,1652541.0,,0.0,0.0,1.067044e+17,1.067044e+17,,
25%,1.573912e+18,1.573912e+18,1.573369e+18,1.573369e+18,589820600.0,589820600.0,,0.0,0.0,1.57227e+18,1.57227e+18,,
50%,1.579398e+18,1.579398e+18,1.578717e+18,1.578717e+18,4230791000.0,4230791000.0,,2.0,0.0,1.578083e+18,1.578083e+18,,
75%,1.585848e+18,1.585848e+18,1.584846e+18,1.584846e+18,1.165397e+18,1.165397e+18,,48.0,5.0,1.584073e+18,1.584073e+18,,
max,1.595005e+18,1.595005e+18,1.594295e+18,1.594295e+18,1.591702e+18,1.591702e+18,,1633.0,16430.0,1.594114e+18,1.594114e+18,,


In [7]:
tweets.head()

Unnamed: 0,name,created_at,id,id_str,full_text,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,in_reply_to_screen_name,geo,is_quote_status,retweet_count,favorite_count,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quote_count,reply_count
0,ZZurecki,2022-10-21 22:33:45,1583587372913942528,1583587372913942528,@agnieszkawolsk9 @prezydentpl Konflikt atomowy...,1.583492e+18,1.583492e+18,7.885019e+17,7.885019e+17,agnieszkawolsk9,,False,0,0,False,pl,,,,
1,ZZurecki,2022-10-21 14:50:47,1583470861306122240,1583470861306122240,Rozpoczeto juz sledztwo kryminalne dot. zakupu...,,,,,,,False,0,0,False,pl,,,,
2,ZZurecki,2022-09-20 21:35:32,1572338694815875072,1572338694815875072,RT @M7A7G7X: Włochy. San Patrignano: rachunek ...,,,,,,,False,500,0,,pl,,,,
3,ZZurecki,2022-09-20 19:07:01,1572301318961741824,1572301318961741824,RT @cyfralab: PO 1989 zachowano urzędową cenę...,,,,,,,False,117,0,,pl,,,,
4,ZZurecki,2022-09-20 08:37:16,1572142838527463424,1572142838527463424,RT @cyfralab: Mafia POmagdalenkowa w latach 19...,,,,,,,False,186,0,,pl,,,,


 ## 1. What topics can we see in the texts (what are people writing about)?

I decided to use [BERTopic](https://github.com/MaartenGr/BERTopic) model based on [BERTopic: Neural topic modeling with a class-based TF-IDF procedure](https://arxiv.org/abs/2203.05794) work by Maarten Grootendorst. As the author says: BERTopic is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

In [8]:
from bertopic import BERTopic

We only remove URLs from tweets because the model author recommends not preprocessing the input data, as this would interfere with the pipelining of BERT embedding.

In [9]:
def remove_url_from_text(text:str) -> str:
    '''
    Cleans text from urls
    '''
    text = re.sub(r'http\S+', '', text)
    return text

In [10]:
# Create a new column with url free tweets
tweets['url_free_tweets'] = tweets['full_text'].apply(remove_url_from_text)

Adding polish stopwords will improve keeping important words in topic descriptions instead of common, but not important words.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
list_of_polish_stopwords =["rt","na", "bo", "to", "do", "z", "w", "nie", "co", "ma", "się", "że", "jak", "za", "żeby", "ci", "cię", "ale", "po"]
vectorizer_model = CountVectorizer(stop_words=list_of_polish_stopwords, ngram_range=(1,1), min_df=5)

BERTopic class takes list of strings as an input, let's get list of values of our URL-free tweets column.

In [27]:
docs = list(tweets['url_free_tweets'].values)

Initialize BERTopic model for polish language and fit it to the data.

In [28]:
model = BERTopic(language="polish", calculate_probabilities=False, verbose=True, vectorizer_model=vectorizer_model, nr_topics="auto")
topics, probabilities = model.fit_transform(docs)

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2895/2895 [08:39<00:00,  5.58it/s]
2023-01-24 18:38:59,825 - BERTopic - Transformed documents to Embeddings
2023-01-24 18:41:50,345 - BERTopic - Reduced dimensionality
2023-01-24 18:42:11,641 - BERTopic - Clustered reduced embeddings
2023-01-24 18:42:40,865 - BERTopic - Reduced number of topics from 1567 to 37


Assign topics to a new column in the data.

In [29]:
# tweets["topic"] = None
tweets["topic"]= topics

### Let's take a look at the topics obtained and their counts
We received X different subjects with different numbers of subjects.
Topic -1  refers to all outliers which do not have a topic assigned (forcing documents in a topic could lead to poor performance), so we will ignore it in further analysis.


In [30]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,0,47996
1,-1,42502
2,1,408
3,2,139
4,3,119
5,4,119
6,5,113
7,6,112
8,7,88
9,8,81


A closer look on the words describing most popular group.

In [31]:
model.get_topic(0)

[('węgla', 0.019248160656274396),
 ('pis', 0.018057810036926053),
 ('jest', 0.016832226472532227),
 ('węgiel', 0.015731737297177046),
 ('dla', 0.015364439214188709),
 ('inflacja', 0.015285331316183445),
 ('ceny', 0.015027468125025163),
 ('gazu', 0.015020151262549417),
 ('od', 0.013958075631508178),
 ('cen', 0.013681599394084906)]

### Visualize topics, their sizes, and their corresponding words

This visualization is highly inspired by LDAvis, a great visualization technique typically reserved for LDA.

In [32]:
model.visualize_topics()

### The variable probabilities that is returned from fit_transform() can be used to understand how confident BERTopic is that certain topics can be found in a document.

In [19]:
model.visualize_distribution(probabilities[200], min_probability=0.015)

### Visualize terms representing topics for top 10 classes
We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [33]:
model.visualize_barchart(top_n_topics=10)

### Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [34]:
model.visualize_heatmap(n_clusters=30, width=1000, height=1000)

### Visualize topics popularity over time

In [35]:
timestamps = list(tweets['created_at'].apply(lambda x: x[:10]).values)
topics_over_time = model.topics_over_time(docs, timestamps)
model.visualize_topics_over_time(topics_over_time)

83it [00:04, 16.92it/s]


In [36]:
model.visualize_term_rank()
# Each topic is represented by a set of words. These words, however,
#     do not all equally represent the topic. This visualization shows
#     how many words are needed to represent a topic and at which point
#     the beneficial effect of adding words starts to decline.