## Loading the data and taking a first look at it

In [15]:
import pandas as pd
import regex as re
import requests
from urllib.parse import urlparse
import spacy
from spacy import displacy

In [2]:
import os
from os.path import expanduser
home = expanduser("~")
os.chdir(os.path.join(home, 'Documents', 'Projekty', 'tweets-analysis'))
print('Current working directory set to:')
os.getcwd()

Current working directory set to:


'C:\\Users\\Asia\\Documents\\Projekty\\tweets-analysis'

In [3]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device: ",device)

Device:  cuda


In [4]:
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
from tqdm import tqdm
tqdm.pandas()

<IPython.core.display.Javascript object>

In [5]:
RAW_DIR = os.path.join(os.getcwd(), 'data', 'raw')

In [6]:
tweets = pd.read_csv(os.path.join(RAW_DIR, 'tweety_rekrutacja.csv'))

In [7]:
tweets.describe()

Unnamed: 0,id,id_str,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,geo,retweet_count,favorite_count,quoted_status_id,quoted_status_id_str,quote_count,reply_count
Loading... (need help?),,,,,,,,,,,,,


In [8]:
tweets.head()

name,created_at,id,id_str,full_text,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,in_reply_to_screen_name,geo,is_quote_status,retweet_count,favorite_count,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quote_count,reply_count
Loading... (need help?),,,,,,,,,,,,,,,,,,,


 ## 1. What topics can we see in the texts (what are people writing about)?

I decided to use [BERTopic](https://github.com/MaartenGr/BERTopic) model based on [BERTopic: Neural topic modeling with a class-based TF-IDF procedure](https://arxiv.org/abs/2203.05794) work by Maarten Grootendorst. As the author says: BERTopic is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

In [9]:
from bertopic import BERTopic

We only remove URLs from tweets because the model author recommends not preprocessing the input data, as this would interfere with the pipelining of BERT embedding.

In [10]:
def remove_url_from_text(text:str) -> str:
    '''
    Cleans text from urls
    '''
    text = re.sub(r'http\S+', '', text)
    return text

In [11]:
# Create a new column with url free tweets
tweets['url_free_tweets'] = tweets['full_text'].apply(remove_url_from_text)

Adding polish stopwords will improve keeping important words in topic descriptions instead of common, but not important words.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
list_of_polish_stopwords =["rt","na", "bo", "to", "do", "z", "w", "nie", "co", "ma", "się", "że", "jak", "za", "żeby", "ci", "cię", "ale", "po"]
vectorizer_model = CountVectorizer(stop_words=list_of_polish_stopwords, ngram_range=(1,1), min_df=3)

BERTopic class takes list of strings as an input, let's get list of values of our URL-free tweets column.

In [22]:
docs = list(tweets['url_free_tweets'].values)

Initialize BERTopic model for polish language and fit it to the data.

In [23]:
model = BERTopic(language="polish", calculate_probabilities=False, verbose=True, vectorizer_model=vectorizer_model, nr_topics="auto")
topics, probabilities = model.fit_transform(docs)

Batches:   0%|          | 0/2895 [00:00<?, ?it/s]

2023-01-26 15:43:22,596 - BERTopic - Transformed documents to Embeddings
2023-01-26 15:45:48,936 - BERTopic - Reduced dimensionality
2023-01-26 15:46:07,383 - BERTopic - Clustered reduced embeddings
2023-01-26 15:46:35,298 - BERTopic - Reduced number of topics from 1567 to 71


Assign topics to a new column in the data.

In [24]:
tweets["topic"]= topics

### Let's take a look at the topics obtained and their counts
We received X different subjects with different numbers of subjects.
Topic -1  refers to all outliers which do not have a topic assigned (forcing documents in a topic could lead to poor performance), so we will ignore it in further analysis.


In [25]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
Loading... (need help?),,


A closer look on the words describing most popular group.

In [26]:
model.get_topic(0)

[('węgla', 0.01111644989067506),
 ('pis', 0.010307268091958715),
 ('węgiel', 0.009211614425803077),
 ('jest', 0.009050501432631074),
 ('dla', 0.008701658906185107),
 ('gazu', 0.008633367410894741),
 ('ceny', 0.00853248339579565),
 ('inflacja', 0.008241844792019802),
 ('cen', 0.008083907297382342),
 ('od', 0.007698990599334426)]

### Visualization of topics, their sizes, and their corresponding words

This visualization is highly inspired by LDAvis, a great visualization technique typically reserved for LDA.

In [27]:
model.visualize_topics()

### Terms representing topics for top 10 classes
We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [29]:
model.visualize_barchart(top_n_topics=10)

### Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [30]:
model.visualize_heatmap(n_clusters=30, width=1000, height=1000)

### Topics popularity over time

In [31]:
timestamps = list(tweets['created_at'].apply(lambda x: x[:10]).values)
topics_over_time = model.topics_over_time(docs, timestamps)
model.visualize_topics_over_time(topics_over_time)

83it [00:10,  7.63it/s]


## 2. What are the most important users who initiate discussions in specific topics?

To answer this question, we need to define what we call the discussion around a given topic, whether it is the number of retweets, replies to a tweet, its likes or perhaps some mix of these elements. In my opinion, the number of retweets can indicate the popularity of a given user's tweets, but not the fact that he or she initiates a discussion on a given topic, which is why I focused on a very simple indicator, which is responses to a person's tweets. So let's count how many times a person's tweets have been replied to in a given topic (not counting their own replies to their tweets).

In [32]:
# remove rows where user is the same as in_reply_to_screen_name
df = tweets[tweets['name'] != tweets['in_reply_to_screen_name']]

In [33]:
# calculate number of in_reply_to_screen_name for every user and every topic
df = df.groupby(['name', 'topic']).agg({'in_reply_to_screen_name': 'count'})

In [34]:
df = df.groupby(['topic']).apply(lambda x: x.nlargest(3, 'in_reply_to_screen_name'))
df = df.reset_index(level=0, drop=True)
df.columns = ['times_people_replied_to_user']

Below table shows top 3 users who initiate discussions for each topic:

In [40]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,times_people_replied_to_user
name,topic,Unnamed: 2_level_1
Loading... (need help?),,


## 3. What are the sources (e.g. media, webpages) people refer to in tweets?

### Let's start with the simplest thing we can do here: extracting the webpages from tweets

Using regex to find URLS in tweets text.

In [42]:
tweets['urls'] = tweets['full_text'].str.findall(r'(https?://[^\s]+)')

In [43]:
tweets_with_urls = tweets[tweets['urls'].map(len) > 0]
# let's take only small sample for example run
tweets_with_urls = tweets_with_urls[:1800]

As all links shared on Twitter are automatically processed and shortened to an http://t.co link, we need to reverse this process and extract domains to see what webpages were used by the authors.

In [44]:
def get_domain_name_from_twitter_url(url: str) -> str:
    try:
        return urlparse(requests.get(url).url).netloc
    except requests.exceptions.RequestException:
        return None


In [45]:
# apply get_full_url function to each URL in urls column ignore nan values
tweets_with_urls['urls'] = tweets_with_urls['urls'].progress_apply(lambda x: [get_domain_name_from_twitter_url(url) for url in x])

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1800/1800 [20:49<00:00,  1.44it/s]


This process is very time-consuming thus not the optimal solution, however this is the workaround to deal with t.co links. When using Twitter API there should be a way to download extended urls, which we should do in the future to avoid this expensive and long url translation process.

In [47]:
# get most popular domains
tweets_with_urls['urls'].explode().value_counts().head(30)

Unnamed: 0,urls
Loading... (need help?),


As we can see, the most popular webpages in our small sample were twitter.com itself, but also mainly politics and economy related portals.

### Another approach: dictionary-based method for most popular polish media
List of media created from the [Instytut Monitorowania Mediów](https://www.imm.com.pl/wp-content/uploads/2021/02/Najbardziej_opiniotworcze_media_w_Polsce_2020_-_raport_roczny-1.pdf) report

In [36]:
polish_media_list = ["Rzeczpospolita", "Gazeta Wyborcza", "Super Express", "Dziennik Gazeta Prawna",
                    "Fakt", "Przegląd Sportowy", "Puls Biznesu", "Do Rzeczy", "Wprost", "Gazeta Polska",
                    "TVN24", "Polsat News", "TVP Info", "TVN", "TVP1", "Polsat", "TVP Sport", "TV Trwam",
                    "Polsat Sport", "TV Republika", "RMF FM", "Radio Zet", "Program Pierwszy Polskiego Radia", "Polskie Radio 24",
                    "TOK FM", "Radio Plus", "Program Trzeci Polskiego Radia", "Radio Maryja", " Radio Kraków",
                    "Radio Poznań", "Onet.pl", "Wp.pl", "Interia.pl", "Money.pl", "wPolityce.pl", "Gazeta.pl",
                    "Wirtualnemedia.pl", "Pudelek.pl", "Plejada.pl", "Businessinsider.com.pl", "Sieci", "Newsweek",
                    "Forbes", "Perspektywy", "Press", "Twój Styl", "Zwierciadło", "Pani", "Focus", "Bankier.pl"]

In [37]:
# find polish media in full_text column and add it to new column
tweets['media'] = tweets['full_text'].progress_apply(lambda x: [media for media in polish_media_list if media in x])
# list most popular media and order by count
tweets['media'].explode().value_counts().sort_values(ascending=False)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 92638/92638 [00:01<00:00, 74285.46it/s]


Unnamed: 0,media
Loading... (need help?),


One remark should be done here, word Pani may be used as a polish magazine name, but also as a regular courtesy expression of Mrs., and dictionary-based approach cannot tell the difference between these two usages, which results in incorrect score for "Pani". Such special cases should be further investigated and corrected. Besides that, we can see that TVN, Polsat, TVN24 and Fakt are the most popular polish sources in our dataset, there is of course question what about foreign media, and that should be the part of the future work.

### Can we use NER algorithm for that task?

We will use the pre-trained [Spacy pl_core_news_sm model](https://spacy.io/models/pl), which is one of the more popular NER models for the Polish language. Entities recognized by the model are: date, geogName, orgName, persName, placeName and time.

In [48]:
!python -m spacy download pl_core_news_sm

Collecting pl-core-news-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pl_core_news_sm-3.5.0/pl_core_news_sm-3.5.0-py3-none-any.whl (20.2 MB)
     ---------------------------------------- 20.2/20.2 MB 8.8 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('pl_core_news_sm')


In [49]:
import pl_core_news_sm
ner = pl_core_news_sm.load()

Let's see how the model works on an example from our dataset

In [71]:
doc = ner(tweets['full_text'][5])
displacy.render(doc, style="ent")

Unfortunately, the entity orgName is very generic and falls into it both media and various other organizations such as UE or NATO, for example, so you can consider the use of NER algorithms as a starting step for further ideas, having enough data you could try a fine-tune model with a narrowed entity, or combine the use of NER with the dictionary method and an expanded list of media, not only Polish.

## 4. How can we identify inauthentic accounts which could possibly be bots or trolls?

Developing even a simple algorithm to catch bots or fake Twitter accounts is a very labor-intensive task, so rather than undertake it, I considered how I would approach the topic with more time and resources.

Let's start with the fact that an Internet troll is a real person (an annoying and often unpleasant person, but a person nonetheless), and as long as that person does not violate the rules of a given site there is little we can do about him other than ignore him.

Bots, on the other hand, are artificial creations programmed for a specific purpose, which may be to boost posts by specific people or with a specific undertone, such as political, bots can be used to intrusively advertise products, but also to spread fake news or propaganda on a given topic, which is particularly dangerous.

So how do you recognize a bot? It is quite complicated, while it is possible to distinguish several features that, taken together, make a given user suspicious and can be considered a fake account and, for example, blocked.

First, anonymity, combined with a simultaneous attempt to be as "human" as possible, bot accounts will therefore often have photos stolen from other users or bought from stock, so we may see multiple accounts with the same photo. On the other hand, bots often have no photo at all, which, however, is not enough of a clue - after all, many of us also prefer not to show our photos on the Internet, where nothing is lost and our photos become the property of Facebook, for example.

The need for anonymity also often results from strange names of fake accounts, containing incomprehensible strings of characters or digits, but again this is not a clear indicator to remove such a delinquent from the portal, because, for example, on Twitter some users do not change the data given to them when creating an account. 

Another key factor that should be analyzed is the activity of our suspicious account. How many posts per day does it generate? How long has the account been in existence and what is its average activity? If an account has existed since yesterday and has already managed to respond to thousands of tweets, this is quite suspicious. Similarly, there are interspersed periods of silence and tremendous activity; someone who has not written a single tweet for months is unlikely to suddenly decide to write thousands of them, not to mention accounts that generate posts every second - this is impressive, but is it really that "human"?

Yet another key indicator is amplification. One of the main roles of bots is to amplify the signal from other users by retweeting, liking or quoting them. So, if our user is constantly retweeting specific posts without a word of comment, or what's more, there is an organized group of such bots to spread specific information, we can suspect that these are fake accounts.

In conclusion, in order to start working on a system to capture fake accounts and various bots, it is necessary to think about their characteristics. This is a very complex and wide-ranging topic, but it is certainly possible to be developed by the relevant specialists.