## <b>Topic extraction</b>

<i>The code below can be used for the topic extraction. Here, it uses a sentence transformer fine-tuned for multilingual data. However, if you want to use a different sentence transformer, just change the model to another one of the sentence transformers (<a href="https://www.sbert.net/" target="_blank">sentence transformers</a>). It is also possible to use a transformer not made for sentences, but for words (<a href="https://huggingface.co/models" target="_blank">huggingface transformers</a>). The transformer is used to vectorize the text into a high-dimensional vector space, so the computer is able to read the data.

Further, this script uses BERTopic to automatically create clusters of tweets (using HDBSCAN and c-TF-IDF).</i>

In [None]:
#Import needed libraries: 
    #pandas for reading the daa
    #sentence_transformers for using a sentence transformer
    #bertopic for extracting topics
import pandas as pd
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
import plotly.express as px
from umap import UMAP

#Set csv file to variable
outputgermany = pd.read_csv('Data\\filtered_tweets_germany.csv')

#Read the data with pandas and transform it to dataframe
tweets_germany = pd.DataFrame(outputgermany)

#Set embedding model to transformer you want to use
model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')

#Set random state variable to make sure results are reproducible
umap_model = UMAP(random_state=42)

#Set BERTopic model
    #embedding_model is your sentence transformer
    #min_topic_size: how many tweets the cluster must contain before creating a cluster
    #verbose=True means that you are able to see how far the model is when executing
topic_model = BERTopic(embedding_model=model, min_topic_size=70, verbose=True, umap_model=umap_model)

#Transform text into embeddings and create clusters 
    #Between the parathenses, define the text to create embeddings and clusters on
topics, probs = topic_model.fit_transform(tweets_germany['text_clean'])

#Add the created topics to the dataframe
clusters_pd = pd.DataFrame(topics,columns=['clusters'])
clusters_pd.value_counts()
tweets_germany['Topic'] = clusters_pd

#Then, a time analysis can be performed on the topics
topics_over_time = topic_model.topics_over_time(tweets_germany['text_clean'], topics, tweets_germany['Date'], nr_bins=50)

#Visualize the time analysis in a graph
topicovertime = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)

#Save dynamic figure as html file
topicovertime.write_html("Data\\topic70overtime.html")

#Save model
topic_model.save("time_bertopic70")

#Save dataframe as csv
tweets_germany.to_csv('Data\\topics70_tweets_germany.csv', index = False, header='true')

The code below can be used to retrieve information about the created clusters. See the documentation on <a href="https://maartengr.github.io/BERTopic/index.html" target="_blank">BERTopic</a> for more visualizations.

In [2]:
from bertopic import BERTopic

In [None]:
topic_model = BERTopic.load('time_bertopic70')

topic_model.get_topic_info()

topic_model.get_topic(10)

topic_model.get_representative_docs(13)

topic_model.visualize_topics()

topic_model.visualize_barchart()

##### Create wordcloud with all keywords of each topic

In [4]:
#Import necessary libraries
import wordcloud
from wordcloud import WordCloud
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#Set image for the wordcloud
mask2 = np.array(Image.open("Data\\duitslandwordcloud.png"))

#Create function for the generation of the wordcloud
    #Data is dictionary containing words + probabilities
    #Title is name of the wordcloud
    #name is how you want the png file to be saved
    
def generate_wordcloud(data, title, name):
    wordcloud = WordCloud(background_color='white', prefer_horizontal=1, contour_color='black', contour_width=1).generate_from_frequencies(data)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.title(title)
    plt.show()
    wordcloud.to_file(name)

Corona and policies wordcloud

In [None]:
topic0words = topic_model.get_topic(0)
topic0wordsdf = pd.DataFrame(topic0words, columns=['word', 'probs'])
topic0wordsdf['word'] = topic0wordsdf['word'].astype("string")

topic0 = {}
for word, probs in topic0wordsdf.values:
    topic0[word] = probs

topic2words = topic_model.get_topic(2)
topic2wordsdf = pd.DataFrame(topic2words, columns=['word', 'probs'])
topic2wordsdf['word'] = topic2wordsdf['word'].astype("string")

topic2 = {}
for word, probs in topic2wordsdf.values:
    topic2[word] = probs

topic4words = topic_model.get_topic(4)
topic4wordsdf = pd.DataFrame(topic4words, columns=['word', 'probs'])
topic4wordsdf['word'] = topic4wordsdf['word'].astype("string")

topic4 = {}
for word, probs in topic4wordsdf.values:
    topic4[word] = probs

topic6words = topic_model.get_topic(6)
topic6wordsdf = pd.DataFrame(topic6words, columns=['word', 'probs'])
topic6wordsdf['word'] = topic6wordsdf['word'].astype("string")

topic6 = {}
for word, probs in topic6wordsdf.values:
    topic6[word] = probs

topic13words = topic_model.get_topic(13)
topic13wordsdf = pd.DataFrame(topic13words, columns=['word', 'probs'])
topic13wordsdf['word'] = topic13wordsdf['word'].astype("string")

topic13 = {}
for word, probs in topic13wordsdf.values:
    topic13[word] = probs

coronapolicies = {**topic0, **topic2, **topic4, **topic6, **topic13}

coronapolicieswordcloud = generate_wordcloud(coronapolicies, 'Corona and policies', 'Wordclouds\\coronapolicieswordcloud.png')

Lockdown activities

In [None]:
topic3words = topic_model.get_topic(3)
topic3wordsdf = pd.DataFrame(topic3words, columns=['word', 'probs'])
topic3wordsdf['word'] = topic3wordsdf['word'].astype("string")

topic3 = {}
for word, probs in topic3wordsdf.values:
    topic3[word] = probs

topic7words = topic_model.get_topic(7)
topic7wordsdf = pd.DataFrame(topic7words, columns=['word', 'probs'])
topic7wordsdf['word'] = topic7wordsdf['word'].astype("string")

topic7 = {}
for word, probs in topic7wordsdf.values:
    topic7[word] = probs

topic8words = topic_model.get_topic(8)
topic8wordsdf = pd.DataFrame(topic8words, columns=['word', 'probs'])
topic8wordsdf['word'] = topic8wordsdf['word'].astype("string")

topic8 = {}
for word, probs in topic8wordsdf.values:
    topic8[word] = probs

topic9words = topic_model.get_topic(9)
topic9wordsdf = pd.DataFrame(topic9words, columns=['word', 'probs'])
topic9wordsdf['word'] = topic9wordsdf['word'].astype("string")

topic9 = {}
for word, probs in topic9wordsdf.values:
    topic9[word] = probs

topic11words = topic_model.get_topic(11)
topic11wordsdf = pd.DataFrame(topic11words, columns=['word', 'probs'])
topic11wordsdf['word'] = topic11wordsdf['word'].astype("string")

topic11 = {}
for word, probs in topic11wordsdf.values:
    topic11[word] = probs

topic12words = topic_model.get_topic(12)
topic12wordsdf = pd.DataFrame(topic12words, columns=['word', 'probs'])
topic12wordsdf['word'] = topic12wordsdf['word'].astype("string")

topic12 = {}
for word, probs in topic12wordsdf.values:
    topic12[word] = probs


lockdownactivities = {**topic3, **topic7, **topic8, **topic9, **topic11, **topic12}

lockdownactivitieswordcloud = generate_wordcloud(lockdownactivities, 'Lockdown activities', 'Wordclouds\\lockdownactivitieswordcloud.png')

Prevention

In [None]:
topic1words = topic_model.get_topic(1)
topic1wordsdf = pd.DataFrame(topic1words, columns=['word', 'probs'])
topic1wordsdf['word'] = topic1wordsdf['word'].astype("string")

topic1 = {}
for word, probs in topic1wordsdf.values:
    topic1[word] = probs


topic5words = topic_model.get_topic(5)
topic5wordsdf = pd.DataFrame(topic5words, columns=['word', 'probs'])
topic5wordsdf['word'] = topic5wordsdf['word'].astype("string")

topic5 = {}
for word, probs in topic5wordsdf.values:
    topic5[word] = probs

topic10words = topic_model.get_topic(10)
topic10wordsdf = pd.DataFrame(topic10words, columns=['word', 'probs'])
topic10wordsdf['word'] = topic10wordsdf['word'].astype("string")

topic10 = {}
for word, probs in topic10wordsdf.values:
    topic10[word] = probs


topic14words = topic_model.get_topic(14)
topic14wordsdf = pd.DataFrame(topic14words, columns=['word', 'probs'])
topic14wordsdf['word'] = topic14wordsdf['word'].astype("string")

topic14 = {}
for word, probs in topic14wordsdf.values:
    topic14[word] = probs

topic15words = topic_model.get_topic(15)
topic15wordsdf = pd.DataFrame(topic15words, columns=['word', 'probs'])
topic15wordsdf['word'] = topic15wordsdf['word'].astype("string")

topic15 = {}
for word, probs in topic15wordsdf.values:
    topic15[word] = probs

prevention = {**topic1, **topic5, **topic10, **topic14, **topic15}
preventionwordcloud = generate_wordcloud(prevention, 'Prevention', 'Wordclouds\\preventionwordcloud.png')