<a href="https://colab.research.google.com/github/Ioana-P/IoanaFio/blob/main/content/project/twitter_sentiment_trackingCopy_of_Using_Bertopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using BERTopic to identify themes in recent Tweets

The purpose of the repo and of this notebook is to see if there are (semi-)automatable ways of measuring and illustrating changes in public attitude via social media, using a controversial tweet by one of the most famous people on Earth as an example. 

In this notebook I will be loading up the tweets I retrieved in mid-October, using the node of @elonmusk as a starting point. I have collected over 200'000 Tweets either by Elon or mentioning him directly. In October he tweeted a very controversial poll, proposing how to best achieve, in his view, peace in the Ruso-Ukrainian War. This seemed to spark a backlash on Twitter, however, plenty of people voted positively on the poll or liked it. So it's very hard to gauge public sentiment around it just from the available metadata. 
Therefore, we will be trying to assess if there are any topics within the data and later on, in a different notebook, gauging if overall Twitter sentiment has changed as a result. 

In [None]:
# !pip install bertopic
# !pip install --upgrade joblib==1.1.0

Successfully installed bertopic-0.12.0 hdbscan-0.8.28 huggingface-hub-0.10.1 pynndescent-0.5.7 pyyaml-5.4.1 sentence-transformers-2.2.2 sentencepiece-0.1.97 tokenizers-0.13.1 transformers-4.23.1 umap-learn-0.5.3


In [None]:
from bertopic import *
import pandas as pd
from umap import UMAP

In [None]:
len(docs)

209668

In [None]:
%%timeit

filepath = 'data/text_for_topics.csv'
docs = pd.read_csv(filepath, index_col=0)['clean_tweet_text'].to_list()

from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")



#now we can instantiate our BertTopic object and load in the umap model
#the embeddings and some params

embeddings = sentence_model.encode(docs, show_progress_bar=True)

import pickle

tweet_ids = pd.read_csv(filepath, index_col=0)['tweet_id'].to_list()

with open("data/sentence_embeddings.pkl", "wb") as fOut:
    pickle.dump({'tweet_id': tweet_ids, 'embeddings': embeddings }, fOut)

RD_STATE = 12345
umap_model = UMAP(n_neighbors=20, n_components=5, 
                low_memory=True,
              min_dist=0.0, metric='cosine', random_state=RD_STATE)


topic_model = BERTopic(
    language="multilingual",
    umap_model = umap_model,
    min_topic_size = 10,  
    nr_topics = 'auto',

    low_memory=True,
    #setting calc probs to True
    calculate_probabilities = True,
    )

topics, probs = topic_model.fit_transform(docs, embeddings)

topic_model.save('topic_model_22_10_13')


viz_tops = topic_model.visualize_topics()
viz_tops.write_html("fig/viz_topics_22_10_13.html")


topic_model.get_topic_info()

Batches:   0%|          | 0/6553 [00:00<?, ?it/s]

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,8634,-1_the_of_to_and
1,0,1017,0_牛b_omg_yessssssss_beauti
2,1,748,1_ukraine_russia_putin_russian
3,2,508,2_elon_elons_his_he
4,3,373,3_starship_spacex_falcon_launch
...,...,...,...
268,267,11,267_scale_master_18x6_20x4
269,268,10,268_bully_bullies_warmonger_sheepzero
270,269,10,269_shes_legs_her_girl
271,270,10,270_employees_qualifications_resignation_under...


In [None]:
viz_tops = topic_model.visualize_topics()


In [None]:
viz_tops.write_html("fig/viz_topics_22_10_13_1st.html")


In [None]:
topic_model.save('topic_model_22_10_13_1st')
topic_model.save()


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.

