<a href="https://colab.research.google.com/github/Ioana-P/IoanaFio/blob/main/content/project/twitter_sentiment_tracking/Modelling_w_BERTopic_GColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using BERTopic to identify themes in recent Tweets

The purpose of the repo and of this notebook is to see if there are (semi-)automatable ways of measuring and illustrating changes in public attitude via social media, using a controversial tweet by one of the most famous people on Earth as an example. 

In this notebook I will be loading up the tweets I retrieved in mid-October, using the node of @elonmusk as a starting point.  Here we'll be inferring topics based on a subset of Tweets either by Musk or mentioning him, from September 2022 onwards (n=12'037). 
In October he tweeted a very controversial poll, proposing how to best achieve, in his view, peace in the Ruso-Ukrainian War. This seemed to spark a backlash on Twitter, however, plenty of people voted positively on the poll or liked it. So it's very hard to gauge public sentiment around it just from the available metadata. #
Therefore, we will be trying to assess if there are any topics within the data and later on, in a different notebook, gauging if overall Twitter sentiment has changed as a result. 

In [None]:
# necessary installs
!pip install bertopic
# owing to an error that comes up on import of BERTopic, it's necessary to 
# downgrade joblib
!pip install --upgrade joblib==1.1.0

In [3]:
# NOTE: for some reason this cell returns an error (due to the joblib install)
# but the error goes away if you run it a second time.

from bertopic import *
import pandas as pd
from umap import UMAP
from sentence_transformers import SentenceTransformer
import numpy as np

In [4]:
#load up pre-trained vectors and docs
print('Loading up docs')
filepath = 'text_for_topics_post_Aug22.csv'
df = pd.read_csv(filepath, index_col=0)

Loading up docs


In [5]:

docs = df['clean_tweet_text'].to_list()


In [6]:
# %%timeit
# with open("data/sentence_embeddings_1st_batch2.pkl", "rb") as f:
#     wv = pickle.load(f)

###
RD_STATE = 12345
print("Initialising UMAP")
umap_model = UMAP(n_neighbors=20, n_components=5, 
                low_memory=True,
              min_dist=0.0, metric='cosine', random_state=RD_STATE)

print("Instantiating BERTopic model")

topic_model = BERTopic(
    # there are some non-English tweets (v few)
    language="multilingual",
    umap_model = umap_model,
    min_topic_size = 10,  
    # automatically detect the nr of topics and we'll reduce later
    nr_topics = 'auto',
    low_memory=False,
    #you can set calculate_probs to False if you don't have
    #sufficient compute
    calculate_probabilities = True,
    )

print("Fit_transforming BERTopic model")

topics, probs = topic_model.fit_transform(docs)

print("Saving visuals")
viz_tops = topic_model.visualize_topics()
viz_tops.write_html("viz_topics_22_10_14.html")


print('Outputting topic info')
topic_model.get_topic_info()


Initialising UMAP
Instantiating BERTopic model
Fit_transforming BERTopic model


Downloading:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Saving BERTopic model
Saving visuals
Outputting topic info


Unnamed: 0,Topic,Count,Name
0,-1,5015,-1_the_to_it_and
1,0,4405,0_to_the_tesla_is
2,1,333,1_russia_ukraine_putin_russian
3,2,122,2_update_release_version_branch
4,3,113,3_starlink_internet_remote_access
...,...,...,...
62,61,13,61_water_fountain_evaporation_feature
63,62,12,62_bird_seagulls_alien_app
64,63,12,63_hurricane_florida_carve_orlando
65,64,11,64_train_trains_bostonnyc_track


In [12]:
## Saving the model using joblib and/or pickle
import joblib
with open('topic_model_22_10_14_3.pkl', 'wb') as file:
  print("Saving BERTopic model")
  joblib.dump(topic_model, file, protocol=4)

In [14]:
#testing if the save has worked and reloading
model_reloaded = joblib.load('topic_model_22_10_14_3.pkl')

In [16]:
model_reloaded.embedding_model

<bertopic.backend._sentencetransformers.SentenceTransformerBackend at 0x7fadea395650>

In [15]:
model_reloaded.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,5015,-1_the_to_it_and
1,0,4405,0_to_the_tesla_is
2,1,333,1_russia_ukraine_putin_russian
3,2,122,2_update_release_version_branch
4,3,113,3_starlink_internet_remote_access
...,...,...,...
62,61,13,61_water_fountain_evaporation_feature
63,62,12,62_bird_seagulls_alien_app
64,63,12,63_hurricane_florida_carve_orlando
65,64,11,64_train_trains_bostonnyc_track


In [19]:
dates = pd.read_csv('datetime_for_topics_post_Aug22.csv', index_col='tweet_id')
dates['datetime'] = pd.to_datetime(dates['datetime'])

In [33]:
topics_over_time = topic_model.topics_over_time(docs, dates['datetime'].to_list())

topic_model.visualize_topics_over_time(topics_over_time)

In [10]:
topic_model.visualize_barchart()


In [27]:
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics(docs))


100%|██████████| 65/65 [00:00<00:00, 114.36it/s]


In [11]:
topic_model.visualize_heatmap()

In [21]:
# time to reduce some topics via merging

topics_to_merge = [[35, 36, 30, 1, 10], # Russia & Ukraine, nuclear war
                   [45, 52, 12, 79], # tesla cars, self-driving and batteries
                   [2, 89], # twitter bots
                   [9, 29, 47], #starlink
                   [11, 51, 77], # robots and AI, neuralink
                   [13, 19, 53 ] #spacex
                  ]


topic_model.merge_topics(docs, topics_to_merge)

In [25]:
topic_model.set_topic_labels({
    1:'Russia_Ukraine_and_war',
    2:'Twitter_bots',
    5:'Starlink_and_satellites',
    6:'SpaceX',
    11:'Robots_and_AI',
  })

In [26]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName
0,-1,5041,-1_the_to_it_of,-1_the_to_it_of
1,0,2591,0_elon_tesla_he_to,0_elon_tesla_he_to
2,1,573,1_russia_war_ukraine_putin,Russia_Ukraine_and_war
3,2,266,2_bot_bots_tesla_twitter,Twitter_bots
4,3,222,3_twitter_tweets_tweet_my,3_twitter_tweets_tweet_my
...,...,...,...,...
74,73,12,73_hurricane_florida_orlando_carve,73_hurricane_florida_orlando_carve
75,74,11,74_speech_free_dictates_allowed,74_speech_free_dictates_allowed
76,75,11,75_ramp_pod_mount_acceleration,75_ramp_pod_mount_acceleration
77,76,10,76_deliveries_delivery_csection_hospital,76_deliveries_delivery_csection_hospital


In [29]:
#and now we visualise them again 
viz_tops = topic_model.visualize_topics()
viz_tops.write_html("viz_topics_22_10_14_redux.html")

In [33]:
bar_plot = topic_model.visualize_barchart([1,2,5,6,11])
bar_plot.write_html('viz_terms_topics_22_10_14_redux.html')

In [36]:
hmap = topic_model.visualize_heatmap()
hmap.write_html('hmap_22_10_14_redux.html')

In [43]:
#finally, let's get the topic probabilities, merge them with the text data and return the df
probs = pd.DataFrame(topic_model.probabilities_, index = df.tweet_id)
probs = probs.join(df.set_index('tweet_id'))
probs.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,69,70,71,72,73,74,75,76,77,clean_tweet_text
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1580168615357140992,1.0,8.298923e-307,7.61658e-307,2.0383e-307,3.2276130000000003e-307,4.680169e-307,5.4077620000000006e-307,3.2439990000000003e-307,7.597482e-307,2.394892e-307,...,4.358597e-307,1.904953e-307,2.424376e-307,1.8603629999999997e-307,4.161548e-307,2.612142e-307,2.279323e-307,2.66925e-307,2.154047e-307,My favorite least favorite is ONLY TWO IDEAS
1580013582778974208,0.265244,0.01166869,0.009759405,0.00271444,0.002978305,0.005367448,0.005956865,0.003633807,0.008282313,0.003077484,...,0.005023313,0.002180062,0.002966637,0.002100634,0.004724942,0.004118049,0.002455282,0.0031632,0.002342698,My first day back to twit after a bit youre g...
1579994233699565568,0.155546,0.01081946,0.01073696,0.00345832,0.003079629,0.004491499,0.004869585,0.003182556,0.005963418,0.002864103,...,0.002728176,0.001708569,0.002599438,0.00163145,0.002504219,0.004552303,0.001690918,0.002153339,0.001579499,Bremmers a straight shooter I dont see him inv...
1579976175732281344,0.178175,0.01168058,0.009382016,0.002652212,0.003456878,0.005746364,0.006351354,0.002821025,0.007522751,0.002832234,...,0.004476122,0.002157688,0.002740427,0.002010582,0.004281461,0.004008176,0.002214602,0.00253141,0.00202853,lollll ‘Musk by MuskIt was alway rite there
1579963541414903815,0.387081,0.02430777,0.01972092,0.005567648,0.004101697,0.008698562,0.009409069,0.005573041,0.01245387,0.006206985,...,0.006536267,0.003582153,0.005552579,0.00339881,0.005630716,0.01606638,0.003653734,0.004966527,0.003424375,Yup Maybe Bremmer betrayed a confidence


In [44]:
probs.to_csv('text_topic_probs.csv')

In [64]:
topic_model.save('topic_model_22_10_14_redux')

## Sentiment classification

In this extra section, we will also be using a pretrained sentiment classification model to sort the tweets into two categories, depending on whether they seem to express a  positive or negative statement.
 

In [73]:
#load up BERT's tokenizer
from transformers import pipeline

generator = pipeline(task="sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


One severe drawback in this very quick approach is that we can only really infer the sentiment for tweets in english. However, there 11914/12307 tweets (98.9%) in this subset of the data, so the loss of data is tiny. We'll discard those tweets before producing any analysis of the sentiment. 

In [75]:
# it should be shown just how easy it is to use pre-trained pipelines from HuggingFace
print(docs[0])
print(generator(docs[0]))

print(docs[100])
print(generator(docs[100]))

print(docs[1000])
print(generator(docs[1000]))

print(docs[10001])
print(generator(docs[10001]))

My favorite least favorite is  ONLY TWO IDEAS
[{'label': 'NEGATIVE', 'score': 0.9990482926368713}]
Cool old tweet from  about Austin  the orbital launch complex “Starbase” back in 2013 🙌🚀♥️
[{'label': 'NEGATIVE', 'score': 0.7108004093170166}]
yay have fun be safe
[{'label': 'POSITIVE', 'score': 0.9993841648101807}]
Thank you Hes done SO MUCH GOOD for Ukraine by enabling Starlink  providing thousands of terminals As a fellow Aspie I get that hes hyperfocused on this topic right now but his poll and the tweets about bot attacks arent helping – in fact they enrage many Ukrainians
[{'label': 'POSITIVE', 'score': 0.9854568839073181}]


In [78]:
sentiment_df = pd.DataFrame({'clean_tweet_text':docs}, index=df.tweet_id)
sentiment_df['Pred_sentiment'] = sentiment_df['clean_tweet_text'].apply(lambda x : generator(x))
sentiment_df.tail()

Unnamed: 0_level_0,clean_tweet_text,Pred_sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1566233616543125505,Yes also very important,"[{'label': 'POSITIVE', 'score': 0.999814689159..."
1566233000458592256,Accurate assessment Raptor design started out ...,"[{'label': 'NEGATIVE', 'score': 0.860174000263..."
1565441825376243713,Hope Connor is ok,"[{'label': 'POSITIVE', 'score': 0.999670863151..."
1565190122924015616,On a bot basis this deal is awesome,"[{'label': 'POSITIVE', 'score': 0.999826610088..."
1565189065158311937,Sure sounds higher than 5,"[{'label': 'POSITIVE', 'score': 0.999288380146..."


In [79]:
sentiment_df.to_csv('text_and_sentiment_preds.csv')

In [81]:
sentiment_df.Pred_sentiment.to_list()[0][0]

{'label': 'NEGATIVE', 'score': 0.9990482926368713}

In [83]:
sentiment_df['Pred_sentiment_out'] = [ x[0]['label'] for x in sentiment_df.Pred_sentiment.to_list() ] 
sentiment_df['Pred_sentiment_score'] = [ x[0]['score'] for x in sentiment_df.Pred_sentiment.to_list() ]

In [85]:
sentiment_df.drop(columns=['Pred_sentiment'],inplace=True)
sentiment_df.to_csv('text_and_sentiment_preds.csv')

In [86]:
sentiment_df.Pred_sentiment_out.value_counts()

NEGATIVE    7293
POSITIVE    4744
Name: Pred_sentiment_out, dtype: int64