<h1>Twitter Topic Emotion Analysis - Part 1</h1>
<h2><i>Topic Modeling</i></h2>

In [1]:
### Imports ###
import pandas as pd
from matplotlib import style
from src.TextNormalizer import TextNormalizer
style.use('ggplot')
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aklei\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aklei\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv('../data/twitter/tweets_isTweet_emotions.csv')
df['combined_text'] = df['tweet_text'].fillna('') + '' + df['quoted_tweet_text'].fillna('')
df['combined_text'] = df["combined_text"].apply(TextNormalizer.remove_noise)
df = (df[df['combined_text'] != ''])
df['combined_text'].head()

2    big deallast week treasury went live first aut...
4    whoathe invisible puppet masters ais disturbin...
5    next week grok 35 early beta release supergrok...
6    existential crisisa friendly reminder make bab...
7                                     knock knock doge
Name: combined_text, dtype: object

In [3]:
### Pre process text (embeddings) ###
tweets = df['combined_text'].values.tolist()
print(f"[Info] Embedding {len(tweets)} tweets ...")

# 1. Embedding-Modell (vorher berechnen oder cachen)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedder.encode(tweets, show_progress_bar=True)

[Info] Embedding 10268 tweets ...


Batches:   0%|          | 0/321 [00:00<?, ?it/s]

In [6]:
### Fit BERTopic model and print topic info ###
# 2. UMAP (Reduktion für semantisch klarere Cluster)
umap_model = UMAP(n_neighbors=50, n_components=20, min_dist=0.05, metric="cosine", random_state=42)

# 3. HDBSCAN (Cluster-Zahl steuern)
hdbscan_model = HDBSCAN(min_cluster_size=20, cluster_selection_epsilon=0.3, metric="euclidean", cluster_selection_method="eom", prediction_data=True)

# 4. CountVectorizer
vectorizer_model = CountVectorizer(min_df=2, stop_words="english")

# 5. Repräsentation (optional, für bessere Labels)
representation_model = KeyBERTInspired()

# 6.1 Tesla-related seed_words (use to find other topics and populate seed_topic_list)
seed_words = ["tesla", "elon musk", "autopilot", "cybertruck", "model3", "gigafactory", "electric vehicle", "supercharger", "amp"]

# 6.2 Tesla-related seed_topic_list (populated by finding broad topics with seed_words and wide clustering, now narrowing it down for accuracy)
# --> Should not choose to many words as seed_list can become blurry
seed_topic_list = [
    ["tesla", "elon musk", "autopilot", "cybertruck", "model3", "gigafactory", "electric vehicle", "supercharger"],
    ["president", "trump", "government", "election", "republican", "democrat", "vote", "ballot"],
    ["judge", "activist", "illegal"],
    ["doge", "dogefather"],
    ["spacex", "launch", "falcon", "orbit", "mars"],
    ["bitcoin", "dogecoin"],
    ["starlink", "broadband", "highspeed"],
    ["fertility", "birthrate", "population", "births", "demographic"],
    ["twitter", "tweet", "ban", "free speech", "grok", "grokai"],
    ["crypto", "bitcoin", "dogecoin"],
    ["white", "farmers", "south africa", "field", "genocide"],
    ["afd", "german", "coalition", "berlin"]
]

# 7. Topic-Modell
topic_model = BERTopic(
    embedding_model=embedder,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    calculate_probabilities=True,
    #seed_topic_list=seed_words, # used to populate seed_topic_list
    seed_topic_list=seed_topic_list,
    nr_topics="auto", # Automatically generate topic count
    verbose=True
)

# 8. Fitting
topics, probs = topic_model.fit_transform(tweets, embeddings)

# 9. Reduce Outliers
#new_topics = topic_model.reduce_outliers(tweets, topics, strategy="embeddings") # Method to reduce outliers
new_topics = topic_model.reduce_outliers(tweets, topics, probabilities=probs, strategy="probabilities") # Method to reduce outliers
topic_model.update_topics(tweets, topics=new_topics)

# 9. Show Topic Info
topic_model.get_topic_info()

2025-07-14 21:51:08,707 - BERTopic - Guided - Find embeddings highly related to seeded topics.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-07-14 21:51:08,920 - BERTopic - Guided - Completed ✓
2025-07-14 21:51:08,922 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-07-14 21:51:43,380 - BERTopic - Dimensionality - Completed ✓
2025-07-14 21:51:43,383 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-07-14 21:51:47,122 - BERTopic - Cluster - Completed ✓
2025-07-14 21:51:47,124 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-07-14 21:51:47,598 - BERTopic - Representation - Completed ✓
2025-07-14 21:51:47,599 - BERTopic - Topic reduction - Reducing number of topics
2025-07-14 21:51:47,628 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-07-14 21:51:53,116 - BERTopic - Representation - Completed ✓
2025-07-14 21:51:53,120 - BERTopic - Topic reduction - Reduced number of topics from 35 to 35


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,1515,0_media_news_legacy_trump,"[media, news, legacy, trump, people, elon, mus...","[media, media, media legacy media slow amp bia..."
1,1,1061,1_government_doge_spending_federal,"[government, doge, spending, federal, money, b...","[yesshrink government, also government, yeahel..."
2,2,802,2_vote_illegal_border_election,"[vote, illegal, border, election, id, voter, i...",[people america realize vote rendered meaningl...
3,3,829,3_grok_ai_xai_openai,"[grok, ai, xai, openai, neuralink, image, try,...","[grok 3, grok 3面白いし頭もいいから使ってみて, grok 3]"
4,4,435,4_spacex_starship_launch_falcon,"[spacex, starship, launch, falcon, flight, orb...","[another falcon launch, congratulations spacex..."
5,5,390,5_tesla_model_cybertruck_car,"[tesla, model, cybertruck, car, fsd, vehicle, ...",[yupbuilt bolted tesla truly americanmade tesl...
6,6,401,6_rape_uk_children_gangs,"[rape, uk, children, gangs, inquiry, girls, br...",[investigate worse gets beyond think possiblei...
7,7,353,7_wow_interesting_yup_day,"[wow, interesting, yup, day, lol, accurate, yi...","[wow, wow, wow]"
8,8,337,8_starlink_internet_highspeed_available,"[starlink, internet, highspeed, available, sat...",[starlink approved zambiastarlinks highspeed i...
9,9,330,9_speech_censorship_free_freedom,"[speech, censorship, free, freedom, people, eu...",[yeselon musk think free speech existential un...


In [38]:
### Create Dataframe from Topic List and Filter Tesla Topic ###
# 1. Get topic info
topic_info = topic_model.get_topic_info()

# 2. Create dataframe
df_topics = pd.DataFrame(topic_info)[['Topic', 'Representation']]

# 3. Get topic ids with tesla related topics
tesla_key = "tesla"
tesla_topic_ids = []

for _, row in df_topics.iterrows():
    if tesla_key in row['Representation']:
        tesla_topic_ids.append(row['Topic'])

print(f"Tesla-Topic-IDs: {tesla_topic_ids}")

# 4. Filter df by documents
doc_info = topic_model.get_document_info(tweets)
df['topics'] = doc_info['Topic']
df_tesla = df[df['topics'].isin(tesla_topic_ids)]

# 5. Drop unnecessary columns and safe to csv
df_tesla = df_tesla[['tweet_id', 'createdAt', 'topics', 'combined_text', 'tweet_text_dominant_emotion', 'quoted_tweet_id', 'quoted_tweet_text_dominant_emotion']]
df_tesla.to_csv('../data/twitter/tweets_isTweet_emotions_tesla.csv', index=False)
print(f"Found Tesla-Tweets: {len(df_tesla)}")

Tesla-Topic-IDs: [5, 24, 30]
Found Tesla-Tweets: 576


<h1>Twitter Topic Emotion Analysis - Part 1</h1>
<h2><i>Event Study</i></h2>
<p>
    In this section, the two event studies of the emotion and topic data will be combined in order to examine, if we can observe an effect of the topic data with certain emotions on the stock price / trading volume on the NYSE and Xetra.
</p>

In [40]:
### Read necessary dataframes, set index, and convert timezones ###
df_tesla = pd.read_csv('../data/twitter/tweets_isTweet_emotions_tesla.csv')
df_nyse = pd.read_csv('../data/stocks/tsla_intraday_202305_202504-1m.csv')
df_xetra = pd.read_csv('../data/stocks/TSL0_intraday_230501_250501-1m.csv')