# Step 6 : Tweet Assignment

**Carta et al. (2021) ‚Äî Step 6**

Enrichir les clusters de news avec des tweets s√©mantiquement corr√©l√©s pour √©valuer la r√©sonance sociale des √©v√©nements d√©tect√©s.

### Libraries

In [1]:
import pandas as pd
import spacy
import plotly.express as px
import os
import sys
from gensim.models import KeyedVectors


sys.path.append(os.path.abspath(os.path.join("..")))
from src.tweet_assignment import *

### Data loading

In [17]:
# # Clusters signatures loading
# df_sig = pd.read_csv('../data/for_models/final_event_signatures_SVB.csv')
# cluster_col = 'Unnamed: 0'

# Clusters signatures loading
df_sig = pd.read_csv("../data/for_models/final_event_signatures_AI.csv")
cluster_col = "Unnamed: 0"

# Transformation en dictionnaire {cluster_id: vecteur_numpy}
final_event_signatures = {}
for _, row in df_sig.iterrows():
    cluster_id = int(row[cluster_col])
    # On r√©cup√®re toutes les colonnes de 0 √† 299 et on les convertit en array
    vector = row.drop(labels=[cluster_col]).values
    final_event_signatures[cluster_id] = vector

print(f"Signatures charg√©es pour {len(final_event_signatures)} clusters.")

Signatures charg√©es pour 6 clusters.


### Tweets preprocessing

In [3]:
# Load datasets
tweets = pd.read_csv("../data/processed/tweets_2023.csv")
tweets["date"] = pd.to_datetime(tweets["date"], errors="coerce")

In [4]:
daily_counts = (
    tweets.groupby(tweets["date"].dt.date).size().reset_index(name="tweet_count")
)
daily_counts.columns = ["date", "count"]

# Distribution des tweets par jour
fig_dist = px.bar(
    daily_counts,
    x="date",
    y="count",
    title="<b>Daily Tweet Distribution (Financial Context)</b>",
    labels={"date": "Date", "count": "Number of Tweets"},
    template="plotly_white",
    color_discrete_sequence=["#1DA1F2"],  # Bleu Twitter
)
fig_dist.update_layout(
    xaxis_title="Date", yaxis_title="Tweet Count", hovermode="x unified"
)
fig_dist.show()

### Tweets Embeddings Generation

In [5]:
nlp = spacy.blank("en")
tweets["cleaned_text"] = tweets["full_content"].apply(
    lambda x: preprocess_tweets_spacy(x, nlp)
)

In [6]:
my_lexicon = [
    # --- Macro√©conomie & Banques Centrales ---
    "inflation",
    "deflation",
    "stagflation",
    "cpi",
    "ppi",
    "gdp",
    "recession",
    "growth",
    "expansion",
    "unemployment",
    "employment",
    "payroll",
    "payrolls",
    "deficit",
    "surplus",
    "debt",
    "stimulus",
    "productivity",
    "spending",
    "consumer",
    "retail",
    "fed",
    "fomc",
    "ecb",
    "boj",
    "boe",
    "rates",
    "interest",
    "hiking",
    "tightening",
    "easing",
    "hawkish",
    "dovish",
    "quantitative",
    "tapering",
    "policy",
    "reserve",
    "federal",
    "monetary",
    "fiscal",
    "yield",
    "curve",
    "basis",
    "points",
    # --- Secteur Bancaire & Crise ---
    "bank",
    "banking",
    "deposit",
    "withdrawal",
    "solvency",
    "insolvency",
    "liquidity",
    "capital",
    "tier",
    "bailout",
    "default",
    "bankruptcy",
    "collapse",
    "failure",
    "contagion",
    "stress",
    "leverage",
    "credit",
    "lending",
    "loan",
    "panic",
    "run",
    "rescue",
    "fdic",
    "regulator",
    "stress test",
    # --- March√©s & Sentiment ---
    "bullish",
    "bearish",
    "bull",
    "bear",
    "volatile",
    "volatility",
    "vix",
    "rally",
    "plunge",
    "correction",
    "crash",
    "momentum",
    "rebound",
    "slump",
    "surge",
    "sideways",
    "outlook",
    "market",
    "stocks",
    "equities",
    "shares",
    "index",
    "nasdaq",
    "dow",
    "sp500",
    "spy",
    "spx",
    "ndx",
    "derivative",
    "futures",
    "options",
    "swap",
    "etf",
    "commodity",
    "gold",
    "oil",
    "btc",
    "eth",
    # --- Corporate Finance & R√©sultats ---
    "earnings",
    "eps",
    "revenue",
    "ebitda",
    "profit",
    "loss",
    "margin",
    "guidance",
    "forecast",
    "dividend",
    "buyback",
    "ipo",
    "merger",
    "acquisition",
    "takeover",
    "restructuring",
    "layoff",
    "valuation",
    "quarterly",
    "outperform",
    "underperform",
    "upgrade",
    "downgrade",
    "security",
    "securities",
    # --- Argot Social Media & Trading (Enrichment) ---
    "ath",
    "atl",
    "fomo",
    "fud",
    "hodl",
    "btfd",
    "moon",
    "whale",
    "short",
    "long",
    "squeeze",
    "short squeeze",
    "bagholder",
    "pump",
    "dump",
    # --- G√©opolitique, R√©gulation & Tech ---
    "brexit",
    "sanctions",
    "trade",
    "tariff",
    "war",
    "election",
    "regulation",
    "sec",
    "compliance",
    "antitrust",
    "lawsuit",
    "settlement",
    "fraud",
    "tech",
    "privacy",
    "data",
    "cybersecurity",
    "intellectual",
    "property",
    "patent",
]

In [7]:
# Loading the Dolma 2024 KeyedVectors
print("Loading Dolma 2024 Vectors...")
word_vectors = KeyedVectors.load_word2vec_format(
    "../models/dolma_300_2024_1.2M.100_combined.txt", binary=False, no_header=True
)

Loading Dolma 2024 Vectors...


In [8]:
# Application de la fonction
tweets_ready = filter_and_embed_tweets(
    df=tweets, text_col="cleaned_text", lexicon=my_lexicon, w2v_model=word_vectors
)
tweets_ready[["date", "tweet_embedding", "full_content", "cleaned_text"]].to_csv(
    "../data/for_models/tweets_features.csv", index=False
)

Tweets analys√©s (apr√®s d√©doublonnage) : 2096
Tweets filtr√©s (bruit social) : 51
Tweets conserv√©s (signal financier) : 2045


### Tweets Assignment to clusters

$$\text{sim}(\mathbf{t}, \mathbf{c}_k) = \frac{\mathbf{t} \cdot \mathbf{c}_k}{\|\mathbf{t}\| \times \|\mathbf{c}_k\|}$$

Condition du papier : (similarit√© > 0.5).

In [25]:
# # Dates d'observations (un peu plus larges que celles des news)
# START_DATE = "2023-03-03"
# END_DATE = "2023-03-17"

# Dates d'observations (plus larges que celles des news car pas trop de tweets dans cette p√©riode)
START_DATE = "2023-09-03"
END_DATE = "2023-10-03"

# Lancement de l'assignation
final_tweets_assigned = assign_tweets_to_events_by_period(
    tweets_df=tweets_ready,
    news_signatures=final_event_signatures,
    start_date=START_DATE,
    end_date=END_DATE,
    threshold=0.55,  # Seuil Delta
)

# final_tweets_assigned.to_csv('../data/for_models/tweets_assigned_SVB.csv', index=False)

final_tweets_assigned.to_csv("../data/for_models/tweets_assigned_AI.csv", index=False)

--- R√©sultat pour la p√©riode 2023-09-03 au 2023-10-03 ---
Tweets dans la p√©riode : 29
Tweets assign√©s aux √©v√©nements : 25


In [26]:
# --- UTILISATION ---
fig_bars = plot_tweet_assignment_bars(
    tweets_ready, final_event_signatures, START_DATE, END_DATE, threshold=0.55
)
fig_bars.show()

Ce graphique montre que la quasi-totalit√© de tes tweets sont assign√©s aux clusters de news (pres quetout est vert, peu de rouge). Concr√®tement :

La majorit√© des tweets ont une similarit√© cosinus sup√©rieure √† 0.55 avec les centro√Ødes, ce qui est √©lev√©. Cela signifie que le vocabulaire financier des tweets (apr√®s filtrage par le lexique) est tr√®s proche s√©mantiquement des articles de news clusteris√©s. On a un taux d'assignation de 91%.

In [28]:
# Top Tweets by cluster
table_tweets = get_event_tweets_summary(final_tweets_assigned, final_event_signatures)

pd.set_option("display.max_colwidth", None)
display(table_tweets)

Unnamed: 0,Cluster ID,Top 3 Representative Tweets
0,Event #0,"[0.866] *Walter Bloomberg tweeted about BX BLACKSTONE SHARES UP 3.68% AFTER ANNOUNCEMENT CO TO JOIN S&P 500\n$BX\n\n[0.742] *Walter Bloomberg tweeted about üî∏ U.S. S&P 500 E-MINI FUTURES DOWN 0.17%, NASDAQ FUTURES DOWN 0.29%, DOW FUTURES DOWN 0.14%\n\n[0.733] *Walter Bloomberg tweeted about S&P 500 E-MINI FUTURES DOWN 0.08%, NASDAQ FUTURES DOWN 0.19%"
1,Event #1,"[0.940] First Squawk tweeted about DOW JONES DOWN 128.36 POINTS, OR 0.38 %, AT 33,304.99 AFTER MARKET OPEN\n\nS&P 500 DOWN 19.63 POINTS, OR 0.46 PERCENT, AT 4,268.76 AFTER MARKET OPEN\n\nNASDAQ DOWN 79.05 POINTS, OR 0.59 PERCENT, AT 13,228.72 AFTER MARKET OPEN\n\n[0.940] First Squawk tweeted about DOW JONES DOWN 12.72 POINTS, OR 0.04 %, AT 34,824.99 AFTER MARKET OPEN\n\nS&P 500 DOWN 4.89 POINTS, OR 0.11 PERCENT, AT 4,510.88 AFTER MARKET OPEN\n\nNASDAQ DOWN 34.06 POINTS, OR 0.24 PERCENT, AT 13,997.76 AFTER MARKET OPEN\n\n[0.922] Daan Crypto Trades tweeted about The US Stock Market opens back up today.\n\nAre we seeing the first proper lower high being formed in this 2023 rally or will this break through soon?\n\nWhat do you think?\n\nIf you want to trade Indices, FX & Commodities using Crypto consider: https://t.co/VZNADvrnZu"
2,Event #2,[0.766] Don't follow Shardi B if you hate Money tweeted about SPY $SPY\n\nThis is interesting...daily RSI now oversold (below 30) and approaching most oversold level in a few years...\n\nWe are probably close to at least a short term bounce IMO
3,Event #3,No tweets assigned
4,Event #4,No tweets assigned
5,Event #5,No tweets assigned


In [29]:
# # Export en CSV
# table_tweets.to_csv("../data/for_models/output/table_3_tweet_assignment_SVB.csv", index=False)

# Export en CSV
table_tweets.to_csv(
    "../data/for_models/output/table_3_tweet_assignment_AI.csv", index=False
)