# Step 6 : Tweet Assignment

**Carta et al. (2021) ‚Äî Step 6**

Enrichir les clusters de news avec des tweets s√©mantiquement corr√©l√©s pour √©valuer la r√©sonance sociale des √©v√©nements d√©tect√©s.

### Libraries

In [None]:
import pandas as pd
import numpy as np
import spacy
import re
import plotly.express as px
import plotly.graph_objects as go
from sklearn.feature_extraction.text import CountVectorizer
from datetime import timedelta
from tqdm import tqdm
import os
import sys
from gensim.models import KeyedVectors


sys.path.append(os.path.abspath(os.path.join('..')))
from src.tweet_assignment import *

### Data loading

In [2]:
# Clusters signatures loading
df_sig = pd.read_csv('../data/for_models/final_event_signatures_SVB.csv')
cluster_col = 'Unnamed: 0'

# Transformation en dictionnaire {cluster_id: vecteur_numpy}
final_event_signatures = {}
for _, row in df_sig.iterrows():
    cluster_id = int(row[cluster_col])
    # On r√©cup√®re toutes les colonnes de 0 √† 299 et on les convertit en array
    vector = row.drop(labels=[cluster_col]).values
    final_event_signatures[cluster_id] = vector

print(f"Signatures charg√©es pour {len(final_event_signatures)} clusters.")

Signatures charg√©es pour 2 clusters.


### Tweets preprocessing

In [3]:
# Load datasets
tweets = pd.read_csv('../data/processed/tweets_2023.csv')
tweets['date'] = pd.to_datetime(tweets['date'], errors='coerce')

In [4]:
daily_counts = tweets.groupby(tweets['date'].dt.date).size().reset_index(name='tweet_count')
daily_counts.columns = ['date', 'count']

# Distribution des tweets par jour
fig_dist = px.bar(
    daily_counts, 
    x='date', 
    y='count',
    title="<b>Daily Tweet Distribution (Financial Context)</b>",
    labels={'date': 'Date', 'count': 'Number of Tweets'},
    template="plotly_white",
    color_discrete_sequence=['#1DA1F2'] # Bleu Twitter
)
fig_dist.update_layout(
    xaxis_title="Date",
    yaxis_title="Tweet Count",
    hovermode="x unified"
)
fig_dist.show()

### Tweets Embeddings Generation

In [5]:
nlp = spacy.blank("en")
tweets['cleaned_text'] = tweets['full_content'].apply(lambda x: preprocess_tweets_spacy(x, nlp))

In [6]:
my_lexicon = [
    # --- Macro√©conomie & Banques Centrales ---
    'inflation', 'deflation', 'stagflation', 'cpi', 'ppi', 'gdp', 'recession', 'growth', 'expansion', 
    'unemployment', 'employment', 'payroll', 'payrolls', 'deficit', 'surplus', 'debt', 'stimulus', 
    'productivity', 'spending', 'consumer', 'retail', 'fed', 'fomc', 'ecb', 'boj', 'boe', 'rates', 
    'interest', 'hiking', 'tightening', 'easing', 'hawkish', 'dovish', 'quantitative', 'tapering', 
    'policy', 'reserve', 'federal', 'monetary', 'fiscal', 'yield', 'curve', 'basis', 'points',

    # --- Secteur Bancaire & Crise ---
    'bank', 'banking', 'deposit', 'withdrawal', 'solvency', 'insolvency', 'liquidity', 'capital', 
    'tier', 'bailout', 'default', 'bankruptcy', 'collapse', 'failure', 'contagion', 'stress', 
    'leverage', 'credit', 'lending', 'loan', 'panic', 'run', 'rescue', 'fdic', 'regulator', 'stress test',

    # --- March√©s & Sentiment ---
    'bullish', 'bearish', 'bull', 'bear', 'volatile', 'volatility', 'vix', 'rally', 'plunge', 
    'correction', 'crash', 'momentum', 'rebound', 'slump', 'surge', 'sideways', 'outlook', 
    'market', 'stocks', 'equities', 'shares', 'index', 'nasdaq', 'dow', 'sp500', 'spy', 'spx', 
    'ndx', 'derivative', 'futures', 'options', 'swap', 'etf', 'commodity', 'gold', 'oil', 'btc', 'eth',

    # --- Corporate Finance & R√©sultats ---
    'earnings', 'eps', 'revenue', 'ebitda', 'profit', 'loss', 'margin', 'guidance', 'forecast', 
    'dividend', 'buyback', 'ipo', 'merger', 'acquisition', 'takeover', 'restructuring', 'layoff', 
    'valuation', 'quarterly', 'outperform', 'underperform', 'upgrade', 'downgrade', 'security', 'securities',

    # --- Argot Social Media & Trading (Enrichment) ---
    'ath', 'atl', 'fomo', 'fud', 'hodl', 'btfd', 'moon', 'whale', 'short', 'long', 'squeeze', 
    'short squeeze', 'bagholder', 'pump', 'dump',

    # --- G√©opolitique, R√©gulation & Tech ---
    'brexit', 'sanctions', 'trade', 'tariff', 'war', 'election', 'regulation', 'sec', 
    'compliance', 'antitrust', 'lawsuit', 'settlement', 'fraud', 'tech', 'privacy', 
    'data', 'cybersecurity', 'intellectual', 'property', 'patent'
]

In [7]:
# Loading the Dolma 2024 KeyedVectors
print("Loading Dolma 2024 Vectors...")
word_vectors = KeyedVectors.load_word2vec_format(
    '../models/dolma_300_2024_1.2M.100_combined.txt', 
    binary=False, 
    no_header=True)

Loading Dolma 2024 Vectors...


In [8]:
# Application de la fonction
tweets_ready = filter_and_embed_tweets(
    df=tweets, 
    text_col='cleaned_text', 
    lexicon=my_lexicon, 
    w2v_model=word_vectors
)
tweets_ready[['date','tweet_embedding','full_content','cleaned_text']].to_csv('../data/for_models/tweets_features.csv', index=False)

Tweets analys√©s (apr√®s d√©doublonnage) : 2096
Tweets filtr√©s (bruit social) : 51
Tweets conserv√©s (signal financier) : 2045


### Tweets Assignment to clusters

$$\text{sim}(\mathbf{t}, \mathbf{c}_k) = \frac{\mathbf{t} \cdot \mathbf{c}_k}{\|\mathbf{t}\| \times \|\mathbf{c}_k\|}$$

Condition du papier : (similarit√© > 0.5).

In [11]:
# Dates d'observations (un peu plus larges que celles des news)
START_DATE = "2023-03-03"
END_DATE = "2023-03-17"

# Lancement de l'assignation
final_tweets_assigned = assign_tweets_to_events_by_period(
    tweets_df=tweets_ready, 
    news_signatures=final_event_signatures, 
    start_date=START_DATE,
    end_date=END_DATE,
    threshold=0.55 # Seuil Delta
)
final_tweets_assigned.to_csv('../data/for_models/tweets_assigned_SVB.csv', index=False)

--- R√©sultat pour la p√©riode 2023-03-03 au 2023-03-17 ---
Tweets dans la p√©riode : 167
Tweets assign√©s aux √©v√©nements : 152


In [12]:
import plotly.graph_objects as go
import numpy as np
from scipy.spatial.distance import cosine

def plot_tweet_assignment_bars(tweets_df, news_signatures, start_date, end_date, threshold=0.55):
    # 1. Filtrage par date
    mask = (tweets_df['date'] >= start_date) & (tweets_df['date'] <= end_date)
    df_period = tweets_df.loc[mask].copy()
    
    # 2. Calcul de la similarit√© maximale pour chaque tweet
    max_similarities = []
    for _, row in df_period.iterrows():
        tweet_vec = row['tweet_embedding']
        # Calcul de la similarit√© avec chaque signature d'√©v√©nement
        sims = [1 - cosine(tweet_vec, sig_vec) for sig_vec in news_signatures.values()]
        max_similarities.append(max(sims) if sims else 0)
    
    df_period['max_similarity'] = max_similarities
    
    # 3. Tri chronologique (important pour l'abscisse)
    df_period = df_period.sort_values(by='date')
    
    # 4. D√©finition des couleurs (Vert pour assign√©, Rouge pour rejet√©)
    colors = ['#2ecc71' if sim >= threshold else '#e74c3c' for sim in df_period['max_similarity']]
    
    # 5. Cr√©ation du Bar Plot
    fig = go.Figure()

    fig.add_trace(go.Bar(
        x=list(range(len(df_period))), # Index num√©rique pour l'ordre
        y=df_period['max_similarity'],
        marker_color=colors,
        # On injecte les donn√©es pour le survol
        customdata=np.stack((
            df_period['date'].dt.strftime('%Y-%m-%d'), 
            df_period['full_content'],
            df_period['max_similarity']
        ), axis=-1),
        hovertemplate=(
            "<b>Date:</b> %{customdata[0]}<br>" +
            "<b>Similarit√©:</b> %{customdata[2]:.4f}<br>" +
            "<b>Texte Nettoy√©:</b> %{customdata[1]}<extra></extra>"
        )
    ))

    # 6. Ajout de la ligne de seuil (Threshold)
    fig.add_hline(
        y=threshold, 
        line_dash="dash", 
        line_color="#3498db", 
        line_width=2,
        annotation_text=f"Seuil Delta ({threshold})", 
        annotation_position="top right"
    )

    # Mise en page
    fig.update_layout(
        title=f"<b>Distribution des Assignations de Tweets</b><br><sup>P√©riode : {start_date} au {end_date}</sup>",
        xaxis_title=f"Tweets tri√©s par date (Total: {len(df_period)})",
        yaxis_title="Niveau de Similarit√© Cosinus",
        template="plotly_white",
        hoverlabel=dict(bgcolor="white", font_size=12),
        height=600
    )

    # Masquer les √©tiquettes de l'axe X (trop nombreuses) pour privil√©gier le survol
    fig.update_xaxes(showticklabels=False)

    return fig

# --- UTILISATION ---
fig_bars = plot_tweet_assignment_bars(
    tweets_ready, 
    final_event_signatures, 
    START_DATE, 
    END_DATE, 
    threshold=0.55
)
fig_bars.show()

Ce graphique montre que la quasi-totalit√© de tes tweets sont assign√©s aux clusters de news (pres quetout est vert, peu de rouge). Concr√®tement :

La majorit√© des tweets ont une similarit√© cosinus sup√©rieure √† 0.55 avec les centro√Ødes, ce qui est √©lev√©. Cela signifie que le vocabulaire financier des tweets (apr√®s filtrage par le lexique) est tr√®s proche s√©mantiquement des articles de news clusteris√©s. On a un taux d'assignation de 88%.

In [13]:
# Affichage des 3 tweets les plus repr√©sentatifs pour chaque √©v√©nement
for cluster_id in final_tweets_assigned['assigned_event'].unique():
    print(f"\n√âV√âNEMENT #{cluster_id} - Top 3 Tweets les plus proches :")
    top_3 = final_tweets_assigned[final_tweets_assigned['assigned_event'] == cluster_id] \
            .sort_values(by='similarity', ascending=False).head(3)
    
    for i, row in top_3.iterrows():
        print(f"  [{row['similarity']:.3f}] - {row['full_content'][:150]}...")


√âV√âNEMENT #0 - Top 3 Tweets les plus proches :
  [0.847] - reciknows üîÅ cfromhertz about SPY today is quadwitching 

notable:
1 .  $SPY and all SPDR Sector ETFs go ex div (always on quad witch)

2.  the open - ...
  [0.833] - IncomeSharks tweeted about STOCK The #stock market is volatile because there's a lot less liquidity. Advisors pushed everyone into 4/5% T bills, CDs a...
  [0.827] - FirstSquawk tweeted about NASDAQ STOCK MARKET: TRADING WILL REMAIN HALTED UNTIL SIGNATURE BANK HAS FULLY SATISFIED NASDAQ'S REQUEST FOR ADDITIONAL INF...

√âV√âNEMENT #1 - Top 3 Tweets les plus proches :
  [0.939] - CryptoNoan üîÅ CryptoNoan about If you apply this strategy you can easily achieve 100% failure. Every single trade will hit SL 

Most of them just copy ...
  [0.930] - trader1sz üîÅ WifeyAlpha about ‚ÄúIndex Gamma into $2.8 Trillion OPeX this friday - S&P 500 Index gamma is long +$1.8B. Dealers lost $2B worth of long gam...
  [0.922] - FirstSquawk tweeted about DOW JONES DOWN 460.74