# Relevant Words Extraction

Objective: Relevant Words Extraction
The goal of this stage is to translate abstract vector clusters into human-readable financial events, enabling a qualitative evaluation of our results. By extracting the most significant terms, we can verify at a glance that our clusters are semantically coherent and accurately represent real-world market news rather than random noise.

Steps : 
- Lexicon-filtered TF-IDF: We fit a TF-IDF model on all articles in the time interval using only the specialized financial lexicon as your features.
- Document Vectorization: We generate a relevance vector for each news document, where values represent the importance of specific financial terms.
- Cluster Aggregation: Grouping vectors by their assigned cluster and computing the average TF-IDF score for every word in that group.
- Ranking and Selection: By sorting terms by their average score in descending order and selecting the top 10 words to define the cluster's "identity."

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import os
import sys

sys.path.append(os.path.abspath(os.path.join('..')))
from src.relevant_words_extraction import *

### Loading the cleaned news data

In [5]:
# clean_news_df = pd.read_csv('../data/for_models/clean_news_week_SVB.csv')

clean_news_df = pd.read_csv('../data/for_models/clean_news_week_AI.csv')

In [6]:
my_lexicon = [
    # --- Macroéconomie & Banques Centrales ---
    'inflation', 'deflation', 'stagflation', 'cpi', 'ppi', 'gdp', 'recession', 'growth', 'expansion', 
    'unemployment', 'employment', 'payroll', 'payrolls', 'deficit', 'surplus', 'debt', 'stimulus', 
    'productivity', 'spending', 'consumer', 'retail', 'fed', 'fomc', 'ecb', 'boj', 'boe', 'rates', 
    'interest', 'hiking', 'tightening', 'easing', 'hawkish', 'dovish', 'quantitative', 'tapering', 
    'policy', 'reserve', 'federal', 'monetary', 'fiscal', 'yield', 'curve', 'basis', 'points',

    # --- Secteur Bancaire & Crise ---
    'bank', 'banking', 'deposit', 'withdrawal', 'solvency', 'insolvency', 'liquidity', 'capital', 
    'tier', 'bailout', 'default', 'bankruptcy', 'collapse', 'failure', 'contagion', 'stress', 
    'leverage', 'credit', 'lending', 'loan', 'panic', 'run', 'rescue', 'fdic', 'regulator', 'stress test',

    # --- Marchés & Sentiment ---
    'bullish', 'bearish', 'bull', 'bear', 'volatile', 'volatility', 'vix', 'rally', 'plunge', 
    'correction', 'crash', 'momentum', 'rebound', 'slump', 'surge', 'sideways', 'outlook', 
    'market', 'stocks', 'equities', 'shares', 'index', 'nasdaq', 'dow', 'sp500', 'spy', 'spx', 
    'ndx', 'derivative', 'futures', 'options', 'swap', 'etf', 'commodity', 'gold', 'oil', 'btc', 'eth',

    # --- Corporate Finance & Résultats ---
    'earnings', 'eps', 'revenue', 'ebitda', 'profit', 'loss', 'margin', 'guidance', 'forecast', 
    'dividend', 'buyback', 'ipo', 'merger', 'acquisition', 'takeover', 'restructuring', 'layoff', 
    'valuation', 'quarterly', 'outperform', 'underperform', 'upgrade', 'downgrade', 'security', 'securities',

    # --- Argot Social Media & Trading (Enrichment) ---
    'ath', 'atl', 'fomo', 'fud', 'hodl', 'btfd', 'moon', 'whale', 'short', 'long', 'squeeze', 
    'short squeeze', 'bagholder', 'pump', 'dump',

    # --- Géopolitique, Régulation & Tech ---
    'brexit', 'sanctions', 'trade', 'tariff', 'war', 'election', 'regulation', 'sec', 
    'compliance', 'antitrust', 'lawsuit', 'settlement', 'fraud', 'tech', 'privacy', 
    'data', 'cybersecurity', 'intellectual', 'property', 'patent'
]

### Relevant Financial Words Extraction 

In [7]:
# Words Extraction
keywords_data = extract_relevant_words_with_scores(clean_news_df, my_lexicon, top_n=10)

# Vizualisation
fig_keywords = plot_keywords_bar_chart(keywords_data)
fig_keywords.show()