# Lexicon Generation

Objective: Create a time-aware, domain-specific lexicon to represent news documents by capturing the correlation between specific words and stock price movements.
- **Time Window**: For each day $d$, we collect news articles from a 4-week look-back period to capture the impact of emerging financial terms.
- **Return Calculation**: We calculate the daily stock price variation ($\Delta$) for the day following each article's publication using the closing prices.
- **Word Scoring**: We assign each word $j$ a score $f(j)$ by averaging the $\Delta$ values of all articles in which that term appears.
- **Frequency Filtering**: We remove terms appearing in more than 90% of documents (too common) or fewer than 10 documents (statistically insignificant).
- **Marginal Screening**: By sorting remaining words by their average scores, we identify those consistently followed by significant positive or negative market variations.
- **Final Selection**: We form the lexicon by selecting words below the 20th percentile and above the 80th percentile to focus on the most impactful terms.

### Libraries

In [2]:
import pandas as pd
import spacy
from datetime import timedelta
from tqdm import tqdm
import os
import sys

sys.path.append(os.path.abspath(os.path.join('..')))
from src.lexicon_generation import preprocess_spacy, build_daily_lexicon, visualize_daily_lexicon

### Data Loading and Cleaning

In [3]:
# Load datasets
news = pd.read_csv('../data/processed/news_2023.csv')
tweets = pd.read_csv('../data/processed/tweets_2023.csv')
prices = pd.read_csv('../data/processed/sp500_2023.csv', skiprows=3, names=['date', 'close', 'high', 'low', 'open', 'vol', 'returns'])

In [4]:
news['date'] = pd.to_datetime(news['date']).dt.date
prices['date'] = pd.to_datetime(prices['date']).dt.date
prices_map = prices.set_index('date')['returns'].to_dict()

### Lexicon generation

In [5]:
# Configuration
nlp = spacy.blank("en")
DTM_OUTPUT_DIR = '../data/processed/daily_dtm/'
LEXICON_OUTPUT_DIR = '../data/processed/daily_lexicons_full/'
FILTERED_LEXICON_OUTPUT_DIR = '../data/processed/daily_lexicons_filtered/'

In [7]:
print("Step 1: Pre-processing text...")
news['clean'] = (news['headline'] + " " + news['body']).apply(lambda x: preprocess_spacy(x, nlp))
news.to_csv('../data/processed/news_2023_clean.csv', index=False)

Step 1: Pre-processing text...


In [8]:
# Daily Loop (Rolling Window)
results = []
start_d = pd.to_datetime('2023-02-01').date()
end_d = pd.to_datetime('2023-12-31').date()

print("Step 2: Generating Daily Lexicons (Rolling Window)...")
for current_date in tqdm(pd.date_range(start_d, end_d)):
    d = current_date.date()   
    # Collection of articles in window [d-28, d-1]
    window_news = news[(news['date'] >= d - timedelta(days=28)) & (news['date'] < d)].copy()  
    # Build Lexicon
    lex_map = build_daily_lexicon(window_news, prices_map, d, DTM_OUTPUT_DIR, LEXICON_OUTPUT_DIR, FILTERED_LEXICON_OUTPUT_DIR)

Step 2: Generating Daily Lexicons (Rolling Window)...


100%|██████████| 334/334 [00:20<00:00, 16.42it/s]


#### Vizualization

In [9]:
# Visualisation pour la faillite de la SVB le 10 Mars 2023:
visualize_daily_lexicon('2023-03-10')

In [12]:
# Visualisation pour CAC40 sommet à 7500 points le 21 Avril 2023:
visualize_daily_lexicon('2023-04-21')

In [13]:
# Visualisation pour l'introduction en bourse du concepteur de puces ARM à New York le 14 Septembre 2023:
visualize_daily_lexicon('2023-09-14')