# Lexicon Generation

The objective is to identify relevant words for finacial markets. 
To achieve this, 



- Objectif : Identifier les mots qui ont un impact réel sur le marché financier.
- Procédure : 
    - Prendre les articles de presse sur une fenêtre de 4 semaines précédant le jour $d$.
    - Calculer la corrélation de chaque mot avec la variation du prix de l'indice ($\Delta$) le jour suivant sa publication.
    - Filtrage des mots trop fréquents (>90% des documents) ou trop rares (<10 documents).
    - Sélection des mots situés dans les percentiles extrêmes (les plus positifs et les plus négatifs) pour former le lexique "conscient du temps".

### Libraries

In [50]:
import pandas as pd
import numpy as np
import spacy
import re
from sklearn.feature_extraction.text import CountVectorizer
from datetime import timedelta
from tqdm import tqdm
import os
import sys

sys.path.append(os.path.abspath(os.path.join('..')))
from src.lexicon_generation import preprocess_spacy, build_daily_lexicon

### Data Loading and Cleaning

In [51]:
# Load datasets
news = pd.read_csv('../data/processed/news_2023.csv')
tweets = pd.read_csv('../data/processed/tweets_2023.csv')
prices = pd.read_csv('../data/processed/sp500_2023.csv', skiprows=3, names=['date', 'close', 'high', 'low', 'open', 'vol', 'returns'])

In [52]:
news['date'] = pd.to_datetime(news['date']).dt.date
prices['date'] = pd.to_datetime(prices['date']).dt.date
prices_map = prices.set_index('date')['returns'].to_dict()

### Lexicon generation

In [39]:
# Configuration
nlp = spacy.blank("en")
LEXICON_OUTPUT_DIR = '../data/processed/daily_lexicons/'
DTM_OUTPUT_DIR = '../data/processed/daily_dtm/'

In [53]:
print("Step 1: Pre-processing text...")
news['clean'] = (news['headline'] + " " + news['body']).apply(lambda x: preprocess_spacy(x, nlp))

Step 1: Pre-processing text...


In [None]:
# Daily Loop (Rolling Window)
results = []
start_d = pd.to_datetime('2023-01-29').date()
end_d = pd.to_datetime('2023-12-31').date()

print("Step 2: Generating Daily Lexicons (Rolling Window)...")
for current_date in tqdm(pd.date_range(start_d, end_d)):
    d = current_date.date()   
    # Collection of articles in window [d-28, d-1]
    window_news = news[(news['date'] >= d - timedelta(days=28)) & (news['date'] < d)].copy()  
    # Build Lexicon
    lex_map = build_daily_lexicon(window_news, prices_map, d, LEXICON_OUTPUT_DIR, DTM_OUTPUT_DIR)

Step 2: Generating Daily Lexicons (Rolling Window)...


100%|██████████| 334/334 [00:43<00:00,  7.75it/s]
