# Feature Engineering : News embedding

The objective of this stage is to convert the textual data (News) from a specific day into a unique numerical representation that captures the global market sentiment for that day.
- **Lexicon Filtering**: For each news article, only the words included in the time-aware, domain-specific lexicon generated in the previous step are retained.
- **Word Embedding Integration**: Each remaining word is replaced by its corresponding Word2Vec dense vector representation, typically using a 300-dimensional space.
- **Document Representation**: The final vector for each article is constructed by computing the average of the word-embeddings of its filtered words, resulting in a unique news-embedding.

In [5]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import os
import sys
from gensim.models import KeyedVectors

sys.path.append(os.path.abspath(os.path.join('..')))
from src.feature_engineering import run_feature_engineering_pipeline

### Loading GloVe pre-trained embeddings

In [6]:
# Loading the Dolma 2024 KeyedVectors
print("Loading Dolma 2024 Vectors...")
word_vectors = KeyedVectors.load_word2vec_format(
    '../models/dolma_300_2024_1.2M.100_combined.txt', 
    binary=False, 
    no_header=True
)

Loading Dolma 2024 Vectors...


### News embeddings procedure

In [7]:
# Création d'embeddings pour chaque article de news
news = pd.read_csv('../data/processed/news_2023_clean.csv')
news_features = run_feature_engineering_pipeline(news, '../data/processed/daily_lexicons/', word_vectors)
news_features.to_csv('../data/for_models/news_features.csv', index=False)

Feature Engineering: 100%|██████████| 334/334 [00:01<00:00, 208.42it/s]
