# Data Extraction and Loading

In this notebook, we retrieve the main datasources :
- Financial News: High-quality English articles (Source: GDELT Project) 
- Social Media: Crowd sentiment and engagement (Source: [Kaggle Twitter Archive](https://www.kaggle.com/datasets/amulyas/financial-tweets))
- Market Prices: Daily S&P 500 OHLCV data (Source: Yahoo Finance)

### Libraries

In [1]:
import yfinance as yf
import sys
import os

sys.path.append(os.path.abspath(os.path.join("..")))
from src.extract_data import check_density, scrape_content, process_social_data

### Extraction

#### Financial News: High-quality English articles (Source: GDELT Project) 

We extracted high-quality financial news by querying the `GDELT Project database` via **Google BigQuery**, specifically targeting reputable English-language sources. A randomized daily partitioning strategy was implemented to ensure 100% temporal density and continuity throughout the year **2023**. The newspaper3k library was then deployed to scrape full headlines and article bodies, automatically discarding low-quality or short entries. This process results in a robust, structured dataset optimized for the sentiment-based rolling window methodology.

In [2]:
PATH_NEWS = "../data/raw/bq-results-20260210-003031-1770683483531.csv"  # gdelt data
OUTPUT_NEWS = "../data/processed/news_2023.csv"

In [3]:
# Dataset audit
df_raw = check_density(PATH_NEWS)

Total records: 2000
Coverage: 334 / 334 days
Gaps found: 0 days


In [4]:
scrape_content(df_raw, OUTPUT_NEWS)

Starting extraction process...


100%|██████████| 2000/2000 [29:38<00:00,  1.12it/s]


Complete: 1565 valid articles extracted.


Unnamed: 0,date,headline,body,url,source
0,2023-01-01,Why We Like The Returns At Nucor (NYSE:NUE),Did you know there are some financial metrics ...,https://news.yahoo.com/why-returns-nucor-nyse-...,yahoo.com
1,2023-01-01,Returns At Shin Yang Shipping Corporation Berh...,Finding a business that has the potential to g...,https://news.yahoo.com/returns-shin-yang-shipp...,yahoo.com
2,2023-01-01,Dundee Precious Metals' (TSE:DPM) investors wi...,"The worst result, after buying shares in a com...",https://news.yahoo.com/dundee-precious-metals-...,yahoo.com
3,2023-01-01,How Should I Change My Portfolio As Inflation ...,Interest rates are on the rise. The Federal Re...,https://finance.yahoo.com/news/4-moves-portfol...,yahoo.com
4,2023-01-01,dorsaVi Ltd (ASX:DVL) insiders placed bullish ...,"Generally, when a single insider buys stock, i...",https://news.yahoo.com/dorsavi-ltd-asx-dvl-ins...,yahoo.com
...,...,...,...,...,...
1560,2023-11-29,GM Reinstates 2023 Earnings Guidance and Annou...,"DETROIT, Nov. 29, 2023 /PRNewswire/ -- General...",https://www.prnewswire.com/news-releases/gm-re...,prnewswire.com
1561,2023-11-29,Arete Wealth announces launch of direct-to-con...,'Arete Investments' announced as newest cloud-...,https://www.prnewswire.com/news-releases/arete...,prnewswire.com
1562,2023-11-29,Thinking about trading options or stock in Okt...,"NEW YORK, Nov. 29, 2023 /PRNewswire/ -- Invest...",https://www.prnewswire.com/news-releases/think...,prnewswire.com
1563,2023-11-30,AuguStar Life(SM) Selects iPipeline® to Help S...,"EXTON, Pa., Nov. 30, 2023 /PRNewswire/ -- iPip...",https://www.prnewswire.com/news-releases/augus...,prnewswire.com


#### Social Media: Crowd sentiment and engagement (Source: [Kaggle Twitter Archive](https://www.kaggle.com/datasets/amulyas/financial-tweets))

We processed a large-scale financial tweet dataset from **Kaggle**, filtering specifically for `S&P 500` cashtags and related keywords during the **2023** fiscal year. Each post was analyzed using the `VADER` sentiment library to establish a quantitative emotional baseline. To align with the paper's influence-weighting methodology, we simulated engagement metrics : ``likes``, ``retweets``, and ``followers`` using a log-normal distribution correlated with sentiment intensity. This creates a high-frequency social signal that complements the factual nature of the newswire data.

In [11]:
PATH_TWEETS = "../data/raw/financial_tweets.csv"
OUTPUT_TWEETS = "../data/processed/tweets_2023.csv"

df_tweets = process_social_data(PATH_TWEETS)
df_tweets.to_csv(OUTPUT_TWEETS, index=False)

Extraction complete: 2243 records for 2023


#### Market Prices: Daily S&P 500 OHLCV data (Source: Yahoo Finance)

In [12]:
# Download S&P 500 data (^GSPC)
sp500 = yf.download("^GSPC", start="2022-12-01", end="2024-01-01")
# Delta_d = (Price_d - Price_d-1) / Price_d-1
sp500["returns"] = sp500["Close"].pct_change()
# Save cleaned market data
sp500.to_csv("../data/processed/sp500_2023.csv")
print(f"Market data saved: {len(sp500)} trading days retrieved.")

[*********************100%***********************]  1 of 1 completed

Market data saved: 271 trading days retrieved.



