# 1: Data Preparation for Inference

In this script, the dataset sp500_headlines_2008_2024_raw, sourced from Kaggle, is pre-processed:
- Daily returns (Return) and log returns (Return_log) are computed from the closing price (CP).
- The column 'Title' is renamed to 'Headline' for improved clarity and consistency with subsequent scripts.


### Import required packages

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/sp500_headlines_2008_2024_raw.csv')

### Data preparation

In [2]:
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date').reset_index(drop=True)
# return
df['Return'] = df['CP'].pct_change()
df['Return_log'] = np.log(df['CP'] / df['CP'].shift(1))
# headlines
df = df.rename(columns={"Title": "Headline"})


### Final dataset

In [3]:
df

Unnamed: 0,Headline,Date,CP,Return,Return_log
0,"JPMorgan Predicts 2008 Will Be ""Nothing But Net""",2008-01-02,1447.16,,
1,Dow Tallies Biggest First-session-of-year Poin...,2008-01-02,1447.16,0.000000,0.00000
2,2008 predictions for the S&P 500,2008-01-02,1447.16,0.000000,0.00000
3,"U.S. Stocks Higher After Economic Data, Monsan...",2008-01-03,1447.16,0.000000,0.00000
4,U.S. Stocks Climb As Hopes Increase For More F...,2008-01-07,1416.18,-0.021407,-0.02164
...,...,...,...,...,...
19122,Zions (ZION) Loses Spot in S&P 500 as Concerns...,2024-03-04,5130.95,0.000000,0.00000
19123,"S&P 500: Super Micro, Deckers Jump On News The...",2024-03-04,5130.95,0.000000,0.00000
19124,"Bank of America boosts S&P 500 target to 5,400...",2024-03-04,5130.95,0.000000,0.00000
19125,S&P 500 Price Forecast – S&P 500 Continues to ...,2024-03-04,5130.95,0.000000,0.00000


### Export final dataset as csv

In [4]:
df.to_csv("../data/sp500_headlines_2008_2024.csv", index=False, encoding="utf-8")