# Time Alignment and Feature Preparation

This notebook addresses one of the most critical challenges in
financial machine learning: aligning unstructured news data with
structured market data without introducing look-ahead bias.

The goal is to:
- map news timestamps to the correct trading day
- aggregate sentiment signals at a daily frequency
- prepare a clean feature set for downstream modeling and backtesting


In [1]:
import pandas as pd
import numpy as np
from datetime import time
import pytz


In [2]:
news_df = pd.DataFrame({
    "headline": [
        "Apple reports strong quarterly earnings",
        "Apple faces antitrust scrutiny in Europe",
        "Markets fall amid global uncertainty",
        "Federal Reserve hints at rate cuts",
        "Tech stocks rally on strong demand"
    ],
    "sentiment_score": [0.8, -0.6, -0.4, 0.2, 0.6],
    "timestamp_utc": [
        "2024-01-15 07:30:00",
        "2024-01-15 16:00:00",
        "2024-01-16 03:00:00",
        "2024-01-20 09:00:00",
        "2024-01-21 12:00:00"
    ]
})

news_df["timestamp_utc"] = pd.to_datetime(news_df["timestamp_utc"]).dt.tz_localize("UTC")
news_df


Unnamed: 0,headline,sentiment_score,timestamp_utc
0,Apple reports strong quarterly earnings,0.8,2024-01-15 07:30:00+00:00
1,Apple faces antitrust scrutiny in Europe,-0.6,2024-01-15 16:00:00+00:00
2,Markets fall amid global uncertainty,-0.4,2024-01-16 03:00:00+00:00
3,Federal Reserve hints at rate cuts,0.2,2024-01-20 09:00:00+00:00
4,Tech stocks rally on strong demand,0.6,2024-01-21 12:00:00+00:00


In [3]:
IST = pytz.timezone("Asia/Kolkata")
MARKET_CLOSE = time(15, 30)  # 3:30 PM IST


In [4]:
news_df["timestamp_ist"] = news_df["timestamp_utc"].dt.tz_convert(IST)
news_df[["headline", "timestamp_ist"]]


Unnamed: 0,headline,timestamp_ist
0,Apple reports strong quarterly earnings,2024-01-15 13:00:00+05:30
1,Apple faces antitrust scrutiny in Europe,2024-01-15 21:30:00+05:30
2,Markets fall amid global uncertainty,2024-01-16 08:30:00+05:30
3,Federal Reserve hints at rate cuts,2024-01-20 14:30:00+05:30
4,Tech stocks rally on strong demand,2024-01-21 17:30:00+05:30


In [5]:
def map_to_trading_day(ts):
    # Rule 1: Weekend news → next Monday
    if ts.weekday() >= 5:
        return (ts + pd.Timedelta(days=7 - ts.weekday())).normalize()
    
    # Rule 2: After market close → next business day
    if ts.time() > MARKET_CLOSE:
        return (ts + pd.tseries.offsets.BusinessDay(1)).normalize()
    
    # Rule 3: During market hours → same day
    return ts.normalize()


In [6]:
news_df["trading_day"] = news_df["timestamp_ist"].apply(map_to_trading_day)
news_df[["headline", "timestamp_ist", "trading_day"]]


Unnamed: 0,headline,timestamp_ist,trading_day
0,Apple reports strong quarterly earnings,2024-01-15 13:00:00+05:30,2024-01-15 00:00:00+05:30
1,Apple faces antitrust scrutiny in Europe,2024-01-15 21:30:00+05:30,2024-01-16 00:00:00+05:30
2,Markets fall amid global uncertainty,2024-01-16 08:30:00+05:30,2024-01-16 00:00:00+05:30
3,Federal Reserve hints at rate cuts,2024-01-20 14:30:00+05:30,2024-01-22 00:00:00+05:30
4,Tech stocks rally on strong demand,2024-01-21 17:30:00+05:30,2024-01-22 00:00:00+05:30


In [7]:
daily_sentiment = (
    news_df
    .groupby("trading_day")
    .agg(
        avg_sentiment=("sentiment_score", "mean"),
        news_count=("sentiment_score", "count")
    )
    .reset_index()
)

daily_sentiment


Unnamed: 0,trading_day,avg_sentiment,news_count
0,2024-01-15 00:00:00+05:30,0.8,1
1,2024-01-16 00:00:00+05:30,-0.5,2
2,2024-01-22 00:00:00+05:30,0.4,2


In [8]:
trading_days = pd.date_range(
    start="2024-01-15",
    end="2024-01-25",
    freq="B"
)

stock_df = pd.DataFrame({
    "date": trading_days,
    "close_price": np.random.uniform(150, 160, len(trading_days))
})

stock_df


Unnamed: 0,date,close_price
0,2024-01-15,150.280087
1,2024-01-16,155.365375
2,2024-01-17,159.45137
3,2024-01-18,155.53971
4,2024-01-19,151.356376
5,2024-01-22,159.447329
6,2024-01-23,157.321813
7,2024-01-24,153.785526
8,2024-01-25,156.043122


In [9]:
# Ensure trading_day is timezone-naive for merging
daily_sentiment["trading_day"] = (
    daily_sentiment["trading_day"]
    .dt.tz_localize(None)
)


In [10]:
df_merged = pd.merge(
    stock_df,
    daily_sentiment,
    left_on="date",
    right_on="trading_day",
    how="left"
).drop(columns=["trading_day"])

df_merged[["avg_sentiment", "news_count"]] = df_merged[
    ["avg_sentiment", "news_count"]
].fillna(0)

df_merged


Unnamed: 0,date,close_price,avg_sentiment,news_count
0,2024-01-15,150.280087,0.8,1.0
1,2024-01-16,155.365375,-0.5,2.0
2,2024-01-17,159.45137,0.0,0.0
3,2024-01-18,155.53971,0.0,0.0
4,2024-01-19,151.356376,0.0,0.0
5,2024-01-22,159.447329,0.4,2.0
6,2024-01-23,157.321813,0.0,0.0
7,2024-01-24,153.785526,0.0,0.0
8,2024-01-25,156.043122,0.0,0.0


In [11]:
features = df_merged[[
    "date",
    "close_price",
    "avg_sentiment",
    "news_count"
]]

features


Unnamed: 0,date,close_price,avg_sentiment,news_count
0,2024-01-15,150.280087,0.8,1.0
1,2024-01-16,155.365375,-0.5,2.0
2,2024-01-17,159.45137,0.0,0.0
3,2024-01-18,155.53971,0.0,0.0
4,2024-01-19,151.356376,0.0,0.0
5,2024-01-22,159.447329,0.4,2.0
6,2024-01-23,157.321813,0.0,0.0
7,2024-01-24,153.785526,0.0,0.0
8,2024-01-25,156.043122,0.0,0.0


## Summary

This notebook demonstrated how to correctly align unstructured news
data with financial market data.

Key steps included:
- timezone conversion (UTC → IST)
- trading day assignment based on market close
- prevention of look-ahead bias
- daily aggregation of sentiment signals
- merging with stock price data

The resulting dataset provides a clean and realistic feature set
for predictive modeling and strategy evaluation.
