# Data Ingestion and Storage Layer

This notebook focuses on the data ingestion and storage layer
of the Market Mood & Moves pipeline.

The objective is to:
- understand the structure of raw news and stock data
- demonstrate responsible data storage practices
- separate raw data from processed data

At this stage, the emphasis is on correctness and reproducibility,
not on data volume or real-time ingestion.


## Types of Data

The project relies on two primary data sources:

1. **News Data**
   - headline or article text
   - publication timestamp
   - source metadata

2. **Market Data**
   - trading date
   - price information (e.g., close price)

These two datasets have different frequencies and structures,
which motivates careful storage and alignment.


In [1]:
import pandas as pd

news_data = {
    "headline": [
        "Apple reports strong quarterly earnings",
        "Apple faces antitrust scrutiny in the EU",
        "Amazon expands cloud infrastructure in India"
    ],
    "timestamp_utc": [
        "2024-01-15 08:00:00",
        "2024-01-15 18:30:00",
        "2024-01-16 06:00:00"
    ],
    "source": ["Reuters", "Financial Times", "Bloomberg"]
}

df_news_raw = pd.DataFrame(news_data)
df_news_raw


Unnamed: 0,headline,timestamp_utc,source
0,Apple reports strong quarterly earnings,2024-01-15 08:00:00,Reuters
1,Apple faces antitrust scrutiny in the EU,2024-01-15 18:30:00,Financial Times
2,Amazon expands cloud infrastructure in India,2024-01-16 06:00:00,Bloomberg


## Why Professional Storage Is Needed

While CSV files are simple, they have several limitations:
- no enforced schema
- inefficient updates
- error-prone when data grows
- poor support for concurrent access

For a scalable and reproducible pipeline,
a structured database-based approach is preferred.


In [2]:
import sqlite3

# Create (or connect to) a local SQLite database
conn = sqlite3.connect("news_data.db")

# Store raw news data
df_news_raw.to_sql(
    "raw_news",
    conn,
    if_exists="replace",
    index=False
)

conn.close()


In [3]:
conn = sqlite3.connect("news_data.db")

df_news_loaded = pd.read_sql(
    "SELECT * FROM raw_news",
    conn
)

conn.close()

df_news_loaded


Unnamed: 0,headline,timestamp_utc,source
0,Apple reports strong quarterly earnings,2024-01-15 08:00:00,Reuters
1,Apple faces antitrust scrutiny in the EU,2024-01-15 18:30:00,Financial Times
2,Amazon expands cloud infrastructure in India,2024-01-16 06:00:00,Bloomberg


## Raw vs Processed Data

In this project:
- **Raw data** is stored exactly as received from the source
- **Processed data** (cleaned text, sentiment scores, aligned dates)
  is created downstream and stored separately if needed

This separation ensures that:
- preprocessing decisions are reversible
- data leakage can be avoided
- experiments remain reproducible


## Scope of This Notebook

This notebook demonstrates:
- the structure of incoming data
- basic ingestion logic
- responsible storage practices

It does not:
- fetch live data from APIs
- perform cleaning or sentiment analysis
- align data with market prices

These steps are handled in subsequent notebooks.
