# 02 - Data Cleaning

## Objective
Load the latest raw snapshots for each ticker, clean and standardise the structure, then save a single tidy dataset for analysis and modelling.

## Inputs
- Raw CSV snapshots in `data/raw/<version>/`
- Version label (e.g. v1)

## Outputs
- Cleaned dataset saved to `data/processed/<version>/clean_prices_<version>_<timestamp>.csv`
- Basic data quality checks (shape, missing values)

## CRISP-DM Stage
Data Preparation

In [1]:
# Make the project root importable (so `import src...` works in notebooks)
import sys
from pathlib import Path

PROJECT_ROOT = Path("..").resolve()  # notebooks live in jupyter_notebooks/
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print("Project root added to sys.path:", PROJECT_ROOT)

In [2]:
from pathlib import Path

import pandas as pd

from src.config import DEFAULT_TICKERS, DEFAULT_VERSION, get_paths

In [None]:
VERSION = DEFAULT_VERSION
TICKERS = DEFAULT_TICKERS

paths = get_paths(VERSION)
RAW_DIR = paths.raw_dir
PROCESSED_DIR = paths.processed_dir
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print("Raw dir:", RAW_DIR)
print("Processed dir:", PROCESSED_DIR)
print("Tickers:", ", ".join(TICKERS))