# 01 — Load & Explore Raw Data

This notebook is the starting point for the project.

**Goals:**
- Load the original **raw sales dataset** and **make translation table**
- Normalise column headers for consistency
- Run some light exploratory checks on the raw data
- Save a verified copy of both CSVs into the `data/` folder for reproducibility

No cleaning or modelling happens here — this is just about understanding and locking in
the exact raw inputs that the rest of the pipeline depends on.


In [1]:
import os
import pandas as pd

RAW_PATH = "data/raw_sales.csv"
TRANS_PATH = "data/translations.csv"

os.makedirs("data", exist_ok=True)

if not os.path.exists(RAW_PATH):
    raise FileNotFoundError(f"Missing raw file: {RAW_PATH}")

if not os.path.exists(TRANS_PATH):
    raise FileNotFoundError(f"Missing translation file: {TRANS_PATH}")


In [None]:
raw_df = pd.read_csv(RAW_PATH)
trans_df = pd.read_csv(TRANS_PATH)

raw_df.columns = raw_df.columns.str.title()
trans_df.columns = trans_df.columns.str.title()

print("Raw shape:", raw_df.shape)
print("Translations shape:", trans_df.shape)

display(raw_df.head())
display(trans_df.head())


In [None]:
print("Null counts in raw data (top 10 columns):")
display(raw_df.isnull().sum().sort_values(ascending=False).head(10))


In [None]:
raw_df.to_csv("data/raw_sales_verified.csv", index=False)
trans_df.to_csv("data/translations_verified.csv", index=False)
print("Saved verified copies to:")
print("- data/raw_sales_verified.csv")
print("- data/translations_verified.csv")
