# Hackathon 2025 — Pipeline de ponta a ponta

Este notebook reproduz o pipeline completo de geração das previsões e montagem do arquivo de submissão **JAN/2023**.

**Saídas esperadas**:
- `reports/submission_final_JAN2023.csv` — CSV final com separador `;` (UTF‑8)
- `reports/submission_final_JAN2023.csv.gz` — versão compactada


## 1) Configuração
Ajuste os parâmetros abaixo se necessário.

In [None]:
from pathlib import Path
import os, sys, subprocess, textwrap

# URL do repositório (ajuste caso use um fork)
REPO_URL = "https://github.com/Guedes1981/hackathon-forecast-2025.git"
WORKDIR  = Path("/content/drive/MyDrive/hackathon-forecast-2025") if Path("/content").exists() else Path.cwd()/"hackathon-forecast-2025"

print("WORKDIR:", WORKDIR)


## 2) Obter o código (clone ou atualizar)

In [None]:
import subprocess, os
if not WORKDIR.exists():
    subprocess.run(["git", "clone", REPO_URL, str(WORKDIR)], check=True)
else:
    subprocess.run(["git", "-C", str(WORKDIR), "pull"], check=True)
print("OK - código disponível em:", WORKDIR)

## 3) Instalação de dependências
Tenta instalar a partir do `requirements.txt` do projeto. Se não houver, usa um conjunto padrão.

In [None]:
req_proj = WORKDIR/"requirements.txt"
if req_proj.exists():
    print("Instalando dependências do projeto…")
    subprocess.run([sys.executable, "-m", "pip", "install", "-r", str(req_proj)], check=True)
else:
    # Fallback: usa requirements.txt incluído neste pacote
    from pathlib import Path
    fallback_reqs = Path("requirements.txt") if Path("requirements.txt").exists() else Path("/mnt/data/requirements.txt")
    print("Instalando dependências padrão… (fallback)")
    subprocess.run([sys.executable, "-m", "pip", "install", "-r", str(fallback_reqs)], check=True)
print("OK - dependências instaladas")

## 4) (Opcional) Rodar validação Prophet (val4)
Gera `data/processed/prophet_topN_val4_preds.parquet` e métricas.

In [None]:
cmd = [
    sys.executable, "-u", str(WORKDIR/"src"/"train_prophet_topn.py"),
    "--top_n", "200",
    "--changepoint_prior_scale", "0.8",
    "--val_split", "val4",
    "--out_parquet", str(WORKDIR/"data/processed/prophet_topN_val4_preds.parquet"),
]
print("Executando:", " ".join(cmd))
try:
    subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as e:
    print("[Aviso] Validação falhou (seguindo adiante):", e)

## 5) Rodar produção (Jan/2023) — Prophet
Usa a flag `--predict_jan2023` implementada no script.

In [None]:
out_jan = WORKDIR/"data/processed/prophet_topN_jan2023_preds.parquet"
cmd = [
    sys.executable, "-u", str(WORKDIR/"src"/"train_prophet_topn.py"),
    "--top_n", "200", "--changepoint_prior_scale", "0.8",
    "--predict_jan2023",
    "--out_parquet", str(out_jan),
]
print("Executando:", " ".join(cmd))
subprocess.run(cmd, check=True)
print("OK - Prophet Jan/2023:", out_jan)

## 6) Ensemble
Gera `data/processed/forecast_ensemble_jan2023.parquet` (usar caminhos padrão do script).

In [None]:
cmd = [sys.executable, "-u", str(WORKDIR/"src"/"forecast_ensemble.py")]
print("Executando:", " ".join(cmd))
subprocess.run(cmd, check=True)
print("OK - ensemble gerado")

## 7) Montar CSV final (`;`, UTF‑8)
Cria `reports/submission_final_JAN2023.csv` no formato exigido: `semana;pdv;produto;quantidade`.

In [None]:
import pandas as pd
from pathlib import Path
ens_pq = WORKDIR/"data/processed/forecast_ensemble_jan2023.parquet"
out_csv = WORKDIR/"reports/submission_final_JAN2023.csv"
out_gz  = WORKDIR/"reports/submission_final_JAN2023.csv.gz"

df = pd.read_parquet(ens_pq)
# Normalização de nomes
ren = {}
for c in df.columns:
    lc = c.lower()
    if lc in ("sku","sku_id","produto_id","product","product_id"): ren[c] = "produto"
    elif lc in ("pdv","pdv_id","store","store_id"): ren[c] = "pdv"
    elif lc in ("date","dt","ds","semana"): ren[c] = "ds"
    elif lc in ("pred","prediction","forecast","y_pred","yhat_prophet","yhat_baseline","ens_pred","yhat_ensemble","yhat"): ren[c] = "yhat"
df = df.rename(columns=ren)

# Extrai chaves de "id" se necessário
if "id" in df.columns and ("pdv" not in df.columns or "produto" not in df.columns):
    ids = df["id"].astype(str).str.split("|", n=1, expand=True)
    if ids.shape[1] == 2:
        df["pdv"], df["produto"] = ids[0], ids[1]

# Tipos e datas
for k in ("pdv","produto"): df[k] = df[k].astype(str)
df["ds"] = pd.to_datetime(df.get("ds"), errors="coerce")
try: df["ds"] = df["ds"].dt.tz_localize(None)
except Exception: pass

# MA(4) caso não haja yhat
if "yhat" not in df.columns:
    keys = ["produto","pdv"] if {"produto","pdv"}.issubset(df.columns) else ["sku_id","pdv_id"]
    aux = df.rename(columns={"sku_id":"produto","pdv_id":"pdv"})
    aux = aux.sort_values(keys + (["ds"] if "ds" in aux.columns else []))
    aux["ma4"] = aux.groupby(keys, sort=False)["quantidade" if "quantidade" in aux.columns else "y" if "y" in aux.columns else "yhat"].rolling(4, min_periods=1).mean().reset_index(level=[0,1], drop=True)
    df["yhat"] = aux["ma4"].values

# Filtra semanas de JAN/2023 (dias 02,09,16,23) e mapeia para 1..4
wmap = {2:1, 9:2, 16:3, 23:4}
mask_jan = (df["ds"].dt.month == 1) & (df["ds"].dt.year == 2023) & (df["ds"].dt.day.isin(wmap))
dfj = df.loc[mask_jan, ["ds","pdv","produto","yhat"]].copy()
dfj["semana"] = dfj["ds"].dt.day.map(wmap)
dfj["quantidade"] = (dfj["yhat"].clip(lower=0).round().astype(int))
dfj = dfj[["semana","pdv","produto","quantidade"]]
dfj = dfj.sort_values(["semana","pdv","produto"]).reset_index(drop=True)

print("Linhas:", len(dfj))
print("Semanas:", dfj["semana"].min(), "->", dfj["semana"].max())
print("Nulos por coluna:", dfj.isna().sum().to_dict())

out_csv.parent.mkdir(parents=True, exist_ok=True)
dfj.to_csv(out_csv, sep=';', index=False, encoding='utf-8')
dfj.to_csv(out_gz, sep=';', index=False, encoding='utf-8', compression='gzip')
print("OK - salvo:", out_csv)
print("OK - salvo (gz):", out_gz)

## 8) Verificações finais (hashes)

In [None]:
import hashlib
def file_hash(path, algo="md5"):
    h = hashlib.new(algo)
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(1024*1024), b""):
            h.update(chunk)
    return h.hexdigest()

csv_path = WORKDIR/"reports/submission_final_JAN2023.csv"
gz_path  = WORKDIR/"reports/submission_final_JAN2023.csv.gz"
print("MD5 csv   :", file_hash(csv_path, "md5"))
print("MD5 gz    :", file_hash(gz_path, "md5"))
print("SHA256 csv:", file_hash(csv_path, "sha256"))
print("SHA256 gz :", file_hash(gz_path, "sha256"))

## 9) (Opcional) Commit/Push/Tag do artefato final
Se desejar versionar os artefatos gerados.

In [None]:
import subprocess
subprocess.run(["git", "-C", str(WORKDIR), "add", "README.md", "reports/submission_final_JAN2023.csv", "reports/submission_final_JAN2023.csv.gz"], check=False)
subprocess.run(["git", "-C", str(WORKDIR), "commit", "-m", "feat: artefato final de submissao (JAN/2023)"], check=False)
subprocess.run(["git", "-C", str(WORKDIR), "push"], check=False)
subprocess.run(["git", "-C", str(WORKDIR), "tag", "-f", "v-final-jan2023"], check=False)
subprocess.run(["git", "-C", str(WORKDIR), "push", "-f", "origin", "v-final-jan2023"], check=False)
print("OK - versionamento final (se repositório estiver autenticado)")