# 0. PopForecast — Baseline Modeling

Purpose:
    
    This notebook implements the Cycle 1 baseline evaluation for PopForecast.
    It loads the processed modeling dataset, applies the defined evaluation protocol for Cycle 1, and computes baseline metrics under two split strategies:

        1) Random split (primary benchmark)
        2) Temporal split (diagnostic)

    The goal is not to optimize models, but to validate:
        - the evaluation protocol,
        - the metric suite,
        - the behavior of simple baselines,
        - and the impact of temporal drift.

    Outputs:
        - baseline metrics tables
        - segmented MAE diagnostics (zero vs positive popularity)
        - consolidated comparison
        - Cycle 1 decisions and next steps

# 1. Setup

## 1.1 - Imports

In [1]:
from __future__ import annotations

# --- Project root setup ---
import sys
from pathlib import Path

# Add project root so that `src/` can be imported when running from notebooks/
PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

# --- Standard libraries ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# --- ML / Metrics ---
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# --- Project modules ---
from src.core.preprocessing import run_preprocessing, default_config

## 1.2 - Global settings

In [2]:
# --- Reproducibility (use only when sampling / splitting inside the notebook) ---
TEST_SIZE = 0.2
RANDOM_STATE = 42

# --- Pandas display ---
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)
pd.set_option("display.max_colwidth", 60)
pd.set_option("display.float_format", "{:,.4f}".format)

# --- Matplotlib defaults (lightweight) ---
plt.rcParams["figure.figsize"] = (10, 4)
plt.rcParams["axes.grid"] = True

## 1.3 - Project paths

In [3]:
DATA_PROCESSED_PATH = PROJECT_ROOT / "data" / "processed" / "spotify_tracks_modeling.parquet"

print("Project root:", PROJECT_ROOT)
print("Processed dataset:", DATA_PROCESSED_PATH)

Project root: /mnt/c/Users/Daniel/OneDrive/Documentos/_Cursos/Outros/PopForecast
Processed dataset: /mnt/c/Users/Daniel/OneDrive/Documentos/_Cursos/Outros/PopForecast/data/processed/spotify_tracks_modeling.parquet


# 2. Load Processed Dataset (Parquet)

We use the processed Parquet dataset to guarantee **type stability** across runs and environments (downcasting is preserved).
If the file is missing, we trigger the preprocessing pipeline to regenerate it from the raw CSV.
This keeps the notebook reproducible while keeping preprocessing logic out of the notebook.

In [4]:
# Ensure processed dataset exists
if not DATA_PROCESSED_PATH.exists():
    print("Processed dataset not found. Running preprocessing...")
    run_preprocessing(default_config(PROJECT_ROOT))

# Load processed dataset
df = pd.read_parquet(DATA_PROCESSED_PATH)
df.sample(5)

Unnamed: 0,song_popularity,album_release_year,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,song_explicit,speechiness,tempo,time_signature,total_available_markets,valence,release_year_missing_or_suspect
126289,33,2015,0.632,0.625,97800.0,0.485,0.0,0,0.211,-10.607,1,False,0.478,86.918,4,169,0.461,False
431989,0,2018,0.0,0.284,185327.0,0.644,0.0002,0,0.337,-10.222,0,False,0.0385,138.265,4,169,0.437,False
404047,0,2019,0.314,0.729,162560.0,0.552,0.0,1,0.0907,-10.272,1,True,0.0699,120.034,4,170,0.158,False
250982,16,2018,0.63,0.789,179221.0,0.436,0.0006,8,0.109,-8.447,0,True,0.278,153.963,4,170,0.837,False
369787,2,2017,0.609,0.699,363760.0,0.435,0.924,9,0.116,-13.083,1,False,0.0283,128.018,4,160,0.0867,False


# 3 Quick Sanity Checks

In [5]:
print("Shape:", df.shape)
print("\nSchema:")
print(df.dtypes)

print("\nMissing values:")
print(df.isna().sum())

print("\nTarget range:")
print("min:", df["song_popularity"].min(), "max:", df["song_popularity"].max())

Shape: (439865, 18)

Schema:
song_popularity                      int16
album_release_year                   Int16
acousticness                       float32
danceability                       float32
duration_ms                        float32
energy                             float32
instrumentalness                   float32
key                                   int8
liveness                           float32
loudness                           float32
mode                                  int8
song_explicit                         bool
speechiness                        float32
tempo                              float32
time_signature                        int8
total_available_markets              int16
valence                            float32
release_year_missing_or_suspect       bool
dtype: object

Missing values:
song_popularity                     0
album_release_year                 22
acousticness                        0
danceability                        0
duration_ms   

# 4. Evaluation Protocol Definition

This notebook uses two evaluation strategies:

## 4.1 - Random Split (primary benchmark)
Assumes i.i.d. sampling and provides a stable baseline for model comparison.

## 4.2 - Temporal Split (diagnostic)
Simulates a realistic scenario where the model trains on past data and predicts future data.  
Rows without `album_release_year` are excluded only from this diagnostic split.

## 4.3 - Implementing the Evaluation Protocol (Train/Test Splits)

This section implements the two evaluation strategies defined above:

- **Random Split (primary benchmark)**  
  A simple 80/20 shuffled split, independent of release‑year inconsistencies.

- **Temporal Split (diagnostic)**  
  A best‑effort temporal evaluation using an explicit cutoff year.  
  Rows without a valid `album_release_year` are excluded only from this diagnostic split.

No validation set is used in Cycle 1, since no hyperparameter tuning is performed.

In [43]:
# -----------------------------
# Random Split (primary benchmark)
# -----------------------------
X = df.drop(columns=["song_popularity"])
y = df["song_popularity"].astype("int16")

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    shuffle=True,
)

# --- Minimal hygiene: remove rows with NaN in features ---
# LinearRegression does not accept NaNs.
# In this dataset, missingness is expected only in album_release_year (few rows).
# We drop rows with missing year for the random split to keep models comparable.
missing_year_train = X_train_r["album_release_year"].isna()
missing_year_test = X_test_r["album_release_year"].isna()

X_train_r = X_train_r.loc[~missing_year_train].copy()
y_train_r = y_train_r.loc[X_train_r.index]

X_test_r = X_test_r.loc[~missing_year_test].copy()
y_test_r = y_test_r.loc[X_test_r.index]

print("Dropped (random split) due to missing year:",
      int(missing_year_train.sum() + missing_year_test.sum()))

print("\nRandom split:")
print("  Train:", X_train_r.shape)
print("  Test :", X_test_r.shape)


# -----------------------------
# Temporal Split (diagnostic)
# -----------------------------
# Remove rows without a valid release year
df_temp = df.dropna(subset=["album_release_year"]).copy()

YEAR_CUTOFF = 2020  # explicit, simple, defensible

train_mask = df_temp["album_release_year"] <= YEAR_CUTOFF
test_mask  = df_temp["album_release_year"] >  YEAR_CUTOFF

X_train_t = df_temp.loc[train_mask].drop(columns=["song_popularity"])
y_train_t = df_temp.loc[train_mask, "song_popularity"]

X_test_t  = df_temp.loc[test_mask].drop(columns=["song_popularity"])
y_test_t  = df_temp.loc[test_mask, "song_popularity"]

print("Temporal split:")
print("  Train:", X_train_t.shape)
print("  Test :", X_test_t.shape)

Dropped (random split) due to missing year: 22

Random split:
  Train: (351873, 17)
  Test : (87970, 17)
Temporal split:
  Train: (389071, 17)
  Test : (50772, 17)


In [42]:
def zero_share(y: pd.Series) -> float:
    return float((y == 0).mean())

print(f"Zero share (random)   - train: {zero_share(y_train_r)*100:.3f}%   test: {zero_share(y_test_r)*100:.3f}%")
print(f"Zero share (temporal) - train: {zero_share(y_train_t)*100:.3f}%   test: {zero_share(y_test_t)*100:.3f}%")


Zero share (random)   - train: 13.382%   test: 13.483%
Zero share (temporal) - train: 12.036%   test: 23.869%


# 5. Metrics

- MAE (primary metric)
- RMSE
- R² (context only)
- Segmented MAE:
    - MAE for popularity == 0
    - MAE for popularity > 0

In [7]:
def regression_metrics(y_true, y_pred):
    return {
        "mae":  mean_absolute_error(y_true, y_pred),
        "rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
        "r2":   r2_score(y_true, y_pred),
    }

def segmented_mae(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    zero_mask = y_true == 0
    pos_mask  = y_true > 0

    return {
        "mae_zero": mean_absolute_error(y_true[zero_mask], y_pred[zero_mask]) if zero_mask.any() else np.nan,
        "mae_pos":  mean_absolute_error(y_true[pos_mask],  y_pred[pos_mask])  if pos_mask.any()  else np.nan,
    }

# 6. Baseline 0 — Constant Predictor (training median)

## 6.1 - Baseline 0 on Random Split

This baseline predicts a **single constant value** for all observations:  the **median popularity in the training set**.

This gives us a minimal benchmark to validate:
- the correctness of the evaluation protocol,
- the metric suite,
- and the behavior of a naïve predictor.

If a more complex model cannot outperform this baseline, it is not learning meaningful structure.

In [20]:
# Baseline: median predictor
y_pred_r = np.full_like(y_test_r, y_train_r.median())

metrics_random = {
    **regression_metrics(y_test_r, y_pred_r),
    **segmented_mae(y_test_r, y_pred_r),
}

pd.DataFrame([{"split": "random", **metrics_random}])

Unnamed: 0,split,mae,rmse,r2,mae_zero,mae_pos
0,random,15.209,18.6765,-0.0221,20.0,14.4623


## 6.2 Results on Temporal Split

In [21]:
y_pred_t = np.full_like(y_test_t, y_train_t.median())

metrics_temporal = {
    **regression_metrics(y_test_t, y_pred_t),
    **segmented_mae(y_test_t, y_pred_t),
}

pd.DataFrame([{"split": "temporal", **metrics_temporal}])

Unnamed: 0,split,mae,rmse,r2,mae_zero,mae_pos
0,temporal,15.6517,18.3903,-0.0038,20.0,14.2884


# 7. Baseline 1 — Simple and Interpretable Model

This baseline uses a **Linear Regression (OLS)** model.

Rationale:
- It is simple, transparent, and easy to audit.
- It provides a meaningful step above the constant predictor.
- It helps verify whether the features contain linear signal.
- No hyperparameter tuning is performed in Cycle 1.

The model is evaluated under the same two split strategies:
- Random split (primary benchmark)
- Temporal split (diagnostic)

## 7.1 - Linear Regression on Random Split

In [22]:
# Fit model
linreg_r = LinearRegression()
linreg_r.fit(X_train_r, y_train_r)

# Predict
y_pred_r_lr = linreg_r.predict(X_test_r)

# Metrics
metrics_random_lr = {
    **regression_metrics(y_test_r, y_pred_r_lr),
    **segmented_mae(y_test_r, y_pred_r_lr),
}

pd.DataFrame([{"split": "random", **metrics_random_lr}])

Unnamed: 0,split,mae,rmse,r2,mae_zero,mae_pos
0,random,14.8511,17.9222,0.0588,21.5907,13.8008


## 7.2 - Linear Regression on Temporal Split

In [23]:
linreg_t = LinearRegression()
linreg_t.fit(X_train_t, y_train_t)

y_pred_t_lr = linreg_t.predict(X_test_t)

metrics_temporal_lr = {
    **regression_metrics(y_test_t, y_pred_t_lr),
    **segmented_mae(y_test_t, y_pred_t_lr),
}

pd.DataFrame([{"split": "temporal", **metrics_temporal_lr}])

Unnamed: 0,split,mae,rmse,r2,mae_zero,mae_pos
0,temporal,15.3828,18.2145,0.0153,19.9086,13.9638


# 8. Baseline 2 — Tree/Boosting Model (strong baseline)

This baseline uses a **Random Forest Regressor**, a non‑linear ensemble model that
typically captures richer structure than linear models.

Rationale:
- Handles complex interactions and non‑linearities.
- Requires no feature scaling.
- Provides a strong, widely‑used baseline without hyperparameter tuning.
- Helps determine whether the dataset contains meaningful non‑linear signal.

No tuning is performed in Cycle 1.  
We use a small, stable configuration to avoid overfitting and keep training time reasonable.

## 8.1 - Random Forest on Random Split

In [12]:
rf_r = RandomForestRegressor(
    n_estimators=50,
    max_depth=12,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1,
)

rf_r.fit(X_train_r, y_train_r)
y_pred_r_rf = rf_r.predict(X_test_r)

metrics_random_rf = {
    **regression_metrics(y_test_r, y_pred_r_rf),
    **segmented_mae(y_test_r, y_pred_r_rf),
}

pd.DataFrame([{"split": "random", **metrics_random_rf}])

Unnamed: 0,split,mae,rmse,r2,mae_zero,mae_pos
0,random,13.5357,16.685,0.1843,18.6831,12.7335


## 8.2 - Random Forest on Temporal Split

In [13]:
rf_t = RandomForestRegressor(
    n_estimators=50,
    max_depth=12,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1,
)

rf_t.fit(X_train_t, y_train_t)
y_pred_t_rf = rf_t.predict(X_test_t)

metrics_temporal_rf = {
    **regression_metrics(y_test_t, y_pred_t_rf),
    **segmented_mae(y_test_t, y_pred_t_rf),
}

pd.DataFrame([{"split": "temporal", **metrics_temporal_rf}])

Unnamed: 0,split,mae,rmse,r2,mae_zero,mae_pos
0,temporal,15.1799,18.2344,0.0131,20.581,13.4865


## 8.3 - Consolidated Comparison 

In [18]:
comparison_random = pd.DataFrame([
    {"split": "random", "model": "Baseline 0 — Constant", **metrics_random},
    {"split": "random", "model": "Baseline 1 — Linear Regression", **metrics_random_lr},
    {"split": "random", "model": "Baseline 2 — Random Forest", **metrics_random_rf},
])

comparison_temporal = pd.DataFrame([
    {"split": "temporal", "model": "Baseline 0 — Constant", **metrics_temporal},
    {"split": "temporal", "model": "Baseline 1 — Linear Regression", **metrics_temporal_lr},
    {"split": "temporal", "model": "Baseline 2 — Random Forest", **metrics_temporal_rf},
])

comparison_all = pd.concat(
    [comparison_random, comparison_temporal],
    ignore_index=True
)

comparison_all

Unnamed: 0,split,model,mae,rmse,r2,mae_zero,mae_pos
0,random,Baseline 0 — Constant,15.209,18.6765,-0.0221,20.0,14.4623
1,random,Baseline 1 — Linear Regression,14.8511,17.9222,0.0588,21.5907,13.8008
2,random,Baseline 2 — Random Forest,13.5357,16.685,0.1843,18.6831,12.7335
3,temporal,Baseline 0 — Constant,15.6517,18.3903,-0.0038,20.0,14.2884
4,temporal,Baseline 1 — Linear Regression,15.3828,18.2145,0.0153,19.9086,13.9638
5,temporal,Baseline 2 — Random Forest,15.1799,18.2344,0.0131,20.581,13.4865


## 8.4 - Interpretation Notes 

### 1. Random Split
The Random Forest clearly outperforms both the constant baseline and the linear model:

- **MAE improves from 15.21 → 14.85 → 13.54**
- **RMSE improves from 18.68 → 17.92 → 16.69**
- **R² improves from –0.02 → 0.06 → 0.18**

This confirms that:
- the dataset contains **non‑linear signal**,  
- linear models capture only part of it,  
- and the Random Forest is a meaningful strong baseline.

Segmented MAE shows:
- **MAE_zero improves** (21.59 → 18.68), meaning the RF handles zero‑popularity tracks better than the linear model.
- **MAE_pos improves** (13.80 → 12.73), meaning the RF captures structure in positive‑popularity tracks.

Overall, the random split behaves as expected:  
**non‑linear models extract more structure than linear ones.**

---

### 2. Temporal Split
All models degrade slightly in the temporal split:

- MAE: **15.65 → 15.38 → 15.18**
- RMSE: **18.39 → 18.21 → 18.23**
- R²: **–0.0038 → 0.015 → 0.013**

Key observations:

- The Random Forest **still wins**, but by a much smaller margin.
- The linear model barely improves over the constant baseline.
- The RF’s R² drops from **0.18 (random)** to **0.013 (temporal)**.

This is a strong indicator of **temporal drift**:
- relationships between features and popularity change over time,
- recent years (2020–2021) are harder to predict,
- the temporal split results are notably weaker, indicating a strong generalization gap when training on releases up to 2020 and testing primarily on 2021 releases (a setup consistent with temporal shift/drift).

Segmented MAE reinforces this:
- **MAE_zero stays high** (~20), meaning zero‑popularity tracks remain unpredictable.
- **MAE_pos improves only slightly** (13.96 → 13.48), much less than in the random split.

This suggests that:
- the non‑linear structure learned pre‑2020 does not transfer well to post‑2020,
- the dataset likely contains **year‑dependent patterns** that the model cannot capture.

---

### 3. What this means for Cycle 2

These diagnostics point to three clear directions:

1. **Temporal modeling is necessary**  
   - models need to incorporate release year explicitly,  
   - or use rolling/expanding windows,  
   - or include temporal embeddings.

2. **Zero‑inflation is a structural challenge**  
   - zero‑popularity tracks behave differently,  
   - a two‑stage model (classification → regression) may help.

3. **Non‑linear models are justified**  
   - Random Forest already shows meaningful gains,  
   - boosting models (LightGBM, XGBoost, CatBoost) will likely outperform it.

---

### 4. Summary
- Random split: **clear non‑linear signal**, RF performs strongly.  
- Temporal split: **drift dominates**, RF gains shrink dramatically.  
- Zero‑popularity tracks remain the hardest segment.  
- Cycle 2 should focus on **temporal robustness** and **zero‑inflation handling**.

# 9. Error Diagnostics

This section provides a lightweight diagnostic of model errors to understand
where the baselines struggle and what patterns may guide Cycle 2 modeling.

The goal is **not** to perform deep statistical analysis or tuning, but to
identify broad structural issues such as:

- systematic under/over‑prediction,
- sensitivity to zero‑inflated targets,
- temporal drift,
- non‑linear patterns missed by simpler models.

Diagnostics are computed for the strongest baseline (Random Forest), since it
represents the best achievable performance in Cycle 1.

## 9.1 - Residual Distribution (Random Split)

In [25]:
residuals_r = y_test_r - y_pred_r_rf

pd.DataFrame({
    "mean_residual": [residuals_r.mean()],
    "median_residual": [residuals_r.median()],
    "std_residual": [residuals_r.std()],
    "min_residual": [residuals_r.min()],
    "max_residual": [residuals_r.max()],
})

Unnamed: 0,mean_residual,median_residual,std_residual,min_residual,max_residual
0,-0.1172,-1.9483,16.6847,-46.8608,65.7655


## 9.2 - Residual Distribution (Temporal Split)

In [24]:
residuals_t = y_test_t - y_pred_t_rf

pd.DataFrame({
    "mean_residual": [residuals_t.mean()],
    "median_residual": [residuals_t.median()],
    "std_residual": [residuals_t.std()],
    "min_residual": [residuals_t.min()],
    "max_residual": [residuals_t.max()],
})

Unnamed: 0,mean_residual,median_residual,std_residual,min_residual,max_residual
0,-3.123,-5.4727,17.9651,-48.3888,73.8522


## 9.3 - Error vs. True Popularity (Binned MAE)
*(helps detect whether the model struggles with specific popularity ranges)*

In [30]:
bins = [0, 5, 20, 40, 60, 80, 100]
labels = ["0–5", "5–20", "20–40", "40–60", "60–80", "80–100"]

df_err_r = pd.DataFrame({
    "y_true": y_test_r,
    "y_pred": y_pred_r_rf,
})
df_err_r["bin"] = pd.cut(df_err_r["y_true"], bins=bins, labels=labels, include_lowest=True)

binned_mae_r = (
    df_err_r
    .groupby("bin", observed=False)[["y_true", "y_pred"]]
    .apply(lambda g: mean_absolute_error(g["y_true"], g["y_pred"]))
)

binned_mae_r.to_frame(name="MAE_random_split")

Unnamed: 0_level_0,MAE_random_split
bin,Unnamed: 1_level_1
0–5,17.7416
5–20,9.2042
20–40,7.5571
40–60,22.1402
60–80,36.9073
80–100,51.1427


## 9.4 - Error vs. Release Year (Temporal Split)
*(detects drift or degradation on newer releases)*

In [31]:
df_err_t = pd.DataFrame({
    "y_true": y_test_t,
    "y_pred": y_pred_t_rf,
    "year": X_test_t["album_release_year"],
})

year_mae_t = (
    df_err_t
    .groupby("year", observed=False)[["y_true", "y_pred"]]
    .apply(lambda g: mean_absolute_error(g["y_true"], g["y_pred"]))
)

year_mae_t.to_frame(name="MAE_by_year_temporal")

Unnamed: 0_level_0,MAE_by_year_temporal
year,Unnamed: 1_level_1
2021,15.1799


## 9.5 — Interpretation Notes

### Residual distribution
- In the **random split**, residuals are centered near zero (mean ≈ –0.12), indicating no systematic bias, but the model occasionally produces large errors (up to ±65), consistent with the high variance of music popularity.
- In the **temporal split**, residuals shift negatively (mean ≈ –3.12, median ≈ –5.47), showing **systematic under‑prediction** for recent releases.  
  This aligns with the weaker performance observed in Section 8.4.

### Binned MAE (random split)
- Errors are lowest for **mid‑range popularity** (5–40), where the model has the most structure to learn.
- **Zero‑popularity tracks** remain difficult (MAE ≈ 17.7), reflecting their noisy and weakly‑informative nature.
- Errors increase sharply for **high‑popularity tracks** (>60), reaching MAE > 50, due to their rarity and idiosyncratic behavior.

### Temporal drift
- The temporal split shows higher error dispersion and a heavier tail of negative residuals, suggesting that a model trained on earlier years generalizes poorly when evaluated primarily on 2021 releases.
- This pattern is compatible with temporal shift/drift, for example:
  - changes in the distribution of releases (e.g., genre/artist mix and platform dynamics),
  - potential changes in the share of zero-popularity tracks in recent years,
  - and a shift in how metadata/audio features relate to popularity.

### Summary
These diagnostics reinforce the conclusions from Section 8.4:  
the dataset exhibits **temporal drift**, **zero‑inflation**, and **extreme‑value instability**, all of which will need to be addressed in Cycle 2 through more expressive models, temporal‑aware strategies, and potentially a two‑stage modeling approach.

# 10. Cycle Decisions & Next Steps

## Summary of Cycle 1 Findings

Cycle 1 established three baselines of increasing complexity:

- **Baseline 0 (Constant Predictor)**  
  Served as the minimal reference point. Errors were high across all scenarios, as expected.

- **Baseline 1 (Linear Regression)**  
  Captured some structure in the random split but showed only modest gains in the temporal split.  
  This indicates that purely linear relationships are insufficient.

- **Baseline 2 (Random Forest)**  
  Delivered clear improvements in the random split (lower MAE/RMSE, higher R²).  
  In the temporal split, however, the improvement was small, revealing **temporal drift** and limited generalization to 2021.

The error diagnostics reinforced three structural characteristics of the dataset:

1. **Zero‑inflation**: tracks with zero popularity are numerous and difficult to predict.  
2. **Extreme‑value instability**: highly popular tracks are rare and produce large errors.  
3. **Temporal drift**: models trained on earlier years generalize poorly to 2021.

These patterns are consistent with the results from the consolidated comparison.

---

## Cycle Decision

Based on the findings, Cycle 1 **achieved its objective**:

- validating the split protocol,  
- establishing comparable baselines,  
- confirming the presence of non‑linear signal,  
- identifying clear limitations that must be addressed in the next cycle.

Therefore, the decision is:

### **→ Proceed to Cycle 2.**

Cycle 1 does not require further refinement — it already provides a solid and reliable diagnostic of the dataset’s behavior.

---

## Next Steps for Cycle 2

Cycle 2 should focus on three main directions:

### **1. More expressive models**
- Boosting models (LightGBM, XGBoost, CatBoost)  
- Models that handle extremes and complex interactions more effectively  
- Evaluation of regularization and early stopping

### **2. Temporal strategies**
- explicit inclusion of temporal features,  
- more granular temporal validation,  
- rolling/expanding window experiments,  
- drift analysis across years.

### **3. Zero‑inflation strategies**
- two‑stage modeling (classification → regression),  
- more robust loss functions,  
- explicit segmentation of zero vs positive popularity tracks.

---

## Closing Note

Cycle 1 provided a clear view of the landscape:  
there is signal, there is non‑linearity, there is drift, and there are structural challenges.  
Cycle 2 can now move forward with focus and direction, exploring approaches that the data genuinely justifies.