# Complete Pipeline: Google Play Reviews Trend Analysis

This notebook orchestrates the end-to-end daily pipeline for the Pulsegen assignment. Run each section from top to bottom.


## 1. Configure Inputs
Set the Google Play store URL, target report date (T), and LLM provider.

In [None]:
import os
from pathlib import Path

KAGGLE_MODEL_ID = os.getenv('KAGGLE_MODEL_ID', 'Qwen/Qwen2.5-3B-Instruct')
KAGGLE_MAX_NEW_TOKENS = int(os.getenv('KAGGLE_MAX_NEW_TOKENS', '180'))
PIPELINE_ROOT = Path(os.getenv('PIPELINE_ROOT', 'Notebook2')).resolve()

print(f"✓ Using local transformers model: {KAGGLE_MODEL_ID}")
print(f"✓ Pipeline notebooks expected under: {PIPELINE_ROOT}")

## 2. Run Data Cleaning (June 2024+)
Loads raw CSV and writes filtered parquet.

In [None]:
KAGGLE_MODEL_ID="${KAGGLE_MODEL_ID}" KAGGLE_MAX_NEW_TOKENS=${KAGGLE_MAX_NEW_TOKENS} PIPELINE_ROOT="${PIPELINE_ROOT}" !jupyter nbconvert --to notebook --execute --inplace --ExecutePreprocessor.kernel_name=python3 ${PIPELINE_ROOT}/01_setup_and_clean.ipynb

## 3. Run Topic Routing by Day
Generates per-day parquet files with topic assignments.

In [None]:
KAGGLE_MODEL_ID="${KAGGLE_MODEL_ID}" KAGGLE_MAX_NEW_TOKENS=${KAGGLE_MAX_NEW_TOKENS} PIPELINE_ROOT="${PIPELINE_ROOT}" !jupyter nbconvert --to notebook --execute --inplace --ExecutePreprocessor.kernel_name=python3 ${PIPELINE_ROOT}/02_topic_router.ipynb

## 4. Generate 30-Day Trend Report
Creates CSV and HTML trend outputs under /output.

In [None]:
KAGGLE_MODEL_ID="${KAGGLE_MODEL_ID}" KAGGLE_MAX_NEW_TOKENS=${KAGGLE_MAX_NEW_TOKENS} PIPELINE_ROOT="${PIPELINE_ROOT}" !jupyter nbconvert --to notebook --execute --inplace --ExecutePreprocessor.kernel_name=python3 ${PIPELINE_ROOT}/05_trend_analysis.ipynb

## 5. Verify Output Artifacts
Checks that expected CSV/HTML reports exist.

In [None]:
from pathlib import Path
OUTPUT_DIR = Path('output')
artifacts = sorted(OUTPUT_DIR.glob('topics_trend_*.csv'))
if not artifacts:
    raise FileNotFoundError('No CSV trend report found in output/.')
latest_csv = artifacts[-1]
latest_html = latest_csv.with_suffix('.html')
print(f'Latest CSV: {latest_csv}')
print(f'HTML report exists: {latest_html.exists()}')
