# Task 1 — Google Play Review Scrape & Clean

This notebook demonstrates **Task 1 (Data Collection & Preprocessing)** for the Ethiopian bank reviews project.

**Outputs**
- `data/raw/reviews_raw.csv` (raw scrape; may include duplicates / missing)
- `data/processed/reviews_task1_clean.csv` (final Task 1 deliverable; exactly 5 columns)

**Task 1 KPI checks**
- Total cleaned reviews: **≥ 1,200**
- Cleaned reviews per bank: **≥ 400**
- Missing critical fields in cleaned dataset: **0%** (dropped during cleaning)
- Date format: `YYYY-MM-DD`


## 0) Setup
We resolve the project root relative to this notebook and initialize standard paths and app configuration.

In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import pandas as pd

# --- Resolve project root (repo root) ---
# notebooks/01_task1_scrape_and_clean.ipynb -> project root assumed to be parent of notebooks/
PROJECT_ROOT = Path.cwd()
# If running from within notebooks/ directory, go one level up
if PROJECT_ROOT.name == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent

print("PROJECT_ROOT:", PROJECT_ROOT.resolve())

# --- Ensure `src/` is on sys.path so `import bank_reviews` works ---
SRC_DIR = PROJECT_ROOT / "src"
if SRC_DIR.exists() and str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

# --- Import project config ---
from bank_reviews.config import Paths, default_app_config

# --- Import project utils (Task 1: use io.py and dates.py explicitly) ---
from bank_reviews.utils.io import ensure_parent_dir, read_csv, write_csv
from bank_reviews.utils.dates import normalize_date

paths = Paths.from_root(PROJECT_ROOT)
app_cfg = default_app_config()

RAW_CSV = paths.raw_dir / "reviews_raw.csv"
CLEAN_CSV = paths.processed_dir / "reviews_task1_clean.csv"

# Ensure output folders exist (via utils.io)
ensure_parent_dir(RAW_CSV)
ensure_parent_dir(CLEAN_CSV)

print("RAW_CSV:", RAW_CSV)
print("CLEAN_CSV:", CLEAN_CSV)


## 1) Scrape Google Play reviews
This step pulls reviews for each bank app using the `scraping.play_store` module.

If you have already scraped and saved raw reviews, you can skip scraping and load `data/raw/reviews_raw.csv` instead.

In [None]:
SCRAPE = True  # set False to reuse existing RAW_CSV
N_TARGET_PER_BANK = 450  # scrape above 400 to survive cleaning

if SCRAPE:
    try:
        import importlib
        import bank_reviews.scraping.play_store as play_store
        importlib.reload(play_store)
    except ModuleNotFoundError as e:
        raise ModuleNotFoundError(
            "Scraper module not found. Implement src/bank_reviews/scraping/play_store.py first."
        ) from e

    df_raw = play_store.scrape_all_banks(
        app_cfg=app_cfg,
        n_target_per_bank=N_TARGET_PER_BANK,
        sort="NEWEST",
    )
    if df_raw is None:
        raise RuntimeError("Scraper returned None; ensure scrape_all_banks returns a pandas DataFrame")

    # Persist raw scrape (via utils.io)
    write_csv(df_raw, RAW_CSV)
else:
    if not RAW_CSV.exists():
        raise FileNotFoundError(f"RAW_CSV not found: {RAW_CSV}. Set SCRAPE=True to create it.")
    df_raw = read_csv(RAW_CSV)

print("Raw shape:", df_raw.shape)
df_raw.head(3)


### 1.1) Quick raw sanity checks
Raw data may include duplicates and missing fields. We check bank distribution and missingness before cleaning.

In [None]:
# Count by bank (raw)
if "bank" in df_raw.columns:
    display(df_raw["bank"].value_counts(dropna=False))
else:
    print("WARNING: raw data has no bank column. Ensure scraper adds bank.")

# Missingness in key fields (raw)
key_cols = [c for c in ["review", "rating", "date", "bank", "source", "review_id"] if c in df_raw.columns]
missing = df_raw[key_cols].isna().mean().sort_values(ascending=False) if key_cols else None
print("Missing rate (raw):")
display(missing)


### 1.2) Date parsing audit (explicit dates.py usage)

In production, `clean_reviews_task1()` should normalize dates internally via `bank_reviews.utils.dates.normalize_date`.
This cell is just an **audit** to confirm raw dates can be normalized cleanly.

In [None]:
if "date" in df_raw.columns:
    sample = df_raw["date"].dropna().astype(object).head(10).tolist()
    normalized = [normalize_date(x) for x in sample]
    audit_df = pd.DataFrame({"raw_date_sample": sample, "normalize_date(raw)": normalized})
    display(audit_df)
else:
    print("No 'date' column found in df_raw.")


## 2) Clean to Task 1 deliverable format
Cleaning enforces the Task 1 schema: **`review, rating, date, bank, source`** with dates normalized to `YYYY-MM-DD` and duplicates removed.

In [None]:
try:
    from bank_reviews.preprocessing.clean_reviews import clean_reviews_task1
except ModuleNotFoundError as e:
    raise ModuleNotFoundError(
        "Cleaner module not found. Implement src/bank_reviews/preprocessing/clean_reviews.py first."
    ) from e

df_clean = clean_reviews_task1(df_raw, dedup_strategy="review_id")

print("Clean shape:", df_clean.shape)
df_clean.head(5)


## 3) KPI checks
We enforce the Task 1 acceptance criteria.

In [None]:
# 3.1) Schema check
expected_cols = ["review", "rating", "date", "bank", "source"]
assert list(df_clean.columns) == expected_cols, f"Expected columns {expected_cols}, got {list(df_clean.columns)}"

# 3.2) Missingness check (should be 0 after dropping)
missing_clean = df_clean.isna().mean()
print("Missing rate (clean):")
display(missing_clean)
assert (missing_clean == 0).all(), "Clean dataset still contains missing values."

# 3.3) KPI: totals
total_n = len(df_clean)
per_bank = df_clean["bank"].value_counts()
print("Total cleaned reviews:", total_n)
display(per_bank)

assert total_n >= 1200, f"KPI failed: total cleaned reviews {total_n} < 1200"
assert (per_bank >= 400).all(), f"KPI failed: some bank has < 400 reviews: {per_bank.to_dict()}"

# 3.4) Date format check
date_ok = df_clean["date"].astype(str).str.match(r"^\d{4}-\d{2}-\d{2}$")
bad_dates = df_clean.loc[~date_ok, "date"].head(10).tolist()
assert date_ok.all(), f"Date format check failed. Example bad dates: {bad_dates}"

print("All Task 1 KPI checks passed.")


## 4) Save Task 1 deliverable
We save the cleaned dataset to `data/processed/reviews_task1_clean.csv`.

In [None]:
# Save the cleaned dataframe via utils.io
write_csv(df_clean, CLEAN_CSV)
print("Saved:", CLEAN_CSV)


## 5) Quick inspection
A quick look at rating distribution and date range.

In [None]:
display(df_clean["rating"].value_counts().sort_index())
print("Date min/max:", df_clean["date"].min(), df_clean["date"].max())

# Optional: show a few samples per bank
display(df_clean.groupby("bank").head(2))


---
### Notes
- If scraping fails due to network/quotas, re-run with `SCRAPE=False` after a successful scrape has been cached.
- For reproducibility, keep raw CSVs (even if gitignored) and document scrape time in your report.
- `utils.dates.normalize_date` is audited explicitly above; the cleaner should also use it internally.