# 🏦 Task 1 – Google Play Review Scraping & Preprocessing  
📘 Version: 2025-06-07

Structured collection and preprocessing of mobile app reviews for three Ethiopian banks (CBE, BOA, Dashen) to enable downstream sentiment and thematic analysis.

### This notebook covers:
- Scraping 400+ reviews per bank using `google-play-scraper`
- Extracting key review metadata (rating, date, content, version)
- Cleaning, deduplication, and date normalization
- Mapping to a 5-column output schema for analysis
- CSV export for downstream sentiment + theme pipelines


In [1]:
# ------------------------------------------------------------------------------
# 🛠 Ensure Notebook Runs from Project Root (for src/ imports to work)
# ------------------------------------------------------------------------------

import os
import sys

# If running from /notebooks/, move up to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
    print("📂 Changed working directory to project root")

# Add project root to sys.path so `src/` modules can be imported
project_root = os.getcwd()
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"✅ Added to sys.path: {project_root}")

# Optional: verify file presence to confirm we're in the right place
expected_path = "data/raw"
print(
    "📁 Output path ready"
    if os.path.exists(expected_path)
    else f"⚠️ Output path not found: {expected_path}"
)

📂 Changed working directory to project root
✅ Added to sys.path: c:\Users\admin\Documents\GIT Repositories\b5w2-customer-ux-analytics-challenge
📁 Output path ready


## 📥 Preview Raw Google Play Reviews (EDA)

This step performs an initial, single-batch scrape of Google Play reviews for one banking app (CBE) using `google-play-scraper`.

- Pulls 100 of the most recent reviews for `com.combanketh.mobilebanking`.
- Displays full metadata per review (user, rating, date, content, etc.).
- Uses a lightweight preview (no continuation tokens) for EDA/debugging.
- Provides a clean, commented structure for extensibility or testing.

This helps validate scraping behavior before integrating the modular review pipeline or running full-batch collection for all three banks.


In [2]:
# ------------------------------------------------------------------------------
# 📥 Preview Raw Reviews for CBE via google-play-scraper
# ------------------------------------------------------------------------------

from google_play_scraper import Sort, reviews

# Define the Play Store app ID for Commercial Bank of Ethiopia
bank_playstore_id = "com.combanketh.mobilebanking"

# Configure scraping parameters
SCRAPE_COUNT = 100  # number of reviews to fetch
LANG = "en"  # review language
COUNTRY = "us"  # review origin
SORT_ORDER = Sort.NEWEST

# Execute a single-page scrape (no continuation) for EDA purposes
try:
    result, continuation_token = reviews(
        bank_playstore_id,
        lang=LANG,
        country=COUNTRY,
        sort=SORT_ORDER,
        count=SCRAPE_COUNT,
        filter_score_with=None,
    )
except Exception as e:
    print(f"❌ Failed to fetch reviews: {e}")
    result = []

# Pretty-print sample reviews with full metadata
for i, review in enumerate(result, 1):
    print(f"\n🔹 Review {i}")
    print(f"  🧑 User Name        : {review.get('userName', '')}")
    print(f"  ⭐ Rating           : {review.get('score', '')}")
    print(
        f"  📅 Date            : {review.get('at').strftime('%Y-%m-%d %H:%M:%S') if review.get('at') else 'N/A'}"
    )
    print(f"  📝 Review Content   : {review.get('content', '')}")
    print(f"  🆔 Review ID        : {review.get('reviewId', '')}")
    print(f"  📱 App Version      : {review.get('appVersion', '')}")
    print(f"  🔁 Replied At       : {review.get('repliedAt', '—')}")
    print(f"  💬 Reply Content    : {review.get('replyContent', '—')}")
    print(f"  👍 Thumbs Up Count  : {review.get('thumbsUpCount', '')}")
    print(f"  🌐 User Image URL   : {review.get('userImage', '')}")
    print("-" * 70)


🔹 Review 1
  🧑 User Name        : Aim4 Beyond
  ⭐ Rating           : 4
  📅 Date            : 2025-06-06 09:54:11
  📝 Review Content   : "Why don’t your ATMs support account-to-account transfers like other countries( Kenya, Nigeria , South africa)"
  🆔 Review ID        : be2cb2ac-bbe0-4175-81c4-9f6c86afdaaa
  📱 App Version      : None
  🔁 Replied At       : None
  💬 Reply Content    : None
  👍 Thumbs Up Count  : 0
  🌐 User Image URL   : https://play-lh.googleusercontent.com/a/ACg8ocJ8haRPi_VW5lsN16hQDpUE8f3f24u6P2mvRwSw8wBpampb4g=mo
----------------------------------------------------------------------

🔹 Review 2
  🧑 User Name        : zakir man
  ⭐ Rating           : 1
  📅 Date            : 2025-06-05 22:16:56
  📝 Review Content   : what is this app problem???
  🆔 Review ID        : 8efd71e9-59cd-41ce-8c5c-12052dee9ad0
  📱 App Version      : 5.1.0
  🔁 Replied At       : None
  💬 Reply Content    : None
  👍 Thumbs Up Count  : 0
  🌐 User Image URL   : https://play-lh.googleusercontent.

## 🔁 Scrape and Export Reviews for All Banks (Modular Pipeline)

This step performs a full review scraping pass for all three Ethiopian banks using the OOP-based `BankReviewScraper` module.

- Loops through Play Store app IDs for CBE, BOA, and Dashen Bank.
- Collects 400+ reviews per bank using continuation tokens and polite throttling.
- Saves individual raw CSVs to the `data/raw/` directory using the scraper's built-in export method.
- Automatically generates a combined export file (`reviews_all_banks.csv`) for streamlined processing.
- Implements verbose diagnostics and error handling for full traceability.

This pipeline ensures traceable, reproducible, and challenge-compliant collection of mobile app reviews—forming the foundation for sentiment modeling and UX insight extraction.


In [2]:
# ------------------------------------------------------------------------------
# 🔁 Task 1 – Batch Scraping & Combined Export for All Banks
# ------------------------------------------------------------------------------

from src.scraper.review_scraper import BankReviewScraper
import pandas as pd
import os

# 📌 Define bank metadata (Play Store app IDs)
BANKS = {
    "CBE": "com.combanketh.mobilebanking",
    "BOA": "com.boa.boaMobileBanking",
    "Dashen": "com.dashen.dashensuperapp",
}

# ⚙️ Scraping configuration
REVIEW_TARGET = 400  # Reviews per bank
OUTPUT_DIR = "data/raw"  # Folder for CSV exports
COMBINED_FILENAME = "reviews_all_banks.csv"  # Combined output file
OUTPUT_PATH = os.path.join(OUTPUT_DIR, COMBINED_FILENAME)

# ✅ Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# 📦 Initialize list to collect all bank reviews
all_reviews = []

# 🔁 Loop through banks and scrape reviews
for bank_label, app_id in BANKS.items():
    print(f"\n🚀 Scraping {REVIEW_TARGET} reviews for {bank_label}...")

    try:
        # Initialize scraper instance
        scraper = BankReviewScraper(
            app_id=app_id,
            bank_label=bank_label,
            target_count=REVIEW_TARGET,
            verbose=True,
        )

        # Run scraping pipeline
        scraper.scrape_reviews()

        # Append structured reviews to global list
        all_reviews.extend(scraper.reviews_raw)

        # Optional: save per-bank CSVs for diagnostics
        # bank_csv_path = os.path.join(OUTPUT_DIR, f"reviews_{bank_label}.csv")
        # pd.DataFrame(scraper.reviews_raw).to_csv(bank_csv_path, index=False, encoding="utf-8-sig")

    except Exception as e:
        print(f"❌ Error scraping {bank_label}: {e}")

# 📤 Final combined export
if all_reviews:
    df_all = pd.DataFrame(all_reviews)
    df_all.to_csv(OUTPUT_PATH, index=False, encoding="utf-8-sig")
    print(f"\n✅ Exported {len(df_all):,} combined reviews to: {OUTPUT_PATH}")
else:
    print("⚠️ No reviews collected across all banks.")


🚀 Scraping 400 reviews for CBE...
🔍 Starting scrape for CBE (400 reviews)...
📦 Collected 200 / 400
📦 Collected 400 / 400

🚀 Scraping 400 reviews for BOA...
🔍 Starting scrape for BOA (400 reviews)...
📦 Collected 200 / 400
📦 Collected 400 / 400

🚀 Scraping 400 reviews for Dashen...
🔍 Starting scrape for Dashen (400 reviews)...
📦 Collected 200 / 400
📦 Collected 400 / 400

✅ Exported 1,200 combined reviews to: data/raw\reviews_all_banks.csv


## 🧼 Task 1 – Clean Raw Google Play Reviews (All Banks)

This step runs the full cleaning pipeline on the raw combined CSV exported from Task 1 scraping (`reviews_all_banks.csv`).

- Loads the raw dataset from `data/raw/` using the modular `ReviewDataCleaner` class.
- Drops rows with missing required fields or blank review content.
- Removes duplicate reviews (by `reviewId`) and normalizes whitespace in text fields.
- Overwrites any existing cleaned file with the same name (safe fallback if locked).
- Outputs a cleaned dataset to `data/cleaned/reviews_all_banks_cleaned.csv`.

This produces a sanitized, analysis-ready dataset for downstream NLP, sentiment, and thematic diagnostics.


In [3]:
# ------------------------------------------------------------------------------
# 🧼 Task 1 – Clean Raw Google Play Reviews (All Banks)
# ------------------------------------------------------------------------------

# 📦 Load cleaner module
from src.cleaning.review_cleaner import ReviewDataCleaner
import os

# 📁 Define file paths
RAW_INPUT_PATH = "data/raw/reviews_all_banks.csv"  # Raw combined review data
CLEANED_OUTPUT_PATH = (
    "data/cleaned/reviews_all_banks_cleaned.csv"  # Final cleaned output
)

# ✅ Ensure output directory exists
os.makedirs(os.path.dirname(CLEANED_OUTPUT_PATH), exist_ok=True)

# 🧼 Run cleaning pipeline with logging and error handling
try:
    # Initialize cleaner with verbose mode ON
    cleaner = ReviewDataCleaner(raw_path=RAW_INPUT_PATH, verbose=True)

    # Step 1: Load raw data from disk
    cleaner.load_raw_data()

    # Step 2: Clean dataset (nulls, blanks, duplicates, normalization)
    cleaner.clean()

    # Step 3: Export cleaned data to output path
    cleaner.export_cleaned(output_path=CLEANED_OUTPUT_PATH)

    # ✅ Final success message
    print(f"\n✅ Cleaning complete. Output saved to: {CLEANED_OUTPUT_PATH}")

except Exception as e:
    # ❌ Graceful failure message
    print(f"❌ Cleaning failed: {e}")

📥 Loaded raw reviews: 1,200 rows, 12 columns
🧹 Dropped 0 rows with missing fields
🧹 Dropped 0 rows with blank reviews
🧹 Dropped 0 duplicate reviewId rows
✅ Cleaned dataset has 1,200 rows (−0.00% loss)
📤 Cleaned reviews exported to: data/cleaned/reviews_all_banks_cleaned.csv

✅ Cleaning complete. Output saved to: data/cleaned/reviews_all_banks_cleaned.csv
