## Introduction
This notebook collects and cleans customer reviews for three major Ethiopian banks' mobile apps using Google Play Store data. The steps include:

- Scraping app reviews using google-play-scraper.
- Preprocessing and cleaning the data (removing duplicates, missing values, and normalizing dates).
- Saving cleaned datasets as separate CSV files (one per bank).
- Validating the datasets to ensure quality.

This is part of a consulting simulation to assess customer satisfaction and improve mobile banking experiences.

---

### Set Up Python Path
Add the project root directory to the Python path so that we can import local modules from the `scripts/` folder.

In [1]:
import sys, os
sys.path.insert(0, os.path.abspath('..'))

### Import Required Modules
We import all custom functions and external libraries needed for scraping, cleaning, and validating the app reviews.

In [2]:
from scripts.scraper import scrape_reviews
from scripts.preprocessing import clean_reviews
from scripts.validation import load_and_validate_reviews

from tqdm import tqdm

### Define Target Apps and Output Directory
We define the Google Play app IDs and corresponding bank names, and ensure that the output directory for cleaned CSV files exists.

In [3]:
# App IDs and Names
apps = {
    "com.combanketh.mobilebanking": "Commercial Bank of Ethiopia",
    "com.boa.boaMobileBanking": "Bank of Abyssinia",
    "com.dashen.dashensuperapp": "Dashen Bank"
}

# Directory to save cleaned CSVs
output_dir = "../data/"
os.makedirs(output_dir, exist_ok=True)

### Scrape and Clean Reviews
For each bank app:
- Scrape 500+ reviews using the Google Play scraper.
- Clean the data (remove duplicates, handle missing values, normalize date format).
- Save results to a separate CSV per bank.

In [4]:
# Loop over each app and process individually
for app_id, bank_name in tqdm(apps.items()):
    raw_reviews = scrape_reviews(app_id, bank_name, count=500)
    cleaned_df = clean_reviews(raw_reviews)

    # Sanitize filename
    bank_filename = bank_name.lower().replace(" ", "_").replace(".", "")
    output_path = os.path.join(output_dir, f"{bank_filename}_reviews.csv")

    cleaned_df.to_csv(output_path, index=False)
    print(f"✅ Saved cleaned reviews for {bank_name} → {output_path}")

 33%|███▎      | 1/3 [00:01<00:02,  1.05s/it]

✅ Saved cleaned reviews for Commercial Bank of Ethiopia → ../data/commercial_bank_of_ethiopia_reviews.csv


 67%|██████▋   | 2/3 [00:02<00:01,  1.10s/it]

✅ Saved cleaned reviews for Bank of Abyssinia → ../data/bank_of_abyssinia_reviews.csv


100%|██████████| 3/3 [00:03<00:00,  1.11s/it]

✅ Saved cleaned reviews for Dashen Bank → ../data/dashen_bank_reviews.csv





--- 

### Validate Scraped Review Data
We validate the output to ensure:
- Total number of reviews ≥ 1,200.
- Each file has <5% missing values.
- All date formats are in YYYY-MM-DD.
- No duplicate entries.

In [5]:
# Load the data
results = load_and_validate_reviews(output_dir)

# Extract results
combined_df = results['combined_df']
summary_df = results['summary_df']
missing_files = results['missing_files']
metrics = results['metrics']

### File Presence Check
Ensure all expected bank files were saved.

In [6]:
# File status
if missing_files:
    print(f"\n⚠️ Missing files: {', '.join(missing_files)}")
else:
    print("\n✅ All expected bank files found")


✅ All expected bank files found


--- 

### Summary of Cleaned Data
Display a breakdown of:
- Review counts per bank
- Missing value percentages
- Duplicate counts
- Date format correctness

In [7]:
# Data summary
if not combined_df.empty:
    print(f"\n📊 Total reviews collected: {metrics['total_reviews']}")
    
    print("\n🔍 Per Bank Summary:")
    display(summary_df.style.format({
        'missing_%': '{:.2f}%',
        'reviews': '{:,}'
    }))

else:
    print("\n❌ No data available - all files missing or empty")


📊 Total reviews collected: 1432

🔍 Per Bank Summary:


Unnamed: 0,bank,file,reviews,missing_values,missing_%,duplicates,date_format_OK
0,Commercial Bank Of Ethiopia,commercial_bank_of_ethiopia_reviews.csv,484,0,0.00%,0,True
1,Bank Of Abyssinia,bank_of_abyssinia_reviews.csv,499,0,0.00%,0,True
2,Dashen Bank,dashen_bank_reviews.csv,449,0,0.00%,0,True


--- 

### Summary and Next step

- 3 separate datasets created for CBE, BOA, and Dashen.
- Each cleaned and saved to ../data/.
- Data validated to meet the minimum 1,200 reviews and <5% missing threshold.

Next step: Proceed to sentiment analysis and thematic extraction in the next notebook.