# Task 1: Data Collection and Preprocessing

## Objective
The goal of this task is to collect, clean, and prepare user reviews from the Google Play Store for three Ethiopian banking apps: Dashen Bank, Commercial Bank of Ethiopia (CBE), and Bank of Abyssinia (BOA). This data will be used for sentiment and thematic analysis in later stages.


## Import Required Libraries

In this section, we import the necessary libraries:

- `google_play_scraper`: to scrape reviews from the Google Play Store.
- `pandas`: for handling and manipulating tabular data.
- `datetime`: to format review timestamps.
- `csv`: included for potential CSV operations (optional if using pandas to save files).

In [1]:
from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
import csv

## Scrape Reviews for Each Bank App

This section defines the target banking apps with their respective Google Play Store App IDs. 
For each app, we scrape up to 500 of the most recent user reviews using the `google-play-scraper` package.

The following information is collected for each review:
- Review text (`content`)
- Star rating (`score`)
- Review date (`at`)
- App name (bank)
- Data source (`Google Play`)

All reviews are compiled into a single list and then converted into a Pandas DataFrame for further processing.

In [2]:
apps = {
    'Dashen Bank': 'com.dashen.dashensuperapp',
    'Commercial Bank of Ethiopia': 'com.combanketh.mobilebanking',
    'Bank of Abyssinia': 'com.boa.boaMobileBanking'
}

## Extract and Store Review Data

For each bank app in the `apps` dictionary, this loop:
- Requests the latest 500 reviews using the `reviews()` function.
- Extracts key details: review text, star rating, posting date, and source.
- Appends each review as a dictionary to the `all_reviews` list.

Once all reviews are collected, the list is converted into a Pandas DataFrame named `df_raw`, which provides a structured format for further cleaning and analysis.

In [3]:
all_reviews = []

for bank_name, app_id in apps.items():
    print(f"Scraping reviews for {bank_name}...")
    results, _ = reviews(
        app_id,
        lang='en',
        country='US',
        sort=Sort.NEWEST,
        count=500,
        filter_score_with=None
    )
    
    for entry in results:
        all_reviews.append({
            'review': entry['content'],
            'rating': entry['score'],
            'date': entry['at'].strftime('%Y-%m-%d'),
            'bank': bank_name,
            'source': 'Google Play'
        })

df_raw = pd.DataFrame(all_reviews)
df_raw.head()

Scraping reviews for Dashen Bank...
Scraping reviews for Commercial Bank of Ethiopia...
Scraping reviews for Bank of Abyssinia...


Unnamed: 0,review,rating,date,bank,source
0,kalid,5,2025-06-08,Dashen Bank,Google Play
1,I like this mobile banking app very much. Over...,2,2025-06-07,Dashen Bank,Google Play
2,love,3,2025-06-06,Dashen Bank,Google Play
3,መቸሸጠ,5,2025-06-03,Dashen Bank,Google Play
4,wow,5,2025-06-03,Dashen Bank,Google Play


## Clean the Review Data

To ensure data quality and accuracy, the following cleaning steps are applied:

1. **Remove Duplicates**  
   Duplicate entries are dropped based on a combination of `review`, `date`, and `bank` to avoid over-representing repeated feedback.

2. **Handle Missing Values**  
   Any rows with missing values in the key fields `review`, `rating`, or `date` are removed.

3. **Check Dataset Size**  
   The number of reviews before and after cleaning is printed to confirm how many were removed during preprocessing.


In [4]:
df_clean = df_raw.drop_duplicates(subset=['review', 'date', 'bank'])

df_clean = df_clean.dropna(subset=['review', 'rating', 'date'])

print(f"Original count: {len(df_raw)}")
print(f"Cleaned count: {len(df_clean)}")

Original count: 1450
Cleaned count: 1428


## Save the Cleaned Dataset

After cleaning, the final dataset is saved to a CSV file for later use in sentiment and thematic analysis.

- The filename includes a timestamp to ensure uniqueness and version tracking.
- The file is saved without the index column.
- A confirmation message is printed showing the file name and successful save.

In [8]:
filename = f"cleaned_reviews.csv"
df_clean.to_csv(filename, index=False)
print(f"✅ Saved cleaned dataset as: {filename}")

✅ Saved cleaned dataset as: cleaned_reviews.csv
