## Review Data Preprocessing Notebook

This notebook initiates the review pipeline by loading raw Google Play Store data collected for three banks. The data includes user reviews, ratings, bank labels, and source metadata.  

We'll begin by:
- Verifying the structure and integrity of the dataset  
- Exploring column types, value distributions, and potential anomalies  
- Identifying data patterns that will inform our cleaning strategy  
- Laying groundwork for downstream modeling (sentiment, thematic clustering)

The raw CSV was saved from earlier scraping work, and will be progressively refined through preprocessing steps documented in this notebook.

In [1]:
import pandas as pd
import numpy as np
import os 
import sys

sys.path.append(os.path.abspath("../"))
from src.utils.utils import load_data, clean_data, clean_review_text

# Load raw review data
raw_path = "../data/raw/scraped_reviews/all_bank_reviews_20250726_131601.csv"
df = load_data(raw_path)

# Quick preview
df.head()

Unnamed: 0,review_id,review,rating,date,bank,source
0,24b9381c-3cd8-431a-b3c9-bd156427585a,wow,5,2025-07-25,Commercial Bank of Ethiopia (CBE),Google Play
1,6a7da8ad-486f-4132-8eb9-aee1e55f7322,excellent,5,2025-07-25,Commercial Bank of Ethiopia (CBE),Google Play
2,d12f5fc4-9d9a-4f3d-b3e0-074f7b379f14,great,5,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play
3,fba3dc77-c9b1-4cfc-8afe-317b302b007c,Great,5,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play
4,bd502a94-e136-4cee-bb6b-e0c51cc5c245,there is many thing u have to fix.,1,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play


In [2]:
# Display basic information about the DataFrame
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.info()

Shape: (1492, 6)
Columns: ['review_id', 'review', 'rating', 'date', 'bank', 'source']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1492 entries, 0 to 1491
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review_id  1492 non-null   object
 1   review     1492 non-null   object
 2   rating     1492 non-null   int64 
 3   date       1492 non-null   object
 4   bank       1492 non-null   object
 5   source     1492 non-null   object
dtypes: int64(1), object(5)
memory usage: 70.1+ KB


In [3]:
# Rating distribution
print("\nRatings:")
print(df['rating'].value_counts().sort_index())

# Bank label distribution
print("\nBanks:")
print(df['bank'].value_counts())

# Source breakdown 
print("\nSources:")
print(df['source'].value_counts())

# Quick check for empty or null reviews
missing_reviews = df['review'].isnull().sum()
empty_reviews = (df['review'].str.strip() == '').sum()
print(f"\nMissing review entries: {missing_reviews}")
print(f"Empty review text entries: {empty_reviews}")


Ratings:
rating
1    313
2     54
3     77
4     98
5    950
Name: count, dtype: int64

Banks:
bank
Commercial Bank of Ethiopia (CBE)    500
Bank of Abyssinia (BOA)              500
Dashen Bank                          492
Name: count, dtype: int64

Sources:
source
Google Play    1492
Name: count, dtype: int64

Missing review entries: 0
Empty review text entries: 0


## Preprocessing Functions

Two core cleaning utilities introduced:

- `clean_data(df)`: Cleans raw dataframe by removing duplicates and null values, then parsing specified date columns  
- `clean_review_text(text)`: Performs minimal cleaning suited for DistilBERT-based sentiment modeling (preserves emojis, punctuation, and contractions)

These steps maintain signal quality for modeling tasks while establishing structural consistency.

In [4]:
# General data cleaning
df = clean_data(df, date_columns=["date"])

In [5]:
# check result
print("Datashape after cleaning:")
print(df.shape)
print("\nData types after cleaning:")
df.dtypes

Datashape after cleaning:
(1492, 6)

Data types after cleaning:


review_id            object
review               object
rating                int64
date         datetime64[ns]
bank                 object
source               object
dtype: object

In [6]:
# Apply the light weight cleaning function to the review text
df["review_clean"] = df["review"].apply(clean_review_text)

In [7]:
# View cleaned review text
print("\nSample cleaned reviews:")  
print(df["review_clean"].head(5))


Sample cleaned reviews:
0                                   wow
1                             excellent
2                                 great
3                                 Great
4    there is many thing u have to fix.
Name: review_clean, dtype: object


In [8]:
# save cleaned data
cleaned_path = "../data/processed/cleaned_reviews.csv"
df.to_csv(cleaned_path, index=False)
print(f"\nCleaned data saved to {cleaned_path} successfully✅")


Cleaned data saved to ../data/processed/cleaned_reviews.csv successfully✅


## ✅ Notebook Summary: Initial Review Preprocessing

This notebook established the foundational steps for processing raw Google Play Store review data across three Ethiopian banks.

### 🔹 What We Accomplished:
- **Loaded raw review data** into a structured DataFrame
- **Inspected** column types, value distributions, and completeness
- Applied **`clean_data()`** to:
  - Remove duplicates and nulls
  - Parse datetime columns for temporal alignment
- Implemented **`clean_review_text()`** for:
  - Lightweight cleaning tailored for DistilBERT-based sentiment analysis
  - Retained emojis, punctuation, and contractions for richer signal preservation

### What’s Next:
In upcoming notebooks, we will:
- Integrate sentiment classification using fine-tuned DistilBERT
- Expand text preprocessing for keyword extraction and thematic clustering
- Explore visualization and performance metrics across banks

This modular pipeline enables flexible adaptation across modeling goals and keeps preprocessing task-specific.