# Task 1: Data Collection and Preprocessing

## Customer Experience Analytics for Ethiopian Fintech Apps

**Objective:** Scrape and preprocess user reviews from Google Play Store for three Ethiopian banks:
- Commercial Bank of Ethiopia (CBE)
- Bank of Abyssinia (BOA)
- Dashen Bank

**Target:** 400+ reviews per bank (1,200 total minimum)

## 1. Setup and Imports

In [None]:
# Add src directory to path for imports
import sys
import os
sys.path.insert(0, os.path.abspath('../src'))

# Standard imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Project imports
from config import APP_IDS, BANK_NAMES, DATA_PATHS
from scraper import PlayStoreScraper
from preprocessing import ReviewPreprocessor

# Display settings
pd.set_option('display.max_colwidth', 100)
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')

print("Setup complete!")

## 2. Configuration Overview

Let's verify our target banks and app IDs.

In [None]:
# Display target banks
print("Target Banks for Analysis")
print("=" * 50)
for code, name in BANK_NAMES.items():
    app_id = APP_IDS[code]
    print(f"\n{code}: {name}")
    print(f"   App ID: {app_id}")

## 3. Data Collection - Scraping Google Play Reviews

We'll use the `google-play-scraper` library to collect reviews from the Google Play Store.

In [None]:
# Initialize the scraper
scraper = PlayStoreScraper()

# Scrape reviews for all banks
# This will collect 400+ reviews per bank
raw_df = scraper.scrape_all_banks()

In [None]:
# Display scraping results summary
if not raw_df.empty:
    print("\nRaw Data Summary")
    print("=" * 50)
    print(f"Total reviews collected: {len(raw_df)}")
    print(f"\nReviews per bank:")
    print(raw_df['bank_name'].value_counts())
    print(f"\nColumns: {list(raw_df.columns)}")

In [None]:
# Preview raw data
raw_df.head(10)

## 4. Data Preprocessing

The preprocessing pipeline performs:
1. Missing data check
2. Duplicate removal
3. Missing value handling
4. Date normalization (YYYY-MM-DD)
5. Text cleaning
6. English language filtering
7. Rating validation (1-5)
8. Final output preparation

In [None]:
# Initialize and run preprocessor
preprocessor = ReviewPreprocessor()
success = preprocessor.process()

if success:
    processed_df = preprocessor.df
    print(f"\nProcessed dataset shape: {processed_df.shape}")
else:
    print("Preprocessing failed!")

In [None]:
# Load processed data (in case running from saved file)
processed_df = pd.read_csv(DATA_PATHS['processed_reviews'])
print(f"Loaded {len(processed_df)} processed reviews")
processed_df.head()

## 5. Data Quality Assessment

In [None]:
# Data quality metrics
print("Data Quality Report")
print("=" * 50)
print(f"\nTotal records: {len(processed_df)}")
print(f"Missing values: {processed_df.isnull().sum().sum()}")
print(f"Duplicate rows: {processed_df.duplicated().sum()}")

# Check if we meet the minimum requirement
min_required = 400
print(f"\nReviews per bank (minimum required: {min_required}):")
bank_counts = processed_df['bank_name'].value_counts()
for bank, count in bank_counts.items():
    status = "✓" if count >= min_required else "✗"
    print(f"  {status} {bank}: {count}")

total_required = 1200
total_status = "✓" if len(processed_df) >= total_required else "✗"
print(f"\n{total_status} Total reviews: {len(processed_df)} (required: {total_required})")

In [None]:
# Data types and info
print("Dataset Info")
print("=" * 50)
processed_df.info()

## 6. Exploratory Data Analysis (EDA)

### 6.1 Reviews Distribution by Bank

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

bank_counts = processed_df['bank_name'].value_counts()
colors = sns.color_palette('Set2', len(bank_counts))

bars = ax.bar(bank_counts.index, bank_counts.values, color=colors, edgecolor='black')

# Add value labels on bars
for bar, count in zip(bars, bank_counts.values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, 
            str(count), ha='center', va='bottom', fontweight='bold', fontsize=12)

# Add minimum threshold line
ax.axhline(y=400, color='red', linestyle='--', linewidth=2, label='Minimum Required (400)')

ax.set_xlabel('Bank', fontsize=12)
ax.set_ylabel('Number of Reviews', fontsize=12)
ax.set_title('Reviews Collected per Bank', fontsize=14, fontweight='bold')
ax.legend()

plt.tight_layout()
plt.savefig('../data/processed/reviews_per_bank.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.2 Rating Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Overall rating distribution
ax1 = axes[0]
rating_counts = processed_df['rating'].value_counts().sort_index()
colors = ['#d73027', '#fc8d59', '#fee08b', '#d9ef8b', '#91cf60']  # Red to Green
bars = ax1.bar(rating_counts.index, rating_counts.values, color=colors, edgecolor='black')

for bar, count in zip(bars, rating_counts.values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
             str(count), ha='center', va='bottom', fontsize=10)

ax1.set_xlabel('Rating (Stars)', fontsize=12)
ax1.set_ylabel('Number of Reviews', fontsize=12)
ax1.set_title('Overall Rating Distribution', fontsize=14, fontweight='bold')
ax1.set_xticks([1, 2, 3, 4, 5])

# Rating distribution by bank
ax2 = axes[1]
rating_by_bank = processed_df.groupby(['bank_name', 'rating']).size().unstack(fill_value=0)
rating_by_bank.plot(kind='bar', ax=ax2, colormap='RdYlGn', edgecolor='black')

ax2.set_xlabel('Bank', fontsize=12)
ax2.set_ylabel('Number of Reviews', fontsize=12)
ax2.set_title('Rating Distribution by Bank', fontsize=14, fontweight='bold')
ax2.legend(title='Rating', bbox_to_anchor=(1.02, 1), loc='upper left')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('../data/processed/rating_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.3 Average Rating by Bank

In [None]:
# Calculate average ratings
avg_ratings = processed_df.groupby('bank_name')['rating'].agg(['mean', 'std', 'count'])
avg_ratings.columns = ['Average Rating', 'Std Dev', 'Review Count']
avg_ratings = avg_ratings.round(2)

print("Average Ratings by Bank")
print("=" * 50)
print(avg_ratings)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

banks = avg_ratings.index
means = avg_ratings['Average Rating']
stds = avg_ratings['Std Dev']

colors = sns.color_palette('Set2', len(banks))
bars = ax.bar(banks, means, yerr=stds, capsize=5, color=colors, edgecolor='black')

for bar, mean in zip(bars, means):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
            f'{mean:.2f}', ha='center', va='bottom', fontweight='bold', fontsize=12)

ax.set_xlabel('Bank', fontsize=12)
ax.set_ylabel('Average Rating', fontsize=12)
ax.set_title('Average Rating by Bank (with Std Dev)', fontsize=14, fontweight='bold')
ax.set_ylim(0, 5.5)
ax.axhline(y=3, color='gray', linestyle='--', alpha=0.5, label='Neutral (3.0)')
ax.legend()

plt.tight_layout()
plt.savefig('../data/processed/average_rating_by_bank.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.4 Review Text Length Analysis

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Text length distribution
ax1 = axes[0]
processed_df['text_length'].hist(bins=50, ax=ax1, color='steelblue', edgecolor='black', alpha=0.7)
ax1.axvline(processed_df['text_length'].median(), color='red', linestyle='--', 
            label=f'Median: {processed_df["text_length"].median():.0f}')
ax1.axvline(processed_df['text_length'].mean(), color='orange', linestyle='--',
            label=f'Mean: {processed_df["text_length"].mean():.0f}')
ax1.set_xlabel('Review Length (characters)', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Distribution of Review Text Length', fontsize=14, fontweight='bold')
ax1.legend()

# Text length by bank (boxplot)
ax2 = axes[1]
processed_df.boxplot(column='text_length', by='bank_name', ax=ax2)
ax2.set_xlabel('Bank', fontsize=12)
ax2.set_ylabel('Review Length (characters)', fontsize=12)
ax2.set_title('Review Length by Bank', fontsize=14, fontweight='bold')
plt.suptitle('')  # Remove automatic title

plt.tight_layout()
plt.savefig('../data/processed/text_length_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.5 Reviews Over Time

In [None]:
# Convert review_date to datetime if needed
processed_df['review_date'] = pd.to_datetime(processed_df['review_date'])

# Reviews over time (monthly)
processed_df['year_month'] = processed_df['review_date'].dt.to_period('M')

fig, ax = plt.subplots(figsize=(14, 6))

for bank in processed_df['bank_name'].unique():
    bank_data = processed_df[processed_df['bank_name'] == bank]
    monthly_counts = bank_data.groupby('year_month').size()
    ax.plot(monthly_counts.index.astype(str), monthly_counts.values, marker='o', label=bank, linewidth=2)

ax.set_xlabel('Month', fontsize=12)
ax.set_ylabel('Number of Reviews', fontsize=12)
ax.set_title('Reviews Over Time by Bank', fontsize=14, fontweight='bold')
ax.legend()
ax.tick_params(axis='x', rotation=45)

# Show only every nth label to avoid crowding
n = max(1, len(ax.get_xticklabels()) // 10)
for i, label in enumerate(ax.get_xticklabels()):
    if i % n != 0:
        label.set_visible(False)

plt.tight_layout()
plt.savefig('../data/processed/reviews_over_time.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Sample Reviews

In [None]:
# Display sample reviews for each bank
for bank in processed_df['bank_name'].unique():
    print(f"\n{'='*60}")
    print(f"{bank}")
    print('='*60)
    
    bank_df = processed_df[processed_df['bank_name'] == bank]
    
    # Show one positive and one negative review
    positive = bank_df[bank_df['rating'] >= 4].head(1)
    negative = bank_df[bank_df['rating'] <= 2].head(1)
    
    if not positive.empty:
        print(f"\n[Positive Review - {positive['rating'].values[0]} stars]")
        print(f"\"{positive['review_text'].values[0][:300]}...\"")
    
    if not negative.empty:
        print(f"\n[Negative Review - {negative['rating'].values[0]} stars]")
        print(f"\"{negative['review_text'].values[0][:300]}...\"")

## 8. Data Export Summary

In [None]:
# Final summary
print("Task 1 Completion Summary")
print("=" * 60)
print(f"\nData Collection:")
print(f"  - Source: Google Play Store")
print(f"  - Banks: {', '.join(BANK_NAMES.values())}")
print(f"  - Total reviews: {len(processed_df)}")

print(f"\nData Quality:")
print(f"  - Missing values: {processed_df.isnull().sum().sum()}")
print(f"  - Duplicates: {processed_df.duplicated().sum()}")
print(f"  - Date range: {processed_df['review_date'].min()} to {processed_df['review_date'].max()}")

print(f"\nOutput Files:")
print(f"  - Raw data: {DATA_PATHS['raw_reviews']}")
print(f"  - Processed data: {DATA_PATHS['processed_reviews']}")

print(f"\nColumns in processed dataset:")
for col in processed_df.columns:
    print(f"  - {col}")

In [None]:
# Final dataset preview
processed_df.describe()

## 9. Next Steps

With Task 1 complete, we have:
- ✅ Scraped 400+ reviews per bank from Google Play Store
- ✅ Preprocessed and cleaned the data
- ✅ Normalized dates to YYYY-MM-DD format
- ✅ Filtered to English-only reviews
- ✅ Saved clean CSV with required columns

**Task 2** will involve:
- Sentiment analysis using DistilBERT
- Thematic analysis with keyword extraction
- Topic clustering (3-5 themes per bank)