# Task 1: Data Collection and Preprocessing

This notebook covers the data collection and preprocessing phase of the Customer Experience Analytics project.

## Objectives:
- Scrape reviews from Google Play Store for 3 Ethiopian banks
- Preprocess and clean the collected data
- Validate data quality metrics
- Prepare data for analysis

## Banks Analyzed:
- Commercial Bank of Ethiopia (CBE)
- Bank of Abyssinia (BOA)
- Dashen Bank


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Libraries imported successfully")


## Step 1: Web Scraping

**Note**: The actual scraping is done using the `scrape_reviews.py` script. This notebook assumes the data has already been collected.

To run scraping manually:
```bash
python task1_data_collection/scrape_reviews.py
```

### 1.1 Load Raw Data


In [None]:
# Load raw scraped data
try:
    df_raw = pd.read_csv('../data/raw/all_reviews_raw.csv')
    print(f"‚úÖ Loaded {len(df_raw)} raw reviews")
    print(f"\nüìä Data Overview:")
    print(f"   Columns: {list(df_raw.columns)}")
    print(f"\nüè¶ Reviews by Bank:")
    print(df_raw['bank'].value_counts())
    print(f"\n‚≠ê Rating Distribution:")
    print(df_raw['rating'].value_counts().sort_index())
except FileNotFoundError:
    print("‚ö†Ô∏è  Raw data file not found.")
    print("üí° Please run: python task1_data_collection/scrape_reviews.py")
    df_raw = None


### 1.2 Display Sample Raw Data


In [None]:
# Display sample raw reviews
if df_raw is not None:
    print("üìù Sample Raw Reviews:")
    print("="*60)
    for idx, row in df_raw.head(5).iterrows():
        print(f"\nBank: {row['bank']} | Rating: {row['rating']}‚òÖ | Date: {row.get('date', 'N/A')}")
        print(f"Review: {row['review'][:150]}...")
        print("-"*60)


## Step 2: Data Preprocessing

**Note**: The preprocessing is done using the `preprocess_reviews.py` script. This notebook demonstrates the preprocessing steps.

### 2.1 Load Cleaned Data


In [None]:
# Load cleaned and preprocessed data
df = pd.read_csv('../data/processed/reviews_cleaned.csv')

print(f"‚úÖ Loaded {len(df)} cleaned reviews")
print(f"\nüìä Data Overview:")
print(f"   Columns: {list(df.columns)}")
print(f"\nüè¶ Reviews by Bank:")
print(df['bank'].value_counts())
print(f"\n‚≠ê Rating Distribution:")
print(df['rating'].value_counts().sort_index())


### 2.2 Data Quality Metrics


In [None]:
# Data quality assessment
print("="*60)
print("üìä DATA QUALITY METRICS")
print("="*60)

print(f"\n1. Total Records: {len(df)}")
print(f"2. Missing Data: {df.isnull().sum().sum()} ({df.isnull().sum().sum()/(len(df)*len(df.columns))*100:.2f}%)")
print(f"3. Duplicate Reviews: {df.duplicated(subset=['review', 'bank']).sum()}")
print(f"4. Valid Ratings (1-5): {len(df[(df['rating'] >= 1) & (df['rating'] <= 5)])}")
print(f"5. Reviews with Dates: {len(df[df['date'] != 'Unknown'])}")

# Check missing data by column
print(f"\nüìã Missing Data by Column:")
missing_data = df.isnull().sum()
missing_pct = (missing_data / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0])


### 2.3 Data Visualization


In [None]:
# Visualize reviews by bank
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Reviews count by bank
ax1 = axes[0]
bank_counts = df['bank'].value_counts()
colors = ['#2E86AB', '#A23B72', '#F18F01']
bank_counts.plot(kind='bar', ax=ax1, color=colors[:len(bank_counts)])
ax1.set_title('Total Reviews by Bank', fontsize=14, fontweight='bold')
ax1.set_xlabel('Bank', fontsize=12)
ax1.set_ylabel('Number of Reviews', fontsize=12)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)
ax1.grid(axis='y', alpha=0.3)

# Rating distribution
ax2 = axes[1]
rating_counts = df['rating'].value_counts().sort_index()
rating_counts.plot(kind='bar', ax=ax2, color='#06A77D')
ax2.set_title('Overall Rating Distribution', fontsize=14, fontweight='bold')
ax2.set_xlabel('Star Rating', fontsize=12)
ax2.set_ylabel('Number of Reviews', fontsize=12)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Rating distribution by bank
fig, ax = plt.subplots(figsize=(12, 6))
rating_by_bank = pd.crosstab(df['bank'], df['rating'])
rating_by_bank.plot(kind='bar', ax=ax, 
                    color=['#FF6B6B', '#FFA07A', '#FFD700', '#98D8C8', '#6BCB77'],
                    width=0.8)
ax.set_title('Rating Distribution by Bank', fontsize=14, fontweight='bold')
ax.set_xlabel('Bank', fontsize=12)
ax.set_ylabel('Number of Reviews', fontsize=12)
ax.legend(title='Rating', labels=['1‚òÖ', '2‚òÖ', '3‚òÖ', '4‚òÖ', '5‚òÖ'], title_fontsize=11)
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()


### 2.4 Summary Statistics


In [None]:
# Summary statistics by bank
print("="*60)
print("üìä SUMMARY STATISTICS BY BANK")
print("="*60)

summary_stats = df.groupby('bank').agg({
    'rating': ['count', 'mean', 'std', 'min', 'max'],
    'review_id': 'count'
}).round(2)

summary_stats.columns = ['Total Reviews', 'Avg Rating', 'Std Dev', 'Min Rating', 'Max Rating', 'Count']
summary_stats = summary_stats.drop('Count', axis=1)

print("\n", summary_stats)

# Average rating comparison
print("\n" + "="*60)
print("‚≠ê AVERAGE RATING BY BANK")
print("="*60)
avg_rating = df.groupby('bank')['rating'].mean().sort_values(ascending=False)
for bank, rating in avg_rating.items():
    print(f"   {bank}: {rating:.2f}‚òÖ")


## Task 1 Summary

‚úÖ **Completed Steps:**
1. Data collection from Google Play Store (400+ reviews per bank)
2. Data preprocessing and cleaning
3. Duplicate removal
4. Missing data handling
5. Date normalization
6. Data quality validation

‚úÖ **KPIs Achieved:**
- 1,200+ reviews collected
- <5% missing data
- Clean CSV dataset ready for analysis

**Next Step**: Proceed to Task 2 for Sentiment and Thematic Analysis
