# Multilingual App Reviews - Exploratory Data Analysis

This notebook provides an initial exploration of the multilingual mobile app reviews dataset. We'll examine the data structure, identify patterns, and prepare for deeper analysis.

## Table of Contents
1. [Data Loading](#data-loading)
2. [Dataset Overview](#dataset-overview)
3. [Data Quality Assessment](#data-quality-assessment)
4. [Initial Insights](#initial-insights)
5. [Next Steps](#next-steps)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully!")

## Data Loading

First, let's load the raw dataset from the CSV file. Make sure you have placed the `multilingual_mobile_app_reviews_2025.csv` file in the `../data/raw/` directory.

In [None]:
# Define data paths
project_root = Path().resolve().parent  # Go up one level from notebooks/
raw_data_path = project_root / "data" / "raw"
csv_file = raw_data_path / "multilingual_mobile_app_reviews_2025.csv"

print(f"Looking for dataset at: {csv_file}")
print(f"File exists: {csv_file.exists()}")

# Load the dataset
if csv_file.exists():
    try:
        df = pd.read_csv(csv_file, encoding='utf-8')
        print(f"‚úÖ Successfully loaded dataset with {len(df)} rows and {len(df.columns)} columns")
    except UnicodeDecodeError:
        print("UTF-8 encoding failed, trying latin-1...")
        df = pd.read_csv(csv_file, encoding='latin-1')
        print(f"‚úÖ Successfully loaded dataset with {len(df)} rows and {len(df.columns)} columns")
else:
    print("‚ùå Dataset file not found!")
    print("Please download 'multilingual_mobile_app_reviews_2025.csv' from Kaggle")
    print("and place it in the data/raw/ directory.")
    df = None

## Dataset Overview

Let's examine the basic structure of our dataset.

In [None]:
if df is not None:
    print("üìä DATASET SHAPE")
    print(f"Rows: {df.shape[0]:,}")
    print(f"Columns: {df.shape[1]:,}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    print("\nüìã COLUMN INFORMATION")
    print("Column names and data types:")
    for i, (col, dtype) in enumerate(df.dtypes.items()):
        print(f"{i+1:2d}. {col:<30} | {str(dtype):<15}")
    
    print("\nüîç FIRST FEW ROWS")
    display(df.head())

## Data Quality Assessment

Now let's check for missing values and data quality issues.

In [None]:
if df is not None:
    print("üîç MISSING VALUES ANALYSIS")
    missing_info = df.isnull().sum()
    missing_info = missing_info[missing_info > 0].sort_values(ascending=False)
    
    if len(missing_info) > 0:
        print("Columns with missing values:")
        for col, count in missing_info.items():
            percentage = (count / len(df)) * 100
            print(f"  {col:<30}: {count:,} ({percentage:.1f}%)")
    else:
        print("‚úÖ No missing values found!")
    
    print(f"\nüìà BASIC STATISTICS")
    print("Numeric columns summary:")
    display(df.describe())
    
    print(f"\nüìù TEXT COLUMNS SAMPLE")
    text_cols = df.select_dtypes(include=['object']).columns
    for col in text_cols[:3]:  # Show first 3 text columns
        print(f"\n{col} - Sample values:")
        unique_vals = df[col].dropna().unique()[:5]
        for val in unique_vals:
            print(f"  ‚Ä¢ {str(val)[:100]}...")  # Truncate long text

## Initial Insights

Let's generate some quick visualizations to understand the data better.

In [None]:
if df is not None:
    # Try to identify key columns
    rating_col = None
    language_col = None
    app_col = None
    date_col = None
    
    for col in df.columns:
        if 'rating' in col.lower() or 'score' in col.lower():
            rating_col = col
        if 'lang' in col.lower():
            language_col = col
        if 'app' in col.lower():
            app_col = col
        if 'date' in col.lower() or 'time' in col.lower():
            date_col = col
    
    print(f"üéØ IDENTIFIED KEY COLUMNS")
    print(f"Rating column: {rating_col}")
    print(f"Language column: {language_col}")
    print(f"App column: {app_col}")
    print(f"Date column: {date_col}")
    
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Initial Data Exploration', fontsize=16, fontweight='bold')
    
    # Plot 1: Rating distribution (if rating column exists)
    if rating_col and df[rating_col].notna().sum() > 0:
        axes[0, 0].hist(df[rating_col].dropna(), bins=20, alpha=0.7, edgecolor='black')
        axes[0, 0].set_title(f'Distribution of {rating_col}')
        axes[0, 0].set_xlabel('Rating')
        axes[0, 0].set_ylabel('Frequency')
        axes[0, 0].grid(True, alpha=0.3)
    else:
        axes[0, 0].text(0.5, 0.5, 'No rating data\navailable', 
                       ha='center', va='center', transform=axes[0, 0].transAxes)
        axes[0, 0].set_title('Rating Distribution')
    
    # Plot 2: Language distribution (if language column exists)
    if language_col and df[language_col].notna().sum() > 0:
        lang_counts = df[language_col].value_counts().head(10)
        axes[0, 1].bar(range(len(lang_counts)), lang_counts.values)
        axes[0, 1].set_title(f'Top 10 Languages')
        axes[0, 1].set_xlabel('Language')
        axes[0, 1].set_ylabel('Count')
        axes[0, 1].set_xticks(range(len(lang_counts)))
        axes[0, 1].set_xticklabels(lang_counts.index, rotation=45, ha='right')
        axes[0, 1].grid(True, alpha=0.3)
    else:
        axes[0, 1].text(0.5, 0.5, 'No language data\navailable', 
                       ha='center', va='center', transform=axes[0, 1].transAxes)
        axes[0, 1].set_title('Language Distribution')
    
    # Plot 3: Missing values heatmap
    missing_data = df.isnull().sum()
    if missing_data.sum() > 0:
        missing_data = missing_data[missing_data > 0]
        axes[1, 0].bar(range(len(missing_data)), missing_data.values)
        axes[1, 0].set_title('Missing Values by Column')
        axes[1, 0].set_xlabel('Column')
        axes[1, 0].set_ylabel('Missing Count')
        axes[1, 0].set_xticks(range(len(missing_data)))
        axes[1, 0].set_xticklabels(missing_data.index, rotation=45, ha='right')
        axes[1, 0].grid(True, alpha=0.3)
    else:
        axes[1, 0].text(0.5, 0.5, 'No missing\nvalues found!', 
                       ha='center', va='center', transform=axes[1, 0].transAxes,
                       fontsize=14, color='green', fontweight='bold')
        axes[1, 0].set_title('Missing Values')
    
    # Plot 4: Data types distribution
    dtype_counts = df.dtypes.value_counts()
    axes[1, 1].pie(dtype_counts.values, labels=dtype_counts.index, autopct='%1.1f%%')
    axes[1, 1].set_title('Data Types Distribution')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nüìä QUICK STATS SUMMARY")
    print(f"‚Ä¢ Total reviews: {len(df):,}")
    if rating_col:
        valid_ratings = df[rating_col].dropna()
        if len(valid_ratings) > 0:
            print(f"‚Ä¢ Average rating: {valid_ratings.mean():.2f}")
            print(f"‚Ä¢ Rating range: {valid_ratings.min():.1f} - {valid_ratings.max():.1f}")
    if language_col:
        unique_langs = df[language_col].nunique()
        print(f"‚Ä¢ Unique languages: {unique_langs}")
    if app_col:
        unique_apps = df[app_col].nunique()
        print(f"‚Ä¢ Unique apps: {unique_apps}")

## Next Steps

Based on this initial exploration, here are the recommended next steps:

1. **Data Cleaning**: Run the data preprocessing script to handle missing values and standardize data types
2. **Language Detection**: For any missing language labels, use automatic language detection
3. **Detailed Analysis**: Use the analytics script to generate comprehensive insights and visualizations
4. **Advanced Analytics**: Consider sentiment analysis, topic modeling, or trend analysis

### Running the Analysis Pipeline

```bash
# Clean and preprocess the data
python src/data_prep.py --verbose

# Generate analytics and visualizations  
python src/analytics.py --verbose
```

The cleaned data will be saved to `data/processed/` and visualizations to `reports/figures/`.