# Data Preprocessing Notebook

This notebook handles the complete lifecycle of data acquisition, cleaning, and preprocessing. It can:

1. Fetch data from Google Play Store API (real data)
2. Load data from existing CSV files (cached data)
3. Generate mock data for testing purposes

The output of this notebook is a clean, processed dataset stored in `processed_reviews.csv` that can be used by other notebooks for analysis and visualization.

**Note:** You can control notebook behavior with these environment variables:
- `USE_MOCK_DATA=true` - Use mock data instead of real API data
- `FORCE_REFRESH=true` - Force fetching fresh data even if existing data is available
- `MAX_REVIEWS=100` - Set the maximum number of reviews to fetch/process

In [1]:
# Import necessary libraries
import os
import sys
import pandas as pd
import numpy as np
import re
import importlib
import subprocess  # Add subprocess import
from datetime import datetime, timedelta
from dotenv import load_dotenv

# Add the project root to the path
project_root = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"Added {project_root} to Python path")

# Import project modules
from src.runner import ReviewAnalysisRunner
from src.modules.acquisition import google_play
from src.modules.storage import file_storage
from src.modules.preprocessing import nlp_preprocessor

# Force reload modules to get latest changes
importlib.reload(google_play)
importlib.reload(file_storage)
importlib.reload(nlp_preprocessor)

# Load environment variables
load_dotenv()

# Configuration and paths
DATA_DIR = os.path.join(project_root, 'data')  # Updated to use /data instead of /src/data
RAW_DATA_PATH = os.path.join(DATA_DIR, 'reviews.csv')
PROCESSED_DATA_PATH = os.path.join(DATA_DIR, 'processed', 'processed_reviews.csv')

# Create directories if they don't exist
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(os.path.dirname(PROCESSED_DATA_PATH), exist_ok=True)

# Environment variables
USE_MOCK_DATA = os.environ.get('USE_MOCK_DATA', 'false').lower() in ('true', '1', 'yes', 'y')
FORCE_REFRESH = os.environ.get('FORCE_REFRESH', 'false').lower() in ('true', '1', 'yes', 'y')
MAX_REVIEWS = int(os.environ.get('MAX_REVIEWS', '100'))

# Extract APP_ID properly (removing any comments)
app_id_env = os.environ.get("APP_ID", "in.goindigo.android")
app_id = app_id_env.split('#')[0].strip()  # Remove any comments and whitespace

print(f"Configuration:\n")
print(f"- App ID: {app_id}")
print(f"- Max Reviews: {MAX_REVIEWS}")
print(f"- Use Mock Data: {USE_MOCK_DATA}")
print(f"- Force Refresh: {FORCE_REFRESH}")
print(f"- Raw Data Path: {RAW_DATA_PATH}")
print(f"- Processed Data Path: {PROCESSED_DATA_PATH}")

# Debug: Show environment variables as seen by Python
print("\nEnvironment variables:")
for env_var in ['MAX_REVIEWS', 'USE_MOCK_DATA', 'FORCE_REFRESH', 'APP_ID']:
    print(f"- {env_var}: {os.environ.get(env_var, 'Not set')}")

Added /Users/dipesh/Local-Projects/indigo-reviews-ai to Python path
DEBUG: Applying APP_ID from environment: 'com.fss.indus'
Configuration:

- App ID: com.fss.indus
- Max Reviews: 5000
- Use Mock Data: False
- Force Refresh: True
- Raw Data Path: /Users/dipesh/Local-Projects/indigo-reviews-ai/data/reviews.csv
- Processed Data Path: /Users/dipesh/Local-Projects/indigo-reviews-ai/data/processed/processed_reviews.csv

Environment variables:
- MAX_REVIEWS: 5000
- USE_MOCK_DATA: false
- FORCE_REFRESH: true
- APP_ID: com.fss.indus


In [2]:
def load_existing_data():
    """Attempt to load existing data from reviews.csv"""
    if os.path.exists(RAW_DATA_PATH):
        try:
            df = pd.read_csv(RAW_DATA_PATH)
            # Convert date column to datetime
            if 'date' in df.columns:
                df['date'] = pd.to_datetime(df['date'], errors='coerce')
            print(f"Successfully loaded {len(df)} reviews from {RAW_DATA_PATH}")
            
            # Display the first few rows of the loaded data
            print("\nFirst 5 rows of loaded data:")
            print(df.head().to_string())
            
            return df
        except Exception as e:
            print(f"Error loading existing data: {e}")
    print(f"No existing data found at {RAW_DATA_PATH}")
    return None

def fetch_fresh_data_direct():
    """Fetch fresh data by directly using the Google Play scraper library"""
    # Import the scraper directly
    from google_play_scraper import app as gp_app
    from google_play_scraper import reviews as gp_reviews
    from google_play_scraper.features.reviews import Sort
    
    print(f"Fetching data directly using Google Play scraper...")
    
    # Get app info first
    try:
        app_info = gp_app(app_id)
        total_reported_reviews = app_info.get('reviews', 0)
        print(f"App info reports {total_reported_reviews} total reviews available")
        print("Note: This may be lower than actual available reviews")
        
        # Use reported reviews as initial guide, but we'll keep fetching until no more are available
        target_reviews = MAX_REVIEWS
        print(f"Will attempt to fetch up to {target_reviews} reviews")
    except Exception as e:
        print(f"Error getting app info: {e}")
        target_reviews = MAX_REVIEWS
    
    all_reviews = []
    continuation_token = None
    retries = 0
    max_retries = 3
    
    # Try multiple language/country combinations to maximize review collection
    combinations = [
        {"lang": "en", "country": "us"},  # Start with US English
        {"lang": "en", "country": "in"},  # Then try Indian English
        {"lang": "hi", "country": "in"},  # Hindi reviews from India
        {"lang": "en", "country": "gb"}   # British English
    ]
    
    # Loop through each combination
    for combo in combinations:
        print(f"\nTrying with language: {combo['lang']}, country: {combo['country']}")
        current_lang = combo['lang']
        current_country = combo['country']
        continuation_token = None  # Reset for each combination
        combo_reviews = 0
        
        # Loop until we have enough reviews or no more are available
        while len(all_reviews) < target_reviews:
            try:
                # Fetch batch of reviews (100 is the max per request)
                batch_size = min(100, target_reviews - len(all_reviews))
                if batch_size <= 0:
                    break
                    
                print(f"Fetching batch of {batch_size} reviews... (total so far: {len(all_reviews)})")
                result, continuation_token = gp_reviews(
                    app_id=app_id,
                    lang=current_lang,
                    country=current_country,
                    sort=Sort.NEWEST,
                    count=batch_size,
                    continuation_token=continuation_token
                )
                
                # If no results, we're done with this combination
                if not result:
                    print(f"No more reviews available for {current_lang}/{current_country}")
                    break
                    
                # Add results to our collection, avoiding duplicates
                existing_ids = set(r.get('reviewId', '') for r in all_reviews)
                new_reviews = [r for r in result if r.get('reviewId', '') not in existing_ids]
                
                if len(new_reviews) < len(result):
                    print(f"Filtered out {len(result) - len(new_reviews)} duplicate reviews")
                
                all_reviews.extend(new_reviews)
                combo_reviews += len(new_reviews)
                
                # Print progress
                print(f"Retrieved {len(new_reviews)} new reviews, total: {len(all_reviews)}/{target_reviews}")
                
                # If no continuation token, we've reached the end for this combination
                if not continuation_token:
                    print(f"No continuation token - reached the end of available reviews for {current_lang}/{current_country}")
                    break
                
                # Reset retries on successful fetch
                retries = 0
            except Exception as e:
                retries += 1
                print(f"Error fetching reviews (attempt {retries}/{max_retries}): {e}")
                if retries >= max_retries:
                    print(f"Too many errors, moving to next language/country combination")
                    break
                import time
                time.sleep(2)  # Add a short delay before retrying
        
        print(f"Completed fetching {combo_reviews} reviews for {current_lang}/{current_country}")
        
        # If we've reached our target, we can stop
        if len(all_reviews) >= target_reviews:
            print(f"Reached target of {target_reviews} reviews, stopping")
            break
    
    print(f"Completed fetching a total of {len(all_reviews)} reviews across all language/country combinations")
    
    if not all_reviews:
        print("No reviews retrieved. Using mock data instead.")
        return None
    
    # Transform review data to match our schema
    transformed_reviews = []
    for review in all_reviews:
        # Basic schema transformation
        transformed = {
            "review_id": review.get("reviewId", ""),
            "author": review.get("userName", ""),
            "rating": review.get("score", 0),
            "text": review.get("content", ""),
            "version": review.get("reviewCreatedVersion", ""),
            "thumbsUpCount": review.get("thumbsUpCount", 0),
            "replyContent": review.get("replyContent", None),
            "repliedAt": review.get("repliedAt", None)
        }
        
        # Handle date fields
        at_date = None
        try:
            # First try to use the 'at' field
            if isinstance(review.get("at"), datetime):
                at_date = review["at"]
            elif review.get("at"):
                at_date = pd.to_datetime(review["at"])
        except Exception as e:
            print(f"Error parsing 'at' field: {e}")
        
        # Fallback to timeMillis if available
        if at_date is None:
            try:
                if review.get("timeMillis"):
                    at_date = pd.to_datetime(review["timeMillis"], unit='ms')
            except Exception as e:
                print(f"Error parsing 'timeMillis' field: {e}")
        
        transformed["date"] = at_date
        transformed["timestamp"] = at_date  # Use the same value for timestamp
        
        transformed_reviews.append(transformed)
    
    # Create DataFrame
    reviews_df = pd.DataFrame(transformed_reviews)
    
    # Write the data to file
    csv_path = RAW_DATA_PATH
    reviews_df.to_csv(csv_path, index=False)
    
    # Display the first few rows
    print(f"\nSuccessfully processed {len(reviews_df)} reviews directly from Google Play")
    print("\nFirst 5 rows of fetched data:")
    print(reviews_df.head().to_string())
    
    print(f"Saved raw data to {RAW_DATA_PATH}")
    
    return reviews_df

def fetch_fresh_data():
    """Fetch fresh data from Google Play API or generate mock data"""
    if USE_MOCK_DATA:
        # Initialize the runner for mock data
        runner = ReviewAnalysisRunner()
        runner._initialize_modules()
        
        print("Using mock data source...")
        # Fetch mock reviews directly from the acquisition module
        fresh_reviews_df = runner.acquisition.fetch_reviews(
            app_id=app_id,
            max_reviews=MAX_REVIEWS,
            use_mock=True
        )
        
        if fresh_reviews_df is not None and not fresh_reviews_df.empty:
            print(f"Successfully generated {len(fresh_reviews_df)} mock reviews")
            
            # Display the first few rows of the mock data
            print("\nFirst 5 rows of mock data:")
            print(fresh_reviews_df.head().to_string())
            
            # Save to CSV
            fresh_reviews_df.to_csv(RAW_DATA_PATH, index=False)
            print(f"Saved mock data to {RAW_DATA_PATH}")
            
            return fresh_reviews_df
        else:
            print("Failed to generate mock reviews.")
            return None
    else:
        # For real data, use direct API access to avoid issues with the runner
        return fetch_fresh_data_direct()

# Main data acquisition logic
print("\n★ Data Acquisition Phase ★")

# Always respect FORCE_REFRESH flag
if FORCE_REFRESH:
    print("Force refresh requested, fetching fresh data...")
    reviews_df = fetch_fresh_data()
else:
    # Try to load existing data first
    reviews_df = load_existing_data()
    
    # Only proceed to fetch new data if:
    # 1. No existing data was found, or
    # 2. We didn't get enough reviews 
    if reviews_df is None or len(reviews_df) < MAX_REVIEWS:
        print(f"Existing data not sufficient (have {len(reviews_df) if reviews_df is not None else 0}, need {MAX_REVIEWS})")
        print("Fetching fresh data...")
        reviews_df = fetch_fresh_data()

# Display basic info about the dataset
if reviews_df is not None and not reviews_df.empty:
    print("\nDataset Overview:")
    print(f"- Shape: {reviews_df.shape} (rows, columns)")
    print(f"- Columns: {reviews_df.columns.tolist()}")
    print(f"- Date range: {reviews_df['date'].min()} to {reviews_df['date'].max()}")
    print(f"- Rating distribution:\n{reviews_df['rating'].value_counts().sort_index()}")
    
    # Check for nulls in important columns
    null_counts = reviews_df[['review_id', 'date', 'rating', 'text']].isnull().sum()
    print(f"\nNull values in key columns:\n{null_counts}")
else:
    print("No data available for processing. Please check your configuration and try again.")


★ Data Acquisition Phase ★
Force refresh requested, fetching fresh data...
Fetching data directly using Google Play scraper...
App info reports 408 total reviews available
Note: This may be lower than actual available reviews
Will attempt to fetch up to 5000 reviews

Trying with language: en, country: us
Fetching batch of 100 reviews... (total so far: 0)
Retrieved 100 new reviews, total: 100/5000
Fetching batch of 100 reviews... (total so far: 100)
Retrieved 100 new reviews, total: 200/5000
Fetching batch of 100 reviews... (total so far: 200)
Retrieved 100 new reviews, total: 300/5000
Fetching batch of 100 reviews... (total so far: 300)
Retrieved 100 new reviews, total: 400/5000
Fetching batch of 100 reviews... (total so far: 400)
Retrieved 100 new reviews, total: 500/5000
Fetching batch of 100 reviews... (total so far: 500)
Retrieved 100 new reviews, total: 600/5000
Fetching batch of 100 reviews... (total so far: 600)
Retrieved 100 new reviews, total: 700/5000
Fetching batch of 100 r

## Data Cleaning

This section performs initial cleaning of the raw data to prepare it for preprocessing:
1. Removing duplicates
2. Handling missing values
3. Basic text cleaning

In [3]:
def clean_data(df):
    """Clean the raw data to prepare for preprocessing"""
    if df is None or df.empty:
        print("No data to clean.")
        return None
    
    print("Cleaning data...")
    cleaned_df = df.copy()
    
    # 1. Remove duplicates based on review_id
    original_count = len(cleaned_df)
    cleaned_df = cleaned_df.drop_duplicates(subset=['review_id'])
    print(f"Removed {original_count - len(cleaned_df)} duplicate reviews")
    
    # 2. Handle missing values
    # For text: replace NaN with empty string
    cleaned_df['text'] = cleaned_df['text'].fillna('')
    
    # For author: replace NaN with 'Anonymous'
    cleaned_df['author'] = cleaned_df['author'].fillna('Anonymous')
    
    # For version: replace NaN with 'Unknown'
    cleaned_df['version'] = cleaned_df['version'].fillna('Unknown')
    
    # 3. Basic text cleaning
    # Remove excessive whitespace
    cleaned_df['text'] = cleaned_df['text'].apply(lambda x: re.sub(r'\s+', ' ', str(x).strip()))
    
    # Add text length as a feature
    cleaned_df['text_length'] = cleaned_df['text'].apply(len)
    
    # Drop rows with missing critical data (review_id, date, rating)
    original_count = len(cleaned_df)
    cleaned_df = cleaned_df.dropna(subset=['review_id', 'date', 'rating'])
    print(f"Dropped {original_count - len(cleaned_df)} rows with missing critical data")
    
    # Ensure rating is numeric
    cleaned_df['rating'] = pd.to_numeric(cleaned_df['rating'], errors='coerce')
    
    print(f"Cleaning complete. Resulting dataset has {len(cleaned_df)} rows")
    return cleaned_df

# Clean the data
cleaned_df = clean_data(reviews_df)

# Display a sample of the cleaned data
if cleaned_df is not None and not cleaned_df.empty:
    print("\nSample of cleaned data:")
    display(cleaned_df[['review_id', 'date', 'rating', 'text', 'text_length']].head())

Cleaning data...
Removed 0 duplicate reviews
Dropped 0 rows with missing critical data
Cleaning complete. Resulting dataset has 5000 rows

Sample of cleaned data:


Unnamed: 0,review_id,date,rating,text,text_length
0,15c7b93a-00a9-46ed-b5d0-ff21acee2c67,2025-05-07 18:18:21,5,good,4
1,2a6d3874-5be3-492b-954c-24390dfd2afe,2025-05-07 17:24:26,1,such an irritating app is this. you will open ...,157
2,67f42ed4-4fb9-44f3-b548-023ee98eb488,2025-05-07 16:21:42,5,good nice,9
3,dc8bd34d-d4ee-40fd-8f91-932f472d6d43,2025-05-07 16:18:56,1,"app is not opening, please rectify",34
4,0ce29084-1e76-4846-8688-1819614b289f,2025-05-07 16:14:14,1,after upgrade its converted to worst app. and ...,75


## Feature Engineering

This section adds new features to the dataset to enhance analysis:
1. Sentiment analysis
2. Text complexity metrics
3. Time-based features
4. Additional categorical features

In [4]:
def add_features(df):
    """Add new features to the dataset to enhance analysis"""
    if df is None or df.empty:
        print("No data for feature engineering.")
        return None
    
    print("Adding features...")
    enhanced_df = df.copy()
    
    # 1. Simple sentiment analysis based on rating
    def get_sentiment_from_rating(rating):
        if rating >= 4:
            return 'positive'
        elif rating <= 2:
            return 'negative'
        else:
            return 'neutral'
    
    enhanced_df['sentiment'] = enhanced_df['rating'].apply(get_sentiment_from_rating)
    
    # 2. Time-based features
    enhanced_df['year'] = enhanced_df['date'].dt.year
    enhanced_df['month'] = enhanced_df['date'].dt.month
    enhanced_df['day_of_week'] = enhanced_df['date'].dt.day_name()
    enhanced_df['is_weekend'] = enhanced_df['day_of_week'].isin(['Saturday', 'Sunday'])
    
    # 3. Text complexity metrics
    # Simple metrics: word count and average word length
    enhanced_df['word_count'] = enhanced_df['text'].apply(lambda x: len(str(x).split()))
    
    def avg_word_length(text):
        words = str(text).split()
        if len(words) == 0:
            return 0
        return sum(len(word) for word in words) / len(words)
    
    enhanced_df['avg_word_length'] = enhanced_df['text'].apply(avg_word_length)
    
    # 4. Categorize reviews by text length
    def categorize_length(length):
        if length == 0:
            return 'empty'
        elif length < 50:
            return 'very_short'
        elif length < 200:
            return 'short'
        elif length < 500:
            return 'medium'
        else:
            return 'long'
    
    enhanced_df['length_category'] = enhanced_df['text_length'].apply(categorize_length)
    
    # 5. Version-based features
    # Extract major version (e.g., '7.2.0' -> '7')
    enhanced_df['major_version'] = enhanced_df['version'].apply(
        lambda x: x.split('.')[0] if isinstance(x, str) and '.' in x else 'Unknown'
    )
    
    # 6. Add text preprocessing columns
    print("Adding text preprocessing columns...")
    try:
        # Import NLP preprocessor
        sys.path.insert(0, project_root)
        from src.modules.preprocessing.nlp_preprocessor import NLPPreprocessor
        
        # Initialize the preprocessor
        preprocessor = NLPPreprocessor({"enable_lemmatization": True})
        if not hasattr(preprocessor, 'is_initialized') or not preprocessor.is_initialized:
            preprocessor.initialize()
        
        # Apply preprocessing to text column
        print("Cleaning and normalizing review text...")
        enhanced_df['cleaned_text'] = enhanced_df['text'].apply(
            lambda x: preprocessor.clean_text(str(x)) if pd.notna(x) else "")
        
        enhanced_df['normalized_text'] = enhanced_df['cleaned_text'].apply(
            lambda x: preprocessor.normalize_text(x) if pd.notna(x) else "")
        
        print("Text preprocessing complete.")
    except Exception as e:
        print(f"Error during text preprocessing: {e}")
        print("Falling back to basic text cleaning...")
        
        # Basic fallback preprocessing if the advanced module fails
        import re
        import nltk
        try:
            nltk.download('stopwords', quiet=True)
            from nltk.corpus import stopwords
            stopwords_list = set(stopwords.words('english'))
        except:
            # If NLTK is not available, use a small set of common stopwords
            stopwords_list = {'a', 'an', 'the', 'and', 'or', 'but', 'is', 'are', 
                              'was', 'were', 'to', 'of', 'in', 'for', 'with'}
        
        # Simple cleaning function
        def basic_clean(text):
            if not isinstance(text, str):
                text = str(text)
            # Convert to lowercase
            text = text.lower()
            # Remove special characters and punctuation
            text = re.sub(r'[^\w\s]', '', text)
            # Remove numbers
            text = re.sub(r'\d+', '', text)
            # Remove extra spaces
            text = re.sub(r'\s+', ' ', text).strip()
            return text
            
        # Simple stopword removal
        def remove_stopwords(text):
            return ' '.join([word for word in text.split() if word not in stopwords_list])
        
        # Apply basic cleaning
        enhanced_df['cleaned_text'] = enhanced_df['text'].apply(
            lambda x: basic_clean(x) if pd.notna(x) else "")
        
        # Apply stopword removal for normalization
        enhanced_df['normalized_text'] = enhanced_df['cleaned_text'].apply(
            lambda x: remove_stopwords(x) if pd.notna(x) else "")
        
        print("Basic text preprocessing complete.")
    
    print(f"Feature engineering complete. Added {len(enhanced_df.columns) - len(df.columns)} new features")
    return enhanced_df

# Add features
enhanced_df = add_features(cleaned_df)

# Display a summary of the new features
if enhanced_df is not None and not enhanced_df.empty:
    print("\nNew features added:")
    new_features = set(enhanced_df.columns) - set(cleaned_df.columns)
    print(', '.join(sorted(new_features)))
    
    # Display feature statistics
    print("\nSentiment distribution:")
    print(enhanced_df['sentiment'].value_counts())
    
    print("\nLength category distribution:")
    print(enhanced_df['length_category'].value_counts())
    
    print("\nMajor version distribution (top 5):")
    print(enhanced_df['major_version'].value_counts().head())
    
    # Display sample of text preprocessing
    print("\nText preprocessing sample:")
    sample_df = enhanced_df[['text', 'cleaned_text', 'normalized_text']].head(2)
    for idx, row in sample_df.iterrows():
        print(f"\nOriginal: {row['text']}")
        print(f"Cleaned:  {row['cleaned_text']}")
        print(f"Normalized: {row['normalized_text']}")

Adding features...
Adding text preprocessing columns...
Cleaning and normalizing review text...
Text preprocessing complete.
Feature engineering complete. Added 11 new features

New features added:
avg_word_length, cleaned_text, day_of_week, is_weekend, length_category, major_version, month, normalized_text, sentiment, word_count, year

Sentiment distribution:
sentiment
negative    2797
positive    2037
neutral      166
Name: count, dtype: int64

Length category distribution:
length_category
very_short    3339
short         1314
medium         336
long            11
Name: count, dtype: int64

Major version distribution (top 5):
major_version
10         2720
9          1481
Unknown     783
8            14
6             1
Name: count, dtype: int64

Text preprocessing sample:

Original: good
Cleaned:  good
Normalized: good

Original: such an irritating app is this. you will open to do some important work and everytime it will redirect you to update some other app which is already updated.

## Final Processing and Export

This section performs the final processing steps and exports the data to a CSV file:

In [5]:
def finalize_and_export(df):
    """Perform final processing and export the data"""
    if df is None or df.empty:
        print("No data to export.")
        return None
    
    print("Performing final processing...")
    final_df = df.copy()
    
    # Sort by date (newest first)
    final_df = final_df.sort_values('date', ascending=False)
    
    # Reset index
    final_df = final_df.reset_index(drop=True)
    
    # Export to CSV
    final_df.to_csv(PROCESSED_DATA_PATH, index=False)
    print(f"Processed data exported to {PROCESSED_DATA_PATH}")
    
    return final_df

# Finalize and export
final_df = finalize_and_export(enhanced_df)

# Display final dataset info
if final_df is not None and not final_df.empty:
    print("\nFinal Dataset Summary:")
    print(f"- Shape: {final_df.shape}")
    print(f"- Memory usage: {final_df.memory_usage().sum() / 1024:.2f} KB")
    print(f"- Column list: {', '.join(final_df.columns.tolist())}")
    
    # Display sample rows
    print("\nSample of processed data:")
    sample_columns = ['review_id', 'date', 'rating', 'sentiment', 'text', 'length_category', 'word_count']
    display(final_df[sample_columns].head())

Performing final processing...
Processed data exported to /Users/dipesh/Local-Projects/indigo-reviews-ai/data/processed/processed_reviews.csv

Final Dataset Summary:
- Shape: (5000, 22)
- Memory usage: 786.26 KB
- Column list: review_id, author, rating, text, version, thumbsUpCount, replyContent, repliedAt, date, timestamp, text_length, sentiment, year, month, day_of_week, is_weekend, word_count, avg_word_length, length_category, major_version, cleaned_text, normalized_text

Sample of processed data:


Unnamed: 0,review_id,date,rating,sentiment,text,length_category,word_count
0,15c7b93a-00a9-46ed-b5d0-ff21acee2c67,2025-05-07 18:18:21,5,positive,good,very_short,1
1,2a6d3874-5be3-492b-954c-24390dfd2afe,2025-05-07 17:24:26,1,negative,such an irritating app is this. you will open ...,short,29
2,67f42ed4-4fb9-44f3-b548-023ee98eb488,2025-05-07 16:21:42,5,positive,good nice,very_short,2
3,dc8bd34d-d4ee-40fd-8f91-932f472d6d43,2025-05-07 16:18:56,1,negative,"app is not opening, please rectify",very_short,6
4,0ce29084-1e76-4846-8688-1819614b289f,2025-05-07 16:14:14,1,negative,after upgrade its converted to worst app. and ...,short,13


## Summary

This notebook has performed the following steps:

1. Acquired data from the appropriate source (API, CSV, or mock)
2. Cleaned the data by handling duplicates and missing values
3. Added features for enhanced analysis
4. Exported the processed data to `processed_reviews.csv`

The processed data can now be used by other notebooks for analysis and visualization.