# Sentio+ RAG-Optimized Preprocessing

This notebook implements the **Hybrid Stratified Signal** sampling strategy to create a high-signal corpus for the Sentio+ RAG chatbot.

**Core Logic:**
- **Source:** `apps_reviews.csv` and `apps_info.csv` (Kaggle cache)
- **Filter:** Minimum 150 characters for high-density insights.
- **Sample:** 50,000 reviews balanced across categories and ratings.
- **Technique:** Stratified sampling prioritizing **Length** and **Helpfulness**.
- **Output:** `sentio_plus_rag_ready.csv` with Context Headers.

## 1. Imports and Setup

In [18]:
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime, timedelta
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 200)

print("Environment ready.")

Environment ready.


## 2. Load Raw Data

In [19]:
# Path to Kaggle cached data
DATA_PATH = Path.home() / ".cache/kagglehub/datasets/dmytrobuhai/play-market-2025-1m-reviews-500-titles/versions/1"

print(f"Loading data from: {DATA_PATH}")

reviews_df = pd.read_csv(DATA_PATH / "apps_reviews.csv")
info_df = pd.read_csv(DATA_PATH / "apps_info.csv")

print(f"Raw Reviews: {len(reviews_df):,}")
print(f"Raw App Info: {len(info_df):,}")

Loading data from: /Users/chenchenliu/.cache/kagglehub/datasets/dmytrobuhai/play-market-2025-1m-reviews-500-titles/versions/1
Raw Reviews: 466,700
Raw App Info: 217


## 3. Data Integration & Cleaning

In [20]:
def clean_category(cat_str):
    if pd.isna(cat_str): return "Unknown"
    if "," in cat_str: return cat_str.split(",")[-1].strip()
    return cat_str.strip()

# 1. Clean categories
info_df['category'] = info_df['categories'].apply(clean_category)

# 2. Merge
merged_df = reviews_df.merge(
    info_df[['app_id', 'app_name', 'category', 'content_rating', 'score', 'downloads']],
    on='app_id', 
    how='left'
)

# 3. Initial Clean
merged_df = merged_df.dropna(subset=['review_text', 'app_name', 'category'])

print(f"Integrated dataset: {len(merged_df):,}")

Integrated dataset: 466,700


## 4. The Quality Gate (Filtering)

- **Length Threshold:** > 150 characters
- **Date Handling:** Identify recent reviews (Last 12 months)

In [21]:
# Calculate text length
merged_df['text_length'] = merged_df['review_text'].str.len()

# Filter for length (The Quality Gate)
high_quality_pool = merged_df[merged_df['text_length'] > 150].copy()

# Handle dates
high_quality_pool['review_date'] = pd.to_datetime(high_quality_pool['review_date'])
CUTOFF_DATE = high_quality_pool['review_date'].max() - timedelta(days=365)
high_quality_pool['is_recent'] = high_quality_pool['review_date'] >= CUTOFF_DATE

print(f"High-Quality Pool (>150 chars): {len(high_quality_pool):,}")
print(f"Recent reviews (Last 12m): {high_quality_pool['is_recent'].sum():,}")

High-Quality Pool (>150 chars): 222,320
Recent reviews (Last 12m): 52,558


## 5. Hybrid Stratified Signal Sampling

We sample 50,000 reviews ensuring breadth (Category/Rating) and depth (Length/Helpful).

In [22]:
TARGET_TOTAL = 50000
categories = high_quality_pool['category'].unique()
ratings = sorted(high_quality_pool['review_score'].unique())

# Calculate target per (cat, rating) bucket
n_buckets = len(categories) * len(ratings)
target_per_bucket = TARGET_TOTAL // n_buckets

print(f"Buckets: {n_buckets}")
print(f"Target per bucket: {target_per_bucket}")

sampled_dfs = []

for cat in categories:
    for rating in ratings:
        bucket = high_quality_pool[
            (high_quality_pool['category'] == cat) & 
            (high_quality_pool['review_score'] == rating)
        ]
        
        if len(bucket) == 0: continue
        
        # Signal Sorting: Prioritize Length and Helpfulness
        # Within bucket, we want 60% recent if possible
        n_recent_target = int(target_per_bucket * 0.6)
        n_older_target = target_per_bucket - n_recent_target
        
        recent_bucket = bucket[bucket['is_recent']].sort_values(['text_length', 'helpful_count'], ascending=False)
        older_bucket = bucket[~bucket['is_recent']].sort_values(['text_length', 'helpful_count'], ascending=False)
        
        # Extract segments
        recent_sample = recent_bucket.head(n_recent_target)
        older_sample = older_bucket.head(target_per_bucket - len(recent_sample))
        
        bucket_sample = pd.concat([recent_sample, older_sample])
        
        # If still short, take whatever is left from the bucket
        if len(bucket_sample) < target_per_bucket:
            remaining = bucket[~bucket.index.isin(bucket_sample.index)].sort_values(['text_length', 'helpful_count'], ascending=False)
            bucket_sample = pd.concat([bucket_sample, remaining.head(target_per_bucket - len(bucket_sample))])
            
        sampled_dfs.append(bucket_sample)

final_sampled_df = pd.concat(sampled_dfs, ignore_index=True)

# If we are still short of 50k (due to small buckets), fill from the remaining high-quality pool
if len(final_sampled_df) < TARGET_TOTAL:
    needed = TARGET_TOTAL - len(final_sampled_df)
    remaining_pool = high_quality_pool[~high_quality_pool.index.isin(final_sampled_df.index)]
    top_up = remaining_pool.sort_values(['is_recent', 'text_length', 'helpful_count'], ascending=False).head(needed)
    final_sampled_df = pd.concat([final_sampled_df, top_up], ignore_index=True)

print(f"\nFinal Sample Size: {len(final_sampled_df):,}")

Buckets: 90
Target per bucket: 555

Final Sample Size: 50,000


## 6. Metadata Enrichment (The Context Header)

Format: `[APP: {app_name} | CAT: {category} | RATING: {rating}/5 | DATE: {date} | SEGMENT: {content_rating}] USER REVIEW: {review_text}`

In [23]:
def enrich_review(row):
    header = (
        f"[APP: {row['app_name']} | "
        f"CAT: {row['category']} | "
        f"RATING: {int(row['review_score'])}/5 | "
        f"DATE: {row['review_date'].strftime('%Y-%m')} | "
        f"SEGMENT: {row['content_rating']}] "
        f"USER REVIEW: {row['review_text']}"
    )
    return header

final_sampled_df['enriched_text'] = final_sampled_df.apply(enrich_review, axis=1)

print("Enrichment complete. Example:")
print(final_sampled_df['enriched_text'].iloc[0])

Enrichment complete. Example:
[APP: Google Wallet | CAT: Finance | RATING: 1/5 | DATE: 2024-09 | SEGMENT: Everyone] USER REVIEW: I used to love this app, I've got all my loyalty cards and that stored in it, but lately, a security update requires verification for every purchase. And it doesn't seem to even make any difference if the phone is unlocked or not, I have to go into the app, click 'Verify it's you', and then put in my pin again. It often fails the transaction first time, so then I'll have to unlock my phone, try again, fail again, then put in my code. The whole experience is a thousand times clunkier than before.


## 7. Final Preparation & Export

In [24]:
# Map column names to final spec
final_df = final_sampled_df[[
    'app_id', 'app_name', 'category', 'review_score', 'review_date', 
    'helpful_count', 'content_rating', 'score', 'downloads', 'enriched_text', 'text_length'
]].copy()

final_df = final_df.rename(columns={
    'review_score': 'rating',
    'score': 'app_avg_score'
})

# Add ID
final_df['review_id'] = range(1, len(final_df) + 1)

# Reorder
final_df = final_df[[
    'review_id', 'app_id', 'app_name', 'category', 'rating', 'review_date', 
    'helpful_count', 'content_rating', 'app_avg_score', 'downloads', 'text_length', 'enriched_text'
]]

# Save
OUTPUT_DIR = Path("data/processed")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_FILE = OUTPUT_DIR / "sentio_plus_rag_ready.csv"

final_df.to_csv(OUTPUT_FILE, index=False)

print(f"\n✅ Saved RAG-Ready dataset to: {OUTPUT_FILE}")
print(f"Total Evidence Chunks: {len(final_df):,}")


✅ Saved RAG-Ready dataset to: data/processed/sentio_plus_rag_ready.csv
Total Evidence Chunks: 50,000


## 8. Data Distribution Audit

In [25]:
print("=== Rating Distribution ===")
print(final_df['rating'].value_counts().sort_index())

print("\n=== Top 10 Categories ===")
print(final_df['category'].value_counts().head(10))

print(f"\n=== Signal Metrics ===")
print(f"Average Review Length: {final_df['text_length'].mean():.1f} chars")
print(f"Total Helpful Votes: {final_df['helpful_count'].sum():,}")

=== Rating Distribution ===
rating
1    13997
2     9665
3     8763
4     8443
5     9132
Name: count, dtype: int64

=== Top 10 Categories ===
category
Food & Drink     4655
Entertainment    4611
Business         4141
Social           3855
Shopping         3668
Productivity     3649
Communication    3609
Finance          3470
House & Home     2817
Education        2605
Name: count, dtype: int64

=== Signal Metrics ===
Average Review Length: 404.9 chars
Total Helpful Votes: 2,560,996


In [None]:
# Define the output location
import os
output_path = "data/processed/sentio_plus_rag_ready.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Export the DataFrame
# index=False ensures you don't save the row numbers as a separate column
final_df.to_csv(output_path, index=False)

print(f"Successfully saved {len(final_df):,} reviews to {output_path}")

Successfully saved 50,000 reviews to data/processed/THIS_ONE.csv
