# Data Preprocessing Pipeline

| Step                          | Purpose                                                      | Method/Components Used |
| ----------------------------- | ------------------------------------------------------------ | ---------------------- |
| **Imports**                   | Load required libraries for ML, NLP, and data processing    | pandas, numpy, sklearn, xgboost, sentence-transformers, spacy, transformers, imbalanced-learn |
| **Load Data**                 | Read CSV file and identify text/ID columns                  | `pd.read_csv()`, column validation, data type conversion |
| **Model Initialization**      | Load pre-trained NLP models and pipelines                   | spaCy (en_core_web_sm), SentenceTransformer (all-MiniLM-L6-v2), RoBERTa sentiment |
| **Text Preprocessing**        | Clean and normalize review text                              | Regex patterns, whitespace normalization, punctuation cleanup |
| **Label Generation**          | Create spam labels using heuristic scoring                  | Length checks, repetition analysis, pattern matching, contextual scoring |
| **Feature Extraction**        | Generate comprehensive feature vectors                       | **TF-IDF** (300 features, 1-2 grams), **Sentence embeddings** (384-dim), **Linguistic features** (POS, entities), **Style features** (ratios, patterns) |
| **Train/Test Split**          | Split dataset for model validation                          | `train_test_split()` with stratification (80/20 split) |
| **Feature Scaling**           | Normalize features for consistent model input               | `StandardScaler()` fit on training data |
| **Class Balancing**           | Address class imbalance in training data                    | SMOTE oversampling with random_state=42 |
| **Model Training**            | Train ensemble classifier                                    | **Random Forest** (100 estimators, balanced) + **XGBoost** (auto-scaled) with soft voting |
| **Model Evaluation**          | Assess classifier performance                                | Confusion matrix, classification report, accuracy score, F1-score |
| **Spam Prediction**           | Generate spam probabilities for all reviews                 | Ensemble `.predict_proba()` with 0.5 threshold |
| **Duplicate Detection**       | Identify near-identical reviews using clustering            | **DBSCAN** (eps=0.01, cosine similarity) + text verification |
| **Off-topic Detection**       | Flag reviews lacking domain relevance                       | Multi-criteria filtering: length + keyword analysis + review language detection |
| **Informativeness Scoring**   | Calculate content quality metrics                           | **Formula**: 0.4×lexical_diversity + 0.3×entity_density + 0.3×length_factor |
| **Results Integration**       | Combine all analysis flags and scores                       | Create comprehensive DataFrame with all detection results |
| **Quality Filtering**         | Apply conservative filtering rules                          | Remove high-confidence spam (>0.8) OR duplicates OR multi-issue reviews |
| **Export Results**            | Save processed datasets to CSV files                       | **Reviews_Clean.csv** (filtered), **Reviews_Filtered.csv** (removed), **Reviews_Analysis.csv** (complete) |

## Pipeline Flow Summary

```
Input CSV → Model Loading → Text Preprocessing → Feature Engineering →
ML Training → Multi-Detection → Quality Assessment → Output Generation
```

**Key Metrics Tracked:**
- Original reviews count
- Clean reviews retained (%)
- Spam detected (%)
- Duplicates found (%)
- Off-topic flagged (%)
- Average informativeness score

In [None]:
#@title connect google drive

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/SMU_MITB_NLP/Project/

In [None]:
!pip install pandas scikit-learn xgboost sentence-transformers transformers tqdm imbalanced-learn spacy
!python -m spacy download en_core_web_sm
!pip install textstat

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

## Overview

This is a comprehensive machine learning-based system for filtering and analyzing product reviews to identify spam, duplicates, off-topic content, and measure informativeness. The system uses ensemble methods, natural language processing, and heuristic approaches to clean review datasets.

## Key Features

### 🔍 **Multi-Modal Detection**
- **Spam Detection**: ML-based classification using ensemble methods
- **Duplicate Detection**: Clustering-based approach with similarity thresholds
- **Off-Topic Detection**: Domain-specific keyword analysis
- **Informativeness Scoring**: Content richness evaluation

### 🤖 **Machine Learning Components**
- **Ensemble Classifier**: Random Forest + XGBoost with soft voting
- **Feature Engineering**: TF-IDF, sentence embeddings, linguistic features, style features
- **Class Balancing**: SMOTE oversampling for imbalanced datasets
- **Cross-Validation**: Stratified train/test splits

### 📊 **NLP Processing**
- **Sentence Transformers**: all-MiniLM-L6-v2 for semantic embeddings
- **spaCy Integration**: Linguistic feature extraction (POS tags, entities)
- **Sentiment Analysis**: Twitter-RoBERTa model integration
- **Readability Analysis**: Flesch reading ease scoring

## System Architecture

### Feature Extraction Pipeline

#### 1. **TF-IDF Features**
```python
TfidfVectorizer(
    max_features=300,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95,
    stop_words='english'
)
```

#### 2. **Sentence Embeddings**
- Model: `all-MiniLM-L6-v2`
- Generates 384-dimensional semantic vectors
- Batch processing for efficiency

#### 3. **Linguistic Features**
- Token counts and sentence segmentation
- POS tag ratios (NOUN, VERB, ADJ)
- Named entity density
- Average words per sentence

#### 4. **Style Features**
- Character/word count ratios
- Capitalization and punctuation patterns
- Lexical diversity scores
- URL detection
- Sentiment scores
- Readability metrics

## Detection Algorithms

### Spam Detection

**Heuristic Scoring System:**
- Length-based penalties (< 8 characters)
- Word repetition analysis
- Pattern matching (excessive caps, punctuation)
- Context-aware scoring (star ratings correlation)

**Threshold:** Spam score ≥ 2

### Duplicate Detection

**DBSCAN Clustering:**
- Epsilon: 0.01 (strict similarity threshold)
- Min samples: 2
- Cosine similarity metric
- Additional text verification for high precision

### Off-Topic Detection

**Multi-Criteria Filtering:**
Only flagged when ALL conditions met:
- Very short content (< 10 characters)
- No domain-relevant keywords
- No review-specific language
- No emotional indicators
- No evaluative content

### Informativeness Scoring

**Composite Score Formula:**
```
info_score = 0.4 × lexical_diversity +
             0.3 × entity_density +
             0.3 × length_factor
```

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import re
import warnings
warnings.filterwarnings('ignore')

# Core ML libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

# NLP libraries
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import spacy
from textstat import flesch_reading_ease
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import cosine_similarity

class CompactReviewFilter:
    def __init__(self, csv_path, text_col='review', id_col='review_index'):
        """Initialize the review filter with essential components."""
        self.df = pd.read_csv(csv_path)
        self.texts = self.df[text_col].astype(str).tolist()
        self.text_col = text_col
        self.id_col = id_col if id_col in self.df.columns else None

        print(f"Loaded {len(self.texts)} reviews")
        print(f"Available columns: {list(self.df.columns)}")

        # Initialize models
        print("Loading NLP models...")
        try:
            self.nlp = spacy.load("en_core_web_sm")
        except:
            print("spaCy model not found. Install with: python -m spacy download en_core_web_sm")
            raise

        self.sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
        try:
            self.sentiment_model = pipeline("sentiment-analysis",
                                           model="cardiffnlp/twitter-roberta-base-sentiment-latest")
        except:
            self.sentiment_model = None
            print("Sentiment model not available")

    def preprocess_texts(self):
        """Basic text preprocessing."""
        processed = []
        for text in self.texts:
            # Clean whitespace and basic normalization
            text = re.sub(r'\s+', ' ', str(text).strip())
            text = re.sub(r'([.!?])\1+', r'\1', text)  # Remove repeated punctuation
            processed.append(text)
        return processed

    def extract_features(self):
        """Extract key features for classification."""
        print("Extracting features...")

        # Preprocess texts
        clean_texts = self.preprocess_texts()

        # 1. TF-IDF features
        tfidf = TfidfVectorizer(max_features=300, ngram_range=(1, 2),
                               min_df=2, max_df=0.95, stop_words='english')
        X_tfidf = tfidf.fit_transform(clean_texts).toarray()

        # 2. Sentence embeddings
        X_embeddings = self.sentence_model.encode(clean_texts, show_progress_bar=True, batch_size=32)

        # 3. Basic linguistic features
        X_linguistic = self._get_linguistic_features(clean_texts)

        # 4. Style features
        X_style = self._get_style_features(clean_texts)

        # Combine features
        X_combined = np.hstack([X_tfidf, X_embeddings, X_linguistic, X_style])
        print(f"Total features: {X_combined.shape[1]}")

        return X_combined

    def _get_linguistic_features(self, texts):
        """Extract linguistic features using spaCy."""
        features = []
        for text in tqdm(texts, desc="Linguistic features"):
            doc = self.nlp(text)

            num_tokens = len([t for t in doc if not t.is_punct])
            num_sentences = len(list(doc.sents))
            num_words = len([t for t in doc if t.is_alpha])

            # POS ratios
            pos_counts = {}
            for token in doc:
                if not token.is_punct:
                    pos_counts[token.pos_] = pos_counts.get(token.pos_, 0) + 1
            total_pos = sum(pos_counts.values()) or 1

            # Named entities
            num_entities = len(doc.ents)

            feature_vec = [
                num_tokens,
                num_sentences,
                num_words / max(num_sentences, 1),  # avg words per sentence
                pos_counts.get('NOUN', 0) / total_pos,
                pos_counts.get('VERB', 0) / total_pos,
                pos_counts.get('ADJ', 0) / total_pos,
                num_entities / max(num_tokens, 1),  # entity density
            ]
            features.append(feature_vec)

        return np.array(features)

    def _get_style_features(self, texts):
        """Extract style and quality features."""
        features = []
        for text in tqdm(texts, desc="Style features"):
            # Basic counts
            char_count = len(text)
            words = text.split()
            word_count = len(words)

            # Ratios
            upper_ratio = sum(1 for c in text if c.isupper()) / max(char_count, 1)
            digit_ratio = sum(1 for c in text if c.isdigit()) / max(char_count, 1)
            punct_ratio = sum(1 for c in text if c in '.,!?;:') / max(char_count, 1)

            # Word-level analysis
            avg_word_len = np.mean([len(w) for w in words]) if words else 0
            unique_words = len(set(words))
            lexical_diversity = unique_words / max(word_count, 1)

            # Pattern detection
            has_urls = 1 if re.search(r'http|www\.', text) else 0
            excessive_caps = 1 if upper_ratio > 0.3 else 0
            excessive_punct = 1 if punct_ratio > 0.15 else 0

            # Sentiment if available
            sentiment_score = 0
            if self.sentiment_model:
                try:
                    result = self.sentiment_model(text[:512])
                    if isinstance(result, list) and len(result) > 0:
                        sentiment_score = result[0].get('score', 0)
                except:
                    sentiment_score = 0

            # Readability
            try:
                readability = flesch_reading_ease(text) / 100
            except:
                readability = 0.5

            feature_vec = [
                char_count / 500,  # normalized
                upper_ratio,
                digit_ratio,
                punct_ratio,
                avg_word_len / 8,  # normalized
                lexical_diversity,
                has_urls,
                excessive_caps,
                excessive_punct,
                sentiment_score,
                readability
            ]
            features.append(feature_vec)

        return np.array(features)

    def generate_labels(self):
        """Generate spam labels using improved heuristics."""
        print("Generating spam labels...")
        labels = []

        for i, text in enumerate(self.texts):
            spam_score = 0
            text_clean = str(text).strip().lower()
            words = text_clean.split()

            # Length checks
            if len(text_clean) < 8:
                spam_score += 2
            elif len(words) < 3:
                spam_score += 1

            # Repetition
            if len(words) > 0:
                unique_ratio = len(set(words)) / len(words)
                if unique_ratio < 0.5:
                    spam_score += 1

            # Pattern checks
            if re.search(r'[A-Z]{6,}', text):  # Excessive caps
                spam_score += 1
            if re.search(r'[!?]{3,}', text):  # Excessive punctuation
                spam_score += 1
            if re.search(r'(.)\1{4,}', text):  # Character repetition
                spam_score += 1

            # Common spam patterns
            if re.match(r'^(good|bad|ok|great|nice|test)$', text_clean):
                spam_score += 1

            # Star rating context (if available)
            if 'stars' in self.df.columns:
                try:
                    stars = float(self.df.iloc[i]['stars'])
                    if (stars == 5 or stars == 1) and len(text_clean) < 10:
                        spam_score += 1
                except:
                    pass

            labels.append(1 if spam_score >= 2 else 0)

        spam_count = sum(labels)
        print(f"Generated labels: {spam_count} spam ({spam_count/len(labels):.1%}) out of {len(labels)}")
        return np.array(labels)

    def train_classifier(self, X, y):
        """Train spam classifier."""
        print("Training classifier...")

        # Split data
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

        # Scale features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Balance with SMOTE
        smote = SMOTE(random_state=42)
        X_train_bal, y_train_bal = smote.fit_resample(X_train_scaled, y_train)

        print(f"Training samples: {len(y_train_bal)} (spam: {sum(y_train_bal)}, clean: {sum(y_train_bal==0)})")

        # Train ensemble
        rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42, n_jobs=-1)
        xgb = XGBClassifier(scale_pos_weight=sum(y_train_bal==0)/sum(y_train_bal==1),
                           use_label_encoder=False, eval_metric='logloss', random_state=42, n_jobs=-1)

        ensemble = VotingClassifier([('rf', rf), ('xgb', xgb)], voting='soft')
        ensemble.fit(X_train_bal, y_train_bal)

        # Evaluate
        y_pred = ensemble.predict(X_test_scaled)
        y_probs = ensemble.predict_proba(X_test_scaled)[:, 1]

        print("\n=== MODEL EVALUATION ===")
        print(confusion_matrix(y_test, y_pred))
        print(classification_report(y_test, y_pred))
        print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

        # Store for later use
        self.scaler = scaler
        self.ensemble = ensemble

        return ensemble

    def detect_duplicates(self, X):
        """Detect duplicate reviews with more conservative thresholds."""
        print("Detecting duplicates...")

        # Use much more conservative parameters for DBSCAN
        dbscan = DBSCAN(eps=0.01, min_samples=2, metric='cosine', n_jobs=-1)  # Much stricter similarity
        clusters = dbscan.fit_predict(X)

        dup_flags = [False] * len(self.texts)
        dup_refs = [None] * len(self.texts)

        # Only mark as duplicate if texts are very similar
        for cluster_id in set(clusters):
            if cluster_id == -1:  # Skip noise
                continue
            cluster_indices = np.where(clusters == cluster_id)[0]
            if len(cluster_indices) > 1:
                # Additional check: verify actual text similarity
                cluster_texts = [self.texts[i] for i in cluster_indices]
                verified_duplicates = []

                for i, idx in enumerate(cluster_indices):
                    text1 = self.texts[idx].lower().strip()
                    # Check if this text is very similar to any already verified
                    is_dup = False
                    for base_idx in verified_duplicates:
                        text2 = self.texts[base_idx].lower().strip()
                        # Check for exact or near-exact matches
                        if (text1 == text2 or
                            len(text1) > 20 and len(text2) > 20 and
                            abs(len(text1) - len(text2)) < 5 and
                            sum(c1 == c2 for c1, c2 in zip(text1, text2)) / max(len(text1), len(text2)) > 0.9):
                            dup_flags[idx] = True
                            dup_refs[idx] = base_idx
                            is_dup = True
                            break

                    if not is_dup:
                        verified_duplicates.append(idx)

        dup_count = sum(dup_flags)
        print(f"Duplicates found: {dup_count} ({dup_count/len(self.texts):.1%})")
        return dup_flags, dup_refs

    def detect_off_topic(self):
        """Detect off-topic reviews with more lenient criteria."""
        print("Detecting off-topic reviews...")

        # Expanded domain keywords
        domain_words = ['product', 'service', 'quality', 'delivery', 'shipping', 'price', 'order',
                       'buy', 'purchase', 'recommend', 'good', 'bad', 'great', 'terrible', 'awesome',
                       'satisfied', 'disappointed', 'excellent', 'poor', 'fast', 'slow', 'cheap',
                       'expensive', 'worth', 'value', 'money', 'item', 'received', 'arrived',
                       'package', 'box', 'customer', 'seller', 'store', 'shop', 'website',
                       'online', 'return', 'refund', 'exchange', 'warranty', 'brand', 'company']

        # Review-specific language
        review_words = ['recommend', 'bought', 'ordered', 'received', 'love', 'hate', 'like',
                       'dislike', 'amazing', 'terrible', 'perfect', 'awful', 'satisfied',
                       'disappointed', 'happy', 'unhappy', 'pleased', 'upset', 'impressed',
                       'expected', 'surprised', 'works', 'broken', 'defective', 'damaged']

        # Emotional/evaluative words
        emotion_words = ['love', 'hate', 'like', 'dislike', 'happy', 'sad', 'angry', 'excited',
                        'disappointed', 'satisfied', 'pleased', 'upset', 'glad', 'sorry']

        flags = []
        for text in self.texts:
            text_lower = str(text).lower().strip()
            words = text_lower.split()

            # Only mark as off-topic if ALL conditions are met:
            # 1. Very short (less than 10 characters)
            # 2. No domain words
            # 3. No review language
            # 4. No emotional content

            is_very_short = len(text_lower) < 10
            has_domain_words = any(word in text_lower for word in domain_words)
            has_review_lang = any(word in text_lower for word in review_words)
            has_emotions = any(word in text_lower for word in emotion_words)

            # Also check for basic review structure
            has_evaluation = any(word in text_lower for word in ['good', 'bad', 'ok', 'okay', 'fine', 'nice'])

            # Only mark as off-topic if it's very short AND lacks any review indicators
            is_off_topic = (is_very_short and
                           not has_domain_words and
                           not has_review_lang and
                           not has_emotions and
                           not has_evaluation)

            flags.append(is_off_topic)

        off_topic_count = sum(flags)
        print(f"Off-topic found: {off_topic_count} ({off_topic_count/len(flags):.1%})")
        return flags

    def calculate_informativeness(self):
        """Calculate informativeness scores."""
        print("Calculating informativeness...")

        scores = []
        for text in self.texts:
            doc = self.nlp(str(text))

            # Basic metrics
            num_words = len([t for t in doc if t.is_alpha])
            if num_words == 0:
                scores.append(0.0)
                continue

            # Lexical diversity
            words = [t.lower_ for t in doc if t.is_alpha]
            lexical_diversity = len(set(words)) / len(words)

            # Entities and content richness
            entities = len(doc.ents)
            entity_density = entities / num_words

            # Length bonus
            length_factor = min(num_words / 20, 1.0)

            # Combine factors
            info_score = (0.4 * lexical_diversity +
                         0.3 * min(entity_density * 10, 1) +
                         0.3 * length_factor)

            scores.append(round(min(1.0, max(0.0, info_score)), 3))

        return scores

    def run_analysis(self):
        """Run complete analysis pipeline."""
        print("=== STARTING REVIEW ANALYSIS ===\n")

        # Extract features
        X = self.extract_features()

        # Generate labels and train classifier
        y = self.generate_labels()
        model = self.train_classifier(X, y)

        # Get spam predictions
        X_scaled = self.scaler.transform(X)
        spam_probs = self.ensemble.predict_proba(X_scaled)[:, 1]
        spam_pred = (spam_probs > 0.5).astype(int)

        # Detect other issues
        dup_flags, dup_refs = self.detect_duplicates(X)
        off_topic_flags = self.detect_off_topic()
        info_scores = self.calculate_informativeness()

        # Create results dataframe
        results = pd.DataFrame({
            'review_id': range(len(self.texts)) if not self.id_col else self.df[self.id_col],
            'review': self.texts,
            'is_spam': spam_pred,
            'spam_probability': np.round(spam_probs, 4),
            'is_duplicate': dup_flags,
            'duplicate_of': dup_refs,
            'is_off_topic': off_topic_flags,
            'informativeness_score': info_scores
        })

        # Create clean and filtered datasets with more balanced approach
        # Only remove if multiple issues are present OR very high confidence single issue
        problematic_mask = (
            (results['is_spam'] & (results['spam_probability'] > 0.8)) |  # High confidence spam
            (results['is_duplicate']) |  # Keep duplicate detection as is
            (results['is_off_topic'] & results['is_spam']) |  # Off-topic AND spam
            (results['is_off_topic'] & (results['informativeness_score'] < 0.1))  # Off-topic AND low info
        )

        keep_mask = ~problematic_mask

        clean_df = self.df[keep_mask].copy()
        filtered_df = self.df[problematic_mask].copy()

        # Add analysis columns to filtered dataset
        for col in ['is_spam', 'spam_probability', 'is_duplicate', 'duplicate_of',
                   'is_off_topic', 'informativeness_score']:
            filtered_df[col] = results[col][problematic_mask].values

        # Save results
        clean_df.to_csv("Reviews_Clean.csv", index=False)
        filtered_df.to_csv("Reviews_Filtered.csv", index=False)
        results.to_csv("Reviews_Analysis.csv", index=False)

        # Summary
        print(f"\n=== ANALYSIS SUMMARY ===")
        print(f"Original reviews: {len(self.df)}")
        print(f"Clean reviews: {len(clean_df)} ({len(clean_df)/len(self.df):.1%})")
        print(f"Filtered out: {len(filtered_df)} ({len(filtered_df)/len(self.df):.1%})")
        print(f"  - Spam: {sum(results['is_spam'])} ({sum(results['is_spam'])/len(self.df):.1%})")
        print(f"  - Duplicates: {sum(results['is_duplicate'])} ({sum(results['is_duplicate'])/len(self.df):.1%})")
        print(f"  - Off-topic: {sum(results['is_off_topic'])} ({sum(results['is_off_topic'])/len(self.df):.1%})")

        return clean_df, filtered_df, results

# Usage
if __name__ == "__main__":
    # Initialize and run analysis
    filter_system = CompactReviewFilter(
        csv_path='USS_Reviews_Silver.csv',
        text_col='review',
        id_col='review_index'
    )

    clean_reviews, filtered_reviews, analysis_results = filter_system.run_analysis()

    print(f"\nFiles saved:")
    print(f"- Reviews_Clean.csv: {len(clean_reviews)} clean reviews")
    print(f"- Reviews_Filtered.csv: {len(filtered_reviews)} filtered reviews")
    print(f"- Reviews_Analysis.csv: Complete analysis results")

Loaded 29412 reviews
Available columns: ['integrated_review', 'stars', 'name', 'review', 'publishedAtDate', 'data_split', 'review_index']
Loading NLP models...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


=== STARTING REVIEW ANALYSIS ===

Extracting features...


Batches:   0%|          | 0/920 [00:00<?, ?it/s]

Linguistic features: 100%|██████████| 29412/29412 [04:44<00:00, 103.33it/s]
Style features: 100%|██████████| 29412/29412 [04:43<00:00, 103.84it/s]


Total features: 702
Generating spam labels...
Generated labels: 1215 spam (4.1%) out of 29412
Training classifier...
Training samples: 45114 (spam: 22557, clean: 22557)

=== MODEL EVALUATION ===
[[5618   22]
 [  23  220]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5640
           1       0.91      0.91      0.91       243

    accuracy                           0.99      5883
   macro avg       0.95      0.95      0.95      5883
weighted avg       0.99      0.99      0.99      5883

Accuracy: 0.9924
Detecting duplicates...
Duplicates found: 1938 (6.6%)
Detecting off-topic reviews...
Off-topic found: 586 (2.0%)
Calculating informativeness...

=== ANALYSIS SUMMARY ===
Original reviews: 29412
Clean reviews: 27052 (92.0%)
Filtered out: 2360 (8.0%)
  - Spam: 1238 (4.2%)
  - Duplicates: 1938 (6.6%)
  - Off-topic: 586 (2.0%)

Files saved:
- Reviews_Clean.csv: 27052 clean reviews
- Reviews_Filtered.csv: 2360 filtered reviews
- Review

---

# 📊 `Review Analysis

The `ImprovedReviewPresentation` class is an end-to-end system for analyzing customer reviews, identifying quality issues (spam, duplicates, off-topic), and generating both static and interactive dashboards for reporting. It is especially useful in customer experience analysis and content moderation use cases.

---

## 🧱 Key Components

### 1. **Initialization**

```python
analyzer = ImprovedReviewPresentation()
```

* Loads the datasets
* Ensures required columns exist
* Creates sample data if input is missing

---

### 2. **Validation & Cleaning**

* Ensures key columns exist (`is_spam`, `spam_probability`, etc.)
* Converts columns to appropriate data types
* Adds default values if necessary

---

### 3. **Metric Calculation**

Calculates:

* Clean vs. filtered review counts
* Spam, duplicate, and off-topic counts
* Informativeness score (average before and after filtering)
* Retention rate and quality improvement %

---

### 4. **Executive Summary**

Outputs a concise text-based report covering:

* Filtering performance
* Issue breakdown
* Quality metrics
* Estimated reviewer time saved

```python
analyzer.print_executive_summary()
```

---

### 5. **Static Dashboard**

Generates a 6-panel dashboard using `matplotlib`:

* Pie chart: Clean vs Filtered
* Bar chart: Issue categories
* Bar chart: Metric improvement
* Histogram: Spam probability
* Histogram: Informativeness (clean vs all)
* Histogram: Review length distribution

```python
analyzer.create_main_dashboard()
```

---

### 6. **Interactive Dashboard**

Interactive Plotly dashboard with:

* Pie chart, bar charts, scatter plots, gauge
* Score histograms and filtering visuals
* HTML output for web-based demo or embed

```python
analyzer.create_interactive_plotly_dashboard()
```

---

### 7. **Detailed Examples**

Displays top clean and spam reviews for manual review or presentations.

```python
analyzer.show_detailed_examples()
```

---


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

class ImprovedReviewPresentation:
    def __init__(self, analysis_results_path='Reviews_Analysis.csv',
                 clean_reviews_path='Reviews_Clean.csv',
                 filtered_reviews_path='Reviews_Filtered.csv'):
        """Initialize presentation analysis with robust error handling."""
        print("🔄 Loading review analysis data...")

        # Load main analysis results
        self.results_df = self._safe_load_csv(analysis_results_path, "analysis results")
        self.clean_df = self._safe_load_csv(clean_reviews_path, "clean reviews")
        self.filtered_df = self._safe_load_csv(filtered_reviews_path, "filtered reviews")

        # Validate and fix data
        self._validate_data()

        # Calculate metrics
        self._calculate_key_metrics()

        print(f"✅ Data loaded successfully: {len(self.results_df)} total reviews")

    def _safe_load_csv(self, path, description):
        """Safely load CSV with error handling."""
        try:
            df = pd.read_csv(path)
            print(f"  ✓ Loaded {description}: {len(df)} records")
            return df
        except FileNotFoundError:
            print(f"  ⚠️ {description.title()} file not found: {path}")
            return pd.DataFrame()
        except Exception as e:
            print(f"  ❌ Error loading {description}: {e}")
            return pd.DataFrame()

    def _validate_data(self):
        """Validate and fix data structure."""
        print("🔧 Validating data structure...")

        # If main results are empty, create sample data
        if self.results_df.empty:
            print("  📝 Creating sample data for demonstration...")
            self.results_df = self._create_sample_data()

        # Ensure required columns exist
        required_columns = {
            'is_spam': False,
            'is_duplicate': False,
            'is_off_topic': False,
            'spam_probability': 0.1,
            'informativeness_score': 0.5,
            'review': 'Sample review text'
        }

        for col, default_val in required_columns.items():
            if col not in self.results_df.columns:
                print(f"  + Adding missing column: {col}")
                self.results_df[col] = default_val

        # Fix data types
        bool_cols = ['is_spam', 'is_duplicate', 'is_off_topic']
        for col in bool_cols:
            self.results_df[col] = self.results_df[col].astype(bool)

        numeric_cols = ['spam_probability', 'informativeness_score']
        for col in numeric_cols:
            self.results_df[col] = pd.to_numeric(self.results_df[col], errors='coerce').fillna(0.5)

        # Ensure review text exists
        self.results_df['review'] = self.results_df['review'].astype(str)

        print("  ✅ Data validation complete")

    def _create_sample_data(self):
        """Create realistic sample data for demonstration."""
        n = 1000
        np.random.seed(42)  # For reproducible results

        return pd.DataFrame({
            'review_id': range(n),
            'review': [f"This is sample review {i} with some content about products and services."
                      for i in range(n)],
            'is_spam': np.random.choice([True, False], n, p=[0.05, 0.95]),
            'spam_probability': np.random.beta(1, 9, n),
            'is_duplicate': np.random.choice([True, False], n, p=[0.08, 0.92]),
            'is_off_topic': np.random.choice([True, False], n, p=[0.12, 0.88]),
            'informativeness_score': np.random.beta(2, 2, n)
        })

    def _calculate_key_metrics(self):
        """Calculate key presentation metrics."""
        print("📊 Calculating key metrics...")

        # Calculate clean mask (reviews not flagged for any issue)
        self.clean_mask = ~(
            self.results_df['is_spam'] |
            self.results_df['is_duplicate'] |
            self.results_df['is_off_topic']
        )

        # Core metrics
        self.total_reviews = len(self.results_df)
        self.clean_count = self.clean_mask.sum()
        self.filtered_count = self.total_reviews - self.clean_count

        # Issue counts
        self.spam_count = self.results_df['is_spam'].sum()
        self.duplicate_count = self.results_df['is_duplicate'].sum()
        self.off_topic_count = self.results_df['is_off_topic'].sum()

        # Quality metrics
        self.avg_info_all = self.results_df['informativeness_score'].mean()
        self.avg_info_clean = (self.results_df[self.clean_mask]['informativeness_score'].mean()
                              if self.clean_count > 0 else self.avg_info_all)

        self.retention_rate = (self.clean_count / self.total_reviews) * 100
        self.quality_improvement = ((self.avg_info_clean - self.avg_info_all) / self.avg_info_all * 100
                                   if self.avg_info_all > 0 else 0)

        print(f"  ✓ Clean reviews: {self.clean_count:,} ({self.retention_rate:.1f}%)")
        print(f"  ✓ Quality improvement: {self.quality_improvement:.1f}%")

    def print_executive_summary(self):
        """Print comprehensive executive summary."""
        print("\n" + "="*70)
        print("📋 EXECUTIVE SUMMARY - REVIEW FILTERING ANALYSIS")
        print("="*70)

        print(f"📈 OVERALL PERFORMANCE:")
        print(f"  Total Reviews Processed:    {self.total_reviews:,}")
        print(f"  Clean Reviews Retained:     {self.clean_count:,} ({self.retention_rate:.1f}%)")
        print(f"  Reviews Filtered Out:       {self.filtered_count:,} ({(self.filtered_count/self.total_reviews)*100:.1f}%)")

        print(f"\n🎯 FILTERING BREAKDOWN:")
        print(f"  Spam Detected:              {self.spam_count:,} ({(self.spam_count/self.total_reviews)*100:.1f}%)")
        print(f"  Duplicates Found:           {self.duplicate_count:,} ({(self.duplicate_count/self.total_reviews)*100:.1f}%)")
        print(f"  Off-topic Reviews:          {self.off_topic_count:,} ({(self.off_topic_count/self.total_reviews)*100:.1f}%)")

        print(f"\n📊 QUALITY IMPROVEMENTS:")
        print(f"  Informativeness Increase:   {self.quality_improvement:.1f}%")
        print(f"  Average Score (All):        {self.avg_info_all:.3f}")
        print(f"  Average Score (Clean):      {self.avg_info_clean:.3f}")

        # ROI calculation
        time_saved = self.filtered_count * 0.5  # Assume 30 seconds saved per filtered review
        print(f"\n💰 ESTIMATED IMPACT:")
        print(f"  Time Saved (hours):         {time_saved/60:.1f}")
        print(f"  Reviews Requiring Review:   {self.clean_count:,} (down from {self.total_reviews:,})")

        print("="*70)

    def create_main_dashboard(self, save_path='review_analysis_dashboard.png'):
        """Create comprehensive main dashboard."""
        print("🎨 Creating main analysis dashboard...")

        fig = plt.figure(figsize=(20, 12))
        gs = fig.add_gridspec(3, 4, hspace=0.3, wspace=0.3)

        # Main title
        fig.suptitle('Review Filtering Analysis Dashboard', fontsize=24, fontweight='bold', y=0.95)

        # 1. Overall distribution (large pie chart)
        ax1 = fig.add_subplot(gs[0, :2])
        if self.filtered_count > 0:
            sizes = [self.clean_count, self.filtered_count]
            labels = [f'Clean Reviews\n({self.clean_count:,})', f'Filtered Reviews\n({self.filtered_count:,})']
            colors = ['#27ae60', '#e74c3c']
            explode = (0.05, 0.05)

            wedges, texts, autotexts = ax1.pie(sizes, labels=labels, colors=colors, explode=explode,
                                              autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12})
            ax1.set_title('Overall Review Distribution', fontsize=16, fontweight='bold', pad=20)
        else:
            ax1.text(0.5, 0.5, f'All {self.clean_count:,}\nReviews are Clean!',
                    ha='center', va='center', fontsize=16, fontweight='bold')
            ax1.set_title('Perfect Quality!', fontsize=16, fontweight='bold')

        # 2. Issues breakdown (bar chart)
        ax2 = fig.add_subplot(gs[0, 2:])
        categories = ['Spam', 'Duplicates', 'Off-topic']
        counts = [self.spam_count, self.duplicate_count, self.off_topic_count]
        colors = ['#e74c3c', '#f39c12', '#3498db']

        bars = ax2.bar(categories, counts, color=colors, alpha=0.8, edgecolor='black', linewidth=1)
        ax2.set_title('Issues Detected by Category', fontsize=14, fontweight='bold')
        ax2.set_ylabel('Number of Reviews', fontsize=12)

        # Add value labels on bars
        for bar, count in zip(bars, counts):
            height = bar.get_height()
            ax2.text(bar.get_x() + bar.get_width()/2., height + max(height*0.02, 1),
                    f'{count:,}', ha='center', va='bottom', fontweight='bold', fontsize=11)

        # 3. Quality improvement metrics
        ax3 = fig.add_subplot(gs[1, :2])
        metrics = ['Retention\nRate (%)', 'Quality\nImprovement (%)', 'Avg Info Score\n(Before)', 'Avg Info Score\n(After)']
        values = [self.retention_rate, self.quality_improvement, self.avg_info_all, self.avg_info_clean]
        colors_metrics = ['#27ae60', '#9b59b6', '#34495e', '#2ecc71']

        bars_metrics = ax3.bar(metrics, values, color=colors_metrics, alpha=0.8, edgecolor='black')
        ax3.set_title('Key Performance Metrics', fontsize=14, fontweight='bold')
        ax3.set_ylabel('Score / Percentage', fontsize=12)

        for bar, value in zip(bars_metrics, values):
            height = bar.get_height()
            ax3.text(bar.get_x() + bar.get_width()/2., height + height*0.02,
                    f'{value:.1f}', ha='center', va='bottom', fontweight='bold', fontsize=11)

        # 4. Spam probability distribution
        ax4 = fig.add_subplot(gs[1, 2:])
        spam_probs = self.results_df['spam_probability']
        ax4.hist(spam_probs, bins=30, alpha=0.7, color='#e74c3c', edgecolor='black')
        ax4.axvline(spam_probs.mean(), color='darkred', linestyle='--', linewidth=2,
                   label=f'Mean: {spam_probs.mean():.3f}')
        ax4.set_title('Spam Probability Distribution', fontsize=14, fontweight='bold')
        ax4.set_xlabel('Spam Probability', fontsize=12)
        ax4.set_ylabel('Frequency', fontsize=12)
        ax4.legend()

        # 5. Informativeness comparison (bottom row)
        ax5 = fig.add_subplot(gs[2, :2])
        all_info = self.results_df['informativeness_score']
        clean_info = self.results_df[self.clean_mask]['informativeness_score']

        ax5.hist(all_info, bins=25, alpha=0.6, label=f'All Reviews (μ={all_info.mean():.3f})',
                color='lightcoral', density=True, edgecolor='black')
        if len(clean_info) > 0:
            ax5.hist(clean_info, bins=25, alpha=0.8, label=f'Clean Reviews (μ={clean_info.mean():.3f})',
                    color='forestgreen', density=True, edgecolor='black')
        ax5.set_title('Informativeness Score Distribution', fontsize=14, fontweight='bold')
        ax5.set_xlabel('Informativeness Score', fontsize=12)
        ax5.set_ylabel('Density', fontsize=12)
        ax5.legend()

        # 6. Review length analysis
        ax6 = fig.add_subplot(gs[2, 2:])
        all_lengths = [len(str(review)) for review in self.results_df['review']]
        clean_lengths = [len(str(review)) for i, review in enumerate(self.results_df['review'])
                        if self.clean_mask.iloc[i]]

        ax6.hist(all_lengths, bins=30, alpha=0.6, label=f'All Reviews (μ={np.mean(all_lengths):.0f})',
                color='lightblue', density=True, range=(0, 500))
        if len(clean_lengths) > 0:
            ax6.hist(clean_lengths, bins=30, alpha=0.8, label=f'Clean Reviews (μ={np.mean(clean_lengths):.0f})',
                    color='darkblue', density=True, range=(0, 500))
        ax6.set_title('Review Length Distribution', fontsize=14, fontweight='bold')
        ax6.set_xlabel('Character Count', fontsize=12)
        ax6.set_ylabel('Density', fontsize=12)
        ax6.legend()

        plt.tight_layout()
        plt.savefig(save_path, dpi=300, bbox_inches='tight', facecolor='white')
        plt.show()

        print(f"  ✅ Dashboard saved as {save_path}")
        return fig

    def create_interactive_plotly_dashboard(self, save_path='interactive_dashboard.html'):
        """Create interactive Plotly dashboard."""
        print("🌐 Creating interactive dashboard...")

        # Create subplots
        fig = make_subplots(
            rows=2, cols=3,
            subplot_titles=('Review Distribution', 'Issues by Category', 'Quality Metrics',
                           'Spam vs Informativeness', 'Score Distributions', 'Processing Efficiency'),
            specs=[[{"type": "domain"}, {"type": "xy"}, {"type": "xy"}],
                   [{"type": "scatter"}, {"type": "xy"}, {"type": "indicator"}]]
        )

        # 1. Pie chart - Review distribution
        fig.add_trace(go.Pie(
            labels=['Clean Reviews', 'Filtered Reviews'],
            values=[self.clean_count, self.filtered_count],
            hole=0.4,
            marker_colors=['#27ae60', '#e74c3c'],
            textinfo='label+percent+value',
            textfont_size=12
        ), row=1, col=1)

        # 2. Bar chart - Issues by category
        fig.add_trace(go.Bar(
            x=['Spam', 'Duplicates', 'Off-topic'],
            y=[self.spam_count, self.duplicate_count, self.off_topic_count],
            marker_color=['#e74c3c', '#f39c12', '#3498db'],
            text=[f'{self.spam_count:,}', f'{self.duplicate_count:,}', f'{self.off_topic_count:,}'],
            textposition='auto',
            name='Issues'
        ), row=1, col=2)

        # 3. Bar chart - Quality metrics
        metrics = ['Retention %', 'Quality Improvement %']
        values = [self.retention_rate, self.quality_improvement]
        fig.add_trace(go.Bar(
            x=metrics,
            y=values,
            marker_color=['#27ae60', '#9b59b6'],
            text=[f'{v:.1f}%' for v in values],
            textposition='auto',
            name='Metrics'
        ), row=1, col=3)

        # 4. Scatter plot - Spam probability vs informativeness
        colors = ['red' if spam else 'blue' for spam in self.results_df['is_spam']]
        fig.add_trace(go.Scatter(
            x=self.results_df['spam_probability'],
            y=self.results_df['informativeness_score'],
            mode='markers',
            marker=dict(
                color=colors,
                size=6,
                opacity=0.6,
                line=dict(width=1, color='white')
            ),
            text=[f"Review {i}<br>Spam: {spam}<br>Info: {info:.3f}"
                  for i, (spam, info) in enumerate(zip(self.results_df['is_spam'],
                                                      self.results_df['informativeness_score']))],
            hovertemplate='%{text}<extra></extra>',
            name='Reviews'
        ), row=2, col=1)

        # 5. Histogram - Score distributions
        fig.add_trace(go.Histogram(
            x=self.results_df['informativeness_score'],
            name='All Reviews',
            opacity=0.6,
            nbinsx=20,
            marker_color='lightcoral'
        ), row=2, col=2)

        if self.clean_count > 0:
            fig.add_trace(go.Histogram(
                x=self.results_df[self.clean_mask]['informativeness_score'],
                name='Clean Reviews',
                opacity=0.8,
                nbinsx=20,
                marker_color='forestgreen'
            ), row=2, col=2)

        # 6. Gauge - Processing efficiency
        fig.add_trace(go.Indicator(
            mode="gauge+number+delta",
            value=self.retention_rate,
            delta={'reference': 80},
            gauge={
                'axis': {'range': [None, 100]},
                'bar': {'color': "#27ae60"},
                'steps': [
                    {'range': [0, 50], 'color': "lightgray"},
                    {'range': [50, 80], 'color': "gray"},
                    {'range': [80, 100], 'color': "lightgreen"}
                ],
                'threshold': {
                    'line': {'color': "red", 'width': 4},
                    'thickness': 0.75,
                    'value': 90
                }
            },
            title={'text': "Retention Rate (%)"},
            number={'suffix': "%"}
        ), row=2, col=3)

        # Update layout
        fig.update_layout(
            title_text="Interactive Review Filtering Analysis Dashboard",
            title_x=0.5,
            title_font_size=20,
            showlegend=True,
            height=800
        )

        # Save and show
        fig.write_html(save_path)
        fig.show()

        print(f"  ✅ Interactive dashboard saved as {save_path}")
        return fig

    def show_detailed_examples(self, n_examples=5):
        """Show detailed examples with context."""
        print("\n" + "="*80)
        print("📝 DETAILED REVIEW EXAMPLES")
        print("="*80)

        # High-quality clean reviews
        if self.clean_count > 0:
            print(f"\n🏆 TOP {min(n_examples, self.clean_count)} HIGH-QUALITY CLEAN REVIEWS:")
            print("-" * 70)

            clean_examples = (self.results_df[self.clean_mask]
                            .nlargest(min(n_examples, self.clean_count), 'informativeness_score'))

            for i, (_, row) in enumerate(clean_examples.iterrows(), 1):
                print(f"\n{i}. CLEAN REVIEW - ID: {row.get('review_id', 'N/A')}")
                print(f"   📊 Informativeness: {row['informativeness_score']:.3f}")
                print(f"   🚫 Spam Probability: {row['spam_probability']:.3f}")
                print(f"   📝 Text: \"{str(row['review'])[:200]}{'...' if len(str(row['review'])) > 200 else ''}\"")
        else:
            print("\n🏆 NO CLEAN REVIEWS FOUND")

        # Spam examples
        if self.spam_count > 0:
            print(f"\n🚫 TOP {min(n_examples, self.spam_count)} SPAM EXAMPLES:")
            print("-" * 70)

            spam_examples = (self.results_df[self.results_df['is_spam']]
                           .nlargest(min(n_examples, self.spam_count), 'spam_probability'))

            for i, (_, row) in enumerate(spam_examples.iterrows(), 1):
                print(f"\n{i}. SPAM REVIEW - ID: {row.get('review_id', 'N/A')}")
                print(f"   🚫 Spam Probability: {row['spam_probability']:.3f}")
                print(f"   📊 Informativeness: {row['informativeness_score']:.3f}")
                print(f"   📝 Text: \"{str(row['review'])[:200]}{'...' if len(str(row['review'])) > 200 else ''}\"")
        else:
            print(f"\n🚫 NO SPAM EXAMPLES FOUND")

        # Show filtering impact
        print(f"\n📈 FILTERING IMPACT SUMMARY:")
        print(f"   • Processing efficiency: {self.retention_rate:.1f}% of reviews retained")
        print(f"   • Quality improvement: {self.quality_improvement:.1f}% increase in informativeness")
        print(f"   • Manual review reduction: {self.filtered_count:,} reviews automatically filtered")

    def generate_complete_presentation(self):
        """Generate all presentation materials."""
        print("\n🎯 GENERATING COMPLETE PRESENTATION PACKAGE")
        print("="*70)

        try:
            # 1. Executive summary
            self.print_executive_summary()

            # 2. Main dashboard
            print("\n📊 Creating visualizations...")
            self.create_main_dashboard()

            # 3. Interactive dashboard
            self.create_interactive_plotly_dashboard()

            # 4. Detailed examples
            self.show_detailed_examples()

            print(f"\n✅ PRESENTATION PACKAGE COMPLETE!")
            print("="*70)
            print("📁 Generated Files:")
            print("  • review_analysis_dashboard.png - Main presentation chart")
            print("  • interactive_dashboard.html - Interactive dashboard for demos")
            print("\n🎯 Ready for presentation! Use the PNG for slides and HTML for live demos.")

        except Exception as e:
            print(f"❌ Error generating presentation: {e}")
            print("Some components may not have been created.")

# Simplified usage
if __name__ == "__main__":
    try:
        print("🚀 STARTING REVIEW ANALYSIS PRESENTATION GENERATOR")
        print("="*70)

        # Initialize analyzer
        analyzer = ImprovedReviewPresentation(
            analysis_results_path='Reviews_Analysis.csv',
            clean_reviews_path='Reviews_Clean.csv',
            filtered_reviews_path='Reviews_Filtered.csv'
        )

        # Generate complete presentation
        analyzer.generate_complete_presentation()

    except Exception as e:
        print(f"💥 Critical error: {e}")
        print("Please check your data files and try again.")

Output hidden; open in https://colab.research.google.com to view.