In [None]:
#@title connect google drive

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/SMU_MITB_NLP/Project/

# Labeling Pipeline

| Step                          | Purpose                                                      | Method/Components Used |
| ----------------------------- | ------------------------------------------------------------ | ---------------------- |
| **Initialization**            | Load predefined categories and create alias mappings        | 60+ theme park categories, smart alias generation system |
| **Category Alias Generation** | Create intelligent variations for each venue/attraction     | Rule-based alias creation, abbreviations, common variations |
| **Data Loading**              | Read review dataset and validate column structure           | `pd.read_csv()` / `pd.read_excel()`, column validation |
| **Text Preprocessing**        | Clean and normalize review text for matching               | Lowercase conversion, whitespace normalization, regex cleaning |
| **Exact Phrase Matching**     | Find direct mentions of categories and aliases             | Substring detection, case-insensitive comparison, alias lookup |
| **Fuzzy String Matching**     | Catch variations, typos, and partial mentions              | **fuzzywuzzy** partial_ratio + token_sort_ratio (85% threshold) |
| **Score Calculation**         | Calculate confidence scores for each potential match        | Match length ratios, similarity percentages, quality weighting |
| **Confidence Filtering**      | Remove low-confidence matches below threshold              | Minimum confidence filtering (80% default) |
| **Duplicate Removal**         | Keep highest-scoring match per category                    | Category deduplication, score-based selection |
| **Result Ranking**            | Sort matches by confidence score                           | Descending score sorting, top-K selection (3 default) |
| **Multi-Category Assignment** | Assign multiple relevant categories per review             | Pipe-separated category lists, score arrays |
| **Distribution Analysis**     | Analyze category frequency and patterns                    | Counter-based frequency analysis, type grouping |
| **Category Grouping**         | Organize results by venue type                             | **Rides** 🎢, **Food** 🍕, **Shows** 🎭, **Services** 🏢, **Characters** 👥 |
| **Performance Metrics**       | Calculate precision and coverage statistics                | Match rates, confidence distributions, category coverage |
| **Results Export**            | Save categorized reviews with analysis metadata           | CSV/Excel output with match details and confidence scores |

## Pipeline Flow Summary

```
Input Reviews → Category Aliases → Text Cleaning → Dual Matching →
Confidence Filtering → Result Ranking → Category Analysis → Output Generation
```

**Key Quality Controls:**
- Conservative confidence thresholds (80%+)
- Exact matching priority over fuzzy
- Meaningful phrase requirements (4+ chars)
- Category-specific alias intelligence
- Multi-level validation pipeline

In [None]:
!pip install pandas numpy scikit-learn nltk fuzzywuzzy python-levenshtein openpyxl



## Overview

A specialized classification system designed to categorize theme park reviews into specific attraction, dining, and service categories with high precision. The system uses exact phrase matching and fuzzy string matching to identify mentions of specific venues, rides, and amenities within review text.

## Key Features

### 🎯 **High-Precision Classification**
- **Exact Phrase Matching**: Direct keyword detection with alias variations
- **Fuzzy String Matching**: Advanced similarity scoring for variations
- **Category Filtering**: Minimum confidence thresholds for quality control
- **Multi-Category Support**: Reviews can match multiple relevant categories

### 🏰 **Theme Park Domain Expertise**
- **Pre-defined Categories**: 60+ specific attractions, dining, and services
- **Intelligent Aliases**: Smart variations for each category
- **Domain-Specific Logic**: Context-aware matching for entertainment venues
- **Quality Scoring**: Confidence-based result ranking

### 📊 **Comprehensive Analysis**
- **Category Distribution**: Detailed frequency and popularity analysis
- **Type Grouping**: Organized by rides, dining, shows, services, characters
- **Performance Metrics**: Match rates and confidence scoring
- **Visual Reporting**: Structured output with category breakdowns

## System Architecture

### Core Classes

#### `PreciseReviewClassifier`
Main classification engine that handles category matching and analysis.

**Key Components:**
- 60+ predefined attraction/service categories
- Smart alias generation system
- Dual matching strategy (exact + fuzzy)
- Comprehensive analysis pipeline

### Category Coverage

#### **Attractions & Rides (25+ categories)**
```python
# Major attractions include:
"transformers the ride the ultimate 3d battle"
"revenge of the mummy"
"battlestar galactica human/cylon"
"despicable me minion mayhem"
"jurassic park rapids adventure"
"shrek 4 d adventure"
```

#### **Dining & Food (15+ categories)**
```python
# Food venues include:
"mel s drive in"
"kt s grill"
"loui s ny pizza parlor"
"starbucks"
"discovery food court"
```

#### **Shows & Entertainment (10+ categories)**
```python
# Entertainment includes:
"waterworld"
"lake hollywood spectacular"
"trolls hug time jubilee"
"donkey live"
```

#### **Character Experiences (8+ categories)**
```python
# Character meets include:
"gru lucy"
"po and master tigress"
"raptor encounter with blue"
"minion dance party"
```

## Classification Algorithm

### Smart Alias Generation

**Automatic Variation Creation:**
```python
def _create_category_aliases(self):
    # Creates intelligent variations:
    # "transformers the ride" → ["transformers", "transformers ride"]
    # "mel s drive in" → ["mel s drive in", "mels drive in"]
    # "battlestar galactica human" → ["battlestar", "battlestar human"]
```

### Dual Matching Strategy

#### 1. **Exact Phrase Matching**
- Direct substring detection
- Case-insensitive comparison
- Alias-based variations
- Quality scoring based on match length

#### 2. **Fuzzy String Matching**
```python
# Uses fuzzywuzzy library:
partial_score = fuzz.partial_ratio(alias, text)      # Substring matching
token_score = fuzz.token_sort_ratio(alias, text)     # Word order flexibility
best_score = max(partial_score, token_score)         # Take best result
```

### Classification Pipeline

| Step | Method | Threshold | Purpose |
|------|--------|-----------|---------|
| **Text Cleaning** | `clean_text()` | N/A | Normalize whitespace, lowercase conversion |
| **Exact Matching** | `exact_phrase_match()` | 100% accuracy | Find direct mentions |
| **Fuzzy Matching** | `fuzzy_phrase_match()` | 85% similarity | Catch variations and typos |
| **Score Filtering** | `min_confidence` | 80% default | Remove low-confidence matches |
| **Result Ranking** | Score-based sorting | Top 3 default | Return best matches |

## Output Analysis

### Classification Results

#### **Per-Review Columns:**
- `Matched_Categories`: Pipe-separated category names
- `Category_Scores`: Confidence scores for each match
- `Top_Category`: Highest-confidence category
- `Confidence_Score`: Primary match confidence
- `Match_Details`: Matching method details

#### **Sample Output:**
```csv
review,Matched_Categories,Category_Scores,Top_Category,Confidence_Score
"Transformers ride was amazing!","transformers the ride the ultimate 3d battle","95.0","transformers the ride the ultimate 3d battle",95.0
"Food at Mel's was great","mel s drive in","88.5","mel s drive in",88.5
```

### Category Distribution Analysis

#### **Summary Statistics:**
- Total reviews processed
- Reviews with high-confidence matches
- Precision match rate percentage
- Total category mentions
- Unique categories found

#### **Category Grouping:**
```python
# Automatic categorization by type:
🎢 RIDES & ATTRACTIONS    # roller coasters, dark rides, experiences
🍕 FOOD & DINING         # restaurants, food courts, snack stands  
🎭 SHOWS & ENTERTAINMENT  # live shows, character performances
🏢 SERVICES & AMENITIES   # restrooms, WiFi, lockers, parking
👥 CHARACTER EXPERIENCES  # meet & greets, photo opportunities
```

## Configuration Options

### **Precision Controls:**
```python
process_reviews_file(
    fuzzy_threshold=85,        # Similarity threshold (higher = stricter)
    max_categories=3,          # Maximum matches per review
    min_confidence=80,         # Minimum score to include result
    review_column='review'     # Source column name
)
```

### **Quality Thresholds:**
- **Exact Match**: 100% accuracy required
- **Fuzzy Match**: 85% similarity default
- **Minimum Confidence**: 80% score threshold
- **Meaningful Phrases**: 4+ character minimum for fuzzy matching


In [None]:
import pandas as pd
import re
from fuzzywuzzy import fuzz
from collections import defaultdict

class PreciseReviewClassifier:
    def __init__(self):
        self.categories = [
            "buggie boogie",
            "despicable me minion mayhem",
            "silly swirly",
            "sesame street spaghetti space chase",
            "battlestar galactica human",
            "battlestar galactica cylon",
            "transformers the ride the ultimate 3d battle",
            "accelerator",
            "revenge of the mummy",
            "treasure hunters",
            "sesame street goes bollywood",
            "canopy flyer",
            "dino soarin",
            "jurassic park rapids adventure",
            "enchanted airways",
            "magic potion spin",
            "puss in boots giant journey",
            "minute of minion mayhem",
            "despicable me family portrait",
            "gru lucy",
            "minion dance party",
            "mel s mixtape",
            "pantages hollywood theater trolls hug time jubilee",
            "lake hollywood spectacular",
            "po and master tigress",
            "lights camera action",
            "rhythm truck",
            "sesame street",
            "transformers voices of cybertron",
            "aset warrior",
            "anubis guards",
            "egyptian guards",
            "hatched featuring dr rodney",
            "waterworld",
            "baby kilos naptime tango",
            "raptor encounter with blue",
            "raptor encounter generations",
            "donkey live",
            "fortune favors the furry",
            "shrek 4 d adventure",
            "happily ever after",
            "ice cream stand",
            "pop a nana",
            "super hungry food stand",
            "mel s drive in",
            "pops popcorn delight",
            "starbucks",
            "kt s grill",
            "loui s ny pizza parlor",
            "me want cookie",
            "frozen fuel",
            "planet yen",
            "starbot café",
            "stardots",
            "discovery food court",
            "fossil fuels",
            "jungle bites",
            "friar s good food",
            "goldilocks",
            "express"
        ]

        # Create smart keyword aliases for better matching
        self.category_aliases = self._create_category_aliases()

    def _create_category_aliases(self):
        """Create aliases and variations for each category"""
        aliases = {}

        for category in self.categories:
            category_set = set()

            # Add the original category
            category_set.add(category.lower())

            # Create meaningful variations based on category type
            words = category.lower().split()

            # For rides/attractions - create sensible short forms
            if any(keyword in category.lower() for keyword in ['ride', 'adventure', 'encounter', 'attraction']):
                # Remove common words but keep meaningful ones
                meaningful_words = [w for w in words if w not in ['the', 'ride', 'adventure', 'ultimate', 'featuring']]
                if len(meaningful_words) >= 2:
                    category_set.add(' '.join(meaningful_words))

            # For character names and specific attractions
            if 'transformers' in category.lower():
                category_set.add('transformers')
                category_set.add('transformers ride')

            elif 'minion' in category.lower():
                category_set.add('minion mayhem')
                category_set.add('minions')
                category_set.add('despicable me')

            elif 'jurassic park' in category.lower():
                category_set.add('jurassic park')
                category_set.add('jurassic rapids')

            elif 'revenge of the mummy' in category.lower():
                category_set.add('mummy')
                category_set.add('revenge mummy')

            elif 'battlestar galactica' in category.lower():
                category_set.add('battlestar galactica')
                category_set.add('battlestar')
                if 'human' in category:
                    category_set.add('battlestar human')
                if 'cylon' in category:
                    category_set.add('battlestar cylon')

            elif 'shrek' in category.lower():
                category_set.add('shrek')
                category_set.add('shrek 4d')

            elif 'sesame street' in category.lower():
                category_set.add('sesame street')
                if 'spaghetti' in category:
                    category_set.add('spaghetti space chase')
                if 'bollywood' in category:
                    category_set.add('sesame bollywood')

            elif 'waterworld' in category.lower():
                category_set.add('waterworld')
                category_set.add('water world')

            # For food venues - keep restaurant names intact
            elif any(food_word in category.lower() for food_word in ['drive', 'grill', 'pizza', 'café', 'court', 'stand']):
                # Keep the full name for restaurants
                category_set.add(category.lower())
                # Add without apostrophes for variations
                category_set.add(category.lower().replace("'", ""))

            # For services - add common variations
            elif category.lower() in ['wifi', 'restrooms', 'lockers', 'atms']:
                category_set.add(category.lower())
                if category.lower() == 'wifi':
                    category_set.add('wi-fi')
                    category_set.add('internet')
                elif category.lower() == 'restrooms':
                    category_set.add('bathroom')
                    category_set.add('toilets')
                elif category.lower() == 'atms':
                    category_set.add('atm')
                    category_set.add('cash machine')

            aliases[category] = list(category_set)

        return aliases

    def clean_text(self, text):
        """Basic text cleaning"""
        if pd.isna(text) or text == '':
            return ''

        # Convert to lowercase and remove extra spaces
        text = str(text).lower()
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()

        return text

    def exact_phrase_match(self, text, category):
        """Check for exact phrase matches"""
        clean_text = self.clean_text(text)
        matches = []

        for alias in self.category_aliases[category]:
            if alias in clean_text:
                # Calculate match quality based on alias length vs category length
                match_quality = min(100, (len(alias) / len(category)) * 100 + 20)
                matches.append((category, match_quality, f"exact: '{alias}'"))

    def analyze_category_distribution(self, df):
        """Analyze and print detailed category distribution"""
        if df is None or 'Matched_Categories' not in df.columns:
            print("No classification data found.")
            return

        from collections import Counter

        # Count all categories (including multiple per review)
        all_categories = []
        category_scores = {}

        for idx, row in df.iterrows():
            if row['Matched_Categories'] != '':
                categories = row['Matched_Categories'].split(' | ')
                scores = row['Category_Scores'].split(' | ')

                for cat, score in zip(categories, scores):
                    all_categories.append(cat)
                    if cat not in category_scores:
                        category_scores[cat] = []
                    category_scores[cat].append(float(score))

        # Count occurrences
        category_counts = Counter(all_categories)

        print("\n" + "="*80)
        print("DETAILED CATEGORY ANALYSIS")
        print("="*80)

        print(f"\nTotal reviews processed: {len(df)}")
        print(f"Reviews with matches: {len(df[df['Matched_Categories'] != ''])}")
        print(f"Total category mentions: {len(all_categories)}")
        print(f"Unique categories found: {len(category_counts)}")

        if category_counts:
            print(f"\nCATEGORY DISTRIBUTION (sorted by frequency):")
            print("-" * 80)
            print(f"{'Rank':<4} {'Category':<50} {'Count':<8} {'Avg Score':<10}")
            print("-" * 80)

            for rank, (category, count) in enumerate(category_counts.most_common(), 1):
                avg_score = sum(category_scores[category]) / len(category_scores[category])
                print(f"{rank:<4} {category:<50} {count:<8} {avg_score:<10.1f}")

            # Group by category type
            print(f"\n\nCATEGORIES BY TYPE:")
            print("-" * 50)

            # Rides & Attractions
            rides = [cat for cat in category_counts.keys() if any(keyword in cat.lower()
                    for keyword in ['ride', 'adventure', 'encounter', 'coaster', 'mayhem', 'galactica', 'transformers', 'mummy', 'shrek', 'jurassic'])]
            if rides:
                print(f"\n🎢 RIDES & ATTRACTIONS ({len(rides)} categories):")
                for cat in sorted(rides):
                    print(f"   {cat}: {category_counts[cat]} mentions")

            # Food & Dining
            food = [cat for cat in category_counts.keys() if any(keyword in cat.lower()
                   for keyword in ['drive', 'grill', 'pizza', 'café', 'court', 'stand', 'starbucks', 'food', 'cookie', 'fuel', 'bites'])]
            if food:
                print(f"\n🍕 FOOD & DINING ({len(food)} categories):")
                for cat in sorted(food):
                    print(f"   {cat}: {category_counts[cat]} mentions")

            # Shows & Entertainment
            shows = [cat for cat in category_counts.keys() if any(keyword in cat.lower()
                    for keyword in ['show', 'theater', 'spectacular', 'live', 'dance', 'voices', 'waterworld'])]
            if shows:
                print(f"\n🎭 SHOWS & ENTERTAINMENT ({len(shows)} categories):")
                for cat in sorted(shows):
                    print(f"   {cat}: {category_counts[cat]} mentions")

            # Services & Amenities
            services = [cat for cat in category_counts.keys() if any(keyword in cat.lower()
                       for keyword in ['restroom', 'locker', 'wifi', 'parking', 'service', 'first aid', 'baby care', 'atm', 'retail'])]
            if services:
                print(f"\n🏢 SERVICES & AMENITIES ({len(services)} categories):")
                for cat in sorted(services):
                    print(f"   {cat}: {category_counts[cat]} mentions")

            # Character Meet & Greets
            characters = [cat for cat in category_counts.keys() if any(keyword in cat.lower()
                         for keyword in ['gru', 'lucy', 'po', 'tigress', 'donkey', 'portrait', 'guards'])]
            if characters:
                print(f"\n👥 CHARACTER EXPERIENCES ({len(characters)} categories):")
                for cat in sorted(characters):
                    print(f"   {cat}: {category_counts[cat]} mentions")

        print("\n" + "="*80)

        return category_counts, category_scores

    def get_category_summary(self, input_file, review_column='review'):
        """Quick method to analyze category distribution from file"""
        try:
            # Read file
            if input_file.lower().endswith('.csv'):
                df = pd.read_csv(input_file)
            else:
                df = pd.read_excel(input_file)

            # Check if already processed
            if 'Matched_Categories' not in df.columns:
                print("File hasn't been processed yet. Processing now...")
                df = self.process_reviews_file(input_file, 'temp_analysis.csv', review_column)

            # Analyze distribution
            return self.analyze_category_distribution(df)

        except Exception as e:
            print(f"Error analyzing file: {str(e)}")
            return None, None

    def fuzzy_phrase_match(self, text, category, min_threshold=85):
        """High-quality fuzzy matching on meaningful phrases only"""
        clean_text = self.clean_text(text)
        matches = []

        # Only do fuzzy matching on longer, more specific aliases
        for alias in self.category_aliases[category]:
            if len(alias) >= 4:  # Only fuzzy match meaningful phrases

                # Partial ratio for substring matching
                partial_score = fuzz.partial_ratio(alias, clean_text)

                # Token sort ratio for word order variations
                token_score = fuzz.token_sort_ratio(alias, clean_text)

                # Use the better score
                best_score = max(partial_score, token_score)

                if best_score >= min_threshold:
                    matches.append((category, best_score, f"fuzzy: '{alias}' ({best_score})"))

        return matches

    def classify_review(self, review_text, fuzzy_threshold=85, max_results=3):
        """Classify review with high precision"""
        if pd.isna(review_text) or review_text == '':
            return []

        all_matches = []

        # Check each category
        for category in self.categories:
            # First try exact matches
            exact_matches = self.exact_phrase_match(review_text, category)
            if exact_matches:  # Only extend if not empty
                all_matches.extend(exact_matches)

            # If no exact match, try fuzzy matching
            if not exact_matches:
                fuzzy_matches = self.fuzzy_phrase_match(review_text, category, fuzzy_threshold)
                if fuzzy_matches:  # Only extend if not empty
                    all_matches.extend(fuzzy_matches)

        # Remove duplicates by category (keep highest score)
        category_best = {}
        for category, score, method in all_matches:
            if category not in category_best or score > category_best[category][1]:
                category_best[category] = (category, score, method)

        # Convert back to list and sort by score
        final_matches = list(category_best.values())
        final_matches.sort(key=lambda x: x[1], reverse=True)

        # Return top matches
        return final_matches[:max_results]

    def process_reviews_file(self, input_file, output_file, review_column='review',
                            fuzzy_threshold=85, max_categories=3, min_confidence=80):
        """Process reviews with high precision settings"""
        try:
            # Read file
            if input_file.lower().endswith('.csv'):
                df = pd.read_csv(input_file)
            elif input_file.lower().endswith(('.xlsx', '.xls')):
                df = pd.read_excel(input_file)
            else:
                raise ValueError("Unsupported file format. Please use .csv, .xlsx, or .xls files.")

            print(f"Loaded {len(df)} reviews from {input_file}")

            # Check if review column exists
            if review_column not in df.columns:
                print(f"Column '{review_column}' not found. Available columns: {list(df.columns)}")
                return None

            # Initialize result columns
            df['Matched_Categories'] = ''
            df['Category_Scores'] = ''
            df['Top_Category'] = ''
            df['Confidence_Score'] = 0.0
            df['Match_Details'] = ''

            # Process reviews
            for idx, row in df.iterrows():
                review_text = row[review_column]

                # Classify with strict settings
                matches = self.classify_review(
                    review_text,
                    fuzzy_threshold=fuzzy_threshold,
                    max_results=max_categories
                )

                # Filter by minimum confidence
                high_confidence_matches = [(cat, score, method) for cat, score, method in matches
                                         if score >= min_confidence]

                if high_confidence_matches:
                    categories = [match[0] for match in high_confidence_matches]
                    scores = [f"{match[1]:.1f}" for match in high_confidence_matches]
                    methods = [match[2] for match in high_confidence_matches]

                    df.at[idx, 'Matched_Categories'] = ' | '.join(categories)
                    df.at[idx, 'Category_Scores'] = ' | '.join(scores)
                    df.at[idx, 'Top_Category'] = high_confidence_matches[0][0]
                    df.at[idx, 'Confidence_Score'] = high_confidence_matches[0][1]
                    df.at[idx, 'Match_Details'] = ' | '.join(methods)

                # Progress update
                if (idx + 1) % 100 == 0:
                    print(f"Processed {idx + 1} reviews...")

            # Save results
            if output_file.lower().endswith('.csv'):
                df.to_csv(output_file, index=False)
            else:
                df.to_excel(output_file, index=False)

            print(f"Results saved to {output_file}")

            # Detailed statistics
            matched_reviews = len(df[df['Matched_Categories'] != ''])
            print(f"\nHigh-Precision Results:")
            print(f"Total reviews: {len(df)}")
            print(f"Reviews with high-confidence matches: {matched_reviews}")
            print(f"Precision match rate: {matched_reviews/len(df)*100:.1f}%")

            # Analyze category distribution
            self.analyze_category_distribution(df)

            return df

        except Exception as e:
            print(f"Error: {str(e)}")
            return None

    def analyze_single_review(self, review_text, show_details=True):
        """Analyze single review with detailed output"""
        matches = self.classify_review(review_text, fuzzy_threshold=80)

        print(f"Review: {review_text[:150]}...")
        print(f"High-precision matches found: {len(matches)}")

        if matches:
            for i, (category, score, method) in enumerate(matches, 1):
                print(f"  {i}. {category}")
                print(f"     Confidence: {score:.1f}%")
                if show_details:
                    print(f"     Method: {method}")
                print()
        else:
            print("  No high-confidence matches found")

        return matches


# Example usage
if __name__ == "__main__":
    print("Initializing High-Precision Classifier...")
    classifier = PreciseReviewClassifier()

    # Test samples
    test_reviews = [
        "The Transformers ride was incredible! Best 3D experience ever.",
        "Minion Mayhem had us laughing the whole time, kids loved it",
        "Revenge of the Mummy was scary but amazing",
        "Food at Mel's Drive-in was really good",
        "Restrooms were clean and easy to find",
        "WiFi connection was terrible throughout the park",
        "Battlestar Galactica Human coaster was intense!",
        "The park was nice but nothing specific mentioned here"
    ]

    print("\nTesting high-precision classification:\n")
    for i, review in enumerate(test_reviews, 1):
        print(f"Test {i}:")
        print("-" * 60)
        classifier.analyze_single_review(review)
        print()

    print("\nUsage for your data:")
    print("classifier.process_reviews_file(")
    print("    'your_file.csv', 'precise_results.csv',")
    print("    review_column='review',    # Your column name")
    print("    fuzzy_threshold=85,        # Higher = more strict")
    print("    min_confidence=80,         # Only show confident matches")
    print("    max_categories=3           # Limit results per review")
    print(")")
    print("\nTo analyze existing results:")
    print("classifier.get_category_summary('precise_results.csv')")

Initializing High-Precision Classifier...

Testing high-precision classification:

Test 1:
------------------------------------------------------------
Review: The Transformers ride was incredible! Best 3D experience ever....
High-precision matches found: 2
  1. transformers the ride the ultimate 3d battle
     Confidence: 100.0%
     Method: fuzzy: 'transformers' (100)

  2. transformers voices of cybertron
     Confidence: 100.0%
     Method: fuzzy: 'transformers' (100)


Test 2:
------------------------------------------------------------
Review: Minion Mayhem had us laughing the whole time, kids loved it...
High-precision matches found: 3
  1. despicable me minion mayhem
     Confidence: 100.0%
     Method: fuzzy: 'minion mayhem' (100)

  2. minute of minion mayhem
     Confidence: 100.0%
     Method: fuzzy: 'minion mayhem' (100)

  3. minion dance party
     Confidence: 100.0%
     Method: fuzzy: 'minion mayhem' (100)


Test 3:
-----------------------------------------------------

In [None]:
classifier = PreciseReviewClassifier()

classifier.process_reviews_file(
    'google_reviews_adaptive_moe_results.csv',
    'precise_classified_reviews.csv',
    fuzzy_threshold=80,    # Very strict matching
    min_confidence=75,     # Only confident matches
    max_categories=5       # Limit results
)

Loaded 29412 reviews from google_reviews_adaptive_moe_results.csv
Processed 100 reviews...
Processed 200 reviews...
Processed 300 reviews...
Processed 400 reviews...
Processed 500 reviews...
Processed 600 reviews...
Processed 700 reviews...
Processed 800 reviews...
Processed 900 reviews...
Processed 1000 reviews...
Processed 1100 reviews...
Processed 1200 reviews...
Processed 1300 reviews...
Processed 1400 reviews...
Processed 1500 reviews...
Processed 1600 reviews...
Processed 1700 reviews...
Processed 1800 reviews...
Processed 1900 reviews...
Processed 2000 reviews...
Processed 2100 reviews...
Processed 2200 reviews...
Processed 2300 reviews...
Processed 2400 reviews...
Processed 2500 reviews...
Processed 2600 reviews...
Processed 2700 reviews...
Processed 2800 reviews...
Processed 2900 reviews...
Processed 3000 reviews...
Processed 3100 reviews...
Processed 3200 reviews...
Processed 3300 reviews...
Processed 3400 reviews...
Processed 3500 reviews...
Processed 3600 reviews...
Process

Unnamed: 0,integrated_review,stars,name,review,publishedAtDate,review_index,true_label,ensemble_prediction,ensemble_confidence,prob_negative,...,prob_positive,primary_model,transformer_weight,bilstm_weight,logistic_weight,Matched_Categories,Category_Scores,Top_Category,Confidence_Score,Match_Details
0,Nice lot if activities need entire day to cove...,4,user_0,Nice lot if activities need entire day to cove...,2025-05-23,0,2,1,0.503277,0.076376,...,0.420347,logistic,0.25,0.45,0.30,,,,0.0,
1,Universal Studios Singapore offers an unforget...,5,user_1,Universal Studios Singapore offers an unforget...,2025-05-23,1,2,2,0.934817,0.007332,...,0.934817,transformer,0.60,0.25,0.15,battlestar galactica human | battlestar galact...,100.0 | 100.0 | 100.0 | 100.0,battlestar galactica human,100.0,fuzzy: 'battlestar' (100) | fuzzy: 'battlestar...
2,Mummy ride was great but cost way too much for...,2,user_2,Mummy ride was great but cost way too much for...,2025-05-23,2,0,0,0.809111,0.809111,...,0.015428,transformer,0.60,0.25,0.15,battlestar galactica human | battlestar galact...,100.0 | 100.0 | 100.0 | 100.0 | 100.0,battlestar galactica human,100.0,fuzzy: 'battlestar' (100) | fuzzy: 'battlestar...
3,We went there to enjoy the Minions. It's hot d...,4,user_3,We went there to enjoy the Minions. It's hot d...,2025-05-23,3,2,2,0.476674,0.095326,...,0.476674,transformer,0.40,0.35,0.25,despicable me minion mayhem | minute of minion...,100.0 | 100.0 | 100.0,despicable me minion mayhem,100.0,fuzzy: 'minions' (100) | fuzzy: 'minions' (100...
4,Universal Studio's famous Playland Singapore. ...,5,user_4,Universal Studio's famous Playland Singapore.,2025-05-23,4,2,2,0.532141,0.239987,...,0.532141,logistic,0.20,0.55,0.25,,,,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29407,Fun! Prices are expensive and queues are long!,4,user_28948,Fun! Prices are expensive and queues are long!,2018-07-29,29407,2,1,0.528107,0.205499,...,0.266394,logistic,0.20,0.55,0.25,,,,0.0,
29408,"I have also been to the LA version of it, henc...",4,user_28949,"I have also been to the LA version of it, henc...",2018-07-29,29408,2,2,0.665794,0.020162,...,0.665794,transformer,0.40,0.35,0.25,,,,0.0,
29409,"An hour of wait time for almost every ride, ev...",2,user_28950,"An hour of wait time for almost every ride, ev...",2018-07-29,29409,0,0,0.637221,0.637221,...,0.095369,bilstm,0.45,0.35,0.20,express,100.0,express,100.0,fuzzy: 'express' (100)
29410,Such a happy place to visit.,4,user_28951,Such a happy place to visit.,2018-07-29,29410,2,2,0.734643,0.043541,...,0.734643,logistic,0.20,0.55,0.25,,,,0.0,
