# Semantic Persona Classification of USS Reviews

## Using Sentence Transformers and Cosine Similarity

### Objective:
To develop and evaluate a semantic-based machine learning model that classifies Universal Studios Singapore (USS) user reviews into six distinct visitor personas — **Families**, **Thrill Seekers**, **International Tourists**, **Budget Conscious**, **Premium Visitors**, and **Experience Focused** — using sentence transformer embeddings and cosine similarity matching.

### Dataset:
Google Reviews collected for Universal Studios Singapore, including textual reviews, star ratings, and temporal metadata spanning multiple years.

### Methodology:
- **Data Loading**: Using pre-processed USS review dataset (cleaned in previous analysis)
- **Semantic Embeddings**: Using BAAI/bge-large-en-v1.5 sentence transformer model with GPU acceleration
- **Persona Definition**: Six detailed semantic descriptions with positive and negative indicators
- **Batch Processing**: Efficient embedding generation with batch size optimization (512 reviews/batch)
- **Classification**: Cosine similarity calculation between review embeddings and persona embeddings
- **Enhanced Analysis**: Primary and secondary persona identification with confidence gap calculation
- **Results Export**: Dual format output (CSV + Parquet) with full similarity matrices

### Evaluation Metrics:
- **Classification Distribution**: Persona percentage breakdown and counts
- **Confidence Statistics**: Mean, median, min, max confidence scores per persona
- **Mixed Persona Analysis**: Detection of users with confidence gaps < 0.05
- **Uncertainty Detection**: Reviews with confidence gaps < 0.1 requiring validation
- **Cross-Persona Patterns**: Most common primary-secondary persona combinations
- **Star Rating Correlation**: Validation against user satisfaction scores
- **High-Confidence Sampling**: Top examples per persona for quality assessment

In [7]:
# Business-Oriented Review Analysis for Universal Studios Singapore
# Focus: Monthly operational issue tracking + visitor persona analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("🎯 Business-Oriented USS Review Analysis")
print("=" * 50)

# Define project paths
project_root = Path('.').resolve().parent
data_path = project_root / 'data' / 'processed'

print(f"Loading data from: {data_path}")

# Load the cleaned dataset
df = pd.read_csv(data_path / 'USS_Reviews_Silver_cleaned_l2.csv')

# # Basic preprocessing (from previous analysis)
# df['review_length'] = df['review'].str.len()
# df['publishedAtDate'] = pd.to_datetime(df['publishedAtDate'])



print(f"Dataset prepared: {len(df):,} reviews")
print(f"Date range: {df['publishedAtDate'].min()} to {df['publishedAtDate'].max()}")
print(f"Star rating distribution:")
print(df['stars'].value_counts().sort_index())

🎯 Business-Oriented USS Review Analysis
Loading data from: C:\Users\nshan\Desktop\SMU\MITB\CS605\project\CS605-NLP-Project\data\processed
Dataset prepared: 24,021 reviews
Date range: 2018-07-29 to 2025-05-23
Star rating distribution:
stars
1     1311
2      796
3     1982
4     4851
5    15081
Name: count, dtype: int64


In [2]:
# Phase 2 continued: Pure semantic vector matching

# Install and import required libraries
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import torch

# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

CUDA available: True
Using device: cuda


In [31]:
# Load sentence transformer model
model_name = 'BAAI/bge-large-en-v1.5'  # Good balance of speed and quality
print(f"\nLoading model: {model_name}")
sentence_model = SentenceTransformer(model_name, device=device)
print("Model loaded successfully!")

# Simplified persona definitions (pure descriptions for semantic matching)
persona_descriptions = {
    'families': """Parents with young children and teenagers. Key phrases: "with kids", "children", "family", 
    "parents", "suitable for kids", "child-friendly", "age-appropriate", "family bonding", "stroller", 
    "educational for children", "traveling with family", "good for kids and parents".
    
    NEGATIVE INDICATORS: Does NOT mention solo travel, adrenaline, scary rides, first-time Singapore visit, 
    price complaints, express passes, queue problems, staff issues, maintenance problems.""",
    

    'thrill_seekers': """Adrenaline junkies focused on intense rides. Key phrases: "adrenaline", "scary", 
    "intense", "thrilling", "roller coaster", "extreme", "heart-pounding", "not for faint hearted", 
    "test of courage", "fear factor", "exciting and tense", "adrenaline rush", "challenging rides".
    
    NEGATIVE INDICATORS: Does NOT mention family activities, children, Singapore tourism, cost concerns, 
    VIP services, operational complaints, staff problems, maintenance issues.""",

    'international_tourists': """Foreign visitors to Singapore discussing travel planning. Key phrases: 
    "first time in Singapore", "visiting Singapore", "must visit", "bucket list", "travel to Singapore", 
    "Singapore attractions", "tourist destination", "once in lifetime", "Singapore itinerary", "foreign tourists".
    
    NEGATIVE INDICATORS: Does NOT mention family activities, adrenaline rides, price complaints, 
    premium services, operational issues, staff problems, local resident experiences.""",
    
    'budget_conscious': """Visitors angry about high costs and poor value. Key phrases: "expensive", 
    "overpriced", "not worth the money", "waste of money", "too expensive", "jaw-breaking expensive", 
    "insane prices", "poor value", "financially disappointed", "ripped off", "budget constraints".
    
    NEGATIVE INDICATORS: Does NOT mention family fun, thrilling experiences, Singapore tourism, 
    premium services, positive service experiences, adrenaline activities.""",
    
    'premium_visitors': """Users who bought or recommend paid upgrades. Key phrases: "express pass", 
    "VIP", "fast track", "skip the line", "priority access", "bought express", "recommend express", 
    "worth buying express", "VIP experience", "premium service", "exclusive access", "personal guide".
    
    NEGATIVE INDICATORS: Does NOT mention cost complaints, family-specific needs, adrenaline seeking, 
    Singapore travel planning, operational complaints, basic ticket experiences.""",
    
    'experience_focused': """Visitors complaining about operations and service quality. Key phrases: 
    "long queues", "waiting time", "staff attitude", "poor service", "maintenance issues", "attraction closed", 
    "crowd management", "facility problems", "operational issues", "park management", "service quality".
    
    NEGATIVE INDICATORS: Does NOT mention price concerns, family activities, thrill experiences, 
    Singapore tourism, premium services, positive experiences."""
}

print(f"\nGenerating embeddings for {len(persona_descriptions)} personas...")

# Generate embeddings for persona descriptions
persona_embeddings = {}
for persona, description in persona_descriptions.items():
    embedding = sentence_model.encode([description])
    persona_embeddings[persona] = embedding[0]
    print(f"✓ {persona}: embedding shape {embedding.shape}")

print("\nPersona embeddings ready for classification!")


Loading model: BAAI/bge-large-en-v1.5
Model loaded successfully!

Generating embeddings for 6 personas...
✓ families: embedding shape (1, 1024)
✓ thrill_seekers: embedding shape (1, 1024)
✓ international_tourists: embedding shape (1, 1024)
✓ budget_conscious: embedding shape (1, 1024)
✓ premium_visitors: embedding shape (1, 1024)
✓ experience_focused: embedding shape (1, 1024)

Persona embeddings ready for classification!


In [32]:
# Phase 3: Batch review classification

import time
from tqdm import tqdm

print("Starting batch review classification...")
print("=" * 50)

# Prepare review data
reviews_list = df['review'].tolist()
total_reviews = len(reviews_list)

print(f"Total reviews to classify: {total_reviews:,}")
print(f"Batch processing on {device.upper()}")

# Set batch size for efficient GPU processing
batch_size = 512  # Adjust based on GPU memory
print(f"Batch size: {batch_size}")

# Generate embeddings for all reviews in batches
print("\nGenerating review embeddings...")
start_time = time.time()

all_review_embeddings = []
for i in tqdm(range(0, total_reviews, batch_size), desc="Processing batches"):
    batch = reviews_list[i:i + batch_size]
    batch_embeddings = sentence_model.encode(batch, device=device, show_progress_bar=False)
    all_review_embeddings.extend(batch_embeddings)

review_embeddings = np.array(all_review_embeddings)
end_time = time.time()

print(f"\n✓ Generated embeddings for {total_reviews:,} reviews")
print(f"Processing time: {end_time - start_time:.2f} seconds")
print(f"Embedding matrix shape: {review_embeddings.shape}")
print(f"Speed: {total_reviews / (end_time - start_time):.0f} reviews/second")

Starting batch review classification...
Total reviews to classify: 24,021
Batch processing on CUDA
Batch size: 512

Generating review embeddings...


Processing batches: 100%|██████████| 47/47 [04:15<00:00,  5.43s/it]


✓ Generated embeddings for 24,021 reviews
Processing time: 255.28 seconds
Embedding matrix shape: (24021, 1024)
Speed: 94 reviews/second





In [33]:
# Phase 4: Cosine similarity calculation and classification

print("Computing persona similarities and classifications...")
print("=" * 50)

# Convert persona embeddings to numpy array for efficient computation
persona_names = list(persona_embeddings.keys())
persona_matrix = np.array([persona_embeddings[name] for name in persona_names])

print(f"Persona matrix shape: {persona_matrix.shape}")
print(f"Review matrix shape: {review_embeddings.shape}")

# Calculate cosine similarities between all reviews and all personas
print("\nCalculating cosine similarities...")
start_time = time.time()

# Use sklearn's efficient cosine similarity
similarity_matrix = cosine_similarity(review_embeddings, persona_matrix)

end_time = time.time()
print(f"Similarity calculation completed in {end_time - start_time:.2f} seconds")
print(f"Similarity matrix shape: {similarity_matrix.shape}")

# Classify each review to the most similar persona
print("\nClassifying reviews...")

# Get the persona with highest similarity for each review
best_persona_indices = np.argmax(similarity_matrix, axis=1)
best_similarities = np.max(similarity_matrix, axis=1)

# Create classification results
classification_results = []
for i, (persona_idx, similarity) in enumerate(zip(best_persona_indices, best_similarities)):
    classification_results.append({
        'review_index': i,
        'best_persona': persona_names[persona_idx],
        'confidence': similarity,
        'all_similarities': dict(zip(persona_names, similarity_matrix[i]))
    })

print(f"✓ Classified {len(classification_results):,} reviews")

# Quick statistics
print(f"\nClassification confidence statistics:")
print(f"Mean confidence: {np.mean(best_similarities):.3f}")
print(f"Median confidence: {np.median(best_similarities):.3f}")
print(f"Min confidence: {np.min(best_similarities):.3f}")
print(f"Max confidence: {np.max(best_similarities):.3f}")

Computing persona similarities and classifications...
Persona matrix shape: (6, 1024)
Review matrix shape: (24021, 1024)

Calculating cosine similarities...
Similarity calculation completed in 0.10 seconds
Similarity matrix shape: (24021, 6)

Classifying reviews...
✓ Classified 24,021 reviews

Classification confidence statistics:
Mean confidence: 0.543
Median confidence: 0.550
Min confidence: 0.302
Max confidence: 0.762


In [34]:
# Phase 5: Classification results analysis and sample inspection

print("Classification Distribution Analysis")
print("=" * 50)

# Add classification results to dataframe
df_classified = df.copy()
df_classified['predicted_persona'] = [result['best_persona'] for result in classification_results]
df_classified['confidence'] = [result['confidence'] for result in classification_results]

# 1. Overall distribution
print("1. PERSONA DISTRIBUTION:")
persona_counts = df_classified['predicted_persona'].value_counts()
persona_percentages = (persona_counts / len(df_classified) * 100).round(1)

for persona in persona_names:
    count = persona_counts.get(persona, 0)
    pct = persona_percentages.get(persona, 0)
    print(f"   {persona:20}: {count:5,} reviews ({pct:5.1f}%)")

# 2. Confidence distribution by persona
print(f"\n2. CONFIDENCE BY PERSONA:")
for persona in persona_names:
    persona_data = df_classified[df_classified['predicted_persona'] == persona]
    if len(persona_data) > 0:
        mean_conf = persona_data['confidence'].mean()
        std_conf = persona_data['confidence'].std()
        print(f"   {persona:20}: {mean_conf:.3f} ± {std_conf:.3f}")

# 3. Star rating correlation
print(f"\n3. PERSONA vs STAR RATING:")
persona_star_crosstab = pd.crosstab(df_classified['predicted_persona'], df_classified['stars'], normalize='index') * 100
print(persona_star_crosstab.round(1))

# 4. High confidence samples (top 3 per persona)
print(f"\n4. HIGH CONFIDENCE SAMPLES:")
print("=" * 60)

for persona in persona_names:
    persona_data = df_classified[df_classified['predicted_persona'] == persona]
    if len(persona_data) > 0:
        top_samples = persona_data.nlargest(8, 'confidence')
        print(f"\n{persona.upper()} (Top 8 high confidence):")
        for idx, row in top_samples.iterrows():
            print(f"   Confidence: {row['confidence']:.3f} | Stars: {row['stars']}")
            print(f"   Review: \"{row['review'][:350]}...\"")
            print()

Classification Distribution Analysis
1. PERSONA DISTRIBUTION:
   families            : 5,076 reviews ( 21.1%)
   thrill_seekers      : 3,141 reviews ( 13.1%)
   international_tourists: 3,649 reviews ( 15.2%)
   budget_conscious    : 1,323 reviews (  5.5%)
   premium_visitors    : 2,914 reviews ( 12.1%)
   experience_focused  : 7,918 reviews ( 33.0%)

2. CONFIDENCE BY PERSONA:
   families            : 0.519 ± 0.073
   thrill_seekers      : 0.504 ± 0.064
   international_tourists: 0.541 ± 0.070
   budget_conscious    : 0.527 ± 0.074
   premium_visitors    : 0.549 ± 0.076
   experience_focused  : 0.575 ± 0.069

3. PERSONA vs STAR RATING:
stars                      1    2     3     4     5
predicted_persona                                  
budget_conscious        10.4  7.6  16.3  20.9  44.7
experience_focused      13.5  7.9  16.2  26.2  36.2
families                 0.4  0.5   3.9  19.7  75.6
international_tourists   0.4  0.3   1.8  12.9  84.6
premium_visitors         2.1  0.8   5.2  19.3

In [35]:
# Phase 4: Cosine similarity calculation and classification

print("Computing persona similarities and classifications...")
print("=" * 50)

# Convert persona embeddings to numpy array for efficient computation
persona_names = list(persona_embeddings.keys())
persona_matrix = np.array([persona_embeddings[name] for name in persona_names])

print(f"Persona matrix shape: {persona_matrix.shape}")
print(f"Review matrix shape: {review_embeddings.shape}")

# Calculate cosine similarities between all reviews and all personas
print("\nCalculating cosine similarities...")
start_time = time.time()

# Use sklearn's efficient cosine similarity
similarity_matrix = cosine_similarity(review_embeddings, persona_matrix)

end_time = time.time()
print(f"Similarity calculation completed in {end_time - start_time:.2f} seconds")
print(f"Similarity matrix shape: {similarity_matrix.shape}")

# Classify each review to the most similar persona
print("\nClassifying reviews...")

# Get the persona with highest similarity for each review
best_persona_indices = np.argmax(similarity_matrix, axis=1)
best_similarities = np.max(similarity_matrix, axis=1)

# Create detailed classification results with ALL similarities
classification_results = []
for i, (persona_idx, similarity) in enumerate(zip(best_persona_indices, best_similarities)):
    all_similarities = dict(zip(persona_names, similarity_matrix[i]))
    
    # Calculate confidence gap (difference between top 2 scores)
    sorted_similarities = sorted(similarity_matrix[i], reverse=True)
    confidence_gap = sorted_similarities[0] - sorted_similarities[1] if len(sorted_similarities) > 1 else 0
    
    # Find secondary persona
    secondary_idx = np.argsort(similarity_matrix[i])[-2]  # Second highest
    secondary_persona = persona_names[secondary_idx]
    secondary_confidence = similarity_matrix[i][secondary_idx]
    
    classification_results.append({
        'review_index': i,
        'best_persona': persona_names[persona_idx],
        'confidence': similarity,
        'secondary_persona': secondary_persona,
        'secondary_confidence': secondary_confidence,
        'confidence_gap': confidence_gap,
        'all_similarities': all_similarities
    })

print(f"Classified {len(classification_results):,} reviews")

# Quick statistics
print(f"\nClassification confidence statistics:")
print(f"Mean confidence: {np.mean(best_similarities):.3f}")
print(f"Median confidence: {np.median(best_similarities):.3f}")
print(f"Min confidence: {np.min(best_similarities):.3f}")
print(f"Max confidence: {np.max(best_similarities):.3f}")

# Confidence gap statistics
confidence_gaps = [result['confidence_gap'] for result in classification_results]
print(f"\nConfidence gap statistics:")
print(f"Mean gap: {np.mean(confidence_gaps):.3f}")
print(f"Median gap: {np.median(confidence_gaps):.3f}")
print(f"Min gap: {np.min(confidence_gaps):.3f}")
print(f"Max gap: {np.max(confidence_gaps):.3f}")

Computing persona similarities and classifications...
Persona matrix shape: (6, 1024)
Review matrix shape: (24021, 1024)

Calculating cosine similarities...
Similarity calculation completed in 0.12 seconds
Similarity matrix shape: (24021, 6)

Classifying reviews...
Classified 24,021 reviews

Classification confidence statistics:
Mean confidence: 0.543
Median confidence: 0.550
Min confidence: 0.302
Max confidence: 0.762

Confidence gap statistics:
Mean gap: 0.034
Median gap: 0.025
Min gap: 0.000
Max gap: 0.215


In [None]:
# Phase 5: Enhanced classification results analysis

print("\nClassification Distribution Analysis")
print("=" * 50)

# Add classification results to dataframe
df_classified = df.copy()
df_classified['predicted_persona'] = [result['best_persona'] for result in classification_results]
df_classified['confidence'] = [result['confidence'] for result in classification_results]
df_classified['secondary_persona'] = [result['secondary_persona'] for result in classification_results]
df_classified['secondary_confidence'] = [result['secondary_confidence'] for result in classification_results]
df_classified['confidence_gap'] = [result['confidence_gap'] for result in classification_results]

# Add all similarity scores as separate columns
for persona in persona_names:
    df_classified[f'{persona}_similarity'] = [result['all_similarities'][persona] for result in classification_results]

# 1. Overall distribution
print("1. PERSONA DISTRIBUTION:")
persona_counts = df_classified['predicted_persona'].value_counts()
persona_percentages = (persona_counts / len(df_classified) * 100).round(1)

for persona in persona_names:
    count = persona_counts.get(persona, 0)
    pct = persona_percentages.get(persona, 0)
    print(f"   {persona:20}: {count:5,} reviews ({pct:5.1f}%)")

# 2. Confidence distribution by persona
print(f"\n2. CONFIDENCE BY PERSONA:")
for persona in persona_names:
    persona_data = df_classified[df_classified['predicted_persona'] == persona]
    if len(persona_data) > 0:
        mean_conf = persona_data['confidence'].mean()
        std_conf = persona_data['confidence'].std()
        mean_gap = persona_data['confidence_gap'].mean()
        print(f"   {persona:20}: {mean_conf:.3f} ± {std_conf:.3f} (gap: {mean_gap:.3f})")

# 3. Star rating correlation
print(f"\n3. PERSONA vs STAR RATING:")
persona_star_crosstab = pd.crosstab(df_classified['predicted_persona'], df_classified['stars'], normalize='index') * 100
print(persona_star_crosstab.round(1))

# 4. NEW: Mixed personas analysis (small confidence gaps)
print(f"\n4. MIXED PERSONA PATTERNS:")
print("Top persona combinations with small confidence gaps (<0.05):")
mixed_reviews = df_classified[df_classified['confidence_gap'] < 0.05]
if len(mixed_reviews) > 0:
    mixed_combinations = mixed_reviews.groupby(['predicted_persona', 'secondary_persona']).size().sort_values(ascending=False)
    for (primary, secondary), count in mixed_combinations.head(10).items():
        pct = count / len(df_classified) * 100
        print(f"   {primary} + {secondary}: {count} reviews ({pct:.1f}%)")
else:
    print("   No reviews with confidence gap < 0.05")

# 5. NEW: Uncertain classifications
print(f"\n5. UNCERTAIN CLASSIFICATIONS:")
uncertain_threshold = 0.1
uncertain_reviews = df_classified[df_classified['confidence_gap'] < uncertain_threshold]
print(f"Reviews with confidence gap < {uncertain_threshold}: {len(uncertain_reviews)} ({len(uncertain_reviews)/len(df_classified)*100:.1f}%)")

if len(uncertain_reviews) > 0:
    print("\nTop uncertain classification examples:")
    uncertain_samples = uncertain_reviews.nsmallest(5, 'confidence_gap')
    for idx, row in uncertain_samples.iterrows():
        print(f"   Gap: {row['confidence_gap']:.3f} | {row['predicted_persona']} ({row['confidence']:.3f}) vs {row['secondary_persona']} ({row['secondary_confidence']:.3f})")
        print(f"   Review: \"{row['review'][:100]}...\"")
        print()

# 6. High confidence samples (top 3 per persona)
print(f"\n6. HIGH CONFIDENCE SAMPLES:")
print("=" * 60)

for persona in persona_names:
    persona_data = df_classified[df_classified['predicted_persona'] == persona]
    if len(persona_data) > 0:
        top_samples = persona_data.nlargest(8, 'confidence')
        print(f"\n{persona.upper()} (Top 8 high confidence):")
        for idx, row in top_samples.iterrows():
            print(f"   Confidence: {row['confidence']:.3f} | Gap: {row['confidence_gap']:.3f} | Stars: {row['stars']}")
            print(f"   Secondary: {row['secondary_persona']} ({row['secondary_confidence']:.3f})")
            print(f"   Review: \"{row['review'][:200]}...\"")
            print()


print(f"\n8. SAVE ENHANCED RESULTS:")

processed_path = project_root / 'data' / 'processed'

processed_path.mkdir(parents=True, exist_ok=True)

csv_file = 'USS_Reviews_Silver_cleaned_l3.csv'
parquet_file = 'USS_Reviews_Silver_cleaned_l3.parquet'

csv_path = processed_path / csv_file
parquet_path = processed_path / parquet_file

# save as csv
print(f"Saving CSV to: {csv_path}")
df_classified.to_csv(csv_path, index=False)
print(f"✓ Enhanced classification results saved to '{csv_path}'")

# save as parquet
print(f"Saving Parquet to: {parquet_path}")
df_classified.to_parquet(parquet_path, index=False)
print(f"✓ Enhanced classification results saved to '{parquet_path}'")

print(f"✓ Includes all {len(persona_names)} persona similarity scores for each review")
print(f"✓ Total columns: {len(df_classified.columns)}")
print(f"✓ Total rows: {len(df_classified):,}")

# compare size
import os
csv_size = os.path.getsize(csv_path) / (1024*1024)  # MB
parquet_size = os.path.getsize(parquet_path) / (1024*1024)  # MB

print(f"\nFile size comparison:")
print(f"  CSV: {csv_size:.1f} MB")
print(f"  Parquet: {parquet_size:.1f} MB")
print(f"  Space savings: {((csv_size - parquet_size) / csv_size * 100):.1f}%")

print(f"\nNext time you can load either format:")
print(f"  df = pd.read_csv('{csv_path}')")
print(f"  df = pd.read_parquet('{parquet_path}')  # Faster loading")


Classification Distribution Analysis
1. PERSONA DISTRIBUTION:
   families            : 5,076 reviews ( 21.1%)
   thrill_seekers      : 3,141 reviews ( 13.1%)
   international_tourists: 3,649 reviews ( 15.2%)
   budget_conscious    : 1,323 reviews (  5.5%)
   premium_visitors    : 2,914 reviews ( 12.1%)
   experience_focused  : 7,918 reviews ( 33.0%)

2. CONFIDENCE BY PERSONA:
   families            : 0.519 ± 0.073 (gap: 0.039)
   thrill_seekers      : 0.504 ± 0.064 (gap: 0.028)
   international_tourists: 0.541 ± 0.070 (gap: 0.025)
   budget_conscious    : 0.527 ± 0.074 (gap: 0.025)
   premium_visitors    : 0.549 ± 0.076 (gap: 0.032)
   experience_focused  : 0.575 ± 0.069 (gap: 0.039)

3. PERSONA vs STAR RATING:
stars                      1    2     3     4     5
predicted_persona                                  
budget_conscious        10.4  7.6  16.3  20.9  44.7
experience_focused      13.5  7.9  16.2  26.2  36.2
families                 0.4  0.5   3.9  19.7  75.6
international_tour