# Tourism Sentiment Analysis - TripAdvisor NYC Data Extraction

**Project:** Tourism Sentiment Analysis

**Task:** Data Extraction & Processing

**Dataset Source:** TripAdvisor (SciDB)

**Focus:** NYC, 2022-2025, Hotels

**Source URL:** https://www.scidb.cn/en/file?fid=df2d477ee4830d106a58c14053a57b07

## 1. Setup & Configuration
*Import libraries, set up project paths, create directory structure*

In [4]:
import requests
from pathlib import Path
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from openpyxl import load_workbook

# Set up project paths with bronze subfolder structure
project_root = Path("../../../../").resolve()
bronze_base = project_root / "data" / "bronze" / "tripadvisor"
print(f"Project root: {project_root}")
print(f"Bronze base: {bronze_base}")

Project root: /Users/db/code/tourism_data_project
Bronze base: /Users/db/code/tourism_data_project/data/bronze/tripadvisor


## 2. Data Acquisition
*Download raw Excel file from ScienceDB using discovered direct API*

<details>
<summary><strong>Manual Download Instructions</strong> (click to expand)</summary>

If automated download fails:
1. Visit: https://www.scidb.cn/en/file?fid=df2d477ee4830d106a58c14053a57b07
2. Download file manually
3. Rename to: `tripadvisor_nyc_2022_2025_original.xlsx`
4. Place in: `data/bronze/tripadvisor/00_original_download/`

</details>

In [None]:
# Set up download directory
original_dir = bronze_base / "00_original_download"
original_dir.mkdir(parents=True, exist_ok=True)

# Direct download URL (SciDB.cn pattern)
file_id = "df2d477ee4830d106a58c14053a57b07"
url = f"https://china.scidb.cn/download?fileId={file_id}"
file_name = "tripadvisor_nyc_2022_2025_original.xlsx"
file_path = original_dir / file_name

# Download the file
if not file_path.exists():
    print(f"Downloading from: {url}...")
    response = requests.get(url, stream=True)
    response.raise_for_status()

    with open(file_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print(f"Download complete: {file_path}")
else:
    print(f"File already exists: {file_path}")

# Check file size
file_size = file_path.stat().st_size / (1024 * 1024)
print(f"File size: {file_size:.1f} MB")

Downloading from: https://china.scidb.cn/download?fileId=df2d477ee4830d106a58c14053a57b07...
Download complete: /Users/db/code/tourism_data_project/data/bronze/tripadvisor/00_original_download/tripadvisor_nyc_2022_2025_original.xlsx
File size: 156.9 MB


## 3. Bronze Layer: Raw Data Processing
*Convert Excel to chunked parquet files preserving original structure*

**Input:** `00_original_download/tripadvisor_nyc_2022_2025_original.xlsx` (156.9 MB)

**Output:** `01_raw_conversion/tripadvisor_nyc_raw_chunk_*.parquet` (chunked files)

**Processing:** 5,000-row chunks for memory efficiency

**Purpose:** Preserve complete dataset structure while converting to analysis-friendly format

In [7]:
# Set up conversion output directory
conversion_dir = bronze_base / "01_raw_conversion"
conversion_dir.mkdir(parents=True, exist_ok=True)

print("Loading Excel file...")
wb = load_workbook(file_path, read_only=True)
ws = wb.active
header = [str(cell.value) if cell.value is not None else f"col_{i}" for i, cell in enumerate(next(ws.iter_rows(min_row=1, max_row=1)))]
print(f"Columns found: {len(header)}")

# Convert to parquet chunks
chunk_size = 5000
rows = []
part = 0

print("Converting to parquet chunks...")
for row in ws.iter_rows(min_row=2, values_only=True):
    row = list(row[:len(header)])  # truncate any extra columns
    while len(row) < len(header):  # fill missing columns with None
        row.append(None)
    rows.append(row)

    if len(rows) >= chunk_size:
        df = pd.DataFrame(rows, columns=header)
        chunk_filename = f"tripadvisor_nyc_raw_chunk_{part:05d}.parquet"
        pq.write_table(pa.Table.from_pandas(df), conversion_dir / chunk_filename, compression="snappy")
        rows = []
        part += 1

        # Progress indicator every 10 files
        if part % 10 == 0:
            print(f"Processed {part} chunks...")

# Write remaining rows
if rows:
    df = pd.DataFrame(rows, columns=header)
    chunk_filename = f"tripadvisor_nyc_raw_chunk_{part:05d}.parquet"
    pq.write_table(pa.Table.from_pandas(df), conversion_dir / chunk_filename, compression="snappy")

print(f"Conversion complete. Total chunks: {part + 1}")
print(f"Output location: {conversion_dir}")

Loading Excel file...
Columns found: 15
Converting to parquet chunks...
Processed 10 chunks...
Processed 20 chunks...
Processed 30 chunks...
Processed 40 chunks...
Processed 50 chunks...
Processed 60 chunks...
Processed 70 chunks...
Processed 80 chunks...
Conversion complete. Total chunks: 84
Output location: /Users/db/code/tourism_data_project/data/bronze/tripadvisor/01_raw_conversion


## 4. Data Verification & Column Inspection
*Load converted data to verify structure and examine columns before filtering*

**Purpose:** Confirm parquet conversion preserved data integrity  
**Check:** Column names, data types, row counts  
**Next:** Identify date column format for primary filtering

In [8]:
# Load a sample chunk to verify conversion
sample_file = conversion_dir / "tripadvisor_nyc_raw_chunk_00000.parquet"
df_sample = pd.read_parquet(sample_file)

print(f"Sample chunk shape: {df_sample.shape}")
print(f"Columns ({len(df_sample.columns)}):")
for i, col in enumerate(df_sample.columns):
    print(f"  {i+1:2d}. {col}")

print(f"\n Data types:")
print(df_sample.dtypes)

print(f"\n First few rows:")
print(df_sample.head(3))

# Check date column specifically for filtering strategy
if 'date' in df_sample.columns:
    print(f"\n Date column sample:")
    print(df_sample['date'].head(10).tolist())

Sample chunk shape: (5000, 15)
Columns (15):
   1. col_0
   2. Unnamed: 0
   3. hotel_name
   4. id_review
   5. title
   6. date
   7. location
   8. user_name
   9. user_link
  10. date_of_stay
  11. rating
  12. review
  13. rating_review
  14. n_review_user
  15. n_votes_review

 Data types:
col_0              int64
Unnamed: 0         int64
hotel_name        object
id_review          int64
title             object
date              object
location          object
user_name         object
user_link         object
date_of_stay      object
rating             int64
review            object
rating_review      int64
n_review_user      int64
n_votes_review     int64
dtype: object

 First few rows:
   col_0  Unnamed: 0                        hotel_name  id_review  \
0      0           0  Premier Inn London Holborn hotel  877377326   
1      1           1  Premier Inn London Holborn hotel  831115773   
2      2           2  Premier Inn London Holborn hotel  877070098   

                   

## 5. Primary Filter: Date Range Selection  
*Filter reviews to 2022-2025 timeframe and consolidate chunks*

**Input:** 84 raw chunks (~500K+ total rows)

**Filter Criteria:** Date contains "2022", "2023", "2024", or "2025"

**Output:** `02_primary_filter/tripadvisor_nyc_2022_2025_date_filtered.parquet`  

**Expected Reduction:** ~90% of data (based on original analysis)

In [11]:
# Set up primary filter output directory
primary_filter_dir = bronze_base / "02_primary_filter"
primary_filter_dir.mkdir(parents=True, exist_ok=True)

# Load all chunks and apply date filter
years_keywords = ["2022", "2023", "2024", "2025"]
chunk_files = sorted(conversion_dir.glob("tripadvisor_nyc_raw_chunk_*.parquet"))

print(f"Processing {len(chunk_files)} chunks for date filtering...")
all_filtered_rows = []

for i, chunk_file in enumerate(chunk_files):
    df = pd.read_parquet(chunk_file)
    date_mask = df["date"].fillna("").apply(lambda x: any(year in str(x) for year in years_keywords))
    filtered_df = df[date_mask]
    all_filtered_rows.append(filtered_df)

    # Progress indicator every 20 files
    if (i + 1) % 20 == 0:
        print(f"Processed {i + 1}/{len(chunk_files)} chunks...")

# Consolidate filtered data
print("Consolidating filtered chunks...")
filtered_df = pd.concat(all_filtered_rows, ignore_index=True)

# Save consolidated result
output_file = primary_filter_dir / "tripadvisor_nyc_2022_2025_date_filtered.parquet"
filtered_df.to_parquet(output_file, compression="snappy")

print(f"Date filtering complete!")
print(f"Original chunks: {len(chunk_files)}")
print(f"Filtered rows: {len(filtered_df):,}")
print(f"Saved to: {output_file}")

# File size check
file_size_mb = output_file.stat().st_size / (1024*1024)
print(f"File size: {file_size_mb:.1f} MB")

Processing 84 chunks for date filtering...
Processed 20/84 chunks...
Processed 40/84 chunks...
Processed 60/84 chunks...
Processed 80/84 chunks...
Consolidating filtered chunks...
Date filtering complete!
Original chunks: 84
Filtered rows: 48,992
Saved to: /Users/db/code/tourism_data_project/data/bronze/tripadvisor/02_primary_filter/tripadvisor_nyc_2022_2025_date_filtered.parquet
File size: 17.7 MB


## 6. Refined Filter: Geographic Filtering
*Extract NYC hotels using positive filtering approach*

**Input:** `02_primary_filter/tripadvisor_nyc_2022_2025_date_filtered.parquet` (48,992 rows)  

**Strategy:** 
1. Positive NYC filtering (hotel names with NYC indicators)
2. Manual cleanup of misclassified hotels

**Expected Output:** ~12,500 rows, ~125 hotels

**Final Location:** `data/silver/tripadvisor/tripadvisor_nyc_2022_2025_final.parquet`

### 6A. Exploratory Analysis Section

**Purpose:** Show analysis process used to develop filtering strategy

**Status:** Optional - Skip to "7. Final Geographic Filter & Save" (Cell 19) to run workflow 

**Contains:** Novel implementation strategies for geographic filtering challenges



#### 6A.1 Initial Hotel Name Analysis
*Examine hotel name patterns after date filtering*

In [12]:
# Load date-filtered data for geographic analysis
primary_filter_dir = bronze_base / "02_primary_filter"
exploration_df = pd.read_parquet(primary_filter_dir / "tripadvisor_nyc_2022_2025_date_filtered.parquet")

print(f"Starting geographic analysis with: {len(exploration_df):,} rows")
print(f"Unique hotels: {exploration_df['hotel_name'].nunique()}")

# Initial hotel name examination
print(f"\n Top 15 hotels by review count:")
top_hotels = exploration_df['hotel_name'].value_counts().head(15)
for hotel, count in top_hotels.items():
    print(f"  â€¢ {hotel} ({count:,} reviews)")

# Look for obvious non-NYC patterns
print(f"\n Sample hotel names (checking for international patterns):")
sample_hotels = exploration_df['hotel_name'].value_counts().head(25).index
for hotel in sample_hotels:
    print(f"  â€¢ {hotel}")

Starting geographic analysis with: 48,992 rows
Unique hotels: 668

 Top 15 hotels by review count:
  â€¢ Park Plaza Westminster Bridge London (749 reviews)
  â€¢ Luma Hotel Time Square (712 reviews)
  â€¢ The Clermont, Charing Cross (660 reviews)
  â€¢ Travelodge London City hotel (616 reviews)
  â€¢ Hyatt Grand Central New York (595 reviews)
  â€¢ Sea Containers London (465 reviews)
  â€¢ Travelodge London Central Waterloo (449 reviews)
  â€¢ Travelodge London Greenwich High Road (437 reviews)
  â€¢ Hyatt Centric Times Square New York (434 reviews)
  â€¢ Hotel Riu Plaza New York Times Square (431 reviews)
  â€¢ The Resident Covent Garden (431 reviews)
  â€¢ Leonardo Royal London Tower Bridge (418 reviews)
  â€¢ Travelodge London Central City Road (409 reviews)
  â€¢ Park Grand London Hyde Park (402 reviews)
  â€¢ Travelodge London Farringdon (390 reviews)

 Sample hotel names (checking for international patterns):
  â€¢ Park Plaza Westminster Bridge London
  â€¢ Luma Hotel Time Square

#### 6A.2 UK Reviewer Concentration Strategy

*Approach: Use reviewer location patterns to identify misclassified hotels*

**Challenge:** Hotel names alone insufficient (e.g., "SoHo" exists in both NYC and London)

**Innovation:** Analyze reviewer geographic patterns to detect misclassified hotels

**Logic:** London hotels will have high concentrations of UK-based reviewers

**Threshold:** Hotels with >60% UK reviewers (min. 10 location entries) flagged for removal

In [14]:
# Analyze reviewer location patterns to identify non-NYC hotels
print("Analyzing reviewer geographic patterns...")

hotel_stats = []
for hotel_name, group in exploration_df.groupby('hotel_name'):
    location_data = group['location'].fillna('')

    total_reviews = len(group)
    total_with_location = group['location'].notna().sum()
    uk_reviews = location_data.str.contains('United Kingdom|UK|England|Scotland|Wales', case=False).sum()
    shanghai_reviews = location_data.str.contains('Shanghai|China', case=False).sum()

    hotel_stats.append({
        'hotel_name': hotel_name,
        'total_reviews': total_reviews,
        'total_with_location': total_with_location,
        'uk_reviews': uk_reviews,
        'shanghai_reviews': shanghai_reviews
    })

# Convert to analysis DataFrame
hotel_analysis = pd.DataFrame(hotel_stats)
hotel_analysis['uk_percentage'] = (hotel_analysis['uk_reviews'] / hotel_analysis['total_with_location']).fillna(0)
hotel_analysis['shanghai_percentage'] = (hotel_analysis['shanghai_reviews'] / hotel_analysis['total_with_location']).fillna(0)

# Identify problematic hotels
uk_threshold = 0.6
uk_hotels = hotel_analysis[
    (hotel_analysis['uk_percentage'] > uk_threshold) &
    (hotel_analysis['total_with_location'] >= 10)
]

print(f"Hotels with >{uk_threshold*100:.0f}% UK reviewers: {len(uk_hotels)}")
if len(uk_hotels) > 0:
    print("\n UK-heavy hotels (likely London):")
    uk_display = uk_hotels.nlargest(10, 'uk_percentage')[['hotel_name', 'total_with_location', 'uk_percentage']]
    for _, row in uk_display.iterrows():
        print(f"  â€¢ {row['hotel_name']} - {row['uk_percentage']:.1%} UK reviewers ({row['total_with_location']} total)")

Analyzing reviewer geographic patterns...
Hotels with >60% UK reviewers: 141

 UK-heavy hotels (likely London):
  â€¢ Premier Inn London Hanger Lane hotel - 92.7% UK reviewers (96 total)
  â€¢ Fitzrovia Hotel - 90.5% UK reviewers (21 total)
  â€¢ Premier Inn London New Southgate Hotel - 90.5% UK reviewers (21 total)
  â€¢ The Chamberlain Hotel - 90.0% UK reviewers (30 total)
  â€¢ Premier Inn London Archway hotel - 88.1% UK reviewers (109 total)
  â€¢ The Luxury Inn - 87.5% UK reviewers (16 total)
  â€¢ hub by Premier Inn London Spitalfields, Brick Lane hotel - 85.9% UK reviewers (85 total)
  â€¢ Premier Inn London Greenwich hotel - 85.5% UK reviewers (124 total)
  â€¢ Premier Inn London Tolworth - 85.5% UK reviewers (62 total)
  â€¢ The Prince of Wales - Townhouse - 84.6% UK reviewers (13 total)


#### 6A.3 Positive NYC Filtering Strategy
*Conservative approach: Identify genuine NYC hotels using location indicators*

**Strategy Shift:** Instead of removing non-NYC, actively identify NYC hotels

**Indicators:** Hotel names containing NYC-specific terms

**Advantage:** Reduces false positives from ambiguous neighborhood names (SoHo, Chelsea, etc.)

**Final Cleanup:** Manual removal of remaining misclassified hotels

In [15]:
# Apply positive NYC filtering - identify genuine NYC hotels
nyc_indicators = [
    'New York', 'NYC', 'Manhattan', 'Brooklyn', 'Queens', 'Bronx',
    'Times Square', 'Time Square', 'Central Park', 'Wall Street',
    'Midtown', 'Downtown', 'Financial District', 'SoHo', 'NoMad',
    'TriBeCa', 'Upper East', 'Upper West', 'Lower East', 'Herald Square',
    'Penn Station', 'Grand Central', 'JFK', 'LaGuardia', 'Empire State'
]

nyc_pattern = '|'.join(nyc_indicators)
nyc_hotels = exploration_df[exploration_df['hotel_name'].str.contains(nyc_pattern, case=False, na=False)]

print(f"NYC hotels identified: {len(nyc_hotels):,} rows")
print(f"Unique NYC hotels: {nyc_hotels['hotel_name'].nunique()}")

# Check for remaining ambiguous terms that might be misclassified
print(f"\n Top 10 NYC hotels:")
nyc_top = nyc_hotels['hotel_name'].value_counts().head(10)
for hotel, count in nyc_top.items():
    print(f"  â€¢ {hotel} ({count:,} reviews)")

# Check for potentially ambiguous hotels needing manual review
ambiguous_terms = ['SoHo', 'Chelsea', 'Greenwich', 'Victoria']
print(f"\n NYC hotels with ambiguous neighborhood terms:")
for term in ambiguous_terms:
    matching = nyc_hotels[nyc_hotels['hotel_name'].str.contains(term, case=False, na=False)]
    if len(matching) > 0:
        unique_hotels = matching['hotel_name'].unique()
        print(f"  {term}: {len(unique_hotels)} hotels")
        for hotel in unique_hotels[:3]:  # Show first 3
            print(f"    â€¢ {hotel}")

NYC hotels identified: 12,846 rows
Unique NYC hotels: 127

 Top 10 NYC hotels:
  â€¢ Luma Hotel Time Square (712 reviews)
  â€¢ Hyatt Grand Central New York (595 reviews)
  â€¢ Hyatt Centric Times Square New York (434 reviews)
  â€¢ Hotel Riu Plaza New York Times Square (431 reviews)
  â€¢ DoubleTree by Hilton Hotel New York Times Square West (354 reviews)
  â€¢ Hyatt Place New York/Chelsea (334 reviews)
  â€¢ M Social Hotel Times Square New York (291 reviews)
  â€¢ Lotte New York Palace (284 reviews)
  â€¢ 1 Hotel Central Park (282 reviews)
  â€¢ Arlo Midtown (281 reviews)

 NYC hotels with ambiguous neighborhood terms:
  SoHo: 7 hotels
    â€¢ The Soho Hotel
    â€¢ hub by Premier Inn London Soho hotel
    â€¢ The Z Hotel Soho
  Chelsea: 9 hotels
    â€¢ SpringHill Suites New York Manhattan/Chelsea
    â€¢ TownePlace Suites by Marriott New York Manhattan/Chelsea
    â€¢ Hyatt House New York/Chelsea


#### 6A.4 Manual Cleanup of Misclassified Hotels
*Remove remaining London hotels caught by ambiguous neighborhood names*

**Issue:** "SoHo" exists in both NYC and London

**Solution:** Remove clearly London-branded hotels

**Targets:** Hotels with "London" in name or known London hotel chains

In [19]:
# Manual removal of identified London hotels
london_hotels_to_remove = [
    'The Soho Hotel',           # London SoHo hotel
    'The Z Hotel Soho',         # London hotel chain
    'hub by Premier Inn London Soho hotel'  # Explicitly London-branded
]

print(f"Removing London hotels:")
for hotel in london_hotels_to_remove:
    count = nyc_hotels[nyc_hotels['hotel_name'] == hotel].shape[0]
    print(f"  â€¢ {hotel} ({count:,} reviews)")

# Apply manual cleanup
final_nyc_df = nyc_hotels[~nyc_hotels['hotel_name'].isin(london_hotels_to_remove)].copy()

print(f"\nManual cleanup complete")
print(f"Final NYC dataset: {len(final_nyc_df):,} rows")
print(f"Unique hotels: {final_nyc_df['hotel_name'].nunique()}")

# Verify remaining SoHo hotels are legitimate NYC hotels
remaining_soho = final_nyc_df[final_nyc_df['hotel_name'].str.contains('soho', case=False)]['hotel_name'].unique()
print(f"\n Remaining SoHo hotels (verified NYC):")
for hotel in remaining_soho:
    print(f"  â€¢ {hotel}")

Removing London hotels:
  â€¢ The Soho Hotel (19 reviews)
  â€¢ The Z Hotel Soho (38 reviews)
  â€¢ hub by Premier Inn London Soho hotel (154 reviews)

Manual cleanup complete
Final NYC dataset: 12,635 rows
Unique hotels: 124

 Remaining SoHo hotels (verified NYC):
  â€¢ Arlo SoHo
  â€¢ Courtyard New York Manhattan/SoHo
  â€¢ Soho Grand Hotel
  â€¢ Sohotel


## 7. Final Geographic Filter & Save

*Clean, validated approach - works whether exploration was run or skipped*

**Implementation:** Apply proven NYC filter strategy

**Output:** `data/silver/tripadvisor/tripadvisor_nyc_2022_2025_final.parquet`

In [None]:
# Load primary filtered data (works whether exploration was run or skipped)
primary_filter_dir = bronze_base / "02_primary_filter"
df_for_filtering = pd.read_parquet(primary_filter_dir / "tripadvisor_nyc_2022_2025_date_filtered.parquet")

# Apply validated NYC filter strategy
nyc_indicators = [
    'New York', 'NYC', 'Manhattan', 'Brooklyn', 'Queens', 'Bronx',
    'Times Square', 'Time Square', 'Central Park', 'Wall Street',
    'Midtown', 'Downtown', 'Financial District', 'SoHo', 'NoMad',
    'TriBeCa', 'Upper East', 'Upper West', 'Lower East', 'Herald Square',
    'Penn Station', 'Grand Central', 'JFK', 'LaGuardia', 'Empire State'
]

nyc_pattern = '|'.join(nyc_indicators)
nyc_filtered = df_for_filtering[df_for_filtering['hotel_name'].str.contains(nyc_pattern, case=False, na=False)]

# Remove identified London hotels
london_hotels_to_remove = ['The Soho Hotel', 'The Z Hotel Soho', 'hub by Premier Inn London Soho hotel']
final_clean_df = nyc_filtered[~nyc_filtered['hotel_name'].isin(london_hotels_to_remove)].copy()

# Save to silver directory (corrected structure)
silver_dir = project_root / "data" / "silver" / "tripadvisor"
silver_dir.mkdir(parents=True, exist_ok=True)

output_file = silver_dir / "tripadvisor_nyc_2022_2025_final.parquet"
final_clean_df.to_parquet(output_file, compression="snappy")

print(f"Final dataset saved!")
print(f"Location: {output_file}")
print(f"Rows: {len(final_clean_df):,}")
print(f"Hotels: {final_clean_df['hotel_name'].nunique()}")

file_size_mb = output_file.stat().st_size / (1024*1024)
print(f"File size: {file_size_mb:.1f} MB")
print(f"\nReady for gold layer processing and analysis!")

Final dataset saved!
Location: /Users/db/code/tourism_data_project/data/bronze/tripadvisor/03_refined_filter/tripadvisor_nyc_2022_2025_final.parquet
Rows: 12,635
Hotels: 124
File size: 4.9 MB

Ready for gold layer processing and analysis!


## 8. Final Verification & Cleanup

*Verify saved dataset and optional cleanup of intermediate files*

In [None]:
# Load and verify final saved dataset
print("Final Dataset Verification:")
print("=" * 50)

final_saved = pd.read_parquet(output_file)
print(f"File loads successfully")
print(f"Shape: {final_saved.shape}")
print(f"Unique hotels: {final_saved['hotel_name'].nunique()}")

# Quick data quality check
print(f"\nColumn info:")
print(f"   â€¢ Total columns: {len(final_saved.columns)}")
print(f"   â€¢ Date range sample: {final_saved['date'].dropna().head(3).tolist()}")
print(f"   â€¢ Top 3 hotels:")
for hotel, count in final_saved['hotel_name'].value_counts().head(3).items():
    print(f"     - {hotel}: {count:,} reviews")

# Document known column issues for gold layer processing
print(f"\nKnown Column Issues (to address in gold layer):")
dummy_columns = [col for col in final_saved.columns if 'Unnamed:' in str(col) or col in ['col_0']]
if dummy_columns:
    print(f"   â€¢ Dummy columns found: {dummy_columns}")
    print(f"   â€¢ These are Excel conversion artifacts - will be cleaned in gold processing")
else:
    print(f"   â€¢ No dummy columns detected")

print(f"\nFile Storage Summary:")
conversion_size = sum(f.stat().st_size for f in conversion_dir.glob("*.parquet")) / (1024*1024)
primary_filter_size = (bronze_base / "02_primary_filter" / "tripadvisor_nyc_2022_2025_date_filtered.parquet").stat().st_size / (1024*1024)
print(f"   â€¢ Raw chunks (bronze/01_raw_conversion): ~{conversion_size:.0f} MB")
print(f"   â€¢ Primary filter (bronze/02_primary_filter): {primary_filter_size:.1f} MB")
print(f"   â€¢ Final dataset (silver/tripadvisor): {file_size_mb:.1f} MB")

print(f"\nOptional Cleanup:")
print(f"   â€¢ To save disk space, you can delete intermediate processing files:")
print(f"   â€¢ rm -rf {conversion_dir}")
print(f"   â€¢ rm -rf {bronze_base}/02_primary_filter")
print(f"   â€¢ Keeps: original Excel + final silver parquet ({156.9 + file_size_mb:.1f} MB total)")
print(f"\nWorkflow complete! Ready for gold layer processing.")

Final Dataset Verification:
File loads successfully
Shape: (12635, 15)
Unique hotels: 124

Column info:
   â€¢ Total columns: 15
   â€¢ Date range sample: ['Jan 2023', 'Aug 2022', 'Jan 2023']
   â€¢ Top 3 hotels:
     - Luma Hotel Time Square: 712 reviews
     - Hyatt Grand Central New York: 595 reviews
     - Hyatt Centric Times Square New York: 434 reviews

Known Column Issues (to address in gold layer):
   â€¢ Dummy columns found: ['col_0', 'Unnamed: 0']
   â€¢ These are Excel conversion artifacts - will be cleaned in gold processing

File Storage Summary:
   â€¢ Raw chunks (01_raw_conversion): ~169 MB
   â€¢ Primary filter (02_primary_filter): 17.7 MB
   â€¢ Final dataset (03_refined_filter): 4.9 MB

ðŸ§¹ Optional Cleanup:
   â€¢ To save disk space, you can delete intermediate processing files:
   â€¢ rm -rf /Users/db/code/tourism_data_project/data/bronze/tripadvisor/01_raw_conversion
   â€¢ rm -rf /Users/db/code/tourism_data_project/data/bronze/tripadvisor/02_primary_filter
   â€¢