# Tourism Sentiment Analysis - TripAdvisor NYC Data Extraction

**Project:** Tourism Sentiment Analysis

**Task:** Data Extraction & Processing

**Dataset Source:** TripAdvisor (SciDB)

**Focus:** NYC, 2022-2025, Hotels

**Source URL:** https://www.scidb.cn/en/file?fid=df2d477ee4830d106a58c14053a57b07

## 1. Setup & Configuration
*Import libraries, set up project paths, create directory structure*

### 1A. Imports

In [1]:
import requests
from pathlib import Path
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from openpyxl import load_workbook

### 1B. Project Root Detection
*Cross-platform function to locate project directory automatically*

**Purpose:** Finds project root by searching for `.python-version` and `.gitignore` files

**Handles:** Working directory issues, different operating systems, various notebook locations

**Fallback:** Provides clear error message with troubleshooting steps if project root not found

In [2]:
def find_project_root():
    """Find project root - flexible and robust"""
    from pathlib import Path
    import os
    # Strategy 1: Look for tourism project indicators
    current = Path.cwd()
    # Search up the directory tree for project indicators
    for _ in range(10):
        # Check for any tourism project signs
        tourism_indicators = [
            current.name.lower().find('tourism') != -1,
            (current / "notebooks").exists(),
            (current / "data").exists(),
            any(f.name.endswith('.ipynb') for f in current.glob('*') if f.is_file())
        ]
        if any(tourism_indicators):
            # Create marker files if missing
            (current / ".python-version").touch(exist_ok=True)
            (current / ".gitignore").touch(exist_ok=True)
            return current
        if current.parent == current:  # Reached filesystem root
            break
        current = current.parent
    # Fallback: Use current working directory
    project_root = Path.cwd()
    # Ensure marker files exist
    (project_root / ".python-version").touch(exist_ok=True)
    (project_root / ".gitignore").touch(exist_ok=True)
    return project_root


### 1C. Set Project Paths
*Establish standardized directory structure for bronze and silver processing*

**Bronze Structure:** Raw download → chunked conversion → primary filter

**Silver Structure:** Final staging area for gold layer integration

**Auto-creation:** All directories created automatically for new collaborators

In [3]:
# Set up project paths with bronze subfolder structure
project_root = find_project_root()
bronze_base = project_root / "data" / "bronze" / "tripadvisor"
print(f"Project root: {project_root}")
print(f"Bronze base: {bronze_base}")

Project root: /home/anna/code/TinaKgn/tourism_data_project/notebooks/shared/data_extraction/tripadvisor
Bronze base: /home/anna/code/TinaKgn/tourism_data_project/notebooks/shared/data_extraction/tripadvisor/data/bronze/tripadvisor


## 2. Data Acquisition
*Download raw Excel file from ScienceDB using discovered direct API*

<details>
<summary><strong>Manual Download Instructions </strong> (click to expand)</summary>

***If automated download fails:***
1. Visit: https://www.scidb.cn/en/file?fid=df2d477ee4830d106a58c14053a57b07
2. Download file manually
3. Rename to: `tripadvisor_nyc_2022_2025_original.xlsx`
4. Place in: `data/bronze/tripadvisor/00_original_download/`

</details>

In [4]:
# Set up download directory
original_dir = bronze_base / "00_original_download"
original_dir.mkdir(parents=True, exist_ok=True)

# Direct download URL (SciDB.cn pattern)
file_id = "df2d477ee4830d106a58c14053a57b07"
url = f"https://china.scidb.cn/download?fileId={file_id}"
file_name = "tripadvisor_nyc_2022_2025_original.xlsx"
file_path = original_dir / file_name

# Download the file
if not file_path.exists():
    print(f"Downloading from: {url}...")
    response = requests.get(url, stream=True)
    response.raise_for_status()

    with open(file_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print(f"Download complete: {file_path}")
else:
    print(f"File already exists: {file_path}")

# Check file size
file_size = file_path.stat().st_size / (1024 * 1024)
print(f"File size: {file_size:.1f} MB")

Downloading from: https://china.scidb.cn/download?fileId=df2d477ee4830d106a58c14053a57b07...
Download complete: /home/anna/code/TinaKgn/tourism_data_project/notebooks/shared/data_extraction/tripadvisor/data/bronze/tripadvisor/00_original_download/tripadvisor_nyc_2022_2025_original.xlsx
File size: 156.9 MB


## 3. Bronze Layer: Raw Data Processing
*Convert Excel to chunked parquet files preserving original structure*

**Input:** `data/bronze/00_original_download/tripadvisor_nyc_2022_2025_original.xlsx` (156.9 MB)

**Output:** `data/bronze/01_raw_conversion/tripadvisor_nyc_raw_chunk_*.parquet` (chunked files)

**Processing:** 5,000-row chunks for memory efficiency

**Purpose:** Preserve complete dataset structure while converting to analysis-friendly format

In [5]:
# Set up conversion output directory
conversion_dir = bronze_base / "01_raw_conversion"
conversion_dir.mkdir(parents=True, exist_ok=True)

# Check if conversion already completed
existing_chunks = list(conversion_dir.glob("tripadvisor_nyc_raw_chunk_*.parquet"))
if existing_chunks:
    print(f"[SKIP] Conversion already complete - found {len(existing_chunks)} existing chunks")
    print(f"Output location: {conversion_dir}")
    print("\nTo reconvert, delete the 01_raw_conversion folder first")
else:
    print("Loading Excel file...")
    wb = load_workbook(file_path, read_only=True)
    ws = wb.active
    header = [str(cell.value) if cell.value is not None else f"col_{i}" for i, cell in enumerate(next(ws.iter_rows(min_row=1, max_row=1)))]
    print(f"Columns found: {len(header)}")

    # Convert to parquet chunks
    chunk_size = 5000
    rows = []
    part = 0

    print("Converting to parquet chunks...")
    for row in ws.iter_rows(min_row=2, values_only=True):
        row = list(row[:len(header)])  # truncate any extra columns
        while len(row) < len(header):  # fill missing columns with None
            row.append(None)
        rows.append(row)

        if len(rows) >= chunk_size:
            df = pd.DataFrame(rows, columns=header)
            chunk_filename = f"tripadvisor_nyc_raw_chunk_{part:05d}.parquet"
            pq.write_table(pa.Table.from_pandas(df), conversion_dir / chunk_filename, compression="snappy")
            rows = []
            part += 1

            # Progress indicator every 10 files
            if part % 10 == 0:
                print(f"Processed {part} chunks...")

    # Write remaining rows
    if rows:
        df = pd.DataFrame(rows, columns=header)
        chunk_filename = f"tripadvisor_nyc_raw_chunk_{part:05d}.parquet"
        pq.write_table(pa.Table.from_pandas(df), conversion_dir / chunk_filename, compression="snappy")

    print(f"Conversion complete. Total chunks: {part + 1}")
    print(f"Output location: {conversion_dir}")

Loading Excel file...
Columns found: 15
Converting to parquet chunks...
Processed 10 chunks...
Processed 20 chunks...
Processed 30 chunks...
Processed 40 chunks...
Processed 50 chunks...
Processed 60 chunks...
Processed 70 chunks...
Processed 80 chunks...
Conversion complete. Total chunks: 84
Output location: /home/anna/code/TinaKgn/tourism_data_project/notebooks/shared/data_extraction/tripadvisor/data/bronze/tripadvisor/01_raw_conversion


## 4. Data Verification & Column Inspection
*Load converted data to verify structure and examine columns before filtering*

**Purpose:** Confirm parquet conversion preserved data integrity

**Check:** Column names, data types, row counts

**Next:** Identify date column format for primary filtering

In [6]:
# Load a sample chunk to verify conversion
sample_file = conversion_dir / "tripadvisor_nyc_raw_chunk_00000.parquet"
df_sample = pd.read_parquet(sample_file)

# Validation checks - flag conversion issues against expected TripAdvisor structure
expected_chunk_size = 5000
expected_columns = [
    'col_0', 'Unnamed: 0', 'hotel_name', 'id_review', 'title',
    'date', 'location', 'user_name', 'user_link', 'date_of_stay',
    'rating', 'review', 'rating_review', 'n_review_user', 'n_votes_review'
]

print(f"Conversion Validation:")
print("=" * 40)

# 1. Exact shape validation
shape_ok = df_sample.shape == (expected_chunk_size, len(expected_columns))
print(f"{'Shape validated' if shape_ok else 'Shape invalid'}\nShape:{df_sample.shape} [Expected ({expected_chunk_size}, {len(expected_columns)})])")

# 2. Exact column validation
columns_ok = list(df_sample.columns) == expected_columns
print(f"{'Columns validated' if columns_ok else 'Columns invalid'}\nColumn structure: {columns_ok}")
if not columns_ok:
    missing = set(expected_columns) - set(df_sample.columns)
    extra = set(df_sample.columns) - set(expected_columns)
    if missing: print(f"   Missing: {missing}")
    if extra: print(f"   Extra: {extra}")

# 3. Sample data display
print(f"\nSample Data Preview:")
print(df_sample[['hotel_name', 'date', 'rating', 'location']].head(3))

# 4. Overall conversion status
all_checks_ok = shape_ok and columns_ok
print(f"\n{'Conversion Successful...' if all_checks_ok else 'Conversion Issues Detected...'}")

if not all_checks_ok:
    print("Review validation failures above before proceeding")

Conversion Validation:
Shape validated
Shape:(5000, 15) [Expected (5000, 15)])
Columns validated
Column structure: True

Sample Data Preview:
                         hotel_name      date  rating  \
0  Premier Inn London Holborn hotel     Feb 3      30   
1  Premier Inn London Holborn hotel  Mar 2022      50   
2  Premier Inn London Holborn hotel  Jan 2023      50   

                      location  
0                         None  
1  Bournemouth, United Kingdom  
2  Cleckheaton, United Kingdom  

Conversion Successful...


## 5. Primary Filter: Date Range Selection  
*Filter reviews to 2022-2025 timeframe and consolidate chunks*

**Input:** 84 raw chunks (~500K+ total rows)

**Filter Criteria:** Date contains "2022", "2023", "2024", or "2025"

**Output:** `data/bronze/02_primary_filter/tripadvisor_nyc_2022_2025_date_filtered.parquet`  

**Expected Reduction:** ~90% of data (based on original analysis)

In [7]:
# Set up primary filter output directory
primary_filter_dir = bronze_base / "02_primary_filter"
primary_filter_dir.mkdir(parents=True, exist_ok=True)

# Check if primary filtering already completed
output_file = primary_filter_dir / "tripadvisor_nyc_2022_2025_date_filtered.parquet"

if output_file.exists():
    # Load existing file to show stats
    existing_df = pd.read_parquet(output_file)
    file_size_mb = output_file.stat().st_size / (1024*1024)
    print(f"[SKIP] Primary filter already complete")
    print(f"Existing file: {len(existing_df):,} rows, {file_size_mb:.1f} MB")
    print(f"Location: {output_file}")
    print("\nTo reprocess, delete the 02_primary_filter folder first")
else:
    # Load all chunks and apply date filter
    years_keywords = ["2022", "2023", "2024", "2025"]
    chunk_files = sorted(conversion_dir.glob("tripadvisor_nyc_raw_chunk_*.parquet"))

    print(f"Processing {len(chunk_files)} chunks for date filtering...")
    all_filtered_rows = []
    original_total = 0

    for i, chunk_file in enumerate(chunk_files):
        df = pd.read_parquet(chunk_file)
        original_total += len(df)
        date_mask = df["date"].fillna("").apply(lambda x: any(year in str(x) for year in years_keywords))
        filtered_df = df[date_mask]
        all_filtered_rows.append(filtered_df)

        # Progress indicator every 20 files
        if (i + 1) % 20 == 0:
            print(f"Processed {i + 1}/{len(chunk_files)} chunks...")

    # Consolidate filtered data
    print("Consolidating filtered chunks...")
    filtered_df = pd.concat(all_filtered_rows, ignore_index=True)

    kept_count = len(filtered_df)
    removed_count = original_total - kept_count

    # Save consolidated result
    filtered_df.to_parquet(output_file, compression="snappy")

    print(f"\nDate filtering complete:")
    print(f"  Original rows: {original_total:,}")
    print(f"  Rows kept (2022-2025): {kept_count:,}")
    print(f"  Rows removed: {removed_count:,}")
    print(f"  Retention rate: {kept_count/original_total*100:.1f}%")
    print(f"\nSaved to: {output_file}")

    # File size check
    file_size_mb = output_file.stat().st_size / (1024*1024)
    print(f"File size: {file_size_mb:.1f} MB")

Processing 84 chunks for date filtering...
Processed 20/84 chunks...
Processed 40/84 chunks...
Processed 60/84 chunks...
Processed 80/84 chunks...
Consolidating filtered chunks...

Date filtering complete:
  Original rows: 416,032
  Rows kept (2022-2025): 48,992
  Rows removed: 367,040
  Retention rate: 11.8%

Saved to: /home/anna/code/TinaKgn/tourism_data_project/notebooks/shared/data_extraction/tripadvisor/data/bronze/tripadvisor/02_primary_filter/tripadvisor_nyc_2022_2025_date_filtered.parquet
File size: 17.7 MB


## 6. Data Verification & Geographic Filtering
*Load converted data to verify structure and extract NYC hotels using positive filtering*

**Input:** `data/bronze/02_primary_filter/tripadvisor_nyc_2022_2025_date_filtered.parquet` (48,992 rows)

**Strategy:** 
1. Verify date filtering results
2. Positive NYC filtering (hotel names with NYC indicators)
3. Manual cleanup of misclassified hotels

**Expected Output:** ~12,500 rows, ~125 hotels

**Final Location:** `data/silver/tripadvisor/staging/tripadvisor_nyc_2022_2025_final.parquet`

### 6A. Data Verification

*Load and verify primary filtered data before geographic analysis*

In [8]:
# Load primary filtered data for geographic analysis
primary_filter_file = bronze_base / "02_primary_filter" / "tripadvisor_nyc_2022_2025_date_filtered.parquet"

if not primary_filter_file.exists():
    print("[ERROR] Primary filter file not found - run previous steps first")
    print(f"Expected location: {primary_filter_file}")
else:
    exploration_df = pd.read_parquet(primary_filter_file)

    print("Primary Filter Verification:")
    print("=" * 40)
    print(f"File loaded successfully")
    print(f"Rows: {len(exploration_df):,}")
    print(f"Hotels: {exploration_df['hotel_name'].nunique()}")

    # Verify date filtering worked
    date_samples = exploration_df['date'].dropna().head(10).tolist()
    target_years = ["2022", "2023", "2024", "2025"]
    dates_valid = any(any(year in str(date) for year in target_years) for date in date_samples)
    print(f"\n{'Year filter validated' if dates_valid else 'Year filter not valid'}: \nSample dates contain target years: {date_samples[:5]}")

    print(f"\nTop 5 hotels by review count:")
    top_hotels = exploration_df['hotel_name'].value_counts().head(5)
    for hotel, count in top_hotels.items():
        print(f"  • {hotel}: {count:,} reviews")

    print(f"\nReady for geographic filtering")

Primary Filter Verification:
File loaded successfully
Rows: 48,992
Hotels: 668

Year filter validated: 
Sample dates contain target years: ['Mar 2022', 'Jan 2023', 'Jan 2023', 'Jan 2023', 'Jan 2023']

Top 5 hotels by review count:
  • Park Plaza Westminster Bridge London: 749 reviews
  • Luma Hotel Time Square: 712 reviews
  • The Clermont, Charing Cross: 660 reviews
  • Travelodge London City hotel: 616 reviews
  • Hyatt Grand Central New York: 595 reviews

Ready for geographic filtering


### 6B. Exploratory Analysis Section (Optional)

**Purpose:** Show analysis process used to develop filtering strategy

***Status:*** *Optional - Jump to "7. Final Geographic Filter & Save" to run workflow*

**Contains:** Implementation strategies for geographic filtering challenges

#### 6B.1 Initial Hotel Name Analysis
*Examine hotel name patterns after date filtering*

In [9]:
# Load date-filtered data for geographic analysis
primary_filter_dir = bronze_base / "02_primary_filter"
exploration_df = pd.read_parquet(primary_filter_dir / "tripadvisor_nyc_2022_2025_date_filtered.parquet")

print(f"Starting geographic analysis with: {len(exploration_df):,} rows")
print(f"Unique hotels: {exploration_df['hotel_name'].nunique()}")

# Initial hotel name examination
print(f"\n Top 15 hotels by review count:")
top_hotels = exploration_df['hotel_name'].value_counts().head(15)
for hotel, count in top_hotels.items():
    print(f"  • {hotel} ({count:,} reviews)")

# Look for obvious non-NYC patterns
print(f"\n Sample hotel names (checking for international patterns):")
sample_hotels = exploration_df['hotel_name'].value_counts().head(25).index
for hotel in sample_hotels:
    print(f"  • {hotel}")

Starting geographic analysis with: 48,992 rows
Unique hotels: 668

 Top 15 hotels by review count:
  • Park Plaza Westminster Bridge London (749 reviews)
  • Luma Hotel Time Square (712 reviews)
  • The Clermont, Charing Cross (660 reviews)
  • Travelodge London City hotel (616 reviews)
  • Hyatt Grand Central New York (595 reviews)
  • Sea Containers London (465 reviews)
  • Travelodge London Central Waterloo (449 reviews)
  • Travelodge London Greenwich High Road (437 reviews)
  • Hyatt Centric Times Square New York (434 reviews)
  • Hotel Riu Plaza New York Times Square (431 reviews)
  • The Resident Covent Garden (431 reviews)
  • Leonardo Royal London Tower Bridge (418 reviews)
  • Travelodge London Central City Road (409 reviews)
  • Park Grand London Hyde Park (402 reviews)
  • Travelodge London Farringdon (390 reviews)

 Sample hotel names (checking for international patterns):
  • Park Plaza Westminster Bridge London
  • Luma Hotel Time Square
  • The Clermont, Charing Cross
 

#### 6B.2 UK Reviewer Concentration Strategy

**Approach:** Use reviewer location patterns to identify misclassified hotels*

**Challenge:** Hotel names alone insufficient (e.g., "SoHo" exists in both NYC and London)

**Adjustment:** Analyze reviewer geographic patterns ([user_...] 'location') to detect misclassified hotels

**Logic:** London hotels will have high concentrations of UK-based reviewers

**Threshold:** Hotels with >60% UK reviewers (min. 10 location entries) flagged for removal

In [10]:
# Analyze reviewer location patterns to identify non-NYC hotels
print("Analyzing reviewer geographic patterns...")

hotel_stats = []
for hotel_name, group in exploration_df.groupby('hotel_name'):
    location_data = group['location'].fillna('')

    total_reviews = len(group)
    total_with_location = group['location'].notna().sum()
    uk_reviews = location_data.str.contains('United Kingdom|UK|England|Scotland|Wales', case=False).sum()
    shanghai_reviews = location_data.str.contains('Shanghai|China', case=False).sum()

    hotel_stats.append({
        'hotel_name': hotel_name,
        'total_reviews': total_reviews,
        'total_with_location': total_with_location,
        'uk_reviews': uk_reviews,
        'shanghai_reviews': shanghai_reviews
    })

# Convert to analysis DataFrame
hotel_analysis = pd.DataFrame(hotel_stats)
hotel_analysis['uk_percentage'] = (hotel_analysis['uk_reviews'] / hotel_analysis['total_with_location']).fillna(0)
hotel_analysis['shanghai_percentage'] = (hotel_analysis['shanghai_reviews'] / hotel_analysis['total_with_location']).fillna(0)

# Identify problematic hotels
uk_threshold = 0.6
uk_hotels = hotel_analysis[
    (hotel_analysis['uk_percentage'] > uk_threshold) &
    (hotel_analysis['total_with_location'] >= 10)
]

print(f"\nHotels with >{uk_threshold*100:.0f}% UK reviewers: {len(uk_hotels)}")
if len(uk_hotels) > 0:
    print("\n UK-heavy hotels (likely London):")
    uk_display = uk_hotels.nlargest(10, 'uk_percentage')[['hotel_name', 'total_with_location', 'uk_percentage']]
    for _, row in uk_display.iterrows():
        print(f"  • {row['hotel_name']} - {row['uk_percentage']:.1%} UK reviewers ({row['total_with_location']} total)")

Analyzing reviewer geographic patterns...

Hotels with >60% UK reviewers: 141

 UK-heavy hotels (likely London):
  • Premier Inn London Hanger Lane hotel - 92.7% UK reviewers (96 total)
  • Fitzrovia Hotel - 90.5% UK reviewers (21 total)
  • Premier Inn London New Southgate Hotel - 90.5% UK reviewers (21 total)
  • The Chamberlain Hotel - 90.0% UK reviewers (30 total)
  • Premier Inn London Archway hotel - 88.1% UK reviewers (109 total)
  • The Luxury Inn - 87.5% UK reviewers (16 total)
  • hub by Premier Inn London Spitalfields, Brick Lane hotel - 85.9% UK reviewers (85 total)
  • Premier Inn London Greenwich hotel - 85.5% UK reviewers (124 total)
  • Premier Inn London Tolworth - 85.5% UK reviewers (62 total)
  • The Prince of Wales - Townhouse - 84.6% UK reviewers (13 total)


#### 6B.3 Positive NYC Filtering Strategy

**Approach:** Identify genuine NYC hotels using location indicators*

**Strategy Shift:** Instead of removing non-NYC, actively identify NYC hotels

**Indicators:** Hotel names containing NYC-specific terms

**Advantage:** Reduces false positives from ambiguous neighborhood names (SoHo, Chelsea, etc.)

**Final Cleanup:** Manual removal of remaining misclassified hotels

In [11]:
# Apply positive NYC filtering - identify genuine NYC hotels
nyc_indicators = [
    'New York', 'NYC', 'Manhattan', 'Brooklyn', 'Queens', 'Bronx',
    'Times Square', 'Time Square', 'Central Park', 'Wall Street',
    'Midtown', 'Downtown', 'Financial District', 'SoHo', 'NoMad',
    'TriBeCa', 'Upper East', 'Upper West', 'Lower East', 'Herald Square',
    'Penn Station', 'Grand Central', 'JFK', 'LaGuardia', 'Empire State'
]

nyc_pattern = '|'.join(nyc_indicators)
nyc_hotels = exploration_df[exploration_df['hotel_name'].str.contains(nyc_pattern, case=False, na=False)]

print(f"NYC hotels identified: {len(nyc_hotels):,} rows")
print(f"Unique NYC hotels: {nyc_hotels['hotel_name'].nunique()}")

# Check for remaining ambiguous terms that might be misclassified
print(f"\n Top 10 NYC hotels:")
nyc_top = nyc_hotels['hotel_name'].value_counts().head(10)
for hotel, count in nyc_top.items():
    print(f"  • {hotel} ({count:,} reviews)")

# Check for potentially ambiguous hotels needing manual review
ambiguous_terms = ['SoHo', 'Chelsea', 'Greenwich', 'Victoria']
print(f"\n NYC hotels with ambiguous neighborhood terms")
for term in ambiguous_terms:
    matching = nyc_hotels[nyc_hotels['hotel_name'].str.contains(term, case=False, na=False)]
    if len(matching) > 0:
        unique_hotels = matching['hotel_name'].unique()
        print(f"\n  {term}: {len(unique_hotels)} hotels")
        for hotel in unique_hotels[:3]:
            print(f"    • {hotel}")

NYC hotels identified: 12,846 rows
Unique NYC hotels: 127

 Top 10 NYC hotels:
  • Luma Hotel Time Square (712 reviews)
  • Hyatt Grand Central New York (595 reviews)
  • Hyatt Centric Times Square New York (434 reviews)
  • Hotel Riu Plaza New York Times Square (431 reviews)
  • DoubleTree by Hilton Hotel New York Times Square West (354 reviews)
  • Hyatt Place New York/Chelsea (334 reviews)
  • M Social Hotel Times Square New York (291 reviews)
  • Lotte New York Palace (284 reviews)
  • 1 Hotel Central Park (282 reviews)
  • Arlo Midtown (281 reviews)

 NYC hotels with ambiguous neighborhood terms

  SoHo: 7 hotels
    • The Soho Hotel
    • hub by Premier Inn London Soho hotel
    • The Z Hotel Soho

  Chelsea: 9 hotels
    • SpringHill Suites New York Manhattan/Chelsea
    • TownePlace Suites by Marriott New York Manhattan/Chelsea
    • Hyatt House New York/Chelsea


#### 6B.4 Manual Cleanup of Misclassified Hotels
*Remove remaining London hotels caught by ambiguous neighborhood names*

**Issue:** "SoHo" exists in both NYC and London

**Solution:** Remove clearly London-branded hotels

**Targets:** Hotels with "London" in name or known London hotel chains

In [12]:
# Manual removal of identified London hotels
london_hotels_to_remove = [
    'The Soho Hotel',                       # London SoHo hotel
    'The Z Hotel Soho',                     # London hotel chain
    'hub by Premier Inn London Soho hotel'  # Explicitly London-branded
]

print(f"Removing London hotels:")
for hotel in london_hotels_to_remove:
    count = nyc_hotels[nyc_hotels['hotel_name'] == hotel].shape[0]
    print(f"  • {hotel} ({count:,} reviews)")

# Apply manual cleanup
final_nyc_df = nyc_hotels[~nyc_hotels['hotel_name'].isin(london_hotels_to_remove)].copy()

print(f"\nManual cleanup complete...")
print(f"Final NYC dataset: {len(final_nyc_df):,} rows")
print(f"Unique hotels: {final_nyc_df['hotel_name'].nunique()}")

# Verify remaining SoHo hotels are legitimate NYC hotels
remaining_soho = final_nyc_df[final_nyc_df['hotel_name'].str.contains('soho', case=False)]['hotel_name'].unique()
print(f"\nRemaining SoHo hotels (verified NYC):")
for hotel in remaining_soho:
    print(f"  • {hotel}")

Removing London hotels:
  • The Soho Hotel (19 reviews)
  • The Z Hotel Soho (38 reviews)
  • hub by Premier Inn London Soho hotel (154 reviews)

Manual cleanup complete...
Final NYC dataset: 12,635 rows
Unique hotels: 124

Remaining SoHo hotels (verified NYC):
  • Arlo SoHo
  • Courtyard New York Manhattan/SoHo
  • Soho Grand Hotel
  • Sohotel


## 7. Silver Layer: Final Geographic Filter & Save
*Clean, validated approach - works whether exploration was run or not*

**Implementation:** Apply proven NYC filter strategy

**Output:** `data/silver/tripadvisor/staging/tripadvisor_nyc_2022_2025_final.parquet`

In [13]:
# Set up silver staging directory
silver_dir = project_root / "data" / "silver" / "tripadvisor" / "staging"
silver_dir.mkdir(parents=True, exist_ok=True)
output_file = silver_dir / "tripadvisor_nyc_2022_2025_final.parquet"

# Check if final filtering already completed
if output_file.exists():
    # Load existing file to show stats
    existing_final = pd.read_parquet(output_file)
    file_size_mb = output_file.stat().st_size / (1024*1024)
    print(f"[SKIP] Final geographic filter already complete")
    print(f"Existing file: {len(existing_final):,} rows, {existing_final['hotel_name'].nunique()} hotels")
    print(f"File size: {file_size_mb:.1f} MB")
    print(f"Location: {output_file}")
    print("\nTo reprocess, delete the silver/staging folder first")
else:
    # Load primary filtered data (works whether exploration was run or skipped)
    primary_filter_dir = bronze_base / "02_primary_filter"
    df_for_filtering = pd.read_parquet(primary_filter_dir / "tripadvisor_nyc_2022_2025_date_filtered.parquet")

    # Apply validated NYC filter strategy
    nyc_indicators = [
        'New York', 'NYC', 'Manhattan', 'Brooklyn', 'Queens', 'Bronx',
        'Times Square', 'Time Square', 'Central Park', 'Wall Street',
        'Midtown', 'Downtown', 'Financial District', 'SoHo', 'NoMad',
        'TriBeCa', 'Upper East', 'Upper West', 'Lower East', 'Herald Square',
        'Penn Station', 'Grand Central', 'JFK', 'LaGuardia', 'Empire State'
    ]

    nyc_pattern = '|'.join(nyc_indicators)
    nyc_filtered = df_for_filtering[df_for_filtering['hotel_name'].str.contains(nyc_pattern, case=False, na=False)]

    # Remove identified London hotels
    london_hotels_to_remove = ['The Soho Hotel', 'The Z Hotel Soho', 'hub by Premier Inn London Soho hotel']
    final_clean_df = nyc_filtered[~nyc_filtered['hotel_name'].isin(london_hotels_to_remove)].copy()

    # Save to silver staging directory
    final_clean_df.to_parquet(output_file, compression="snappy")

    print(f"Final dataset saved")
    print(f"Location: {output_file}")
    file_size_mb = output_file.stat().st_size / (1024*1024)
    print(f"Rows: {len(final_clean_df):,}")
    print(f"Hotels: {final_clean_df['hotel_name'].nunique()}")
    print(f"File size: {file_size_mb:.1f} MB")
    print(f"\nReady for gold layer processing and analysis")

Final dataset saved
Location: /home/anna/code/TinaKgn/tourism_data_project/notebooks/shared/data_extraction/tripadvisor/data/silver/tripadvisor/staging/tripadvisor_nyc_2022_2025_final.parquet
Rows: 12,635
Hotels: 124
File size: 4.9 MB

Ready for gold layer processing and analysis


## 8. Final Verification & Cleanup

*Verify saved dataset and optional cleanup of intermediate files*

In [14]:
# Load and verify final saved dataset
print("Final Dataset Verification:")
print("=" * 50)

final_saved = pd.read_parquet(output_file)
print(f"File loads successfully")
print(f"Shape: {final_saved.shape}")
print(f"Unique hotels: {final_saved['hotel_name'].nunique()}")

# Quick data quality check
print(f"\nColumn info:")
print(f"   • Total columns: {len(final_saved.columns)}")
print(f"   • Date range sample: {final_saved['date'].dropna().head(3).tolist()}")
print(f"   • Top 3 hotels:")
for hotel, count in final_saved['hotel_name'].value_counts().head(3).items():
    print(f"     - {hotel}: {count:,} reviews")

# Document known column issues for gold layer processing
print(f"\nKnown Column Issues (to address in gold layer):")
dummy_columns = [col for col in final_saved.columns if 'Unnamed:' in str(col) or col in ['col_0']]
if dummy_columns:
    print(f"   • Dummy columns found: {dummy_columns}")
    print(f"   • These are Excel conversion artifacts to be cleaned in gold processing")
else:
    print(f"   • No dummy columns detected")

print(f"\nFile Storage Summary:")
conversion_size = sum(f.stat().st_size for f in conversion_dir.glob("*.parquet")) / (1024*1024)
primary_filter_size = (bronze_base / "02_primary_filter" / "tripadvisor_nyc_2022_2025_date_filtered.parquet").stat().st_size / (1024*1024)
print(f"   • Raw chunks (bronze/01_raw_conversion): ~{conversion_size:.0f} MB")
print(f"   • Primary filter (bronze/02_primary_filter): {primary_filter_size:.1f} MB")
print(f"   • Final dataset (silver/tripadvisor): {file_size_mb:.1f} MB")

print(f"\nOptional Cleanup:")
print(f"   • To save disk space, you can delete intermediate processing files:")
print(f"   • rm -rf {conversion_dir}")
print(f"   • rm -rf {bronze_base}/02_primary_filter")
print(f"   • Keeps: original Excel + final silver parquet ({156.9 + file_size_mb:.1f} MB total)")
print(f"\nWorkflow complete! Ready for gold layer processing.")

Final Dataset Verification:
File loads successfully
Shape: (12635, 15)
Unique hotels: 124

Column info:
   • Total columns: 15
   • Date range sample: ['Jan 2023', 'Aug 2022', 'Jan 2023']
   • Top 3 hotels:
     - Luma Hotel Time Square: 712 reviews
     - Hyatt Grand Central New York: 595 reviews
     - Hyatt Centric Times Square New York: 434 reviews

Known Column Issues (to address in gold layer):
   • Dummy columns found: ['col_0', 'Unnamed: 0']
   • These are Excel conversion artifacts to be cleaned in gold processing

File Storage Summary:
   • Raw chunks (bronze/01_raw_conversion): ~169 MB
   • Primary filter (bronze/02_primary_filter): 17.7 MB
   • Final dataset (silver/tripadvisor): 4.9 MB

Optional Cleanup:
   • To save disk space, you can delete intermediate processing files:
   • rm -rf /home/anna/code/TinaKgn/tourism_data_project/notebooks/shared/data_extraction/tripadvisor/data/bronze/tripadvisor/01_raw_conversion
   • rm -rf /home/anna/code/TinaKgn/tourism_data_project/n

## 9. Next Steps: Gold Layer Processing

**Current Status:**
Bronze → Silver workflow complete for TripAdvisor NYC dataset

**Upcoming Gold Layer Integration:**

- **Multi-dataset analysis:** All processed silver datasets (TripAdvisor NYC, Yelp New Orleans, AirBnB LA/Chicago) will be explored for shared columns
  
- **Schema standardization:** Common fields (location, date, rating, text) will be unified across datasets
  
- **Data quality:** Null value handling, strategic imputation, and appropriate data type conversions
  
- **Analysis-ready format:** Final gold datasets optimized for sentiment analysis and tourism correlation modeling

**Gold Processing Pipeline:**
1. Load all silver datasets and analyze column overlap
2. Standardize shared column names and formats
3. Handle missing values with dataset-appropriate strategies
4. Convert data types for analysis efficiency
5. Create unified gold datasets for cross-platform analysis

<!-- **Template Replication:** Use this notebook structure for remaining datasets:
- `002_yelp_new_orleans_2013_2016_2018_extraction.ipynb`
- `003_airbnb_los_angeles_2022_2024_extraction.ipynb`
- `004_airbnb_chicago_2022_2024_extraction.ipynb` -->

In [15]:
# Clear all cell outputs for clean notebook sharing

from IPython.display import Javascript
Javascript("Jupyter.notebook.clear_all_output()")
print("Outputs cleared for clean collaboration")

# Reset all cell numbers to None
Javascript("Jupyter.notebook.get_cells().forEach(function(c) {c.set_input_prompt();})")

# # Verify no local paths
# import json
# with open('001_tripadvisor_nyc_extraction.ipynb') as f:
#     nb = json.load(f)
#     nb_str = json.dumps(nb)
#     if '[!Replace with Local user directory root!]' in nb_str:
#         print("Local paths found in notebook metadata")
#     else:
#         print("No local paths detected")

Outputs cleared for clean collaboration


<IPython.core.display.Javascript object>