# Tourism Sentiment Analysis - Yelp New Orleans Data Extraction

**Project:** Tourism Sentiment Analysis

**Task:** Data Extraction & Processing

**Dataset Source:** Yelp Academic Dataset (Kaggle)

**Focus:** New Orleans, 2013/2016/2018, Tourism Businesses

**Source URL:** https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset

---
<br>
<details>
<summary><strong>Dataset Update Warning</strong> (click to expand)</summary>

Yelp Academic Dataset is updated annually with newer data and potential schema changes.

Current dataset covers 2005-2022, but future versions may:
- Remove older historical data (pre-2010)
- Change JSON structure or column names
- Modify business categorization system

This notebook includes validation to detect significant dataset changes.
</details>

---
<br>
<details>
<summary><strong>Historical Tourism Events Focus</strong> (click to expand)</summary>

This analysis focuses on major tourism events in New Orleans to understand how volume fluctuations affect customer sentiment:

- **2013:** Super Bowl XLVII (February) - massive tourism influx
- **2016:** Normal tourism baseline - regular seasonal patterns  
- **2018:** New Orleans Tricentennial - 300th anniversary celebrations

The workflow filters to these specific years after 2012+ data quality threshold.

</details>

---

## 1. Setup & Configuration
*Import libraries, set up project paths, create directory structure*

### 1A. Imports & Script Setup
*Load required packages and configure script imports*

In [None]:
# Core data processing
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

# File handling & JSON processing
import json
from pathlib import Path

# System utilities
import sys

# Bootstrap: Add shared scripts to Python path
def setup_scripts_path():
    """Add shared scripts to Python path"""
    current = Path.cwd()
    for _ in range(10):
        if (current / '.projectroot').exists():
            scripts_dir = current / 'notebooks' / 'shared' / 'scripts'
            if scripts_dir.exists():
                sys.path.insert(0, str(scripts_dir))
                return scripts_dir
        if current.parent == current:
            break
        current = current.parent
    raise FileNotFoundError(
        "Scripts directory not found.\n"
        "Ensure .projectroot exists at project root."
    )

# Setup path and import utilities
scripts_dir = setup_scripts_path()

from project_utils import find_project_root
from data_io import setup_extraction_directories, check_existing_file, check_existing_chunks
from data_validation import print_final_summary, print_storage_summary

# Define target files for download
TARGET_FILES = [
    'yelp_academic_dataset_business.json',
    'yelp_academic_dataset_review.json',
    'yelp_academic_dataset_user.json'
]

print(f"✓ Packages and scripts loaded successfully")

### 1B. Project Root Detection
*Cross-platform function to locate project directory automatically*

**Purpose:** Finds project root by searching for `.projectroot` marker file

**Handles:** Working directory issues, different operating systems, various notebook locations

**Confirmation:** Displays detected path for verification

**Manual Override:** Uncomment line below if auto-detection fails

In [None]:
# Automatically detect project root
project_root = find_project_root()

In [None]:
# If auto-detection fails, uncomment and edit this line:
# project_root = Path('/.../.../.../[tourism_data_project]')

### 1C. Set Project Paths
*Establish standardized directory structure for bronze and silver processing*

**Bronze Structure:** Raw download → chunked conversion → primary filter

**Silver Structure:** Final staging area for gold layer integration

**Auto-creation:** All directories created automatically for new collaborators

In [None]:
# Create all required directories automatically
dirs = setup_extraction_directories(project_root, 'yelp')

# Access directories throughout notebook
original_dir = dirs['bronze_original']
conversion_dir = dirs['bronze_conversion']
primary_filter_dir = dirs['bronze_primary_filter']
silver_staging = dirs['silver_staging']

print("Directories ready:")
print(f"  Bronze original: {original_dir}")
print(f"  Bronze conversion: {conversion_dir}")
print(f"  Bronze primary filter: {primary_filter_dir}")
print(f"  Silver staging: {silver_staging}")

## 2. Data Acquisition
*Download raw JSON files from Kaggle using Yelp Academic Dataset*

**Source:** Official Yelp Academic Dataset via Kaggle API

**Files:** Business, Review, and User JSON files (~7GB total)

**Authentication:** Requires Kaggle API credentials (setup below)

***<u>Important Note</u>:*** If API setup fails or if manual download is preferable, *please skip to Section 2C*

### 2A. Kaggle API Configuration
*Step-by-step setup for Yelp Academic Dataset access*

**Required:** Kaggle account and API credentials for 8GB dataset download



#### 2A.1 Kaggle Account & API Token Setup

<details>
<summary>Create account and generate credentials (click to expand)</summary>

**Visit:** https://www.kaggle.com/

**Register/Login:** Use email or Google account

**Generate API Token:**
   - Go to: https://www.kaggle.com/settings/account
   - Scroll to "API" section  
   - Click "Create New Token"
   - **Copy your username** (displayed on the account page)
   - **Copy the API token** immediately

**Next:** Choose ONE configuration method below (2A.2 or 2A.3)

</details>

---

#### 2A.2 Project .env Configuration (Method Option 1 of 2)

<details>
<summary><strong>Recommended</strong>: Secure, project-specific credential storage (click to expand)</summary>

```bash
# Navigate to your project root directory
cd ~/path/to/tourism_data_project

# Create .env file with your credentials (replace with values from Kaggle interface)
echo "KAGGLE_USERNAME=[your_username]" >> .env
echo "KAGGLE_KEY=[your_copied_token]" >> .env
echo "KAGGLE_API_TOKEN=[your_copied_token]" >> .env

# Secure the file
chmod 600 .env
echo ".env" >> .gitignore
```

***<u>Note</u>:*** Kaggle documentation refers to use of both `KAGGLE_KEY` and `KAGGLE_API_TOKEN` - setting both ensures compatibility.
</details>

---

#### 2A.2 Terminal Environment Variables  (Method Option 2 of 2)


<details>
<summary><strong>Alternative</strong>: Temporary Configuration <i>requiring notebook restart</i> (click to expand)</summary>

```bash
# Run in terminal BEFORE starting Jupyter (temporary)
export KAGGLE_USERNAME="[your_username]"
export KAGGLE_KEY="[your_copied_token]"
export KAGGLE_API_TOKEN="[your_copied_token]"

# Then start Jupyter from same terminal
jupyter notebook
```

***<u>Note</u>:*** Kaggle documentation refers to use of both `KAGGLE_KEY` and `KAGGLE_API_TOKEN` - setting both ensures compatibility.

</details>

---

### 2B. API Authentication Test
*Verify Kaggle API credentials before large download*

**Purpose:** Test connection before proceeding to download 8GB files

**Next:** Automated download if successful, *manual instructions if test fails*

In [None]:
# Import Yelp-specific functions
from yelp_utils import test_kaggle_authentication, download_yelp_with_complete_handling

# Test Kaggle authentication
print("Testing Kaggle API authentication...")

auth_success, auth_message = test_kaggle_authentication()

if auth_success:
    print("\n✓ Authentication successful ready for download")
    print("Proceeding to Section 2C...")
else:
    print("\n✗ Authentication failed")
    print("See manual download instructions in Section 2C below")

### 2C. Dataset Download
*Download Yelp Academic Dataset (~8GB) using Kaggle API*

**Files:** Business (~115 MB), Reviews (~5.3 GB), Users (~3.1 GB)

**Process:** ZIP download → automatic extraction → JSON files

---
<details>
<summary><strong>Manual Download Instructions</strong> (click to expand)</summary>

***If automated download fails:***

**Option A: Kaggle Website Download**
1. Visit Kaggle: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset
   
2. Click "Download" button (requires Kaggle account)
   
3. Extract the downloaded ZIP file
   
4. Copy these files to: 
   
    `data/bronze/yelp/00_original_download/...`

    `yelp_academic_dataset_business.json`
    `yelp_academic_dataset_review.json`
    `yelp_academic_dataset_user.json`

**Option B: Official Yelp Dataset**
- Visting Yelp's official hosting page: https://business.yelp.com/data/resources/open-dataset/

- May require registration for academic/research use
  
- Same procedural handling as Option 1 immediately above.

**File Size Expectations:**
- Business: ~115 MB (150K+ businesses)
- Reviews: ~5.3 GB (7M+ reviews)
- Users: ~3.1 GB (2M+ users)
- Total: ~8.5 GB uncompressed

</details>

---

In [None]:
print("Large download: ~8GB total, may take 10-20 minutes")
print("Files: Business (~115 MB), Reviews (~5.3 GB), Users (~3.1 GB)")

# Download & validate in one step
success, message, status, validation = download_yelp_with_complete_handling(
    download_dir=original_dir,
    target_files=TARGET_FILES
)

print(f"\n{message}")

if not success:
    print("\nIf download failed, see manual instructions in markdown above")

## 3. Bronze Layer: Raw Data Processing
*Convert large JSON files to chunked parquet preserving original structure*

**Input:** 3 JSON files (~8.4 GB total)

**Output:** Chunked parquet files in `data/bronze/yelp/01_raw_conversion/`

**Processing:** Memory-efficient chunked conversion (10,000 records per chunk)

**Purpose:** Preserve complete dataset structure while converting to analysis-friendly format

### 3A. Configuration & Pre-Conversion Check
*Set chunking parameters and confirm files ready for processing*

**Input:** Validated JSON files from Section 2 (8.4 GB total)

**Configuration:** 10,000 records per parquet chunk for memory efficiency

**Purpose:** Final check before beginning memory-intensive conversion process

In [None]:
# Configuration
CHUNK_SIZE = 10000

# Confirm files from Section 2 validation
file_paths = {
    'business': original_dir / 'yelp_academic_dataset_business.json',
    'review': original_dir / 'yelp_academic_dataset_review.json',
    'user': original_dir / 'yelp_academic_dataset_user.json'
}

# Quick existence check (validation already done in Section 2)
all_present = all(path.exists() for path in file_paths.values())

if not all_present:
    print("✗ Files missing - run Section 2 first")
    missing = [name for name, path in file_paths.items() if not path.exists()]
    print(f"Missing: {missing}")
else:
    # Show file sizes
    total_size_mb = sum(path.stat().st_size for path in file_paths.values()) / (1024 * 1024)

    print("Files ready for bronze layer conversion:")
    for name, path in file_paths.items():
        size_mb = path.stat().st_size / (1024 * 1024)
        print(f"  ✓ {name}: {size_mb:.1f} MB")

    print(f"\nTotal dataset: {total_size_mb / 1024:.1f} GB")
    print(f"Chunk size: {CHUNK_SIZE:,} records per parquet file")
    print("\n✓ Ready for conversion (Sections 3B-3D)")

### 3B. Business Data Conversion
*Convert business JSON to parquet chunks*

**Input:** `yelp_academic_dataset_business.json` (~113 MB)

**Expected:** ~150K business records

**Output:** `yelp_business_chunk_*.parquet` files


In [None]:
from data_io import convert_json_dataset_to_chunks

business_file = original_dir / 'yelp_academic_dataset_business.json'

success, chunk_count, total_records = convert_json_dataset_to_chunks(
    json_file=business_file,
    output_dir=conversion_dir,
    file_prefix='yelp_business',
    chunk_size=CHUNK_SIZE
)

if success:
    print(f"✓ Business conversion complete: {total_records:,} records, {chunk_count} chunks")
else:
    print(f"✗ Business conversion failed")

### 3C. Review Data Conversion
*Convert large review JSON to parquet chunks*

**Input:** `yelp_academic_dataset_review.json` (~5.1 GB)

**Expected:** ~7M review records

**Output:** `yelp_reviews_chunk_*.parquet` files

**Note:** *Largest file - may take 5-10 minutes*


In [None]:
review_file = original_dir / 'yelp_academic_dataset_review.json'

print("Converting review data (this may take 5-10 minutes)...")
success, chunk_count, total_records = convert_json_dataset_to_chunks(
    json_file=review_file,
    output_dir=conversion_dir,
    file_prefix='yelp_reviews',
    chunk_size=CHUNK_SIZE
)

if success:
    print(f"✓ Review conversion complete: {total_records:,} records, {chunk_count} chunks")
else:
    print(f"✗ Review conversion failed")

### 3D. User Data Conversion
*Convert user JSON to parquet chunks*

**Input:** `yelp_academic_dataset_user.json` (~3.2 GB)

**Expected:** ~2M user records

**Output:** `yelp_users_chunk_*.parquet` files


In [None]:
user_file = original_dir / 'yelp_academic_dataset_user.json'

success, chunk_count, total_records = convert_json_dataset_to_chunks(
    json_file=user_file,
    output_dir=conversion_dir,
    file_prefix='yelp_users',
    chunk_size=CHUNK_SIZE
)

if success:
    print(f"✓ User conversion complete: {total_records:,} records, {chunk_count} chunks")
else:
    print(f"✗ User conversion failed")

### 4A. Verify Bronze Conversion
*Validate all three datasets converted successfully before proceeding*

**Input:** Parquet chunks from Section 3 conversion

**Purpose:** Confirm all business, review, and user chunks created before exploration

**Uses:** `check_existing_chunks()` from data_io.py

In [None]:
from data_io import check_existing_chunks

# Verify bronze conversion completed successfully
print("Bronze Layer Conversion Verification")

# Check business chunks
exists_business, count_business = check_existing_chunks(
    conversion_dir,
    pattern="yelp_business_chunk_*.parquet",
    show_info=False
)

# Check review chunks
exists_reviews, count_reviews = check_existing_chunks(
    conversion_dir,
    pattern="yelp_reviews_chunk_*.parquet",
    show_info=False
)

# Check user chunks
exists_users, count_users = check_existing_chunks(
    conversion_dir,
    pattern="yelp_users_chunk_*.parquet",
    show_info=False
)

print(f"\nBusiness chunks: {count_business} files")
print(f"Review chunks: {count_reviews} files")
print(f"User chunks: {count_users} files")

# Validate expected counts
expected = {'business': 16, 'reviews': 700, 'users': 199}
actual = {'business': count_business, 'reviews': count_reviews, 'users': count_users}

all_valid = all(actual[key] == expected[key] for key in expected)

if all_valid:
    print(f"\n✓ All conversions validated, ready for exploration")
else:
    print(f"\n✗ Conversion mismatch - check Section 3")

### 4B. Load Business Data & Analyze Geographic Distribution
*Load all business chunks to understand state/city coverage*

**Input:** `data/bronze/yelp/01_raw_conversion/yelp_business_chunk_*.parquet` (16 chunks, ~150K businesses)

**Purpose:** Identify which states/cities have sufficient tourism business presence for analysis

**Output:** Geographic distribution to inform city selection

In [None]:
# Load all business chunks
business_chunks = []
business_files = sorted(conversion_dir.glob("yelp_business_chunk_*.parquet"))

for file in business_files:
    df = pd.read_parquet(file)
    business_chunks.append(df)

business_df = pd.concat(business_chunks, ignore_index=True)

print(f"✓ Loaded {len(business_df):,} businesses")
print(f"Columns: {len(business_df.columns)}")
print(f"\nKey columns: {list(business_df.columns[:10])}")

# Quick structure check
print(f"\nDataset shape: {business_df.shape}")
print(f"States represented: {business_df['state'].nunique()}")
print(f"Cities represented: {business_df['city'].nunique()}")

### 4C. Analyze State & City Review Distribution
*Use business metadata to identify states/cities with highest review activity*

**Strategy:** Aggregate review counts from business metadata → identify top states → show leading cities

**Purpose:** Identify optimal geographic target without loading full review dataset

**Output:** Top 5 states with their leading cities by review volume

In [None]:
# Aggregate review counts by city
city_reviews = business_df.groupby(['state', 'city'])['review_count'].sum().reset_index()
city_reviews = city_reviews.sort_values('review_count', ascending=False)

# Aggregate by state with top 2 cities
print("Top 5 States by Review Volume")

state_totals = city_reviews.groupby('state')['review_count'].sum().sort_values(ascending=False)

for state in state_totals.head(5).index:
    state_total = state_totals[state]
    state_cities = city_reviews[city_reviews['state'] == state].head(2)

    print(f"\n  {state}: {state_total:,} total reviews")
    for _, row in state_cities.iterrows():
        print(f"  • {row['city']}: {row['review_count']:,} reviews")

### 4D. Load Reviews & Filter Louisiana 2012-2019
*Load all review chunks, filter to Louisiana 2012-2019, prefix columns with data source*

**Input:** All review chunks (~7M reviews)

**Filters:** Louisiana business reviews between 2012 and 2019

**Column Strategy:** Prefix with `rev_` to identify review-source data

**Output:** Louisiana reviews ready for tourism classification

In [None]:
# Load all review chunks
print("Loading all review chunks (this may take 2-3 minutes)")
review_chunks = []
review_files = sorted(conversion_dir.glob("yelp_reviews_chunk_*.parquet"))

for i, file in enumerate(review_files):
    df = pd.read_parquet(file)
    review_chunks.append(df)

    if (i + 1) % 100 == 0:
        print(f"  Loaded {i + 1}/{len(review_files)} chunks...")

reviews_df = pd.concat(review_chunks, ignore_index=True)
print(f"\n✓ Loaded {len(reviews_df):,} total reviews")

# Get Louisiana business IDs
la_business_ids = set(business_df[business_df['state'] == 'LA']['business_id'])
print(f"Louisiana businesses: {len(la_business_ids):,}")

# Create datetime column for filtering (preserve original date string)
reviews_df['date_dt'] = pd.to_datetime(reviews_df['date'], errors='coerce')

# Filter to Louisiana + 2012-2019 with only needed columns
reviews_la_2012_2019 = reviews_df[
    (reviews_df['business_id'].isin(la_business_ids)) &
    (reviews_df['date_dt'].dt.year >= 2012) &
    (reviews_df['date_dt'].dt.year <= 2019)
][['review_id', 'user_id', 'business_id', 'stars', 'text', 'date', 'date_dt', 'useful']].copy()

# Prefix review columns (keep both date and date_dt)
reviews_la_2012_2019 = reviews_la_2012_2019.rename(columns={
    'review_id': 'rev_id',
    'user_id': 'rev_user_id',
    'business_id': 'rev_business_id',
    'stars': 'rev_stars',
    'text': 'rev_text',
    'date': 'rev_date',           # Original string preserved
    'date_dt': 'rev_date_dt',     # Datetime for analysis
    'useful': 'rev_useful',
})

print(f"\nLouisiana 2012-2019 reviews: {len(reviews_la_2012_2019):,} ({len(reviews_la_2012_2019)/len(reviews_df)*100:.1f}%)")
print(f"Final columns: {list(reviews_la_2012_2019.columns)}")
print(f"\nData integrity check:")
print(f"  Original date format: {reviews_la_2012_2019['rev_date'].iloc[0]} (string)")
print(f"  Datetime column: {reviews_la_2012_2019['rev_date_dt'].iloc[0]} (datetime64)")

### 4E. Apply Tourism Classification to Louisiana Businesses
*Classify businesses using comprehensive category groupings*

**Purpose:** Identify tourism-relevant businesses for final dataset filtering
<br>

---
<br>
<details>
<summary><strong>Normalizing category matches with fuzzy logic</strong> (click to expand)</summary>

- Input categories and dictionary keywords both normalized (lowercase, trailing 's' removed)
  
- Example: "Walking Tours" → "walking tour", matches keyword "walking tour"
  
- Example: "Restaurants" → "restaurant", matches keyword "restaurant"
</details>

---
<br>
<details>
<summary><strong>Addressing category specificity</strong> (click to expand)</summary>

Keywords like "tour", "walking tour", "boat tour" are not redundant:

- "tour" matches: "Tour", "Tours" (generic tour businesses)
  
- "walking tour" matches: "Walking Tour", "Walking Tours" (specific type)
  
- All options required to ensure full range of business types
</details>


---


In [None]:
from yelp_utils import classify_tourism_business
from collections import Counter

# Define comprehensive tourism category groups
category_groups = {
    "restaurant": [
        "restaurant", "food", "coffee & tea", "dessert", "bakery",
        "ice cream & frozen yogurt", "juice bar & smoothie", "gelato",
        "cupcake", "shaved ice", "food truck", "cafe"
    ],

    "hotels_travel": [
        "hotel & travel", "hotels & travel", "hotel", "resort",
        "vacation rental", "bed & breakfast",
        "taxi", "limo", "airport shuttle", "car rental", "airline",
        "transportation", "public transportation", "pedicab"
    ],

    "tourism": [
        "tour", "walking tour", "historical tour", "boat tour",
        "wine tour", "beer tour", "bus tour",
        "museum", "art museum",
        "landmark & historical building", "landmarks & historical building",
        "amusement park", "zoo", "botanical garden", "park",
        "stadium & arena", "stadiums & arena",
        "bike rental", "local flavor"
    ],

    "entertainment": [
        "art & entertainment", "arts & entertainment",
        "music venue", "performing art",
        "comedy club", "jazz & blue", "festival",
        "cinema", "professional sports team"
    ],

    "nightlife": [
        "nightlife", "bar", "pub", "wine bar", "beer bar",
        "beer garden", "cocktail bar", "dive bar", "brewpub",
        "distillery", "winery"
    ],

    "events": [
        "event planning & service", "venue & event space",
        "wedding planning", "party & event planning", "bridal",
        "party bus rental", "custom cake", "dance club"
    ],
}

# Apply tourism classification to Louisiana businesses
la_businesses = business_df[business_df['state'] == 'LA'].copy()
la_businesses['tourism_groups'] = la_businesses['categories'].apply(
    lambda x: classify_tourism_business(x, category_groups)
)

# Count businesses by tourism group
classified = la_businesses[la_businesses['tourism_groups'].apply(len) > 0]
print(f"Classified as tourism: {len(classified):,} / {len(la_businesses):,} ({len(classified)/len(la_businesses)*100:.1f}%)")

all_groups = [group for groups in classified['tourism_groups'] for group in groups]
group_counts = Counter(all_groups)
for group, count in group_counts.most_common():
    print(f"  {group}: {count:,} businesses")

unclassified = la_businesses[la_businesses['tourism_groups'].apply(len) == 0]
print(f"\nUnclassified: {len(unclassified):,} businesses")

### 4F. Filter Reviews to Tourism Businesses Only
*Reduce review dataset to only tourism-classified businesses*

**Input:** Louisiana reviews (2012-2019) with `rev_` prefixed columns

**Purpose:** Final dataset focuses exclusively on tourism sentiment

---
<br>
<details>
<summary><strong>Filter and combine handling</strong> (click to expand)</summary>

**Filter:** Keep only reviews for tourism businesses (classified in 4E)

**Join Strategy:** Match `rev_business_id` to tourism business IDs

**Data Integrity:** No column modifications - pure filtering operation

</details>

---


In [None]:
# Get tourism business IDs
tourism_business_ids = set(classified['business_id'])

print(f"Tourism businesses: {len(tourism_business_ids):,}")

# Filter reviews to tourism businesses only
reviews_tourism = reviews_la_2012_2019[
    reviews_la_2012_2019['rev_business_id'].isin(tourism_business_ids)
].copy()

print(f"\nReview Filtering")
print(f"  Before: {len(reviews_la_2012_2019):,} reviews")
print(f"  After: {len(reviews_tourism):,} reviews")
print(f"  Reduction: {(1 - len(reviews_tourism)/len(reviews_la_2012_2019))*100:.1f}%")

# Analyze city distribution in filtered reviews
tourism_business_with_city = classified[['business_id', 'city']]
reviews_with_city = reviews_tourism.merge(
    tourism_business_with_city,
    left_on='rev_business_id',
    right_on='business_id',
    how='left'
)

print(f"\nCity distribution (tourism reviews only)")
city_dist = reviews_with_city['city'].value_counts().head(5)
for city, count in city_dist.items():
    pct = (count / len(reviews_with_city)) * 100
    print(f"  {city}: {count:,} reviews ({pct:.1f}%)")

### 4G. Load User Data & Filter to Tourism Reviewers
*Load user chunks and filter to users who reviewed tourism businesses*

**Input:** User chunks (199 chunks, ~2M users)

**Filter:** Keep only users present in tourism review dataset (identified by `rev_user_id`)

**Purpose:** Reduce user dataset before final merge, minimize memory footprint

---
<br>
<details>
<summary><strong>Column handling</strong> (click to expand)</summary>

**Columns Kept:** 6 essential user attributes only

- `usr_id`, `usr_name`, `usr_review_count`, `usr_yelping_since`, `usr_useful`, `usr_average_stars`

**Column Strategy:** Prefix with `usr_` for source identification

</details>

---

In [None]:
# Get unique user IDs from tourism reviews
tourism_user_ids = set(reviews_tourism['rev_user_id'])
print(f"Unique users in tourism reviews: {len(tourism_user_ids):,}")

# Check if user data already loaded and processed
if 'users_tourism' in locals() and 'usr_id' in users_tourism.columns:
    print("\n✓ User data already loaded and processed")
    print(f"Users: {len(users_tourism):,}")
    print(f"Columns: {list(users_tourism.columns)}")
else:
    # Load and filter user chunks
    print("\nLoading and filtering user data")
    user_chunks = []
    user_files = sorted(conversion_dir.glob("yelp_users_chunk_*.parquet"))

    for i, file in enumerate(user_files):
        df = pd.read_parquet(file)
        # Filter to tourism users only
        df_filtered = df[df['user_id'].isin(tourism_user_ids)]
        if len(df_filtered) > 0:
            user_chunks.append(df_filtered)

        if (i + 1) % 50 == 0:
            print(f"  Processed {i + 1}/{len(user_files)} chunks...")

    users_tourism = pd.concat(user_chunks, ignore_index=True)

    # Keep only needed user columns
    users_tourism = users_tourism[[
        'user_id', 'review_count', 'yelping_since', 'useful', 'average_stars'
    ]].copy()

    # Rename with usr_ prefix
    users_tourism = users_tourism.rename(columns={
        'user_id': 'usr_id',
        'review_count': 'usr_review_count',
        'yelping_since': 'usr_yelping_since',
        'useful': 'usr_useful',
        'average_stars': 'usr_average_stars'
    })

    print(f"\n✓ Loaded {len(users_tourism):,} tourism users")
    print(f"  Columns: {list(users_tourism.columns)}")

### 4H. Merge Reviews, Business, and User Data
*Combine three datasets with source prefixes in staged approach*

**Output:** Complete tourism dataset with clear data provenance (`rev_`, `bus_`, `usr_` prefixes)

---
<br>
<details>
<summary><strong>Merge handling with data integrity</strong> (click to expand)</summary>

**Step 1:** Prepare business data with `bus_` prefix (remove `state`, `hours`, `is_open`)

**Step 2:** Merge reviews + business data

**Step 3:** Merge result + user data

**Data Integrity:** All merges use left join to preserve all review records

</details>

---


In [None]:
print("Preparing datasets for merge...")
print()

# Check if merge already completed
if 'merged' in locals() and 'bus_name' in merged.columns and 'usr_review_count' in merged.columns:
    print("✓ Merge already completed")
    print(f"Rows: {len(merged):,}")
    print(f"Columns: {len(merged.columns)}")
    print()

    print("Column structure")
    rev_cols = [c for c in merged.columns if c.startswith('rev_')]
    bus_cols = [c for c in merged.columns if c.startswith('bus_')]
    usr_cols = [c for c in merged.columns if c.startswith('usr_')]
    print(f"  rev_ columns: {len(rev_cols)}")
    print(f"  bus_ columns: {len(bus_cols)}")
    print(f"  usr_ columns: {len(usr_cols)}")
else:
    # Prepare business data with renamed columns
    business_tourism = classified[[
        'business_id', 'name', 'city', 'latitude', 'longitude',
        'stars', 'review_count', 'categories', 'postal_code'
    ]].copy()

    business_tourism = business_tourism.rename(columns={
        'business_id': 'bus_id',
        'name': 'bus_name',
        'city': 'bus_city',
        'latitude': 'bus_latitude',
        'longitude': 'bus_longitude',
        'stars': 'bus_average_stars',
        'review_count': 'bus_review_count',
        'categories': 'bus_categories',
        'postal_code': 'bus_postal_code'
    })

    print("Merging datasets...")
    print()

    # Merge reviews + business
    merged = reviews_tourism.merge(
        business_tourism,
        left_on='rev_business_id',
        right_on='bus_id',
        how='left'
    )

    # Merge + users (already has usr_ prefix from 4G)
    merged = merged.merge(
        users_tourism,
        left_on='rev_user_id',
        right_on='usr_id',
        how='left'
    )

    print("Merge complete")
    print(f"  Final rows: {len(merged):,}")
    print(f"  Final columns: {len(merged.columns)}")
    print()

    # Verify column structure
    rev_cols = [c for c in merged.columns if c.startswith('rev_')]
    bus_cols = [c for c in merged.columns if c.startswith('bus_')]
    usr_cols = [c for c in merged.columns if c.startswith('usr_')]

    print("Column structure")
    print(f"  rev_ columns: {len(rev_cols)}")
    print(f"  bus_ columns: {len(bus_cols)}")
    print(f"  usr_ columns: {len(usr_cols)}")
    print()

    # Verify no data loss
    business_match = merged['bus_name'].notna().sum()
    user_match = merged['usr_review_count'].notna().sum()

    print("Data integrity check")
    print(f"  Reviews matched to business: {business_match:,} ({business_match/len(merged)*100:.1f}%)")
    print(f"  Reviews matched to user: {user_match:,} ({user_match/len(merged)*100:.1f}%)")

## 5. Primary Filter: Save Louisiana Tourism Dataset
*Save merged dataset to bronze primary filter*

**Input:** Tourism reviews with business and user data

**Output:** `data/bronze/yelp/02_primary_filter/louisiana_tourism_2012_2019.parquet`


---
<details>
<summary><strong>Dataset Summary</strong> (click to expand)</summary>

- Louisiana tourism businesses only
  
- Years: 2012-2019

- Complete merge: Reviews + Business + User data
  
- Original data integrity preserved (no type conversion overwrites)

</details>

---


### 5A. Calculate User Review Counts by Year
*Tally each user's Louisiana tourism reviews per year for reviewer classification*

**Purpose:** Identify high-volume reviewers (*"locals"* vs. *"tourists"*) by analyzing annual Louisiana review patterns

---
<details>
<summary><strong>Strategy and Use Case</strong> (click to expand)</summary>

**Strategy:** Create pivot table of user review counts per year

- Columns: `usr_2012_la_rev_count`, `usr_2013_la_rev_count`, ..., `usr_2019_la_rev_count`

- Values: Number of reviews each user wrote in Louisiana tourism businesses per year

**Use Case:** Enable reviewer classification during analysis


</details>

---


In [None]:
print("Calculating user review counts by year...")
print()

# Extract year from review date for pivot
merged['rev_year'] = merged['rev_date_dt'].dt.year

# Create pivot: users × years
print("Creating user-year pivot table...")
user_year_pivot = merged.groupby(['rev_user_id', 'rev_year']).size().reset_index(name='year_review_count')
user_year_pivot = user_year_pivot.pivot(
    index='rev_user_id',
    columns='rev_year',
    values='year_review_count'
).fillna(0).astype(int)

# Rename columns with usr_ prefix and _la_rev_count suffix
year_columns = {year: f'usr_{int(year)}_la_rev_count' for year in user_year_pivot.columns}
user_year_pivot = user_year_pivot.rename(columns=year_columns)
user_year_pivot = user_year_pivot.reset_index()

print()
print("User-year pivot created")
print(f"  Unique users: {len(user_year_pivot):,}")
year_cols = [col for col in user_year_pivot.columns if col.startswith('usr_')]
print(f"  User review count/year columns: {year_cols}")
print()

# Merge back to main dataset
print("Merging annual counts back to dataset...")
merged = merged.merge(user_year_pivot, on='rev_user_id', how='left')

# Drop temporary rev_year column
merged = merged.drop(columns=['rev_year'])

print()
print("Merged dataset updated")
print(f"  Total columns: {len(merged.columns)}")
print(f"  New year count columns: {len([c for c in merged.columns if '_la_rev_count' in c])}")
print()

# Analyze high-volume reviewers across all years
print("High-Volume Reviewer Analysis")
for year in range(2012, 2020):
    col_name = f'usr_{year}_la_rev_count'
    if col_name in merged.columns:
        high_volume = (merged[col_name] > 50).sum()
        if high_volume > 0:
            max_count = merged[col_name].max()
            print(f"  {year}: {high_volume:,} users with >50 reviews (max: {max_count})")
print()

print("✓ Annual review counts calculated and merged")

### 5B. Primary Filter: Save Louisiana Tourism Dataset
*Save merged dataset to bronze primary filter*

**Input:** 547,432 tourism reviews with complete business, user, and annual count data

**Output:** `data/bronze/yelp/02_primary_filter/louisiana_tourism_2012_2019.parquet`


---
<details>
<summary><strong>Dataset Summary</strong> (click to expand)</summary>

- Louisiana tourism businesses only (6,307 businesses)

- Years: 2012-2019

- Complete merge: Reviews (`rev_`) + Business (`bus_`) + User (`usr_`) + Annual counts

- Original data integrity preserved (source date strings maintained, datetime in separate column)


</details>

---

<details>
<summary><strong>Column Structure</strong> (click to expand)</summary>


- `rev_id`, `rev_user_id`, `rev_business_id`, `rev_stars`, `rev_text`, `rev_date`, `rev_date_dt`, `rev_useful`

- `bus_id`, `bus_name`, `bus_city`, `bus_latitude`, `bus_longitude`, `bus_average_stars`, `bus_review_count`, `bus_categories`, `bus_postal_code`

- `usr_id`, `usr_review_count`, `usr_yelping_since`, `usr_useful`, `usr_average_stars`

- `usr_2012_la_rev_count` through `usr_2019_la_rev_count`


</details>

---

In [None]:
from data_io import check_existing_file

# Set up primary filter output
output_file = primary_filter_dir / "louisiana_tourism_2012_2019.parquet"

# Check if already saved
exists, info = check_existing_file(output_file, file_type='parquet', show_info=False)

if not exists:
    print("Saving Louisiana tourism dataset...")

    merged.to_parquet(output_file, compression="snappy", index=False)

    file_size_mb = output_file.stat().st_size / (1024 * 1024)

    print()
    print("✓ Dataset saved successfully")
    print(f"Location: {output_file.name}")
    print(f"Directory: {output_file.parent}")
    print(f"Size: {file_size_mb:.1f} MB")
    print(f"Rows: {len(merged):,}")
    print(f"Columns: {len(merged.columns)}")
    print()

    # Column breakdown
    rev_cols = [c for c in merged.columns if c.startswith('rev_')]
    bus_cols = [c for c in merged.columns if c.startswith('bus_')]
    usr_cols = [c for c in merged.columns if c.startswith('usr_') and '_la_rev_count' not in c]
    year_cols = [c for c in merged.columns if '_la_rev_count' in c]

    print("Column summary")
    print(f"  rev_ columns: {len(rev_cols)}")
    print(f"  bus_ columns: {len(bus_cols)}")
    print(f"  usr_ columns: {len(usr_cols)}")
    print(f"  year count columns: {len(year_cols)}")

else:
    print(f"[SKIP] Primary filter already exists: {output_file.name}")
    print(f"Size: {info['size_mb']:.1f} MB")

print()
print("Primary filter complete, ready for year selection")

## **6. Load Primary Filter & Year Selection**
*Load Louisiana tourism dataset and configure target year for analysis*

**Input:** `data/bronze/yelp/02_primary_filter/louisiana_tourism_2012_2019.parquet` (217.9 MB)

**Strategy:** Load → Verify → Select year → Validate → Ready for silver save

**Target Years:** 2013 (Super Bowl), 2016 (Baseline), 2018 (Tricentennial)


### 6A. Memory Cleanup
*Clear intermediate DataFrames from memory before loading primary filter*

**Purpose:** Free memory and ensure clean slate for year selection workflow

**Action:** Remove large DataFrames from Sections 4-5, force garbage collection

In [None]:
print("Cleaning up intermediate data from memory...")

# List of variables to clean up
cleanup_vars = [
    'merged',
    'reviews_tourism',
    'reviews_la_2012_2019',
    'business_df',
    'la_businesses',
    'classified',
    'users_tourism',
    'business_tourism',
    'user_year_pivot',
    'user_year_counts'
]

# Iterate and delete if exists
cleanup_count = 0
for var in cleanup_vars:
    if var in dir():
        del globals()[var]
        cleanup_count += 1

# Force garbage collection
import gc
gc.collect()

print()
print(f"✓ Cleaned {cleanup_count} variables from memory")
print("Ready to load primary filter fresh from disk")

### 6B. Load Primary Filter & Verify Dataset
*Load saved Louisiana tourism dataset and confirm structure using shared utilities*

**Input:** `data/bronze/yelp/02_primary_filter/louisiana_tourism_2012_2019.parquet`

**Uses:** `check_existing_file()` from data_io.py for consistent file handling

**Purpose:** Verify primary filtering results and prepare for year selection


---
<details>
<summary><strong>Expected Structure</strong> (click to expand)</summary>

- 547,432 tourism reviews

- 30 total columns (8 rev_, 9 bus_, 5 usr_, 8 year counts)

- Years: 2012-2019 coverage
</details>

---


In [None]:
from data_io import check_existing_file

print("Load Primary Filter & Verify Dataset")
print()

# Load primary filter dataset
primary_filter_file = primary_filter_dir / "louisiana_tourism_2012_2019.parquet"

# Check if file exists using shared utility
exists, info = check_existing_file(primary_filter_file, file_type='parquet', show_info=False)

if not exists:
    print("✗ Primary filter file not found - run Section 5 first")
    print(f"Expected location: {primary_filter_file}")
else:
    # Load dataset
    exploration_df = pd.read_parquet(primary_filter_file)

    print("✓ Dataset loaded successfully")
    print(f"File: {primary_filter_file.name}")
    print(f"Directory: {primary_filter_file.parent}")
    print(f"Size: {info['size_mb']:.1f} MB")
    print()

    print("Dataset shape")
    print(f"  Rows: {len(exploration_df):,}")
    print(f"  Columns: {len(exploration_df.columns)}")
    print()

    # Verify column structure
    rev_cols = [c for c in exploration_df.columns if c.startswith('rev_')]
    bus_cols = [c for c in exploration_df.columns if c.startswith('bus_')]
    usr_cols = [c for c in exploration_df.columns if c.startswith('usr_') and '_la_rev_count' not in c]
    year_cols = [c for c in exploration_df.columns if '_la_rev_count' in c]

    print("Column summary")
    print(f"  rev_ columns: {len(rev_cols)}")
    print(f"  bus_ columns: {len(bus_cols)}")
    print(f"  usr_ columns: {len(usr_cols)}")
    print(f"  year count columns: {len(year_cols)}")
    print()

    # Year distribution
    year_dist = exploration_df['rev_date_dt'].dt.year.value_counts().sort_index()

    print("Year distribution")
    for year, count in year_dist.items():
        print(f"  {year}: {count:,} reviews")
    print()

    print("✓ Primary filter verified - ready for year selection")

### 6C. Configure Target Year for Analysis
*Select specific year for detailed tourism event analysis*

**Available Years:** 2012-2019 with strong representation

**Strategy:** Process one year at a time for focused analysis

**Configuration:** Set `TARGET_YEAR` variable below and rerun from this cell for different year


---
<details>
<summary><strong>Project Target Years</strong> (click to expand)</summary>

- **2013:** Super Bowl XLVII (February) - massive tourism influx (40,084 reviews)

- **2016:** Baseline year - normal seasonal patterns (77,097 reviews)

- **2018:** New Orleans Tricentennial - 300th anniversary (97,887 reviews)
</details>

---

In [None]:
# USER CONFIGURATION - Target Year Selection
TARGET_YEAR = 2016          # Options: 2013, 2016, 2018 (or any 2012-2019)

print("Target Year Configuration")
print()
print(f"Selected year: {TARGET_YEAR}")
print(f"To change: Update |TARGET_YEAR| variable above and rerun from here")
print()

# Validate target year availability
available_years = sorted(year_dist.index.tolist())

if TARGET_YEAR not in available_years:
    print(f"✗ Target year {TARGET_YEAR} not available")
    print(f"Available years: {available_years}")
else:
    # Filter to target year
    year_mask = exploration_df['rev_date_dt'].dt.year == TARGET_YEAR
    year_df = exploration_df[year_mask].copy()

    print(f"✓ Target year {TARGET_YEAR} is available")
    print(f"Records: {len(year_df):,}")
    print()

    # Show year context
    print("Year context")
    for year in available_years:
        count = year_dist[year]
        marker = " <- TARGET" if year == TARGET_YEAR else ""
        print(f"  {year}: {count:,} reviews{marker}")
    print()

    # Basic statistics for target year
    print("Target year summary")
    print(f"  Unique users: {year_df['rev_user_id'].nunique():,}")
    print(f"  Unique businesses: {year_df['rev_business_id'].nunique():,}")
    print(f"  Cities: {year_df['bus_city'].nunique()}")
    print(f"  Average rating: {year_df['rev_stars'].mean():.2f}")
    print()

    print("✓ Configuration complete - ready for exploratory analysis")

### 6D. Exploratory Validation of Target Year (Optional)
*Analyze data richness and distribution for selected year*

**Purpose:** Ensure selected year has sufficient data quality for sentiment analysis

**Note:** Optional section - can skip to Section 7 for silver save

---
<details>
<summary><strong>Annual Dataset Checks</strong> (click to expand)</summary>

- Seasonal distribution (ensure all seasons represented)

- City coverage (confirm New Orleans dominance)

- Review completeness (text availability)

- Rating distribution

- High-volume reviewer identification
</details>

---


In [None]:
print(f"Exploratory Analysis - {TARGET_YEAR}")
print()

# 1. Seasonal distribution
year_df['month'] = year_df['rev_date_dt'].dt.month

print("Monthly distribution")
monthly_counts = year_df['month'].value_counts().sort_index()
for month in range(1, 13):
    if month in monthly_counts.index:
        count = monthly_counts[month]
        pct = count / len(year_df) * 100
        print(f"  Month {month:2d}: {count:5,} reviews ({pct:4.1f}%)")
print()

# 2. City distribution
print("City distribution")
city_dist = year_df['bus_city'].value_counts().head(5)
for city, count in city_dist.items():
    pct = count / len(year_df) * 100
    print(f"  {city}: {count:,} reviews ({pct:.1f}%)")
print()

# 3. Review completeness
text_available = year_df['rev_text'].notna().sum()
text_pct = text_available / len(year_df) * 100
print("Review completeness")
print(f"  Text available: {text_available:,} ({text_pct:.1f}%)")
print(f"  Text missing: {len(year_df) - text_available:,} ({100-text_pct:.1f}%)")
print()

# 4. Rating distribution
print("Rating distribution")
rating_dist = year_df['rev_stars'].value_counts().sort_index()
for stars, count in rating_dist.items():
    pct = count / len(year_df) * 100
    bar = "█" * int(pct / 2)  # Visual bar
    print(f"  {int(stars)} stars: {count:5,} ({pct:4.1f}%) {bar}")
print()

# 5. High-volume reviewers in target year
year_count_col = f'usr_{TARGET_YEAR}_la_rev_count'
if year_count_col in year_df.columns:
    high_volume_mask = year_df[year_count_col] > 50
    high_volume_count = high_volume_mask.sum()
    high_volume_users = year_df[high_volume_mask]['rev_user_id'].nunique()

    print(f"High-volume reviewers (>50 reviews in {TARGET_YEAR})")
    print(f"  Unique users: {high_volume_users:,}")
    print(f"  Total reviews from these users: {high_volume_count:,} ({high_volume_count/len(year_df)*100:.1f}%)")
print()

# Drop temporary month column
year_df = year_df.drop(columns=['month'])

print("✓ Exploratory validation complete")
print("Ready for Section 7: Silver Layer Save")

---

<u>***Year Configuration Checkpoint***</u>

**Project Scope:** New Orleans tourism events (2013 Super Bowl, 2016 Baseline, 2018 Tricentennial)

**Reset Point:** Update `TARGET_YEAR` variable above and rerun from here for different year


---

## 7. Silver Layer: Save Target Year to Staging
*Save validated year-specific dataset to silver staging for gold layer integration*

**Input:** Validated year data from Section 6

**Output:** `data/silver/yelp/staging/new_orleans_{TARGET_YEAR}_final.parquet`

**Purpose:** Create year-specific dataset ready for gold layer processing (where reviewer classification will occur)

In [None]:
from data_io import check_existing_file

print(f"Silver Layer Save - New Orleans {TARGET_YEAR}")
print()

# Set up silver staging output file
output_file = silver_staging / f"new_orleans_{TARGET_YEAR}_final.parquet"

# Check if save already completed
exists, info = check_existing_file(output_file, file_type='parquet', show_info=False)

if not exists:
    print("Saving to silver staging...")

    # Save year-specific data to silver
    year_df.to_parquet(output_file, compression="snappy", index=False)

    file_size_mb = output_file.stat().st_size / (1024 * 1024)

    print()
    print("✓ Silver layer save complete")
    print(f"Location: {output_file.name}")
    print(f"Directory: {output_file.parent}")
    print(f"Size: {file_size_mb:.1f} MB")
    print(f"Rows: {len(year_df):,}")
    print(f"Columns: {len(year_df.columns)}")
    print()

    # Show key statistics
    print("Dataset summary")
    print(f"  City: New Orleans (+ metro area)")
    print(f"  Year: {TARGET_YEAR}")
    print(f"  Tourism businesses: {year_df['rev_business_id'].nunique():,}")
    print(f"  Reviewers: {year_df['rev_user_id'].nunique():,}")
    print(f"  Average rating: {year_df['rev_stars'].mean():.2f}")

else:
    print(f"[SKIP] Silver staging file already exists: {output_file.name}")
    print(f"Size: {info['size_mb']:.1f} MB")

print()
print("Available silver staging files")
all_files = sorted(silver_staging.glob("new_orleans_*_final.parquet"))
if all_files:
    for file in all_files:
        size_mb = file.stat().st_size / (1024 * 1024)
        year = file.stem.split('_')[2]  # Extract year from filename
        marker = " <- CURRENT" if file == output_file else ""
        print(f"  ✓ {file.name} ({size_mb:.1f} MB){marker}")
else:
    print("  No files found")

print()
print("✓ Ready for final verification")

## 8. Final Verification & Summary
*Comprehensive dataset summary and workflow completion status*

**Uses:** `print_final_summary()` from data_validation.py

**Purpose:** Verify saved dataset integrity and display workflow status

In [None]:
from data_validation import print_final_summary

print(f"Final Verification - New Orleans {TARGET_YEAR}")
print()

# Print comprehensive final summary
print_final_summary(
    output_file,
    dataset_name=f"Yelp New Orleans {TARGET_YEAR}",
    file_type='parquet'
)

print()

# Workflow completion status
print("Workflow Status")
print(f"  City: New Orleans ✓")
print(f"  Year: {TARGET_YEAR} ✓")
print(f"  Reviews: {len(year_df):,} ✓")
print(f"  File: {output_file.name} ✓")
print()

print("To process different year")
print(f"  1. Update TARGET_YEAR variable in Section 6C")
print(f"  2. Rerun Sections 6C-8")
print(f"  3. Produces: new_orleans_[YEAR]_final.parquet")
print()

print("Project scope: Process years 2013, 2016, 2018 for comparative analysis")

## 9. Workflow Completion

**Current Status:**
Bronze → Silver workflow complete for New Orleans tourism dataset

**Action:** Review next steps and clean memory when ready to proceed

### 9A. Next Steps: Additional Years & Gold Layer Processing
*Review workflow iteration strategy and upcoming gold layer integration*

---
<br>
<details>
<summary><strong>Remaining target years</strong> (click to expand)</summary>

**Project scope:** Three years for comparative tourism event analysis

- **2013:** Super Bowl XLVII (February) - massive tourism influx (40,084 reviews)
  
- **2016:** Normal tourism baseline - regular seasonal patterns (77,097 reviews)
  
- **2018:** New Orleans Tricentennial - 300th anniversary celebrations (97,887 reviews)

**To process additional years:**

1. Update `TARGET_YEAR` variable in Section 6C

2. Rerun Sections 6C-8 for each year

3. Run Section 9B after all years processed
</details>

---
<br>
<details>
<summary><strong>Gold layer integration strategy</strong> (click to expand)</summary>

**Multi-dataset analysis:** All processed silver datasets (TripAdvisor NYC, Yelp New Orleans, AirBnB LA/Chicago) will be explored for schema compatibility

**Reviewer classification:** Three-tier system (locals/frequent visitors/tourists) will be implemented in gold layer using `usr_YYYY_la_rev_count` columns

**Data enhancements:**

- Null value handling with appropriate strategies

- Type conversions for analysis efficiency

- Tourism event markers (Super Bowl, Tricentennial, etc.)

- Seasonal analysis standardization

**Analysis-ready format:** Final gold datasets optimized for sentiment analysis and tourism correlation modeling
</details>

---
<br>
<details>
<summary><strong>Gold processing pipeline</strong> (click to expand)</summary>

**Workflow steps:**

1. Load all silver year datasets (2013, 2016, 2018)
   
2. Implement reviewer classification using annual review patterns
   
3. Add tourism event context markers
   
4. Standardize temporal features across datasets
   
5. Create unified schemas for cross-platform comparison
   
6. Output analysis-ready gold datasets for sentiment analysis

**Integration considerations:**

- Schema alignment across TripAdvisor, Yelp, AirBnB datasets

- Common temporal features (year, season, event proximity)

- Standardized reviewer type classifications

- Tourism volume correlation preparation
</details>

---

### 9B. Memory Cleanup & Storage Summary
*Clear year selection data and review storage optimization options*

**Purpose:** Free memory and identify intermediate files for optional cleanup

**When to run:** After processing all target years (2013, 2016, 2018)

**Storage optimization:** Review bronze layer intermediate files and cleanup options

In [None]:
print("Memory Cleanup & Storage Summary")
print()

# List of variables to clean up
cleanup_vars = ['exploration_df', 'year_df', 'year_dist']

# Iterate and delete if exists
cleanup_count = 0
for var in cleanup_vars:
    if var in dir():
        del globals()[var]
        cleanup_count += 1

# Force garbage collection
import gc
gc.collect()

print(f"✓ Cleaned {cleanup_count} variables from memory")
print()

# Show storage breakdown
bronze_base = project_root / "data" / "bronze" / "yelp"
print_storage_summary(bronze_base, silver_staging, dataset_name="Yelp New Orleans")

print()
print("Workflow complete - ready for gold layer processing")