# Tourism Sentiment Analysis - AirBnB Data Extraction

**Project:** Tourism Sentiment Analysis

**Task:** Data Extraction & Processing

**Dataset Source:** InsideAirbnb

**Focus:** Chicago/Los Angeles, 2022-2024, Post-COVID Analysis

**Source:** http://insideairbnb.com/get-the-data/

---
<br>
<details>
<summary><strong>Dataset Update Warning</strong> (click to expand)</summary>



InsideAirbnb regularly updates their datasets with new timestamps.

Download URLs contain dates that change over time (e.g., `2025-06-17` → `2025-12-XX`).

Historical data may be removed or restructured.

This notebook includes validation to detect significant dataset changes.
</details>

---
<br>
<details>
<summary><strong>Post-COVID Focus</strong> (click to expand)</summary>

This analysis focuses on post-pandemic tourism patterns (2022+) to understand how tourism volume fluctuations affect customer sentiment in hospitality reviews.

The workflow filters out pre-2022 data as the first major data reduction step.

</details>

---

## 1. Setup & Configuration
*Import libraries, set up project paths, create directory structure*

### 1A. Imports & Script Setup
*Load required packages and configure script imports*

In [None]:
# Core data processing
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

# File handling & web requests
import requests
from pathlib import Path

# System utilities
import sys

# Bootstrap: Add shared scripts to Python path
def setup_scripts_path():
    """Add shared scripts to Python path"""
    current = Path.cwd()
    for _ in range(10):
        if (current / '.projectroot').exists():
            scripts_dir = current / 'notebooks' / 'shared' / 'scripts'
            if scripts_dir.exists():
                sys.path.insert(0, str(scripts_dir))
                return scripts_dir
        if current.parent == current:
            break
        current = current.parent
    raise FileNotFoundError(
        "scripts directory not found.\n"
        "ensure .projectroot exists at project root."
    )

# Setup path and import utilities
scripts_dir = setup_scripts_path()

from project_utils import find_project_root
from data_io import setup_extraction_directories, check_existing_file, check_existing_chunks
from data_validation import print_final_summary, print_storage_summary

print(f"✓ Packages and scripts loaded successfully")

### 1B. Project Root Detection
*Cross-platform function to locate project directory automatically*

**Purpose:** Finds project root by searching for `.projectroot` marker file

**Handles:** Working directory issues, different operating systems, various notebook locations

**Confirmation:** Displays detected path for verification

**Manual Override:** Uncomment line below if auto-detection fails

In [None]:
# Automatically detect project root
project_root = find_project_root()

In [None]:
# If auto-detection fails, uncomment and edit this line:
# project_root = Path('/.../.../.../[tourism_data_project]')

### 1C. Set Project Paths
*Establish standardized directory structure for bronze and silver processing*

**Bronze Structure:** Raw download → chunked conversion → primary filter

**Silver Structure:** Final staging area for gold layer integration

**Auto-creation:** All directories created automatically for new collaborators

In [None]:
# Create base AirBnB directory structure
# City-specific subdirectories created after selection in Section 2
bronze_base = project_root / "data" / "bronze" / "airbnb"
silver_base = project_root / "data" / "silver" / "airbnb"

bronze_base.mkdir(parents=True, exist_ok=True)
silver_base.mkdir(parents=True, exist_ok=True)

print(f"Bronze base: {bronze_base}")
print(f"Silver base: {silver_base}")
print("Directory structure created successfully")

## 2. Data Acquisition
*Download raw CSV files from InsideAirbnb for Chicago or Los Angeles*

**City Selection:** Configure `CITY` variable below (chicago or los_angeles)

**Automatic Detection:** Tests recent dates to find current data availability

<details>
<summary><strong>Manual Download Instructions</strong> (click to expand)</summary>

***If automated download fails:***

1. Visit: https://insideairbnb.com/get-the-data/
2. Navigate to Chicago or Los Angeles section
3. Download current: `listings.csv.gz` and `reviews.csv.gz`
4. Rename to: `{city}_listings.csv.gz`, `{city}_reviews.csv.gz`
5. Place in: `data/bronze/airbnb/{city}/00_original_download/`

**Note:** InsideAirbnb updates datasets regularly. If automated download fails, manually download current files and adjust date references if needed.

</details>

In [None]:
# Import AirBnB-specific utilities
from airbnb_utils import download_insideairbnb

# USER CONFIGURATION - City Selection
CITY = "chicago"            # Options: "chicago" or "los_angeles"

print(f"Current city: {CITY.replace('_', ' ').title()}")
print("To change city: Update |CITY| variable above and rerun this cell")

# Create city-specific directories
city_bronze_base = bronze_base / CITY
city_silver_base = silver_base / CITY
city_bronze_base.mkdir(parents=True, exist_ok=True)
city_silver_base.mkdir(parents=True, exist_ok=True)

print(f"\nBronze directory: {city_bronze_base}")
print(f"Silver directory: {city_silver_base}")
print("\nReady for download configuration")

In [None]:
# Download configuration
DATE_SNAPSHOT = "2025-06-17"  # Update if needed based on InsideAirbnb availability

original_dir = city_bronze_base / "00_original_download"
original_dir.mkdir(parents=True, exist_ok=True)

print(f"Testing download for {CITY.replace('_', ' ').title()} ({DATE_SNAPSHOT})...")

# Download listings
listings_path = original_dir / f"{CITY}_listings.csv.gz"
success, message = download_insideairbnb(
    city=CITY,
    date_snapshot=DATE_SNAPSHOT,
    file_type='listings',
    output_path=listings_path
)
print(f"Listings: {message}")

# Download reviews
reviews_path = original_dir / f"{CITY}_reviews.csv.gz"
success, message = download_insideairbnb(
    city=CITY,
    date_snapshot=DATE_SNAPSHOT,
    file_type='reviews',
    output_path=reviews_path
)
print(f"Reviews: {message}")

print(f"\nFiles location: {original_dir}")

---

<u>***City Configuration Checkpoint***</u>

**Project Scope:** Chicago & Los Angeles (2022, 2024) comparative analysis  

**Reset Point:** Update `CITY` variable above and rerun from here for different city

---

## 3. Bronze Layer: Raw Data Processing
*Load CSV files, merge datasets, convert to chunked parquet preserving original structure*

**Input:** `data/bronze/airbnb/{city}/00_original_download/{city}_listings.csv.gz` & `{city}_reviews.csv.gz`

**Output:** `data/bronze/airbnb/{city}/01_raw_conversion/{city}_merged_chunk_*.parquet`

**Processing:** Merge listings + reviews, then chunked conversion for memory efficiency

**Purpose:** Preserve complete dataset structure while converting to analysis-friendly format

### 3A.1 Load and Verify Raw Files
*Load downloaded CSV files and verify structure before processing*

In [None]:
# Import AirBnB validation
from airbnb_utils import validate_insideairbnb_structure

# Load and verify downloaded files
listings_file = original_dir / f"{CITY}_listings.csv.gz"
reviews_file = original_dir / f"{CITY}_reviews.csv.gz"

# Verify files exist
if not listings_file.exists():
    print(f"Listings file not found: {listings_file}")
    print("Please run Section 2 to download files first")
elif not reviews_file.exists():
    print(f"Reviews file not found: {reviews_file}")
    print("Please run Section 2 to download files first")
else:
    print("✓ Both files found, loading data...")

    # Load files
    print("\nLoading listings...")
    listings_df = pd.read_csv(listings_file, compression="gzip")
    print(f"Listings shape: {listings_df.shape}")

    print("\nLoading reviews...")
    reviews_df = pd.read_csv(reviews_file, compression="gzip")
    print(f"Reviews shape: {reviews_df.shape}")

    # Validate structure
    valid, missing = validate_insideairbnb_structure(reviews_df, listings_df)

    if not valid:
        print("\n✗ Dataset structure validation failed:")
        if missing['reviews']:
            print(f"  Missing review columns: {missing['reviews']}")
        if missing['listings']:
            print(f"  Missing listing columns: {missing['listings']}")
    else:
        print("\nDataset structure validated")
        print(f"  Unique listings: {listings_df['id'].nunique():,}")
        print(f"  Unique listing IDs in reviews: {reviews_df['listing_id'].nunique():,}")
        print("\nRaw data loaded successfully")

#### 3A.2 Raw Data Structure Validation  
*Verify downloaded file structure before merging*

**Purpose:** Detect InsideAirbnb format changes and validate required columns

**Expected:** Reviews (6 cols), Listings (~79 cols) based on Nov 2025 Chicago dataset

In [None]:
# Raw data structure validation using future-proof checks
print("Raw Data Structure Validation:")
print("=" * 40)

# Use validation function
valid, missing = validate_insideairbnb_structure(reviews_df, listings_df)

# Report structure
print(f"Structure check:")
print(f"  Reviews: {len(reviews_df.columns)} columns")
print(f"  Listings: {len(listings_df.columns)} columns")

if not valid:
    print(f"\nMissing required columns:")
    if missing['reviews']:
        print(f"  Reviews: {missing['reviews']}")
    if missing['listings']:
        print(f"  Listings: {missing['listings']}")
    print(f"\nDataset may be incompatible - check InsideAirbnb version")
else:
    print(f"\n✓ All required columns present")
    print(f"✓ Ready for merge operation")

### 3B. Merge Listings and Reviews
*Combine datasets following Anna's proven methodology*

**Pattern:** Left join `reviews` to `listings` to preserve all review data with listing details

**Join Strategy:** `reviews.listing_id` → `listings.id`

**Result:** Comprehensive dataset with review content + listing characteristics

In [None]:
# Import merge function
from airbnb_utils import merge_listings_reviews

# Merge listings and reviews
print("Merging reviews with listings...")
print("Join strategy: reviews (left) + listings (right) on listing_id = id")

merged_data = merge_listings_reviews(
    listings_path=listings_file,
    reviews_path=reviews_file
)

print(f"\nMerge complete:")
print(f"Merged records: {len(merged_data):,}")
print(f"Merged dataset shape: {merged_data.shape}")

# Verify merge quality - check for unmatched reviews
missing_listings = merged_data['id_listing'].isna().sum()
if missing_listings > 0:
    print(f"\n✗ {missing_listings:,} reviews could not be matched to listings")
else:
    print("\n✓ All reviews successfully matched to listings")

# Show sample
key_cols = ['listing_id', 'date', 'comments', 'name', 'property_type']
available_cols = [col for col in key_cols if col in merged_data.columns]
print(f"\nSample merged data:")
print(merged_data[available_cols].head(3))

print("\nReady for year filtering")

### 3C. Primary Filters: Year, Business Customer & Data Quality
*Apply core filters immediately to reduce dataset before saving*

**Input:** Merged dataset (all available reviews, all available years, all property types)

**Filters Applied:**

1. **Year filter:** Reviews from 2022+ only *(post-COVID focus)*
   
2. **Business customer filter:** Entire home/apartment rentals only (exclude private/shared rooms)

**Output:** `data/bronze/airbnb/{city}/02_primary_filter/{city}_post2022_filtered.parquet`

**Purpose:** Focus on business travel market with current, relevant data

In [None]:
# Set up primary filter output directory
primary_filter_dir = city_bronze_base / "02_primary_filter"
primary_filter_dir.mkdir(parents=True, exist_ok=True)
output_file = primary_filter_dir / f"{CITY}_post2022_filtered.parquet"

# Check if filtering already completed
exists, info = check_existing_file(output_file, file_type='parquet')

if not exists:
    print("Applying primary filters...")
    print(f"\nOriginal merged data: {len(merged_data):,} rows, {len(merged_data.columns)} columns")

    # Filter 1: Year filter (2022+)
    merged_data['date'] = pd.to_datetime(merged_data['date'], errors='coerce')
    year_filtered = merged_data[merged_data['date'].dt.year >= 2022].copy()
    print(f"  After year filter (2022+): {len(year_filtered):,} rows")

    # Filter 2: Business customer filter (Entire home/apt + Hotel room)
    business_filtered = year_filtered[
        year_filtered['room_type'].isin(['Entire home/apt', 'Hotel room'])
    ].copy()
    print(f"  After business customer filter: {len(business_filtered):,} rows")

    # Filter 3: Column optimization - keep only sentiment-relevant columns
    keep_columns = [
        # Review essentials
        'listing_id', 'id_review', 'date', 'reviewer_id', 'comments',

        # Property context
        'name', 'description', 'neighborhood_overview',
        'room_type', 'accommodates', 'bedrooms', 'beds', 'bathrooms', 'price',
        'amenities',

        # Location
        'latitude', 'longitude',

        # Host quality
        'host_id', 'host_since', 'host_about',
        'host_response_time', 'host_response_rate', 'host_is_superhost',
        'host_listings_count',

        # Review scores
        'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
        'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
        'review_scores_value',

        # Metadata
        'first_review', 'last_review', 'license', 'listing_url'
    ]

    # Filter to available columns only
    available_keep = [col for col in keep_columns if col in business_filtered.columns]
    final_filtered = business_filtered[available_keep].copy()

    print(f"  After column optimization: {len(final_filtered.columns)} columns kept")

    # Save filtered data
    final_filtered.to_parquet(output_file, compression="snappy")

    print(f"\nPrimary filtering complete:")
    print(f"  Rows kept: {len(final_filtered):,}")
    print(f"  Rows removed: {len(merged_data) - len(final_filtered):,}")
    print(f"  Retention rate: {len(final_filtered)/len(merged_data)*100:.1f}%")
    print(f"  Columns kept: {len(final_filtered.columns)} of {len(merged_data.columns)}")
    print(f"\nSaved to: {output_file}")

    file_size_mb = output_file.stat().st_size / (1024*1024)
    print(f"File size: {file_size_mb:.1f} MB")

## 4. Load and Verify Filtered Data
*Load primary filtered data and confirm filtering results*

**Input:** `data/bronze/airbnb/{city}/02_primary_filter/{city}_post2022_filtered.parquet`

**Purpose:** Verify primary filtering results and analyze data distribution

**Next:** Year selection for targeted analysis

In [None]:
# Load primary filtered data
primary_filter_file = primary_filter_dir / f"{CITY}_post2022_filtered.parquet"

if not primary_filter_file.exists():
    print("Primary filter file not found - run Section 3 first")
    print(f"Expected location: {primary_filter_file}")
else:
    exploration_df = pd.read_parquet(primary_filter_file)

    print("✓ Primary filtered data loaded successfully")
    print(f"Rows: {len(exploration_df):,}")
    print(f"Columns: {len(exploration_df.columns)}")
    print(f"File: {primary_filter_file.name}")

    # Verify filters applied correctly
    print(f"\nFilter verification:")

    # Year distribution
    exploration_df['date'] = pd.to_datetime(exploration_df['date'], errors='coerce')
    year_dist = exploration_df['date'].dt.year.value_counts().sort_index()
    print(f"  Year distribution:")
    for year, count in year_dist.items():
        print(f"    {year}: {count:,} reviews")

    # Room type distribution
    room_dist = exploration_df['room_type'].value_counts()
    print(f"\n  Room type distribution:")
    for room_type, count in room_dist.items():
        print(f"    {room_type}: {count:,} reviews")

    print(f"\n✓ Filters confirmed - ready for year selection")

## 5. Target Year Selection
*Configure specific year for detailed seasonal analysis*

**Available Years:** 2022-2025 with strong representation

**Default:** 2024 (most complete recent year)

**Strategy:** Process one year at a time for focused analysis

**Next:** Exploratory validation of selected year

In [None]:
# USER CONFIGURATION - Target Year Selection
TARGET_YEAR = 2024          # Current options: 2022, 2023, 2024, 2025

print(f"Current target year: {TARGET_YEAR}")
print("To change year: Update |TARGET_YEAR| variable above and rerun this cell")

# Validate target year availability
available_years = exploration_df['date'].dt.year.unique()
available_years = sorted([y for y in available_years if pd.notna(y)])

if TARGET_YEAR in available_years:
    # Filter to target year
    year_mask = exploration_df['date'].dt.year == TARGET_YEAR
    year_df = exploration_df[year_mask].copy()

    print(f"\nTarget year {TARGET_YEAR} is available")
    print(f"Records: {len(year_df):,}")

    # Show year context
    print(f"\nYear context:")
    for year in available_years:
        count = (exploration_df['date'].dt.year == year).sum()
        marker = " <- TARGET" if year == TARGET_YEAR else ""
        print(f"  {year}: {count:,} reviews{marker}")

    print(f"\nConfiguration summary:")
    print(f"  City: {CITY.title()}")
    print(f"  Target year: {TARGET_YEAR}")
    print(f"  Records: {len(year_df):,}")
    print("\nReady for exploratory validation")

else:
    print(f"\n✗ Target year {TARGET_YEAR} not available")
    print(f"Available years: {available_years}")
    print("Update TARGET_YEAR to an available year above")

---

<u>***Year Configuration Checkpoint***</u>

**Project Scope:** Chicago & Los Angeles (2022, 2024) comparative analysis

**Reset Point:** Update `TARGET_YEAR` variable above and rerun from here for different year


---

## 6. Exploratory Analysis
*Validate data richness and distribution before final save*

**Purpose:** Ensure selected year has sufficient data quality for sentiment analysis

**Strategy:** Examine seasonal patterns, property distribution, review completeness

**Note:** *Optional section - can skip to Section 7 for direct save*

### 6A. Seasonal Distribution
*Examine temporal patterns for tourism insights*

In [None]:
# Seasonal distribution analysis
print(f"Seasonal Analysis - {CITY.title()} {TARGET_YEAR}:")
print("=" * 60)

# Extract month information
year_df['month'] = year_df['date'].dt.month
year_df['month_name'] = year_df['date'].dt.month_name()

# Monthly distribution
monthly_counts = year_df['month'].value_counts().sort_index()
print(f"Monthly Review Distribution:")
for month in range(1, 13):
    if month in monthly_counts.index:
        count = monthly_counts[month]
        month_name = year_df[year_df['month'] == month]['month_name'].iloc[0]
        percentage = count / len(year_df) * 100
        print(f"  {month_name}: {count:,} reviews ({percentage:.1f}%)")

# Seasonal groupings
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

year_df['season'] = year_df['month'].apply(get_season)
seasonal_counts = year_df['season'].value_counts()

print(f"\nSeasonal Distribution:")
for season in ['Winter', 'Spring', 'Summer', 'Fall']:
    if season in seasonal_counts.index:
        count = seasonal_counts[season]
        percentage = count / len(year_df) * 100
        print(f"  {season}: {count:,} reviews ({percentage:.1f}%)")

# Data sufficiency check
min_seasonal = seasonal_counts.min()
min_seasonal_pct = (min_seasonal / len(year_df)) * 100

print(f"\nData Sufficiency:")
print(f"  Minimum seasonal representation: {min_seasonal:,} reviews ({min_seasonal_pct:.1f}%)")
if min_seasonal_pct >= 10.0:
    print("  Sufficient data for all seasons (>10% minimum)")
else:
    print("  ✗ Limited data in some seasons (<10% minimum)")

### 6B. Property & Review Quality
*Validate property diversity and review completeness*

In [None]:
# Property and review quality analysis
print(f"Property & Review Quality - {CITY.title()} {TARGET_YEAR}:")
print("=" * 60)

# Property diversity
print(f"Property Coverage:")
print(f"  Unique listings: {year_df['listing_id'].nunique():,}")
print(f"  Unique reviewers: {year_df['reviewer_id'].nunique():,}")

# Room type breakdown
print(f"\nRoom Type Distribution:")
room_type_dist = year_df['room_type'].value_counts()
for room_type, count in room_type_dist.items():
    percentage = count / len(year_df) * 100
    print(f"  {room_type}: {count:,} reviews ({percentage:.1f}%)")

# Review completeness
print(f"\nReview Completeness:")
comments_complete = (~year_df['comments'].isna()).sum()
comments_pct = comments_complete / len(year_df) * 100
print(f"  Comments available: {comments_complete:,} ({comments_pct:.1f}%)")

# Review score availability
if 'review_scores_rating' in year_df.columns:
    scores_available = year_df['review_scores_rating'].notna().sum()
    scores_pct = scores_available / len(year_df) * 100
    print(f"  Review scores available: {scores_available:,} ({scores_pct:.1f}%)")

# Host diversity
if 'host_id' in year_df.columns:
    print(f"\nHost Diversity:")
    unique_hosts = year_df['host_id'].nunique()
    print(f"  Unique hosts: {unique_hosts:,}")
    avg_reviews_per_host = len(year_df) / unique_hosts
    print(f"  Average reviews per host: {avg_reviews_per_host:.1f}")

print(f"\nIf data quality suffices, ready for silver layer save")

## 7. Silver Layer: Final Save
*Save validated year-specific dataset to silver staging*

**Input:** Validated year data from exploratory analysis

**Output:** `data/silver/airbnb/{city}/staging/{city}_{target_year}_final.parquet`

**Purpose:** Create year-specific dataset ready for silver staging.

In [None]:
# Set up silver staging directory
city_silver_base = project_root / "data" / "silver" / "airbnb" / CITY / "staging"
city_silver_base.mkdir(parents=True, exist_ok=True)
output_file = city_silver_base / f"{CITY}_{TARGET_YEAR}_final.parquet"

# Check if save already completed
exists, info = check_existing_file(output_file, file_type='parquet')

if not exists:
    # Remove exploration columns (month, month_name, season)
    exploration_cols = ['month', 'month_name', 'season']
    clean_df = year_df.drop(columns=[col for col in exploration_cols if col in year_df.columns])

    # Save clean year-specific data to silver
    clean_df.to_parquet(output_file, compression="snappy")

    file_size_mb = output_file.stat().st_size / (1024*1024)
    print(f"Silver layer save complete:")
    print(f"  Dataset: {CITY.title()} {TARGET_YEAR}")
    print(f"  Records: {len(clean_df):,}")
    print(f"  Columns: {len(clean_df.columns)}")
    print(f"  File size: {file_size_mb:.1f} MB")
    print(f"\nSaved to: {output_file}")
    print(f"\n✓ Ready for silver stage processing")

## 8. Final Verification & Summary
*Verify saved dataset and review workflow completion*

**Purpose:** Confirm data integrity and provide workflow summary

**Next:** Gold layer integration or repeat workflow for different city/year

In [None]:
# Print comprehensive summary
print_final_summary(
    output_file,
    dataset_name=f"AirBnB {CITY.replace('_', ' ').title()} {TARGET_YEAR}",
    file_type='parquet'
)

# Workflow completion status
print(f"\nWorkflow Status:")
print(f"  City: {CITY.replace('_', ' ').title()} ✓")
print(f"  Year: {TARGET_YEAR} ✓")
print(f"  Records: {len(year_df):,} ✓")
print(f"  File: {output_file.name} ✓")

print(f"\nNext steps:")
print(f"  - Change |CITY| variable (Section 2) for different city")
print(f"  - Change |TARGET_YEAR| variable (Section 5) for different year")
print(f"  - Proceed to gold layer integration")

## 9. Next Steps & Project Completion Status

**Project Scope:** Chicago & Los Angeles × 2022, 2024 = 4 datasets

**Workflow Reset Points:**

**To process different city:**
1. Update `CITY` variable in Section 2
2. Rerun Sections 2-8
3. Automatic directory creation

**To process different year (same city):**
1. Update `TARGET_YEAR` variable in Section 5
2. Rerun Sections 5-8
3. Uses existing filtered data

---

**Data Quality Notes:**

All datasets show:
- Complete date coverage for target year
- Sufficient seasonal distribution (>10% minimum)
- High review completeness (>99%)
- Expected nulls in optional fields (host_about, neighborhood_overview, license)

---

**Upcoming Gold Layer Integration:**

- **Multi-dataset analysis:** All processed silver datasets (TripAdvisor NYC, Yelp New Orleans, AirBnB LA/Chicago) will be explored for shared columns
  
- **Schema standardization:** Common fields (location, date, rating, text) will be unified across datasets
  
- **Data quality:** Null value handling, strategic imputation, and appropriate data type conversions
  
- **Analysis-ready format:** Final gold datasets optimized for sentiment analysis and tourism correlation modeling

**Gold Processing Pipeline:**
1. Load all silver datasets and analyze column overlap
2. Standardize shared column names and formats
3. Handle missing values with dataset-appropriate strategies
4. Convert data types for analysis efficiency
5. Create unified gold datasets for cross-platform analysis
