# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [3]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [4]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


('', [])

In [5]:
import pandas as pd
calendar = pd.read_csv("../Data/calendar.csv.gz", compression="gzip")
print(calendar.head())

           listing_id        date available  price  adjusted_price  \
0  686088974677118082  2025-06-27         f    NaN             NaN   
1  686088974677118082  2025-06-28         f    NaN             NaN   
2  686088974677118082  2025-06-29         t    NaN             NaN   
3  686088974677118082  2025-06-30         t    NaN             NaN   
4  686088974677118082  2025-07-01         t    NaN             NaN   

   minimum_nights  maximum_nights  
0               2            1125  
1               2            1125  
2               2            1125  
3               2            1125  
4               2            1125  


In [6]:
listings_gz = pd.read_csv("../Data/listings.csv.gz", compression="gzip")
print(listings_gz.head())


       id                          listing_url       scrape_id last_scraped  \
0  155305  https://www.airbnb.com/rooms/155305  20250617145515   2025-06-17   
1  197263  https://www.airbnb.com/rooms/197263  20250617145515   2025-06-17   
2  209068  https://www.airbnb.com/rooms/209068  20250617145515   2025-06-17   
3  246315  https://www.airbnb.com/rooms/246315  20250617145515   2025-06-17   
4  314540  https://www.airbnb.com/rooms/314540  20250617145515   2025-06-17   

        source                                               name  \
0  city scrape                 Cottage! BonPaul + Sharky's Hostel   
1  city scrape                       Tranquil Room & Private Bath   
2  city scrape                                    Terrace Cottage   
3  city scrape                          Asheville Dreamer's Cabin   
4  city scrape  Asheville Urban Farmhouse Entire Home 4.6 mi t...   

                                         description  \
0  West Asheville Cottage within walking distance...  

In [7]:
reviews_gz = pd.read_csv("../Data/reviews.csv.gz", compression="gzip")
print(reviews_gz.head())



   listing_id       id        date  reviewer_id reviewer_name  \
0      155305   409437  2011-07-31       844309       Jillian   
1      155305   469775  2011-08-23       343443         Katie   
2      155305   548257  2011-09-19      1152025         Katie   
3      155305   671470  2011-10-28      1245885         Jason   
4      155305  1606327  2012-07-01      1891395         Craig   

                                            comments  
0  We had a wonderful time! The cottage was very ...  
1  Place was great! Can't really speak to the ins...  
2  We had a great time!  The cabin was nice and a...  
3  Clean and comfortable room with everything you...  
4  The cabin was solid for an overnight stay. It ...  


In [8]:
neighbourhoods_geo = pd.read_json("../Data/neighbourhoods.geojson")
print(neighbourhoods_geo.head())



                type                                           features
0  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...
1  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...
2  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...
3  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...
4  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...


In [9]:
listings = pd.read_csv("../Data/listings.csv")
print(listings.head())


       id                                               name  host_id  \
0  155305                 Cottage! BonPaul + Sharky's Hostel   746673   
1  197263                       Tranquil Room & Private Bath   961396   
2  209068                                    Terrace Cottage  1029919   
3  246315                          Asheville Dreamer's Cabin  1292070   
4  314540  Asheville Urban Farmhouse Entire Home 4.6 mi t...   381660   

  host_name  neighbourhood_group  neighbourhood   latitude  longitude  \
0   BonPaul                  NaN          28806  35.578640 -82.595780   
1   Timothy                  NaN          28806  35.577350 -82.638040   
2     Kevin                  NaN          28804  35.617641 -82.551819   
3     Annie                  NaN          28805  35.596150 -82.506350   
4       Tom                  NaN          28806  35.585610 -82.627310   

         room_type  price  minimum_nights  number_of_reviews last_review  \
0  Entire home/apt   95.0               1     

In [10]:
reviews = pd.read_csv("../Data/reviews.csv")
print(reviews.head())


   listing_id        date
0      155305  2011-07-31
1      155305  2011-08-23
2      155305  2011-09-19
3      155305  2011-10-28
4      155305  2012-07-01


In [11]:
neighbourhoods = pd.read_csv("../Data/neighbourhoods.csv")
print(neighbourhoods.head())


   neighbourhood_group  neighbourhood
0                  NaN          28704
1                  NaN          28715
2                  NaN          28732
3                  NaN          28801
4                  NaN          28803


In [None]:
# ============================================================================
# PART 1: DATA EXPLORATION AND MISSING VALUES ANALYSIS
# ============================================================================

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("=" * 80)
print("DATA OVERVIEW")
print("=" * 80)

# Check data shapes
print("\n1. DATA SHAPES:")
print(f"   Calendar: {calendar.shape}")
print(f"   Listings (gz): {listings_gz.shape}")
print(f"   Listings (csv): {listings.shape}")
print(f"   Reviews (gz): {reviews_gz.shape}")
print(f"   Reviews (csv): {reviews.shape}")
print(f"   Neighbourhoods: {neighbourhoods.shape}")

# Use the gzipped versions as they seem more complete
listings_df = listings_gz.copy()
reviews_df = reviews_gz.copy()

# Rename 'id' to 'listing_id' for consistency across datasets
if 'id' in listings_df.columns:
    listings_df = listings_df.rename(columns={'id': 'listing_id'})

# Check missing values in calendar
print("\n2. MISSING VALUES IN CALENDAR:")
print(calendar.isnull().sum())
print(f"\n   Total missing values: {calendar.isnull().sum().sum()}")
print(f"   Missing percentage: {(calendar.isnull().sum().sum() / calendar.size * 100):.2f}%")

# Check missing values in listings
print("\n3. MISSING VALUES IN LISTINGS:")
missing_listings = listings_df.isnull().sum()
missing_listings_pct = (missing_listings / len(listings_df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing_listings,
    'Missing %': missing_listings_pct
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
print(f"\n   Columns with missing values: {len(missing_df)}")
print(f"   Total missing values: {listings_df.isnull().sum().sum()}")
print("\n   Top 20 columns with most missing values:")
print(missing_df.head(20))

# Check missing values in reviews
print("\n4. MISSING VALUES IN REVIEWS:")
print(reviews_df.isnull().sum())
print(f"\n   Total missing values: {reviews_df.isnull().sum().sum()}")

# Basic statistics
print("\n5. CALENDAR DATA TYPES:")
print(calendar.dtypes)
print("\n   Unique listings in calendar:", calendar['listing_id'].nunique())
print("   Date range:", calendar['date'].min(), "to", calendar['date'].max())
print("   Available values:", calendar['available'].value_counts().to_dict())


In [None]:
# ============================================================================
# PART 2: EXPLORATORY DATA ANALYSIS (EDA) WITH VISUALIZATIONS
# ============================================================================

# Convert date column to datetime
calendar['date'] = pd.to_datetime(calendar['date'])
calendar['available_bool'] = calendar['available'] == 't'

# Convert price columns - remove $ and commas, convert to float
def clean_price(price_str):
    if pd.isna(price_str):
        return np.nan
    if isinstance(price_str, str):
        return float(price_str.replace('$', '').replace(',', ''))
    return float(price_str)

calendar['price_clean'] = calendar['price'].apply(clean_price)
calendar['adjusted_price_clean'] = calendar['adjusted_price'].apply(clean_price)

# 1. Availability over time
print("=" * 80)
print("EDA: AVAILABILITY PATTERNS")
print("=" * 80)

# Daily availability rate
daily_availability = calendar.groupby('date').agg({
    'available_bool': ['mean', 'count']
}).reset_index()
daily_availability.columns = ['date', 'availability_rate', 'total_listings']

fig1 = px.line(
    daily_availability, 
    x='date', 
    y='availability_rate',
    title='Daily Availability Rate Over Time',
    labels={'availability_rate': 'Availability Rate', 'date': 'Date'}
)
fig1.update_layout(height=500, showlegend=False)
fig1.show()

# 2. Price distribution (when available)
price_available = calendar[calendar['available_bool'] & calendar['price_clean'].notna()]['price_clean']

fig2 = px.histogram(
    price_available,
    nbins=50,
    title='Price Distribution for Available Listings',
    labels={'value': 'Price ($)', 'count': 'Frequency'}
)
fig2.update_layout(height=500)
fig2.show()

print(f"\nPrice Statistics (available listings):")
print(f"   Mean: ${price_available.mean():.2f}")
print(f"   Median: ${price_available.median():.2f}")
print(f"   Min: ${price_available.min():.2f}")
print(f"   Max: ${price_available.max():.2f}")


In [None]:
# 3. Listings analysis
print("\n" + "=" * 80)
print("EDA: LISTINGS CHARACTERISTICS")
print("=" * 80)

# Room type distribution
if 'room_type' in listings_df.columns:
    room_type_counts = listings_df['room_type'].value_counts()
    fig3 = px.pie(
        values=room_type_counts.values,
        names=room_type_counts.index,
        title='Distribution of Room Types'
    )
    fig3.update_layout(height=500)
    fig3.show()
    print("\nRoom Type Distribution:")
    print(room_type_counts)

# Price distribution in listings
if 'price' in listings_df.columns:
    # Clean price in listings
    listings_df['price_clean'] = listings_df['price'].apply(clean_price)
    price_listings = listings_df[listings_df['price_clean'].notna()]['price_clean']
    
    fig4 = px.histogram(
        price_listings,
        nbins=50,
        title='Listing Price Distribution',
        labels={'value': 'Price ($)', 'count': 'Frequency'}
    )
    fig4.update_layout(height=500)
    fig4.show()

# Number of reviews distribution
if 'number_of_reviews' in listings_df.columns:
    fig5 = px.histogram(
        listings_df['number_of_reviews'],
        nbins=50,
        title='Distribution of Number of Reviews',
        labels={'number_of_reviews': 'Number of Reviews', 'count': 'Frequency'}
    )
    fig5.update_layout(height=500)
    fig5.show()

# Review scores if available
review_score_cols = [col for col in listings_df.columns if 'review_scores' in col.lower()]
if review_score_cols:
    print(f"\nReview Score Columns Found: {review_score_cols}")
    for col in review_score_cols[:3]:  # Show first 3
        if listings_df[col].notna().sum() > 0:
            fig = px.histogram(
                listings_df[col].dropna(),
                nbins=20,
                title=f'Distribution of {col}',
                labels={col: col.replace('_', ' ').title(), 'count': 'Frequency'}
            )
            fig.update_layout(height=400)
            fig.show()


In [None]:
# 4. Temporal patterns - day of week, month effects
print("\n" + "=" * 80)
print("EDA: TEMPORAL PATTERNS")
print("=" * 80)

calendar['day_of_week'] = calendar['date'].dt.day_name()
calendar['month'] = calendar['date'].dt.month
calendar['day_of_month'] = calendar['date'].dt.day
calendar['is_weekend'] = calendar['date'].dt.dayofweek >= 5

# Availability by day of week
dow_availability = calendar.groupby('day_of_week')['available_bool'].mean().reset_index()
dow_availability.columns = ['day_of_week', 'availability_rate']
dow_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_availability['day_of_week'] = pd.Categorical(dow_availability['day_of_week'], categories=dow_order, ordered=True)
dow_availability = dow_availability.sort_values('day_of_week')

fig6 = px.bar(
    dow_availability,
    x='day_of_week',
    y='availability_rate',
    title='Average Availability Rate by Day of Week',
    labels={'availability_rate': 'Availability Rate', 'day_of_week': 'Day of Week'}
)
fig6.update_layout(height=500, xaxis_tickangle=-45)
fig6.show()

# Availability by month
month_availability = calendar.groupby('month')['available_bool'].mean().reset_index()
month_availability.columns = ['month', 'availability_rate']
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month_availability['month_name'] = month_availability['month'].apply(lambda x: month_names[x-1])

fig7 = px.bar(
    month_availability,
    x='month_name',
    y='availability_rate',
    title='Average Availability Rate by Month',
    labels={'availability_rate': 'Availability Rate', 'month_name': 'Month'}
)
fig7.update_layout(height=500)
fig7.show()

# Weekend vs weekday
weekend_availability = calendar.groupby('is_weekend')['available_bool'].mean()
print(f"\nWeekend Availability Rate: {weekend_availability[True]:.3f}")
print(f"Weekday Availability Rate: {weekend_availability[False]:.3f}")


In [None]:
# ============================================================================
# PART 3: FEATURE ENGINEERING - NUMERICAL FEATURES
# ============================================================================

print("=" * 80)
print("FEATURE ENGINEERING: NUMERICAL FEATURES")
print("=" * 80)

# Start with listings features
features_list = []

# 1. Basic listing numerical features
if 'id' in listings_df.columns:
    # Rename id to listing_id for merging
    listings_df = listings_df.rename(columns={'id': 'listing_id'})

# Select key numerical features from listings
numerical_features = [
    'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 
    'beds', 'price', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
    'reviews_per_month', 'calculated_host_listings_count', 'availability_365',
    'number_of_reviews_ltm'
]

# Add any review score columns
review_score_cols = [col for col in listings_df.columns if 'review_scores' in col.lower()]
numerical_features.extend(review_score_cols)

# Extract numerical features
listing_numerical = listings_df[['listing_id'] + [f for f in numerical_features if f in listings_df.columns]].copy()

# Clean price
if 'price' in listing_numerical.columns:
    listing_numerical['price'] = listing_numerical['price'].apply(clean_price)

# Handle missing values in numerical features - fill with median
for col in listing_numerical.columns:
    if col != 'listing_id' and listing_numerical[col].dtype in ['float64', 'int64']:
        median_val = listing_numerical[col].median()
        missing_count = listing_numerical[col].isnull().sum()
        if missing_count > 0:
            listing_numerical[col].fillna(median_val, inplace=True)
            print(f"   Filled {missing_count} missing values in {col} with median: {median_val:.2f}")

# 2. Aggregate features from calendar
print("\nCreating calendar-based numerical features...")

calendar_features = calendar.groupby('listing_id').agg({
    'available_bool': ['mean', 'sum', 'count'],
    'price_clean': ['mean', 'median', 'std', 'min', 'max'],
    'minimum_nights': ['mean', 'min', 'max'],
    'maximum_nights': ['mean', 'min', 'max']
}).reset_index()

# Flatten column names
calendar_features.columns = ['listing_id', 
                            'avg_availability_rate', 'total_available_days', 'total_calendar_days',
                            'avg_price', 'median_price', 'price_std', 'min_price', 'max_price',
                            'avg_min_nights', 'min_min_nights', 'max_min_nights',
                            'avg_max_nights', 'min_max_nights', 'max_max_nights']

# Fill missing values
calendar_features = calendar_features.fillna(0)

print(f"   Created {len(calendar_features.columns) - 1} calendar-based features")

# 3. Merge listing and calendar numerical features
features_numerical = listing_numerical.merge(calendar_features, on='listing_id', how='left')
features_numerical = features_numerical.fillna(0)

print(f"\nTotal numerical features created: {len(features_numerical.columns) - 1}")
print(f"Shape: {features_numerical.shape}")


In [None]:
# ============================================================================
# PART 4: FEATURE ENGINEERING - CATEGORICAL FEATURES
# ============================================================================

print("=" * 80)
print("FEATURE ENGINEERING: CATEGORICAL FEATURES")
print("=" * 80)

# Extract categorical features from listings
categorical_features = ['room_type', 'neighbourhood', 'neighbourhood_group']

# Add host-related categorical features if available
host_cat_features = [col for col in listings_df.columns if 'host_' in col.lower() and 
                     listings_df[col].dtype == 'object' and col not in ['host_id', 'host_name']]
categorical_features.extend(host_cat_features[:5])  # Limit to first 5 to avoid too many

# Extract categorical features
listing_categorical = listings_df[['listing_id'] + [f for f in categorical_features if f in listings_df.columns]].copy()

# Handle missing values - fill with 'unknown'
for col in listing_categorical.columns:
    if col != 'listing_id':
        missing_count = listing_categorical[col].isnull().sum()
        if missing_count > 0:
            listing_categorical[col].fillna('unknown', inplace=True)
            print(f"   Filled {missing_count} missing values in {col} with 'unknown'")

# One-hot encode categorical features
print("\nOne-hot encoding categorical features...")
features_categorical = pd.get_dummies(listing_categorical, columns=[c for c in listing_categorical.columns if c != 'listing_id'], 
                                      prefix_sep='_', drop_first=False)

print(f"   Created {len(features_categorical.columns) - 1} categorical features (one-hot encoded)")
print(f"   Shape: {features_categorical.shape}")

# Merge with numerical features
features_combined = features_numerical.merge(features_categorical, on='listing_id', how='left')
features_combined = features_combined.fillna(0)

print(f"\nCombined features (numerical + categorical): {features_combined.shape}")


In [None]:
# ============================================================================
# PART 5: FEATURE ENGINEERING - TEMPORAL FEATURES
# ============================================================================

print("=" * 80)
print("FEATURE ENGINEERING: TEMPORAL FEATURES")
print("=" * 80)

# Create temporal features from calendar data
# For each listing-date combination, extract temporal features
calendar_temporal = calendar[['listing_id', 'date']].copy()
calendar_temporal['year'] = calendar_temporal['date'].dt.year
calendar_temporal['month'] = calendar_temporal['date'].dt.month
calendar_temporal['day_of_month'] = calendar_temporal['date'].dt.day
calendar_temporal['day_of_week'] = calendar_temporal['date'].dt.dayofweek
calendar_temporal['is_weekend'] = (calendar_temporal['day_of_week'] >= 5).astype(int)
calendar_temporal['is_month_start'] = (calendar_temporal['day_of_month'] <= 3).astype(int)
calendar_temporal['is_month_end'] = (calendar_temporal['day_of_month'] >= 28).astype(int)

# Extract quarter
calendar_temporal['quarter'] = calendar_temporal['date'].dt.quarter

# For the final dataset, we'll merge these temporal features with calendar
# But first, let's create aggregated temporal patterns per listing
temporal_patterns = calendar.groupby('listing_id').agg({
    'is_weekend': 'mean',  # Weekend availability preference
    'month': lambda x: x.mode()[0] if len(x.mode()) > 0 else 0,  # Most common month
    'day_of_week': lambda x: x.mode()[0] if len(x.mode()) > 0 else 0  # Most common day
}).reset_index()
temporal_patterns.columns = ['listing_id', 'weekend_preference', 'preferred_month', 'preferred_day']

print(f"   Created {len(temporal_patterns.columns) - 1} aggregated temporal pattern features")

# Merge temporal patterns
features_combined = features_combined.merge(temporal_patterns, on='listing_id', how='left')
features_combined = features_combined.fillna(0)

print(f"Features after adding temporal patterns: {features_combined.shape}")

# Note: When creating the final dataset, we'll add date-specific temporal features
# to each calendar row


In [None]:
# ============================================================================
# PART 6: FEATURE ENGINEERING - TEXTUAL FEATURES (TF-IDF)
# ============================================================================

print("=" * 80)
print("FEATURE ENGINEERING: TEXTUAL FEATURES (TF-IDF)")
print("=" * 80)

from sklearn.feature_extraction.text import TfidfVectorizer

# 1. TF-IDF on listing descriptions
text_columns = ['name', 'description', 'neighborhood_overview']
available_text_cols = [col for col in text_columns if col in listings_df.columns]

print(f"\nProcessing text columns: {available_text_cols}")

# Combine text columns
listings_df['combined_text'] = listings_df[available_text_cols].fillna('').apply(
    lambda row: ' '.join(row.astype(str)), axis=1
)

# Remove HTML tags and clean text
import re
def clean_text(text):
    if pd.isna(text):
        return ''
    text = str(text)
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text.lower()

listings_df['combined_text_clean'] = listings_df['combined_text'].apply(clean_text)

# Create TF-IDF features (limit to top 50 features to avoid too many dimensions)
print("\nCreating TF-IDF features from listing descriptions...")
tfidf = TfidfVectorizer(max_features=50, stop_words='english', ngram_range=(1, 2))
tfidf_features = tfidf.fit_transform(listings_df['combined_text_clean'])

# Convert to DataFrame
tfidf_df = pd.DataFrame(
    tfidf_features.toarray(),
    columns=[f'tfidf_desc_{i}' for i in range(tfidf_features.shape[1])],
    index=listings_df.index
)
tfidf_df['listing_id'] = listings_df['listing_id'].values

print(f"   Created {len(tfidf_df.columns) - 1} TF-IDF features from descriptions")

# 2. TF-IDF on review comments
print("\nCreating TF-IDF features from review comments...")

# Aggregate reviews per listing
reviews_agg = reviews_df.groupby('listing_id')['comments'].apply(
    lambda x: ' '.join(x.fillna('').astype(str))
).reset_index()
reviews_agg['comments_clean'] = reviews_agg['comments'].apply(clean_text)

# Create TF-IDF for reviews (top 30 features)
tfidf_reviews = TfidfVectorizer(max_features=30, stop_words='english', ngram_range=(1, 2))
tfidf_reviews_features = tfidf_reviews.fit_transform(reviews_agg['comments_clean'])

tfidf_reviews_df = pd.DataFrame(
    tfidf_reviews_features.toarray(),
    columns=[f'tfidf_review_{i}' for i in range(tfidf_reviews_features.shape[1])],
    index=reviews_agg.index
)
tfidf_reviews_df['listing_id'] = reviews_agg['listing_id'].values

print(f"   Created {len(tfidf_reviews_df.columns) - 1} TF-IDF features from reviews")

# Merge TF-IDF features
features_combined = features_combined.merge(tfidf_df, on='listing_id', how='left')
features_combined = features_combined.merge(tfidf_reviews_df, on='listing_id', how='left')
features_combined = features_combined.fillna(0)

print(f"\nFeatures after adding TF-IDF: {features_combined.shape}")


In [None]:
# ============================================================================
# PART 7: ADDITIONAL FEATURE ENGINEERING
# ============================================================================

print("=" * 80)
print("FEATURE ENGINEERING: ADDITIONAL DERIVED FEATURES")
print("=" * 80)

# 1. Review-based features
print("\nCreating review-based features...")

review_stats = reviews_df.groupby('listing_id').agg({
    'id': 'count',  # Total number of reviews
    'date': ['min', 'max']  # First and last review dates
}).reset_index()
review_stats.columns = ['listing_id', 'total_reviews_count', 'first_review_date', 'last_review_date']

# Convert dates
review_stats['first_review_date'] = pd.to_datetime(review_stats['first_review_date'])
review_stats['last_review_date'] = pd.to_datetime(review_stats['last_review_date'])

# Calculate review recency (days since last review)
current_date = pd.to_datetime('2025-06-17')  # Approximate current date based on scrape date
review_stats['days_since_last_review'] = (current_date - review_stats['last_review_date']).dt.days
review_stats['review_span_days'] = (review_stats['last_review_date'] - review_stats['first_review_date']).dt.days
review_stats['reviews_per_day'] = review_stats['total_reviews_count'] / (review_stats['review_span_days'] + 1)

# Fill missing values
review_stats = review_stats.fillna(0)

# Select only numerical columns for merging
review_features = review_stats[['listing_id', 'total_reviews_count', 'days_since_last_review', 
                                'review_span_days', 'reviews_per_day']].copy()

features_combined = features_combined.merge(review_features, on='listing_id', how='left')
features_combined = features_combined.fillna(0)

print(f"   Created {len(review_features.columns) - 1} review-based features")

# 2. Price-related features
print("\nCreating price-related features...")
if 'price' in features_combined.columns:
    # Price per person (if accommodates available)
    if 'accommodates' in features_combined.columns:
        features_combined['price_per_person'] = features_combined['price'] / (features_combined['accommodates'] + 1)
    
    # Price per bedroom
    if 'bedrooms' in features_combined.columns:
        features_combined['price_per_bedroom'] = features_combined['price'] / (features_combined['bedrooms'] + 1)
    
    # Price per bed
    if 'beds' in features_combined.columns:
        features_combined['price_per_bed'] = features_combined['price'] / (features_combined['beds'] + 1)

# 3. Host experience features
print("\nCreating host experience features...")
if 'calculated_host_listings_count' in features_combined.columns:
    features_combined['is_superhost_candidate'] = (
        (features_combined['calculated_host_listings_count'] >= 5) & 
        (features_combined.get('number_of_reviews', 0) >= 20)
    ).astype(int)

# 4. Availability features
if 'availability_365' in features_combined.columns:
    features_combined['availability_rate_365'] = features_combined['availability_365'] / 365

print(f"\nFinal feature set shape: {features_combined.shape}")
print(f"Total features: {len(features_combined.columns) - 1}")  # -1 for listing_id


In [None]:
# ============================================================================
# PART 8: CREATE UNIFIED ML-READY DATASET
# ============================================================================

print("=" * 80)
print("CREATING UNIFIED ML-READY DATASET")
print("=" * 80)

# The target variable is availability for each listing-date combination
# We'll merge calendar data with all engineered features

# Prepare calendar with target variable
calendar_ml = calendar[['listing_id', 'date', 'available_bool']].copy()
calendar_ml['target'] = calendar_ml['available_bool'].astype(int)

# Add temporal features to calendar
calendar_ml['year'] = calendar_ml['date'].dt.year
calendar_ml['month'] = calendar_ml['date'].dt.month
calendar_ml['day_of_month'] = calendar_ml['date'].dt.day
calendar_ml['day_of_week'] = calendar_ml['date'].dt.dayofweek
calendar_ml['is_weekend'] = (calendar_ml['day_of_week'] >= 5).astype(int)
calendar_ml['quarter'] = calendar_ml['date'].dt.quarter
calendar_ml['is_month_start'] = (calendar_ml['day_of_month'] <= 3).astype(int)
calendar_ml['is_month_end'] = (calendar_ml['day_of_month'] >= 28).astype(int)

# Calculate days from a reference date (for seasonality)
reference_date = calendar_ml['date'].min()
calendar_ml['days_from_start'] = (calendar_ml['date'] - reference_date).dt.days

print(f"\nCalendar shape: {calendar_ml.shape}")

# Merge with engineered features
ml_dataset = calendar_ml.merge(features_combined, on='listing_id', how='left')

# Fill any remaining missing values
ml_dataset = ml_dataset.fillna(0)

print(f"\nML Dataset shape: {ml_dataset.shape}")
print(f"Total features: {len(ml_dataset.columns)}")
print(f"Target variable: 'target' (1 = available, 0 = not available)")

# Check final missing values
print("\n" + "=" * 80)
print("FINAL MISSING VALUES CHECK")
print("=" * 80)
missing_final = ml_dataset.isnull().sum()
missing_final = missing_final[missing_final > 0]
if len(missing_final) > 0:
    print("Columns with missing values:")
    print(missing_final)
else:
    print("✓ No missing values in final dataset!")

# Check data types
print("\n" + "=" * 80)
print("DATA TYPES SUMMARY")
print("=" * 80)
print(ml_dataset.dtypes.value_counts())

# Display sample
print("\n" + "=" * 80)
print("SAMPLE OF ML-READY DATASET")
print("=" * 80)
print(ml_dataset.head())
print(f"\nDataset shape: {ml_dataset.shape}")
print(f"Columns: {list(ml_dataset.columns[:10])}... (showing first 10)")


In [None]:
# ============================================================================
# PART 9: FINAL VALIDATION AND SUMMARY
# ============================================================================

print("=" * 80)
print("FINAL VALIDATION AND SUMMARY")
print("=" * 80)

# 1. Check for any infinite values
inf_cols = []
for col in ml_dataset.select_dtypes(include=[np.number]).columns:
    if np.isinf(ml_dataset[col]).any():
        inf_cols.append(col)
        ml_dataset[col] = ml_dataset[col].replace([np.inf, -np.inf], 0)

if inf_cols:
    print(f"✓ Replaced infinite values in {len(inf_cols)} columns")
else:
    print("✓ No infinite values found")

# 2. Check target distribution
print("\nTarget Variable Distribution:")
target_dist = ml_dataset['target'].value_counts()
print(target_dist)
print(f"   Available (1): {target_dist.get(1, 0):,} ({target_dist.get(1, 0)/len(ml_dataset)*100:.2f}%)")
print(f"   Not Available (0): {target_dist.get(0, 0):,} ({target_dist.get(0, 0)/len(ml_dataset)*100:.2f}%)")

# 3. Feature categories summary
print("\n" + "=" * 80)
print("FEATURE CATEGORIES SUMMARY")
print("=" * 80)

feature_categories = {
    'Temporal': [col for col in ml_dataset.columns if col in ['year', 'month', 'day_of_month', 'day_of_week', 
                                                               'is_weekend', 'quarter', 'is_month_start', 
                                                               'is_month_end', 'days_from_start']],
    'Numerical (Listings)': [col for col in ml_dataset.columns if col in numerical_features],
    'Numerical (Calendar Aggregates)': [col for col in ml_dataset.columns if 'avg_' in col or 'total_' in col or 
                                        'median_' in col or 'min_' in col or 'max_' in col or 'std' in col],
    'Categorical (One-hot)': [col for col in ml_dataset.columns if any(cat in col for cat in categorical_features)],
    'TF-IDF (Descriptions)': [col for col in ml_dataset.columns if 'tfidf_desc' in col],
    'TF-IDF (Reviews)': [col for col in ml_dataset.columns if 'tfidf_review' in col],
    'Review-based': [col for col in ml_dataset.columns if 'review' in col.lower() and 'tfidf' not in col],
    'Price-derived': [col for col in ml_dataset.columns if 'price_per' in col],
    'Other': []
}

# Categorize remaining columns
all_categorized = set()
for cat, cols in feature_categories.items():
    all_categorized.update(cols)

other_cols = [col for col in ml_dataset.columns if col not in all_categorized 
              and col not in ['listing_id', 'date', 'available_bool', 'target']]
feature_categories['Other'] = other_cols

for category, features in feature_categories.items():
    if features:
        print(f"\n{category}: {len(features)} features")
        if len(features) <= 10:
            print(f"   {features}")
        else:
            print(f"   {features[:5]} ... and {len(features)-5} more")

# 4. Save the dataset
print("\n" + "=" * 80)
print("SAVING ML-READY DATASET")
print("=" * 80)

# Remove non-feature columns for ML (keep listing_id and date for reference if needed)
ml_features = ml_dataset.drop(['available_bool'], axis=1, errors='ignore')

# Optionally save to CSV (commented out to avoid large files)
# ml_features.to_csv('ml_ready_dataset.csv', index=False)
# print("✓ Dataset saved to 'ml_ready_dataset.csv'")

print(f"\n✓ ML-ready dataset created successfully!")
print(f"   Shape: {ml_features.shape}")
print(f"   Features: {len(ml_features.columns) - 3}")  # Excluding listing_id, date, target
print(f"   Target: 'target' (availability: 1 = available, 0 = not available)")

# Display final statistics
print("\n" + "=" * 80)
print("DATASET STATISTICS")
print("=" * 80)
print(f"Total rows: {len(ml_features):,}")
print(f"Total columns: {len(ml_features.columns)}")
print(f"Unique listings: {ml_features['listing_id'].nunique():,}")
print(f"Date range: {ml_features['date'].min()} to {ml_features['date'].max()}")
print(f"Memory usage: {ml_features.memory_usage(deep=True).sum() / 1024**2:.2f} MB")


In [None]:
# ============================================================================
# PART 10: FINAL EDA VISUALIZATION - FEATURE IMPORTANCE PREVIEW
# ============================================================================

print("=" * 80)
print("FINAL EDA: CORRELATION WITH TARGET")
print("=" * 80)

# Calculate correlation of numerical features with target
numerical_cols = ml_dataset.select_dtypes(include=[np.number]).columns.tolist()
# Remove non-feature columns
numerical_cols = [col for col in numerical_cols if col not in ['listing_id', 'target', 'available_bool']]

# Calculate correlations
correlations = []
for col in numerical_cols[:50]:  # Limit to first 50 to avoid too many
    try:
        corr = ml_dataset[col].corr(ml_dataset['target'])
        if not np.isnan(corr):
            correlations.append({'feature': col, 'correlation': abs(corr)})
    except:
        pass

corr_df = pd.DataFrame(correlations).sort_values('correlation', ascending=False).head(20)

if len(corr_df) > 0:
    fig = px.bar(
        corr_df,
        x='correlation',
        y='feature',
        orientation='h',
        title='Top 20 Features by Absolute Correlation with Target (Availability)',
        labels={'correlation': 'Absolute Correlation', 'feature': 'Feature'}
    )
    fig.update_layout(height=600, yaxis={'categoryorder': 'total ascending'})
    fig.show()
    
    print("\nTop 10 features most correlated with availability:")
    print(corr_df.head(10)[['feature', 'correlation']].to_string(index=False))
else:
    print("Could not calculate correlations")

# Final summary
print("\n" + "=" * 80)
print("✓ EXERCISE COMPLETED SUCCESSFULLY!")
print("=" * 80)
print("\nSummary:")
print(f"  • EDA: Comprehensive exploratory data analysis with Plotly visualizations")
print(f"  • Numerical Features: Extracted and engineered from listings and calendar")
print(f"  • Categorical Features: One-hot encoded from listings attributes")
print(f"  • Temporal Features: Day, week, month, season patterns")
print(f"  • Textual Features: TF-IDF from descriptions ({50} features) and reviews ({30} features)")
print(f"  • Additional Features: Review statistics, price ratios, host metrics")
print(f"  • Final Dataset: {ml_features.shape[0]:,} rows × {ml_features.shape[1]} columns")
print(f"  • Target Variable: 'target' (1 = available, 0 = not available)")
print(f"  • Missing Values: All handled (filled with appropriate defaults)")
print(f"  • Data Quality: Cleaned and validated")
print("\nThe dataset is now ready for machine learning model training!")
