# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [3]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [4]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


('', [])

In [5]:
import pandas as pd
calendar = pd.read_csv("../Data/calendar.csv.gz", compression="gzip")
print(calendar.head())

           listing_id        date available  price  adjusted_price  \
0  686088974677118082  2025-06-27         f    NaN             NaN   
1  686088974677118082  2025-06-28         f    NaN             NaN   
2  686088974677118082  2025-06-29         t    NaN             NaN   
3  686088974677118082  2025-06-30         t    NaN             NaN   
4  686088974677118082  2025-07-01         t    NaN             NaN   

   minimum_nights  maximum_nights  
0               2            1125  
1               2            1125  
2               2            1125  
3               2            1125  
4               2            1125  


In [6]:
listings_gz = pd.read_csv("../Data/listings.csv.gz", compression="gzip")
print(listings_gz.head())


       id                          listing_url       scrape_id last_scraped  \
0  155305  https://www.airbnb.com/rooms/155305  20250617145515   2025-06-17   
1  197263  https://www.airbnb.com/rooms/197263  20250617145515   2025-06-17   
2  209068  https://www.airbnb.com/rooms/209068  20250617145515   2025-06-17   
3  246315  https://www.airbnb.com/rooms/246315  20250617145515   2025-06-17   
4  314540  https://www.airbnb.com/rooms/314540  20250617145515   2025-06-17   

        source                                               name  \
0  city scrape                 Cottage! BonPaul + Sharky's Hostel   
1  city scrape                       Tranquil Room & Private Bath   
2  city scrape                                    Terrace Cottage   
3  city scrape                          Asheville Dreamer's Cabin   
4  city scrape  Asheville Urban Farmhouse Entire Home 4.6 mi t...   

                                         description  \
0  West Asheville Cottage within walking distance...  

In [7]:
reviews_gz = pd.read_csv("../Data/reviews.csv.gz", compression="gzip")
print(reviews_gz.head())



   listing_id       id        date  reviewer_id reviewer_name  \
0      155305   409437  2011-07-31       844309       Jillian   
1      155305   469775  2011-08-23       343443         Katie   
2      155305   548257  2011-09-19      1152025         Katie   
3      155305   671470  2011-10-28      1245885         Jason   
4      155305  1606327  2012-07-01      1891395         Craig   

                                            comments  
0  We had a wonderful time! The cottage was very ...  
1  Place was great! Can't really speak to the ins...  
2  We had a great time!  The cabin was nice and a...  
3  Clean and comfortable room with everything you...  
4  The cabin was solid for an overnight stay. It ...  


In [8]:
neighbourhoods_geo = pd.read_json("../Data/neighbourhoods.geojson")
print(neighbourhoods_geo.head())



                type                                           features
0  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...
1  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...
2  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...
3  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...
4  FeatureCollection  {'type': 'Feature', 'geometry': {'type': 'Mult...


In [9]:
listings = pd.read_csv("../Data/listings.csv")
print(listings.head())


       id                                               name  host_id  \
0  155305                 Cottage! BonPaul + Sharky's Hostel   746673   
1  197263                       Tranquil Room & Private Bath   961396   
2  209068                                    Terrace Cottage  1029919   
3  246315                          Asheville Dreamer's Cabin  1292070   
4  314540  Asheville Urban Farmhouse Entire Home 4.6 mi t...   381660   

  host_name  neighbourhood_group  neighbourhood   latitude  longitude  \
0   BonPaul                  NaN          28806  35.578640 -82.595780   
1   Timothy                  NaN          28806  35.577350 -82.638040   
2     Kevin                  NaN          28804  35.617641 -82.551819   
3     Annie                  NaN          28805  35.596150 -82.506350   
4       Tom                  NaN          28806  35.585610 -82.627310   

         room_type  price  minimum_nights  number_of_reviews last_review  \
0  Entire home/apt   95.0               1     

In [10]:
reviews = pd.read_csv("../Data/reviews.csv")
print(reviews.head())


   listing_id        date
0      155305  2011-07-31
1      155305  2011-08-23
2      155305  2011-09-19
3      155305  2011-10-28
4      155305  2012-07-01


In [11]:
neighbourhoods = pd.read_csv("../Data/neighbourhoods.csv")
print(neighbourhoods.head())


   neighbourhood_group  neighbourhood
0                  NaN          28704
1                  NaN          28715
2                  NaN          28732
3                  NaN          28801
4                  NaN          28803


In [None]:
# ------------------------------------------------------------
# 3. Listings feature selection & cleaning
# ------------------------------------------------------------
# Goal: select a subset of informative features and clean them.

# If listings has 'id' instead of 'listing_id', rename it for consistency
if "listing_id" not in listings.columns and "id" in listings.columns:
    listings = listings.rename(columns={"id": "listing_id"})

# Candidate listing-level features (we keep only those that actually exist)
listing_feature_candidates = [
    "listing_id",
    # Host-related
    "host_is_superhost",
    "host_response_time",
    "host_response_rate",
    "host_listings_count",
    "host_total_listings_count",
    # Property characteristics
    "room_type",
    "property_type",
    "accommodates",
    "bathrooms",
    "bathrooms_text",
    "bedrooms",
    "beds",
    "minimum_nights",
    "maximum_nights",
    "minimum_minimum_nights",
    "maximum_minimum_nights",
    "minimum_maximum_nights",
    "maximum_maximum_nights",
    # Availability summary
    "availability_30",
    "availability_60",
    "availability_90",
    "availability_365",
    # Reviews summary already in listings
    "number_of_reviews",
    "number_of_reviews_ltm",
    "review_scores_rating",
    "review_scores_cleanliness",
    "review_scores_location",
    "review_scores_value",
    # Location
    "neighbourhood_cleansed",
    "latitude",
    "longitude",
    # Booking behavior
    "instant_bookable",
    # Potentially useful text/meta
    "amenities",
]

listing_features_present = [c for c in listing_feature_candidates if c in listings.columns]
listings_feat = listings[listing_features_present].copy()

print("Listing feature columns used:", listing_features_present)

# Clean simple boolean-like columns if they exist
bool_candidates = ["host_is_superhost", "instant_bookable"]
for col in bool_candidates:
    if col in listings_feat.columns:
        listings_feat[col] = (
            listings_feat[col]
            .astype(str)
            .str.lower()
            .map({"t": 1, "true": 1, "y": 1, "yes": 1, "f": 0, "false": 0, "n": 0, "no": 0})
        )

# Convert host_response_rate from "97%" to float in [0,1]
if "host_response_rate" in listings_feat.columns:
    listings_feat["host_response_rate"] = (
        listings_feat["host_response_rate"]
        .astype(str)
        .str.replace("%", "", regex=False)
        .str.strip()
    )
    listings_feat["host_response_rate"] = pd.to_numeric(
        listings_feat["host_response_rate"], errors="coerce"
    ) / 100.0

# Numeric conversion for obvious numeric columns (safe conversion)
numeric_like_cols = [
    "accommodates",
    "bathrooms",
    "bedrooms",
    "beds",
    "minimum_nights",
    "maximum_nights",
    "minimum_minimum_nights",
    "maximum_minimum_nights",
    "minimum_maximum_nights",
    "maximum_maximum_nights",
    "availability_30",
    "availability_60",
    "availability_90",
    "availability_365",
    "number_of_reviews",
    "number_of_reviews_ltm",
    "review_scores_rating",
    "review_scores_cleanliness",
    "review_scores_location",
    "review_scores_value",
]
for col in numeric_like_cols:
    if col in listings_feat.columns:
        listings_feat[col] = pd.to_numeric(listings_feat[col], errors="coerce")



In [None]:
# ------------------------------------------------------------
# 5. Merge calendar + listings + reviews into a single dataset
# ------------------------------------------------------------

# Start from calendar base (listing_id, date, target)
df = calendar_base.copy()

# Merge listing-level static features
df = df.merge(listings_feat, on="listing_id", how="left")

# Merge review-level features
df = df.merge(reviews_agg, on="listing_id", how="left")

print("Merged dataset shape (before cleaning):", df.shape)

# ------------------------------------------------------------
# 6. Basic cleaning of the merged dataset
# ------------------------------------------------------------

# Remove rows without target
df = df.dropna(subset=["available_target"])

# Identify numeric and categorical columns (excluding identifiers)
id_cols = ["listing_id", "date"]
target_col = "available_target"

numeric_cols = df.select_dtypes(include=["number"]).columns.tolist()
# Ensure we don't accidentally treat the target as a feature in this step
numeric_feature_cols = [c for c in numeric_cols if c not in [target_col]]

categorical_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

# Fill numeric missing values with median (simple, robust strategy)
for col in numeric_feature_cols:
    median_value = df[col].median()
    df[col] = df[col].fillna(median_value)

# Fill categorical missing with "Unknown"
for col in categorical_cols:
    df[col] = df[col].fillna("Unknown").astype("category")

