# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


## 1. Data Loading & Validation

In this section we:

### **1.1 Import libraries**
We load all core packages needed for:
- data manipulation (`pandas`, `numpy`)
- file handling (`pathlib`)
- visualisation (`plotly.express`, `plotly.graph_objects`)

---

### **1.2 Define data directory and dtypes**
We specify:
- where the `listings.csv.gz`, `calendar.csv.gz`, and `reviews.csv.gz` files are stored  
- optimized `dtype` settings to:
  - reduce memory usage
  - avoid incorrect automatic type inference
  - keep categorical variables clean (`room_type`, `property_type`, `host_is_superhost`)

---

### **1.3 Load the three raw datasets**
Using `pd.read_csv()` with:
- `dtype=` for consistency  
- `parse_dates=` for correct time handling  
- `compression='gzip'` for the InsideAirbnb format  
- `low_memory=False` to avoid mixed-type warnings

This gives us:
- `listings` → one row per listing  
- `calendar` → one row per listing per date  
- `reviews` → one row per textual review  

---

### **1.4 Normalize boolean-like fields**
Airbnb stores boolean values as `"t"`/`"f"`.
We convert:
- `host_is_superhost` → `True/False`
- `available` → `1/0` (numeric form for grouping/averaging)

---

### **1.5 Inspect availability distribution**
We compute:
- count of available vs unavailable calendar rows  
- share (%) of availability

This helps detect issues like:
- entire dataset marked unavailable (→ common problem in poorly scraped files)

---

### **1.6 Print basic metadata**
We print:
- column dtypes  
- missing availability share  
- shape of each table  
- match rate between `calendar`/`reviews` and `listings` via `listing_id`

These checks confirm that the joins in later steps will work (or warn us when the dataset is broken).

---

### **Outcome**
After this block, we have:
- cleanly loaded data  
- normalized boolean fields  
- validated linking keys  
- confirmed dataset size and structure  

We are now ready to start exploratory data analysis (EDA).


**Exploratpry Analysis**

In [4]:
from pathlib import Path
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

listings = pd.read_csv("listings.csv.gz")
calendar = pd.read_csv("calendar.csv.gz")
reviews = pd.read_csv("reviews.csv.gz")

listings_dtypes = {
    "id": "int64",
    "host_id": "int64",
    "host_is_superhost": "category",
    "room_type": "category",
    "property_type": "category",
}
calendar_dtypes = {
    "listing_id": "int64",
    "available": "category",   # convert after load
    "price": "string",         # clean currency symbols later
}
reviews_dtypes = {
    "listing_id": "int64",
    "reviewer_id": "int64",
}

listings = pd.read_csv(
    "listings.csv.gz",
    compression="gzip",
    dtype=listings_dtypes,
    parse_dates=["host_since"],
    low_memory=False,
)
calendar = pd.read_csv(
    "calendar.csv.gz",
    compression="gzip",
    dtype=calendar_dtypes,
    parse_dates=["date"],
)
reviews = pd.read_csv(
    "reviews.csv.gz",
    compression="gzip",
    dtype=reviews_dtypes,
    parse_dates=["date"],
)

# Normalize boolean-ish flags so downstream logic sees consistent types
listings["host_is_superhost"] = listings["host_is_superhost"].map({"t": True, "f": False})
calendar["available"] = (
    calendar["available"]
    .astype(str)
    .str.strip()
    .map({"t": 1.0, "f": 0.0})
)

availability_stats = (
    calendar["available"]
    .value_counts(dropna=False)
    .rename_axis("available_flag")
    .to_frame("count")
)
availability_stats["share"] = availability_stats["count"] / availability_stats["count"].sum()
display(availability_stats)

if availability_stats.index.tolist() == [0.0]:
    print(
        "⚠️ All rows are marked unavailable (0.0). "
        "Monthly/heatmap visuals will look blank until the raw calendar data "
        "contains some 't' (available) entries."
    )

print(f"calendar['available'] dtype: {calendar['available'].dtype}")
print(f"Missing availability share: {calendar['available'].isna().mean():.2%}")

# Sanity checks on key relationships
assert listings["id"].is_unique, "Listing IDs must be unique in listings.csv"
missing_calendar = calendar["listing_id"].isin(listings["id"]).mean()
missing_reviews = reviews["listing_id"].isin(listings["id"]).mean()
print(f"Calendar rows matching listings: {missing_calendar:.2%}")
print(f"Review rows matching listings: {missing_reviews:.2%}")

print(f"Listings shape: {listings.shape}")
print(f"Calendar shape: {calendar.shape}")
print(f"Reviews shape: {reviews.shape}")


Unnamed: 0_level_0,count,share
available_flag,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,2839532,0.742322
1.0,985668,0.257678


calendar['available'] dtype: float64
Missing availability share: 0.00%
Calendar rows matching listings: 100.00%
Review rows matching listings: 100.00%
Listings shape: (10480, 79)
Calendar shape: (3825200, 7)
Reviews shape: (501084, 6)


## 2.1 Core Listing Metrics & Data Quality

In this section we answer two basic questions:

1. **What does a “typical” Amsterdam Airbnb listing look like?**  
   We summarise key numeric variables:
   - `price`
   - `minimum_nights`
   - `number_of_reviews`
   - `reviews_per_month`

   Using `.describe()` (with quartiles) we get:
   - central tendency (mean, median)
   - spread (std, 25% / 75% quantiles)
   - minimum and maximum values

2. **Where are the biggest data gaps in `listings`?**  
   We compute the **percentage of missing values per column** and display the
   **top 15 columns with the most missing data**.  
   This helps us:
   - identify variables that are risky to use directly in models,
   - decide where imputation or feature dropping might be necessary,
   - understand which signals are “strong” and which ones are sparse.

These basic summaries give a first, high-level picture of the Amsterdam Airbnb market
and show how reliable each column is for downstream feature engineering.


In [5]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

listings_summary = (
    listings[["price", "minimum_nights", "number_of_reviews", "reviews_per_month"]]
    .describe(percentiles=[0.25, 0.5, 0.75])
    .T
)
display(listings_summary)

missing_pct = (
    listings.isna().mean().sort_values(ascending=False).mul(100).round(1).head(15)
)
display(missing_pct.to_frame(name="percent_missing"))


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
minimum_nights,10480.0,4.390267,19.80735,1.0,2.0,3.0,4.0,1001.0
number_of_reviews,10480.0,47.813359,131.50744,0.0,3.0,10.0,30.0,5097.0
reviews_per_month,9383.0,0.998668,2.306143,0.01,0.2,0.41,0.91,99.42


Unnamed: 0,percent_missing
neighbourhood_group_cleansed,100.0
calendar_updated,100.0
host_neighbourhood,73.4
neighbourhood,50.5
neighborhood_overview,50.5
host_about,47.4
estimated_revenue_l365d,44.0
price,44.0
beds,43.7
bathrooms,43.4


## 2.2 Price Distribution of Amsterdam Listings

To understand the pricing landscape of Amsterdam's Airbnb market, we visualize
the distribution of nightly rates.

### Why this matters
- Identifies whether the market is **cheap, mid-range, or luxury-heavy**
- Helps detect **outliers** (extremely expensive properties)
- Supports later feature engineering (e.g., log-price, price categories)

### Data cleaning steps performed
Before plotting, we:
- convert price from string → numeric  
- remove currency symbols such as `$` and `,`
- drop empty or invalid entries
- clip extremely high prices at €800 to avoid distorted histograms

### What the histogram shows
The resulting histogram reveals how nightly rates are spread across the city,
highlighting typical price ranges and market skewness.


In [6]:
price_clean = (
    listings["price"]
    .astype(str)                              # ensure string dtype
    .str.replace(r"[$,]", "", regex=True)     # remove $ and commas
    .replace("", pd.NA)                       # drop empty strings
    .dropna()
    .astype(float)
    .clip(upper=800)
)

fig_price = px.histogram(
    price_clean,
    nbins=60,
    title="Amsterdam nightly price distribution (capped at €800)",
    labels={"value": "price_eur"},
    opacity=0.75,
)
fig_price.update_layout(bargap=0.02)
fig_price.show()


## 2.3 Availability Patterns Across Months and Weekdays

Understanding when listings are most or least available helps us uncover
seasonality, tourism patterns, and booking pressure in Amsterdam.

### What we compute
We extract two temporal dimensions from the calendar data:

1. **Monthly Availability Rate**  
   - For each month (1–12), we calculate the *share of nights marked available*.
   - This reveals **high-demand vs. low-demand seasons**.

2. **Weekday × Season Availability Heatmap**  
   - We classify each date into one of four seasons:
     - Winter (Dec–Feb)
     - Spring (Mar–May)
     - Summer (Jun–Aug)
     - Autumn (Sep–Nov)
   - Then we compute average availability for every weekday within each season.

### Why this matters
- Shows **touristic pressure** (e.g., summer months often have low availability)
- Highlights **weekday booking behavior** (e.g., weekends may be fully booked)
- These patterns are crucial for feature engineering such as:
  - `is_peak_season`
  - `is_weekend`
  - `expected_availability_rate`

### Visualisations included
- **Bar chart**: availability by month  
- **Heatmap**: availability by weekday × season  


In [7]:
calendar["month"] = calendar["date"].dt.month
calendar["weekday"] = calendar["date"].dt.day_name()

availability_by_month = (
    calendar.groupby("month", observed=True)["available"]
    .mean()
    .rename("share_available")
    .reset_index()
)
display(availability_by_month)

weekday_pivot = (
    calendar.assign(
        season=pd.cut(
            calendar["date"].dt.month % 12,
            bins=[-float("inf"), 2, 5, 8, 11],
            labels=["Winter", "Spring", "Summer", "Autumn"],
        )
    )
    .pivot_table(
        values="available", index="season", columns="weekday",
        aggfunc="mean", observed=True
    )
    .reindex(index=["Winter", "Spring", "Summer", "Autumn"])
    .reindex(columns=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"])
)

fig_month = px.bar(
    availability_by_month,
    x="month",
    y="share_available",
    title="Share of available nights by month",
    labels={"share_available": "availability_rate"}
)
fig_month.update_yaxes(tickformat=".0%")
fig_month.show()

fig_heat = go.Figure(go.Heatmap(
    z=weekday_pivot.values,
    x=weekday_pivot.columns,
    y=weekday_pivot.index,
    colorscale="Viridis",
    colorbar=dict(title="availability rate", ticksuffix="%"),
))
fig_heat.update_layout(
    title="Availability heatmap by season and weekday",
    xaxis_title="Weekday", 
    yaxis_title="Season"
)
fig_heat.show()


Unnamed: 0,month,share_available
0,1,0.304411
1,2,0.315223
2,3,0.282686
3,4,0.256733
4,5,0.272701
5,6,0.232137
6,7,0.221282
7,8,0.224326
8,9,0.178232
9,10,0.217751


## 2.4 Review Sentiment Proxy from Text Comments

Guest reviews contain valuable qualitative information about listing quality,
host responsiveness, cleanliness, and the overall guest experience.

In this section we build **simple text-based sentiment indicators** using the
raw `comments` field from `reviews.csv`.

### What features we engineer
1. **Review Word Count**  
   - Measures how detailed each guest review is.
   - Longer reviews often correspond to stronger emotions (positive or negative).

2. **Positive Keyword Indicator**  
   - Flags reviews containing common positive sentiment words such as  
     *“great”, “amazing”, “perfect”, “wonderful”*.

3. **Negative Keyword Indicator**  
   - Flags reviews containing common negative sentiment words such as  
     *“bad”, “terrible”, “dirty”, “noisy”*.

### Aggregation
We group by `listing_id` and compute the **mean of each indicator**, giving:
- average positivity rate per listing  
- average negativity rate per listing  
- average review verbosity  

### Why this matters
These text features can enrich ML models by capturing:
- host quality
- cleanliness issues
- overall guest satisfaction  
- listing reputation signals

This creates a foundation for more advanced NLP features like TF–IDF in later stages.


In [8]:
reviews["review_word_count"] = reviews["comments"].str.split().str.len()
reviews["has_positive_word"] = reviews["comments"].str.contains(
    r"\b(?:great|amazing|perfect|wonderful)\b", case=False, na=False
)
reviews["has_negative_word"] = reviews["comments"].str.contains(
    r"\b(?:bad|terrible|dirty|noisy)\b", case=False, na=False
)

sentiment_pivot = (
    reviews.groupby("listing_id")[["has_positive_word", "has_negative_word"]]
    .mean()
    .reset_index()
)

display(sentiment_pivot.describe())


Unnamed: 0,listing_id,has_positive_word,has_negative_word
count,9383.0,9383.0,9383.0
mean,5.316341e+17,0.480164,0.011921
std,5.422466e+17,0.24499,0.042594
min,27886.0,0.0,0.0
25%,23103000.0,0.333333,0.0
50%,6.194346e+17,0.5,0.0
75%,1.030774e+18,0.631579,0.003263
max,1.498684e+18,1.0,1.0


## 3.1 Centralised Configuration with `AirbnbConfig`

Before loading any data, we define a small configuration object that stores all
file-related settings in one place.

The `AirbnbConfig` dataclass:

- sets the base directory (`data_dir`) where the Airbnb CSV files are located,
- stores the exact filenames used in this project,
- exposes convenient path properties:
  - `calendar_path`
  - `listings_path`
  - `reviews_path`

This makes our notebook cleaner and avoids hardcoding paths in multiple places.
If we later switch to another city or folder, only the config needs to change.


In [10]:
from dataclasses import dataclass
from pathlib import Path

@dataclass
class AirbnbConfig:
    # Load files from the current notebook directory or a provided folder
    data_dir: Path = Path(".")

    # Filenames exactly as they exist in your folder
    calendar_file: str = "calendar.csv.gz"
    listings_file: str = "listings.csv"
    reviews_file: str = "reviews.csv"

    @property
    def calendar_path(self) -> Path:
        return self.data_dir / self.calendar_file

    @property
    def listings_path(self) -> Path:
        return self.data_dir / self.listings_file

    @property
    def reviews_path(self) -> Path:
        return self.data_dir / self.reviews_file


## 3.2 Lazy CSV Loading with `AirbnbDataLoader`

To avoid repeating `pd.read_csv(...)` all over the notebook, we wrap the loading
logic in a small helper class called `AirbnbDataLoader`.

### What this class does

- Uses the `AirbnbConfig` object to know **where the files are**.
- Exposes three **lazy-loaded** attributes:
  - `listings`
  - `calendar`
  - `reviews`
- Each attribute:
  - reads the corresponding CSV **once** (the first time it is accessed),
  - performs basic cleaning:
    - `listings`:
      - parses `host_since` as a date
      - converts `host_is_superhost` from `"t"/"f"` to `True/False`
    - `calendar`:
      - parses `date` as a date
      - converts `available` from `"t"/"f"` to numeric `1.0/0.0`
    - `reviews`:
      - parses `date` as a date

### Why use `@cached_property`?

We decorate each loader method with `@cached_property`, so:

- the CSV file is read **only the first time**,
- subsequent access to `loader.listings`, `loader.calendar`, or `loader.reviews`
  returns the **already-loaded DataFrame**,
- this keeps the code clean and avoids unnecessary disk I/O.

This class gives us a neat, reusable entry point for the rest of the analysis
and feature engineering steps.


In [13]:
from functools import cached_property
import pandas as pd

class AirbnbDataLoader:
    """
    Lazily loads raw CSVs only once using the paths defined in AirbnbConfig.
    """

    def __init__(self, config: AirbnbConfig):
        self.config = config

    @cached_property
    def listings(self) -> pd.DataFrame:
        df = pd.read_csv(
            self.config.listings_path,
            low_memory=False,
            parse_dates=["host_since"],
        )
        df["host_is_superhost"] = df["host_is_superhost"].map({"t": True, "f": False})
        return df

    @cached_property
    def calendar(self) -> pd.DataFrame:
        df = pd.read_csv(
            self.config.calendar_path,
            compression="gzip",
            parse_dates=["date"],
        )
        df["available"] = (
            df["available"].astype(str).str.strip().map({"t": 1.0, "f": 0.0})
        )
        return df

    @cached_property
    def reviews(self) -> pd.DataFrame:
        df = pd.read_csv(
            self.config.reviews_path,
            parse_dates=["date"],
        )
        return df
