# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


In [1]:
import pandas as pd
import os

# Use relative path so it also works on other machines
data_path = "data"

# Build full file paths
reviews_file = os.path.join(data_path, "reviews.csv")
listings_file = os.path.join(data_path, "listings.csv")
calendar_file = os.path.join(data_path, "calendar.csv")

# Load CSV files
df_reviews = pd.read_csv(reviews_file)
df_listings = pd.read_csv(listings_file)
df_calendar = pd.read_csv(calendar_file)

# Quick checks
print("Reviews:", df_reviews.shape)
print("Listings:", df_listings.shape)
print("Calendar:", df_calendar.shape)


Reviews: (2097996, 6)
Listings: (96871, 79)
Calendar: (35357974, 7)


In [2]:
# First look at the listings data
df_listings.head()


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,13913,https://www.airbnb.com/rooms/13913,20250914034649,2025-09-16,city scrape,Holiday London DB Room Let-on going,My bright double bedroom with a large window h...,Finsbury Park is a friendly melting pot commun...,https://a0.muscache.com/pictures/miso/Hosting-...,54730,...,4.87,4.78,4.78,,f,2,1,1,0,0.3
1,15400,https://www.airbnb.com/rooms/15400,20250914034649,2025-09-16,city scrape,Bright Chelsea Apartment. Chelsea!,Lots of windows and light. St Luke's Gardens ...,It is Chelsea.,https://a0.muscache.com/pictures/428392/462d26...,60302,...,4.84,4.93,4.74,,f,1,1,0,0,0.51
2,17402,https://www.airbnb.com/rooms/17402,20250914034649,2025-09-16,city scrape,Very Central Modern 3-Bed/2 Bath By Oxford St W1,"You'll have a great time in this beautiful, cl...","Fitzrovia is a very desirable trendy, arty and...",https://a0.muscache.com/pictures/39d5309d-fba7...,67564,...,4.72,4.89,4.61,,f,2,2,0,0,0.32
3,24328,https://www.airbnb.com/rooms/24328,20250914034649,2025-09-18,previous scrape,Battersea live/work artist house,"Artist house by SW Battersea Park, bright high...","- Battersea is a quiet family area, easy acces...",https://a0.muscache.com/pictures/9194b40f-c627...,41759,...,4.93,4.6,4.65,,f,1,1,0,0,0.53
4,36274,https://www.airbnb.com/rooms/36274,20250914034649,2025-09-15,city scrape,Bright 1 bedroom apt off brick lane in Shoreditch,*Update June '25- Pump Installed to improve wa...,,https://a0.muscache.com/pictures/hosting/Hosti...,133271,...,4.46,4.85,4.54,,t,2,2,0,0,0.09


In [3]:
# Basic info about listings dataset
df_listings.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96871 entries, 0 to 96870
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            96871 non-null  int64  
 1   listing_url                                   96871 non-null  object 
 2   scrape_id                                     96871 non-null  int64  
 3   last_scraped                                  96871 non-null  object 
 4   source                                        96871 non-null  object 
 5   name                                          96871 non-null  object 
 6   description                                   94421 non-null  object 
 7   neighborhood_overview                         41208 non-null  object 
 8   picture_url                                   96865 non-null  object 
 9   host_id                                       96871 non-null 

In [4]:
# Summary statistics for numeric columns
df_listings.describe()


Unnamed: 0,id,scrape_id,host_id,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,latitude,longitude,accommodates,bathrooms,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,96871.0,96871.0,96871.0,96830.0,96830.0,0.0,96871.0,96871.0,96871.0,62025.0,...,72706.0,72729.0,72705.0,72705.0,0.0,96871.0,96871.0,96871.0,96871.0,72749.0
mean,6.894448e+17,20250910000000.0,214449400.0,50.088991,80.305246,,51.50974,-0.127638,3.326228,1.35555,...,4.790344,4.81227,4.729839,4.616202,,16.685499,14.303249,2.325526,0.028553,0.990334
std,5.941222e+17,0.0,219605300.0,399.405694,589.326173,,0.049067,0.101112,2.078605,0.72077,...,0.430135,0.425909,0.409689,0.511816,,53.13029,52.478299,9.623243,0.631484,1.304282
min,13913.0,20250910000000.0,2594.0,1.0,1.0,,51.295937,-0.49676,1.0,0.0,...,0.0,0.0,0.0,0.0,,1.0,0.0,0.0,0.0,0.01
25%,30260580.0,20250910000000.0,27268140.0,1.0,1.0,,51.48415,-0.189468,2.0,1.0,...,4.76,4.8,4.65,4.5,,1.0,0.0,0.0,0.0,0.15
50%,8.505248e+17,20250910000000.0,116432100.0,2.0,3.0,,51.51372,-0.127505,2.0,1.0,...,4.93,4.96,4.85,4.75,,2.0,1.0,0.0,0.0,0.52
75%,1.254262e+18,20250910000000.0,419897400.0,10.0,14.0,,51.539108,-0.068316,4.0,1.5,...,5.0,5.0,5.0,4.94,,8.0,5.0,1.0,0.0,1.29
max,1.508964e+18,20250910000000.0,718690500.0,5469.0,8769.0,,51.68263,0.27896,16.0,26.0,...,5.0,5.0,5.0,5.0,,500.0,499.0,116.0,25.0,36.96


In [5]:
# Check missing values ratio per column (top 20)
missing_ratio = df_listings.isnull().mean().sort_values(ascending=False)
missing_ratio.head(20)


license                         1.000000
calendar_updated                1.000000
neighbourhood_group_cleansed    1.000000
neighborhood_overview           0.574610
neighbourhood                   0.574599
host_neighbourhood              0.526690
host_about                      0.485574
beds                            0.360479
price                           0.360356
estimated_revenue_l365d         0.360356
bathrooms                       0.359715
host_response_rate              0.327312
host_response_time              0.327312
host_acceptance_rate            0.286567
review_scores_location          0.249466
review_scores_value             0.249466
review_scores_checkin           0.249455
review_scores_communication     0.249218
review_scores_accuracy          0.249166
review_scores_cleanliness       0.249104
dtype: float64

### Missing Values Analysis

The table below shows the columns with the highest percentage of missing values in `listings.csv`.
Columns such as `license`, `calendar_updated`, and `neighbourhood_group_cleansed` are completely missing (100%).
Other fields like descriptions and host-related text fields have moderate missingness.
Next, I will explore whether these columns are useful and decide how to handle them.


In [6]:
# Drop columns that are completely missing
cols_to_drop = [
    "license",
    "calendar_updated",
    "neighbourhood_group_cleansed"
]

df_listings = df_listings.drop(columns=cols_to_drop)
df_listings.shape


(96871, 76)

In [7]:
df_listings[[
    "neighborhood_overview",
    "host_about",
    "beds",
    "bathrooms",
    "price"
]].head(10)


Unnamed: 0,neighborhood_overview,host_about,beds,bathrooms,price
0,Finsbury Park is a friendly melting pot commun...,I am a Multi-Media Visual Artist and Creative ...,1.0,1.0,$70.00
1,It is Chelsea.,"English, grandmother, I have travelled quite ...",1.0,1.0,$149.00
2,"Fitzrovia is a very desirable trendy, arty and...",We are Liz and Jack. We manage a number of ho...,3.0,2.0,$411.00
3,"- Battersea is a quiet family area, easy acces...","I've been using Airbnb for a while now, both a...",,,
4,,"We are Hendryks Services - your resident, mana...",0.0,1.0,$210.00
5,"Residential family neighborhood, with both Eng...",".Hi, I am Geert, and I've been on Airbnb both ...",3.0,1.5,$280.00
6,East Finchley is a friendly popular area of No...,We are a happy couple who live in a wonderful ...,1.0,0.0,$90.00
7,Shepherds Bush is one of London's truly divers...,We are Elisa and Dominic and we live in West L...,1.0,1.0,$61.00
8,Shepherds Bush itself is a prime real estate l...,We are Elisa and Dominic and we live in West L...,4.0,2.0,$340.00
9,Peckham itself is a vibrant and fashionable ar...,"Hi, I am Cesar. Welcome to my place. I love d...",1.0,1.0,$49.00


In [8]:
df_listings["price"] = (
    df_listings["price"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

# Convert to float (errors=coerce makes bad values NaN)
df_listings["price"] = pd.to_numeric(df_listings["price"], errors="coerce")

df_listings["price"].describe()


count    6.196300e+04
mean     2.299170e+02
std      4.437589e+03
min      7.000000e+00
25%      7.700000e+01
50%      1.350000e+02
75%      2.210000e+02
max      1.085147e+06
Name: price, dtype: float64

In [9]:
# Fill beds and bathrooms with median (common practice)
df_listings["beds"] = df_listings["beds"].fillna(df_listings["beds"].median())
df_listings["bathrooms"] = df_listings["bathrooms"].fillna(df_listings["bathrooms"].median())
