# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


In [1]:
import pandas as pd

In [None]:
#Calendar
calendar_df = pd.read_csv('data/calendar.csv.gz')

In [None]:
#Preview of the dataset
display(calendar_df.head(20))
print("MISSING VALUES OVERVIEW")
print("=" * 50)

# Count of missing values per column
missing_count = calendar_df.isnull().sum()
print("\nMissing values count:")
print(missing_count)

# Percentage of missing values per column
missing_percent = (calendar_df.isnull().sum() / len(calendar_df)) * 100
print("\nMissing values percentage:")
print(missing_percent)

# Combined view
missing_summary = pd.DataFrame({
    'Missing_Count': missing_count,
    'Missing_Percent': missing_percent
})
print("\nCombined summary:")
print(missing_summary)

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,1196288722069341420,2025-09-15,f,,,1,1125
1,1196288722069341420,2025-09-16,f,,,1,1125
2,1196288722069341420,2025-09-17,f,,,1,1125
3,1196288722069341420,2025-09-18,f,,,1,1125
4,1196288722069341420,2025-09-19,f,,,1,1125
5,1196288722069341420,2025-09-20,f,,,1,1125
6,1196288722069341420,2025-09-21,f,,,1,1125
7,1196288722069341420,2025-09-22,f,,,1,1125
8,1196288722069341420,2025-09-23,f,,,1,1125
9,1196288722069341420,2025-09-24,f,,,1,1125


MISSING VALUES OVERVIEW

Missing values count:
listing_id               0
date                     0
available                0
price             35357974
adjusted_price    35357974
minimum_nights           0
maximum_nights           0
dtype: int64

Missing values percentage:
listing_id          0.0
date                0.0
available           0.0
price             100.0
adjusted_price    100.0
minimum_nights      0.0
maximum_nights      0.0
dtype: float64

Combined summary:
                Missing_Count  Missing_Percent
listing_id                  0              0.0
date                        0              0.0
available                   0              0.0
price                35357974            100.0
adjusted_price       35357974            100.0
minimum_nights              0              0.0
maximum_nights              0              0.0


In [8]:
#'price' and 'adjusted_price' are empty
# will drop since it will come back when we join the datasets
calendar_df = calendar_df.drop(['price', 'adjusted_price'], axis=1)

In [None]:
#Manipulating 'date' column

#Converting into datetime
calendar_df['date'] = pd.to_datetime(calendar_df['date'])

# Extract time features
calendar_df['year'] = calendar_df['date'].dt.year
calendar_df['month'] = calendar_df['date'].dt.month
calendar_df['day_of_week'] = calendar_df['date'].dt.dayofweek  # 0=Monday

