# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


In [12]:
import pandas as pd



listings = pd.read_csv("listings.csv.gz")

calendar = pd.read_csv("calendar.csv.gz")

reviews = pd.read_csv("reviews.csv.gz")



display(reviews.describe())

display(calendar.describe())

display(listings.describe())

Unnamed: 0,listing_id,id,reviewer_id
count,501084.0,501084.0,501084.0
mean,1.382436e+17,6.408181e+17,176324000.0
std,3.386866e+17,5.495376e+17,172103600.0
min,27886.0,82539.0,1.0
25%,6255736.0,432562300.0,35777460.0
50%,20468520.0,7.135894e+17,113546000.0
75%,44423330.0,1.149944e+18,271445000.0
max,1.498684e+18,1.507956e+18,717414000.0


Unnamed: 0,listing_id,price,adjusted_price,minimum_nights,maximum_nights
count,3825200.0,0.0,0.0,3825200.0,3825200.0
mean,5.925464e+17,,,4.374672,410197.7
std,5.620407e+17,,,18.7946,29663530.0
min,27886.0,,,1.0,1.0
25%,26293730.0,,,2.0,21.0
50%,6.893474e+17,,,3.0,50.0
75%,1.11961e+18,,,4.0,730.0
max,1.506287e+18,,,1001.0,2147484000.0


Unnamed: 0,id,scrape_id,host_id,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,10480.0,10480.0,10480.0,10477.0,10477.0,0.0,10480.0,10480.0,10480.0,5932.0,10174.0,5904.0,10480.0,10480.0,10476.0,10476.0,10476.0,10476.0,10480.0,10480.0,0.0,10480.0,10480.0,10480.0,10480.0,10480.0,10480.0,10480.0,10480.0,10480.0,10480.0,5874.0,9383.0,9383.0,9382.0,9383.0,9383.0,9383.0,9383.0,10480.0,10480.0,10480.0,10480.0,9383.0
mean,5.925464e+17,20250910000000.0,134501900.0,3.967262,5.991887,,52.366679,4.889447,2.920515,1.229855,1.554158,1.793022,4.390267,282.453053,3.877911,4.878771,410339.9,410363.6,4.374828,410197.7,,5.156679,12.382061,21.455153,93.999809,47.813359,8.588073,0.686069,27.784447,8.195134,48.438931,15811.94,4.844096,4.853297,4.777811,4.894101,4.907071,4.817007,4.653918,1.844084,1.217748,0.560115,0.029103,0.998668
std,5.620675e+17,0.0,180435900.0,37.409613,61.183451,,0.017246,0.034821,1.276192,0.536533,0.886438,1.599438,19.80735,385.865218,17.959485,19.88346,29670600.0,29670600.0,18.735179,29664940.0,,8.311859,17.926748,29.341653,122.276158,131.50744,25.195305,2.233509,37.045892,24.099944,76.929561,41124.52,0.257871,0.251559,0.317948,0.215202,0.216665,0.231716,0.323497,3.159096,2.433486,1.713971,0.409605,2.306143
min,27886.0,20250910000000.0,1662.0,1.0,1.0,,52.290276,4.75587,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.01
25%,26293730.0,20250910000000.0,12777810.0,1.0,1.0,,52.355694,4.864618,2.0,1.0,1.0,1.0,2.0,20.0,2.0,2.0,21.0,21.0,2.0,21.0,,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,1632.0,4.79,4.8,4.69,4.87,4.9,4.73,4.53,1.0,1.0,0.0,0.0,0.2
50%,6.893474e+17,20250910000000.0,45478430.0,1.0,1.0,,52.36569,4.887516,2.0,1.0,1.0,1.0,3.0,30.0,2.0,3.0,31.0,60.0,3.0,60.0,,0.0,1.0,2.0,20.0,10.0,2.0,0.0,4.0,2.0,16.0,7020.0,4.92,4.92,4.87,4.97,5.0,4.89,4.71,1.0,1.0,0.0,0.0,0.41
75%,1.11961e+18,20250910000000.0,187719600.0,1.0,2.0,,52.37651,4.908675,4.0,1.5,2.0,2.0,4.0,365.0,3.0,4.0,700.0,731.0,4.0,730.0,,7.0,21.0,42.0,173.0,30.0,6.0,1.0,56.0,6.0,48.0,20371.25,5.0,5.0,5.0,5.0,5.0,5.0,4.85,1.0,1.0,0.0,0.0,0.91
max,1.506287e+18,20250910000000.0,717347000.0,957.0,1655.0,,52.42512,5.02815,16.0,17.0,17.0,33.0,1001.0,1125.0,1001.0,1001.0,2147484000.0,2147484000.0,1001.0,2147484000.0,,30.0,60.0,90.0,365.0,5097.0,949.0,89.0,112.0,835.0,255.0,2480558.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,35.0,32.0,15.0,9.0,99.42


**Exploratpry Analysis**

In [13]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

def build_features(filename='reviews.csv.gz'):
    try:
        df = pd.read_csv(filename)
        
        # 1. Aggregate text per listing (combine all comments for one ID)
        print("Aggregating text by listing_id...")
        grouped_df = df.groupby('listing_id')['comments'].apply(
            lambda x: " ".join(x.fillna("").astype(str))
        ).reset_index()

        # 2. Run TF-IDF as requested
        print("Vectorizing...")
        vectorizer = TfidfVectorizer(
            min_df=5, 
            max_features=300, 
            ngram_range=(1,2),
            stop_words='english' # Added to remove common words like 'the', 'and'
        )
        
        X = vectorizer.fit_transform(grouped_df["comments"])
        
        # 3. Create readable DataFrame
        tfidf_df = pd.DataFrame(
            X.toarray(),
            index=grouped_df['listing_id'],
            columns=[f"tfidf_{t}" for t in vectorizer.get_feature_names_out()]
        )
        
        return tfidf_df

    except FileNotFoundError:
        print(f"File {filename} not found.")
        return pd.DataFrame()

if __name__ == "__main__":
    features = build_features()
    if not features.empty:
        print(features.head())

Aggregating text by listing_id...
Vectorizing...
            tfidf_10  tfidf_15  tfidf_20  tfidf_able  tfidf_absolutely  tfidf_access  tfidf_accommodating  tfidf_accommodation  tfidf_agréable  tfidf_airbnb  tfidf_alles  tfidf_amazing  tfidf_amenities  tfidf_amsterdam  tfidf_amsterdam br  tfidf_apartment  tfidf_appartement  tfidf_area  tfidf_arrival  tfidf_arrived  tfidf_attractions  tfidf_au  tfidf_auch  tfidf_available  tfidf_avec  tfidf_avons  tfidf_away  tfidf_awesome  tfidf_bars  tfidf_bathroom  tfidf_beautiful  tfidf_bed  tfidf_beds  tfidf_best  tfidf_better  tfidf_bien  tfidf_big  tfidf_bike  tfidf_bikes  tfidf_bit  tfidf_boat  tfidf_bon  tfidf_br  tfidf_br br  tfidf_breakfast  tfidf_bus  tfidf_cafes  tfidf_calme  tfidf_canal  tfidf_casa  tfidf_ce  tfidf_center  tfidf_central  tfidf_central station  tfidf_centre  tfidf_chambre  tfidf_check  tfidf_city  tfidf_city center  tfidf_city centre  tfidf_clean  tfidf_close  tfidf_coffee  tfidf_come  tfidf_comfortable  tfidf_comfy  tfidf_c