# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


In [12]:
import pandas as pd



listings = pd.read_csv("listings.csv.gz")

calendar = pd.read_csv("calendar.csv.gz")

reviews = pd.read_csv("reviews.csv.gz")



display(reviews.describe())

display(calendar.describe())

display(listings.describe())

Unnamed: 0,listing_id,id,reviewer_id
count,501084.0,501084.0,501084.0
mean,1.382436e+17,6.408181e+17,176324000.0
std,3.386866e+17,5.495376e+17,172103600.0
min,27886.0,82539.0,1.0
25%,6255736.0,432562300.0,35777460.0
50%,20468520.0,7.135894e+17,113546000.0
75%,44423330.0,1.149944e+18,271445000.0
max,1.498684e+18,1.507956e+18,717414000.0


Unnamed: 0,listing_id,price,adjusted_price,minimum_nights,maximum_nights
count,3825200.0,0.0,0.0,3825200.0,3825200.0
mean,5.925464e+17,,,4.374672,410197.7
std,5.620407e+17,,,18.7946,29663530.0
min,27886.0,,,1.0,1.0
25%,26293730.0,,,2.0,21.0
50%,6.893474e+17,,,3.0,50.0
75%,1.11961e+18,,,4.0,730.0
max,1.506287e+18,,,1001.0,2147484000.0


Unnamed: 0,id,scrape_id,host_id,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,latitude,longitude,accommodates,bathrooms,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,10480.0,10480.0,10480.0,10477.0,10477.0,0.0,10480.0,10480.0,10480.0,5932.0,...,9382.0,9383.0,9383.0,9383.0,9383.0,10480.0,10480.0,10480.0,10480.0,9383.0
mean,5.925464e+17,20250910000000.0,134501900.0,3.967262,5.991887,,52.366679,4.889447,2.920515,1.229855,...,4.777811,4.894101,4.907071,4.817007,4.653918,1.844084,1.217748,0.560115,0.029103,0.998668
std,5.620675e+17,0.0,180435900.0,37.409613,61.183451,,0.017246,0.034821,1.276192,0.536533,...,0.317948,0.215202,0.216665,0.231716,0.323497,3.159096,2.433486,1.713971,0.409605,2.306143
min,27886.0,20250910000000.0,1662.0,1.0,1.0,,52.290276,4.75587,1.0,0.0,...,1.0,1.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.01
25%,26293730.0,20250910000000.0,12777810.0,1.0,1.0,,52.355694,4.864618,2.0,1.0,...,4.69,4.87,4.9,4.73,4.53,1.0,1.0,0.0,0.0,0.2
50%,6.893474e+17,20250910000000.0,45478430.0,1.0,1.0,,52.36569,4.887516,2.0,1.0,...,4.87,4.97,5.0,4.89,4.71,1.0,1.0,0.0,0.0,0.41
75%,1.11961e+18,20250910000000.0,187719600.0,1.0,2.0,,52.37651,4.908675,4.0,1.5,...,5.0,5.0,5.0,5.0,4.85,1.0,1.0,0.0,0.0,0.91
max,1.506287e+18,20250910000000.0,717347000.0,957.0,1655.0,,52.42512,5.02815,16.0,17.0,...,5.0,5.0,5.0,5.0,5.0,35.0,32.0,15.0,9.0,99.42


**Exploratpry Analysis**

In [13]:
print("--- Calendar Nulls ---")
print(calendar.isnull().sum())

print("\n--- Reviews Nulls ---")
print(reviews.isnull().sum())

print("\n--- Listings Nulls ---")
print(listings.isnull().sum())

--- Calendar Nulls ---
listing_id              0
date                    0
available               0
price             3825200
adjusted_price    3825200
minimum_nights          0
maximum_nights          0
dtype: int64

--- Reviews Nulls ---
listing_id        0
id                0
date              0
reviewer_id       0
reviewer_name     1
comments         31
dtype: int64

--- Listings Nulls ---
id                                                 0
listing_url                                        0
scrape_id                                          0
last_scraped                                       0
source                                             0
                                                ... 
calculated_host_listings_count                     0
calculated_host_listings_count_entire_homes        0
calculated_host_listings_count_private_rooms       0
calculated_host_listings_count_shared_rooms        0
reviews_per_month                               1097
Length: 79, dtype: 

Calendar rows: 0
Listings rows: 10480
Match count:   0


Unnamed: 0,listing_id,date,available,price_x,adjusted_price,minimum_nights_x,maximum_nights_x,id,listing_url,scrape_id,...,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,total_reviews,latest_review


In [6]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import string

class ReviewFeatureBuilder:
    def __init__(self, dataframe):
        """
        Initialize with the raw reviews dataframe.
        Expected columns: listing_id, id, date, reviewer_id, reviewer_name, comments
        """
        self.raw_df = dataframe.copy()
        self.final_features = None
        
        # Simple lexicon for sentiment analysis without TextBlob
        self.pos_words = {
            'good', 'great', 'love', 'excellent', 'amazing', 'clean', 'nice', 
            'best', 'beautiful', 'fantastic', 'comfortable', 'perfect', 'lovely'
        }
        self.neg_words = {
            'bad', 'terrible', 'awful', 'dirty', 'worst', 'poor', 'rude', 
            'noisy', 'hate', 'horrible', 'disgusting', 'messy', 'broken'
        }

    def _clean_text(self, text):
        """Basic text cleaning: lowercase, remove special chars using standard string methods."""
        if not isinstance(text, str):
            return ""
        text = text.lower()
        # Keep only alphanumeric characters and spaces
        return "".join(c for c in text if c.isalnum() or c.isspace())

    def _get_sentiment(self, text):
        """
        Returns a basic sentiment score based on word counts.
        Range: -1.0 (negative) to 1.0 (positive).
        """
        words = text.split()
        if not words:
            return 0.0
            
        score = 0
        for word in words:
            if word in self.pos_words:
                score += 1
            elif word in self.neg_words:
                score -= 1
        
        # Normalize score by length of text to keep it somewhat within -1 to 1 range
        # Multiplied by 5 to make the signal stronger for short reviews
        if len(words) > 0:
            normalized_score = (score / len(words)) * 5
            return max(min(normalized_score, 1.0), -1.0)
        return 0.0

    def build(self, max_tfidf_features=20):
        """
        Main execution pipeline:
        1. Preprocessing
        2. Per-review Sentiment extraction
        3. Aggregation by listing_id
        4. TF-IDF vectorization on aggregated text
        """
        print("--- Starting Feature Engineering (No External NLP Libs) ---")
        
        # 1. Preprocessing
        print("Cleaning text...")
        self.raw_df['clean_comments'] = self.raw_df['comments'].fillna("").apply(self._clean_text)

        # 2. Per-Review Sentiment
        # We calculate sentiment BEFORE grouping to get the average sentiment per listing
        # and the volatility (std dev) of sentiment.
        print("Calculating sentiment for individual reviews...")
        self.raw_df['sentiment_polarity'] = self.raw_df['clean_comments'].apply(self._get_sentiment)

        # 3. Aggregation
        print("Aggregating by listing_id...")
        
        # Define aggregation rules
        agg_rules = {
            'clean_comments': lambda x: ' '.join(x),  # Concatenate all reviews into one document
            'sentiment_polarity': ['mean', 'std', 'count'], # Avg sentiment, consistency, and volume
            'date': ['min', 'max'] # Recency features could be derived here
        }
        
        grouped = self.raw_df.groupby('listing_id').agg(agg_rules)
        
        # Flatten MultiIndex columns
        grouped.columns = ['_'.join(col).strip() for col in grouped.columns.values]
        
        # Rename for clarity
        grouped = grouped.rename(columns={
            'clean_comments_<lambda>': 'aggregated_text',
            'sentiment_polarity_mean': 'avg_sentiment',
            'sentiment_polarity_std': 'sentiment_std',
            'sentiment_polarity_count': 'review_count',
            'date_min': 'first_review',
            'date_max': 'last_review'
        })
        
        # Handle NaN in std deviation (occurs if a listing has only 1 review)
        grouped['sentiment_std'] = grouped['sentiment_std'].fillna(0)

        # 4. TF-IDF
        # We treat the aggregated reviews of one listing as a single "document"
        print(f"Vectorizing text (Top {max_tfidf_features} features)...")
        tfidf = TfidfVectorizer(
            stop_words='english', 
            max_features=max_tfidf_features,
            min_df=0.05, # Ignore terms that appear in less than 5% of listings
            max_df=0.95  # Ignore terms that appear in more than 95% of listings
        )
        
        # Fit on the aggregated text
        tfidf_matrix = tfidf.fit_transform(grouped['aggregated_text'])
        
        # Create a DataFrame for the TF-IDF features
        feature_names = [f"tfidf_{w}" for w in tfidf.get_feature_names_out()]
        tfidf_df = pd.DataFrame(
            tfidf_matrix.toarray(), 
            index=grouped.index, 
            columns=feature_names
        )

        # 5. Final Assembly
        # Combine metrics with text vectors
        self.final_features = pd.concat([
            grouped[['review_count', 'avg_sentiment', 'sentiment_std']], 
            tfidf_df
        ], axis=1)

        print("--- Feature Build Complete ---")
        return self.final_features

# ==========================================
# Example Usage with Mock Data
# ==========================================

if __name__ == "__main__":
    # 1. Create Mock Dataset
    data = {
        'listing_id': [101, 101, 101, 205, 205, 309],
        'id': [1, 2, 3, 4, 5, 6],
        'date': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-01-15', '2023-02-20', '2023-01-10'],
        'reviewer_id': [99, 88, 77, 66, 55, 44],
        'reviewer_name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve', 'Frank'],
        'comments': [
            "The place was clean and the host was great.",
            "Great location, but a bit noisy at night.",
            "Absolutely loved the clean kitchen and spacious room.",
            "Terrible experience. Dirty and rude host.",
            "Not worth the money. Very dirty.",
            "It was okay. Nothing special but good location."
        ]
    }
    
    df_reviews = pd.DataFrame(data)
    
    print(f"Input Data:\n{df_reviews[['listing_id', 'comments']]}\n")

    # 2. Instantiate and Run Builder
    builder = ReviewFeatureBuilder(df_reviews)
    features = builder.build(max_tfidf_features=5)

    # 3. Display Results
    print("\nResulting Features (Indexed by listing_id):")
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 1000)
    print(features)
    
    # Explanation of logic
    print("\n--- Interpretation ---")
    print("Listing 101: Positive sentiment, keywords likely include 'clean', 'great'.")
    print("Listing 205: Negative sentiment, keywords likely include 'dirty'.")
    print("Listing 309: Neutral sentiment.")

Input Data:
   listing_id                                           comments
0         101        The place was clean and the host was great.
1         101          Great location, but a bit noisy at night.
2         101  Absolutely loved the clean kitchen and spaciou...
3         205          Terrible experience. Dirty and rude host.
4         205                   Not worth the money. Very dirty.
5         309    It was okay. Nothing special but good location.

--- Starting Feature Engineering (No External NLP Libs) ---
Cleaning text...
Calculating sentiment for individual reviews...
Aggregating by listing_id...
Vectorizing text (Top 5 features)...
--- Feature Build Complete ---

Resulting Features (Indexed by listing_id):
            review_count  avg_sentiment  sentiment_std  tfidf_clean  tfidf_dirty  tfidf_great  tfidf_host  tfidf_location
listing_id                                                                                                               
101                  