# Airbnb Recommendation System with Clustering

This notebook recommends Airbnb listings across European cities based on user preferences and budget.

The system supports two recommendation modes:

- **Duration**: Finds listings where the user can stay the longest within their budget.
- **Value**: Ranks listings based on a composite score using:
  - Guest satisfaction
  - Price per night
  - Distance to city center
  - Distance to metro

We also apply **KMeans clustering** to group listings into 6 behavioral types (e.g., budget shared, high-end central, suburban). Users can optionally filter results by these cluster types for more tailored recommendations.


In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

### Recommend Airbnbs Function  
This function allows users to interactively enter travel preferences, such as budget, city, number of bedrooms, and preferred distances. Based on these inputs, it filters Airbnb listings and either maximizes the duration of stay or ranks listings by value using a custom scoring function. The results are displayed directly in the notebook or terminal.


In [None]:
# --- Recommend Airbnb Listings Based on User Preferences ---
def recommend_airbnbs(filepath="../ML-exam/data/clustered_airbnb.csv"):
    
    # ## Load and Preview Data
    df = pd.read_csv(filepath)

    # --- Display available cities from the dataset ---
    cities = sorted(df['City'].unique())
    print("Available cities:")
    for c in cities:
        print("-", c)

    # ## User Input Section
    print("\n--- Enter your preferences ---")
    try:
        # Collect user inputs for filtering
        user_budget = float(input("Total budget (€): "))
        duration_input = input("Number of nights (leave blank to maximize duration): ").strip()
        min_bedrooms = int(input("Minimum number of bedrooms: "))
        max_city_dist = float(input("Max distance to city center (e.g. 3.0): "))
        max_metro_dist = float(input("Max distance to metro (e.g. 1.0): "))
        weekend = input("Is your stay during a weekend? (yes/no): ").lower() == "yes"
        city_input = input("Pick a city (leave blank to search all): ").strip()

        # --- Optional Cluster Filter ---
        use_cluster = input("Would you like to filter by listing type (cluster)? (yes/no): ").lower() == "yes"
        if use_cluster:
            print("\nAvailable clusters:")
            print("  0: Large and Expensive")
            print("  1: Budget and Shared")
            print("  2: Compact Private Rooms")
            print("  3: High-End Central")
            print("  4: Suburban Midrange")
            print("  5: Poorly Rated")
            selected_cluster = int(input("Enter cluster number (0–5): "))
    except ValueError:
        print("Invalid input. Please try again.")
        return

    # ## Determine Recommendation Mode (maximize duration or sort by value)
    if duration_input == "":
        mode = "duration"
        user_duration = None
    else:
        mode = "value"
        try:
            user_duration = int(duration_input)
        except ValueError:
            print("Invalid number of nights. Please enter an integer.")
            return

    # ## Filter Listings Based on User Inputs
    filtered_df = df.copy()

    # Apply filters step-by-step
    if weekend:
        filtered_df = filtered_df[filtered_df['Is_weekend_bool'] == 1]
    if city_input:
        filtered_df = filtered_df[filtered_df['City'].str.lower() == city_input.lower()]
    filtered_df = filtered_df[
        (filtered_df['bedrooms'] >= min_bedrooms) &
        (filtered_df['dist'] <= max_city_dist) &
        (filtered_df['metro_dist'] <= max_metro_dist)
    ]
    if use_cluster:
        filtered_df = filtered_df[filtered_df['cluster'] == selected_cluster]

    # Return early if no listings match the filters
    if filtered_df.empty:
        print("\nNo listings match your criteria.")
        return

    # ## Recommendation Logic
    if mode == "duration":
        # If no duration was provided, calculate max nights within the budget
        filtered_df['max_nights'] = (user_budget / filtered_df['realSum']).apply(int)
        recommended = filtered_df.sort_values(by='max_nights', ascending=False)
        display_cols = ['City', 'realSum', 'bedrooms', 'dist', 'metro_dist',
                        'guest_satisfaction_overall', 'max_nights']
    else:
        # If a duration was provided, filter listings based on budget * duration
        filtered_df = filtered_df[filtered_df['realSum'] * user_duration <= user_budget]
        if filtered_df.empty:
            print("\nNo listings within your budget for the selected duration.")
            return

        # --- Use external value scoring method ---
        scored_df = calculate_value_scores(filtered_df)
        recommended = scored_df.sort_values(by='value_score', ascending=False)
        display_cols = ['City', 'realSum', 'bedrooms', 'dist', 'metro_dist',
                        'guest_satisfaction_overall', 'value_score']

    # ## Output Results
    print(f"\nTop 10 Recommended Listings (Mode: {mode}):\n")
    print(recommended[display_cols].head(10).to_string(index=False))

### Value Score Calculation Function
This function calculates a value_score for each Airbnb listing by combining key features such as price, proximity to the city center and metro, number of bedrooms, and guest satisfaction. The scores are normalized on a 0–100 scale, allowing the listings to be ranked by overall value for money.

In [None]:
def calculate_value_scores(df):
    # Work on a copy to avoid modifying the original DataFrame
    df = df.copy()

    # --- Invert features where lower values are better ---
    # Inverse price: cheaper is better
    df['inv_price'] = 1 / df['realSum']
    
    # Inverse distances (add 0.1 to avoid division by zero)
    df['inv_dist'] = 1 / (df['dist'] + 0.1)
    df['inv_metro'] = 1 / (df['metro_dist'] + 0.1)

    # --- Combine selected features into a scoring matrix ---
    # We use both direct (bedrooms, satisfaction) and inverse (price, distance) metrics
    scoring_data = pd.DataFrame({
        'price': df['inv_price'],                           # Favor cheaper listings
        'center': df['inv_dist'],                           # Favor closer to city center
        'metro': df['inv_metro'],                           # Favor closer to metro
        'bedrooms': df['bedrooms'],                         # Favor more rooms
        'satisfaction': df['guest_satisfaction_overall']    # Favor higher guest scores
    })
    
    # --- Normalize features using Min-Max scaling ---
    scaler = MinMaxScaler()
    normalized = scaler.fit_transform(scoring_data)

    # Average the normalized features to get a raw value score
    raw_score = normalized.mean(axis=1)

    # --- Final Value Score Scaling (0–100) ---
    max_score = raw_score.max()
    if pd.notna(max_score) and max_score != 0:
        df['value_score'] = (raw_score / max_score) * 100
    else:
        # Handle edge case: no variability or invalid score
        df['value_score'] = 0

    return df

In [None]:
recommend_airbnbs()

### How the Value Score Is Calculated

When the user provides a fixed number of nights (i.e., not leaving the "number of nights" field blank), the system calculates a **value score** for each listing to rank the most cost-effective options.

The formula used is:

```python
value_score = guest_satisfaction_overall / price_per_day
```

Where:

* `guest_satisfaction_overall`: A rating from 0 to 100 indicating how satisfied previous guests were.
* `price_per_day`: Calculated as:

  ```python
  price_per_day = realSum / number_of_nights
  ```

  `realSum` is the total cost of the listing (for the entire stay).

### Interpretation

* A **higher value score** indicates **better guest satisfaction per euro spent per day**.
* This helps highlight listings that are not just cheap, but also highly rated — ensuring you get the **best bang for your buck**.