## **Introduction & Explanation: Airbnb Data Preparation Scripts**

This notebook contains two important preprocessing scripts that prepare Airbnb price and listing data for machine learning tasks.

In [None]:
import pandas as pd
import os

# https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities

#### **Unsupervised Learning Dataset Creation**

This section builds the dataset used for unsupervised learning tasks such as clustering.

1. **Combining Data**

   * Loads multiple CSV files (weekends, weekdays) from the Airbnb European cities folder.
   * Adds a `City` column (cleaned name) and an `Is_weekend` flag for each entry.
   * Combines everything into a single DataFrame.

2. **Cleaning the Data**

   * Drops unnecessary columns like coordinates and index scores.
   * Renames `'Unnamed: 0'` to `'ID'` if present.
   * Converts boolean fields to integer flags (adds `_bool` suffix).

3. **Outlier Removal**

   * Removes extreme price outliers (top 1%) to keep the dataset focused on typical listings.

4. **Final Touch**

   * Rounds distance, price, and related numeric fields for consistency.
   * Saves the cleaned dataset as `cleaned_airbnb_data.csv`.

In [None]:
# RUN THIS TO CREATE THE FINAL DATASET FOR UNSUPERVISED LEARNING MODEL
# Step 1: Combine all CSV files
folder_path = './data/airbnb-prices-in-european-cities/'
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]
combined_df = []

for file in csv_files:
    file_path = os.path.join(folder_path, file)
    df = pd.read_csv(file_path)

    filename_no_ext = os.path.splitext(file)[0]
    if filename_no_ext.endswith('_weekends'):
        city = filename_no_ext.replace('_weekends', '').capitalize()
        is_weekend = True
    elif filename_no_ext.endswith('_weekdays'):
        city = filename_no_ext.replace('_weekdays', '').capitalize()
        is_weekend = False
    else:
        city = filename_no_ext.capitalize()
        is_weekend = pd.NA

    df['City'] = city
    df['Is_weekend'] = is_weekend
    combined_df.append(df)

final_df = pd.concat(combined_df, ignore_index=True)

# Step 2: Clean and optimize
def clean_airbnb_data(df):
    # Drop unnecessary columns
    cols_to_drop = ['lng', 'lat', 'attr_index', 'attr_index_norm', 'rest_index', 'rest_index_norm']
    df = df.drop(columns=[col for col in cols_to_drop if col in df.columns])

    # Rename 'Unnamed: 0' to 'ID' if present
    if 'Unnamed: 0' in df.columns:
        df = df.rename(columns={'Unnamed: 0': 'ID'})

    # Convert boolean columns to 0/1 and add "_bool" suffix
    bool_cols = df.select_dtypes(include='bool').columns.tolist()

    # Add explicitly boolean-like int columns
    manual_bool_cols = ['biz', 'multi']
    
    for col in bool_cols + manual_bool_cols:
        df[col] = df[col].astype(int)
        df.rename(columns={col: f"{col}_bool"}, inplace=True)

    return df

# Step 3: Remove outliers
def remove_price_outliers(df, column='realSum', quantile=0.99):
    threshold = df[column].quantile(quantile)
    return df[df[column] < threshold].copy()

# Apply cleaning and filtering
cleaned_df = clean_airbnb_data(final_df)
cleaned_df = remove_price_outliers(cleaned_df)

# Round selected values
cleaned_df['dist'] = cleaned_df['dist'].round(1)
cleaned_df['metro_dist'] = cleaned_df['metro_dist'].round(1)
cleaned_df['realSum'] = cleaned_df['realSum'].round(2)
cleaned_df['person_capacity'] = cleaned_df['person_capacity'].round(0).astype(int)
cleaned_df['cleanliness_rating'] = cleaned_df['cleanliness_rating'].round(0).astype(int)
cleaned_df['guest_satisfaction_overall'] = cleaned_df['guest_satisfaction_overall'].round(0).astype(int)

# Step 4: Save to a new CSV
output_path = os.path.join('./data/', 'cleaned_airbnb_data.csv')
cleaned_df.to_csv(output_path, index=False)
print(f"Cleaned data saved to: {output_path}")


#### **Supervised Learning Dataset Creation**

This section prepares a cleaned dataset tailored for supervised learning models such as regression or classification.

1. **Combining Data**

   * Same as in the unsupervised pipeline: combines all CSV files and adds city and weekend flags.

2. **Cleaning and Encoding**

   * Drops unnecessary columns and renames `'Unnamed: 0'` to `'ID'`.
   * Converts boolean fields to integer flags (adds `_bool` suffix).
   * Encodes categorical columns like `'City'` and `'room_type'` into numeric IDs starting from 1, making them suitable for machine learning models.

3. **Final Touch**

   * Rounds distances, prices, and rating fields.
   * Saves the cleaned dataset as `supervised_cleaned_airbnb_data.csv`.

In [None]:
# RUN THIS TO CREATE THE FINAL DATASET FOR SUPERVISED LEARNING MODEL
# Step 1: Combine all CSV files
folder_path = './data/airbnb-prices-in-european-cities/'
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]
combined_df = []

for file in csv_files:
    file_path = os.path.join(folder_path, file)
    df = pd.read_csv(file_path)

    filename_no_ext = os.path.splitext(file)[0]
    if filename_no_ext.endswith('_weekends'):
        city = filename_no_ext.replace('_weekends', '').capitalize()
        is_weekend = True
    elif filename_no_ext.endswith('_weekdays'):
        city = filename_no_ext.replace('_weekdays', '').capitalize()
        is_weekend = False
    else:
        city = filename_no_ext.capitalize()
        is_weekend = pd.NA

    df['City'] = city
    df['Is_weekend'] = is_weekend
    combined_df.append(df)

final_df = pd.concat(combined_df, ignore_index=True)

# Step 2: Clean and optimize
def clean_airbnb_data(df):
    # Drop unnecessary columns
    cols_to_drop = ['lng', 'lat', 'attr_index', 'attr_index_norm', 'rest_index', 'rest_index_norm']
    df = df.drop(columns=[col for col in cols_to_drop if col in df.columns])

    # Rename 'Unnamed: 0' to 'ID' if present
    if 'Unnamed: 0' in df.columns:
        df = df.rename(columns={'Unnamed: 0': 'ID'})

    # Convert boolean columns to 0/1 and add "_bool" suffix
    bool_cols = df.select_dtypes(include='bool').columns.tolist()

    # Add explicitly boolean-like int columns
    manual_bool_cols = ['biz', 'multi']
    
    for col in bool_cols + manual_bool_cols:
        df[col] = df[col].astype(int)
        df.rename(columns={col: f"{col}_bool"}, inplace=True)

    # Convert 'City' to numeric ID starting from 1
    unique_cities = df['City'].unique()
    city_mapping = {city: idx + 1 for idx, city in enumerate(unique_cities)}
    df['City'] = df['City'].map(city_mapping)

    # Convert 'Room_type' to numeric ID starting from 1
    room_types = df['room_type'].unique()
    room_type_mapping = {room_type: idx + 1 for idx, room_type in enumerate(room_types)}
    df['room_type'] = df['room_type'].map(room_type_mapping)

    return df

cleaned_df = clean_airbnb_data(final_df)

cleaned_df['dist'] = cleaned_df['dist'].round(1)
cleaned_df['metro_dist'] = cleaned_df['metro_dist'].round(1)
cleaned_df['realSum'] = cleaned_df['realSum'].round(2)
cleaned_df['person_capacity'] = cleaned_df['person_capacity'].round(0).astype(int)
cleaned_df['cleanliness_rating'] = cleaned_df['cleanliness_rating'].round(0).astype(int)
cleaned_df['guest_satisfaction_overall'] = cleaned_df['guest_satisfaction_overall'].round(0).astype(int)

# Step 3: Save to a new CSV
output_path = os.path.join('./data/', 'supervised_cleaned_airbnb_data.csv')
cleaned_df.to_csv(output_path, index=False)
print(f"Cleaned data saved to: {output_path}")


### **Summary**
These two scripts transform raw, multi-source Airbnb data into machine learning-ready datasets: one optimized for pattern discovery (unsupervised) and the other for predictive modeling (supervised), ensuring clean, encoded, and consistently formatted input data.