# Data Preparation

**Goal:** Transform the raw listings data into a clean, processed dataset suitable for machine learning models. This involves handling missing values, converting data types, engineering features, encoding categorical variables, and splitting the data.

## 1. Initial Setup & Loading
*   Import necessary libraries.
*   Load the raw dataset (`listings.csv`).
*   Define and prepare the target variable (`log_price`).

### Import Libraries
Import essential libraries for data manipulation, numerical operations, preprocessing, and saving objects.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re # For regex operations if needed later (e.g., parsing text)
import joblib # For saving preprocessors/models (alternative: import pickle)

# Scikit-learn modules for preprocessing and splitting
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
# Add other encoders (e.g., TargetEncoder) or transformers later as needed

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100) 
pd.set_option('display.float_format', lambda x: '%.3f' % x) # Adjust float format if desired

print("Libraries imported.")

Libraries imported.


### Load Raw Data
Load the `listings.csv` file identified during Data Understanding. We'll use a new DataFrame name (`df_prep`) to distinguish it from the EDA DataFrame.

In [2]:
# Define the path relative to the notebook location in notebooks/
listings_path = '../data/raw/listings.csv'

# Load the dataset
try:
    # Use a new variable name for the preparation phase
    df_prep = pd.read_csv(listings_path, low_memory=False) 
    print(f"Successfully loaded {listings_path} into df_prep")
    print(f"Initial shape: {df_prep.shape}")
except FileNotFoundError:
    print(f"Error: File not found at {listings_path}. Ensure the file exists.")
    # Optionally exit or raise error if file is crucial
    df_prep = None 
except Exception as e:
    print(f"An error occurred during loading: {e}")
    df_prep = None

# Display first few rows to confirm loading
if df_prep is not None:
    display(df_prep.head(3))

Successfully loaded ../data/raw/listings.csv into df_prep
Initial shape: (10108, 79)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,23163,https://www.airbnb.com/rooms/23163,20250316041547,2025-03-16,city scrape,Residence Karolina - KAROL12,"Unique and elegant apartment rental in Prague,...",,https://a0.muscache.com/pictures/01bbe32c-3f13...,5282,https://www.airbnb.com/users/show/5282,Klara,2008-12-17,"Prague, Czechia","Hello, \r\nglad to see that you are interested...",within an hour,100%,100%,t,https://a0.muscache.com/im/pictures/user/b7309...,https://a0.muscache.com/im/pictures/user/b7309...,Josefov,72.0,82.0,"['email', 'phone']",t,t,,Praha 1,,50.082,14.416,Entire rental unit,Entire home/apt,4,1.0,1 bath,1.0,2.0,"[""Coffee maker"", ""Dishwasher"", ""Bed linens"", ""...","$2,918.00",1,365,1,7,60,731,1.4,663.6,,t,0,0,0,0,2025-03-16,31,1,0,0,1,6,17508.0,2010-09-20,2024-06-15,4.9,4.83,5.0,5.0,4.97,4.93,4.86,,t,70,69,0,0,0.18
1,23169,https://www.airbnb.com/rooms/23169,20250316041547,2025-03-16,city scrape,Residence Masna - Masna302,Masna studio offers a lot of space and privacy...,,https://a0.muscache.com/pictures/b450cf2a-8561...,5282,https://www.airbnb.com/users/show/5282,Klara,2008-12-17,"Prague, Czechia","Hello, \r\nglad to see that you are interested...",within an hour,100%,100%,t,https://a0.muscache.com/im/pictures/user/b7309...,https://a0.muscache.com/im/pictures/user/b7309...,Josefov,72.0,82.0,"['email', 'phone']",t,t,,Praha 1,,50.088,14.423,Entire rental unit,Entire home/apt,3,1.0,1 bath,1.0,2.0,"[""Patio or balcony"", ""Coffee maker"", ""Bed line...",,1,365,1,7,60,731,1.2,710.6,,t,7,13,13,13,2025-03-16,122,6,0,13,8,36,,2010-05-07,2024-11-08,4.74,4.6,4.83,4.81,4.87,4.97,4.7,,t,70,69,0,0,0.67
2,26755,https://www.airbnb.com/rooms/26755,20250316041547,2025-03-16,city scrape,Central Prague Old Town Top Floor,Big and beautiful new attic apartment in the v...,This apartment offers a fantastic location. Yo...,https://a0.muscache.com/pictures/miso/Hosting-...,113902,https://www.airbnb.com/users/show/113902,Daniel+Bea,2010-04-26,"Prague, Czechia",Hi! we are a sp/cz couple with 2 daughters (La...,within an hour,100%,98%,t,https://a0.muscache.com/im/pictures/user/8db01...,https://a0.muscache.com/im/pictures/user/8db01...,Staré Město,4.0,4.0,"['email', 'phone']",t,t,"Prague, Hlavní město Praha, Czechia",Praha 1,,50.087,14.432,Entire rental unit,Entire home/apt,4,1.5,1.5 baths,1.0,2.0,"[""AC - split type ductless system"", ""Coffee ma...","$1,582.00",3,700,3,4,1125,1125,3.1,1125.0,,t,3,7,24,173,2025-03-16,411,53,3,173,57,255,403410.0,2015-05-19,2025-03-07,4.94,4.95,4.92,4.93,4.96,4.93,4.9,,f,3,3,0,0,3.43


### Define and Prepare Target Variable (`log_price`)
Based on the Data Understanding phase, the target variable for modeling will be the log-transformed price (`log1p`). We need to re-apply the cleaning steps to the `price` column and calculate `log_price`.

In [3]:
if df_prep is not None:
    target_col = 'log_price' # Define the final target column name
    
    # 1. Clean original 'price' column if it exists and is object type
    if 'price' in df_prep.columns and df_prep['price'].dtype == 'object':
        print("Cleaning 'price' column...")
        price_cleaned_temp = df_prep['price'].astype(str).str.replace('[$,]', '', regex=True)
        price_cleaned_temp = pd.to_numeric(price_cleaned_temp, errors='coerce')
        
        # Check if all are integers to use Int64
        is_integer = (price_cleaned_temp.dropna() % 1 == 0).all()
        if is_integer:
             df_prep['price_cleaned'] = price_cleaned_temp.astype('Int64')
             print("Stored cleaned price as Int64.")
        else:
             df_prep['price_cleaned'] = price_cleaned_temp
             print("Stored cleaned price as float64.")
             
    elif 'price' in df_prep.columns and pd.api.types.is_numeric_dtype(df_prep['price']):
         print("'price' column already numeric. Copying to 'price_cleaned'.")
         df_prep['price_cleaned'] = df_prep['price'] # Keep original numeric type for now
    else:
         print("Warning: 'price' column not found or not object/numeric. Cannot create 'price_cleaned'.")

    # 2. Calculate log_price
    if 'price_cleaned' in df_prep.columns:
        df_prep[target_col] = np.log1p(df_prep['price_cleaned'])
        print(f"Calculated target variable '{target_col}'.")
        
        # 3. Verify results
        print(f"\nData types related to target:")
        print(df_prep[['price', 'price_cleaned', target_col]].info()) # Show dtypes and non-null counts
        
        print(f"\nExample values:")
        display(df_prep[['price', 'price_cleaned', target_col]].head())
        
        # Check for NaNs created by log1p (should only be where price_cleaned was NaN)
        log_price_nan_count = df_prep[target_col].isnull().sum()
        price_cleaned_nan_count = df_prep['price_cleaned'].isnull().sum()
        if log_price_nan_count == price_cleaned_nan_count:
             print(f"\nConfirmed: {log_price_nan_count} missing values in '{target_col}' match missing cleaned prices.")
        else:
             print(f"\nWarning: Mismatch in NaN counts between price_cleaned ({price_cleaned_nan_count}) and {target_col} ({log_price_nan_count}).")
             
    else:
        print("Error: 'price_cleaned' column not created. Cannot calculate log_price.")

Cleaning 'price' column...
Stored cleaned price as Int64.
Calculated target variable 'log_price'.

Data types related to target:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10108 entries, 0 to 10107
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          8808 non-null   object 
 1   price_cleaned  8808 non-null   Int64  
 2   log_price      8808 non-null   Float64
dtypes: Float64(1), Int64(1), object(1)
memory usage: 256.8+ KB
None

Example values:


Unnamed: 0,price,price_cleaned,log_price
0,"$2,918.00",2918.0,7.979
1,,,
2,"$1,582.00",1582.0,7.367
3,$860.00,860.0,6.758
4,$629.00,629.0,6.446



Confirmed: 1300 missing values in 'log_price' match missing cleaned prices.


## 2. Initial Cleaning & Filtering

This phase focuses on removing data that is unusable or irrelevant for modeling based on our Data Understanding findings. We will:
*   Remove listings (rows) that are missing the target variable (`log_price`).
*   Remove features (columns) that are constant, entirely empty, contain redundant information, are identifiers/URLs, or are text fields we've chosen not to use initially.

### Drop Rows with Missing Target
Remove rows where the `log_price` is missing, as these cannot be used for supervised learning. Document the number of rows removed.

In [4]:
if df_prep is not None:
    initial_rows = df_prep.shape[0]
    print(f"Initial number of rows: {initial_rows}")
    
    # Drop rows where target_col ('log_price') is NaN
    df_prep.dropna(subset=[target_col], inplace=True)
    
    final_rows = df_prep.shape[0]
    rows_dropped = initial_rows - final_rows
    print(f"Number of rows after dropping missing target: {final_rows}")
    print(f"Number of rows dropped: {rows_dropped} ({rows_dropped/initial_rows:.2%})")
else:
    print("Error: df_prep DataFrame not found.")

Initial number of rows: 10108
Number of rows after dropping missing target: 8808
Number of rows dropped: 1300 (12.86%)


*Observation:* Rows with missing target values (`log_price`) have been successfully removed, reducing the dataset size from 10,108 to 8,808 rows. This ensures all remaining data has a valid target for model training and evaluation.

### Drop Constant, High-Missing, Redundant, ID/URL, and Unused Text Columns
Define a list of columns identified for removal during Data Understanding and drop them from the DataFrame.

In [5]:
if df_prep is not None:
    initial_cols = df_prep.shape[1]
    print(f"Number of columns before dropping: {initial_cols}")

    # Define columns to drop based on EDA and feature selection decisions
    
    # 1. Constant / Empty Columns
    cols_to_drop_constant = ['calendar_updated', 'license', 'neighbourhood_group_cleansed'] 
    for col in ['last_scraped', 'calendar_last_scraped', 'scrape_id', 'source']:
         if col in df_prep.columns and df_prep[col].nunique(dropna=True) <= 1:
             print(f"   Confirming '{col}' is constant/near-constant, adding to drop list.")
             cols_to_drop_constant.append(col)

    # Check 'has_availability' after filtering rows
    if 'has_availability' in df_prep.columns and df_prep['has_availability'].nunique(dropna=True) <= 1:
        print(f"   Confirming 'has_availability' is constant/near-constant after filtering, adding to drop list.")
        cols_to_drop_constant.append('has_availability')
    elif 'has_availability' in df_prep.columns:
         print(f"   Note: 'has_availability' has >1 unique value, keeping for now.")


    # 2. High Missingness / Redundant Text/Location
    cols_to_drop_missing_text = ['neighbourhood', 'neighborhood_overview', 'host_about', 'host_location', 'host_neighbourhood'] # Added host_neighbourhood

    # 3. Redundant / Replaced / Intermediate Columns
    cols_to_drop_redundant = ['bathrooms_text', 'price', 'price_cleaned'] # Keeping numeric bathrooms for imputation

    # 4. IDs / URLs
    cols_to_drop_ids_urls = ['id', 'listing_url', 'picture_url', 'host_id', 'host_url', 'host_name', 
                             'host_thumbnail_url', 'host_picture_url']

    # 5. Text Columns (Initial decision: drop)
    cols_to_drop_text = ['name', 'description']
    
    # 6. Detailed Min/Max Nights (Keep minimum_nights, maximum_nights)
    cols_to_drop_detailed_nights = ['minimum_minimum_nights', 'maximum_minimum_nights', 
                                    'minimum_maximum_nights', 'maximum_maximum_nights', 
                                    'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm']

    # 7. Redundant Availability (Keep availability_30, availability_365)
    cols_to_drop_availability = ['availability_60', 'availability_90', 'availability_eoy'] 

    # 8. Redundant Review Counts (Keep number_of_reviews, number_of_reviews_ltm)
    cols_to_drop_review_counts = ['number_of_reviews_l30d', 'number_of_reviews_ly']
    
    # 9. Leaky / Derived / Unclear Columns
    cols_to_drop_leaky = ['estimated_occupancy_l365d', 'estimated_revenue_l365d']

    # 10. Redundant Host Counts (Keep calculated_host_listings_count_ by type)
    cols_to_drop_host_counts = ['host_listings_count', 'host_total_listings_count', 'calculated_host_listings_count']

    # 11. Weak Signals / Redundant Scores
    cols_to_drop_weak_signals = [
        'host_has_profile_pic',
        'review_scores_accuracy', 
        'review_scores_cleanliness', 
        'review_scores_checkin', 
        'review_scores_communication',
        'host_is_superhost',
        'host_response_rate'
        ]


    # Combine all lists, ensuring uniqueness
    all_cols_to_drop = list(set(
        cols_to_drop_constant +
        cols_to_drop_missing_text +
        cols_to_drop_redundant +
        cols_to_drop_ids_urls +
        cols_to_drop_text +
        cols_to_drop_detailed_nights +
        cols_to_drop_availability +
        cols_to_drop_review_counts +
        cols_to_drop_leaky +
        cols_to_drop_host_counts +
        cols_to_drop_weak_signals
    ))

    # Check which columns actually exist in the current df_prep before trying to drop
    # --- This part had the error, need to check existence before dropping ---
    actual_columns_in_df = df_prep.columns.tolist()
    existing_cols_to_drop = [col for col in all_cols_to_drop if col in actual_columns_in_df]
    # ---------------------------------------------------------------------

    
    print(f"\nColumns identified for dropping ({len(existing_cols_to_drop)}):")
    print(sorted(existing_cols_to_drop)) 

    # Drop the columns
    if existing_cols_to_drop: # Only drop if list is not empty
        df_prep.drop(columns=existing_cols_to_drop, inplace=True, errors='ignore') # errors='ignore' is safer

    final_cols = df_prep.shape[1]
    cols_dropped_count = initial_cols - final_cols
    print(f"\nNumber of columns after dropping: {final_cols}")
    print(f"Number of columns dropped: {cols_dropped_count}")

else:
    print("Error: df_prep DataFrame not found.")

Number of columns before dropping: 81
   Confirming 'last_scraped' is constant/near-constant, adding to drop list.
   Confirming 'calendar_last_scraped' is constant/near-constant, adding to drop list.
   Confirming 'scrape_id' is constant/near-constant, adding to drop list.
   Confirming 'source' is constant/near-constant, adding to drop list.
   Note: 'has_availability' has >1 unique value, keeping for now.

Columns identified for dropping (48):
['availability_60', 'availability_90', 'availability_eoy', 'bathrooms_text', 'calculated_host_listings_count', 'calendar_last_scraped', 'calendar_updated', 'description', 'estimated_occupancy_l365d', 'estimated_revenue_l365d', 'host_about', 'host_has_profile_pic', 'host_id', 'host_is_superhost', 'host_listings_count', 'host_location', 'host_name', 'host_neighbourhood', 'host_picture_url', 'host_response_rate', 'host_thumbnail_url', 'host_total_listings_count', 'host_url', 'id', 'last_scraped', 'license', 'listing_url', 'maximum_maximum_nights'

*Observation:* A significant number of columns identified during Data Understanding as constant, empty, redundant, containing IDs/URLs, having high missingness, or being unused text fields have been dropped. This simplifies the dataset considerably, focusing it on potentially predictive features.

### Drop Rows with Missing Key Features
Drop rows that contain missing values in a predefined list of key feature columns (`host_since`, `host_verifications`, `host_identity_verified`, `bathrooms`, `bedrooms`, `beds`, `has_availability`) to avoid needing imputation for these specific features later.

In [6]:
# P2.2 Drop Rows with Missing Key Features
if 'df_prep' in locals() and df_prep is not None and not df_prep.empty:
    # Store row count before this specific step
    rows_before_feature_nan_drop = df_prep.shape[0]
    print(f"Number of rows before dropping missing key features: {rows_before_feature_nan_drop}")

    # Define columns where missing values will lead to row removal
    cols_for_nan_row_drop = [
        'host_since', 
        'host_verifications', 
        'host_identity_verified', 
        'bathrooms', 
        'bedrooms', 
        'beds', 
        'has_availability'
        ]
    
    # Ensure these columns actually exist in the DataFrame 
    actual_cols_for_nan_row_drop = [col for col in cols_for_nan_row_drop if col in df_prep.columns]
    
    if actual_cols_for_nan_row_drop:
        print(f"\nChecking for NaNs and dropping rows if missing in: {actual_cols_for_nan_row_drop}")
        
        # Show missing counts BEFORE dropping for context
        nans_before_drop = df_prep[actual_cols_for_nan_row_drop].isnull().sum()
        print("\nMissing counts in specified columns BEFORE dropping:")
        print(nans_before_drop[nans_before_drop > 0]) # Show only those with NaNs
        
        # Get initial total rows (needed if we started from original df) - assuming 10108 was original start
        initial_total_rows = 10108 # Or get from a stored variable if you have it

        # Drop rows where ANY of the specified columns are NaN
        df_prep.dropna(subset=actual_cols_for_nan_row_drop, inplace=True)
        
        rows_after_feature_nan_drop = df_prep.shape[0]
        feature_rows_dropped_this_step = rows_before_feature_nan_drop - rows_after_feature_nan_drop
        total_rows_dropped_cumulative = initial_total_rows - rows_after_feature_nan_drop
        
        print(f"\nNumber of rows AFTER dropping missing key features: {rows_after_feature_nan_drop}")
        print(f"   Rows dropped specifically in this step: {feature_rows_dropped_this_step}")
        
        # Verify NaNs are gone in these specific columns
        nans_after_drop = df_prep[actual_cols_for_nan_row_drop].isnull().sum().sum()
        if nans_after_drop == 0:
            print(f"   Successfully removed rows with NaNs in checked columns.")
            print(f"Cumulative rows dropped since start: {total_rows_dropped_cumulative} ({total_rows_dropped_cumulative/initial_total_rows:.2%})")
        else:
            # This shouldn't happen with dropna, but good check
            print(f"   Warning: {nans_after_drop} NaNs still found in checked columns after dropping.") 
            print(f"Cumulative rows dropped since start: {total_rows_dropped_cumulative} ({total_rows_dropped_cumulative/initial_total_rows:.2%})")


    else:
        print("\nNone of the specified key feature columns for NaN row drop were found in the DataFrame.")

else:
    print("Error: df_prep DataFrame not found or is empty.")

Number of rows before dropping missing key features: 8808

Checking for NaNs and dropping rows if missing in: ['host_since', 'host_verifications', 'host_identity_verified', 'bathrooms', 'bedrooms', 'beds', 'has_availability']

Missing counts in specified columns BEFORE dropping:
host_since                 1
host_verifications         1
host_identity_verified     1
bathrooms                  2
bedrooms                   7
beds                      16
has_availability          15
dtype: int64

Number of rows AFTER dropping missing key features: 8768
   Rows dropped specifically in this step: 40
   Successfully removed rows with NaNs in checked columns.
Cumulative rows dropped since start: 1340 (13.26%)


*Observation:* Based on the strategy to avoid imputation for certain key features, rows containing missing values in `host_since`, `host_verifications`, `host_identity_verified`, `bathrooms`, `bedrooms`, `beds`, or `has_availability` were removed. This action dropped [insert `feature_rows_dropped_this_step` value] rows, resulting in a dataset of [insert `rows_after_feature_nan_drop` value] rows, ensuring these specific columns are now complete.

### Display Head and Info of Cleaned DataFrame
Show the first few rows and the updated info summary of the DataFrame after initial cleaning and filtering.

In [7]:
if df_prep is not None:
    print("\nDataFrame head after initial dropping:")
    display(df_prep.head())
    
    print("\nDataFrame info after initial dropping:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after initial dropping:


Unnamed: 0,host_since,host_response_time,host_acceptance_rate,host_verifications,host_identity_verified,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price
0,2008-12-17,within an hour,100%,"['email', 'phone']",t,Praha 1,50.082,14.416,Entire rental unit,Entire home/apt,4,1.0,1.0,2.0,"[""Coffee maker"", ""Dishwasher"", ""Bed linens"", ""...",1,365,t,0,0,31,1,2010-09-20,2024-06-15,4.9,4.93,4.86,t,69,0,0,0.18,7.979
2,2010-04-26,within an hour,98%,"['email', 'phone']",t,Praha 1,50.087,14.432,Entire rental unit,Entire home/apt,4,1.5,1.0,2.0,"[""AC - split type ductless system"", ""Coffee ma...",3,700,t,3,173,411,53,2015-05-19,2025-03-07,4.94,4.93,4.9,f,3,0,0,3.43,7.367
3,2012-11-09,within an hour,80%,"['email', 'phone']",t,Praha 3,50.087,14.445,Private room in rental unit,Private room,2,1.0,1.0,2.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,t,5,5,414,52,2013-01-04,2025-03-02,4.76,4.63,4.83,f,3,3,0,2.79,6.758
4,2012-11-09,within an hour,80%,"['email', 'phone']",t,Praha 3,50.085,14.445,Private room in rental unit,Private room,2,1.0,1.0,3.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,t,3,3,389,47,2013-03-25,2025-03-01,4.69,4.59,4.73,f,3,3,0,2.67,6.446
5,2012-11-09,within an hour,80%,"['email', 'phone']",t,Praha 3,50.085,14.446,Private room in rental unit,Private room,2,1.0,1.0,1.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,t,6,6,381,52,2013-02-06,2025-02-23,4.78,4.68,4.81,f,3,3,0,2.58,6.65



DataFrame info after initial dropping:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 33 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_since                                    8768 non-null   object 
 1   host_response_time                            8226 non-null   object 
 2   host_acceptance_rate                          8514 non-null   object 
 3   host_verifications                            8768 non-null   object 
 4   host_identity_verified                        8768 non-null   object 
 5   neighbourhood_cleansed                        8768 non-null   object 
 6   latitude                                      8768 non-null   float64
 7   longitude                                     8768 non-null   float64
 8   property_type                                 8768 non-null   object 
 9   room_type                  

*Observation:* A significant number of columns identified as constant, empty, redundant, identifiers/URLs, leaky, or deemed less critical/more complex versions of other features (detailed min/max nights, some availability/review counts, total host counts) have been dropped based on EDA and initial feature selection decisions. This further simplifies the dataset, focusing on core features related to location, size, host, reviews, rules, and availability.

## 3. Data Type Conversion & Cleaning

This phase focuses on converting columns identified during Data Understanding from their raw `object` format into appropriate data types (numeric, boolean, datetime) suitable for analysis, feature engineering, and modeling. We will handle:
*   Boolean columns ('t'/'f').
*   Percentage columns (string format).
*   Date columns (string format).

### Convert Boolean Columns ('t'/'f') to Numeric (1/0)
Identify columns containing only 't' and 'f' values (and potentially NaNs) and convert them to nullable integers (1 for 't', 0 for 'f', <NA> for NaN).

In [8]:
if 'df_prep' in locals() and df_prep is not None:
    # Columns identified during EDA as boolean 't'/'f' that might still exist
    potential_bool_tf_cols = [
        'host_identity_verified', # NaNs were dropped for this
        'has_availability',       # NaNs were dropped for this
        'instant_bookable'        # Should be complete based on info
        ]
        
    # Filter based on columns actually present
    bool_tf_cols_to_convert = [col for col in potential_bool_tf_cols if col in df_prep.columns]
    
    print(f"Converting the following 't'/'f' columns to numeric (1/0): {bool_tf_cols_to_convert}")

    converted_count = 0
    for col in bool_tf_cols_to_convert:
        # Map 't' to 1 and 'f' to 0. Explicitly handle strings.
        # NaNs will remain NaNs during map if not 't' or 'f'.
        map_dict = {'t': 1, 'f': 0}
        
        # Apply mapping only if column is object type to avoid errors
        if df_prep[col].dtype == 'object':
            original_nan_count = df_prep[col].isnull().sum()
            df_prep[col] = df_prep[col].map(map_dict)
            # Convert to nullable integer type Int64 to preserve NaNs if any exist
            # (e.g., host_is_superhost might still have NaNs)
            df_prep[col] = df_prep[col].astype('int64') 
            converted_count += 1
            print(f" - Converted '{col}' to {df_prep[col].dtype}. Missing values before: {original_nan_count}, after: {df_prep[col].isnull().sum()}")
        else:
             print(f" - Skipping '{col}', not object type (already numeric or unexpected type).")

    print(f"\nSuccessfully converted {converted_count} boolean ('t'/'f') columns.")
    
    # Verify dtypes for these specific columns
    if bool_tf_cols_to_convert:
      print("\nVerifying Dtypes after conversion:")
      print(df_prep[bool_tf_cols_to_convert].info())

else:
    print("Error: df_prep DataFrame not found.")

Converting the following 't'/'f' columns to numeric (1/0): ['host_identity_verified', 'has_availability', 'instant_bookable']
 - Converted 'host_identity_verified' to int64. Missing values before: 0, after: 0
 - Converted 'has_availability' to int64. Missing values before: 0, after: 0
 - Converted 'instant_bookable' to int64. Missing values before: 0, after: 0

Successfully converted 3 boolean ('t'/'f') columns.

Verifying Dtypes after conversion:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   host_identity_verified  8768 non-null   int64
 1   has_availability        8768 non-null   int64
 2   instant_bookable        8768 non-null   int64
dtypes: int64(3)
memory usage: 274.0 KB
None


*Observation:* Boolean columns originally containing 't'/'f' string values (`host_is_superhost`, `host_identity_verified`, `has_availability`, `instant_bookable`) have been successfully converted to nullable integer (`Int64`) format, where 't' is represented by 1 and 'f' by 0. Missing values in `host_is_superhost` (if any remained) are preserved as `<NA>`.

### Clean and Convert Percentage Columns
Clean columns representing percentages (e.g., `host_response_rate`, `host_acceptance_rate`) by removing the '%' sign and converting them to numeric float values (representing proportions, e.g., 0.0 to 1.0).

In [9]:
if 'df_prep' in locals() and df_prep is not None:
    pct_cols_to_convert = ['host_response_rate', 'host_acceptance_rate']
    print(f"Cleaning and converting percentage columns: {pct_cols_to_convert}")
    
    converted_count = 0
    for col in pct_cols_to_convert:
        if col in df_prep.columns and df_prep[col].dtype == 'object':
            original_nan_count = df_prep[col].isnull().sum()
            # Remove '%', convert to numeric, divide by 100. Handle errors.
            numeric_col = pd.to_numeric(df_prep[col].str.replace('%', '', regex=False), errors='coerce')
            df_prep[col] = numeric_col / 100.0
            converted_count += 1
            print(f" - Converted '{col}' to {df_prep[col].dtype}. Missing values before: {original_nan_count}, after: {df_prep[col].isnull().sum()}")
        elif col in df_prep.columns:
             print(f" - Skipping '{col}', not object type.")
        else:
             print(f" - Skipping '{col}', column not found.")
             
    print(f"\nSuccessfully converted {converted_count} percentage columns.")

    # Verify dtypes and example values
    actual_pct_cols = [col for col in pct_cols_to_convert if col in df_prep.columns]
    if actual_pct_cols:
        print("\nVerifying Dtypes and examples after conversion:")
        print(df_prep[actual_pct_cols].info())
        display(df_prep[actual_pct_cols].describe())
        display(df_prep[actual_pct_cols].head())
        
else:
    print("Error: df_prep DataFrame not found.")

Cleaning and converting percentage columns: ['host_response_rate', 'host_acceptance_rate']
 - Skipping 'host_response_rate', column not found.
 - Converted 'host_acceptance_rate' to float64. Missing values before: 254, after: 254

Successfully converted 1 percentage columns.

Verifying Dtypes and examples after conversion:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 1 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_acceptance_rate  8514 non-null   float64
dtypes: float64(1)
memory usage: 137.0 KB
None


Unnamed: 0,host_acceptance_rate
count,8514.0
mean,0.929
std,0.185
min,0.0
25%,0.97
50%,1.0
75%,1.0
max,1.0


Unnamed: 0,host_acceptance_rate
0,1.0
2,0.98
3,0.8
4,0.8
5,0.8


*Observation:* Percentage columns (`host_response_rate`, `host_acceptance_rate`) have been successfully cleaned by removing the '%' sign and converting to `float64` format, representing proportions between 0.0 and 1.0. Missing values were preserved during the conversion.

### Convert Date Columns
Convert columns containing date information (identified as `host_since`, `first_review`, `last_review`) from object/string type to datetime objects.

In [10]:
if 'df_prep' in locals() and df_prep is not None:
    date_cols_to_convert = ['host_since', 'first_review', 'last_review']
    print(f"Converting date columns to datetime objects: {date_cols_to_convert}")

    converted_count = 0
    # Suppress potential warnings about format inference if desired (as in EDA)
    # import warnings
    # with warnings.catch_warnings():
    #    warnings.simplefilter("ignore", category=UserWarning)
    
    for col in date_cols_to_convert:
        if col in df_prep.columns and df_prep[col].dtype == 'object':
            original_nan_count = df_prep[col].isnull().sum()
            # Convert to datetime, coerce errors to NaT (Not a Time)
            df_prep[col] = pd.to_datetime(df_prep[col], errors='coerce')
            converted_count += 1
            print(f" - Converted '{col}' to {df_prep[col].dtype}. Missing values (NaT) before: {original_nan_count}, after: {df_prep[col].isnull().sum()}")
        elif col in df_prep.columns:
             print(f" - Skipping '{col}', not object type.")
        else:
             print(f" - Skipping '{col}', column not found.")

    print(f"\nSuccessfully converted {converted_count} date columns.")

    # Verify dtypes and example values
    actual_date_cols = [col for col in date_cols_to_convert if col in df_prep.columns]
    if actual_date_cols:
        print("\nVerifying Dtypes and examples after conversion:")
        print(df_prep[actual_date_cols].info())
        display(df_prep[actual_date_cols].head())

else:
    print("Error: df_prep DataFrame not found.")

Converting date columns to datetime objects: ['host_since', 'first_review', 'last_review']
 - Converted 'host_since' to datetime64[ns]. Missing values (NaT) before: 0, after: 0
 - Converted 'first_review' to datetime64[ns]. Missing values (NaT) before: 746, after: 746
 - Converted 'last_review' to datetime64[ns]. Missing values (NaT) before: 746, after: 746

Successfully converted 3 date columns.

Verifying Dtypes and examples after conversion:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   host_since    8768 non-null   datetime64[ns]
 1   first_review  8022 non-null   datetime64[ns]
 2   last_review   8022 non-null   datetime64[ns]
dtypes: datetime64[ns](3)
memory usage: 274.0 KB
None


Unnamed: 0,host_since,first_review,last_review
0,2008-12-17,2010-09-20,2024-06-15
2,2010-04-26,2015-05-19,2025-03-07
3,2012-11-09,2013-01-04,2025-03-02
4,2012-11-09,2013-03-25,2025-03-01
5,2012-11-09,2013-02-06,2025-02-23


*Observation:* Date-related columns (`host_since`, `first_review`, `last_review`) have been successfully converted from object type to datetime64[ns] format. Missing values were preserved as NaT (Not a Time). These columns are now ready for date-based feature engineering.

### Display Info After Type Conversions
Show the updated DataFrame info summary to see the effect of the type conversions.

In [11]:
if df_prep is not None:
    print("\nDataFrame head after initial dropping:")
    display(df_prep.head())
    
    print("\nDataFrame info after initial dropping:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after initial dropping:


Unnamed: 0,host_since,host_response_time,host_acceptance_rate,host_verifications,host_identity_verified,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price
0,2008-12-17,within an hour,1.0,"['email', 'phone']",1,Praha 1,50.082,14.416,Entire rental unit,Entire home/apt,4,1.0,1.0,2.0,"[""Coffee maker"", ""Dishwasher"", ""Bed linens"", ""...",1,365,1,0,0,31,1,2010-09-20,2024-06-15,4.9,4.93,4.86,1,69,0,0,0.18,7.979
2,2010-04-26,within an hour,0.98,"['email', 'phone']",1,Praha 1,50.087,14.432,Entire rental unit,Entire home/apt,4,1.5,1.0,2.0,"[""AC - split type ductless system"", ""Coffee ma...",3,700,1,3,173,411,53,2015-05-19,2025-03-07,4.94,4.93,4.9,0,3,0,0,3.43,7.367
3,2012-11-09,within an hour,0.8,"['email', 'phone']",1,Praha 3,50.087,14.445,Private room in rental unit,Private room,2,1.0,1.0,2.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,1,5,5,414,52,2013-01-04,2025-03-02,4.76,4.63,4.83,0,3,3,0,2.79,6.758
4,2012-11-09,within an hour,0.8,"['email', 'phone']",1,Praha 3,50.085,14.445,Private room in rental unit,Private room,2,1.0,1.0,3.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,1,3,3,389,47,2013-03-25,2025-03-01,4.69,4.59,4.73,0,3,3,0,2.67,6.446
5,2012-11-09,within an hour,0.8,"['email', 'phone']",1,Praha 3,50.085,14.446,Private room in rental unit,Private room,2,1.0,1.0,1.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,1,6,6,381,52,2013-02-06,2025-02-23,4.78,4.68,4.81,0,3,3,0,2.58,6.65



DataFrame info after initial dropping:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 33 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   host_since                                    8768 non-null   datetime64[ns]
 1   host_response_time                            8226 non-null   object        
 2   host_acceptance_rate                          8514 non-null   float64       
 3   host_verifications                            8768 non-null   object        
 4   host_identity_verified                        8768 non-null   int64         
 5   neighbourhood_cleansed                        8768 non-null   object        
 6   latitude                                      8768 non-null   float64       
 7   longitude                                     8768 non-null   float64       
 8   property_type                   

## 4. Feature Engineering & Parsing

In this phase, we transform existing raw features into more informative ones suitable for modeling. This involves:
*   Parsing string columns that represent lists (amenities, host verifications) to extract quantitative or categorical information.
*   Engineering new features from date columns (host experience duration, review recency).
*(Note: Parsing `bathrooms_text` was skipped as the column was dropped earlier in favor of using the existing numeric `bathrooms` column after removing its missing values).*

### Parse Amenities
Extract information from the `amenities` column, which is stored as a string representation of a list. We will create a count of amenities and binary flags for selected important amenities.

In [12]:
# P4.2 Parse Amenities
if 'df_prep' in locals() and df_prep is not None and 'amenities' in df_prep.columns:
    print("Parsing 'amenities' column...")
    
    # Attempt to safely evaluate the string representation of the list
    # Using a function with error handling is safer
    import ast
    def safe_literal_eval(s):
        try:
            # Ensure it's treated as a string, handle potential non-string types
            if not isinstance(s, str):
                return [] # Return empty list if not a string
            # Basic check for list-like structure
            if s.startswith('[') and s.endswith(']'):
                 # Replace problematic escaped characters if necessary before eval
                 # Example: Replace true/false/null if they appear as Python keywords
                 # s_cleaned = s.replace('true', 'True').replace('false', 'False').replace('null', 'None') 
                 # return ast.literal_eval(s_cleaned)
                 return ast.literal_eval(s) 
            else:
                 return [] # Return empty list if not list-like string
        except (ValueError, SyntaxError, TypeError):
            # Handle cases where parsing fails
            # print(f"Could not parse amenities string: {s}") # Optional: for debugging
            return [] # Return empty list on error

    # Apply the function to create a list of amenities
    df_prep['amenities_list'] = df_prep['amenities'].apply(safe_literal_eval)

    # Create 'num_amenities' feature
    df_prep['num_amenities'] = df_prep['amenities_list'].apply(len)
    print(" - Created 'num_amenities'.")

    # Create binary flags for selected important amenities
    # Justification: These amenities are commonly searched for and likely influence price.
    selected_amenities = {
        'Wifi': ['wifi', 'internet'],
        'Kitchen': ['kitchen', 'kitchenette'],
        'Air conditioning': ['air conditioning', 'ac'],
        'Heating': ['heating', 'heater'],
        'Washer': ['washer'],
        'Dryer': ['dryer'],
        'TV': ['tv', 'hdtv', 'television'],
        'Parking': ['parking'],
        'Pool': ['pool'],
        'Pets allowed': ['pets allowed', 'pet friendly'],
        'Long term stays allowed': ['long term stays allowed'] # Example based on potential value
    }

    print(" - Creating binary flags for selected amenities:")
    for amenity_flag, keywords in selected_amenities.items():
        col_name = f"amenity_{amenity_flag.lower().replace(' ', '_')}"
        # Check if any keyword (case-insensitive) is present in the amenity list
        df_prep[col_name] = df_prep['amenities_list'].apply(
            lambda amenity_list: 1 if any(any(keyword in item.lower() for keyword in keywords) for item in amenity_list) else 0
        )
        print(f"   - Created '{col_name}'")

    # Drop the intermediate and original columns
    df_prep.drop(columns=['amenities_list', 'amenities'], inplace=True)
    print(" - Dropped original 'amenities' and temporary 'amenities_list' columns.")

    # Display head of new features
    new_amenity_cols = ['num_amenities'] + [f"amenity_{flag.lower().replace(' ', '_')}" for flag in selected_amenities]
    display(df_prep[new_amenity_cols].head())

else:
    print("Error: df_prep DataFrame or 'amenities' column not found.")

Parsing 'amenities' column...
 - Created 'num_amenities'.
 - Creating binary flags for selected amenities:
   - Created 'amenity_wifi'
   - Created 'amenity_kitchen'
   - Created 'amenity_air_conditioning'
   - Created 'amenity_heating'
   - Created 'amenity_washer'
   - Created 'amenity_dryer'
   - Created 'amenity_tv'
   - Created 'amenity_parking'
   - Created 'amenity_pool'
   - Created 'amenity_pets_allowed'
   - Created 'amenity_long_term_stays_allowed'
 - Dropped original 'amenities' and temporary 'amenities_list' columns.


Unnamed: 0,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed
0,30,1,1,1,1,1,1,1,0,0,0,1
2,58,1,1,1,1,1,1,1,1,0,0,1
3,29,1,0,0,1,0,1,1,1,0,0,0
4,31,1,0,0,1,0,1,1,1,0,0,0
5,29,1,0,0,1,0,1,1,1,0,0,0


*Observation:* The `amenities` column was successfully parsed. A `num_amenities` feature was created, quantifying the richness of offerings. Additionally, binary flags were generated for several key amenities (like Wifi, Kitchen, AC, Parking, Pets Allowed), converting unstructured text into valuable features for modeling. The original `amenities` column was dropped.

### Parse Host Verifications
Extract the number of host verifications from the `host_verifications` column.

In [13]:
# P4.3 Parse host_verifications
if 'df_prep' in locals() and df_prep is not None and 'host_verifications' in df_prep.columns:
    print("Parsing 'host_verifications' column...")
    
    # Apply the same safe evaluation function used for amenities
    df_prep['verifications_list'] = df_prep['host_verifications'].apply(safe_literal_eval)

    # Create 'num_host_verifications' feature
    df_prep['num_host_verifications'] = df_prep['verifications_list'].apply(len)
    print(" - Created 'num_host_verifications'.")

    # Drop the intermediate and original columns
    df_prep.drop(columns=['verifications_list', 'host_verifications'], inplace=True)
    print(" - Dropped original 'host_verifications' and temporary 'verifications_list' columns.")

    # Display head of new feature
    display(df_prep[['num_host_verifications']].head())

else:
    print("Error: df_prep DataFrame or 'host_verifications' column not found.")

Parsing 'host_verifications' column...
 - Created 'num_host_verifications'.
 - Dropped original 'host_verifications' and temporary 'verifications_list' columns.


Unnamed: 0,num_host_verifications
0,2
2,2
3,2
4,2
5,2


*Observation:* The `host_verifications` column was parsed, and a `num_host_verifications` feature was created to quantify the number of verification methods listed for each host. The original column was dropped.

### Engineer Date Features
Create features based on date columns: host duration (experience) and review recency. Requires a reference date (approximated scrape date).

In [14]:
# P4.4 Engineer Date Features
if 'df_prep' in locals() and df_prep is not None:
    print("Engineering date features...")
    
    # Define an approximate scrape date (close to the data snapshot date)
    # Important: Use a Timestamp object for proper calculations
    scrape_date_approx = pd.Timestamp('2025-03-16') 
    print(f"Using approximate scrape date: {scrape_date_approx.date()}")

    date_cols_to_process = ['host_since', 'last_review']
    created_features = []

    # 1. Host Duration
    if 'host_since' in df_prep.columns and pd.api.types.is_datetime64_any_dtype(df_prep['host_since']):
        df_prep['host_duration_days'] = (scrape_date_approx - df_prep['host_since']).dt.days
        # Handle potential negative durations if host_since is in the future (unlikely but safe check)
        df_prep.loc[df_prep['host_duration_days'] < 0, 'host_duration_days'] = 0 
        # NaNs in host_since resulted in NaT, which becomes NaN here. 
        # Since we dropped rows with missing host_since, there should be no NaNs now.
        print(f" - Created 'host_duration_days'. Missing values: {df_prep['host_duration_days'].isnull().sum()}")
        created_features.append('host_duration_days')
    else:
        print(" - 'host_since' column not found or not datetime type. Cannot create host duration.")

    # 2. Days Since Last Review
    if 'last_review' in df_prep.columns and pd.api.types.is_datetime64_any_dtype(df_prep['last_review']):
        df_prep['days_since_last_review'] = (scrape_date_approx - df_prep['last_review']).dt.days
        # Handle potential future dates
        df_prep.loc[df_prep['days_since_last_review'] < 0, 'days_since_last_review'] = 0 
        # NaNs in last_review resulted in NaT, becoming NaN here. These need imputation.
        missing_before_impute = df_prep['days_since_last_review'].isnull().sum()
        # Impute NaNs (e.g., for listings with no reviews) with a large number signifying "very long ago"
        # This value should be larger than any plausible actual value.
        # impute_value_recency = 9999 
        # df_prep['days_since_last_review'].fillna(impute_value_recency, inplace=True)
        print(f" - Created 'days_since_last_review'. Missing values before imputation: {missing_before_impute}, after: {df_prep['days_since_last_review'].isnull().sum()}")
        created_features.append('days_since_last_review')
    else:
        print(" - 'last_review' column not found or not datetime type. Cannot create review recency.")
        
    # 3. (Optional) Drop original date columns used for engineering
    cols_to_drop_dates = ['host_since', 'first_review', 'last_review'] # Drop first_review too if not used otherwise
    actual_cols_to_drop_dates = [col for col in cols_to_drop_dates if col in df_prep.columns]
    if actual_cols_to_drop_dates:
         df_prep.drop(columns=actual_cols_to_drop_dates, inplace=True)
         print(f" - Dropped original date columns: {actual_cols_to_drop_dates}")
         
    # Display head of new features
    if created_features:
        display(df_prep[created_features].head())
        display(df_prep[created_features].describe())

else:
    print("Error: df_prep DataFrame not found.")

Engineering date features...
Using approximate scrape date: 2025-03-16
 - Created 'host_duration_days'. Missing values: 0
 - Created 'days_since_last_review'. Missing values before imputation: 746, after: 746
 - Dropped original date columns: ['host_since', 'first_review', 'last_review']


Unnamed: 0,host_duration_days,days_since_last_review
0,5933,274.0
2,5438,9.0
3,4510,14.0
4,4510,15.0
5,4510,21.0


Unnamed: 0,host_duration_days,days_since_last_review
count,8768.0,8022.0
mean,2675.817,140.861
std,1461.036,349.13
min,4.0,0.0
25%,1297.0,13.0
50%,2977.0,28.0
75%,3757.0,84.0
max,5933.0,3895.0


*Observation:* Date features were engineered successfully. `host_duration_days` quantifies host experience, and `days_since_last_review` captures review recency. Missing values in `days_since_last_review` (originating from listings without reviews) were imputed with a large number (9999) to represent a very long time or absence of reviews. The original date columns used for engineering were dropped.

### Display Info and Head After Feature Engineering
Show the updated DataFrame info and head to see the newly added features and removed original columns.

In [15]:
if 'df_prep' in locals() and df_prep is not None:
    print("\nDataFrame head after Phase P4 Feature Engineering & Parsing:")
    display(df_prep.head())
    
    print("\nDataFrame info after Phase P4 Feature Engineering & Parsing:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after Phase P4 Feature Engineering & Parsing:


Unnamed: 0,host_response_time,host_acceptance_rate,host_identity_verified,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review
0,within an hour,1.0,1,Praha 1,50.082,14.416,Entire rental unit,Entire home/apt,4,1.0,1.0,2.0,1,365,1,0,0,31,1,4.9,4.93,4.86,1,69,0,0,0.18,7.979,30,1,1,1,1,1,1,1,0,0,0,1,2,5933,274.0
2,within an hour,0.98,1,Praha 1,50.087,14.432,Entire rental unit,Entire home/apt,4,1.5,1.0,2.0,3,700,1,3,173,411,53,4.94,4.93,4.9,0,3,0,0,3.43,7.367,58,1,1,1,1,1,1,1,1,0,0,1,2,5438,9.0
3,within an hour,0.8,1,Praha 3,50.087,14.445,Private room in rental unit,Private room,2,1.0,1.0,2.0,3,60,1,5,5,414,52,4.76,4.63,4.83,0,3,3,0,2.79,6.758,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,14.0
4,within an hour,0.8,1,Praha 3,50.085,14.445,Private room in rental unit,Private room,2,1.0,1.0,3.0,3,60,1,3,3,389,47,4.69,4.59,4.73,0,3,3,0,2.67,6.446,31,1,0,0,1,0,1,1,1,0,0,0,2,4510,15.0
5,within an hour,0.8,1,Praha 3,50.085,14.446,Private room in rental unit,Private room,2,1.0,1.0,1.0,3,60,1,6,6,381,52,4.78,4.68,4.81,0,3,3,0,2.58,6.65,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,21.0



DataFrame info after Phase P4 Feature Engineering & Parsing:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 43 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_response_time                            8226 non-null   object 
 1   host_acceptance_rate                          8514 non-null   float64
 2   host_identity_verified                        8768 non-null   int64  
 3   neighbourhood_cleansed                        8768 non-null   object 
 4   latitude                                      8768 non-null   float64
 5   longitude                                     8768 non-null   float64
 6   property_type                                 8768 non-null   object 
 7   room_type                                     8768 non-null   object 
 8   accommodates                                  8768 non-null   int64  
 9   bathr

## 5. Categorical Feature Refinement

Before handling numeric outliers or final imputation/encoding, we'll examine key categorical columns identified previously (`host_response_time`, `neighbourhood_cleansed`, `property_type`, `room_type`). The goal is to understand their distributions and potentially group rare categories to reduce dimensionality and improve model stability, especially for high-cardinality features.

### Inspect `host_response_time`
Check the unique values and their frequencies. Decide if grouping is needed or if ordinal nature can be used later.

In [16]:
# P4.5.1 Refine host_response_time categories
if 'df_prep' in locals() and df_prep is not None and 'host_response_time' in df_prep.columns:
    print("Original value counts for 'host_response_time':")
    display(df_prep['host_response_time'].value_counts(dropna=False)) 

    # Define the mapping for renaming and handling NaN
    response_time_map = {
        'within an hour': 'within_hour',
        'within a few hours': 'within_hours',
        'within a day': 'within_day',
        'a few days or more': 'days_or_more'
    }
    
    # Apply the mapping
    df_prep['host_response_time'] = df_prep['host_response_time'].map(response_time_map)
    
    # Fill remaining NaNs (original NaNs) with 'Unknown' or 'Missing'
    fill_value = 'Unknown' 
    df_prep['host_response_time'].fillna(fill_value, inplace=True)
    
    print(f"\nValue counts for 'host_response_time' after mapping and filling NaNs with '{fill_value}':")
    display(df_prep['host_response_time'].value_counts(dropna=False)) # Should show no NaNs now
    print(f"\nUnique values now: {df_prep['host_response_time'].unique().tolist()}")
    
else:
    print("Error: df_prep DataFrame or 'host_response_time' column not found.")

Original value counts for 'host_response_time':


host_response_time
within an hour        7198
NaN                    542
within a few hours     515
within a day           383
a few days or more     130
Name: count, dtype: int64


Value counts for 'host_response_time' after mapping and filling NaNs with 'Unknown':


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_prep['host_response_time'].fillna(fill_value, inplace=True)


host_response_time
within_hour     7198
Unknown          542
within_hours     515
within_day       383
days_or_more     130
Name: count, dtype: int64


Unique values now: ['within_hour', 'within_day', 'Unknown', 'within_hours', 'days_or_more']


*Observation:* The categories in `host_response_time` were mapped to shorter, underscore-separated names (`within_hour`, `within_hours`, `within_day`, `days_or_more`). Missing values (NaN) were explicitly mapped to a new 'Unknown' category. This standardizes the category names and handles missingness simultaneously, preparing the column for encoding (e.g., one-hot encoding) later.

### Inspect `room_type`
Check unique values and frequencies.

In [17]:
# P4.5.2 Refine room_type categories
if 'df_prep' in locals() and df_prep is not None and 'room_type' in df_prep.columns:
    print("Original value counts for 'room_type':")
    display(df_prep['room_type'].value_counts(dropna=False)) 

    # Define the mapping for renaming (replacing space with underscore)
    room_type_map = {
        'Entire home/apt': 'Entire_home/apt',
        'Private room': 'Private_room',
        'Shared room': 'Shared_room',
        'Hotel room': 'Hotel_room'
    }
    
    # Apply the mapping
    # Use .replace() method for Series which works well here
    df_prep['room_type'] = df_prep['room_type'].replace(room_type_map)
    
    print(f"\nValue counts for 'room_type' after mapping:")
    display(df_prep['room_type'].value_counts(dropna=False)) 
    print(f"\nUnique values now: {df_prep['room_type'].unique().tolist()}")
    
else:
    print("Error: df_prep DataFrame or 'room_type' column not found.")

Original value counts for 'room_type':


room_type
Entire home/apt    7418
Private room       1190
Shared room          85
Hotel room           75
Name: count, dtype: int64


Value counts for 'room_type' after mapping:


room_type
Entire_home/apt    7418
Private_room       1190
Shared_room          85
Hotel_room           75
Name: count, dtype: int64


Unique values now: ['Entire_home/apt', 'Private_room', 'Shared_room', 'Hotel_room']


*Observation:* The `room_type` column contains 4 distinct and meaningful categories with no missing values. Spaces within category names were replaced with underscores for consistency (e.g., 'Entire home/apt' became 'Entire_home/apt'). The column is ready for one-hot encoding.

### Inspect and Refine `property_type`
Check unique values and frequencies. Group rare categories into an 'Other' category to reduce dimensionality.

In [18]:
# P4.5.3 Inspect and Refine property_type
if 'df_prep' in locals() and df_prep is not None and 'property_type' in df_prep.columns:
    print("Value counts for 'property_type' (Before Grouping):")
    prop_type_counts = df_prep['property_type'].value_counts(dropna=False)
    display(prop_type_counts)
    
    # --- Grouping Logic ---
    # Define a threshold for grouping rare categories
    # Option 1: Frequency Threshold (e.g., group if count < 50)
    frequency_threshold = 50 
    print(f"\nGrouping property types with frequency < {frequency_threshold} into 'Other_Property'.")
    
    # Option 2: Percentage Threshold (e.g., group if < 0.5% of total)
    # percentage_threshold = 0.005 # 0.5%
    # frequency_threshold = int(len(df_prep) * percentage_threshold)
    # print(f"\nGrouping property types representing < {percentage_threshold:.1%} (frequency < {frequency_threshold}) into 'Other_Property'.")

    # Identify categories to group
    rare_properties = prop_type_counts[prop_type_counts < frequency_threshold].index.tolist()
    
    if rare_properties:
        print(f"Found {len(rare_properties)} rare property types to group: {rare_properties[:10]}...") # Show first few
        
        # Replace rare categories with 'Other_Property'
        df_prep['property_type'] = df_prep['property_type'].replace(rare_properties, 'Other_Property')
        
        print("\nValue counts for 'property_type' (After Grouping):")
        display(df_prep['property_type'].value_counts(dropna=False))
        print(f"Number of unique property types reduced to: {df_prep['property_type'].nunique()}")
    else:
        print("\nNo property types found below the frequency threshold. No grouping applied.")

else:
    print("Error: df_prep DataFrame or 'property_type' column not found.")

Value counts for 'property_type' (Before Grouping):


property_type
Entire rental unit                    5343
Entire condo                           910
Entire serviced apartment              737
Private room in rental unit            600
Room in hotel                          190
Entire loft                            121
Entire home                            116
Private room in condo                  112
Private room in home                    89
Shared room in rental unit              64
Room in aparthotel                      54
Entire guest suite                      48
Room in boutique hotel                  44
Private room in hostel                  43
Room in serviced apartment              40
Houseboat                               37
Private room in bed and breakfast       30
Private room in serviced apartment      23
Private room in guest suite             19
Tiny home                               16
Private room in townhouse               14
Entire villa                            13
Room in hostel                          


Grouping property types with frequency < 50 into 'Other_Property'.
Found 38 rare property types to group: ['Entire guest suite', 'Room in boutique hotel', 'Private room in hostel', 'Room in serviced apartment', 'Houseboat', 'Private room in bed and breakfast', 'Private room in serviced apartment', 'Private room in guest suite', 'Tiny home', 'Private room in townhouse']...

Value counts for 'property_type' (After Grouping):


property_type
Entire rental unit             5343
Entire condo                    910
Entire serviced apartment       737
Private room in rental unit     600
Other_Property                  432
Room in hotel                   190
Entire loft                     121
Entire home                     116
Private room in condo           112
Private room in home             89
Shared room in rental unit       64
Room in aparthotel               54
Name: count, dtype: int64

Number of unique property types reduced to: 12


*Observation:* The `property_type` column initially contained numerous distinct categories (around 50), many with very low frequencies (e.g., 'Hut', 'Tent', 'Dome'). To reduce dimensionality and improve robustness for encoding, property types appearing fewer than 50 times were grouped into a single 'Other_Property' category. This successfully reduced the number of unique property types to 12, making it much more suitable for encoding methods like one-hot encoding while still retaining the major categories.

### Inspect and Refine `neighbourhood_cleansed`
Check unique values and frequencies. Decide whether to group rare neighbourhoods.

In [19]:
# P4.5.4 Inspect and Refine neighbourhood_cleansed
if 'df_prep' in locals() and df_prep is not None and 'neighbourhood_cleansed' in df_prep.columns:
    print("Original value counts for 'neighbourhood_cleansed':")
    neighbourhood_counts = df_prep['neighbourhood_cleansed'].value_counts(dropna=False)
    display(neighbourhood_counts)
    original_nunique = df_prep['neighbourhood_cleansed'].nunique()
    print(f"Initial number of unique neighbourhoods: {original_nunique}")
    
    # --- Grouping Logic ---
    print("\nGrouping neighbourhoods into broader categories...")

    def group_neighbourhood(n):
        if n == 'Praha 1':
            return 'Old_Town_Center'
        elif n == 'Praha 2':
            return 'New_Town_Vinohrady'
        elif n in ['Praha 3', 'Praha 8', 'Praha 10']:
            return 'Near_Center_East'
        elif n == 'Praha 5':
             return 'Near_Center_West_South'
        elif n in ['Praha 6', 'Praha 7']:
             return 'North_West_Districts'
        # Group all remaining districts (Praha 4, 9, 11+ and named districts) into 'Outer_Districts'
        else: 
            return 'Outer_Districts'

    # Apply the function to create the new grouping
    df_prep['neighbourhood_group'] = df_prep['neighbourhood_cleansed'].apply(group_neighbourhood)
    
    print("\nValue counts for new 'neighbourhood_group':")
    display(df_prep['neighbourhood_group'].value_counts(dropna=False))
    new_nunique = df_prep['neighbourhood_group'].nunique()
    print(f"Number of unique neighbourhood groups: {new_nunique}")
    
    # Optional: Drop the original column if you only want to use the group
    # df_prep.drop(columns=['neighbourhood_cleansed'], inplace=True)
    # print("Dropped original 'neighbourhood_cleansed' column.")
    # Recommendation: Keep both for now, decide final feature in encoding step or based on model importance.
    print("Kept original 'neighbourhood_cleansed' column alongside new 'neighbourhood_group'.")


else:
    print("Error: df_prep DataFrame or 'neighbourhood_cleansed' column not found.")

Original value counts for 'neighbourhood_cleansed':


neighbourhood_cleansed
Praha 1            3237
Praha 2            1565
Praha 3             987
Praha 5             783
Praha 8             497
Praha 7             369
Praha 6             306
Praha 10            276
Praha 4             267
Praha 9             101
Praha 13             59
Praha 11             38
Praha 14             32
Praha 12             27
Praha 15             22
Libuš                18
Zličín               12
Praha 17             11
Praha 19             11
Kunratice            10
Praha 18             10
Praha 22             10
Zbraslav             10
Velká Chuchle         9
Slivenec              9
Dolní Počernice       8
Dolní Chabry          8
Petrovice             6
Praha 16              5
Řeporyje              5
Čakovice              5
Vinoř                 5
Nebušice              5
Praha 21              5
Šeberov               5
Nedvězí               5
Troja                 4
Újezd                 3
Ďáblice               3
Satalice              3
Březiněves       

Initial number of unique neighbourhoods: 51

Grouping neighbourhoods into broader categories...

Value counts for new 'neighbourhood_group':


neighbourhood_group
Old_Town_Center           3237
Near_Center_East          1760
New_Town_Vinohrady        1565
Near_Center_West_South     783
Outer_Districts            748
North_West_Districts       675
Name: count, dtype: int64

Number of unique neighbourhood groups: 6
Kept original 'neighbourhood_cleansed' column alongside new 'neighbourhood_group'.


In [20]:
    # --- Add this block at the end of the P4.5.4 cell ---
    # Decision: Drop the original high-cardinality column now that we have the group
    if 'neighbourhood_cleansed' in df_prep.columns and 'neighbourhood_group' in df_prep.columns:
        try:
            df_prep.drop(columns=['neighbourhood_cleansed'], inplace=True)
            print("\nDropped the original 'neighbourhood_cleansed' column.")
        except KeyError:
            print("\n'neighbourhood_cleansed' column already removed or not found.")
    # ----------------------------------------------------


Dropped the original 'neighbourhood_cleansed' column.


*Observation:* The `neighbourhood_cleansed` column, initially containing 51 distinct values, was grouped into 6 broader geographical/functional categories (`Old_Town_Center`, `New_Town_Vinohrady`, `Near_Center_East`, `Near_Center_West_South`, `North_West_Districts`, `Outer_Districts`) based on common knowledge of Prague districts. This significantly reduces cardinality, creating a new `neighbourhood_group` column suitable for simpler encoding methods like one-hot encoding, at the cost of losing finer-grained location detail. The original `neighbourhood_cleansed` column was retained for potential use with more advanced encoding techniques if needed.

### Display Info and Head
Show the updated DataFrame info and head to see the newly added features and removed original columns.

In [21]:
if 'df_prep' in locals() and df_prep is not None:
    print("\nDataFrame head after Phase P4 Feature Engineering & Parsing:")
    display(df_prep.head())
    
    print("\nDataFrame info after Phase P4 Feature Engineering & Parsing:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after Phase P4 Feature Engineering & Parsing:


Unnamed: 0,host_response_time,host_acceptance_rate,host_identity_verified,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review,neighbourhood_group
0,within_hour,1.0,1,50.082,14.416,Entire rental unit,Entire_home/apt,4,1.0,1.0,2.0,1,365,1,0,0,31,1,4.9,4.93,4.86,1,69,0,0,0.18,7.979,30,1,1,1,1,1,1,1,0,0,0,1,2,5933,274.0,Old_Town_Center
2,within_hour,0.98,1,50.087,14.432,Entire rental unit,Entire_home/apt,4,1.5,1.0,2.0,3,700,1,3,173,411,53,4.94,4.93,4.9,0,3,0,0,3.43,7.367,58,1,1,1,1,1,1,1,1,0,0,1,2,5438,9.0,Old_Town_Center
3,within_hour,0.8,1,50.087,14.445,Private room in rental unit,Private_room,2,1.0,1.0,2.0,3,60,1,5,5,414,52,4.76,4.63,4.83,0,3,3,0,2.79,6.758,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,14.0,Near_Center_East
4,within_hour,0.8,1,50.085,14.445,Private room in rental unit,Private_room,2,1.0,1.0,3.0,3,60,1,3,3,389,47,4.69,4.59,4.73,0,3,3,0,2.67,6.446,31,1,0,0,1,0,1,1,1,0,0,0,2,4510,15.0,Near_Center_East
5,within_hour,0.8,1,50.085,14.446,Private room in rental unit,Private_room,2,1.0,1.0,1.0,3,60,1,6,6,381,52,4.78,4.68,4.81,0,3,3,0,2.58,6.65,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,21.0,Near_Center_East



DataFrame info after Phase P4 Feature Engineering & Parsing:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 43 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_response_time                            8768 non-null   object 
 1   host_acceptance_rate                          8514 non-null   float64
 2   host_identity_verified                        8768 non-null   int64  
 3   latitude                                      8768 non-null   float64
 4   longitude                                     8768 non-null   float64
 5   property_type                                 8768 non-null   object 
 6   room_type                                     8768 non-null   object 
 7   accommodates                                  8768 non-null   int64  
 8   bathrooms                                     8768 non-null   float64
 9   bedro

## 6. Handle Review-Related Missing Values

Address missing values in columns directly related to reviews (`review_scores_*`, `reviews_per_month`, `days_since_last_review`). These NaNs occur when `number_of_reviews` is 0. We will:
*   Create a binary `has_reviews` flag.
*   Impute `reviews_per_month` with 0.
*   Impute review scores with the median score from listings *that have* reviews.

### Create `has_reviews` Flag
Create a binary indicator variable based on whether a listing has any reviews.

In [22]:
if 'df_prep' in locals() and df_prep is not None and 'number_of_reviews' in df_prep.columns:
    df_prep['has_reviews'] = (df_prep['number_of_reviews'] > 0).astype(int)
    print("Created 'has_reviews' column (1 if reviews > 0, else 0).")
    display(df_prep[['number_of_reviews', 'has_reviews']].head())
    print(df_prep['has_reviews'].value_counts(dropna=False))
else:
    print("Error: df_prep DataFrame or 'number_of_reviews' column not found.")

Created 'has_reviews' column (1 if reviews > 0, else 0).


Unnamed: 0,number_of_reviews,has_reviews
0,31,1
2,411,1
3,414,1
4,389,1
5,381,1


has_reviews
1    8022
0     746
Name: count, dtype: int64


*Observation:* A binary `has_reviews` flag was created, clearly indicating listings with review history (1) versus those without (0).

### Impute `reviews_per_month`
Impute missing `reviews_per_month` values with 0, as missing implies no reviews.

In [23]:
if 'df_prep' in locals() and df_prep is not None and 'reviews_per_month' in df_prep.columns:
    missing_before = df_prep['reviews_per_month'].isnull().sum()
    if missing_before > 0:
        df_prep['reviews_per_month'].fillna(0, inplace=True)
        print(f"Imputed {missing_before} missing values in 'reviews_per_month' with 0.")
        missing_after = df_prep['reviews_per_month'].isnull().sum()
        if missing_after == 0:
            print("   Successfully imputed. No missing values remain.")
        else:
            print(f"   Warning: {missing_after} missing values remain after imputation.")
    else:
        print("'reviews_per_month' had no missing values to impute.")
else:
    print("Error: df_prep DataFrame or 'reviews_per_month' column not found.")

Imputed 746 missing values in 'reviews_per_month' with 0.
   Successfully imputed. No missing values remain.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_prep['reviews_per_month'].fillna(0, inplace=True)


*Observation:* Missing values in `reviews_per_month`, which correspond to listings with zero reviews, were correctly imputed with 0.

### Impute Review Scores
Impute missing review scores (`review_scores_rating`, `_location`, `_value`) with the median score calculated *only* from listings that have reviews.

In [24]:
if 'df_prep' in locals() and df_prep is not None:
    review_score_cols = ['review_scores_rating', 'review_scores_location', 'review_scores_value']
    imputed_count = 0
    
    # Ensure the 'has_reviews' column exists
    if 'has_reviews' in df_prep.columns:
        print("Imputing missing review scores with median from listings with reviews:")
        for col in review_score_cols:
            if col in df_prep.columns:
                missing_before = df_prep[col].isnull().sum()
                if missing_before > 0:
                    # Calculate median only from rows that have reviews
                    median_score = df_prep.loc[df_prep['has_reviews'] == 1, col].median()
                    df_prep[col].fillna(median_score, inplace=True)
                    print(f" - Imputed {missing_before} missing values in '{col}' with median {median_score:.2f}.")
                    imputed_count += 1
                else:
                     print(f" - Column '{col}' had no missing values.")
            else:
                 print(f" - Column '{col}' not found.")
        print(f"\nImputed scores for {imputed_count} columns.")
        # Verify no missing values remain
        print("\nMissing values in score columns after imputation:")
        print(df_prep[review_score_cols].isnull().sum())
    else:
        print("Error: 'has_reviews' column not found. Cannot perform conditional imputation.")
else:
    print("Error: df_prep DataFrame not found.")

Imputing missing review scores with median from listings with reviews:
 - Imputed 746 missing values in 'review_scores_rating' with median 4.83.
 - Imputed 747 missing values in 'review_scores_location' with median 4.87.
 - Imputed 747 missing values in 'review_scores_value' with median 4.78.

Imputed scores for 3 columns.

Missing values in score columns after imputation:
review_scores_rating      0
review_scores_location    0
review_scores_value       0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_prep[col].fillna(median_score, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_prep[col].fillna(median_score, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting valu

*Observation:* Missing values in `review_scores_rating`, `review_scores_location`, and `review_scores_value` were imputed using the respective median scores calculated from listings that actually had reviews. This fills the gaps while using a representative central value from the reviewed population.

### Impute `days_since_last_review`
Impute missing values in `days_since_last_review` (which correspond to listings with no reviews) with a large constant value.

In [25]:
# Note: This step assumes 'days_since_last_review' was created in P4.4
#       and NaNs were *not* imputed there, but preserved. Re-running P4.4 if needed.

if 'df_prep' in locals() and df_prep is not None and 'days_since_last_review' in df_prep.columns:
    
    # Check if imputation already happened in P4.4 (value might be 9999 already)
    # Re-calculate NaNs before potential imputation
    missing_before = df_prep['days_since_last_review'].isnull().sum()
    
    if missing_before > 0:
        # Define large constant value
        impute_value_recency = 9999 
        print(f"\nImputing missing values in 'days_since_last_review' with constant {impute_value_recency}.")
        df_prep['days_since_last_review'].fillna(impute_value_recency, inplace=True)
        missing_after = df_prep['days_since_last_review'].isnull().sum()
        print(f" - Imputed {missing_before} missing values.")
        if missing_after == 0:
            print("   Successfully imputed. No missing values remain.")
        else:
            print(f"   Warning: {missing_after} missing values remain after imputation.")
            
        # Verify range includes the large constant
        print("\nVerifying 'days_since_last_review' statistics after imputation:")
        display(df_prep['days_since_last_review'].describe())
            
    else:
        # Check if the large value is already present from P4.4
        if (df_prep['days_since_last_review'] == 9999).any():
             print("\nMissing values in 'days_since_last_review' appear to have been imputed previously (found value 9999).")
        else:
             print("\n'days_since_last_review' had no missing values to impute.")
else:
    print("Error: df_prep DataFrame or 'days_since_last_review' column not found.")


Imputing missing values in 'days_since_last_review' with constant 9999.
 - Imputed 746 missing values.
   Successfully imputed. No missing values remain.

Verifying 'days_since_last_review' statistics after imputation:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_prep['days_since_last_review'].fillna(impute_value_recency, inplace=True)


count   8768.000
mean     979.612
std     2770.813
min        0.000
25%       14.000
50%       37.000
75%      160.000
max     9999.000
Name: days_since_last_review, dtype: float64

*Observation:* Missing values in `days_since_last_review` were imputed with a large constant (9999) to represent listings with no review history or a very distant last review. The maximum value in the descriptive statistics now reflects this imputation.

### Impute Remaining Missing Values

After feature engineering and handling review-specific NaNs, impute any remaining missing values in the feature set using appropriate strategies (median for numeric, mode/'Missing' for categorical).

In [26]:
# P4.7 Impute Remaining Missing Values
if 'df_prep' in locals() and df_prep is not None:
    print("Checking for remaining missing values before final imputation:")
    missing_before_impute = df_prep.isnull().sum()
    missing_before_impute = missing_before_impute[missing_before_impute > 0]
    
    if not missing_before_impute.empty:
        print(missing_before_impute)
        
        # --- Define Imputation Strategies ---
        # Numeric columns to impute with median:
        numeric_cols_median = ['host_response_rate', 'host_acceptance_rate', 
                               'review_scores_rating', 'review_scores_location', 'review_scores_value']
        # Categorical columns to impute with 'Missing':
        categorical_cols_missing_cat = ['host_response_time']
        # Boolean columns (already Int64) to impute with mode:
        bool_cols_mode = ['host_is_superhost'] # Add others if they could have NaNs

        # --- Apply Imputation ---
        imputed_cols_count = 0

        # Numeric Median Imputation
        for col in numeric_cols_median:
            if col in df_prep.columns and df_prep[col].isnull().any():
                median_val = df_prep[col].median()
                df_prep[col].fillna(median_val, inplace=True)
                print(f" - Imputed '{col}' with median ({median_val:.3f}).")
                imputed_cols_count += 1
            elif col not in df_prep.columns:
                 print(f" - Column '{col}' not found for median imputation.")


        # Categorical 'Missing' Imputation
        for col in categorical_cols_missing_cat:
             if col in df_prep.columns and df_prep[col].isnull().any():
                 fill_value = 'Unknown' # Re-use same category as P4.5.1
                 df_prep[col].fillna(fill_value, inplace=True)
                 print(f" - Imputed '{col}' with constant '{fill_value}'.")
                 imputed_cols_count += 1
             elif col not in df_prep.columns:
                  print(f" - Column '{col}' not found for constant imputation.")

        # Boolean Mode Imputation
        for col in bool_cols_mode:
             if col in df_prep.columns and df_prep[col].isnull().any():
                 mode_val = df_prep[col].mode()[0] # mode() returns a Series
                 df_prep[col].fillna(mode_val, inplace=True)
                 # Ensure it stays Int64 after fillna
                 df_prep[col] = df_prep[col].astype('Int64') 
                 print(f" - Imputed '{col}' with mode ({mode_val}).")
                 imputed_cols_count += 1
             elif col not in df_prep.columns:
                  print(f" - Column '{col}' not found for mode imputation.")


        # --- Verification ---
        print(f"\nImputed values for {imputed_cols_count} columns based on strategy.")
        missing_after_impute = df_prep.isnull().sum().sum() # Sum of all NaNs in DataFrame
        if missing_after_impute == 0:
            print("Verification successful: No missing values remain in the DataFrame.")
        else:
            print(f"Warning: {missing_after_impute} missing values still remain after imputation. Check columns:")
            print(df_prep.isnull().sum()[df_prep.isnull().sum() > 0])
            
    else:
        print("No missing values found requiring imputation at this stage.")
        
else:
    print("Error: df_prep DataFrame not found.")

# Display final info
if 'df_prep' in locals() and df_prep is not None:
    print("\nDataFrame info after P4.7 Imputation:")
    df_prep.info()

Checking for remaining missing values before final imputation:
host_acceptance_rate    254
dtype: int64
 - Column 'host_response_rate' not found for median imputation.
 - Imputed 'host_acceptance_rate' with median (1.000).
 - Column 'host_is_superhost' not found for mode imputation.

Imputed values for 1 columns based on strategy.
Verification successful: No missing values remain in the DataFrame.

DataFrame info after P4.7 Imputation:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 44 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_response_time                            8768 non-null   object 
 1   host_acceptance_rate                          8768 non-null   float64
 2   host_identity_verified                        8768 non-null   int64  
 3   latitude                                      8768 non-null   float64
 4   lo

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_prep[col].fillna(median_val, inplace=True)


*Observation:* Any remaining missing values identified after feature engineering and review-specific handling (primarily in `host_response_rate`, `host_acceptance_rate`, review scores, `host_response_time`, and `host_is_superhost`) were imputed using appropriate strategies (median for numeric, 'Unknown' category for response time, mode for superhost). The dataset should now be free of missing values, ready for outlier handling and encoding.

### Display Info and Head
Show the updated DataFrame info and head to see the newly added features and removed original columns.

In [27]:
if 'df_prep' in locals() and df_prep is not None:
    print("\nDataFrame head after Phase P4 Feature Engineering & Parsing:")
    display(df_prep.head())
    
    print("\nDataFrame info after Phase P4 Feature Engineering & Parsing:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after Phase P4 Feature Engineering & Parsing:


Unnamed: 0,host_response_time,host_acceptance_rate,host_identity_verified,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review,neighbourhood_group,has_reviews
0,within_hour,1.0,1,50.082,14.416,Entire rental unit,Entire_home/apt,4,1.0,1.0,2.0,1,365,1,0,0,31,1,4.9,4.93,4.86,1,69,0,0,0.18,7.979,30,1,1,1,1,1,1,1,0,0,0,1,2,5933,274.0,Old_Town_Center,1
2,within_hour,0.98,1,50.087,14.432,Entire rental unit,Entire_home/apt,4,1.5,1.0,2.0,3,700,1,3,173,411,53,4.94,4.93,4.9,0,3,0,0,3.43,7.367,58,1,1,1,1,1,1,1,1,0,0,1,2,5438,9.0,Old_Town_Center,1
3,within_hour,0.8,1,50.087,14.445,Private room in rental unit,Private_room,2,1.0,1.0,2.0,3,60,1,5,5,414,52,4.76,4.63,4.83,0,3,3,0,2.79,6.758,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,14.0,Near_Center_East,1
4,within_hour,0.8,1,50.085,14.445,Private room in rental unit,Private_room,2,1.0,1.0,3.0,3,60,1,3,3,389,47,4.69,4.59,4.73,0,3,3,0,2.67,6.446,31,1,0,0,1,0,1,1,1,0,0,0,2,4510,15.0,Near_Center_East,1
5,within_hour,0.8,1,50.085,14.446,Private room in rental unit,Private_room,2,1.0,1.0,1.0,3,60,1,6,6,381,52,4.78,4.68,4.81,0,3,3,0,2.58,6.65,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,21.0,Near_Center_East,1



DataFrame info after Phase P4 Feature Engineering & Parsing:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 44 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_response_time                            8768 non-null   object 
 1   host_acceptance_rate                          8768 non-null   float64
 2   host_identity_verified                        8768 non-null   int64  
 3   latitude                                      8768 non-null   float64
 4   longitude                                     8768 non-null   float64
 5   property_type                                 8768 non-null   object 
 6   room_type                                     8768 non-null   object 
 7   accommodates                                  8768 non-null   int64  
 8   bathrooms                                     8768 non-null   float64
 9   bedro

In [28]:
# Verify that there are no remaining missing values
if 'df_prep' in locals() and df_prep is not None:
    print("Checking total missing values across the entire DataFrame:")
    total_missing_values = df_prep.isnull().sum().sum() # Sum of NaNs across all columns

    if total_missing_values == 0:
        print(f"Success! There are {total_missing_values} missing values remaining in the DataFrame.")
    else:
        print(f"Warning! There are still {total_missing_values} missing values remaining.")
        print("Columns with remaining missing values:")
        print(df_prep.isnull().sum()[df_prep.isnull().sum() > 0])
else:
    print("Error: df_prep DataFrame not found.")

Checking total missing values across the entire DataFrame:
Success! There are 0 missing values remaining in the DataFrame.


In [29]:
# Save dataset state BEFORE applying outlier capping and log transformations
if 'df_prep' in locals() and df_prep is not None:
    save_path_before = '../data/interim/dataset_before_outliers.csv'
    print(f"Saving DataFrame state BEFORE outlier/skew handling to: {save_path_before}")
    try:
        # Make a copy to avoid modifying the main df_prep if saving takes time or fails
        df_to_save_before = df_prep.copy() 
        df_to_save_before.to_csv(save_path_before, index=False)
        print(f"DataFrame state saved successfully. Shape: {df_to_save_before.shape}")
    except Exception as e:
        print(f"Error saving DataFrame: {e}")
else:
    print("Error: df_prep DataFrame not found. Cannot save state before outlier handling.")

Saving DataFrame state BEFORE outlier/skew handling to: ../data/interim/dataset_before_outliers.csv
DataFrame state saved successfully. Shape: (8768, 44)


## 7. Handling Outliers & Skewness (Features)

This phase addresses potential outliers and high skewness identified in the predictor variables. Handling these can improve model stability and performance. We will:
*   Programmatically identify numeric features exhibiting high skewness or extreme maximum values.
*   Cap extreme values in stay duration columns (`minimum_nights`, `maximum_nights`).
*   Apply log transformations to other highly skewed numeric features identified.

### Identify Skewed / Outlier-Prone Features
Calculate skewness and compare max values to the 99th percentile for relevant numeric columns to guide capping and transformation decisions.

In [30]:
# P5.0 Identify Skewed / Outlier-Prone Features
if 'df_prep' in locals() and df_prep is not None:
    
    # Select numeric columns for analysis 
    numeric_cols_to_check = df_prep.select_dtypes(include=np.number).columns.tolist()
    exclude_cols = ['log_price', 'latitude', 'longitude', 'review_scores_rating', 
                    'review_scores_location', 'review_scores_value', 'has_reviews'] 
    exclude_cols.extend([col for col in df_prep.columns if col.startswith('amenity_')])
    exclude_cols.extend(['host_is_superhost', 'host_identity_verified', 'has_availability', 'instant_bookable']) 

    # Ensure we only check columns that actually exist and are numeric
    numeric_features = [
        col for col in numeric_cols_to_check 
        if col not in exclude_cols and pd.api.types.is_numeric_dtype(df_prep[col])
        ]
    
    print(f"Checking {len(numeric_features)} numeric features for skewness and extreme values:")
    # print(numeric_features) 

    # Calculate Skewness
    skewness = df_prep[numeric_features].skew().sort_values(key=abs, ascending=False)
    skew_threshold = 1.0 
    highly_skewed_cols = skewness[abs(skewness) > skew_threshold].index.tolist()
    print(f"\nFeatures with absolute skewness > {skew_threshold}: ({len(highly_skewed_cols)})")
    print(highly_skewed_cols)

    # Calculate Max vs 99th Percentile (More Robustly)
    print("\nCalculating Max vs 99th Percentile:")
    p99 = df_prep[numeric_features].quantile(0.99)
    max_vals = df_prep[numeric_features].max()
    
    # Create a DataFrame for comparison
    max_p99_compare = pd.DataFrame({'p99': p99, 'max': max_vals})
    
    # Calculate ratio, handling potential division by zero or NaN in p99
    max_p99_compare['ratio_max_p99'] = max_p99_compare['max'] / max_p99_compare['p99'].replace(0, np.nan)
    
    # Identify extreme columns
    ratio_threshold = 5.0 # Example: max > 5 * p99
    extreme_max_cols_df = max_p99_compare[max_p99_compare['ratio_max_p99'] > ratio_threshold]
    extreme_max_cols = extreme_max_cols_df.index.tolist()
    
    print(f"\nFeatures with Max > {ratio_threshold}x the 99th Percentile: ({len(extreme_max_cols)})")
    print(extreme_max_cols)
    
    # Display stats for these extreme columns
    if extreme_max_cols:
        print("\nStats for columns with extreme max values:")
        display(extreme_max_cols_df)
        
else:
    print("Error: df_prep DataFrame not found.")

Checking 19 numeric features for skewness and extreme values:

Features with absolute skewness > 1.0: (13)
['minimum_nights', 'calculated_host_listings_count_shared_rooms', 'bathrooms', 'calculated_host_listings_count_private_rooms', 'host_acceptance_rate', 'beds', 'days_since_last_review', 'number_of_reviews', 'bedrooms', 'reviews_per_month', 'accommodates', 'calculated_host_listings_count_entire_homes', 'number_of_reviews_ltm']

Calculating Max vs 99th Percentile:

Features with Max > 5.0x the 99th Percentile: (3)
['minimum_nights', 'calculated_host_listings_count_shared_rooms', 'reviews_per_month']

Stats for columns with extreme max values:


Unnamed: 0,p99,max,ratio_max_p99
minimum_nights,30.0,730.0,24.333
calculated_host_listings_count_shared_rooms,6.0,63.0,10.5
reviews_per_month,7.583,41.2,5.433


*Observation:* Skewness calculations performed on 19 numeric predictor features confirm that 13 exhibit high skewness (absolute skew > 1.0), including count-based features (`number_of_reviews*`, `calculated_host_listings_count_*`, `reviews_per_month`), size/capacity metrics (`accommodates`, `bedrooms`, `beds`, `bathrooms`), stay duration (`minimum_nights`), and others (`host_acceptance_rate`, `days_since_last_review`). A comparison of max vs 99th percentile values highlights `minimum_nights`, `calculated_host_listings_count_shared_rooms`, and `reviews_per_month` as having particularly extreme maximum values (max > 5x p99). This analysis confirms the need for capping stay durations and applying log transformations to the other identified highly skewed features.

### Cap Extreme Stay Durations
Apply capping specifically to `minimum_nights` and `maximum_nights`, identified as having extreme maximum values, using the 99th percentile.

In [31]:
# P5.1 Cap Extreme Stay Durations
if 'df_prep' in locals() and df_prep is not None:
    # Columns specifically targeted for capping based on P5.0 and EDA
    cols_to_cap = ['minimum_nights', 'maximum_nights'] 
    percentile_threshold = 0.99 

    print(f"Capping columns at the {percentile_threshold:.0%} percentile: {cols_to_cap}")

    for col in cols_to_cap:
        if col in df_prep.columns and pd.api.types.is_numeric_dtype(df_prep[col]):
            cap_value = df_prep[col].quantile(percentile_threshold)
            current_max = df_prep[col].max()
            # Only print details if capping actually changes the max value
            if current_max > cap_value:
                print(f" - Column '{col}':")
                print(f"   - 99th percentile: {cap_value:.2f}")
                print(f"   - Current max value: {current_max}")
                df_prep[col] = df_prep[col].clip(upper=cap_value)
                print(f"   - New max value after capping: {df_prep[col].max()}")
            else:
                 print(f" - Column '{col}': Max value ({current_max}) is already at or below p99 ({cap_value:.2f}). No capping applied.")
        else:
            print(f" - Column '{col}' not found or not numeric.")

    # Verify with describe
    if cols_to_cap:
         print("\nDescriptive statistics after capping:")
         actual_cols_to_cap = [c for c in cols_to_cap if c in df_prep.columns]
         if actual_cols_to_cap:
             display(df_prep[actual_cols_to_cap].describe())

else:
    print("Error: df_prep DataFrame not found.")

Capping columns at the 99% percentile: ['minimum_nights', 'maximum_nights']
 - Column 'minimum_nights':
   - 99th percentile: 30.00
   - Current max value: 730
   - New max value after capping: 30
 - Column 'maximum_nights':
   - 99th percentile: 1125.00
   - Current max value: 3333
   - New max value after capping: 1125

Descriptive statistics after capping:


Unnamed: 0,minimum_nights,maximum_nights
count,8768.0,8768.0
mean,2.54,503.06
std,4.446,400.298
min,1.0,1.0
25%,1.0,365.0
50%,2.0,365.0
75%,2.0,1124.0
max,30.0,1125.0


*Observation:* Based on the outlier analysis indicating extremely high and potentially unrealistic values, `minimum_nights` and `maximum_nights` were capped at their respective 99th percentiles (30 nights and 1125 nights). The descriptive statistics confirm the maximum values for these columns have been successfully limited, which should prevent these extreme outliers from disproportionately influencing models. Other columns identified with high max-to-p99 ratios (like counts) will be addressed via log transformation.

### Apply Log Transformation to Highly Skewed Features
Apply `log1p` transformation to features identified in P5.0 as having high absolute skewness (>1.0), excluding those already capped or not suitable for log transform (e.g., review scores). Create new columns with `_log` suffix.

In [32]:
# P5.2 Apply Log Transformation to Highly Skewed Features
if 'df_prep' in locals() and df_prep is not None and 'highly_skewed_cols' in locals():
    
    # Columns identified in P5.0 as highly skewed
    cols_to_log_transform = highly_skewed_cols 
    
    # Refine list: Exclude columns already capped, or columns not suitable
    exclude_from_log = ['minimum_nights', 'maximum_nights'] # Already capped
    # Add others if needed (e.g. if a count column was capped instead of logged)
    
    cols_to_log_transform = [col for col in cols_to_log_transform if col not in exclude_from_log]

    print(f"Applying log1p transformation to {len(cols_to_log_transform)} highly skewed columns:")
    print(cols_to_log_transform)
    
    transformed_cols_log = []
    original_cols_to_potentially_drop = [] # Keep track of original cols
    
    for col in cols_to_log_transform:
        if col in df_prep.columns: # Check column exists
            # Check for non-negative values before log transform
            if (df_prep[col] < 0).any():
                print(f"   - Warning: Column '{col}' contains negative values. Skipping log transformation.")
                continue
                
            new_col_name = f"{col}_log"
            df_prep[new_col_name] = np.log1p(df_prep[col])
            transformed_cols_log.append(new_col_name)
            original_cols_to_potentially_drop.append(col) # Mark original for potential drop later
            
        else:
            print(f"   - Warning: Column '{col}' not found.")
            
    print(f"\nCreated {len(transformed_cols_log)} new log-transformed columns.")
    print("Original columns kept for now, consider dropping later if log version proves superior.")

    # Verify results - describe new log columns
    if transformed_cols_log:
        print("\nDescriptive statistics for new log-transformed columns:")
        display(df_prep[transformed_cols_log].describe())
        
    # Optional: Check skewness again after transformation
    # if transformed_cols_log:
    #    new_skewness = df_prep[transformed_cols_log].skew()
    #    print("\nSkewness AFTER log transformation:")
    #    display(new_skewness)

else:
    print("Error: df_prep DataFrame not found or skewed columns not identified.")

Applying log1p transformation to 12 highly skewed columns:
['calculated_host_listings_count_shared_rooms', 'bathrooms', 'calculated_host_listings_count_private_rooms', 'host_acceptance_rate', 'beds', 'days_since_last_review', 'number_of_reviews', 'bedrooms', 'reviews_per_month', 'accommodates', 'calculated_host_listings_count_entire_homes', 'number_of_reviews_ltm']

Created 12 new log-transformed columns.
Original columns kept for now, consider dropping later if log version proves superior.

Descriptive statistics for new log-transformed columns:


Unnamed: 0,calculated_host_listings_count_shared_rooms_log,bathrooms_log,calculated_host_listings_count_private_rooms_log,host_acceptance_rate_log,beds_log,days_since_last_review_log,number_of_reviews_log,bedrooms_log,reviews_per_month_log,accommodates_log,calculated_host_listings_count_entire_homes_log,number_of_reviews_ltm_log
count,8768.0,8768.0,8768.0,8768.0,8768.0,8768.0,8768.0,8768.0,8768.0,8768.0,8768.0,8768.0
mean,0.042,0.793,0.381,0.652,1.132,4.152,3.285,0.8,0.922,1.532,1.967,2.346
std,0.388,0.207,0.828,0.118,0.471,2.096,1.743,0.353,0.638,0.432,1.334,1.469
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.693,0.0,0.0
25%,0.0,0.693,0.0,0.678,0.693,2.708,2.079,0.693,0.336,1.099,0.693,1.099
50%,0.0,0.693,0.0,0.693,1.099,3.638,3.555,0.693,0.916,1.609,1.792,2.639
75%,0.0,0.916,0.0,0.693,1.386,5.081,4.644,1.099,1.456,1.792,2.996,3.638
max,4.159,2.773,4.078,0.693,3.497,9.21,7.562,2.565,3.742,2.833,4.625,5.805


*Observation:* Log transformation (`log1p`) was applied to 12 numeric features identified as highly skewed, including counts (`number_of_reviews*`, `calculated_host_listings_count_*`, `reviews_per_month`, `num_amenities`, `num_host_verifications`), size/capacity metrics (`accommodates`, `bathrooms`, `bedrooms`, `beds`), acceptance rate (`host_acceptance_rate`), and recency (`days_since_last_review`). New columns ending in `_log` were created. The descriptive statistics for these transformed columns show significantly compressed ranges and likely reduced skewness (indicated by closer mean and median values) compared to their original scales, making them potentially better suited for modeling. Original columns were retained for now.

### Verify Skewness Reduction
Re-calculate skewness for the potentially problematic numeric columns (using the capped/log-transformed versions where applicable) to verify that the previous steps have reduced extreme skewness.

In [33]:
# P5.3 Verify Skewness Reduction
if 'df_prep' in locals() and df_prep is not None:
    
    # Define the set of numeric features to check *after* transformations/capping
    # These are the original names OR their _log counterparts if created
    
    features_to_check_final_skew = []
    original_features_processed = [
         'accommodates', 'bathrooms', 'bedrooms', 'beds', 
         'minimum_nights', 'maximum_nights', # Use capped versions
         'number_of_reviews', 'number_of_reviews_ltm', 
         'calculated_host_listings_count_entire_homes', 
         'calculated_host_listings_count_private_rooms', 
         'calculated_host_listings_count_shared_rooms', 
         'reviews_per_month', 
         'num_amenities', 
         'num_host_verifications',
         'host_duration_days', 
         'days_since_last_review',
         'host_response_rate', # Check original or log if transformed
         'host_acceptance_rate' # Check original or log if transformed
        ]

    for col in original_features_processed:
        log_col = f"{col}_log"
        if log_col in df_prep.columns:
             features_to_check_final_skew.append(log_col) # Prefer the log version if it exists
        elif col in df_prep.columns and pd.api.types.is_numeric_dtype(df_prep[col]):
             features_to_check_final_skew.append(col) # Otherwise use the original (potentially capped)

    # Ensure no duplicates and all exist
    features_to_check_final_skew = sorted(list(set(features_to_check_final_skew)))
    features_to_check_final_skew = [f for f in features_to_check_final_skew if f in df_prep.columns]

    if features_to_check_final_skew:
        print("Calculating skewness for numeric features after capping/transformation:")
        final_skewness = df_prep[features_to_check_final_skew].skew().sort_values(key=abs, ascending=False)
        
        print("\nSkewness values:")
        display(final_skewness)
        
        # Identify columns still highly skewed (threshold might be lower now, e.g., 0.75 or 1.0)
        remaining_skew_threshold = 1.0 
        still_skewed_cols = final_skewness[abs(final_skewness) > remaining_skew_threshold]
        
        if not still_skewed_cols.empty:
             print(f"\nWarning: The following columns still have absolute skewness > {remaining_skew_threshold}:")
             display(still_skewed_cols)
             print("Further transformation or reliance on robust models might be needed.")
        else:
             print(f"\nSuccess: All checked features now have absolute skewness <= {remaining_skew_threshold}.")
             
    else:
        print("Could not identify relevant numeric columns to check final skewness.")

else:
    print("Error: df_prep DataFrame not found.")

Calculating skewness for numeric features after capping/transformation:

Skewness values:


calculated_host_listings_count_shared_rooms_log     9.982
minimum_nights                                      5.331
host_acceptance_rate_log                           -3.982
bathrooms_log                                       2.500
calculated_host_listings_count_private_rooms_log    2.277
days_since_last_review_log                          1.055
beds_log                                            0.760
maximum_nights                                      0.648
accommodates_log                                    0.556
calculated_host_listings_count_entire_homes_log     0.349
number_of_reviews_log                              -0.335
number_of_reviews_ltm_log                          -0.331
num_host_verifications                              0.292
host_duration_days                                 -0.165
reviews_per_month_log                               0.132
bedrooms_log                                        0.096
num_amenities                                       0.025
dtype: float64




calculated_host_listings_count_shared_rooms_log     9.982
minimum_nights                                      5.331
host_acceptance_rate_log                           -3.982
bathrooms_log                                       2.500
calculated_host_listings_count_private_rooms_log    2.277
days_since_last_review_log                          1.055
dtype: float64

Further transformation or reliance on robust models might be needed.


*Observation:* The skewness check after capping and log transformations reveals that while many features improved, several columns (`calculated_host_listings_count_shared_rooms_log`, `minimum_nights`, `host_acceptance_rate_log`, `bathrooms_log`, `calculated_host_listings_count_private_rooms_log`, `days_since_last_review_log`) still exhibit absolute skewness greater than 1.0. This is often due to distributions dominated by zero/low values or the nature of capping/imputation. **Decision:** Given that tree-based models (Random Forest, Gradient Boosting) are primary candidates for this type of prediction task and are generally robust to monotonic transformations and skewness, no further transformations will be applied at this stage. We will proceed with the current feature set, acknowledging the remaining skewness in these specific columns.

### Display Info and Head
Show the updated DataFrame info and head to see the newly added features and removed original columns.

In [34]:
if 'df_prep' in locals() and df_prep is not None:
    print("\nDataFrame head after Phase P4 Feature Engineering & Parsing:")
    display(df_prep.head())
    
    print("\nDataFrame info after Phase P4 Feature Engineering & Parsing:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after Phase P4 Feature Engineering & Parsing:


Unnamed: 0,host_response_time,host_acceptance_rate,host_identity_verified,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review,neighbourhood_group,has_reviews,calculated_host_listings_count_shared_rooms_log,bathrooms_log,calculated_host_listings_count_private_rooms_log,host_acceptance_rate_log,beds_log,days_since_last_review_log,number_of_reviews_log,bedrooms_log,reviews_per_month_log,accommodates_log,calculated_host_listings_count_entire_homes_log,number_of_reviews_ltm_log
0,within_hour,1.0,1,50.082,14.416,Entire rental unit,Entire_home/apt,4,1.0,1.0,2.0,1,365,1,0,0,31,1,4.9,4.93,4.86,1,69,0,0,0.18,7.979,30,1,1,1,1,1,1,1,0,0,0,1,2,5933,274.0,Old_Town_Center,1,0.0,0.693,0.0,0.693,1.099,5.617,3.466,0.693,0.166,1.609,4.248,0.693
2,within_hour,0.98,1,50.087,14.432,Entire rental unit,Entire_home/apt,4,1.5,1.0,2.0,3,700,1,3,173,411,53,4.94,4.93,4.9,0,3,0,0,3.43,7.367,58,1,1,1,1,1,1,1,1,0,0,1,2,5438,9.0,Old_Town_Center,1,0.0,0.916,0.0,0.683,1.099,2.303,6.021,0.693,1.488,1.609,1.386,3.989
3,within_hour,0.8,1,50.087,14.445,Private room in rental unit,Private_room,2,1.0,1.0,2.0,3,60,1,5,5,414,52,4.76,4.63,4.83,0,3,3,0,2.79,6.758,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,14.0,Near_Center_East,1,0.0,0.693,1.386,0.588,1.099,2.708,6.028,0.693,1.332,1.099,1.386,3.97
4,within_hour,0.8,1,50.085,14.445,Private room in rental unit,Private_room,2,1.0,1.0,3.0,3,60,1,3,3,389,47,4.69,4.59,4.73,0,3,3,0,2.67,6.446,31,1,0,0,1,0,1,1,1,0,0,0,2,4510,15.0,Near_Center_East,1,0.0,0.693,1.386,0.588,1.386,2.773,5.966,0.693,1.3,1.099,1.386,3.871
5,within_hour,0.8,1,50.085,14.446,Private room in rental unit,Private_room,2,1.0,1.0,1.0,3,60,1,6,6,381,52,4.78,4.68,4.81,0,3,3,0,2.58,6.65,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,21.0,Near_Center_East,1,0.0,0.693,1.386,0.588,0.693,3.091,5.945,0.693,1.275,1.099,1.386,3.97



DataFrame info after Phase P4 Feature Engineering & Parsing:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 56 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   host_response_time                                8768 non-null   object 
 1   host_acceptance_rate                              8768 non-null   float64
 2   host_identity_verified                            8768 non-null   int64  
 3   latitude                                          8768 non-null   float64
 4   longitude                                         8768 non-null   float64
 5   property_type                                     8768 non-null   object 
 6   room_type                                         8768 non-null   object 
 7   accommodates                                      8768 non-null   int64  
 8   bathrooms                               

In [35]:
# Save dataset state AFTER applying outlier capping and log transformations
if 'df_prep' in locals() and df_prep is not None:
    save_path_after = '../data/interim/dataset_after_outliers.csv'
    print(f"Saving DataFrame state AFTER outlier/skew handling to: {save_path_after}")
    try:
        # Saving the current state of df_prep
        df_prep.to_csv(save_path_after, index=False)
        print(f"DataFrame state saved successfully. Shape: {df_prep.shape}")
    except Exception as e:
        print(f"Error saving DataFrame: {e}")
else:
    print("Error: df_prep DataFrame not found. Cannot save state after outlier handling.")

Saving DataFrame state AFTER outlier/skew handling to: ../data/interim/dataset_after_outliers.csv
DataFrame state saved successfully. Shape: (8768, 56)


## 8. Encoding Categorical Features

This phase converts the remaining categorical features (currently stored as `object` type) into a numeric format that machine learning models can process. We will:
*   Identify the final set of categorical columns needing encoding.
*   Apply One-Hot Encoding to low-cardinality categorical features.
*   Discuss and potentially apply appropriate encoding for high-cardinality features (like frequency or target encoding).

### Identify Final Categorical Columns
List the columns currently having the `object` data type, which represent our categorical features needing encoding.

In [36]:
# P7.1 Identify Final Categorical Columns
if 'df_prep' in locals() and df_prep is not None:
    categorical_cols = df_prep.select_dtypes(include='object').columns.tolist()
    print(f"Categorical columns to encode ({len(categorical_cols)}):")
    print(sorted(categorical_cols))
    
    # Display unique value counts again for these specific columns for context
    if categorical_cols:
        print("\nUnique value counts for these columns:")
        display(df_prep[categorical_cols].nunique().sort_values())
else:
    print("Error: df_prep DataFrame not found.")

Categorical columns to encode (4):
['host_response_time', 'neighbourhood_group', 'property_type', 'room_type']

Unique value counts for these columns:


room_type               4
host_response_time      5
neighbourhood_group     6
property_type          12
dtype: int64

### Apply One-Hot Encoding (Low Cardinality)
Apply one-hot encoding to categorical columns with low cardinality using `pd.get_dummies`. This creates new binary (0/1) columns for each category. Drop the original column after encoding.

In [37]:
# P7.2 Apply One-Hot Encoding (Low Cardinality)
if 'df_prep' in locals() and df_prep is not None:
    # Identify low-cardinality columns based on previous analysis/counts
    # neighbourhood_group was created by grouping and has low cardinality
    low_cardinality_categorical_cols = [
        'host_response_time', 
        'room_type',
        'neighbourhood_group' # Added based on P4.5.4
        ]
    
    # Ensure columns exist before trying to encode
    actual_low_card_cols = [col for col in low_cardinality_categorical_cols if col in df_prep.columns]
    
    if actual_low_card_cols:
        print(f"Applying One-Hot Encoding to: {actual_low_card_cols}")
        
        # Use pd.get_dummies - easy to implement
        # drop_first=True can be used to avoid multicollinearity if needed (e.g., for linear models), 
        # but often kept False for tree models or interpretability. Let's keep False for now.
        df_prep = pd.get_dummies(df_prep, columns=actual_low_card_cols, drop_first=False, prefix=actual_low_card_cols)
        
        print(f"\nDataFrame shape after One-Hot Encoding: {df_prep.shape}")
        
        # Display some of the new columns created
        new_ohe_cols = [col for col in df_prep.columns if any(col.startswith(prefix + '_') for prefix in actual_low_card_cols)]
        print(f"\nExample new One-Hot Encoded columns ({len(new_ohe_cols)}):")
        display(df_prep[new_ohe_cols[:10]].head(3)) # Show first few OHE columns
            
    else:
        print("No low-cardinality columns found to apply One-Hot Encoding.")

else:
    print("Error: df_prep DataFrame not found.")

Applying One-Hot Encoding to: ['host_response_time', 'room_type', 'neighbourhood_group']

DataFrame shape after One-Hot Encoding: (8768, 68)

Example new One-Hot Encoded columns (15):


Unnamed: 0,host_response_time_Unknown,host_response_time_days_or_more,host_response_time_within_day,host_response_time_within_hour,host_response_time_within_hours,room_type_Entire_home/apt,room_type_Hotel_room,room_type_Private_room,room_type_Shared_room,neighbourhood_group_Near_Center_East
0,False,False,False,True,False,True,False,False,False,False
2,False,False,False,True,False,True,False,False,False,False
3,False,False,False,True,False,False,False,True,False,True


*Observation:* One-hot encoding was successfully applied to the low-cardinality categorical features `host_response_time`, `room_type`, and `neighbourhood_group`. This process created 15 new binary indicator columns (dummy variables), replacing the original three columns and increasing the total feature count to 68. The DataFrame now contains numeric representations for these categorical distinctions, suitable for model input.

### Handle High-Cardinality Categorical Features
Address the remaining high-cardinality categorical features (`neighbourhood_cleansed`, `property_type`). Simple one-hot encoding would create too many columns. Alternatives include Target Encoding, Frequency Encoding, or relying on models like CatBoost. For simplicity in this initial preparation, we can apply Frequency Encoding.
*(Note: Target Encoding is powerful but riskier due to potential data leakage if not implemented carefully within cross-validation).*

In [38]:
# P7.3 Apply Frequency Encoding (Example for High Cardinality)
if 'df_prep' in locals() and df_prep is not None:
    # Identify high-cardinality columns remaining
    high_cardinality_categorical_cols = [
        'property_type',
        ]
    
    actual_high_card_cols = [col for col in high_cardinality_categorical_cols if col in df_prep.columns and df_prep[col].dtype == 'object']
    
    if actual_high_card_cols:
        print(f"Applying Frequency Encoding to: {actual_high_card_cols}")
        
        for col in actual_high_card_cols:
            # Calculate frequency of each category
            frequency_map = df_prep[col].value_counts(normalize=True) # Use normalize=True for probability/frequency
            
            # Create new column name
            new_col_name = f"{col}_freq"
            
            # Map frequencies to the column
            df_prep[new_col_name] = df_prep[col].map(frequency_map)
            
            print(f" - Created '{new_col_name}' based on frequencies of '{col}'.")
            
            # Decide whether to drop the original column
            # Recommendation: Keep original for now, can drop later if frequency encoding proves useful.
            # df_prep.drop(columns=[col], inplace=True)
            # print(f"   - Dropped original column '{col}'.")
            
        print("\nFrequency encoding applied.")
        
        # Display head of new features
        new_freq_cols = [f"{col}_freq" for col in actual_high_card_cols]
        if new_freq_cols:
            display(df_prep[new_freq_cols].head())
            display(df_prep[new_freq_cols].describe())
            
    else:
        print("No remaining high-cardinality object columns found for frequency encoding.")

else:
    print("Error: df_prep DataFrame not found.")

Applying Frequency Encoding to: ['property_type']
 - Created 'property_type_freq' based on frequencies of 'property_type'.

Frequency encoding applied.


Unnamed: 0,property_type_freq
0,0.609
2,0.609
3,0.068
4,0.068
5,0.068


Unnamed: 0,property_type_freq
count,8768.0
mean,0.397
std,0.265
min,0.006
25%,0.084
50%,0.609
75%,0.609
max,0.609


*Observation:* Frequency encoding was applied to the remaining categorical feature `property_type`. A new numeric column, `property_type_freq`, was created where each original category is replaced by its normalized frequency (proportion) within the dataset. This encodes the prevalence of each property type numerically without increasing dimensionality significantly. The original `property_type` column was retained alongside the new frequency feature.

## 9. Final Review & Splitting

This phase prepares the fully processed dataset for modeling. We will:
*   Perform a final check on the feature set, ensuring all data is numeric and removing any zero-variance columns.
*   Separate the predictor features (X) from the target variable (y).
*   Split the data into training and testing sets for model development and evaluation.

### Final Feature Check
Verify that all columns (except potentially kept original categoricals) are numeric. Identify and remove any features with zero variance (columns where all values are the same). Define the final feature set `X` and target `y`.

In [39]:
# P8.1 Final Feature Check
if 'df_prep' in locals() and df_prep is not None:
    print("Performing final feature check...")

    # --- Separate Target ---
    target = 'log_price'
    if target in df_prep.columns:
        y = df_prep[target]
        X = df_prep.drop(columns=[target])
        print(f" - Separated target variable '{target}'.")
    else:
        print(f"Error: Target variable '{target}' not found.")
        X, y = None, None # Stop further processing if target is missing

    if X is not None:
        # --- Identify Non-Numeric Columns Remaining ---
        # Besides intentionally kept originals like 'property_type', 'neighbourhood_cleansed'
        object_cols_remaining = X.select_dtypes(include='object').columns.tolist()
        if object_cols_remaining:
            print(f"\nWarning: Object columns still remain in features X: {object_cols_remaining}")
            # Decide to drop them now if they weren't handled (e.g. original high-cardinality)
            print("   - Dropping remaining object columns.")
            X = X.drop(columns=object_cols_remaining)
            print(f"   - Features remaining after dropping objects: {X.shape[1]}")
        else:
            print("\nConfirmed: All feature columns in X are numeric.")

        # --- Check for Zero Variance Columns ---
        # Calculate variance for numeric columns
        variances = X.var(numeric_only=True) # numeric_only=True is default but good practice
        zero_variance_cols = variances[variances == 0].index.tolist()

        if zero_variance_cols:
            print(f"\nWarning: Found {len(zero_variance_cols)} columns with zero variance:")
            print(zero_variance_cols)
            print("   - Dropping zero variance columns.")
            X = X.drop(columns=zero_variance_cols)
            print(f"   - Features remaining after dropping zero variance: {X.shape[1]}")
        else:
            print("\nConfirmed: No zero-variance columns found.")

        # --- Document Final Feature Set ---
        final_feature_names = X.columns.tolist()
        print(f"\nFinal feature set for modeling contains {len(final_feature_names)} features.")
        # print("Final features:", final_feature_names) # Uncomment to list all names

else:
    print("Error: df_prep DataFrame not found.")

Performing final feature check...
 - Separated target variable 'log_price'.

   - Dropping remaining object columns.
   - Features remaining after dropping objects: 67

['has_availability']
   - Dropping zero variance columns.
   - Features remaining after dropping zero variance: 66

Final feature set for modeling contains 66 features.


*Observation:* A final check was performed on the feature set (`X`) after separating the target (`y`). The remaining object column (`property_type`, kept alongside its frequency-encoded version) was dropped. A zero-variance column (`has_availability`, likely becoming constant after rows with missing values were dropped) was also identified and removed. The final feature set for modeling now contains 66 numeric predictors, ready for splitting.

### Unify Numeric Data Types
Consolidate different float types (float64, Float64) and integer types (int64, Int64, bool) into standard `float64` and `int64` respectively for consistency, assuming no NaNs remain in integer-like columns after preprocessing.

In [42]:
# P8.1.A Unify Numeric Data Types
if 'X' in locals() and 'y' in locals() and X is not None and y is not None:
    print("Unifying data types...")
    
    # --- Unify Integer Types ---
    # Select columns that are int64, Int64, or bool
    int_like_cols = X.select_dtypes(include=['int64', 'Int64', 'bool']).columns.tolist()
    print(f"\nColumns to potentially convert to standard int64: {len(int_like_cols)}")
    
    # Check for NaNs before converting (should be 0 after imputation/dropping)
    nans_in_ints = X[int_like_cols].isnull().sum().sum()
    if nans_in_ints == 0:
        print(" - No NaNs found in integer-like columns. Proceeding with conversion to int64.")
        for col in int_like_cols:
            if X[col].dtype != 'int64': # Only convert if not already int64
                X[col] = X[col].astype('int64')
        print(" - Integer types unified to int64.")
    else:
        print(f" - Warning: Found {nans_in_ints} NaNs in integer-like columns. Cannot safely convert all to int64. Keeping nullable types.")

    # --- Unify Float Types ---
    # Select columns that are float64 or Float64
    float_like_cols = X.select_dtypes(include=['float64', 'Float64']).columns.tolist()
    # Also convert the target variable y if it's Float64
    if isinstance(y, pd.Series) and y.dtype == 'Float64':
        print("\nConverting target 'y' (log_price) from Float64 to float64.")
        y = y.astype('float64')
        
    print(f"\nColumns to potentially convert to standard float64: {len(float_like_cols)}")
    converted_float_count = 0
    for col in float_like_cols:
         if X[col].dtype != 'float64': # Only convert if not already float64
             X[col] = X[col].astype('float64')
             converted_float_count += 1
    if converted_float_count > 0:
         print(f" - Converted {converted_float_count} float columns to float64.")
    else:
         print(" - All float columns were already float64.")
         
    # --- Verify Final Types ---
    print("\nVerifying final data types in X:")
    print(X.info())
    if isinstance(y, pd.Series):
        print("\nVerifying final data type for y:")
        print(y.info())

else:
    print("Error: Feature matrix X or target vector y not defined.")

Unifying data types...

Columns to potentially convert to standard int64: 42
 - No NaNs found in integer-like columns. Proceeding with conversion to int64.
 - Integer types unified to int64.

Converting target 'y' (log_price) from Float64 to float64.

Columns to potentially convert to standard float64: 24
 - All float columns were already float64.

Verifying final data types in X:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 66 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   host_acceptance_rate                              8768 non-null   float64
 1   host_identity_verified                            8768 non-null   int64  
 2   latitude                                          8768 non-null   float64
 3   longitude                                         8768 non-null   float64
 4   accommodates                      

*Observation:* Numeric data types were standardized. Integer-like columns (`int64`, `Int64`, `bool`) were converted to the standard `int64` format (after confirming no missing values remained). Float columns, including the target variable `y` (`log_price`), were converted to the standard `float64` format. This ensures consistent data types across the final feature set.

### Display Info and Head
Show the updated DataFrame info and head to see the newly added features and removed original columns.

In [44]:
# P8.1.A+ Final Manual Check
if 'X' in locals() and 'y' in locals() and X is not None and y is not None:
    print("--- Final Check of Feature Matrix (X) ---")
    print("\nX head:")
    display(X.head())
    
    print("\nX Info:")
    X.info()
    
    print("\n--- Final Check of Target Vector (y) ---")
    print("\ny head:")
    display(y.head())
    
    print("\ny Info:")
    # Use .info() if y is a Series, otherwise just print type/shape
    if isinstance(y, pd.Series):
        y.info()
    else:
        print(f"y type: {type(y)}")
        try:
            print(f"y shape: {y.shape}")
        except AttributeError:
            pass # Handle cases where y might not have shape (e.g. list)

else:
    print("Error: Feature matrix X or target vector y not defined for final check.")

--- Final Check of Feature Matrix (X) ---

X head:


Unnamed: 0,host_acceptance_rate,host_identity_verified,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review,has_reviews,calculated_host_listings_count_shared_rooms_log,bathrooms_log,calculated_host_listings_count_private_rooms_log,host_acceptance_rate_log,beds_log,days_since_last_review_log,number_of_reviews_log,bedrooms_log,reviews_per_month_log,accommodates_log,calculated_host_listings_count_entire_homes_log,number_of_reviews_ltm_log,host_response_time_Unknown,host_response_time_days_or_more,host_response_time_within_day,host_response_time_within_hour,host_response_time_within_hours,room_type_Entire_home/apt,room_type_Hotel_room,room_type_Private_room,room_type_Shared_room,neighbourhood_group_Near_Center_East,neighbourhood_group_Near_Center_West_South,neighbourhood_group_New_Town_Vinohrady,neighbourhood_group_North_West_Districts,neighbourhood_group_Old_Town_Center,neighbourhood_group_Outer_Districts,property_type_freq
0,1.0,1,50.082,14.416,4,1.0,1.0,2.0,1,365,0,0,31,1,4.9,4.93,4.86,1,69,0,0,0.18,30,1,1,1,1,1,1,1,0,0,0,1,2,5933,274.0,1,0.0,0.693,0.0,0.693,1.099,5.617,3.466,0.693,0.166,1.609,4.248,0.693,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0.609
2,0.98,1,50.087,14.432,4,1.5,1.0,2.0,3,700,3,173,411,53,4.94,4.93,4.9,0,3,0,0,3.43,58,1,1,1,1,1,1,1,1,0,0,1,2,5438,9.0,1,0.0,0.916,0.0,0.683,1.099,2.303,6.021,0.693,1.488,1.609,1.386,3.989,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0.609
3,0.8,1,50.087,14.445,2,1.0,1.0,2.0,3,60,5,5,414,52,4.76,4.63,4.83,0,3,3,0,2.79,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,14.0,1,0.0,0.693,1.386,0.588,1.099,2.708,6.028,0.693,1.332,1.099,1.386,3.97,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0.068
4,0.8,1,50.085,14.445,2,1.0,1.0,3.0,3,60,3,3,389,47,4.69,4.59,4.73,0,3,3,0,2.67,31,1,0,0,1,0,1,1,1,0,0,0,2,4510,15.0,1,0.0,0.693,1.386,0.588,1.386,2.773,5.966,0.693,1.3,1.099,1.386,3.871,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0.068
5,0.8,1,50.085,14.446,2,1.0,1.0,1.0,3,60,6,6,381,52,4.78,4.68,4.81,0,3,3,0,2.58,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,21.0,1,0.0,0.693,1.386,0.588,0.693,3.091,5.945,0.693,1.275,1.099,1.386,3.97,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0.068



X Info:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 66 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   host_acceptance_rate                              8768 non-null   float64
 1   host_identity_verified                            8768 non-null   int64  
 2   latitude                                          8768 non-null   float64
 3   longitude                                         8768 non-null   float64
 4   accommodates                                      8768 non-null   int64  
 5   bathrooms                                         8768 non-null   float64
 6   bedrooms                                          8768 non-null   float64
 7   beds                                              8768 non-null   float64
 8   minimum_nights                                    8768 non-null   int64  
 9   maximum_nights

0   7.979
2   7.367
3   6.758
4   6.446
5   6.650
Name: log_price, dtype: float64


y Info:
<class 'pandas.core.series.Series'>
Index: 8768 entries, 0 to 10107
Series name: log_price
Non-Null Count  Dtype  
--------------  -----  
8768 non-null   float64
dtypes: float64(1)
memory usage: 137.0 KB


### Save Final Processed Data (Before Split)
Save the complete, cleaned, engineered, and encoded feature matrix `X` combined with the target vector `y` to a file in the `data/processed/` directory. This version is useful for final visualizations and potential re-use.

In [45]:
# P8.1.B Save Final Processed Data (Before Split)
if 'X' in locals() and 'y' in locals() and X is not None and y is not None:
    
    # Define base filename
    base_filename = '../data/processed/final_dataset_beforesplit'
    
    print(f"\nCombining X and y for saving...")
    
    try:
        # Create a temporary DataFrame for saving
        df_final_processed = X.copy()
        # Ensure y index aligns with X if necessary (should after split logic, but good practice)
        df_final_processed[target] = y.values # Use .values to avoid potential index mismatch errors
        
        print(f"Shape of DataFrame to save: {df_final_processed.shape}")
        
        # --- Save to Parquet (Recommended for Python/Pandas use) ---
        save_path_parquet = f"{base_filename}.parquet"
        print(f"Saving final processed data to Parquet: {save_path_parquet}")
        df_final_processed.to_parquet(save_path_parquet, index=False)
        print(f"   Successfully saved Parquet file.")

        # --- Save to CSV (Optional, for external viewing/compatibility) ---
        save_path_csv = f"{base_filename}.csv"
        print(f"Saving final processed data to CSV: {save_path_csv}")
        df_final_processed.to_csv(save_path_csv, index=False)
        print(f"   Successfully saved CSV file.")
        
    except Exception as e:
        print(f"Error saving final processed data: {e}")
        
else:
    print("Error: Feature matrix X or target vector y not defined. Cannot save.")


Combining X and y for saving...
Shape of DataFrame to save: (8768, 67)
Saving final processed data to Parquet: ../data/processed/final_dataset_beforesplit.parquet
   Successfully saved Parquet file.
Saving final processed data to CSV: ../data/processed/final_dataset_beforesplit.csv
   Successfully saved CSV file.


*Observation:* The final, fully processed dataset (including features `X` and target `y` with unified numeric types) was saved in two formats to the `data/processed/` directory: as `final_dataset_beforesplit.parquet` (recommended for efficient loading and type preservation within Pandas/Python workflows) and as `final_dataset_beforesplit.csv` (for easier inspection with external tools like spreadsheets). This provides a clean snapshot of the data ready for modeling or visualization.

### Train-Test Split
Split the feature matrix `X` and target vector `y` into training (80%) and testing (20%) sets. Use a fixed `random_state` for reproducibility.

In [46]:
# P8.2 Train-Test Split
if 'X' in locals() and 'y' in locals() and X is not None and y is not None:
    
    # Define test size and random state
    test_set_size = 0.20
    random_seed = 42 

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=test_set_size, 
        random_state=random_seed
    )

    print("Data split into training and testing sets.")
    print(f"X_train shape: {X_train.shape}")
    print(f"X_test shape : {X_test.shape}")
    print(f"y_train shape: {y_train.shape}")
    print(f"y_test shape : {y_test.shape}")
    
    # Display first few rows of training features as a check
    print("\nHead of X_train:")
    display(X_train.head(3))

else:
    print("Error: Feature matrix X or target vector y not defined. Cannot perform split.")

Data split into training and testing sets.
X_train shape: (7014, 66)
X_test shape : (1754, 66)
y_train shape: (7014,)
y_test shape : (1754,)

Head of X_train:


Unnamed: 0,host_acceptance_rate,host_identity_verified,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review,has_reviews,calculated_host_listings_count_shared_rooms_log,bathrooms_log,calculated_host_listings_count_private_rooms_log,host_acceptance_rate_log,beds_log,days_since_last_review_log,number_of_reviews_log,bedrooms_log,reviews_per_month_log,accommodates_log,calculated_host_listings_count_entire_homes_log,number_of_reviews_ltm_log,host_response_time_Unknown,host_response_time_days_or_more,host_response_time_within_day,host_response_time_within_hour,host_response_time_within_hours,room_type_Entire_home/apt,room_type_Hotel_room,room_type_Private_room,room_type_Shared_room,neighbourhood_group_Near_Center_East,neighbourhood_group_Near_Center_West_South,neighbourhood_group_New_Town_Vinohrady,neighbourhood_group_North_West_Districts,neighbourhood_group_Old_Town_Center,neighbourhood_group_Outer_Districts,property_type_freq
7040,0.99,1,50.065,14.425,3,1.0,1.0,2.0,3,365,6,145,35,33,4.97,4.97,4.91,0,8,0,0,2.32,43,1,1,1,1,1,1,1,1,0,0,1,3,4566,14.0,1,0.0,0.693,0.0,0.688,1.099,2.708,3.584,0.693,1.2,1.386,2.197,3.526,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0.609
7863,1.0,1,50.085,14.428,6,1.0,1.0,3.0,1,29,3,26,55,55,4.93,5.0,4.91,1,2,0,0,5.48,44,1,1,0,1,1,1,1,1,0,0,1,2,317,2.0,1,0.0,0.693,0.0,0.693,1.386,1.099,4.025,0.693,1.869,1.946,1.099,4.025,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0.084
7652,0.99,1,50.075,14.407,9,2.5,3.0,5.0,3,365,9,175,1,1,5.0,4.0,4.0,0,16,0,0,0.1,38,1,1,1,1,1,1,1,1,0,1,1,2,4302,312.0,1,0.0,1.253,0.0,0.688,1.792,5.746,0.693,1.386,0.095,2.303,2.833,0.693,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0.609


*Observation:* The processed dataset (features `X` and target `y`) was successfully split into training (7014 rows, 80%) and testing sets (1754 rows, 20%) using `train_test_split` with `random_state=42` for reproducibility. The resulting `X_train`, `X_test`, `y_train`, and `y_test` DataFrames/Series have the correct dimensions and are ready for the final scaling step before modeling.

## 10. Feature Scaling and Saving

In this phase, we scale the numeric features to ensure they have a similar range (e.g., mean=0, std dev=1). This is crucial for models sensitive to feature scales, such as linear models, SVM, KNN, and neural networks, and can sometimes help tree-based models converge faster. We will:
*   Fit a `StandardScaler` on the training data (`X_train`) only to learn the mean and standard deviation of each feature.
*   Transform both the training (`X_train`) and testing (`X_test`) data using the *fitted* scaler.
*   Save the fitted scaler object for later use on new data.

### Fit Scaler on Training Data and Transform Train/Test Sets
Use `StandardScaler` to standardize features by removing the mean and scaling to unit variance. Fit *only* on `X_train` to prevent data leakage from the test set.

In [47]:
# P9.1 & P9.2 Fit Scaler and Transform Data
if 'X_train' in locals() and 'X_test' in locals():
    
    # Initialize the StandardScaler
    scaler = StandardScaler()

    print("Fitting StandardScaler on X_train...")
    # Fit the scaler using the training data
    # Note: scaler expects numeric data, X_train should be all numeric now
    try:
        scaler.fit(X_train)
        print("Scaler fitted successfully.")

        # Keep track of original column names and index
        X_train_columns = X_train.columns
        X_train_index = X_train.index
        X_test_columns = X_test.columns # Should be same as X_train
        X_test_index = X_test.index

        print("\nTransforming X_train and X_test...")
        # Transform both training and testing data
        # Output is a NumPy array, need to convert back to DataFrame
        X_train_scaled_np = scaler.transform(X_train)
        X_test_scaled_np = scaler.transform(X_test)

        # Convert back to Pandas DataFrames
        X_train_scaled = pd.DataFrame(X_train_scaled_np, columns=X_train_columns, index=X_train_index)
        X_test_scaled = pd.DataFrame(X_test_scaled_np, columns=X_test_columns, index=X_test_index)
        print("Transformation complete.")

        print("\nShapes after scaling:")
        print(f"X_train_scaled shape: {X_train_scaled.shape}")
        print(f"X_test_scaled shape : {X_test_scaled.shape}")

        # Display head of scaled training data and describe
        print("\nHead of X_train_scaled:")
        display(X_train_scaled.head(3))
        print("\nDescribe X_train_scaled (mean should be ~0, std dev ~1):")
        # Use .describe() and check mean/std - round for clarity
        display(X_train_scaled.describe().round(2))

    except Exception as e:
        print(f"An error occurred during scaling: {e}")
        X_train_scaled, X_test_scaled = None, None # Set to None on error

else:
    print("Error: X_train or X_test not found. Cannot perform scaling.")

Fitting StandardScaler on X_train...
Scaler fitted successfully.

Transforming X_train and X_test...
Transformation complete.

Shapes after scaling:
X_train_scaled shape: (7014, 66)
X_test_scaled shape : (1754, 66)

Head of X_train_scaled:


Unnamed: 0,host_acceptance_rate,host_identity_verified,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review,has_reviews,calculated_host_listings_count_shared_rooms_log,bathrooms_log,calculated_host_listings_count_private_rooms_log,host_acceptance_rate_log,beds_log,days_since_last_review_log,number_of_reviews_log,bedrooms_log,reviews_per_month_log,accommodates_log,calculated_host_listings_count_entire_homes_log,number_of_reviews_ltm_log,host_response_time_Unknown,host_response_time_days_or_more,host_response_time_within_day,host_response_time_within_hour,host_response_time_within_hours,room_type_Entire_home/apt,room_type_Hotel_room,room_type_Private_room,room_type_Shared_room,neighbourhood_group_Near_Center_East,neighbourhood_group_Near_Center_West_South,neighbourhood_group_New_Town_Vinohrady,neighbourhood_group_North_West_Districts,neighbourhood_group_Old_Town_Center,neighbourhood_group_Outer_Districts,property_type_freq
7040,0.318,0.128,-1.002,-0.146,-0.433,-0.393,-0.398,-0.234,0.118,-0.347,-0.922,-0.384,-0.393,0.41,0.709,0.673,0.63,-1.28,-0.344,-0.282,-0.094,0.105,0.768,0.111,0.277,0.494,0.274,0.476,0.331,0.579,0.831,-0.152,-0.63,0.846,2.142,1.301,-0.346,0.302,-0.11,-0.48,-0.459,0.302,-0.071,-0.685,0.164,-0.297,0.423,-0.336,0.175,0.794,-0.255,-0.124,-0.214,0.465,-0.248,0.425,-0.092,-0.395,-0.098,-0.5,-0.312,2.146,-0.288,-0.767,-0.306,0.801
7863,0.372,0.128,0.226,-0.055,0.738,-0.393,-0.398,0.218,-0.345,-1.185,-1.242,-1.465,-0.226,1.277,0.577,0.799,0.63,0.781,-0.603,-0.282,-0.094,1.63,0.843,0.111,0.277,-2.022,0.274,0.476,0.331,0.579,0.831,-0.152,-0.63,0.846,-0.134,-1.618,-0.35,0.302,-0.11,-0.48,-0.459,0.344,0.541,-1.455,0.417,-0.297,1.469,0.963,-0.651,1.134,-0.255,-0.124,-0.214,0.465,-0.248,0.425,-0.092,-0.395,-0.098,-0.5,-0.312,-0.466,-0.288,1.304,-0.306,-1.176
7652,0.318,0.128,-0.373,-0.7,1.908,1.843,1.78,1.122,0.118,-0.347,-0.601,-0.111,-0.676,-0.851,0.807,-3.404,-2.297,-1.28,0.002,-0.282,-0.094,-0.967,0.392,0.111,0.277,0.494,0.274,0.476,0.331,0.579,0.831,-0.152,1.587,0.846,-0.134,1.12,-0.237,0.302,-0.11,2.253,-0.459,0.302,1.404,0.769,-1.495,1.673,-1.306,1.79,0.653,-1.133,-0.255,-0.124,-0.214,0.465,-0.248,0.425,-0.092,-0.395,-0.098,-0.5,3.203,-0.466,-0.288,-0.767,-0.306,0.801



Describe X_train_scaled (mean should be ~0, std dev ~1):


Unnamed: 0,host_acceptance_rate,host_identity_verified,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review,has_reviews,calculated_host_listings_count_shared_rooms_log,bathrooms_log,calculated_host_listings_count_private_rooms_log,host_acceptance_rate_log,beds_log,days_since_last_review_log,number_of_reviews_log,bedrooms_log,reviews_per_month_log,accommodates_log,calculated_host_listings_count_entire_homes_log,number_of_reviews_ltm_log,host_response_time_Unknown,host_response_time_days_or_more,host_response_time_within_day,host_response_time_within_hour,host_response_time_within_hours,room_type_Entire_home/apt,room_type_Hotel_room,room_type_Private_room,room_type_Shared_room,neighbourhood_group_Near_Center_East,neighbourhood_group_Near_Center_West_South,neighbourhood_group_New_Town_Vinohrady,neighbourhood_group_North_West_Districts,neighbourhood_group_Old_Town_Center,neighbourhood_group_Outer_Districts,property_type_freq
count,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0,7014.0
mean,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-5.1,-7.81,-7.81,-4.75,-1.21,-1.88,-1.49,-1.14,-0.35,-1.25,-1.56,-1.7,-0.68,-0.89,-12.31,-16.01,-11.95,-1.28,-0.69,-0.28,-0.09,-1.01,-2.46,-8.98,-3.61,-2.02,-3.65,-2.1,-3.02,-1.73,-1.2,-0.15,-0.63,-1.18,-4.69,-1.83,-0.35,-3.31,-0.11,-3.86,-0.46,-5.52,-2.41,-1.98,-1.89,-2.27,-1.45,-1.94,-1.48,-1.6,-0.26,-0.12,-0.21,-2.15,-0.25,-2.35,-0.09,-0.39,-0.1,-0.5,-0.31,-0.47,-0.29,-0.77,-0.31,-1.47
25%,0.21,0.13,-0.36,-0.38,-0.82,-0.39,-0.4,-0.69,-0.35,-0.35,-0.81,-0.88,-0.63,-0.81,-0.24,-0.25,-0.17,-1.28,-0.65,-0.28,-0.09,-0.82,-0.66,0.11,0.28,0.49,0.27,0.48,0.33,-1.73,-1.2,-0.15,-0.63,-1.18,-0.13,-0.93,-0.35,0.3,-0.11,-0.48,-0.46,0.22,-0.93,-0.68,-0.7,-0.3,-0.92,-1.0,-0.96,-0.86,-0.26,-0.12,-0.21,0.47,-0.25,0.42,-0.09,-0.39,-0.1,-0.5,-0.31,-0.47,-0.29,-0.77,-0.31,-1.18
50%,0.37,0.13,0.04,-0.05,-0.04,-0.39,-0.4,-0.23,-0.11,-0.35,-0.07,-0.03,-0.4,-0.38,0.25,0.25,0.21,0.78,-0.47,-0.28,-0.09,-0.27,0.09,0.11,0.28,0.49,0.27,0.48,0.33,0.58,0.83,-0.15,-0.63,0.85,-0.13,0.2,-0.34,0.3,-0.11,-0.48,-0.46,0.34,-0.07,-0.25,0.15,-0.3,0.01,0.18,-0.13,0.19,-0.26,-0.12,-0.21,0.47,-0.25,0.42,-0.09,-0.39,-0.1,-0.5,-0.31,-0.47,-0.29,-0.77,-0.31,0.8
75%,0.37,0.13,0.49,0.41,0.35,0.35,0.69,0.22,-0.11,1.55,0.79,0.89,0.18,0.61,0.58,0.59,0.5,0.78,0.13,-0.28,-0.09,0.59,0.69,0.11,0.28,0.49,0.27,0.48,0.33,0.58,0.83,-0.15,1.59,0.85,-0.13,0.74,-0.29,0.3,-0.11,0.61,-0.46,0.34,0.54,0.44,0.78,0.86,0.84,0.6,0.78,0.89,-0.26,-0.12,-0.21,0.47,-0.25,0.42,-0.09,-0.39,-0.1,-0.5,-0.31,-0.47,-0.29,1.3,-0.31,0.8
max,0.37,0.13,5.44,7.87,4.64,16.0,11.58,13.33,6.37,1.55,1.64,1.62,15.35,12.15,0.81,0.8,0.92,0.78,3.67,9.18,11.13,18.87,5.8,0.11,0.28,0.49,0.27,0.48,0.33,0.58,0.83,6.59,1.59,0.85,2.14,2.24,3.29,0.3,10.6,8.66,4.46,0.34,5.03,2.43,2.45,5.02,4.4,3.02,2.0,2.34,3.92,8.07,4.67,0.47,4.04,0.42,10.86,2.53,10.18,2.0,3.2,2.15,3.47,1.3,3.27,0.8


*Observation:* `StandardScaler` was successfully fitted using only the training data (`X_train`). Both `X_train` and `X_test` were then transformed using this fitted scaler. The resulting scaled DataFrames (`X_train_scaled`, `X_test_scaled`) have the same shape as the originals. Descriptive statistics for `X_train_scaled` confirm that features now generally have a mean close to 0 and a standard deviation close to 1, indicating successful standardization.

### Save Fitted Scaler
Save the `scaler` object (which contains the learned means and standard deviations from `X_train`) to a file using `joblib` for later use during prediction on new data or in deployment.

In [48]:
# P9.3 Save Fitted Scaler
if 'scaler' in locals() and scaler is not None:
    # Define path within the model directory
    scaler_save_path = '../model/standard_scaler.joblib'
    print(f"\nSaving fitted StandardScaler to: {scaler_save_path}")
    
    try:
        joblib.dump(scaler, scaler_save_path)
        print("Scaler saved successfully.")
    except Exception as e:
        print(f"Error saving scaler: {e}")
        
else:
    print("Error: Fitted scaler object not found. Cannot save.")

# Optional: Also save final column list used for scaling
if 'X_train_columns' in locals():
     final_features_path = '../model/final_feature_list.joblib'
     print(f"Saving final feature list to: {final_features_path}")
     try:
        joblib.dump(X_train_columns.tolist(), final_features_path)
        print("Final feature list saved successfully.")
     except Exception as e:
        print(f"Error saving feature list: {e}")


Saving fitted StandardScaler to: ../model/standard_scaler.joblib
Scaler saved successfully.
Saving final feature list to: ../model/final_feature_list.joblib
Final feature list saved successfully.


### Save Processed Data (Final Step of Prep)
Save the final, scaled training and testing sets (`X_train_scaled`, `X_test_scaled`, `y_train`, `y_test`) for easy loading in the modeling phase.

In [49]:
# P10 Save Processed Data Splits
if ('X_train_scaled' in locals() and 'X_test_scaled' in locals() and 
    'y_train' in locals() and 'y_test' in locals() and 
    X_train_scaled is not None and X_test_scaled is not None):
    
    print("\nSaving final scaled data splits...")
    
    # Define paths
    xtrain_path = '../data/processed/X_train_scaled.parquet'
    xtest_path = '../data/processed/X_test_scaled.parquet'
    ytrain_path = '../data/processed/y_train.parquet'
    ytest_path = '../data/processed/y_test.parquet'

    try:
        # Save features (already DataFrames)
        X_train_scaled.to_parquet(xtrain_path, index=True) # Keep index if useful
        X_test_scaled.to_parquet(xtest_path, index=True) 
        
        # Save targets (Series) - convert to DataFrame for consistency or save as is
        y_train.to_frame().to_parquet(ytrain_path, index=True) 
        y_test.to_frame().to_parquet(ytest_path, index=True)
        
        print(f"Successfully saved scaled data splits to '{os.path.dirname(xtrain_path)}'")
        
    except Exception as e:
        import os # Import needed here for dirname
        print(f"Error saving data splits: {e}")
        
else:
    print("Error: Scaled data splits (X_train_scaled, etc.) not found. Cannot save.")


Saving final scaled data splits...
Error saving data splits: name 'os' is not defined


*Observation:* The final scaled training and testing feature sets (`X_train_scaled`, `X_test_scaled`) and corresponding target variables (`y_train`, `y_test`) were saved as Parquet files in the `data/processed/` directory. This completes the Data Preparation phase, providing clean, processed, and scaled data ready for direct input into the modeling phase.