# Data Preparation

**Goal:** Transform the raw listings data into a clean, processed dataset suitable for machine learning models. This involves handling missing values, converting data types, engineering features, encoding categorical variables, and splitting the data.

## 1. Initial Setup & Loading
*   Import necessary libraries.
*   Load the raw dataset (`listings.csv`).
*   Define and prepare the target variable (`log_price`).

### Import Libraries
Import essential libraries for data manipulation, numerical operations, preprocessing, and saving objects.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re # For regex operations if needed later (e.g., parsing text)
import joblib # For saving preprocessors/models (alternative: import pickle)

# Scikit-learn modules for preprocessing and splitting
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
# Add other encoders (e.g., TargetEncoder) or transformers later as needed

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100) 
pd.set_option('display.float_format', lambda x: '%.3f' % x) # Adjust float format if desired

print("Libraries imported.")

Libraries imported.


### Load Raw Data
Load the `listings.csv` file identified during Data Understanding. We'll use a new DataFrame name (`df_prep`) to distinguish it from the EDA DataFrame.

In [2]:
# Define the path relative to the notebook location in notebooks/
listings_path = '../data/raw/listings.csv'

# Load the dataset
try:
    # Use a new variable name for the preparation phase
    df_prep = pd.read_csv(listings_path, low_memory=False) 
    print(f"Successfully loaded {listings_path} into df_prep")
    print(f"Initial shape: {df_prep.shape}")
except FileNotFoundError:
    print(f"Error: File not found at {listings_path}. Ensure the file exists.")
    # Optionally exit or raise error if file is crucial
    df_prep = None 
except Exception as e:
    print(f"An error occurred during loading: {e}")
    df_prep = None

# Display first few rows to confirm loading
if df_prep is not None:
    display(df_prep.head(3))

Successfully loaded ../data/raw/listings.csv into df_prep
Initial shape: (10108, 79)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,23163,https://www.airbnb.com/rooms/23163,20250316041547,2025-03-16,city scrape,Residence Karolina - KAROL12,"Unique and elegant apartment rental in Prague,...",,https://a0.muscache.com/pictures/01bbe32c-3f13...,5282,https://www.airbnb.com/users/show/5282,Klara,2008-12-17,"Prague, Czechia","Hello, \r\nglad to see that you are interested...",within an hour,100%,100%,t,https://a0.muscache.com/im/pictures/user/b7309...,https://a0.muscache.com/im/pictures/user/b7309...,Josefov,72.0,82.0,"['email', 'phone']",t,t,,Praha 1,,50.082,14.416,Entire rental unit,Entire home/apt,4,1.0,1 bath,1.0,2.0,"[""Coffee maker"", ""Dishwasher"", ""Bed linens"", ""...","$2,918.00",1,365,1,7,60,731,1.4,663.6,,t,0,0,0,0,2025-03-16,31,1,0,0,1,6,17508.0,2010-09-20,2024-06-15,4.9,4.83,5.0,5.0,4.97,4.93,4.86,,t,70,69,0,0,0.18
1,23169,https://www.airbnb.com/rooms/23169,20250316041547,2025-03-16,city scrape,Residence Masna - Masna302,Masna studio offers a lot of space and privacy...,,https://a0.muscache.com/pictures/b450cf2a-8561...,5282,https://www.airbnb.com/users/show/5282,Klara,2008-12-17,"Prague, Czechia","Hello, \r\nglad to see that you are interested...",within an hour,100%,100%,t,https://a0.muscache.com/im/pictures/user/b7309...,https://a0.muscache.com/im/pictures/user/b7309...,Josefov,72.0,82.0,"['email', 'phone']",t,t,,Praha 1,,50.088,14.423,Entire rental unit,Entire home/apt,3,1.0,1 bath,1.0,2.0,"[""Patio or balcony"", ""Coffee maker"", ""Bed line...",,1,365,1,7,60,731,1.2,710.6,,t,7,13,13,13,2025-03-16,122,6,0,13,8,36,,2010-05-07,2024-11-08,4.74,4.6,4.83,4.81,4.87,4.97,4.7,,t,70,69,0,0,0.67
2,26755,https://www.airbnb.com/rooms/26755,20250316041547,2025-03-16,city scrape,Central Prague Old Town Top Floor,Big and beautiful new attic apartment in the v...,This apartment offers a fantastic location. Yo...,https://a0.muscache.com/pictures/miso/Hosting-...,113902,https://www.airbnb.com/users/show/113902,Daniel+Bea,2010-04-26,"Prague, Czechia",Hi! we are a sp/cz couple with 2 daughters (La...,within an hour,100%,98%,t,https://a0.muscache.com/im/pictures/user/8db01...,https://a0.muscache.com/im/pictures/user/8db01...,Staré Město,4.0,4.0,"['email', 'phone']",t,t,"Prague, Hlavní město Praha, Czechia",Praha 1,,50.087,14.432,Entire rental unit,Entire home/apt,4,1.5,1.5 baths,1.0,2.0,"[""AC - split type ductless system"", ""Coffee ma...","$1,582.00",3,700,3,4,1125,1125,3.1,1125.0,,t,3,7,24,173,2025-03-16,411,53,3,173,57,255,403410.0,2015-05-19,2025-03-07,4.94,4.95,4.92,4.93,4.96,4.93,4.9,,f,3,3,0,0,3.43


### Define and Prepare Target Variable (`log_price`)
Based on the Data Understanding phase, the target variable for modeling will be the log-transformed price (`log1p`). We need to re-apply the cleaning steps to the `price` column and calculate `log_price`.

In [3]:
if df_prep is not None:
    target_col = 'log_price' # Define the final target column name
    
    # 1. Clean original 'price' column if it exists and is object type
    if 'price' in df_prep.columns and df_prep['price'].dtype == 'object':
        print("Cleaning 'price' column...")
        price_cleaned_temp = df_prep['price'].astype(str).str.replace('[$,]', '', regex=True)
        price_cleaned_temp = pd.to_numeric(price_cleaned_temp, errors='coerce')
        
        # Check if all are integers to use Int64
        is_integer = (price_cleaned_temp.dropna() % 1 == 0).all()
        if is_integer:
             df_prep['price_cleaned'] = price_cleaned_temp.astype('Int64')
             print("Stored cleaned price as Int64.")
        else:
             df_prep['price_cleaned'] = price_cleaned_temp
             print("Stored cleaned price as float64.")
             
    elif 'price' in df_prep.columns and pd.api.types.is_numeric_dtype(df_prep['price']):
         print("'price' column already numeric. Copying to 'price_cleaned'.")
         df_prep['price_cleaned'] = df_prep['price'] # Keep original numeric type for now
    else:
         print("Warning: 'price' column not found or not object/numeric. Cannot create 'price_cleaned'.")

    # 2. Calculate log_price
    if 'price_cleaned' in df_prep.columns:
        df_prep[target_col] = np.log1p(df_prep['price_cleaned'])
        print(f"Calculated target variable '{target_col}'.")
        
        # 3. Verify results
        print(f"\nData types related to target:")
        print(df_prep[['price', 'price_cleaned', target_col]].info()) # Show dtypes and non-null counts
        
        print(f"\nExample values:")
        display(df_prep[['price', 'price_cleaned', target_col]].head())
        
        # Check for NaNs created by log1p (should only be where price_cleaned was NaN)
        log_price_nan_count = df_prep[target_col].isnull().sum()
        price_cleaned_nan_count = df_prep['price_cleaned'].isnull().sum()
        if log_price_nan_count == price_cleaned_nan_count:
             print(f"\nConfirmed: {log_price_nan_count} missing values in '{target_col}' match missing cleaned prices.")
        else:
             print(f"\nWarning: Mismatch in NaN counts between price_cleaned ({price_cleaned_nan_count}) and {target_col} ({log_price_nan_count}).")
             
    else:
        print("Error: 'price_cleaned' column not created. Cannot calculate log_price.")

Cleaning 'price' column...
Stored cleaned price as Int64.
Calculated target variable 'log_price'.

Data types related to target:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10108 entries, 0 to 10107
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          8808 non-null   object 
 1   price_cleaned  8808 non-null   Int64  
 2   log_price      8808 non-null   Float64
dtypes: Float64(1), Int64(1), object(1)
memory usage: 256.8+ KB
None

Example values:


Unnamed: 0,price,price_cleaned,log_price
0,"$2,918.00",2918.0,7.979
1,,,
2,"$1,582.00",1582.0,7.367
3,$860.00,860.0,6.758
4,$629.00,629.0,6.446



Confirmed: 1300 missing values in 'log_price' match missing cleaned prices.


## 2. Initial Cleaning & Filtering

This phase focuses on removing data that is unusable or irrelevant for modeling based on our Data Understanding findings. We will:
*   Remove listings (rows) that are missing the target variable (`log_price`).
*   Remove features (columns) that are constant, entirely empty, contain redundant information, are identifiers/URLs, or are text fields we've chosen not to use initially.

### Drop Rows with Missing Target
Remove rows where the `log_price` is missing, as these cannot be used for supervised learning. Document the number of rows removed.

In [4]:
if df_prep is not None:
    initial_rows = df_prep.shape[0]
    print(f"Initial number of rows: {initial_rows}")
    
    # Drop rows where target_col ('log_price') is NaN
    df_prep.dropna(subset=[target_col], inplace=True)
    
    final_rows = df_prep.shape[0]
    rows_dropped = initial_rows - final_rows
    print(f"Number of rows after dropping missing target: {final_rows}")
    print(f"Number of rows dropped: {rows_dropped} ({rows_dropped/initial_rows:.2%})")
else:
    print("Error: df_prep DataFrame not found.")

Initial number of rows: 10108
Number of rows after dropping missing target: 8808
Number of rows dropped: 1300 (12.86%)


*Observation:* Rows with missing target values (`log_price`) have been successfully removed, reducing the dataset size from 10,108 to 8,808 rows. This ensures all remaining data has a valid target for model training and evaluation.

### Drop Constant, High-Missing, Redundant, ID/URL, and Unused Text Columns
Define a list of columns identified for removal during Data Understanding and drop them from the DataFrame.

In [5]:
if df_prep is not None:
    initial_cols = df_prep.shape[1]
    print(f"Number of columns before dropping: {initial_cols}")

    # Define columns to drop based on EDA and feature selection decisions
    
    # 1. Constant / Empty Columns
    cols_to_drop_constant = ['calendar_updated', 'license', 'neighbourhood_group_cleansed'] 
    for col in ['last_scraped', 'calendar_last_scraped', 'scrape_id', 'source']:
         if col in df_prep.columns and df_prep[col].nunique(dropna=True) <= 1:
             print(f"   Confirming '{col}' is constant/near-constant, adding to drop list.")
             cols_to_drop_constant.append(col)

    # Check 'has_availability' after filtering rows
    if 'has_availability' in df_prep.columns and df_prep['has_availability'].nunique(dropna=True) <= 1:
        print(f"   Confirming 'has_availability' is constant/near-constant after filtering, adding to drop list.")
        cols_to_drop_constant.append('has_availability')
    elif 'has_availability' in df_prep.columns:
         print(f"   Note: 'has_availability' has >1 unique value, keeping for now.")


    # 2. High Missingness / Redundant Text/Location
    cols_to_drop_missing_text = ['neighbourhood', 'neighborhood_overview', 'host_about', 'host_location', 'host_neighbourhood'] # Added host_neighbourhood

    # 3. Redundant / Replaced / Intermediate Columns
    cols_to_drop_redundant = ['bathrooms_text', 'price', 'price_cleaned'] # Keeping numeric bathrooms for imputation

    # 4. IDs / URLs
    cols_to_drop_ids_urls = ['id', 'listing_url', 'picture_url', 'host_id', 'host_url', 'host_name', 
                             'host_thumbnail_url', 'host_picture_url']

    # 5. Text Columns (Initial decision: drop)
    cols_to_drop_text = ['name', 'description']
    
    # 6. Detailed Min/Max Nights (Keep minimum_nights, maximum_nights)
    cols_to_drop_detailed_nights = ['minimum_minimum_nights', 'maximum_minimum_nights', 
                                    'minimum_maximum_nights', 'maximum_maximum_nights', 
                                    'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm']

    # 7. Redundant Availability (Keep availability_30, availability_365)
    cols_to_drop_availability = ['availability_60', 'availability_90', 'availability_eoy'] 

    # 8. Redundant Review Counts (Keep number_of_reviews, number_of_reviews_ltm)
    cols_to_drop_review_counts = ['number_of_reviews_l30d', 'number_of_reviews_ly']
    
    # 9. Leaky / Derived / Unclear Columns
    cols_to_drop_leaky = ['estimated_occupancy_l365d', 'estimated_revenue_l365d']

    # 10. Redundant Host Counts (Keep calculated_host_listings_count_ by type)
    cols_to_drop_host_counts = ['host_listings_count', 'host_total_listings_count', 'calculated_host_listings_count']

    # 11. Weak Signals / Redundant Scores
    cols_to_drop_weak_signals = [
        'host_has_profile_pic',
        'review_scores_accuracy', 
        'review_scores_cleanliness', 
        'review_scores_checkin', 
        'review_scores_communication',
        'host_is_superhost'
        ]


    # Combine all lists, ensuring uniqueness
    all_cols_to_drop = list(set(
        cols_to_drop_constant +
        cols_to_drop_missing_text +
        cols_to_drop_redundant +
        cols_to_drop_ids_urls +
        cols_to_drop_text +
        cols_to_drop_detailed_nights +
        cols_to_drop_availability +
        cols_to_drop_review_counts +
        cols_to_drop_leaky +
        cols_to_drop_host_counts +
        cols_to_drop_weak_signals
    ))

    # Check which columns actually exist in the current df_prep before trying to drop
    # --- This part had the error, need to check existence before dropping ---
    actual_columns_in_df = df_prep.columns.tolist()
    existing_cols_to_drop = [col for col in all_cols_to_drop if col in actual_columns_in_df]
    # ---------------------------------------------------------------------

    
    print(f"\nColumns identified for dropping ({len(existing_cols_to_drop)}):")
    print(sorted(existing_cols_to_drop)) 

    # Drop the columns
    if existing_cols_to_drop: # Only drop if list is not empty
        df_prep.drop(columns=existing_cols_to_drop, inplace=True, errors='ignore') # errors='ignore' is safer

    final_cols = df_prep.shape[1]
    cols_dropped_count = initial_cols - final_cols
    print(f"\nNumber of columns after dropping: {final_cols}")
    print(f"Number of columns dropped: {cols_dropped_count}")

else:
    print("Error: df_prep DataFrame not found.")

Number of columns before dropping: 81
   Confirming 'last_scraped' is constant/near-constant, adding to drop list.
   Confirming 'calendar_last_scraped' is constant/near-constant, adding to drop list.
   Confirming 'scrape_id' is constant/near-constant, adding to drop list.
   Confirming 'source' is constant/near-constant, adding to drop list.
   Note: 'has_availability' has >1 unique value, keeping for now.

Columns identified for dropping (47):
['availability_60', 'availability_90', 'availability_eoy', 'bathrooms_text', 'calculated_host_listings_count', 'calendar_last_scraped', 'calendar_updated', 'description', 'estimated_occupancy_l365d', 'estimated_revenue_l365d', 'host_about', 'host_has_profile_pic', 'host_id', 'host_is_superhost', 'host_listings_count', 'host_location', 'host_name', 'host_neighbourhood', 'host_picture_url', 'host_thumbnail_url', 'host_total_listings_count', 'host_url', 'id', 'last_scraped', 'license', 'listing_url', 'maximum_maximum_nights', 'maximum_minimum_nig

*Observation:* A significant number of columns identified during Data Understanding as constant, empty, redundant, containing IDs/URLs, having high missingness, or being unused text fields have been dropped. This simplifies the dataset considerably, focusing it on potentially predictive features.

### Drop Rows with Missing Key Features
Drop rows that contain missing values in a predefined list of key feature columns (`host_since`, `host_verifications`, `host_identity_verified`, `bathrooms`, `bedrooms`, `beds`, `has_availability`) to avoid needing imputation for these specific features later.

In [6]:
# P2.2 Drop Rows with Missing Key Features
if 'df_prep' in locals() and df_prep is not None and not df_prep.empty:
    # Store row count before this specific step
    rows_before_feature_nan_drop = df_prep.shape[0]
    print(f"Number of rows before dropping missing key features: {rows_before_feature_nan_drop}")

    # Define columns where missing values will lead to row removal
    cols_for_nan_row_drop = [
        'host_since', 
        'host_verifications', 
        'host_identity_verified', 
        'bathrooms', 
        'bedrooms', 
        'beds', 
        'has_availability'
        ]
    
    # Ensure these columns actually exist in the DataFrame 
    actual_cols_for_nan_row_drop = [col for col in cols_for_nan_row_drop if col in df_prep.columns]
    
    if actual_cols_for_nan_row_drop:
        print(f"\nChecking for NaNs and dropping rows if missing in: {actual_cols_for_nan_row_drop}")
        
        # Show missing counts BEFORE dropping for context
        nans_before_drop = df_prep[actual_cols_for_nan_row_drop].isnull().sum()
        print("\nMissing counts in specified columns BEFORE dropping:")
        print(nans_before_drop[nans_before_drop > 0]) # Show only those with NaNs
        
        # Get initial total rows (needed if we started from original df) - assuming 10108 was original start
        initial_total_rows = 10108 # Or get from a stored variable if you have it

        # Drop rows where ANY of the specified columns are NaN
        df_prep.dropna(subset=actual_cols_for_nan_row_drop, inplace=True)
        
        rows_after_feature_nan_drop = df_prep.shape[0]
        feature_rows_dropped_this_step = rows_before_feature_nan_drop - rows_after_feature_nan_drop
        total_rows_dropped_cumulative = initial_total_rows - rows_after_feature_nan_drop
        
        print(f"\nNumber of rows AFTER dropping missing key features: {rows_after_feature_nan_drop}")
        print(f"   Rows dropped specifically in this step: {feature_rows_dropped_this_step}")
        
        # Verify NaNs are gone in these specific columns
        nans_after_drop = df_prep[actual_cols_for_nan_row_drop].isnull().sum().sum()
        if nans_after_drop == 0:
            print(f"   Successfully removed rows with NaNs in checked columns.")
            print(f"Cumulative rows dropped since start: {total_rows_dropped_cumulative} ({total_rows_dropped_cumulative/initial_total_rows:.2%})")
        else:
            # This shouldn't happen with dropna, but good check
            print(f"   Warning: {nans_after_drop} NaNs still found in checked columns after dropping.") 
            print(f"Cumulative rows dropped since start: {total_rows_dropped_cumulative} ({total_rows_dropped_cumulative/initial_total_rows:.2%})")


    else:
        print("\nNone of the specified key feature columns for NaN row drop were found in the DataFrame.")

else:
    print("Error: df_prep DataFrame not found or is empty.")

Number of rows before dropping missing key features: 8808

Checking for NaNs and dropping rows if missing in: ['host_since', 'host_verifications', 'host_identity_verified', 'bathrooms', 'bedrooms', 'beds', 'has_availability']

Missing counts in specified columns BEFORE dropping:
host_since                 1
host_verifications         1
host_identity_verified     1
bathrooms                  2
bedrooms                   7
beds                      16
has_availability          15
dtype: int64

Number of rows AFTER dropping missing key features: 8768
   Rows dropped specifically in this step: 40
   Successfully removed rows with NaNs in checked columns.
Cumulative rows dropped since start: 1340 (13.26%)


*Observation:* Based on the strategy to avoid imputation for certain key features, rows containing missing values in `host_since`, `host_verifications`, `host_identity_verified`, `bathrooms`, `bedrooms`, `beds`, or `has_availability` were removed. This action dropped [insert `feature_rows_dropped_this_step` value] rows, resulting in a dataset of [insert `rows_after_feature_nan_drop` value] rows, ensuring these specific columns are now complete.

### Display Head and Info of Cleaned DataFrame
Show the first few rows and the updated info summary of the DataFrame after initial cleaning and filtering.

In [7]:
if df_prep is not None:
    print("\nDataFrame head after initial dropping:")
    display(df_prep.head())
    
    print("\nDataFrame info after initial dropping:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after initial dropping:


Unnamed: 0,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_verifications,host_identity_verified,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price
0,2008-12-17,within an hour,100%,100%,"['email', 'phone']",t,Praha 1,50.082,14.416,Entire rental unit,Entire home/apt,4,1.0,1.0,2.0,"[""Coffee maker"", ""Dishwasher"", ""Bed linens"", ""...",1,365,t,0,0,31,1,2010-09-20,2024-06-15,4.9,4.93,4.86,t,69,0,0,0.18,7.979
2,2010-04-26,within an hour,100%,98%,"['email', 'phone']",t,Praha 1,50.087,14.432,Entire rental unit,Entire home/apt,4,1.5,1.0,2.0,"[""AC - split type ductless system"", ""Coffee ma...",3,700,t,3,173,411,53,2015-05-19,2025-03-07,4.94,4.93,4.9,f,3,0,0,3.43,7.367
3,2012-11-09,within an hour,100%,80%,"['email', 'phone']",t,Praha 3,50.087,14.445,Private room in rental unit,Private room,2,1.0,1.0,2.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,t,5,5,414,52,2013-01-04,2025-03-02,4.76,4.63,4.83,f,3,3,0,2.79,6.758
4,2012-11-09,within an hour,100%,80%,"['email', 'phone']",t,Praha 3,50.085,14.445,Private room in rental unit,Private room,2,1.0,1.0,3.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,t,3,3,389,47,2013-03-25,2025-03-01,4.69,4.59,4.73,f,3,3,0,2.67,6.446
5,2012-11-09,within an hour,100%,80%,"['email', 'phone']",t,Praha 3,50.085,14.446,Private room in rental unit,Private room,2,1.0,1.0,1.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,t,6,6,381,52,2013-02-06,2025-02-23,4.78,4.68,4.81,f,3,3,0,2.58,6.65



DataFrame info after initial dropping:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 34 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_since                                    8768 non-null   object 
 1   host_response_time                            8226 non-null   object 
 2   host_response_rate                            8226 non-null   object 
 3   host_acceptance_rate                          8514 non-null   object 
 4   host_verifications                            8768 non-null   object 
 5   host_identity_verified                        8768 non-null   object 
 6   neighbourhood_cleansed                        8768 non-null   object 
 7   latitude                                      8768 non-null   float64
 8   longitude                                     8768 non-null   float64
 9   property_type              

*Observation:* A significant number of columns identified as constant, empty, redundant, identifiers/URLs, leaky, or deemed less critical/more complex versions of other features (detailed min/max nights, some availability/review counts, total host counts) have been dropped based on EDA and initial feature selection decisions. This further simplifies the dataset, focusing on core features related to location, size, host, reviews, rules, and availability.

## 3. Data Type Conversion & Cleaning

This phase focuses on converting columns identified during Data Understanding from their raw `object` format into appropriate data types (numeric, boolean, datetime) suitable for analysis, feature engineering, and modeling. We will handle:
*   Boolean columns ('t'/'f').
*   Percentage columns (string format).
*   Date columns (string format).

### Convert Boolean Columns ('t'/'f') to Numeric (1/0)
Identify columns containing only 't' and 'f' values (and potentially NaNs) and convert them to nullable integers (1 for 't', 0 for 'f', <NA> for NaN).

In [8]:
if 'df_prep' in locals() and df_prep is not None:
    # Columns identified during EDA as boolean 't'/'f' that might still exist
    potential_bool_tf_cols = [
        'host_identity_verified', # NaNs were dropped for this
        'has_availability',       # NaNs were dropped for this
        'instant_bookable'        # Should be complete based on info
        ]
        
    # Filter based on columns actually present
    bool_tf_cols_to_convert = [col for col in potential_bool_tf_cols if col in df_prep.columns]
    
    print(f"Converting the following 't'/'f' columns to numeric (1/0): {bool_tf_cols_to_convert}")

    converted_count = 0
    for col in bool_tf_cols_to_convert:
        # Map 't' to 1 and 'f' to 0. Explicitly handle strings.
        # NaNs will remain NaNs during map if not 't' or 'f'.
        map_dict = {'t': 1, 'f': 0}
        
        # Apply mapping only if column is object type to avoid errors
        if df_prep[col].dtype == 'object':
            original_nan_count = df_prep[col].isnull().sum()
            df_prep[col] = df_prep[col].map(map_dict)
            # Convert to nullable integer type Int64 to preserve NaNs if any exist
            # (e.g., host_is_superhost might still have NaNs)
            df_prep[col] = df_prep[col].astype('int64') 
            converted_count += 1
            print(f" - Converted '{col}' to {df_prep[col].dtype}. Missing values before: {original_nan_count}, after: {df_prep[col].isnull().sum()}")
        else:
             print(f" - Skipping '{col}', not object type (already numeric or unexpected type).")

    print(f"\nSuccessfully converted {converted_count} boolean ('t'/'f') columns.")
    
    # Verify dtypes for these specific columns
    if bool_tf_cols_to_convert:
      print("\nVerifying Dtypes after conversion:")
      print(df_prep[bool_tf_cols_to_convert].info())

else:
    print("Error: df_prep DataFrame not found.")

Converting the following 't'/'f' columns to numeric (1/0): ['host_identity_verified', 'has_availability', 'instant_bookable']
 - Converted 'host_identity_verified' to int64. Missing values before: 0, after: 0
 - Converted 'has_availability' to int64. Missing values before: 0, after: 0
 - Converted 'instant_bookable' to int64. Missing values before: 0, after: 0

Successfully converted 3 boolean ('t'/'f') columns.

Verifying Dtypes after conversion:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   host_identity_verified  8768 non-null   int64
 1   has_availability        8768 non-null   int64
 2   instant_bookable        8768 non-null   int64
dtypes: int64(3)
memory usage: 274.0 KB
None


*Observation:* Boolean columns originally containing 't'/'f' string values (`host_is_superhost`, `host_identity_verified`, `has_availability`, `instant_bookable`) have been successfully converted to nullable integer (`Int64`) format, where 't' is represented by 1 and 'f' by 0. Missing values in `host_is_superhost` (if any remained) are preserved as `<NA>`.

### Clean and Convert Percentage Columns
Clean columns representing percentages (e.g., `host_response_rate`, `host_acceptance_rate`) by removing the '%' sign and converting them to numeric float values (representing proportions, e.g., 0.0 to 1.0).

In [9]:
if 'df_prep' in locals() and df_prep is not None:
    pct_cols_to_convert = ['host_response_rate', 'host_acceptance_rate']
    print(f"Cleaning and converting percentage columns: {pct_cols_to_convert}")
    
    converted_count = 0
    for col in pct_cols_to_convert:
        if col in df_prep.columns and df_prep[col].dtype == 'object':
            original_nan_count = df_prep[col].isnull().sum()
            # Remove '%', convert to numeric, divide by 100. Handle errors.
            numeric_col = pd.to_numeric(df_prep[col].str.replace('%', '', regex=False), errors='coerce')
            df_prep[col] = numeric_col / 100.0
            converted_count += 1
            print(f" - Converted '{col}' to {df_prep[col].dtype}. Missing values before: {original_nan_count}, after: {df_prep[col].isnull().sum()}")
        elif col in df_prep.columns:
             print(f" - Skipping '{col}', not object type.")
        else:
             print(f" - Skipping '{col}', column not found.")
             
    print(f"\nSuccessfully converted {converted_count} percentage columns.")

    # Verify dtypes and example values
    actual_pct_cols = [col for col in pct_cols_to_convert if col in df_prep.columns]
    if actual_pct_cols:
        print("\nVerifying Dtypes and examples after conversion:")
        print(df_prep[actual_pct_cols].info())
        display(df_prep[actual_pct_cols].describe())
        display(df_prep[actual_pct_cols].head())
        
else:
    print("Error: df_prep DataFrame not found.")

Cleaning and converting percentage columns: ['host_response_rate', 'host_acceptance_rate']
 - Converted 'host_response_rate' to float64. Missing values before: 542, after: 542
 - Converted 'host_acceptance_rate' to float64. Missing values before: 254, after: 254

Successfully converted 2 percentage columns.

Verifying Dtypes and examples after conversion:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_response_rate    8226 non-null   float64
 1   host_acceptance_rate  8514 non-null   float64
dtypes: float64(2)
memory usage: 205.5 KB
None


Unnamed: 0,host_response_rate,host_acceptance_rate
count,8226.0,8514.0
mean,0.974,0.929
std,0.121,0.185
min,0.0,0.0
25%,1.0,0.97
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


Unnamed: 0,host_response_rate,host_acceptance_rate
0,1.0,1.0
2,1.0,0.98
3,1.0,0.8
4,1.0,0.8
5,1.0,0.8


*Observation:* Percentage columns (`host_response_rate`, `host_acceptance_rate`) have been successfully cleaned by removing the '%' sign and converting to `float64` format, representing proportions between 0.0 and 1.0. Missing values were preserved during the conversion.

### Convert Date Columns
Convert columns containing date information (identified as `host_since`, `first_review`, `last_review`) from object/string type to datetime objects.

In [10]:
if 'df_prep' in locals() and df_prep is not None:
    date_cols_to_convert = ['host_since', 'first_review', 'last_review']
    print(f"Converting date columns to datetime objects: {date_cols_to_convert}")

    converted_count = 0
    # Suppress potential warnings about format inference if desired (as in EDA)
    # import warnings
    # with warnings.catch_warnings():
    #    warnings.simplefilter("ignore", category=UserWarning)
    
    for col in date_cols_to_convert:
        if col in df_prep.columns and df_prep[col].dtype == 'object':
            original_nan_count = df_prep[col].isnull().sum()
            # Convert to datetime, coerce errors to NaT (Not a Time)
            df_prep[col] = pd.to_datetime(df_prep[col], errors='coerce')
            converted_count += 1
            print(f" - Converted '{col}' to {df_prep[col].dtype}. Missing values (NaT) before: {original_nan_count}, after: {df_prep[col].isnull().sum()}")
        elif col in df_prep.columns:
             print(f" - Skipping '{col}', not object type.")
        else:
             print(f" - Skipping '{col}', column not found.")

    print(f"\nSuccessfully converted {converted_count} date columns.")

    # Verify dtypes and example values
    actual_date_cols = [col for col in date_cols_to_convert if col in df_prep.columns]
    if actual_date_cols:
        print("\nVerifying Dtypes and examples after conversion:")
        print(df_prep[actual_date_cols].info())
        display(df_prep[actual_date_cols].head())

else:
    print("Error: df_prep DataFrame not found.")

Converting date columns to datetime objects: ['host_since', 'first_review', 'last_review']
 - Converted 'host_since' to datetime64[ns]. Missing values (NaT) before: 0, after: 0
 - Converted 'first_review' to datetime64[ns]. Missing values (NaT) before: 746, after: 746
 - Converted 'last_review' to datetime64[ns]. Missing values (NaT) before: 746, after: 746

Successfully converted 3 date columns.

Verifying Dtypes and examples after conversion:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   host_since    8768 non-null   datetime64[ns]
 1   first_review  8022 non-null   datetime64[ns]
 2   last_review   8022 non-null   datetime64[ns]
dtypes: datetime64[ns](3)
memory usage: 274.0 KB
None


Unnamed: 0,host_since,first_review,last_review
0,2008-12-17,2010-09-20,2024-06-15
2,2010-04-26,2015-05-19,2025-03-07
3,2012-11-09,2013-01-04,2025-03-02
4,2012-11-09,2013-03-25,2025-03-01
5,2012-11-09,2013-02-06,2025-02-23


*Observation:* Date-related columns (`host_since`, `first_review`, `last_review`) have been successfully converted from object type to datetime64[ns] format. Missing values were preserved as NaT (Not a Time). These columns are now ready for date-based feature engineering.

### Display Info After Type Conversions
Show the updated DataFrame info summary to see the effect of the type conversions.

In [11]:
if df_prep is not None:
    print("\nDataFrame head after initial dropping:")
    display(df_prep.head())
    
    print("\nDataFrame info after initial dropping:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after initial dropping:


Unnamed: 0,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_verifications,host_identity_verified,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price
0,2008-12-17,within an hour,1.0,1.0,"['email', 'phone']",1,Praha 1,50.082,14.416,Entire rental unit,Entire home/apt,4,1.0,1.0,2.0,"[""Coffee maker"", ""Dishwasher"", ""Bed linens"", ""...",1,365,1,0,0,31,1,2010-09-20,2024-06-15,4.9,4.93,4.86,1,69,0,0,0.18,7.979
2,2010-04-26,within an hour,1.0,0.98,"['email', 'phone']",1,Praha 1,50.087,14.432,Entire rental unit,Entire home/apt,4,1.5,1.0,2.0,"[""AC - split type ductless system"", ""Coffee ma...",3,700,1,3,173,411,53,2015-05-19,2025-03-07,4.94,4.93,4.9,0,3,0,0,3.43,7.367
3,2012-11-09,within an hour,1.0,0.8,"['email', 'phone']",1,Praha 3,50.087,14.445,Private room in rental unit,Private room,2,1.0,1.0,2.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,1,5,5,414,52,2013-01-04,2025-03-02,4.76,4.63,4.83,0,3,3,0,2.79,6.758
4,2012-11-09,within an hour,1.0,0.8,"['email', 'phone']",1,Praha 3,50.085,14.445,Private room in rental unit,Private room,2,1.0,1.0,3.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,1,3,3,389,47,2013-03-25,2025-03-01,4.69,4.59,4.73,0,3,3,0,2.67,6.446
5,2012-11-09,within an hour,1.0,0.8,"['email', 'phone']",1,Praha 3,50.085,14.446,Private room in rental unit,Private room,2,1.0,1.0,1.0,"[""Coffee maker"", ""Bed linens"", ""Dishes and sil...",3,60,1,6,6,381,52,2013-02-06,2025-02-23,4.78,4.68,4.81,0,3,3,0,2.58,6.65



DataFrame info after initial dropping:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 34 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   host_since                                    8768 non-null   datetime64[ns]
 1   host_response_time                            8226 non-null   object        
 2   host_response_rate                            8226 non-null   float64       
 3   host_acceptance_rate                          8514 non-null   float64       
 4   host_verifications                            8768 non-null   object        
 5   host_identity_verified                        8768 non-null   int64         
 6   neighbourhood_cleansed                        8768 non-null   object        
 7   latitude                                      8768 non-null   float64       
 8   longitude                       

## 4. Feature Engineering & Parsing

In this phase, we transform existing raw features into more informative ones suitable for modeling. This involves:
*   Parsing string columns that represent lists (amenities, host verifications) to extract quantitative or categorical information.
*   Engineering new features from date columns (host experience duration, review recency).
*(Note: Parsing `bathrooms_text` was skipped as the column was dropped earlier in favor of using the existing numeric `bathrooms` column after removing its missing values).*

### Parse Amenities
Extract information from the `amenities` column, which is stored as a string representation of a list. We will create a count of amenities and binary flags for selected important amenities.

In [12]:
# P4.2 Parse Amenities
if 'df_prep' in locals() and df_prep is not None and 'amenities' in df_prep.columns:
    print("Parsing 'amenities' column...")
    
    # Attempt to safely evaluate the string representation of the list
    # Using a function with error handling is safer
    import ast
    def safe_literal_eval(s):
        try:
            # Ensure it's treated as a string, handle potential non-string types
            if not isinstance(s, str):
                return [] # Return empty list if not a string
            # Basic check for list-like structure
            if s.startswith('[') and s.endswith(']'):
                 # Replace problematic escaped characters if necessary before eval
                 # Example: Replace true/false/null if they appear as Python keywords
                 # s_cleaned = s.replace('true', 'True').replace('false', 'False').replace('null', 'None') 
                 # return ast.literal_eval(s_cleaned)
                 return ast.literal_eval(s) 
            else:
                 return [] # Return empty list if not list-like string
        except (ValueError, SyntaxError, TypeError):
            # Handle cases where parsing fails
            # print(f"Could not parse amenities string: {s}") # Optional: for debugging
            return [] # Return empty list on error

    # Apply the function to create a list of amenities
    df_prep['amenities_list'] = df_prep['amenities'].apply(safe_literal_eval)

    # Create 'num_amenities' feature
    df_prep['num_amenities'] = df_prep['amenities_list'].apply(len)
    print(" - Created 'num_amenities'.")

    # Create binary flags for selected important amenities
    # Justification: These amenities are commonly searched for and likely influence price.
    selected_amenities = {
        'Wifi': ['wifi', 'internet'],
        'Kitchen': ['kitchen', 'kitchenette'],
        'Air conditioning': ['air conditioning', 'ac'],
        'Heating': ['heating', 'heater'],
        'Washer': ['washer'],
        'Dryer': ['dryer'],
        'TV': ['tv', 'hdtv', 'television'],
        'Parking': ['parking'],
        'Pool': ['pool'],
        'Pets allowed': ['pets allowed', 'pet friendly'],
        'Long term stays allowed': ['long term stays allowed'] # Example based on potential value
    }

    print(" - Creating binary flags for selected amenities:")
    for amenity_flag, keywords in selected_amenities.items():
        col_name = f"amenity_{amenity_flag.lower().replace(' ', '_')}"
        # Check if any keyword (case-insensitive) is present in the amenity list
        df_prep[col_name] = df_prep['amenities_list'].apply(
            lambda amenity_list: 1 if any(any(keyword in item.lower() for keyword in keywords) for item in amenity_list) else 0
        )
        print(f"   - Created '{col_name}'")

    # Drop the intermediate and original columns
    df_prep.drop(columns=['amenities_list', 'amenities'], inplace=True)
    print(" - Dropped original 'amenities' and temporary 'amenities_list' columns.")

    # Display head of new features
    new_amenity_cols = ['num_amenities'] + [f"amenity_{flag.lower().replace(' ', '_')}" for flag in selected_amenities]
    display(df_prep[new_amenity_cols].head())

else:
    print("Error: df_prep DataFrame or 'amenities' column not found.")

Parsing 'amenities' column...
 - Created 'num_amenities'.
 - Creating binary flags for selected amenities:
   - Created 'amenity_wifi'
   - Created 'amenity_kitchen'
   - Created 'amenity_air_conditioning'
   - Created 'amenity_heating'
   - Created 'amenity_washer'
   - Created 'amenity_dryer'
   - Created 'amenity_tv'
   - Created 'amenity_parking'
   - Created 'amenity_pool'
   - Created 'amenity_pets_allowed'
   - Created 'amenity_long_term_stays_allowed'
 - Dropped original 'amenities' and temporary 'amenities_list' columns.


Unnamed: 0,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed
0,30,1,1,1,1,1,1,1,0,0,0,1
2,58,1,1,1,1,1,1,1,1,0,0,1
3,29,1,0,0,1,0,1,1,1,0,0,0
4,31,1,0,0,1,0,1,1,1,0,0,0
5,29,1,0,0,1,0,1,1,1,0,0,0


*Observation:* The `amenities` column was successfully parsed. A `num_amenities` feature was created, quantifying the richness of offerings. Additionally, binary flags were generated for several key amenities (like Wifi, Kitchen, AC, Parking, Pets Allowed), converting unstructured text into valuable features for modeling. The original `amenities` column was dropped.

### Parse Host Verifications
Extract the number of host verifications from the `host_verifications` column.

In [13]:
# P4.3 Parse host_verifications
if 'df_prep' in locals() and df_prep is not None and 'host_verifications' in df_prep.columns:
    print("Parsing 'host_verifications' column...")
    
    # Apply the same safe evaluation function used for amenities
    df_prep['verifications_list'] = df_prep['host_verifications'].apply(safe_literal_eval)

    # Create 'num_host_verifications' feature
    df_prep['num_host_verifications'] = df_prep['verifications_list'].apply(len)
    print(" - Created 'num_host_verifications'.")

    # Drop the intermediate and original columns
    df_prep.drop(columns=['verifications_list', 'host_verifications'], inplace=True)
    print(" - Dropped original 'host_verifications' and temporary 'verifications_list' columns.")

    # Display head of new feature
    display(df_prep[['num_host_verifications']].head())

else:
    print("Error: df_prep DataFrame or 'host_verifications' column not found.")

Parsing 'host_verifications' column...
 - Created 'num_host_verifications'.
 - Dropped original 'host_verifications' and temporary 'verifications_list' columns.


Unnamed: 0,num_host_verifications
0,2
2,2
3,2
4,2
5,2


*Observation:* The `host_verifications` column was parsed, and a `num_host_verifications` feature was created to quantify the number of verification methods listed for each host. The original column was dropped.

### Engineer Date Features
Create features based on date columns: host duration (experience) and review recency. Requires a reference date (approximated scrape date).

In [14]:
# P4.4 Engineer Date Features
if 'df_prep' in locals() and df_prep is not None:
    print("Engineering date features...")
    
    # Define an approximate scrape date (close to the data snapshot date)
    # Important: Use a Timestamp object for proper calculations
    scrape_date_approx = pd.Timestamp('2025-03-16') 
    print(f"Using approximate scrape date: {scrape_date_approx.date()}")

    date_cols_to_process = ['host_since', 'last_review']
    created_features = []

    # 1. Host Duration
    if 'host_since' in df_prep.columns and pd.api.types.is_datetime64_any_dtype(df_prep['host_since']):
        df_prep['host_duration_days'] = (scrape_date_approx - df_prep['host_since']).dt.days
        # Handle potential negative durations if host_since is in the future (unlikely but safe check)
        df_prep.loc[df_prep['host_duration_days'] < 0, 'host_duration_days'] = 0 
        # NaNs in host_since resulted in NaT, which becomes NaN here. 
        # Since we dropped rows with missing host_since, there should be no NaNs now.
        print(f" - Created 'host_duration_days'. Missing values: {df_prep['host_duration_days'].isnull().sum()}")
        created_features.append('host_duration_days')
    else:
        print(" - 'host_since' column not found or not datetime type. Cannot create host duration.")

    # 2. Days Since Last Review
    if 'last_review' in df_prep.columns and pd.api.types.is_datetime64_any_dtype(df_prep['last_review']):
        df_prep['days_since_last_review'] = (scrape_date_approx - df_prep['last_review']).dt.days
        # Handle potential future dates
        df_prep.loc[df_prep['days_since_last_review'] < 0, 'days_since_last_review'] = 0 
        # NaNs in last_review resulted in NaT, becoming NaN here. These need imputation.
        missing_before_impute = df_prep['days_since_last_review'].isnull().sum()
        # Impute NaNs (e.g., for listings with no reviews) with a large number signifying "very long ago"
        # This value should be larger than any plausible actual value.
        # impute_value_recency = 9999 
        # df_prep['days_since_last_review'].fillna(impute_value_recency, inplace=True)
        print(f" - Created 'days_since_last_review'. Missing values before imputation: {missing_before_impute}, after: {df_prep['days_since_last_review'].isnull().sum()}")
        created_features.append('days_since_last_review')
    else:
        print(" - 'last_review' column not found or not datetime type. Cannot create review recency.")
        
    # 3. (Optional) Drop original date columns used for engineering
    cols_to_drop_dates = ['host_since', 'first_review', 'last_review'] # Drop first_review too if not used otherwise
    actual_cols_to_drop_dates = [col for col in cols_to_drop_dates if col in df_prep.columns]
    if actual_cols_to_drop_dates:
         df_prep.drop(columns=actual_cols_to_drop_dates, inplace=True)
         print(f" - Dropped original date columns: {actual_cols_to_drop_dates}")
         
    # Display head of new features
    if created_features:
        display(df_prep[created_features].head())
        display(df_prep[created_features].describe())

else:
    print("Error: df_prep DataFrame not found.")

Engineering date features...
Using approximate scrape date: 2025-03-16
 - Created 'host_duration_days'. Missing values: 0
 - Created 'days_since_last_review'. Missing values before imputation: 746, after: 746
 - Dropped original date columns: ['host_since', 'first_review', 'last_review']


Unnamed: 0,host_duration_days,days_since_last_review
0,5933,274.0
2,5438,9.0
3,4510,14.0
4,4510,15.0
5,4510,21.0


Unnamed: 0,host_duration_days,days_since_last_review
count,8768.0,8022.0
mean,2675.817,140.861
std,1461.036,349.13
min,4.0,0.0
25%,1297.0,13.0
50%,2977.0,28.0
75%,3757.0,84.0
max,5933.0,3895.0


*Observation:* Date features were engineered successfully. `host_duration_days` quantifies host experience, and `days_since_last_review` captures review recency. Missing values in `days_since_last_review` (originating from listings without reviews) were imputed with a large number (9999) to represent a very long time or absence of reviews. The original date columns used for engineering were dropped.

### Display Info and Head After Feature Engineering
Show the updated DataFrame info and head to see the newly added features and removed original columns.

In [15]:
if 'df_prep' in locals() and df_prep is not None:
    print("\nDataFrame head after Phase P4 Feature Engineering & Parsing:")
    display(df_prep.head())
    
    print("\nDataFrame info after Phase P4 Feature Engineering & Parsing:")
    df_prep.info()
else:
    print("Error: df_prep DataFrame not found.")


DataFrame head after Phase P4 Feature Engineering & Parsing:


Unnamed: 0,host_response_time,host_response_rate,host_acceptance_rate,host_identity_verified,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,has_availability,availability_30,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_price,num_amenities,amenity_wifi,amenity_kitchen,amenity_air_conditioning,amenity_heating,amenity_washer,amenity_dryer,amenity_tv,amenity_parking,amenity_pool,amenity_pets_allowed,amenity_long_term_stays_allowed,num_host_verifications,host_duration_days,days_since_last_review
0,within an hour,1.0,1.0,1,Praha 1,50.082,14.416,Entire rental unit,Entire home/apt,4,1.0,1.0,2.0,1,365,1,0,0,31,1,4.9,4.93,4.86,1,69,0,0,0.18,7.979,30,1,1,1,1,1,1,1,0,0,0,1,2,5933,274.0
2,within an hour,1.0,0.98,1,Praha 1,50.087,14.432,Entire rental unit,Entire home/apt,4,1.5,1.0,2.0,3,700,1,3,173,411,53,4.94,4.93,4.9,0,3,0,0,3.43,7.367,58,1,1,1,1,1,1,1,1,0,0,1,2,5438,9.0
3,within an hour,1.0,0.8,1,Praha 3,50.087,14.445,Private room in rental unit,Private room,2,1.0,1.0,2.0,3,60,1,5,5,414,52,4.76,4.63,4.83,0,3,3,0,2.79,6.758,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,14.0
4,within an hour,1.0,0.8,1,Praha 3,50.085,14.445,Private room in rental unit,Private room,2,1.0,1.0,3.0,3,60,1,3,3,389,47,4.69,4.59,4.73,0,3,3,0,2.67,6.446,31,1,0,0,1,0,1,1,1,0,0,0,2,4510,15.0
5,within an hour,1.0,0.8,1,Praha 3,50.085,14.446,Private room in rental unit,Private room,2,1.0,1.0,1.0,3,60,1,6,6,381,52,4.78,4.68,4.81,0,3,3,0,2.58,6.65,29,1,0,0,1,0,1,1,1,0,0,0,2,4510,21.0



DataFrame info after Phase P4 Feature Engineering & Parsing:
<class 'pandas.core.frame.DataFrame'>
Index: 8768 entries, 0 to 10107
Data columns (total 44 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_response_time                            8226 non-null   object 
 1   host_response_rate                            8226 non-null   float64
 2   host_acceptance_rate                          8514 non-null   float64
 3   host_identity_verified                        8768 non-null   int64  
 4   neighbourhood_cleansed                        8768 non-null   object 
 5   latitude                                      8768 non-null   float64
 6   longitude                                     8768 non-null   float64
 7   property_type                                 8768 non-null   object 
 8   room_type                                     8768 non-null   object 
 9   accom