

---

## Task 1.2: Data Cleaning and Preprocessing

This task transformed raw Airbnb data into a clean, model-ready dataset from a **landlord perspective**—using only features available at listing creation time. 
We merged San Francisco and San Diego listings, removed URLs, text descriptions, and review-based features (avoiding data leakage), and dropped columns with >50% missing values. 
Data types were converted (prices, percentages, booleans), and feature engineering created `host_years`, `price_per_person`, `price_per_bedroom`, and `availability_rate`. 
Missing values were imputed (median for numeric, mode for categorical), and price outliers were removed using IQR. 
The target variable `value_category` was created using an FP Score that balances landlord-controlled features (accommodates, beds, bathrooms, superhost status, instant bookable) against price—classifying listings into Poor, Fair, and Excellent value tiers without using any review data.

## 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import os

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("="*80)
print("T1.2: DATA CLEANING AND PREPROCESSING (LANDLORD PERSPECTIVE)")
print("="*80)

In [None]:
# Load both datasets
sf_df = pd.read_csv('../../data/raw/san francisco.csv')
sd_df = pd.read_csv('../../data/raw/san diego.csv')

# Add city identifier
sf_df['city'] = 'San Francisco'
sd_df['city'] = 'San Diego'

# Combine datasets
df = pd.concat([sf_df, sd_df], ignore_index=True)

print(f"\n Combined Dataset Shape: {df.shape}")
print(f"   - Total Rows: {df.shape[0]:,}")
print(f"   - Total Columns: {df.shape[1]}")

## 2. Missing Values Analysis

In [None]:
print("\n" + "="*80)
print("2. MISSING VALUES ANALYSIS")
print("="*80)

# Calculate missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing_Count': missing.values,
    'Missing_Percentage': missing_pct.values
})

# Filter columns with missing values
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print(f"\n Columns with Missing Values: {len(missing_df)}")
print("\nTop 20 Columns with Most Missing Data:")
print(missing_df.head(20).to_string(index=False))

# Visualize missing data
plt.figure(figsize=(12, 8))
top_missing = missing_df.head(20)
plt.barh(top_missing['Column'], top_missing['Missing_Percentage'])
plt.xlabel('Missing Percentage (%)')
plt.title('Top 20 Columns with Missing Values')
plt.tight_layout()
plt.savefig('../../outputs/figures/missing_values_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n Visualization saved: outputs/figures/missing_values_analysis.png")

## 3. Feature Selection - Remove Irrelevant Columns

In [None]:
print("\n" + "="*80)
print("3. FEATURE SELECTION - LANDLORD PERSPECTIVE")
print("="*80)

# Define columns to drop (URLs, IDs, text descriptions, REVIEW-BASED FEATURES)
columns_to_drop = [
    # URLs and IDs
    'listing_url', 'scrape_id', 'picture_url', 'host_url', 
    'host_thumbnail_url', 'host_picture_url',
    
    # Text descriptions (too noisy for initial model)
    'description', 'neighborhood_overview', 'host_about', 'name',
    
    # Redundant or highly specific
    'source', 'calendar_updated', 'last_scraped', 'calendar_last_scraped',
    
    # License (mostly missing or not useful)
    'license',
    
    # Neighbourhood group (if empty)
    'neighbourhood_group_cleansed',
    
    # Bathrooms (we'll use bathrooms_text instead)
    'bathrooms',
    
    # Host verifications (complex nested data)
    'host_verifications',
    
    # Amenities (complex nested data - can be processed later)
    'amenities',
    
    # *** CRITICAL: REMOVE ALL REVIEW-BASED FEATURES (DATA LEAKAGE) ***
    'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
    'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
    'review_scores_value', 'number_of_reviews', 'number_of_reviews_ltm',
    'number_of_reviews_l30d', 'first_review', 'last_review', 'reviews_per_month'
]

# Drop columns that exist in the dataframe
columns_to_drop = [col for col in columns_to_drop if col in df.columns]
df_cleaned = df.drop(columns=columns_to_drop)

print(f"\n Dropped {len(columns_to_drop)} columns (including ALL review-based features)")
print(f"   - Original: {df.shape[1]} columns")
print(f"   - After dropping: {df_cleaned.shape[1]} columns")
print(f"\n Remaining columns: {df_cleaned.shape[1]}")
print(f"\n Note: All review-based features removed to prevent data leakage")

## 4. Data Type Conversions and Cleaning

In [None]:
print("\n" + "="*80)
print("4. DATA TYPE CONVERSIONS")
print("="*80)

# 4.1 Clean price column (remove $ and commas)
if 'price' in df_cleaned.columns:
    df_cleaned['price'] = df_cleaned['price'].replace('[\$,]', '', regex=True).astype(float)
    print("\n✓ Cleaned 'price' column (removed $ and commas)")

# 4.2 Convert percentage columns
percentage_cols = ['host_response_rate', 'host_acceptance_rate']
for col in percentage_cols:
    if col in df_cleaned.columns:
        df_cleaned[col] = df_cleaned[col].str.rstrip('%').astype(float) / 100
        print(f"✓ Converted '{col}' to decimal")

# 4.3 Convert boolean columns
boolean_cols = ['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 
                'has_availability', 'instant_bookable']
for col in boolean_cols:
    if col in df_cleaned.columns:
        df_cleaned[col] = df_cleaned[col].map({'t': 1, 'f': 0})
        print(f"✓ Converted '{col}' to binary (0/1)")

# 4.4 Convert date columns
date_cols = ['host_since']
for col in date_cols:
    if col in df_cleaned.columns:
        df_cleaned[col] = pd.to_datetime(df_cleaned[col], errors='coerce')
        print(f"✓ Converted '{col}' to datetime")

# 4.5 Extract number from bathrooms_text
if 'bathrooms_text' in df_cleaned.columns:
    df_cleaned['bathrooms_numeric'] = df_cleaned['bathrooms_text'].str.extract('(\d+\.?\d*)').astype(float)
    print("\n✓ Extracted numeric bathrooms from 'bathrooms_text'")

## 5. Feature Engineering (Landlord-Controlled Features Only)

In [None]:
print("\n" + "="*80)
print("5. FEATURE ENGINEERING (LANDLORD PERSPECTIVE ONLY)")
print("="*80)

# 5.1 Host experience (years as host)
if 'host_since' in df_cleaned.columns:
    df_cleaned['host_years'] = (pd.Timestamp.now() - df_cleaned['host_since']).dt.days / 365.25
    print("\n✓ Created 'host_years' feature")

# 5.2 Price per person
if 'price' in df_cleaned.columns and 'accommodates' in df_cleaned.columns:
    df_cleaned['price_per_person'] = df_cleaned['price'] / df_cleaned['accommodates']
    print(" Created 'price_per_person' feature")

# 5.3 Availability rate
if 'availability_365' in df_cleaned.columns:
    df_cleaned['availability_rate'] = df_cleaned['availability_365'] / 365
    print(" Created 'availability_rate' feature")

# 5.4 Price per bedroom
if 'price' in df_cleaned.columns and 'bedrooms' in df_cleaned.columns:
    df_cleaned['price_per_bedroom'] = df_cleaned['price'] / (df_cleaned['bedrooms'] + 1)  # +1 to avoid division by zero
    print(" Created 'price_per_bedroom' feature")

# 5.5 Price per bathroom
if 'price' in df_cleaned.columns and 'bathrooms_numeric' in df_cleaned.columns:
    df_cleaned['price_per_bathroom'] = df_cleaned['price'] / (df_cleaned['bathrooms_numeric'] + 0.5)
    print(" Created 'price_per_bathroom' feature")

# 5.6 Space efficiency (beds per accommodation)
if 'beds' in df_cleaned.columns and 'accommodates' in df_cleaned.columns:
    df_cleaned['space_efficiency'] = df_cleaned['beds'] / df_cleaned['accommodates']
    print(" Created 'space_efficiency' feature")

# 5.7 Host portfolio size indicator
if 'calculated_host_listings_count' in df_cleaned.columns:
    df_cleaned['is_multi_listing_host'] = (df_cleaned['calculated_host_listings_count'] > 1).astype(int)
    print(" Created 'is_multi_listing_host' feature")

print(f"\n Total features after engineering: {df_cleaned.shape[1]}")
print(f"\n Note: NO review-based features created (landlord perspective only)")

## 6. Handle Missing Values

In [None]:
print("\n" + "="*80)
print("6. HANDLING MISSING VALUES")
print("="*80)

# 6.1 Drop columns with >50% missing values, EXCEPT important landlord features
missing_threshold = 0.5
missing_pct = df_cleaned.isnull().sum() / len(df_cleaned)

# Preserve important landlord features even if they have high missing rates
important_landlord_features = [
    'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
    'bedrooms', 'beds', 'bathrooms_numeric', 'price', 'accommodates',
    'neighbourhood_cleansed', 'property_type'
]

cols_to_drop = []
for col in missing_pct[missing_pct > missing_threshold].index:
    if col not in important_landlord_features:
        cols_to_drop.append(col)

if cols_to_drop:
    df_cleaned = df_cleaned.drop(columns=cols_to_drop)
    print(f"\n Dropped {len(cols_to_drop)} columns with >{missing_threshold*100}% missing values")
    print(f"   Columns dropped: {cols_to_drop}")
    print(f"   Preserved important landlord features: {important_landlord_features}")

# 6.2 Fill missing values for specific columns
# Numeric columns - fill with median
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    if df_cleaned[col].isnull().sum() > 0:
        df_cleaned[col].fillna(df_cleaned[col].median(), inplace=True)

print(f"\n Filled missing numeric values with median")

# Categorical columns - fill with mode or 'Unknown'
categorical_cols = df_cleaned.select_dtypes(include=['object']).columns
for col in categorical_cols:
    if df_cleaned[col].isnull().sum() > 0:
        mode_val = df_cleaned[col].mode()
        if len(mode_val) > 0:
            df_cleaned[col].fillna(mode_val[0], inplace=True)
        else:
            df_cleaned[col].fillna('Unknown', inplace=True)

print(f" Filled missing categorical values with mode or 'Unknown'")

# Check remaining missing values
remaining_missing = df_cleaned.isnull().sum().sum()
print(f"\n Remaining missing values: {remaining_missing}")

## 7. Handle Outliers

In [None]:
print("\n" + "="*80)
print("7. OUTLIER DETECTION AND HANDLING")
print("="*80)

# Focus on price outliers
if 'price' in df_cleaned.columns:
    # Remove listings with price = 0 or extremely high prices
    initial_rows = len(df_cleaned)
    
    # Remove price = 0
    df_cleaned = df_cleaned[df_cleaned['price'] > 0]
    
    # Remove extreme outliers (using IQR method)
    Q1 = df_cleaned['price'].quantile(0.25)
    Q3 = df_cleaned['price'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 3 * IQR
    upper_bound = Q3 + 3 * IQR
    
    df_cleaned = df_cleaned[(df_cleaned['price'] >= lower_bound) & 
                    (df_cleaned['price'] <= upper_bound)]
    
    rows_removed = initial_rows - len(df_cleaned)
    print(f"\n Removed {rows_removed} rows with price outliers")
    print(f"   - Price range: ${df_cleaned['price'].min():.2f} - ${df_cleaned['price'].max():.2f}")
    print(f"   - Remaining rows: {len(df_cleaned):,}")

# Visualize price distribution after cleaning
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(df_cleaned['price'], bins=50, edgecolor='black')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.title('Price Distribution (After Outlier Removal)')

plt.subplot(1, 2, 2)
plt.boxplot(df_cleaned['price'])
plt.ylabel('Price ($)')
plt.title('Price Boxplot (After Outlier Removal)')

plt.tight_layout()
plt.savefig('../../outputs/figures/price_distribution_cleaned.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n Visualization saved: outputs/figures/price_distribution_cleaned.png")

## 8. Encode Categorical Variables (Partial)

In [None]:
print("\n" + "="*80)
print("8. ENCODING CATEGORICAL VARIABLES (PARTIAL)")
print("="*80)

# Identify categorical columns (excluding datetime)
categorical_cols = df_cleaned.select_dtypes(include=['object']).columns.tolist()

# Remove date columns if they're still object type
date_related = ['host_since']
categorical_cols = [col for col in categorical_cols if col not in date_related]

print(f"\n Categorical columns found: {len(categorical_cols)}")
print(f"   Columns: {categorical_cols[:10]}...")  # Show first 10

# One-hot encode categorical variables with low cardinality
low_cardinality_cols = []
for col in categorical_cols:
    if df_cleaned[col].nunique() <= 10:  # Only encode if <=10 unique values
        low_cardinality_cols.append(col)

if low_cardinality_cols:
    df_encoded = pd.get_dummies(df_cleaned, columns=low_cardinality_cols, drop_first=True)
    print(f"\n One-hot encoded {len(low_cardinality_cols)} categorical columns")
    print(f"   Columns: {low_cardinality_cols}")
else:
    df_encoded = df_cleaned.copy()

# Keep important high-cardinality features for later encoding (T1.4)
important_categorical = ['neighbourhood_cleansed', 'property_type']

remaining_categorical = df_encoded.select_dtypes(include=['object']).columns.tolist()
remaining_categorical = [col for col in remaining_categorical if col not in important_categorical]

if remaining_categorical:
    df_encoded = df_encoded.drop(columns=remaining_categorical)
    print(f"\n Dropped {len(remaining_categorical)} high-cardinality categorical columns")
    print(f"   Columns dropped: {remaining_categorical}")

# Show which important categoricals are preserved
preserved = [col for col in important_categorical if col in df_encoded.columns]
if preserved:
    print(f"\n Preserved for T1.4 (Categorical Encoding): {preserved}")

print(f"\n Dataset shape after partial encoding: {df_encoded.shape}")

## 9. Create Target Variable (Landlord Perspective - NO REVIEWS)

In [None]:
print("\n" + "="*80)
print("9. CREATE TARGET VARIABLE - VALUE CATEGORY (LANDLORD PERSPECTIVE)")
print("="*80)
print("\n Using ONLY landlord-controlled features (NO review data)")
print("   Quality indicators: accommodates, beds, bathrooms, superhost, instant_bookable\n")

# Create value category based on LANDLORD-AVAILABLE features only
# Value = Quality indicators / Price
# Quality indicators: accommodates, beds, bathrooms, superhost status, instant booking

if 'price' in df_encoded.columns:
    df_with_target = df_encoded.copy()
    
    # Calculate quality score from landlord-controlled features
    quality_score = 0
    
    # Accommodation capacity (normalized)
    if 'accommodates' in df_with_target.columns:
        quality_score += (df_with_target['accommodates'] - df_with_target['accommodates'].min()) / \
                        (df_with_target['accommodates'].max() - df_with_target['accommodates'].min())
        print(" Added 'accommodates' to quality score")
    
    # Beds (normalized)
    if 'beds' in df_with_target.columns:
        quality_score += (df_with_target['beds'] - df_with_target['beds'].min()) / \
                        (df_with_target['beds'].max() - df_with_target['beds'].min())
        print(" Added 'beds' to quality score")
    
    # Bathrooms (normalized)
    if 'bathrooms_numeric' in df_with_target.columns:
        quality_score += (df_with_target['bathrooms_numeric'] - df_with_target['bathrooms_numeric'].min()) / \
                        (df_with_target['bathrooms_numeric'].max() - df_with_target['bathrooms_numeric'].min())
        print(" Added 'bathrooms_numeric' to quality score")
    
    # Superhost bonus
    if 'host_is_superhost' in df_with_target.columns:
        quality_score += df_with_target['host_is_superhost'] * 0.5
        print(" Added 'host_is_superhost' bonus to quality score")
    
    # Instant bookable bonus
    if 'instant_bookable' in df_with_target.columns:
        quality_score += df_with_target['instant_bookable'] * 0.3
        print(" Added 'instant_bookable' bonus to quality score")
    
    # Normalize price (inverse - lower price = better value)
    price_normalized = (df_with_target['price'] - df_with_target['price'].min()) / \
                      (df_with_target['price'].max() - df_with_target['price'].min())
    
    # Calculate FP Score (Fair Price Score) - higher quality, lower price = better value
    df_with_target['fp_score'] = quality_score / (price_normalized + 0.1)
    
    # Classify into 3 categories based on FP Score
    fp_33 = df_with_target['fp_score'].quantile(0.33)
    fp_67 = df_with_target['fp_score'].quantile(0.67)
    
    def classify_value(fp_score):
        if fp_score <= fp_33:
            return 'Poor_Value'
        elif fp_score <= fp_67:
            return 'Fair_Value'
        else:
            return 'Excellent_Value'
    
    df_with_target['value_category'] = df_with_target['fp_score'].apply(classify_value)
    
    print(f"\n Created FP Score and Value Category (Landlord Perspective)")
    print(f"   - Total listings: {len(df_with_target):,}")
    print(f"   - FP Score range: {df_with_target['fp_score'].min():.2f} - {df_with_target['fp_score'].max():.2f}")
    print(f"\n Value Category Distribution:")
    print(df_with_target['value_category'].value_counts())
    
    # Visualize distribution
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    df_with_target['value_category'].value_counts().plot(kind='bar', color=['red', 'orange', 'green'])
    plt.xlabel('Value Category')
    plt.ylabel('Count')
    plt.title('Distribution of Value Categories (Landlord Perspective)')
    plt.xticks(rotation=45)
    
    plt.subplot(1, 2, 2)
    plt.hist(df_with_target['fp_score'], bins=50, edgecolor='black')
    plt.xlabel('FP Score')
    plt.ylabel('Frequency')
    plt.title('FP Score Distribution')
    
    plt.tight_layout()
    plt.savefig('../../outputs/figures/value_category_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n Visualization saved: outputs/figures/value_category_distribution.png")
    print("\n NO REVIEW DATA USED - Pure landlord perspective!")
    
    # Use df_with_target for further processing
    df_final = df_with_target.copy()
else:
    print("\n Warning: 'price' not found. Skipping target creation.")
    df_final = df_encoded.copy()

## 10. Prepare Features for Modeling

In [None]:
print("\n" + "="*80)
print("10. PREPARE FEATURES FOR MODELING")
print("="*80)

# Drop columns that shouldn't be used as features
columns_to_exclude = [
    'id', 'host_id', 'value_category', 'fp_score',
    'host_since',  # Datetime column
    # Ensure ALL review-based features are excluded (double-check)
    'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
    'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
    'review_scores_value', 'number_of_reviews', 'number_of_reviews_ltm',
    'number_of_reviews_l30d', 'first_review', 'last_review', 'reviews_per_month'
]

# Get feature columns (only those that exist)
columns_to_exclude = [col for col in columns_to_exclude if col in df_final.columns]
feature_cols = [col for col in df_final.columns if col not in columns_to_exclude]

# Separate features and target
X = df_final[feature_cols]
y = df_final['value_category']

print(f"\n Features prepared (LANDLORD PERSPECTIVE ONLY)")
print(f"   - Number of features: {X.shape[1]}")
print(f"   - Number of samples: {X.shape[0]:,}")
print(f"   - Target variable: value_category")
print(f"   - Excluded columns: {len(columns_to_exclude)}")
print(f"\n Feature columns (first 20):")
print(X.columns.tolist()[:20])

# Verify no review columns leaked through
review_keywords = ['review', 'rating', 'score']
leaked_features = [col for col in X.columns if any(keyword in col.lower() for keyword in review_keywords)]
if leaked_features:
    print(f"\n WARNING: Potential review-based features detected: {leaked_features}")
else:
    print(f"\n VERIFIED: No review-based features in feature set!")

## 11. Train-Test Split

In [None]:
print("\n" + "="*80)
print("11. TRAIN-TEST SPLIT")
print("="*80)

print("\n Train-test split will be performed in Task 1.6")
print("   This ensures all engineered features from T1.3, T1.4, and T1.5 are included.")
print("\n Skipping split for now...")

## 12. Feature Scaling (Deferred to T1.6)

In [None]:
print("\n" + "="*80)
print("12. FEATURE SCALING")
print("="*80)

print("\n Feature scaling will be performed in Task 1.6 (Train-Test Split & Scaling)")
print("   This ensures all engineered features from T1.3, T1.4, and T1.5 are included.")
print("\n Saving UNSCALED data for now...")

## 13. Save Processed Data

In [None]:
print("\n" + "="*80)
print("13. SAVE PROCESSED DATA")
print("="*80)

# Create output directory if it doesn't exist
os.makedirs('../../data/processed', exist_ok=True)

# Save cleaned full dataset
df_final.to_csv('../../data/processed/listings_cleaned_with_target.csv', index=False)
print("\n Saved: data/processed/listings_cleaned_with_target.csv")
print(f"   - Shape: {df_final.shape}")
print(f"   - Columns: {df_final.shape[1]}")

# Save feature names (for reference)
feature_cols = [col for col in df_final.columns if col not in ['id', 'host_id', 'value_category', 'fp_score', 'host_since']]
with open('../../outputs/reports/feature_names_T1.2.txt', 'w') as f:
    for feature in feature_cols:
        f.write(f"{feature}\n")

print("✓ Saved: outputs/reports/feature_names_T1.2.txt")
print(f"   - Features: {len(feature_cols)}")

print("\n Note: Train-test split will be done in T1.6 after all feature engineering (T1.3, T1.4, T1.5)")

## 14. Summary Report

In [None]:
print("\n" + "="*80)
print("14. PREPROCESSING SUMMARY REPORT")
print("="*80)

summary = f"""
DATA PREPROCESSING COMPLETED SUCCESSFULLY (LANDLORD PERSPECTIVE)
{'='*80}

ORIGINAL DATA:
  - San Francisco: 7,780 listings
  - San Diego: 13,162 listings
  - Combined: {df.shape[0]:,} listings, {df.shape[1]} columns

AFTER CLEANING:
  - Final dataset: {df_final.shape[0]:,} listings, {df_final.shape[1]} columns
  - Features preserved: {df_final.shape[1]}

TARGET VARIABLE:
  - Name: value_category
  - Classes: Poor_Value, Fair_Value, Excellent_Value
  - Based on: Landlord-controlled features ONLY (NO REVIEWS)
  - Quality indicators: accommodates, beds, bathrooms, superhost, instant_bookable
  - Distribution:
{df_final['value_category'].value_counts().to_string()}

KEY STEPS PERFORMED:
   Removed irrelevant columns (URLs, IDs, text descriptions)
   REMOVED ALL REVIEW-BASED FEATURES (data leakage prevention)
   Converted data types (price, percentages, booleans, dates)
   Feature engineering (host_years, price_per_person, etc.) - LANDLORD ONLY
   Handled missing values (preserved important landlord features)
   Removed outliers (price outliers using IQR method)
   Encoded categorical variables (partial - low cardinality only)
   Created target variable (FP Score - LANDLORD PERSPECTIVE)
   Train-test split deferred to T1.6 (after all feature engineering)

OUTPUT FILES:
  - data/processed/listings_cleaned_with_target.csv
  - data/processed/feature_names_T1.2.txt

VISUALIZATIONS:
  - outputs/figures/missing_values_analysis.png
  - outputs/figures/price_distribution_cleaned.png
  - outputs/figures/value_category_distribution.png

CRITICAL NOTES:
   NO REVIEW DATA USED - Pure landlord perspective
   All review-based features excluded to prevent data leakage
   Target based on features available at listing creation time
   neighbourhood_cleansed and property_type preserved for T1.4
   Train-test split will be performed in T1.6

NEXT STEP:
  Task 1.3: Algebraic Feature Engineering

{'='*80}
"""

print(summary)

# Save summary to file
import os
os.makedirs('../../outputs/reports', exist_ok=True)
with open('../../outputs/reports/T1.2_summary.txt', 'w') as f:
    f.write(summary)

print("\nSummary saved to: outputs/reports/T1.2_summary.txt")