# 
## Airbnb Value Prediction - San Francisco & San Diego

**Project Goal:** Predict value-for-money category for Airbnb listings using only landlord-controlled features.

**Pipeline Overview:**
- **Task 1.1:** Initial Data Exploration
- **Task 1.2:** Data Cleaning and Preprocessing
- **Task 1.3:** Algebraic Feature Engineering
- **Task 1.4:** Categorical Encoding
- **Task 1.5:** Feature Selection 


---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import joblib
import os
from pathlib import Path


warnings.filterwarnings('ignore')

# Load both datasets
sf_df = pd.read_csv('../../data/raw/san francisco.csv')
sd_df = pd.read_csv('../../data/raw/san diego.csv')

print("="*80)
print("T1.1: Initial Data Exploration")
print("="*80)

print("\n" + "="*80)
print("1. Dataset Overview")
print("="*80)

print(f"\n San Francisco Dataset:")
print(f"   - Rows: {sf_df.shape[0]:,}")
print(f"   - Columns: {sf_df.shape[1]}")

print(f"\n San Diego Dataset:")
print(f"   - Rows: {sd_df.shape[0]:,}")
print(f"   - Columns: {sd_df.shape[1]}")

print(f"\n Combined Dataset:")
print(f"   - Total Rows: {sf_df.shape[0] + sd_df.shape[0]:,}")

# Check if columns match
sf_cols = set(sf_df.columns)
sd_cols = set(sd_df.columns)
print(f"\n✓ Column names match: {sf_cols == sd_cols}")

if sf_cols != sd_cols:
    print(f"   - Columns only in SF: {sf_cols - sd_cols}")
    print(f"   - Columns only in SD: {sd_cols - sf_cols}")

print("\n" + "="*80)
print("2. Column summary in San Francisco")
print("="*80)
print(f"\nTotal Columns: {sf_df.shape[1]}")
print("\nColumn Names:")
for i, col in enumerate(sf_df.columns, 1):
    print(f"{i:2d}. {col}")

print("\n" + "="*80)
print("3. Data types analysis")
print("="*80)

# Analyze data types
sf_dtypes = sf_df.dtypes.value_counts()
print("\n San Francisco Data Types:")
for dtype, count in sf_dtypes.items():
    print(f"   - {dtype}: {count} columns")

sd_dtypes = sd_df.dtypes.value_counts()
print("\n San Diego Data Types:")
for dtype, count in sd_dtypes.items():
    print(f"   - {dtype}: {count} columns")

print("\n" + "="*80)
print("4. Missing values analysis - San Francisco")
print("="*80)

sf_missing = sf_df.isnull().sum()
sf_missing_pct = (sf_missing / len(sf_df) * 100).round(2)
sf_missing_df = pd.DataFrame({
    'Column': sf_missing.index,
    'Missing_Count': sf_missing.values,
    'Missing_Percentage': sf_missing_pct.values
})
sf_missing_df = sf_missing_df[sf_missing_df['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print(f"\n Columns with Missing Values: {len(sf_missing_df)}/{len(sf_df.columns)}")
print("\nTop 20 Columns with Most Missing Values:")
print(sf_missing_df.head(20).to_string(index=False))

print("\n" + "="*80)
print("5. Missing Values Analysis - San Diego")
print("="*80)

sd_missing = sd_df.isnull().sum()
sd_missing_pct = (sd_missing / len(sd_df) * 100).round(2)
sd_missing_df = pd.DataFrame({
    'Column': sd_missing.index,
    'Missing_Count': sd_missing.values,
    'Missing_Percentage': sd_missing_pct.values
})
sd_missing_df = sd_missing_df[sd_missing_df['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print(f"\n Columns with Missing Values: {len(sd_missing_df)}/{len(sd_df.columns)}")
print("\n Top 20 Columns with Most Missing Values:")
print(sd_missing_df.head(20).to_string(index=False))

print("\n" + "="*80)
print("6. Key numerical features summary - San Francisco")
print("="*80)

# Key numerical columns to analyze
key_numerical = ['price', 'accommodates', 'bedrooms', 'beds', 'bathrooms', 
                 'minimum_nights', 'maximum_nights', 'number_of_reviews',
                 'review_scores_rating', 'review_scores_accuracy', 
                 'review_scores_cleanliness', 'review_scores_checkin',
                 'review_scores_communication', 'review_scores_location',
                 'review_scores_value']

# Check which columns exist
existing_numerical = [col for col in key_numerical if col in sf_df.columns]

sf_summary = sf_df[existing_numerical].describe().T
sf_summary['missing'] = sf_df[existing_numerical].isnull().sum()
sf_summary['missing_pct'] = (sf_summary['missing'] / len(sf_df) * 100).round(2)
print(sf_summary[['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max', 'missing', 'missing_pct']].to_string())

print("\n" + "="*80)
print("7. Key numerical features summary - San Diego")
print("="*80)

sd_summary = sd_df[existing_numerical].describe().T
sd_summary['missing'] = sd_df[existing_numerical].isnull().sum()
sd_summary['missing_pct'] = (sd_summary['missing'] / len(sd_df) * 100).round(2)
print(sd_summary[['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max', 'missing', 'missing_pct']].to_string())

print("\n" + "="*80)
print("8. Key categorical features analysis")
print("="*80)

key_categorical = ['property_type', 'room_type', 'neighbourhood_cleansed', 
                   'host_is_superhost', 'instant_bookable']

existing_categorical = [col for col in key_categorical if col in sf_df.columns]

print("\n San Francisco - Categorical Features:")
for col in existing_categorical:
    unique_count = sf_df[col].nunique()
    missing = sf_df[col].isnull().sum()
    print(f"\n{col}:")
    print(f"   - Unique values: {unique_count}")
    print(f"   - Missing: {missing} ({missing/len(sf_df)*100:.2f}%)")
    if unique_count <= 10:
        print(f"   - Value counts:")
        print(sf_df[col].value_counts().head(10).to_string())

print("\n San Diego - Categorical Features:")
for col in existing_categorical:
    unique_count = sd_df[col].nunique()
    missing = sd_df[col].isnull().sum()
    print(f"\n{col}:")
    print(f"   - Unique values: {unique_count}")
    print(f"   - Missing: {missing} ({missing/len(sd_df)*100:.2f}%)")
    if unique_count <= 10:
        print(f"   - Value counts:")
        print(sd_df[col].value_counts().head(10).to_string())

print("\n" + "="*80)
print("9. PRICE ANALYSIS ")
print("="*80)

print("\n San Francisco - Price Column:")
print(f"   - Data type: {sf_df['price'].dtype}")
print(f"   - Sample values: {sf_df['price'].head(10).tolist()}")
print(f"   - Missing: {sf_df['price'].isnull().sum()}")

print("\n San Diego - Price Column:")
print(f"   - Data type: {sd_df['price'].dtype}")
print(f"   - Sample values: {sd_df['price'].head(10).tolist()}")
print(f"   - Missing: {sd_df['price'].isnull().sum()}")

print("\n  Note: Price column is stored as string with '$' and ',' - needs cleaning in T1.2")

print("\n" + "="*80)

print("="*80)

issues = []

# Check for duplicates
sf_dupes = sf_df.duplicated().sum()
sd_dupes = sd_df.duplicated().sum()
if sf_dupes > 0 or sd_dupes > 0:
    issues.append(f"Duplicate rows: SF={sf_dupes}, SD={sd_dupes}")

# Check price format
if sf_df['price'].dtype == 'object':
    issues.append("Price column needs cleaning (contains '$' and ',')")

# Check high missing value columns
high_missing_sf = sf_missing_df[sf_missing_df['Missing_Percentage'] > 50]
high_missing_sd = sd_missing_df[sd_missing_df['Missing_Percentage'] > 50]
issues.append(f"Columns with >50% missing: SF={len(high_missing_sf)}, SD={len(high_missing_sd)}")

# Check for columns with all missing
all_missing_sf = sf_missing_df[sf_missing_df['Missing_Percentage'] == 100]
all_missing_sd = sd_missing_df[sd_missing_df['Missing_Percentage'] == 100]
if len(all_missing_sf) > 0 or len(all_missing_sd) > 0:
    issues.append(f"Columns with 100% missing: SF={len(all_missing_sf)}, SD={len(all_missing_sd)}")

print("\n Issues Found:")
for i, issue in enumerate(issues, 1):
    print(f"{i}. {issue}")

print("\n" + "="*80)
print("="*80)

print(f"""
 Task T1.1 Completed: Initial Data Exploration

 Dataset Overview:
   - San Francisco: {sf_df.shape[0]:,} rows × {sf_df.shape[1]} columns
   - San Diego: {sd_df.shape[0]:,} rows × {sd_df.shape[1]} columns
   - Combined: {sf_df.shape[0] + sd_df.shape[0]:,} rows

 Key Findings:
   1. Both datasets have {sf_df.shape[1]} columns with matching structure
   2. {len(sf_missing_df)} columns in SF and {len(sd_missing_df)} columns in SD have missing values
   3. Review scores have significant missing values (~30-40%)
   4. Text columns (description, host_about etc.) needs NLP processing 
   5. Categorical encoding needed for property_type, room_type, neighbourhood
   """)



print("\n" + "="*80)


---
---

# Task 1.2: Data Cleaning and Preprocessing

---



**Important note on data leakage:**
This notebook creates a comprehensive feature set including review-based features. These review features are essential for creating our target variable (`value_category`) which measures "value for money" based on rating/price ratio.

However, review-based features will be REMOVED from model input in Task 1.5 because:
1. New listings have no reviews yet
2. We need to predict value for listings without review history
3. Using reviews as input features creates data leakage

**Strategy:**
- Keep all features (including review-based ones) in this task for target creation
- Task 1.5 will filter out review-based features from X (input) while keeping them for y (target)
- Price MUST remain as a feature - you cannot predict "value for money" without knowing the price!

## 1. Import Libraries and Load Data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("="*80)
print("Task 1.2: Data cleaning and preprocessing")
print("="*80)

In [None]:
# Load both datasets
sf_df = pd.read_csv('../../data/raw/san francisco.csv')
sd_df = pd.read_csv('../../data/raw/san diego.csv')

# Add city identifier
sf_df['city'] = 'San Francisco'
sd_df['city'] = 'San Diego'

# Combine datasets
df = pd.concat([sf_df, sd_df], ignore_index=True)

print(f"\n Combined Dataset Shape: {df.shape}")
print(f"   - Total Rows: {df.shape[0]:,}")
print(f"   - Total Columns: {df.shape[1]}")

## 2. Missing Values Analysis

In [None]:
print("\n" + "="*80)
print("2. MISSING VALUES ANALYSIS")
print("="*80)

# Calculate missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing_Count': missing.values,
    'Missing_Percentage': missing_pct.values
})

# Filter columns with missing values
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print(f"\n Columns with Missing Values: {len(missing_df)}")
print("\nTop 20 Columns with Most Missing Data:")
print(missing_df.head(20).to_string(index=False))

# Visualize missing data
plt.figure(figsize=(12, 8))
top_missing = missing_df.head(20)
plt.barh(top_missing['Column'], top_missing['Missing_Percentage'])
plt.xlabel('Missing Percentage (%)')
plt.title('Top 20 Columns with Missing Values')
plt.tight_layout()
plt.savefig('../../outputs/figures/missing_values_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n Visualization saved: outputs/figures/missing_values_analysis.png")

## 3. Feature Selection - Remove Irrelevant Columns

In [None]:
print("\n" + "="*80)
print("3. FEATURE SELECTION")
print("="*80)

# Define columns to drop (URLs, IDs, text descriptions, etc.)
columns_to_drop = [
    # URLs and IDs
    'listing_url', 'scrape_id', 'picture_url', 'host_url', 
    'host_thumbnail_url', 'host_picture_url',
    
    # Text descriptions (too noisy for initial model)
    'description', 'neighborhood_overview', 'host_about', 'name',
    
    # Redundant or highly specific
    'source', 'calendar_updated', 'last_scraped', 'calendar_last_scraped',
    
    # License (mostly missing or not useful)
    'license',
    
    # Neighbourhood group (if empty)
    'neighbourhood_group_cleansed',
    
    # Bathrooms (we'll use bathrooms_text instead)
    'bathrooms',
    
    # Host verifications (complex nested data)
    'host_verifications',
    
    # Amenities (complex nested data - can be processed later)
    'amenities'
]

# Drop columns that exist in the dataframe
columns_to_drop = [col for col in columns_to_drop if col in df.columns]
df_cleaned = df.drop(columns=columns_to_drop)

print(f"\n Dropped {len(columns_to_drop)} columns")
print(f"   - Original: {df.shape[1]} columns")
print(f"   - After dropping: {df_cleaned.shape[1]} columns")

## 4. Data Type Conversions and Cleaning

In [None]:
print("\n" + "="*80)
print("4. DATA TYPE CONVERSIONS")
print("="*80)

# 4.1 Clean price column (remove $ and commas)
if 'price' in df_cleaned.columns:
    df_cleaned['price'] = df_cleaned['price'].replace('[\$,]', '', regex=True).astype(float)
    print("\nCleaned 'price' column (removed $ and commas)")

# 4.2 Convert percentage columns
percentage_cols = ['host_response_rate', 'host_acceptance_rate']
for col in percentage_cols:
    if col in df_cleaned.columns:
        df_cleaned[col] = df_cleaned[col].str.rstrip('%').astype(float) / 100
        print(f"Converted '{col}' to decimal")

# 4.3 Convert boolean columns
boolean_cols = ['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 
                'has_availability', 'instant_bookable']
for col in boolean_cols:
    if col in df_cleaned.columns:
        df_cleaned[col] = df_cleaned[col].map({'t': 1, 'f': 0})
        print(f"Converted '{col}' to binary (0/1)")

# 4.4 Convert date columns
date_cols = ['host_since', 'first_review', 'last_review']
for col in date_cols:
    if col in df_cleaned.columns:
        df_cleaned[col] = pd.to_datetime(df_cleaned[col], errors='coerce')
        print(f"Converted '{col}' to datetime")

# 4.5 Extract number from bathrooms_text
if 'bathrooms_text' in df_cleaned.columns:
    df_cleaned['bathrooms_numeric'] = df_cleaned['bathrooms_text'].str.extract('(\d+\.?\d*)').astype(float)
    print("\n Extracted numeric bathrooms from 'bathrooms_text'")

## 5. Feature Engineering

In [None]:
print("\n" + "="*80)
print("5. FEATURE ENGINEERING")
print("="*80)

# 5.1 Host experience (years as host)
if 'host_since' in df_cleaned.columns:
    df_cleaned['host_years'] = (pd.Timestamp.now() - df_cleaned['host_since']).dt.days / 365.25
    print("\n Created 'host_years' feature")

# 5.2 Days since first review
if 'first_review' in df_cleaned.columns:
    df_cleaned['days_since_first_review'] = (pd.Timestamp.now() - df_cleaned['first_review']).dt.days
    print(" Created 'days_since_first_review' feature")

# 5.3 Days since last review
if 'last_review' in df_cleaned.columns:
    df_cleaned['days_since_last_review'] = (pd.Timestamp.now() - df_cleaned['last_review']).dt.days
    print(" Created 'days_since_last_review' feature")

# 5.4 Price per person 
if 'price' in df_cleaned.columns and 'accommodates' in df_cleaned.columns:
    df_cleaned['price_per_person'] = df_cleaned['price'] / df_cleaned['accommodates']
    print(" Created 'price_per_person' feature ")

# 5.5 Reviews per month (will be removed in T1.5)
if 'reviews_per_month' not in df_cleaned.columns:
    if 'number_of_reviews' in df_cleaned.columns and 'days_since_first_review' in df_cleaned.columns:
        df_cleaned['reviews_per_month'] = (df_cleaned['number_of_reviews'] / 
                    (df_cleaned['days_since_first_review'] / 30.44))
        print(" Created 'reviews_per_month' feature")

# 5.6 Availability rate 
if 'availability_365' in df_cleaned.columns:
    df_cleaned['availability_rate'] = df_cleaned['availability_365'] / 365
    print(" Created 'availability_rate' feature ")

# 5.7 Average review score (will be removed in T1.5)
review_score_cols = [col for col in df_cleaned.columns if 'review_scores_' in col and col != 'review_scores_rating']
if review_score_cols:
    df_cleaned['avg_review_score'] = df_cleaned[review_score_cols].mean(axis=1)
    print(" Created 'avg_review_score' feature ")

# 5.8 Has reviews flag (will be removed in T1.5)
if 'number_of_reviews' in df_cleaned.columns:
    df_cleaned['has_reviews'] = (df_cleaned['number_of_reviews'] > 0).astype(int)
    print(" Created 'has_reviews' feature ")

print(f"\n Total features after engineering: {df_cleaned.shape[1]}")

## 6. Handle Missing Values

In [None]:
print("\n" + "="*80)
print("6. HANDLING MISSING VALUES")
print("="*80)

# 6.1 Drop columns with >50% missing values
missing_threshold = 0.5
missing_pct = df_cleaned.isnull().sum() / len(df_cleaned)
cols_to_drop = missing_pct[missing_pct > missing_threshold].index.tolist()

if cols_to_drop:
    df_cleaned = df_cleaned.drop(columns=cols_to_drop)
    print(f"\n Dropped {len(cols_to_drop)} columns with >{missing_threshold*100}% missing values")
    print(f"   Columns dropped: {cols_to_drop}")

# 6.2 Fill missing values for specific columns
# Numeric columns - fill with median
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    if df_cleaned[col].isnull().sum() > 0:
        df_cleaned[col].fillna(df_cleaned[col].median(), inplace=True)

print(f"\n Filled missing numeric values with median")

# Categorical columns - fill with mode or 'Unknown'
categorical_cols = df_cleaned.select_dtypes(include=['object']).columns
for col in categorical_cols:
    if df_cleaned[col].isnull().sum() > 0:
        mode_val = df_cleaned[col].mode()
        if len(mode_val) > 0:
            df_cleaned[col].fillna(mode_val[0], inplace=True)
        else:
            df_cleaned[col].fillna('Unknown', inplace=True)

print(f"Filled missing categorical values with mode or 'Unknown'")

# Check remaining missing values
remaining_missing = df_cleaned.isnull().sum().sum()
print(f"\n Remaining missing values: {remaining_missing}")

## 7. Handle Outliers

In [None]:
print("\n" + "="*80)
print("7. OUTLIER DETECTION AND HANDLING")
print("="*80)

# Focus on price outliers
if 'price' in df_cleaned.columns:
    # Removing listings with price = 0 or extremely high prices
    initial_rows = len(df_cleaned)
    
    # Removing price = 0
    df_cleaned = df_cleaned[df_cleaned['price'] > 0]
    
    # Remove extreme outliers (using IQR method)
    Q1 = df_cleaned['price'].quantile(0.25)
    Q3 = df_cleaned['price'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 3 * IQR
    upper_bound = Q3 + 3 * IQR
    
    df_cleaned = df_cleaned[(df_cleaned['price'] >= lower_bound) & 
                    (df_cleaned['price'] <= upper_bound)]
    
    rows_removed = initial_rows - len(df_cleaned)
    print(f"\n Removed {rows_removed} rows with price outliers")
    print(f" - Price range: ${df_cleaned['price'].min():.2f} - ${df_cleaned['price'].max():.2f}")
    print(f" - Remaining rows: {len(df_cleaned):,}")

# Visualize price distribution after cleaning
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(df_cleaned['price'], bins=50, edgecolor='black')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.title('Price Distribution (After Outlier Removal)')

plt.subplot(1, 2, 2)
plt.boxplot(df_cleaned['price'])
plt.ylabel('Price ($)')
plt.title('Price Boxplot (After Outlier Removal)')

plt.tight_layout()
plt.savefig('../../outputs/figures/price_distribution_cleaned.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n Visualization saved: outputs/figures/price_distribution_cleaned.png")

## 8. Create Target Variable (Value Category)

**CRITICAL:** We use review_scores_rating here to create labels. This is correct because:
1. We need historical data (with reviews) to learn what makes good/bad value
2. The target represents "what value category would this listing be if it had reviews"
3. Review features will be removed from features (x) in Task 1.5, but will be kept for creating target (y)

In [None]:
print("\n" + "="*80)
print("8. Creating target variable: Value Category")
print("="*80)

# Calculate FP Score (Fair Price Score) = Rating / Price
# This measures "value for money"

if 'review_scores_rating' in df_cleaned.columns and 'price' in df_cleaned.columns:
    # Filter listings with reviews (needed for labeling)
    df_with_reviews = df_cleaned[df_cleaned['review_scores_rating'].notna()].copy()
    
    # Normalize rating (0-5 scale) and price
    df_with_reviews['rating_normalized'] = df_with_reviews['review_scores_rating'] / 20  # Convert 0-100 to 0-5
    df_with_reviews['price_normalized'] = (df_with_reviews['price'] - df_with_reviews['price'].min()) / \
                    (df_with_reviews['price'].max() - df_with_reviews['price'].min())
    
    # Calculate FP Score (higher = better value)
    df_with_reviews['fp_score'] = df_with_reviews['rating_normalized'] / (df_with_reviews['price_normalized'] + 0.1)
    
    # Classify into 3 categories based on FP Score
    fp_33 = df_with_reviews['fp_score'].quantile(0.33)
    fp_67 = df_with_reviews['fp_score'].quantile(0.67)
    
    def classify_value(fp_score):
        if fp_score <= fp_33:
            return 'Poor_Value'
        elif fp_score <= fp_67:
            return 'Fair_Value'
        else:
            return 'Excellent_Value'
    
    df_with_reviews['value_category'] = df_with_reviews['fp_score'].apply(classify_value)
    
    print(f"\n Created FP Score and Value Category")
    print(f"   - Listings with reviews: {len(df_with_reviews):,}")
    print(f"   - FP Score range: {df_with_reviews['fp_score'].min():.2f} - {df_with_reviews['fp_score'].max():.2f}")
    print(f"\n Value Category Distribution:")
    print(df_with_reviews['value_category'].value_counts())
    
    # Visualize distribution
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    df_with_reviews['value_category'].value_counts().plot(kind='bar', color=['red', 'orange', 'green'])
    plt.xlabel('Value Category')
    plt.ylabel('Count')
    plt.title('Distribution of Value Categories')
    plt.xticks(rotation=45)
    
    plt.subplot(1, 2, 2)
    plt.hist(df_with_reviews['fp_score'], bins=50, edgecolor='black')
    plt.xlabel('FP Score')
    plt.ylabel('Frequency')
    plt.title('FP Score Distribution')
    
    plt.tight_layout()
    plt.savefig('../../outputs/figures/value_category_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n Visualization saved: outputs/figures/value_category_distribution.png")
    
    # Use df_with_reviews for further processing
    df_final = df_with_reviews.copy()
else:
    print("\n Warning: 'review_scores_rating' or 'price' not found. Skipping target creation.")
    df_final = df_cleaned.copy()

## 9. Save Cleaned Data 

**Note:** This dataset contains all features including review-based ones.
Task 1.3 and 1.4 will use this file.
Task 1.5 will filter out review-based features from model input.

In [None]:
print("\n" + "="*80)
print("9. SAVE CLEANED DATA")
print("="*80)

# Save cleaned full dataset with target
df_final.to_csv('../../data/processed/listings_cleaned_with_target.csv', index=False)
print("\n Saved to: data/processed/listings_cleaned_with_target.csv")
print(f"   - Shape: {df_final.shape}")
print(f"   - Contains all features (including review-based ones)")
print(f"   - Review features will be filtered in Task 1.5")

## 10. Summary Report

In [None]:
print("\n" + "="*80)
print("="*80)

summary = f"""
{'='*80}

Original data:  
  - San Francisco: 7,780 listings
  - San Diego: 13,162 listings
  - Combined: {df.shape[0]:,} listings, {df.shape[1]} columns

After cleaning and preprocessing steps:
  - Final dataset: {df_final.shape[0]:,} listings, {df_final.shape[1]} columns

Target variable properties:
  - Name: value_category
  - Classes: Poor_Value, Fair_Value, Excellent_Value
  - Based on FP Score = Rating / Price (measures value for money)
  - Distribution:
{df_final['value_category'].value_counts().to_string()}

After cleaning and preprocessing steps:
  - Final dataset: {df_final.shape[0]:,} listings, {df_final.shape[1]} columns

Key steps performed:
  - Removed irrelevant columns (URLs, IDs, text descriptions)
  - Converted data types (price, percentages, booleans, dates)
  - Feature engineering (host_years, price_per_person, etc.)
  - Handled missing values (dropped >50% missing, imputed rest)
  - Removed outliers (price outliers using IQR method)
  - Created target variable (FP Score classification)

OUTPUT FILES:
  - data/processed/listings_cleaned_with_target.csv

VISUALIZATIONS:
  - outputs/figures/missing_values_analysis.png
  - outputs/figures/price_distribution_cleaned.png
  - outputs/figures/value_category_distribution.png


{'='*80}

{'='*80}
"""

print(summary)




---
---

# Task 1.3: Algebraic Feature Engineering

---


In [None]:

warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("="*80)
print("Task 1.3: Algebraic Feature Engineering and Visualization")
print("San Francisco & San Diego Airbnb Dataset")
print("="*80)

## 1. Load Data from T1.2

In [None]:
# Load the cleaned data with target from T1.2
df = pd.read_csv('../../data/processed/listings_cleaned_with_target.csv')
print(f"\n Loaded Dataset Shape: {df.shape}")
print(f"Columns: {df.shape[1]}")
print(f"Rows: {df.shape[0]:,}")

# Check existing columns
print("\n Existing Columns:")
print(df.columns.tolist())

## 2. Verify Required Columns

In [None]:
print("\n Checking Required Columns for Algebraic Features:")
required_cols = ['price', 'accommodates', 'bedrooms', 'beds', 'bathrooms_numeric', 
                 'availability_365', 'host_total_listings_count', 'minimum_nights']

all_present = True
for col in required_cols:
    if col in df.columns:
        print(f" {col}")
    else:
        print(f"{col} - Missing!")
        all_present = False

if all_present:
    print("\n All required columns present! Ready to create algebraic features.")
else:
    print("\n Some required columns are missing!")

## 3. Create 10 Landlord-Controlled Algebraic Features

In [None]:
print("\n" + "="*80)
print("CREATING 10 NEW LANDLORD-CONTROLLED ALGEBRAIC FEATURES")
print("NO REVIEW-BASED DATA - PREVENTS DATA LEAKAGE")
print("="*80)

# Feature 1: space_efficiency (beds per bedroom) 
print("\n Creating: space_efficiency = beds / bedrooms ")
df['space_efficiency'] = df['beds'] / df['bedrooms'].replace(0, np.nan)
df['space_efficiency'] = df['space_efficiency'].fillna(df['space_efficiency'].median())
print(f"   Range: {df['space_efficiency'].min():.2f} to {df['space_efficiency'].max():.2f}")
print(f"   Mean: {df['space_efficiency'].mean():.2f}")
print(f"   Median: {df['space_efficiency'].median():.2f}")

# Feature 2: price_per_bedroom 
print("\n Creating: price_per_bedroom = price / bedrooms ")
df['price_per_bedroom'] = df['price'] / df['bedrooms'].replace(0, np.nan)
df['price_per_bedroom'] = df['price_per_bedroom'].fillna(df['price_per_bedroom'].median())
print(f"   Range: ${df['price_per_bedroom'].min():.2f} to ${df['price_per_bedroom'].max():.2f}")
print(f"   Mean: ${df['price_per_bedroom'].mean():.2f}")
print(f"   Median: ${df['price_per_bedroom'].median():.2f}")

# Feature 3: price_per_bathroom 
print("\n Creating: price_per_bathroom = price / bathrooms_numeric ")
df['price_per_bathroom'] = df['price'] / df['bathrooms_numeric'].replace(0, np.nan)
df['price_per_bathroom'] = df['price_per_bathroom'].fillna(df['price_per_bathroom'].median())
print(f"   Range: ${df['price_per_bathroom'].min():.2f} to ${df['price_per_bathroom'].max():.2f}")
print(f"   Mean: ${df['price_per_bathroom'].mean():.2f}")
print(f"   Median: ${df['price_per_bathroom'].median():.2f}")

# Feature 4: occupancy_rate 
print("\n Creating: occupancy_rate = (365 - availability_365) / 365 ")
df['occupancy_rate'] = (365 - df['availability_365']) / 365
df['occupancy_rate'] = df['occupancy_rate'].clip(0, 1)
print(f"   Range: {df['occupancy_rate'].min():.2f} to {df['occupancy_rate'].max():.2f}")
print(f"   Mean: {df['occupancy_rate'].mean():.2f}")
print(f"   Median: {df['occupancy_rate'].median():.2f}")

# Feature 5: booking_flexibility_score 
print("\n Creating: booking_flexibility_score = 1 / (minimum_nights + 1) ")
df['booking_flexibility_score'] = 1 / (df['minimum_nights'] + 1)
print(f"   Range: {df['booking_flexibility_score'].min():.6f} to {df['booking_flexibility_score'].max():.6f}")
print(f"   Mean: {df['booking_flexibility_score'].mean():.6f}")
print(f"   Median: {df['booking_flexibility_score'].median():.6f}")

# Feature 6: space_per_person 
print("\n Creating: space_per_person = bedrooms / accommodates ")
df['space_per_person'] = df['bedrooms'] / df['accommodates'].replace(0, np.nan)
df['space_per_person'] = df['space_per_person'].fillna(df['space_per_person'].median())
print(f"   Range: {df['space_per_person'].min():.2f} to {df['space_per_person'].max():.2f}")
print(f"   Mean: {df['space_per_person'].mean():.2f}")
print(f"   Median: {df['space_per_person'].median():.2f}")

# Feature 7: host_portfolio_intensity 
print("\n Creating: host_portfolio_intensity = host_total_listings_count / accommodates ")
df['host_portfolio_intensity'] = df['host_total_listings_count'] / df['accommodates'].replace(0, np.nan)
df['host_portfolio_intensity'] = df['host_portfolio_intensity'].fillna(df['host_portfolio_intensity'].median())
print(f"   Range: {df['host_portfolio_intensity'].min():.2f} to {df['host_portfolio_intensity'].max():.2f}")
print(f"   Mean: {df['host_portfolio_intensity'].mean():.2f}")
print(f"   Median: {df['host_portfolio_intensity'].median():.2f}")

# Feature 8: bathroom_to_bedroom_ratio 
print("\n Creating: bathroom_to_bedroom_ratio = bathrooms_numeric / bedrooms ")
df['bathroom_to_bedroom_ratio'] = df['bathrooms_numeric'] / df['bedrooms'].replace(0, np.nan)
df['bathroom_to_bedroom_ratio'] = df['bathroom_to_bedroom_ratio'].fillna(df['bathroom_to_bedroom_ratio'].median())
print(f"   Range: {df['bathroom_to_bedroom_ratio'].min():.2f} to {df['bathroom_to_bedroom_ratio'].max():.2f}")
print(f"   Mean: {df['bathroom_to_bedroom_ratio'].mean():.2f}")
print(f"   Median: {df['bathroom_to_bedroom_ratio'].median():.2f}")
print(f"   Interpretation: Higher ratio = more luxury (more bathrooms per bedroom)")

# Feature 9: price_to_capacity_ratio  
print("\n Creating: price_to_capacity_ratio = price / (accommodates × bedrooms) ")
df['price_to_capacity_ratio'] = df['price'] / (df['accommodates'] * df['bedrooms'].replace(0, np.nan))
df['price_to_capacity_ratio'] = df['price_to_capacity_ratio'].fillna(df['price_to_capacity_ratio'].median())
print(f"   Range: ${df['price_to_capacity_ratio'].min():.2f} to ${df['price_to_capacity_ratio'].max():.2f}")
print(f"   Mean: ${df['price_to_capacity_ratio'].mean():.2f}")
print(f"   Median: ${df['price_to_capacity_ratio'].median():.2f}")
print(f"   Interpretation: Price efficiency per unit of space")

# Feature 10: availability_flexibility_score  
print("\n Creating: availability_flexibility_score = availability_365 / minimum_nights ")
df['availability_flexibility_score'] = df['availability_365'] / df['minimum_nights'].replace(0, np.nan)
df['availability_flexibility_score'] = df['availability_flexibility_score'].fillna(df['availability_flexibility_score'].median())
df['availability_flexibility_score'] = df['availability_flexibility_score'].clip(0, 365)  # Cap at 365
print(f"   Range: {df['availability_flexibility_score'].min():.2f} to {df['availability_flexibility_score'].max():.2f}")
print(f"   Mean: {df['availability_flexibility_score'].mean():.2f}")
print(f"   Median: {df['availability_flexibility_score'].median():.2f}")
print(f"   Interpretation: High availability + low minimum nights = more flexible booking")

print("\n" + "="*80)
print(" All 10 new features created successfully!")
print("="*80)

## 4. Summary and Data Quality Check

In [None]:
# Summary
new_features = [
    'space_efficiency', 'price_per_bedroom', 'price_per_bathroom',
    'occupancy_rate', 'booking_flexibility_score', 'space_per_person',
    'host_portfolio_intensity', 'bathroom_to_bedroom_ratio',
    'price_to_capacity_ratio', 'availability_flexibility_score'
]

print(f"\n New Dataset Shape: {df.shape}")
print(f" Added: {len(new_features)} new algebraic features")

# Check data quality
print("\n Data Quality Check:")
quality_ok = True
for feature in new_features:
    nan_count = df[feature].isna().sum()
    inf_count = np.isinf(df[feature]).sum()
    if nan_count > 0 or inf_count > 0:
        print(f"{feature}: {nan_count} NaN, {inf_count} Inf values")
        quality_ok = False
        
if quality_ok:
    print(" All features are clean (no NaN or Inf values)")

## 5. Save Dataset

In [None]:
# Save the dataset
output_path = '../../data/processed/listings_with_algebraic_features.csv'
df.to_csv(output_path, index=False)
print(f"\nSaved dataset to: {output_path}")
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

## 6. Detailed Statistics for All 10 Features

In [None]:
print("\n" + "="*100)
print("Detailed Statictical Summary of New Features")
print("="*100)

stats_data = []
for i, feature in enumerate(new_features, 1):
    stats = {
        'No.': i,
        'Feature Name': feature,
        'Mean': df[feature].mean(),
        'Median': df[feature].median(),
        'Std Dev': df[feature].std(),
        'Min': df[feature].min(),
        'Max': df[feature].max(),
        'Q1': df[feature].quantile(0.25),
        'Q3': df[feature].quantile(0.75),
        'Skewness': df[feature].skew(),
        'Missing': df[feature].isna().sum()
    }
    stats_data.append(stats)

stats_df = pd.DataFrame(stats_data)

# Display formatted table
print("\n")
for idx, row in stats_df.iterrows():
    print(f"{row['No.']}. {row['Feature Name'].upper()}")
    print(f"   Mean: {row['Mean']:.4f} | Median: {row['Median']:.4f} | Std: {row['Std Dev']:.4f}")
    print(f"   Range: [{row['Min']:.4f}, {row['Max']:.4f}] | IQR: [{row['Q1']:.4f}, {row['Q3']:.4f}]")
    print(f"   Skewness: {row['Skewness']:.4f} | Missing: {row['Missing']}")
    print()

print("\n" + "="*100)

## 7. Visualizations

In [None]:
# Create visualizations for the new features
import os
os.makedirs('../../outputs/figures', exist_ok=True)

# 1. Distribution plots for all 10 features
fig, axes = plt.subplots(5, 2, figsize=(15, 20))
axes = axes.ravel()

for idx, feature in enumerate(new_features):
    axes[idx].hist(df[feature], bins=50, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{feature}', fontsize=10, fontweight='bold')
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../../outputs/figures/algebraic_features_distributions.png', dpi=300, bbox_inches='tight')
plt.show()
print("\n Saved to : outputs/figures/algebraic_features_distributions.png")

# 2. Correlation heatmap of new features
plt.figure(figsize=(12, 10))
correlation_matrix = df[new_features].corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of 10 Landlord-Controlled Algebraic Features', 
          fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('../../outputs/figures/algebraic_features_correlation.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved to : outputs/figures/algebraic_features_correlation.png")

# 3. Box plots for outlier detection
fig, axes = plt.subplots(5, 2, figsize=(15, 20))
axes = axes.ravel()

for idx, feature in enumerate(new_features):
    axes[idx].boxplot(df[feature].dropna())
    axes[idx].set_title(f'{feature}', fontsize=10, fontweight='bold')
    axes[idx].set_ylabel('Value')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../../outputs/figures/algebraic_features_boxplots.png', dpi=300, bbox_inches='tight')
plt.show()
print("Saved to : outputs/figures/algebraic_features_boxplots.png")

# 4. Summary statistics visualization
fig, ax = plt.subplots(figsize=(14, 8))
x_pos = np.arange(len(new_features))
means = [df[f].mean() for f in new_features]
stds = [df[f].std() for f in new_features]

# Normalize for visualization
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
means_normalized = scaler.fit_transform(np.array(means).reshape(-1, 1)).flatten()
stds_normalized = scaler.fit_transform(np.array(stds).reshape(-1, 1)).flatten()

ax.bar(x_pos - 0.2, means_normalized, 0.4, label='Mean (normalized)', alpha=0.8)
ax.bar(x_pos + 0.2, stds_normalized, 0.4, label='Std Dev (normalized)', alpha=0.8)
ax.set_xlabel('Features', fontweight='bold')
ax.set_ylabel('Normalized Value', fontweight='bold')
ax.set_title('Mean and Standard Deviation of Algebraic Features (Normalized)', 
             fontsize=14, fontweight='bold')
ax.set_xticks(x_pos)
ax.set_xticklabels(new_features, rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../../outputs/figures/algebraic_features_variability.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved to : outputs/figures/algebraic_features_variability.png")

## 8. Final Summary Report

In [None]:
print("\n" + "="*80)
print("="*80)

summary = f"""
{'='*80}

INPUT DATA:
  - Source: data/processed/listings_cleaned_with_target.csv (from T1.2)
  - Shape: {df.shape[0]:,} rows × {df.shape[1] - len(new_features)} columns (before)

OUTPUT DATA:
  - Destination: data/processed/listings_with_algebraic_features.csv
  - Shape: {df.shape[0]:,} rows × {df.shape[1]} columns (after)
  - Added: {len(new_features)} new algebraic features

  New algebraic features created:
  1. space_efficiency = beds / bedrooms
  2. price_per_bedroom = price / bedrooms
  3. price_per_bathroom = price / bathrooms_numeric
  4. occupancy_rate = (365 - availability_365) / 365
  5. booking_flexibility_score = 1 / (minimum_nights + 1)
  6. space_per_person = bedrooms / accommodates
  7. host_portfolio_intensity = host_total_listings_count / accommodates
  8. bathroom_to_bedroom_ratio = bathrooms_numeric / bedrooms (luxury indicator)
  9. price_to_capacity_ratio = price / (accommodates × bedrooms) (price efficiency)
  10. availability_flexibility_score = availability_365 / minimum_nights (booking flexibility)

 Data leakage prevention:
  - No review-based features used in these calculations
  - All features derived from landlord-controllable data only
  - Features are available for NEW listings without review history
  - Model can predict value for listings with zero reviews

OUTPUT FILES:
  - data/processed/listings_with_algebraic_features.csv
  - data/processed/algebraic_features_statistics.csv

VISUALIZATIONS:
  - outputs/figures/algebraic_features_distributions.png
  - outputs/figures/algebraic_features_correlation.png
  - outputs/figures/algebraic_features_boxplots.png
  - outputs/figures/algebraic_features_variability.png

{'='*80}
{'='*80}
"""

print(summary)






---
---

# Task 1.4: Categorical encoding

---


## 1. Import Libraries and Load Data

In [None]:

from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings('ignore')

print("="*80)
print("Task 1.4: Categorical encoding")
print("="*80)

# Load the dataset with algebraic features from T1.3
df = pd.read_csv('../../data/processed/listings_with_algebraic_features.csv')
print(f"\n Loaded dataset from T1.3: {df.shape}")
print(f"   Rows: {df.shape[0]:,}")
print(f"   Columns: {df.shape[1]}")

## 2. Merge Categorical Columns from Raw Data

In [None]:
print("\n Loading raw data to get categorical columns...")

# Load raw datasets
sf_raw = pd.read_csv('../../data/raw/san francisco.csv')
sd_raw = pd.read_csv('../../data/raw/san diego.csv')

print(f" San Francisco: {sf_raw.shape}")
print(f" San Diego: {sd_raw.shape}")

# Combine raw datasets
raw_combined = pd.concat([sf_raw, sd_raw], ignore_index=True)
print(f" Combined raw: {raw_combined.shape}")

# Select only needed categorical columns
categorical_cols = raw_combined[['id', 'property_type', 'room_type', 'neighbourhood_cleansed']]

# Drop existing categorical columns from df before merging to avoid duplicates
cols_to_drop = ['property_type', 'room_type', 'neighbourhood_cleansed']
existing_cols = [col for col in cols_to_drop if col in df.columns]
if existing_cols:
    df = df.drop(columns=existing_cols)
    print(f"\n Dropped existing columns: {existing_cols}")

# Merge with main dataset
df = df.merge(categorical_cols, on='id', how='left')
print(f"\n After merging categorical columns: {df.shape}")

# Check for missing values in categorical columns
print("\n Checking categorical columns:")
for col in ['property_type', 'room_type', 'neighbourhood_cleansed']:
    missing = df[col].isna().sum()
    unique = df[col].nunique()
    print(f"   {col}: {unique} unique values, {missing} missing")

## 3. Perform Categorical Encoding

In [None]:
print("\n" + "="*80)
print("Categorical encoding - 4 variables, 10 new features")
print("="*80)

# Initialize label encoders
le_property = LabelEncoder()
le_neighbourhood = LabelEncoder()

# 1. ROOM TYPE - One-Hot Encoding 
print("\n ROOM TYPE - One-Hot Encoding ")
print(f" Original categories: {df['room_type'].nunique()}")
print(f" Categories: {df['room_type'].unique().tolist()}")

room_dummies = pd.get_dummies(df['room_type'], prefix='room_type')
df = pd.concat([df, room_dummies], axis=1)

print(f" Created {len(room_dummies.columns)} binary columns:")
for col in room_dummies.columns:
    count = df[col].sum()
    pct = (count / len(df)) * 100
    print(f"      {col}: {int(count):,} ({pct:.2f}%)")

# 2. PROPERTY TYPE - Label + Frequency Encoding 
print("\n Property Type - Label + Frequency Encoding ")
print(f"   Original categories: {df['property_type'].nunique()}")

# Label encoding
df['property_type_label'] = le_property.fit_transform(df['property_type'])

# Frequency encoding
df['property_type_frequency'] = df['property_type'].map(
    df['property_type'].value_counts(normalize=True)
)

print(f" Created 2 columns:")
print(f" property_type_label: Range 0-{int(df['property_type_label'].max())}")
print(f" property_type_frequency: Range {df['property_type_frequency'].min():.4f}-{df['property_type_frequency'].max():.4f}")
print(f" Mean frequency: {df['property_type_frequency'].mean():.4f}")

# 3. NEIGHBOURHOOD - Target + Frequency + Label Encoding 
print("\n NEIGHBOURHOOD - Target + Frequency + Label Encoding ")
print(f"  Original categories: {df['neighbourhood_cleansed'].nunique()}")

# First encode value_category for target encoding
value_mapping = {'Poor_Value': 0, 'Fair_Value': 1, 'Excellent_Value': 2}
df['value_encoded'] = df['value_category'].map(value_mapping)

# Target encoding (mean value_encoded per neighbourhood)
neighbourhood_target = df.groupby('neighbourhood_cleansed')['value_encoded'].mean()
df['neighbourhood_target_encoded'] = df['neighbourhood_cleansed'].map(neighbourhood_target)

# Frequency encoding
df['neighbourhood_frequency'] = df['neighbourhood_cleansed'].map(
    df['neighbourhood_cleansed'].value_counts(normalize=True)
)

# Label encoding
df['neighbourhood_label'] = le_neighbourhood.fit_transform(df['neighbourhood_cleansed'])

print(f" Created 3 columns:")
print(f" neighbourhood_label: Range 0-{int(df['neighbourhood_label'].max())}")
print(f" neighbourhood_target_encoded: Range {df['neighbourhood_target_encoded'].min():.4f}-{df['neighbourhood_target_encoded'].max():.4f}")
print(f" neighbourhood_frequency: Range {df['neighbourhood_frequency'].min():.4f}-{df['neighbourhood_frequency'].max():.4f}")
print(f" Mean target encoding: {df['neighbourhood_target_encoded'].mean():.4f}")

# 4. VALUE CATEGORY - Already encoded as value_encoded [TARGET VARIABLE]
print("\n VALUE CATEGORY - Label Encoding [Target Variable] ")
print(f"  Original categories: {df['value_category'].nunique()}")
print(f"  Mapping: Poor_Value=0, Fair_Value=1, Excellent_Value=2")
print(f"  Created 1 column: value_encoded")

value_dist = df['value_encoded'].value_counts().sort_index()
for val, count in value_dist.items():
    pct = (count / len(df)) * 100
    label = ['Poor_Value', 'Fair_Value', 'Excellent_Value'][int(val)]
    print(f"      {val} ({label}): {count:,} ({pct:.2f}%)")

print("\n" + "="*80)
print(" All categorical encoding completed!")
print(f" Total new encoded features: 10")
print("="*80)

## 4. Data Quality Check

In [None]:
print("\n Data Quality Check:")

# Check for duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()].tolist()
if duplicate_cols:
    print(f" WARNING: Duplicate columns found: {duplicate_cols}")
    print(f" Removing duplicate columns...")
    df = df.loc[:, ~df.columns.duplicated()]
    print(f" Duplicates removed. New shape: {df.shape}")
else:
    print(f" No duplicate columns found")

# Check all new encoding columns for missing values
all_clean = True
new_encoding_cols = [
    'room_type_Entire home/apt', 'room_type_Hotel room', 
    'room_type_Private room', 'room_type_Shared room',
    'property_type_label', 'property_type_frequency',
    'neighbourhood_label', 'neighbourhood_target_encoded', 
    'neighbourhood_frequency', 'value_encoded'
]

print("\n   Checking encoded columns for missing values:")
for col in new_encoding_cols:
    if col in df.columns:
        missing = int(df[col].isna().sum())
        if missing > 0:
            print(f" {col}: {missing} missing values")
            all_clean = False
        else:
            print(f" {col}: No missing values")

if all_clean:
    print("\n All encoded columns are complete (no missing values)")
else:
    print("\n Some columns have missing values - review required")

print(f"\n Final Dataset Shape: {df.shape}")
print(f"   Rows: {df.shape[0]:,}")
print(f"   Columns: {df.shape[1]}")

## 5. Save Encoded Dataset and Mapping Files

In [None]:
##import os



# Save the encoded dataset
output_path = '../../data/processed/listings_with_categorical_encoding.csv'
df.to_csv(output_path, index=False)
print(f"\n Saved encoded dataset to: {output_path}")
print(f"   Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

# Create encoding reference files
print(f"\n Creating encoding reference files...")

# 1. Property Type Mapping
property_mapping = pd.DataFrame({
    'property_type': le_property.classes_,
    'label': range(len(le_property.classes_))
})
property_mapping = property_mapping.merge(
    df.groupby('property_type')['property_type_frequency'].first().reset_index(),
    on='property_type'
)
property_mapping = property_mapping.merge(
    df['property_type'].value_counts().reset_index().rename(columns={'count': 'count'}),
    on='property_type'
)
property_mapping = property_mapping.sort_values('count', ascending=False)
property_mapping.to_csv('../../outputs/property_type_encoding_map.csv', index=False)
print(f" Saved to: outputs/property_type_encoding_map.csv ({len(property_mapping)} property types)")

# 2. Neighbourhood Mapping
neighbourhood_mapping = pd.DataFrame({
    'neighbourhood': le_neighbourhood.classes_,
    'label': range(len(le_neighbourhood.classes_))
})
neighbourhood_mapping = neighbourhood_mapping.merge(
    df.groupby('neighbourhood_cleansed').agg({
        'neighbourhood_target_encoded': 'first',
        'neighbourhood_frequency': 'first'
    }).reset_index(),
    left_on='neighbourhood',
    right_on='neighbourhood_cleansed'
).drop('neighbourhood_cleansed', axis=1)
neighbourhood_mapping = neighbourhood_mapping.merge(
    df['neighbourhood_cleansed'].value_counts().reset_index().rename(columns={'count': 'count'}),
    left_on='neighbourhood',
    right_on='neighbourhood_cleansed'
).drop('neighbourhood_cleansed', axis=1)
neighbourhood_mapping = neighbourhood_mapping.sort_values('count', ascending=False)
neighbourhood_mapping.to_csv('../../outputs/neighbourhood_encoding_map.csv', index=False)
print(f" Saved to : outputs/neighbourhood_encoding_map.csv ({len(neighbourhood_mapping)} neighbourhoods)")

# 3. Value Category Mapping
value_mapping_df = pd.DataFrame({
    'value_category': ['Poor_Value', 'Fair_Value', 'Excellent_Value'],
    'encoded_value': [0, 1, 2],
    'count': [df[df['value_encoded']==i].shape[0] for i in range(3)]
})
value_mapping_df.to_csv('../../outputs/value_category_encoding_map.csv', index=False)
print(f" Saved to: outputs/value_category_encoding_map.csv")

print("\n" + "="*80)

## 6. Create Visualizations

In [None]:
print("\n Creating visualizations...")



# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (16, 10)

# Figure 1: Categorical Encoding Analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Categorical Encoding Analysis - Task 1.4', fontsize=16, fontweight='bold')

# 1. Room Type Distribution
room_type_cols = ['room_type_Entire home/apt', 'room_type_Hotel room', 
                  'room_type_Private room', 'room_type_Shared room']
room_type_data = df[room_type_cols].sum()
axes[0, 0].bar(range(len(room_type_data)), room_type_data.values, color='skyblue', edgecolor='black')
axes[0, 0].set_xticks(range(len(room_type_data)))
axes[0, 0].set_xticklabels(['Entire home/apt', 'Hotel room', 'Private room', 'Shared room'], 
                           rotation=45, ha='right')
axes[0, 0].set_title('Room Type Distribution (One-Hot Encoded)', fontweight='bold')
axes[0, 0].set_ylabel('Count')
axes[0, 0].grid(True, alpha=0.3)

# 2. Property Type Frequency Distribution
axes[0, 1].hist(df['property_type_frequency'], bins=30, color='coral', edgecolor='black')
axes[0, 1].set_title('Property Type Frequency Distribution', fontweight='bold')
axes[0, 1].set_xlabel('Frequency')
axes[0, 1].set_ylabel('Count')
axes[0, 1].grid(True, alpha=0.3)

# 3. Neighbourhood Target Encoding Distribution
axes[1, 0].hist(df['neighbourhood_target_encoded'], bins=30, color='lightgreen', edgecolor='black')
axes[1, 0].set_title('Neighbourhood Target Encoding Distribution', fontweight='bold')
axes[1, 0].set_xlabel('Target Encoded Value')
axes[1, 0].set_ylabel('Count')
axes[1, 0].grid(True, alpha=0.3)

# 4. Value Category Distribution
value_counts = df['value_encoded'].value_counts().sort_index()
axes[1, 1].bar(['Poor Value', 'Fair Value', 'Excellent Value'], value_counts.values, 
               color=['#ff6b6b', '#ffd93d', '#6bcf7f'], edgecolor='black')
axes[1, 1].set_title('Value Category Distribution (Label Encoded)', fontweight='bold')
axes[1, 1].set_ylabel('Count')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../../outputs/figures/categorical_encoding_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved to: outputs/figures/categorical_encoding_analysis.png")

# Figure 2: Encoding Methods Comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Encoding Methods Comparison - Task 1.4', fontsize=16, fontweight='bold')

# 1. Cardinality Comparison
variables = ['room_type', 'property_type', 'neighbourhood', 'value_category']
cardinalities = [4, df['property_type'].nunique(), df['neighbourhood_cleansed'].nunique(), 3]
colors_card = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']
axes[0, 0].barh(variables, cardinalities, color=colors_card, edgecolor='black')
axes[0, 0].set_title('Original Cardinality by Variable', fontweight='bold')
axes[0, 0].set_xlabel('Number of Unique Categories')
axes[0, 0].grid(True, alpha=0.3, axis='x')

# 2. Encoding Methods Used
methods = ['One-Hot', 'Label +\nFrequency', 'Target +\nFrequency +\nLabel', 'Label\n(Ordinal)']
columns_created = [4, 2, 3, 1]
axes[0, 1].bar(range(len(methods)), columns_created, color=colors_card, edgecolor='black')
axes[0, 1].set_xticks(range(len(methods)))
axes[0, 1].set_xticklabels(methods)
axes[0, 1].set_title('Columns Created by Encoding Method', fontweight='bold')
axes[0, 1].set_ylabel('Number of Columns')
axes[0, 1].grid(True, alpha=0.3, axis='y')

# 3. Property Type - Top 10
top_properties = df['property_type'].value_counts().head(10)
axes[1, 0].barh(range(len(top_properties)), top_properties.values, color='steelblue', edgecolor='black')
axes[1, 0].set_yticks(range(len(top_properties)))
axes[1, 0].set_yticklabels(top_properties.index, fontsize=9)
axes[1, 0].set_title('Top 10 Property Types', fontweight='bold')
axes[1, 0].set_xlabel('Count')
axes[1, 0].invert_yaxis()
axes[1, 0].grid(True, alpha=0.3, axis='x')

# 4. Neighbourhood - Top 10
top_neighbourhoods = df['neighbourhood_cleansed'].value_counts().head(10)
axes[1, 1].barh(range(len(top_neighbourhoods)), top_neighbourhoods.values, 
                color='mediumseagreen', edgecolor='black')
axes[1, 1].set_yticks(range(len(top_neighbourhoods)))
axes[1, 1].set_yticklabels(top_neighbourhoods.index, fontsize=9)
axes[1, 1].set_title('Top 10 Neighbourhoods', fontweight='bold')
axes[1, 1].set_xlabel('Count')
axes[1, 1].invert_yaxis()
axes[1, 1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('../../outputs/figures/encoding_methods_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved to: outputs/figures/encoding_methods_comparison.png")

print("\n" + "="*80)

## 7. Final Summary Report

In [None]:
print("\n" + "="*80)
print(" Task 1.4 Summary Report")
print("="*80)

summary = f"""
{'='*80}

Input Data:
  - Source: data/processed/listings_with_algebraic_features.csv (from T1.3)
  - Shape: {df.shape[0]:,} rows × {df.shape[1] - 10} columns (before encoding)

Output Data:
  - Destination: data/processed/listings_with_categorical_encoding.csv
  - Shape: {df.shape[0]:,} rows × {df.shape[1]} columns (after encoding)
  - Added: 10 new encoded features

Encoding Breakdown:

1. Room Type (One-Hot Encoding) [Landlord-Controlled]
   - Original categories: 4
   - Encoded columns: 4
   - Method: Binary columns for each category

2.  Property Type (Label + Frequency Encoding) [Landlord-Controlled]
    Original categories: {df['property_type'].nunique()}
   - Encoded columns: 2
   - Methods: Label encoding + Frequency encoding

3. Neighbourhood (Target + Frequency + Label Encoding) [Landlord-Controlled]
   - Original categories: {df['neighbourhood_cleansed'].nunique()}
   - Encoded columns: 3
   - Methods: Target encoding + Frequency encoding + Label encoding

4. Value Category (Label Encoding - Ordinal) [Target Variable]
   - Original categories: 3
   - Encoded columns: 1
   - Mapping: Poor_Value=0, Fair_Value=1, Excellent_Value=2

OUTPUT FILES:
  - data/processed/listings_with_categorical_encoding.csv
  - outputs/property_type_encoding_map.csv
  - outputs/neighbourhood_encoding_map.csv
  - outputs/value_category_encoding_map.csv
  

VISUALIZATIONS:
  - outputs/figures/categorical_encoding_analysis.png
  - outputs/figures/encoding_methods_comparison.png


{'='*80}

{'='*80}
"""

print(summary)

print("\n" + "="*80)



---
---

# Task 1.5 : Feature Selection


---


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)



In [None]:
# Load the dataset from T1.4 (with categorical encoding)
df = pd.read_csv('../../data/processed/listings_with_categorical_encoding.csv')

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nColumn names:\n{df.columns.tolist()}")

## Feature Categorization

We categorize all features into:
1. **Landlord-Controlled Features** - Available at listing creation
2. **Review-Based Features**  - Only available after guests stay
3. **Target-Related Features** - Used to create target, causes data leakage
4. **Identifier Features** - Not useful for prediction

In [None]:
# Define feature categories

# 1. Landlord-controlled features (available at listing creation)
landlord_features = [
    # Price information
    'price',
    
    # Property characteristics
    'accommodates', 'bedrooms', 'beds', 'bathrooms',
    
    # Location
    'latitude', 'longitude', 'city',
    
    # Host information (available at listing creation)
    'host_is_superhost', 'host_identity_verified',
    'host_response_time', 'host_response_rate',
    
    # Listing policies
    'instant_bookable', 'cancellation_policy',
    'minimum_nights', 'maximum_nights',
    
    # Availability
    'availability_30', 'availability_60', 'availability_90', 'availability_365',
    
   
    # Algebraic features (from T1.3) - landlord-controlled
    'space_efficiency', 'price_per_bedroom', 'price_per_bathroom',
     'bathroom_per_bedroom',
    
    # Categorical encodings (from T1.4)
    'room_type_Entire home/apt', 'room_type_Private room', 'room_type_Shared room',
    'property_type_label', 'property_type_frequency',
     'neighbourhood_frequency', 'neighbourhood_label'
]

# 2. Review-based features (not available for new listings)
review_features_to_remove = [
    'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
    'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
    'review_scores_value', 'number_of_reviews', 'number_of_reviews_ltm',
    'number_of_reviews_l30d', 'reviews_per_month',
    'first_review', 'last_review', 'days_since_first_review', 'days_since_last_review',
    'review_recency_score', 'estimated_occupancy', 'estimated_revenue',
    'quality_score'  # This is derived from review_scores_rating
]

# 3. Target-related features (causes data leakage)
target_leakage_features = [
    'fp_score',  # This is rating/price - directly related to target
    'price_normalized', 'rating_normalized',  # Used to create fp_score
    'value_category'  # This is our target variable
]

# 4. Identifier features (not useful for modeling)
identifier_features = [
    'id', 'listing_url', 'name', 'description', 'host_id', 'host_name'
]

print("Feature categories defined:")
print(f"  - Landlord-controlled features: {len(landlord_features)}")
print(f"  - Review-based features to remove: {len(review_features_to_remove)}")
print(f"  - Target leakage features to remove: {len(target_leakage_features)}")
print(f"  - Identifier features to remove: {len(identifier_features)}")

In [None]:
# Verify which features actually exist in the dataset
available_landlord_features = [f for f in landlord_features if f in df.columns]
missing_landlord_features = [f for f in landlord_features if f not in df.columns]

print(f"Available landlord features: {len(available_landlord_features)}")
print(f"Missing landlord features: {len(missing_landlord_features)}")

if missing_landlord_features:
    print(f"\nMissing features: {missing_landlord_features}")

# Check if target variable exists
if 'value_category' not in df.columns:
    print("\n WARNING: 'value_category' not found in dataset!")
    print("Creating target variable from fp_score.")
    
    # Create target variable if it doesn't exist
    if 'fp_score' in df.columns:
        df['value_category'] = pd.qcut(df['fp_score'], q=3, labels=['Low', 'Medium', 'High'])
        print(" Target variable created successfully")
    else:
        print(" Error: Cannot create target variable - fp_score not found!")
else:
    print("\n Target variable 'value_category' found in dataset")

## Create Clean Dataset with Landlord-Only Features

In [None]:
# Separate features (X) and target (y)
X = df[available_landlord_features].copy()
y = df['value_category'].copy()

print(f"Feature matrix (X): {X.shape}")
print(f"Target variable (y): {y.shape}")
print(f"\nTarget distribution:\n{y.value_counts()}")
print(f"\nTarget distribution (%):\n{y.value_counts(normalize=True) * 100}")

In [None]:
# Check for missing values in landlord features
missing_values = X.isnull().sum()
missing_pct = (missing_values / len(X)) * 100

missing_df = pd.DataFrame({
    'Missing_Count': missing_values,
    'Missing_Percentage': missing_pct
}).sort_values('Missing_Count', ascending=False)

print("Missing values in landlord features:")
print(missing_df[missing_df['Missing_Count'] > 0])

# Handle missing values if any
if missing_df['Missing_Count'].sum() > 0:
    print("\nHandling missing values...")
    
    # Fill numeric columns with median
    numeric_cols = X.select_dtypes(include=[np.number]).columns
    X[numeric_cols] = X[numeric_cols].fillna(X[numeric_cols].median())
    
    # Fill categorical columns with mode
    categorical_cols = X.select_dtypes(include=['object']).columns
    for col in categorical_cols:
        X[col] = X[col].fillna(X[col].mode()[0] if not X[col].mode().empty else 'Unknown')
    
    print(" Missing values handled")
else:
    print("\n No missing values found")

## Train-Test Split

In [None]:
# Split data into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining target distribution:\n{y_train.value_counts()}")
print(f"\nTest target distribution:\n{y_test.value_counts()}")

## Feature Scaling

In [None]:

# Remove non-numeric columns before scaling
non_numeric = X_train.select_dtypes(include=['object']).columns.tolist()
if non_numeric:
    print(f"Removing non-numeric columns: {non_numeric}")
    X_train = X_train.drop(columns=non_numeric)
    X_test = X_test.drop(columns=non_numeric)


# Identify numeric columns for scaling
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()

print(f"Numeric features to scale: {len(numeric_features)}")
print(f"Features: {numeric_features}")

# Initialize and fit scaler on training data
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Scale numeric features
X_train_scaled[numeric_features] = scaler.fit_transform(X_train[numeric_features])
X_test_scaled[numeric_features] = scaler.transform(X_test[numeric_features])

print("\n Feature scaling completed")
print(f"\nScaled training set: {X_train_scaled.shape}")
print(f"Scaled test set: {X_test_scaled.shape}")

# Save the scaler for future use
os.makedirs('models', exist_ok=True)
joblib.dump(scaler, '../../models/standard_scaler.pkl')
print("\n Saved: models/standard_scaler.pkl")

# Save scaled data for T2.1
X_train_scaled.to_csv('../../data/processed/X_train_landlord.csv', index=False)
X_test_scaled.to_csv('../../data/processed/X_test_landlord.csv', index=False)
print(" Saved scaled X_train and X_test")

## Save Processed Data

In [None]:
print("\n Saving processed datasets...")

# Save the clean dataset with landlord features only
landlord_df = pd.concat([X, y], axis=1)
landlord_df.to_csv('../../data/processed/listings_landlord_features_only.csv', index=False)
print(" Saved: listings_landlord_features_only.csv")

# Save train-test splits (unscaled)
X_train.to_csv('../../data/processed/X_train_landlord.csv', index=False)
X_test.to_csv('../../data/processed/X_test_landlord.csv', index=False)
y_train.to_csv('../../data/processed/y_train_landlord.csv', index=False)
y_test.to_csv('../../data/processed/y_test_landlord.csv', index=False)
print(" Saved: X_train_landlord.csv, X_test_landlord.csv, y_train_landlord.csv, y_test_landlord.csv")

# Save scaled versions
X_train_scaled.to_csv('../../data/processed/X_train_landlord_scaled.csv', index=False)
X_test_scaled.to_csv('../../data/processed/X_test_landlord_scaled.csv', index=False)
print(" Saved: X_train_landlord_scaled.csv, X_test_landlord_scaled.csv")

print("\n" + "="*60)

print("="*60)

## Feature Analysis and Visualization

In [None]:
# Summary statistics for landlord features
print("Summary Statistics for Landlord Features:")
print("="*60)
print(X[numeric_features].describe())

In [None]:
# Visualize target distribution
plt.figure(figsize=(10, 6))
y.value_counts().plot(kind='bar', color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
plt.title('Distribution of Value Categories (Target Variable)', fontsize=14, fontweight='bold')
plt.xlabel('Value Category', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('../../outputs/figures/target_distribution_landlord.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: target_distribution_landlord.png")

In [None]:
# Visualize key landlord features
key_features = ['price', 'accommodates', 'bedrooms', 'beds', 'bathrooms']
available_key_features = [f for f in key_features if f in X.columns]

if available_key_features:
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for idx, feature in enumerate(available_key_features):
        axes[idx].hist(X[feature].dropna(), bins=50, color='#4ECDC4', edgecolor='black', alpha=0.7)
        axes[idx].set_title(f'Distribution of {feature}', fontsize=12, fontweight='bold')
        axes[idx].set_xlabel(feature, fontsize=10)
        axes[idx].set_ylabel('Frequency', fontsize=10)
        axes[idx].grid(axis='y', alpha=0.3)
    
    # Hide unused subplots
    for idx in range(len(available_key_features), len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.savefig('../../outputs/figures/key_features_distribution_landlord.png', dpi=300, bbox_inches='tight')
    plt.show()
    print(" Saved: key_features_distribution_landlord.png")

In [None]:
# Correlation heatmap for numeric features
if len(numeric_features) > 1:
    plt.figure(figsize=(14, 10))
    correlation_matrix = X[numeric_features].corr()
    sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0, 
                square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
    plt.title('Correlation Heatmap - Landlord Features Only', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.savefig('../../outputs/figures/correlation_heatmap_landlord.png', dpi=300, bbox_inches='tight')
    plt.show()
    print(" Saved: correlation_heatmap_landlord.png")

##  Documentation Report

In [None]:
# Comprehensive documentation
report = f"""
{'='*80}
Task 1.5: Feature Selection 
{'='*80}

Real-world use case: 
- Predict if a new listing while has no reviews will be good value for money
- Based on: landlord's price + property characteristics

{'='*80}
Dataset Information:
{'='*80}

-Input Dataset: listings_with_categorical_encoding.csv
-Total Samples: {len(df)}
-Total Features (before filtering): {len(df.columns)}

-Output Dataset: listings_landlord_features_only.csv
-Total Samples: {len(landlord_df)}
-Landlord Features: {len(available_landlord_features)}

{'='*80}
Feature Categories
{'='*80}

1. Landlord-Controlled Features : {len(available_landlord_features)}
{chr(10).join(['   - ' + f for f in available_landlord_features])}

2. Review-Based Features : {len(review_features_to_remove)}
{chr(10).join(['   - ' + f for f in review_features_to_remove])}

3. Target Leakage Features : {len(target_leakage_features)}
{chr(10).join(['   - ' + f for f in target_leakage_features])}

{'='*80}
Target Variable Distribution
{'='*80}

{y.value_counts()}

Percentage Distribution:
{y.value_counts(normalize=True) * 100}


{'='*80}
Feature Scaling
{'='*80}

Method: StandardScaler (zero mean, unit variance)
Numeric Features Scaled: {len(numeric_features)}

{'='*80}
Output files generated
{'='*80}

1. listings_landlord_features_only.csv - Full dataset with landlord features
2. X_train_landlord.csv - Training features (unscaled)
3. X_test_landlord.csv - Test features (unscaled)
4. y_train_landlord.csv - Training target
5. y_test_landlord.csv - Test target
6. X_train_landlord_scaled.csv - Training features (scaled)
7. X_test_landlord_scaled.csv - Test features (scaled)
8. target_distribution_landlord.png - Target distribution plot
9. key_features_distribution_landlord.png - Key features histograms
10. correlation_heatmap_landlord.png - Feature correlation heatmap

{'='*80}
Critical Reminders
{'='*80}

 -Price must be included as a feature 
 -Review features used only for labeling, not for training
 -Model predicts value for new listings without reviews


"""

print(report)


## Summary

### What We Did:
1.  Loaded dataset from T1.4 (with categorical encoding)
2.  Identified and categorized all features
3.  Removed review-based features from model input
4.  Removed target leakage features
5.  Kept only landlord-controlled features
6.  Created train-test split (80-20, stratified)
7.  Applied feature scaling (StandardScaler)
8.  Saved all processed datasets with 'landlord' suffix
9.  Generated visualizations and documentation

### Key Takeaways:
- **Data leakage fixed**: Review features removed from model input
- **Price included**: Critical for predicting "value for money"
- **Production-ready**: Model can predict for NEW listings




---
---



## Summary of All Tasks

### Task 1.1: Initial Data Exploration
- Loaded SF (7,780) and SD (13,162) datasets
- Identified 79 columns with matching structure
- Analyzed missing values and data quality issues

### Task 1.2: Data Cleaning and Preprocessing
- Cleaned price column and converted data types
- Created engineered features (host_years, price_per_person, etc.)
- Handled missing values and outliers
- Created target variable (value_category) from FP Score
- Output: `listings_cleaned_with_target.csv` (19,912 rows × 73 columns)

### Task 1.3: Algebraic Feature Engineering
- Created 10 landlord-controlled algebraic features
- Features: space_efficiency, price_per_bedroom, price_per_bathroom, etc.
- All features use ONLY landlord-controlled data (no review features)
- Output: `listings_with_algebraic_features.csv` (19,912 rows × 83 columns)

### Task 1.4: Categorical Encoding
- Encoded room_type (One-Hot)
- Encoded property_type (Label + Frequency)
- Encoded neighbourhood (Target + Frequency + Label)
- Encoded value_category (Label - ordinal)
- Output: `listings_with_categorical_encoding.csv` (19,912 rows × 90 columns)

### Task 1.5: Feature Selection 
- Kept 28 landlord-controlled features 
- Created train-test split 
- Applied StandardScaler for feature scaling
- Output files:
  - `listings_landlord_features_only.csv`
  - `X_train_landlord_scaled.csv`, `X_test_landlord_scaled.csv`
  - `y_train_landlord.csv`, `y_test_landlord.csv`

## Final Dataset for Modeling

- **Features:** 28 landlord-controlled features
- **Samples:** 19,912 listings
- **Target:** value_category (Poor_Value, Fair_Value, Excellent_Value)
- **Train/Test Split:** 15,929 / 3,983 (80/20)

## Key Achievements

 - **Data Leakage Fixed:** Review features removed from model input
 - **Price Included:** Critical for predicting value-for-money
 - **Production-Ready:** Model can predict for new listings without reviews



---

