# Task 1.5: Feature Selection

## Critical Issue: Data Leakage Fix

**Problem that we have Ä±dentified:**
- Previous models achieved 99% accuracy due to data leakage
- Review-based features (e.g., `review_scores_rating`, `number_of_reviews`) were used to predict `value_category`
- These features are not available for new listings without reviews

**Solution:**
- Use review-based features only for creating the target variable (`value_category`)
- Remove all review-based features from the model's input (X)
- Keep only landlord-controlled features that are available at listing creation time

**Real-World Use Case:**
Train a model that predicts if a new listing (with no reviews yet) will be good value for money, based on:
- Landlord's chosen price
- Property characteristics (beds, bathrooms, amenities, location, etc.)



In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully!")

In [None]:
# Load the dataset from T1.4 (with categorical encoding)
df = pd.read_csv('../../data/processed/listings_with_categorical_encoding.csv')

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nColumn names:\n{df.columns.tolist()}")

## Feature Categorization

We categorize all features into:
1. **Landlord-Controlled Features** (keep) - Available at listing creation
2. **Review-Based Features** (remove) - Only available after guests stay
3. **Target-Related Features** (remove) - Used to create target, causes leakage
4. **Identifier Features** (remove) - Not useful for prediction

In [None]:
# Define feature categories

# 1. Landlord-controlled Features (To be retained)
landlord_features = [
    # Price (needed to predict value for money)
    'price',
    
    # Property characteristics
    'accommodates', 'bedrooms', 'beds', 'bathrooms',
    
    # Location
    'latitude', 'longitude', 'city',
    
    # Host information (available at listing creation)
    'host_is_superhost', 'host_identity_verified',
    'host_response_time', 'host_response_rate',
    
    # Listing policies
    'instant_bookable', 'cancellation_policy',
    'minimum_nights', 'maximum_nights',
    
    # Availability
    'availability_30', 'availability_60', 'availability_90', 'availability_365',
    
   
    # Algebraic features (from T1.3) - landlord-controlled
    'space_efficiency', 'price_per_bedroom', 'price_per_bathroom',
     'bathroom_per_bedroom',
    
    # Categorical encodings (from T1.4)
    'room_type_Entire home/apt', 'room_type_Private room', 'room_type_Shared room',
    'property_type_label', 'property_type_frequency',
     'neighbourhood_frequency', 'neighbourhood_label'
]

# 2. Review-Based Features (Has to be removed - not available for new listings)
review_features_to_remove = [
    'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
    'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
    'review_scores_value', 'number_of_reviews', 'number_of_reviews_ltm',
    'number_of_reviews_l30d', 'reviews_per_month',
    'first_review', 'last_review', 'days_since_first_review', 'days_since_last_review',
    'review_recency_score', 'estimated_occupancy', 'estimated_revenue',
    'quality_score'  # This is derived from review_scores_rating
]

# 3. Target-Related Features (Has to be removed - causes data leakage)
target_leakage_features = [
    'fp_score',  # This is rating/price - directly related to target
    'price_normalized', 'rating_normalized',  # Used to create fp_score
    'value_category'  # This is our target variable
]

# 4. Identifier Features (Has to be removed - not useful for modeling)
identifier_features = [
    'id', 'listing_url', 'name', 'description', 'host_id', 'host_name'
]

print("Feature categories defined:")
print(f" - Landlord-controlled features: {len(landlord_features)}")
print(f" - Review-based features to remove: {len(review_features_to_remove)}")
print(f" - Target leakage features to remove: {len(target_leakage_features)}")
print(f" - Identifier features to remove: {len(identifier_features)}")

In [None]:
# Verify which features actually exist in the dataset
available_landlord_features = [f for f in landlord_features if f in df.columns]
missing_landlord_features = [f for f in landlord_features if f not in df.columns]

print(f"Available landlord features: {len(available_landlord_features)}")
print(f"Missing landlord features: {len(missing_landlord_features)}")

if missing_landlord_features:
    print(f"\nMissing features: {missing_landlord_features}")

# Check if target variable exists
if 'value_category' not in df.columns:
    print("\n Warning: 'value_category' not found in dataset!")
    print("Creating target variable from fp_score.")
    
    # Create target variable if it doesn't exist
    if 'fp_score' in df.columns:
        df['value_category'] = pd.qcut(df['fp_score'], q=3, labels=['Low', 'Medium', 'High'])
        print(" Target variable created successfully")
    else:
        print(" Error: Cannot create target variable - fp_score not found!")
else:
    print("\n Target variable 'value_category' found in dataset")

## Creating Clean Dataset with Landlord-Only Features

In [None]:
# Separate features (X) and target (y)
X = df[available_landlord_features].copy()
y = df['value_category'].copy()

print(f"Feature matrix (X): {X.shape}")
print(f"Target variable (y): {y.shape}")
print(f"\nTarget distribution:\n{y.value_counts()}")
print(f"\nTarget distribution (%):\n{y.value_counts(normalize=True) * 100}")

In [None]:
# Check for missing values in landlord features
missing_values = X.isnull().sum()
missing_pct = (missing_values / len(X)) * 100

missing_df = pd.DataFrame({
    'Missing_Count': missing_values,
    'Missing_Percentage': missing_pct
}).sort_values('Missing_Count', ascending=False)

print("Missing values in landlord features:")
print(missing_df[missing_df['Missing_Count'] > 0])

# Handle missing values if any
if missing_df['Missing_Count'].sum() > 0:
    print("\nHandling missing values...")
    
    # Fill numeric columns with median
    numeric_cols = X.select_dtypes(include=[np.number]).columns
    X[numeric_cols] = X[numeric_cols].fillna(X[numeric_cols].median())
    
    # Fill categorical columns with mode
    categorical_cols = X.select_dtypes(include=['object']).columns
    for col in categorical_cols:
        X[col] = X[col].fillna(X[col].mode()[0] if not X[col].mode().empty else 'Unknown')
    
    print(" Missing values handled")
else:
    print("\n No missing values found")

## Train-Test Split

In [None]:
# Split data into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining target distribution:\n{y_train.value_counts()}")
print(f"\nTest target distribution:\n{y_test.value_counts()}")

## Feature Scaling

In [None]:

# Remove non-numeric columns before scaling
non_numeric = X_train.select_dtypes(include=['object']).columns.tolist()
if non_numeric:
    print(f"Removing non-numeric columns: {non_numeric}")
    X_train = X_train.drop(columns=non_numeric)
    X_test = X_test.drop(columns=non_numeric)


# Identify numeric columns for scaling
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()

print(f"Numeric features to scale: {len(numeric_features)}")
print(f"Features: {numeric_features}")

# Initialize and fit scaler on training data
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Scale numeric features
X_train_scaled[numeric_features] = scaler.fit_transform(X_train[numeric_features])
X_test_scaled[numeric_features] = scaler.transform(X_test[numeric_features])

print("\n Feature scaling completed")
print(f"\nScaled training set: {X_train_scaled.shape}")
print(f"Scaled test set: {X_test_scaled.shape}")

# Save the scaler for future use
os.makedirs('models', exist_ok=True)
joblib.dump(scaler, '../../models/standard_scaler.pkl')
print("\n Saved: models/standard_scaler.pkl")

# Save Scaled data for T2.1 (logistic regression model)
X_train_scaled.to_csv('../../data/processed/X_train_landlord.csv', index=False)
X_test_scaled.to_csv('../../data/processed/X_test_landlord.csv', index=False)
print("Saved scaled X_train and X_test")

## Save Processed Data

In [None]:
# Create output directory if it doesn't exist
import os
os.makedirs('../../data/processed', exist_ok=True)

# Save the clean dataset with landlord features only
landlord_df = pd.concat([X, y], axis=1)
landlord_df.to_csv('../../data/processed/listings_landlord_features_only.csv', index=False)
print("Saved: listings_landlord_features_only.csv")

# Save train-test splits (unscaled)
X_train.to_csv('../../data/processed/X_train_landlord.csv', index=False)
X_test.to_csv('../../data/processed/X_test_landlord.csv', index=False)
y_train.to_csv('../../data/processed/y_train_landlord.csv', index=False)
y_test.to_csv('../../data/processed/y_test_landlord.csv', index=False)
print("Saved: X_train_landlord.csv, X_test_landlord.csv, y_train_landlord.csv, y_test_landlord.csv")

# Save scaled versions
X_train_scaled.to_csv('../../data/processed/X_train_landlord_scaled.csv', index=False)
X_test_scaled.to_csv('../../data/processed/X_test_landlord_scaled.csv', index=False)
print("Saved: X_train_landlord_scaled.csv, X_test_landlord_scaled.csv")

# Save feature names for reference
with open('../../outputs/reports/landlord_feature_names.txt', 'w') as f:
    f.write('\n'.join(available_landlord_features))
print("Saved: landlord_feature_names.txt")
print("\n" + "="*60)
print("="*60)

## Feature Analysis and Visualization

In [None]:
# Summary statistics for landlord features
print("Summary Statistics for Landlord Features:")
print("="*60)
print(X[numeric_features].describe())

In [None]:
# Visualize target distribution
plt.figure(figsize=(10, 6))
y.value_counts().plot(kind='bar', color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
plt.title('Distribution of Value Categories (Target Variable)', fontsize=14, fontweight='bold')
plt.xlabel('Value Category', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('../../outputs/figures/target_distribution_landlord.png', dpi=300, bbox_inches='tight')
plt.show()
print("Saved:target_distribution_landlord.png")

In [None]:
# Visualize key landlord features
key_features = ['price', 'accommodates', 'bedrooms', 'beds', 'bathrooms']
available_key_features = [f for f in key_features if f in X.columns]

if available_key_features:
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for idx, feature in enumerate(available_key_features):
        axes[idx].hist(X[feature].dropna(), bins=50, color='#4ECDC4', edgecolor='black', alpha=0.7)
        axes[idx].set_title(f'Distribution of {feature}', fontsize=12, fontweight='bold')
        axes[idx].set_xlabel(feature, fontsize=10)
        axes[idx].set_ylabel('Frequency', fontsize=10)
        axes[idx].grid(axis='y', alpha=0.3)
    
    # Hide unused subplots
    for idx in range(len(available_key_features), len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.savefig('../../outputs/figures/key_features_distribution_landlord.png', dpi=300, bbox_inches='tight')
    plt.show()
    print(" Saved: key_features_distribution_landlord.png")

In [None]:
# Correlation heatmap for numeric features
if len(numeric_features) > 1:
    plt.figure(figsize=(14, 10))
    correlation_matrix = X[numeric_features].corr()
    sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0, 
                square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
    plt.title('Correlation Heatmap - Landlord Features Only', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.savefig('../../outputs/figures/correlation_heatmap_landlord.png', dpi=300, bbox_inches='tight')
    plt.show()
    print(" Saved: correlation_heatmap_landlord.png")

##  Documentation Report

In [None]:
# Comprehensive documentation
report = f"""

Problem Identified:
- Previous models achieved 99% accuracy due to data leakage
- Review-based features were used to predict value_category
- These features are not available for new listings without reviews

Solution we implemented:
- Review features used only for creating target variable (value_category)
- All review-based features removed from model input (X)
- Only landlord-controlled features kept for training

Real-World Use Case:
- Predict if a new listing (no reviews) will be good value for money
- Based on: Landlord's price + property characteristics

{'='*80}
Dataset Summary
{'='*80}

Input Dataset: listings_with_categorical_encoding.csv
Total Samples: {len(df)}
Total Features (before filtering): {len(df.columns)}

Output Dataset: listings_landlord_features_only.csv
Total Samples: {len(landlord_df)}
Landlord Features: {len(available_landlord_features)}

{'='*80}
Feature Categories
{'='*80}

1.Landlord-controlled features (kept): {len(available_landlord_features)}
{chr(10).join(['   - ' + f for f in available_landlord_features])}

2.Review-based features (removed): {len(review_features_to_remove)}
{chr(10).join(['   - ' + f for f in review_features_to_remove])}

3.Target Leakage features (removed): {len(target_leakage_features)}
{chr(10).join(['   - ' + f for f in target_leakage_features])}

{'='*80}
Target Variable Distribution
{'='*80}

{y.value_counts()}

Percentage Distribution:
{y.value_counts(normalize=True) * 100}

{'='*80}
Train-test Split Summary
{'='*80}

Training Set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)
Test Set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)
Features: {X_train.shape[1]}

Split Strategy: Stratified (maintains class distribution)
Random State: 42 (reproducible)

{'='*80}
Feature Scaling
{'='*80}

Method: StandardScaler (zero mean, unit variance)
Numeric Features Scaled: {len(numeric_features)}

{'='*80}
Generated output files
{'='*80}

1. listings_landlord_features_only.csv - Full dataset with landlord features
2. X_train_landlord.csv - Training features (unscaled)
3. X_test_landlord.csv - Test features (unscaled)
4. y_train_landlord.csv - Training target
5. y_test_landlord.csv - Test target
6. X_train_landlord_scaled.csv - Training features (scaled)
7. X_test_landlord_scaled.csv - Test features (scaled)
8. landlord_feature_names.txt - List of feature names
9. target_distribution_landlord.png - Target distribution plot
10. key_features_distribution_landlord.png - Key features histograms
11. correlation_heatmap_landlord.png - Feature correlation heatmap

{'='*80}
Expected Model Performance and Critical Reminders
{'='*80}
- Model now predicts for new listings (no review history)
- Uses only information available at listing creation time
- Can be deployed in production for real-world use
- Price must be included as a feature (it's landlord-controlled).
- Review features used only for labeling, not for training.
- Model predicts value for new listings without reviews.
 


"""
print(report)